This chapter covers AWS Elastic Disaster Recovery (DRS), a managed service for replicating and recovering workloads from on-premises or other clouds to AWS. For the SOA-C02 exam, DRS is a key topic under Domain 2: Reliability, specifically Objective 2.2: Implement disaster recovery (DR) for AWS workloads. Expect 2-3 questions on DRS, focusing on its features, configuration, recovery objectives, and how it differs from other DR services like CloudEndure Disaster Recovery (now part of DRS), AWS Backup, and Storage Gateway. Mastery of DRS is essential for designing resilient architectures.
Jump to a section
Imagine you run a busy office with all your critical documents in a single filing cabinet. One day, a fire destroys the building. Without a backup, you lose everything. AWS Elastic Disaster Recovery (DRS) is like having a second, identical filing cabinet in a secure offsite vault that is continuously updated. Here is how it works mechanically: You have a replication agent installed on each server (like a dedicated clerk). This agent continuously monitors all changes to your data (files, databases, configurations) and sends only the changed blocks to the vault, not the entire file every time. The vault stores these changes in a staging area (like a temporary holding shelf) that is always ready. When disaster strikes, you initiate a recovery drill or actual failover. The vault then uses the latest consistent snapshot to spin up a fully functional server (like pulling a complete, up-to-date copy from the shelf and placing it on a new desk). The key is that the replication is ongoing and low-latency, so the failover server can be ready in minutes. The staging area also allows you to test failovers without affecting the source, just as you can rehearse a fire drill without setting off the real alarm.
What is AWS Elastic Disaster Recovery (DRS)?
AWS Elastic Disaster Recovery (DRS) is a fully managed service that enables you to replicate and recover your physical, virtual, and cloud-based servers into AWS. It is designed to help you meet your Recovery Time Objective (RTO) of minutes and Recovery Point Objective (RPO) of seconds. DRS is the evolution of CloudEndure Disaster Recovery, which AWS acquired in 2019. It supports a wide range of operating systems (Windows, Linux, etc.) and databases (Oracle, SQL Server, MySQL, etc.).
How DRS Works Internally
DRS operates by installing a lightweight, block-level replication agent on each source server. This agent continuously replicates the server's entire disk (OS, applications, data) to a staging area in your target AWS account. The staging area is a low-cost, high-performance storage tier (typically Amazon EBS volumes) that holds the replicated data in a consistent state. The agent uses asynchronous replication, meaning that changes are sent in near-real-time but not necessarily synchronously. This allows for minimal impact on source performance.
Key Components and Defaults
Replication Agent: A software agent installed on the source server. It captures disk writes and sends them to the staging area. The agent is available for Windows and Linux. It requires outbound internet access to AWS endpoints (or via AWS PrivateLink).
Staging Area: A temporary storage location in your target AWS account. It consists of a small EBS volume (default 30 GB for boot volume, plus additional for data volumes) and an EC2 instance (default t3.small for staging) that processes incoming data. The staging area is automatically created when you set up a replication configuration.
Recovery Instance: The EC2 instance launched during a recovery drill or actual failover. It is created from the latest consistent snapshot in the staging area. The instance type can be customized (e.g., m5.large) and must be specified in the recovery plan.
Conversion Server: A temporary server used during recovery to convert the replicated data into a bootable AMI. This process is automated.
Recovery Point Objective (RPO): DRS targets an RPO of seconds (typically under 10 seconds). The actual RPO depends on network bandwidth and data change rate.
Recovery Time Objective (RTO): DRS targets an RTO of minutes (typically under 5 minutes for most workloads). The RTO includes the time to launch the recovery instance and boot the OS.
Data Encryption: DRS supports encryption at rest using AWS KMS (both source and target). Data in transit is encrypted using TLS.
Bandwidth Throttling: You can throttle replication bandwidth to avoid saturating the source network. Default is unlimited.
Retention Policy: DRS retains recovery points for a configurable period (default 7 days, max 30 days).
Configuration and Verification
To set up DRS, you first install the replication agent on the source server. The agent registers with the DRS service and starts replicating. You then configure a recovery plan in the DRS console, specifying target instance type, security groups, subnet, etc. You can perform non-disruptive recovery drills to verify failover readiness. Commands for agent installation (example for Linux):
sudo wget -O ./aws-replication-installer-init https://aws-elastic-disaster-recovery-us-east-1.s3.us-east-1.amazonaws.com/latest/linux/aws-replication-installer-init
sudo python3 aws-replication-installer-initTo check replication status, use the AWS CLI:
aws drs describe-replication-configuration-templates
aws drs describe-source-serversInteraction with Related Technologies
DRS integrates with AWS Backup for additional backup retention. It can also be used with AWS CloudFormation for automated recovery plan deployment. DRS is often compared to AWS Storage Gateway Volume Gateway (which replicates volumes but not full servers) and AWS Elastic Beanstalk (which is for application tier, not DR). For database workloads, DRS can be combined with AWS Database Migration Service (DMS) for ongoing replication of databases to RDS or Aurora. However, DRS provides block-level replication, which is OS-agnostic and works for any application.
Exam Emphasis
On the SOA-C02 exam, you need to know:
DRS is the primary service for full server DR (as opposed to AWS Backup which is for individual resources like EBS snapshots, RDS snapshots, etc.).
DRS supports both physical and virtual servers on-premises, as well as EC2 instances (cross-region or cross-account).
The staging area is a temporary, low-cost storage that is not the final recovery instance. It holds the replicated data in a consistent state.
DRS uses continuous block-level replication, not periodic snapshots.
RPO is seconds, RTO is minutes.
DRS can be used for both DR and migration (e.g., lift-and-shift to AWS).
DRS supports non-disruptive recovery drills (test failovers) that do not affect source servers.
DRS is a managed service; you do not need to manage the replication infrastructure.
DRS is region-specific; replication is from source region to target region (or within same region for cross-account).
DRS supports Windows and Linux operating systems.
DRS uses AWS PrivateLink for secure replication without internet exposure.
DRS pricing is based on the number of source servers and storage used in the staging area.
Common Pitfalls
Confusing DRS with AWS Backup: AWS Backup is for scheduled backups of individual resources, not continuous replication. DRS is for full server DR with low RPO/RTO.
Thinking DRS requires a VPN or Direct Connect: It can use the internet, but best practice is to use AWS PrivateLink or VPN for security.
Assuming DRS replicates only data volumes: It replicates entire disks (OS + data).
Overlooking the need to install the agent on each server: DRS does not work with agentless replication (unlike AWS Migration Hub or SMS).
Forgetting that DRS staging area is in the target AWS account, not the source.
Command Examples
To initiate a recovery drill:
aws drs start-recovery --source-server-ids <source-server-id> --is-drill trueTo describe recovery instances:
aws drs describe-recovery-instancesTo update replication configuration:
aws drs update-replication-configuration-template --replication-configuration-template-id <template-id> --bandwidth-throttling <value>Install Replication Agent on Source
Download and run the DRS replication agent installer on each source server (physical, virtual, or cloud). The agent registers with the DRS service and authenticates using IAM roles or access keys. It then begins scanning the disk to identify all volumes and file systems. The agent is lightweight and typically uses less than 5% CPU and 100 MB RAM. It requires outbound HTTPS access to the DRS endpoint (e.g., drs.us-east-1.amazonaws.com) and to the staging area's S3 bucket for initial seeding. The agent runs as a system service (e.g., aws-replication-agent on Linux).
Create Staging Area in Target Account
When you configure replication for a source server, DRS automatically creates a staging area in your target AWS account. This includes a small EC2 instance (e.g., t3.small) that acts as a replication server, and EBS volumes to store the replicated data. The staging area is created in the specified target region and subnet. The staging instance processes incoming data and maintains point-in-time consistent snapshots. The staging area is not the final recovery instance; it is temporary and low-cost. You can configure the staging area subnet, security groups, and instance type.
Continuous Block-Level Replication
The replication agent continuously monitors disk writes at the block level. It sends only the changed blocks to the staging area using a proprietary protocol over HTTPS. The staging area applies these changes to its local copy, maintaining a consistent state. The replication is asynchronous, so the source server does not wait for acknowledgment. The agent uses a buffer to handle network congestion. The target RPO is seconds, but actual RPO depends on bandwidth and write rate. DRS uses deduplication and compression to optimize data transfer.
Configure Recovery Plan
In the DRS console, you define a recovery plan that specifies how the recovery instance should be launched. This includes: target instance type (e.g., m5.large), VPC, subnet, security groups, IAM role, and any user data scripts. You can also specify launch order for multiple servers (e.g., database before application). The recovery plan is saved as a template. You can create multiple plans for different scenarios (e.g., drill vs. actual failover). The plan also includes retention settings for recovery points (default 7 days).
Perform Recovery Drill or Failover
To test DR readiness, you initiate a recovery drill. DRS launches a recovery instance from the latest consistent snapshot in the staging area, but in a separate VPC or subnet to avoid IP conflicts. The drill does not affect the source server. During an actual failover, you initiate a final sync to capture the latest changes, then launch the recovery instance. The recovery instance is fully functional and can be used as the new production server. After failover, you can perform a failback by reversing replication (back to source or new target). The RTO is typically under 5 minutes, but depends on instance size and boot time.
Monitor and Manage Replication
Use the DRS console or AWS CLI to monitor replication status, data lag, and errors. You can view metrics like replication lag (time since last successful sync), bytes remaining, and throughput. Alerts can be set up via Amazon CloudWatch. You can also modify replication settings, such as bandwidth throttling, staging area resources, and retention policy. If a source server fails, DRS automatically pauses replication and alerts you. You can also delete replication configurations when no longer needed.
Enterprise Scenario 1: On-Premises to AWS DR for a Financial Services Company
A large bank runs its core banking application on physical servers in a data center. They need a DR solution with RPO of 5 seconds and RTO of 2 minutes to meet regulatory requirements. They choose AWS DRS because it offers continuous block-level replication with low RPO/RTO. They install the agent on each server (Windows Server 2019) and configure replication to a staging area in us-west-2. They set up a recovery plan that launches the recovery instance as an m5.xlarge with the same security groups and IAM roles as the source. They perform weekly non-disruptive drills to ensure the recovery works. During a real disaster, they initiate a failover and the recovery instance boots in under 2 minutes. The bank uses AWS PrivateLink to keep replication traffic within the AWS network, avoiding internet exposure. They also use CloudWatch alarms to alert if replication lag exceeds 10 seconds. The staging area is sized with 100 GB of EBS gp3 volumes per server, costing approximately $30/month per server. The total cost includes agent licensing (per server), staging storage, and recovery instance costs during drills. The bank saves 60% compared to their previous DR solution.
Enterprise Scenario 2: Cross-Region DR for an E-Commerce Platform
A global e-commerce company runs its web application on EC2 instances in us-east-1. They want a DR site in us-west-2 with automatic failover. They use DRS to replicate the EC2 instances cross-region. They install the agent on each EC2 instance (Amazon Linux 2) and configure replication to a staging area in us-west-2. They set up a recovery plan that uses an Application Load Balancer (ALB) to route traffic to the recovery instances. They also use AWS Route 53 health checks to automatically failover DNS. During a regional outage, they initiate a failover and the recovery instances are launched in us-west-2. The ALB is pre-configured with the same target group. The entire failover takes less than 5 minutes. They also use DRS's ability to perform non-disruptive drills to test the failover process monthly. The company's RPO is under 10 seconds, and RTO is under 5 minutes. They use AWS KMS to encrypt replication data at rest. The staging area uses t3.medium instances for replication processing. The company monitors replication lag using CloudWatch and has automated alerts for any lag exceeding 15 seconds.
Common Misconfigurations
Not sizing the staging area properly: The staging area must have enough storage to hold the full disk image plus changes. If the staging area runs out of space, replication pauses. The default staging volume size is 30 GB, but you should increase it based on source disk size.
Forgetting to open outbound ports: The agent needs HTTPS (443) to the DRS endpoint and to the staging area's S3 bucket. If a firewall blocks this, replication fails.
Using the same subnet for drills and production: During a drill, the recovery instance must not conflict with the source IP. Use a separate subnet or VPC.
Not testing failovers regularly: Many companies skip drills and discover issues only during a real disaster. DRS makes drills easy and non-disruptive.
Overlooking bandwidth throttling: If the source network is congested, the agent can consume all bandwidth. Set throttling to limit replication traffic to, say, 50 Mbps.
What SOA-C02 Tests on DRS
The SOA-C02 exam tests DRS under Domain 2: Reliability, Objective 2.2: Implement disaster recovery (DR) for AWS workloads. Specific sub-objectives include: selecting appropriate DR strategies (pilot light, warm standby, multi-site), implementing automated failover and failback, and using AWS services for DR. DRS is directly tested in questions about continuous replication, RPO/RTO targets, and recovery drills.
Common Wrong Answers and Why Candidates Choose Them
Choosing AWS Backup instead of DRS: AWS Backup is for scheduled backups (snapshots) with RPO of hours, not seconds. Candidates see "backup" and think it covers DR, but DRS provides faster RPO/RTO. Wrong answer: "Use AWS Backup with continuous backup". Reality: AWS Backup does not offer continuous block-level replication; it uses periodic snapshots.
Selecting AWS Storage Gateway Volume Gateway: Volume Gateway replicates volumes to S3, but it does not replicate the entire server (OS). Candidates confuse volume replication with full server DR. Wrong answer: "Use Storage Gateway to replicate the server's volumes". Reality: DRS replicates the entire disk, including OS.
Assuming DRS requires a VPN or Direct Connect: DRS can work over the internet, but best practice is to use AWS PrivateLink. Candidates think a dedicated connection is mandatory. Wrong answer: "You must establish a VPN connection before using DRS". Reality: DRS works over HTTPS, but for security, PrivateLink is recommended.
Confusing staging area with recovery instance: The staging area is temporary storage, not the final recovery instance. Candidates think the staging area is the DR site. Wrong answer: "The staging area is the EC2 instance that runs during failover". Reality: The recovery instance is launched from the staging area's snapshots.
Specific Numbers and Terms on the Exam
RPO: seconds (typically under 10 seconds)
RTO: minutes (typically under 5 minutes)
Maximum retention for recovery points: 30 days (default 7 days)
Staging area default instance type: t3.small
Staging area default boot volume size: 30 GB
DRS supports Windows and Linux
DRS uses block-level replication (not file-level)
DRS is the evolution of CloudEndure Disaster Recovery
Edge Cases
DRS does not support replication of bare metal servers (must have a hypervisor).
DRS cannot replicate servers with unsupported file systems (e.g., ReFS on Windows).
DRS replication agent requires outbound internet access; if the source is in a private subnet, use AWS PrivateLink or a NAT gateway.
DRS does not support cross-region replication for the same server in the same account (you must use a different target region or account).
DRS pricing includes the staging area costs (EC2 and EBS) plus per-server licensing.
How to Eliminate Wrong Answers
If the question mentions RPO of seconds or RTO of minutes, DRS is likely the answer.
If the question describes continuous replication of entire servers, choose DRS over AWS Backup or Storage Gateway.
If the question asks about non-disruptive testing (drills), DRS supports that.
If the question involves on-premises servers, DRS is designed for that.
If the question mentions "agent-based replication", DRS uses an agent.
Eliminate answers that suggest periodic snapshots (e.g., AWS Backup) or file-level replication (e.g., AWS DataSync).
AWS Elastic Disaster Recovery (DRS) provides continuous block-level replication with an RPO of seconds and an RTO of minutes.
DRS is the evolution of CloudEndure Disaster Recovery and supports Windows and Linux servers.
The staging area is a temporary, low-cost storage and processing environment in the target AWS account.
DRS is agent-based; you must install the replication agent on each source server.
DRS can replicate on-premises physical/virtual servers as well as EC2 instances.
Non-disruptive recovery drills allow you to test failover without affecting the source.
DRS integrates with AWS PrivateLink for secure replication over the AWS network.
Default retention for recovery points is 7 days, configurable up to 30 days.
DRS is different from AWS Backup, which provides scheduled snapshots with higher RPO.
DRS pricing includes per-server licensing and staging area costs (EC2 and EBS).
These come up on the exam all the time. Here's how to tell them apart.
AWS Elastic Disaster Recovery (DRS)
Continuous block-level replication with RPO of seconds
RTO of minutes (fast failover)
Replicates entire server (OS + data)
Agent-based replication on source server
Supports non-disruptive recovery drills
AWS Backup
Scheduled backups (snapshots) with RPO of hours
RTO of minutes to hours (restore from snapshot)
Backs up individual resources (EBS, RDS, etc.), not full server
Agentless (uses AWS APIs for snapshots)
No built-in drill capability; restore is disruptive
Mistake
DRS replicates only data volumes, not the OS.
Correct
DRS replicates the entire disk, including the operating system, system files, and all data volumes. It captures block-level changes across all attached disks.
Mistake
The staging area is the same as the recovery instance.
Correct
The staging area is a temporary storage and processing environment (small EC2 instance + EBS volumes) that holds the replicated data. The recovery instance is a separate EC2 instance launched from the staging area during failover.
Mistake
DRS requires a VPN or Direct Connect to function.
Correct
DRS can operate over the public internet using HTTPS. However, for enhanced security, you can use AWS PrivateLink or a VPN. A dedicated connection is not mandatory.
Mistake
DRS is only for on-premises servers, not for EC2 instances.
Correct
DRS supports replication of EC2 instances (cross-region or cross-account), as well as physical and virtual servers on-premises. It is a universal DR solution.
Mistake
DRS provides an RTO of seconds and RPO of minutes.
Correct
It's the opposite: DRS targets an RPO of seconds (typically under 10 seconds) and an RTO of minutes (typically under 5 minutes). RPO is about data loss, RTO is about downtime.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
AWS Elastic Disaster Recovery (DRS) provides continuous block-level replication of entire servers, achieving an RPO of seconds and RTO of minutes. It is agent-based and supports non-disruptive drills. AWS Backup, on the other hand, provides scheduled snapshots of individual AWS resources (EBS volumes, RDS databases, etc.) with an RPO of hours. DRS is designed for full server disaster recovery, while AWS Backup is for backup and restore of specific services.
Yes, DRS supports replication of EC2 instances for cross-region or cross-account disaster recovery. You install the replication agent on the source EC2 instance and configure replication to a target region or account. This allows you to failover to a different region or account in case of a regional outage.
The staging area is a temporary environment created in your target AWS account when you set up replication. It consists of a small EC2 instance (default t3.small) and EBS volumes that store the replicated data. The staging area processes incoming block-level changes and maintains point-in-time consistent snapshots. It is not the final recovery instance; it is used to launch recovery instances during failover or drills.
To perform a non-disruptive drill, use the AWS DRS console or CLI to start a recovery with the --is-drill flag set to true. DRS launches a recovery instance from the latest consistent snapshot in the staging area, but in a separate subnet or VPC to avoid IP conflicts. The drill does not affect the source server or ongoing replication. After testing, you can terminate the drill instance.
The replication agent requires outbound HTTPS (port 443) access to the DRS service endpoint (e.g., drs.us-east-1.amazonaws.com) and to the staging area's S3 bucket for initial seeding. For secure replication, you can use AWS PrivateLink to keep traffic within the AWS network. If the source is in a private subnet, you need a NAT gateway or VPC endpoint.
Yes, DRS can replicate servers running databases because it replicates the entire disk at the block level. This includes the database files, logs, and system files. However, for database-consistent recovery, you should ensure that the database is in a consistent state (e.g., by using application-consistent snapshots or database-specific tools). DRS does not provide application-level consistency by default.
The default retention period is 7 days, and you can configure it up to a maximum of 30 days. Recovery points older than the retention period are automatically deleted. You can adjust the retention setting in the replication configuration template.
You've just covered AWS Elastic Disaster Recovery (DRS) — now see how well it sticks with free SOA-C02 practice questions. Full explanations included, no account needed.
Done with this chapter?