This chapter covers disaster recovery (DR) strategies on AWS, a critical topic for the SAA-C03 exam, where it appears in roughly 10-15% of questions. You will learn the four main DR strategies — backup and restore, pilot light, warm standby, and multi-site active/active — along with their RTO and RPO characteristics, and which AWS services enable each. We'll also dive into the underlying mechanisms of key services like RDS Multi-AZ, CloudFormation, Route 53, and S3 Cross-Region Replication, and how to design for recovery objectives. By the end, you'll be able to select the appropriate strategy for a given business requirement and avoid common pitfalls that trip up exam candidates.
Jump to a section
Imagine a company operating out of a single office building (primary site). The business requires continuous operations, so they implement a disaster recovery plan. First, they have a backup generator (RDS Multi-AZ) that automatically kicks in within seconds if power fails — minimal downtime but still a single building. For a fire that destroys the building, they have a second, identical building in another city (cross-region DR) with a cold standby: servers are off but ready to power on when needed. The company also has a hot standby site that mirrors all operations in real-time (Active/Active multi-region). They test the failover quarterly by simulating a power outage (DR drills). Key metrics: Recovery Time Objective (RTO) is the maximum acceptable downtime — for the generator it's 10 seconds, for the cold site it's 4 hours. Recovery Point Objective (RPO) is the maximum acceptable data loss — for the generator it's 0 seconds (no data loss), for the cold site it's the last backup (maybe 24 hours). AWS services map to these: RDS Multi-AZ (generator), RDS Cross-Region Read Replicas (hot standby with manual promotion), S3 Cross-Region Replication (real-time data copy), and CloudFormation for infrastructure as code to spin up the cold site quickly. The analogy breaks down if you think the generator prevents data loss — it doesn't; it only keeps the system running. Similarly, Multi-AZ doesn't protect against region failure.
What is Disaster Recovery and Why It Exists
Disaster Recovery (DR) is the process of restoring IT infrastructure and data after a catastrophic event that renders the primary site unavailable. In the cloud, DR strategies aim to minimize downtime (RTO) and data loss (RPO) while balancing cost. The SAA-C03 exam expects you to understand the trade-offs between the four common strategies and which AWS services implement each.
The Four DR Strategies
#### Backup and Restore - RTO: Hours to days. RPO: Hours (last backup). - Mechanism: Data is backed up to S3 (or Glacier) periodically. On failure, you restore from backup to new infrastructure. - Key Services: AWS Backup, S3, S3 Glacier, Snowball, Storage Gateway. - Exam Note: Cheapest strategy but highest RTO/RPO. Use when data changes infrequently or cost is paramount.
#### Pilot Light - RTO: Minutes to hours. RPO: Minutes (near real-time replication). - Mechanism: A minimal core set of services (the "pilot light") runs in the DR region, replicating data. On failure, you scale up by launching additional resources (e.g., EC2 instances, load balancers) using pre-prepared AMIs and CloudFormation templates. - Key Services: RDS Cross-Region Read Replicas, CloudFormation, Route 53, AMIs. - Exam Note: Core data services (like RDS) are always on; compute is scaled up only during failover. RTO is limited by the time to launch and configure new instances.
#### Warm Standby - RTO: Minutes. RPO: Seconds to minutes (continuous replication). - Mechanism: A scaled-down but fully functional copy of your production environment runs in the DR region. On failure, you scale up (e.g., increase instance sizes, add capacity) and redirect traffic. - Key Services: RDS Multi-AZ with Cross-Region Read Replicas, EC2 Auto Scaling, ELB, Route 53. - Exam Note: More expensive than pilot light because compute is always running. RTO is the time to scale up and switch DNS.
#### Multi-Site Active/Active - RTO: Real-time (seconds). RPO: Near-zero (synchronous replication). - Mechanism: Two or more regions actively serve traffic simultaneously. Data is replicated synchronously between regions. Traffic is distributed via Route 53 latency-based or geolocation routing. If one region fails, Route 53 routes all traffic to the remaining regions. - Key Services: Route 53, ELB, Auto Scaling, DynamoDB Global Tables, Aurora Global Database, S3 Cross-Region Replication. - Exam Note: Most expensive but lowest RTO/RPO. Requires application to be stateless or handle cross-region data consistency.
RTO and RPO Defined
Recovery Time Objective (RTO): The maximum acceptable time to restore service after a disaster. Measured in seconds, minutes, or hours.
Recovery Point Objective (RPO): The maximum acceptable amount of data loss measured in time. For example, an RPO of 15 minutes means you can lose at most 15 minutes of data.
Exam Trap: Candidates confuse RTO and RPO. RTO is about downtime; RPO is about data loss.
Key AWS Services and Their DR Roles
#### Amazon RDS Multi-AZ - Provides automatic failover to a standby in a different Availability Zone (AZ) within the same region. - RTO: Typically 1-2 minutes. RPO: Near-zero (synchronous replication). - Not a DR solution for region failure; only protects against AZ failure. - Mechanism: Synchronous replication from primary to standby. On failure, DNS is updated to point to the standby. No manual intervention required.
#### RDS Cross-Region Read Replicas - Asynchronous replication from primary region to read replicas in other regions. - RPO: Seconds to minutes (asynchronous, some lag possible). - Failover: You must manually promote the read replica to a standalone instance. RTO includes the time to detect failure and promote. - Exam Note: Not automatic like Multi-AZ. Used for pilot light or warm standby strategies.
#### Amazon S3 Cross-Region Replication (CRR) - Automatically replicates objects from a source bucket to a destination bucket in another region. - RPO: Typically 15 minutes to a few hours (asynchronous, eventual consistency). - Mechanism: Replication is triggered by PUT operations. Existing objects are not replicated unless you enable S3 Batch Replication. - Exam Note: CRR is for DR; Same-Region Replication (SRR) is for compliance or data aggregation.
#### AWS CloudFormation - Infrastructure as Code (IaC) service that allows you to define your entire environment in templates. - Role in DR: Quickly launch identical environments in a DR region. Used in pilot light and warm standby to scale up compute. - Exam Note: CloudFormation StackSets can deploy templates across multiple regions and accounts.
#### Route 53 - DNS service with routing policies: failover, latency, geolocation, weighted. - DR Role: Failover routing automatically directs traffic to a healthy endpoint. Health checks monitor endpoints and trigger failover. - Exam Note: Failover routing requires a primary and secondary record set. Associated health checks must be configured.
#### AWS Backup - Centralized backup service that automates and consolidates backups across AWS services (EBS, RDS, DynamoDB, etc.). - DR Role: Enables backup and restore strategy. Supports cross-region and cross-account backups. - Exam Note: Backup plans define frequency, retention, and lifecycle rules.
Step-by-Step DR Failover Process (Pilot Light Example)
Setup: In the primary region, run your production workload. In the DR region, run a minimal RDS Cross-Region Read Replica (pilot light). Have CloudFormation templates ready to launch EC2 instances, ALB, and Auto Scaling groups. Store AMIs in the DR region.
Monitoring: Use Route 53 health checks to monitor your primary endpoint (e.g., ALB). If health checks fail consecutively (e.g., 3 failures), Route 53 marks the endpoint as unhealthy.
Failover Trigger: When the primary is unhealthy, Route 53 failover routing automatically returns the IP of the DR region's endpoint (if already running) or you manually initiate failover.
Scale Up DR Region: Execute CloudFormation template to launch EC2 instances from AMIs, configure ALB, and attach Auto Scaling groups. Promote the RDS read replica to a standalone instance (if using RDS) or redirect application to the read replica.
DNS Switch: Route 53 failover routing now points to the DR region's ALB. Users experience minimal interruption.
Recovery: Once the primary region is restored, you can replicate data back and switch over again.
DR Architecture Patterns on the Exam
Backup and Restore: S3 + AWS Backup + Snowball for large data migration. Cheapest.
Pilot Light: RDS Cross-Region Read Replica + CloudFormation + Route 53. Core DB always on.
Warm Standby: RDS Multi-AZ with Cross-Region Read Replica + EC2 Auto Scaling (min 1 instance) + ALB + Route 53. Compute always on but scaled down.
Multi-Site Active/Active: DynamoDB Global Tables or Aurora Global Database + Route 53 latency-based routing + Auto Scaling in each region. All regions serve traffic.
Exam Pitfalls
Multi-AZ vs. Cross-Region Read Replica: Multi-AZ is for AZ failure within a region, not DR. Cross-Region Read Replicas are for DR but require manual promotion.
RPO vs. RTO: Many questions describe a scenario and ask for the strategy that meets given RPO/RTO. Know the approximate ranges.
Cost vs. Recovery Speed: Cheapest = backup/restore; fastest = multi-site active/active. The exam will test your ability to match business requirements to the right strategy.
S3 CRR vs. SRR: CRR is for DR across regions; SRR is for same-region compliance. Both are asynchronous.
Configuration Examples
#### S3 CRR Setup
aws s3api put-bucket-replication --bucket source-bucket --replication-configuration file://replication.jsonWhere replication.json specifies destination bucket, IAM role, and status.
#### RDS Read Replica Promotion
aws rds promote-read-replica --db-instance-identifier my-read-replica#### CloudFormation StackSet
aws cloudformation create-stack-set --stack-set-name MyDRStack --template-body file://template.yaml --capabilities CAPABILITY_IAM --regions eu-west-1 eu-west-2Summary of DR Strategies
| Strategy | RTO | RPO | Cost | Complexity | |----------|-----|-----|------|------------| | Backup & Restore | Hours to days | Hours (last backup) | Low | Low | | Pilot Light | Minutes to hours | Minutes (near real-time) | Medium | Medium | | Warm Standby | Minutes | Seconds to minutes | Medium-High | Medium | | Multi-Site Active/Active | Real-time (seconds) | Near-zero | High | High |
Plan and Replicate Data
Identify critical data and services. For pilot light, set up RDS Cross-Region Read Replicas in the DR region. Enable S3 CRR for static assets. Ensure replication is active and monitor lag. For warm standby, also launch a minimal EC2 instance in the DR region with the application stack. Use CloudFormation to define the full environment. This step involves configuring IAM roles for replication and testing that data flows correctly. The RPO depends on replication frequency: synchronous for Multi-AZ, asynchronous for cross-region. Monitor CloudWatch metrics like ReplicaLag for RDS and ReplicationLatency for S3.
Monitor Primary Site Health
Use Route 53 health checks to monitor the primary endpoint (e.g., ALB). Configure health check with a threshold (e.g., 3 consecutive failures) and interval (e.g., 30 seconds). Optionally use CloudWatch alarms to trigger SNS notifications or Lambda functions. For RDS, you can monitor the DB instance status. The health check should test a specific path (e.g., /health) that returns 200 OK. If the primary fails, Route 53 can automatically route to the DR endpoint if failover routing is configured. This step is critical for meeting RTO; faster detection reduces downtime.
Initiate Failover
When the primary is deemed unhealthy, initiate failover. For pilot light, this means executing CloudFormation templates to launch EC2 instances, ALB, and Auto Scaling groups in the DR region. Promote the RDS read replica to a standalone instance. For warm standby, you may only need to scale up existing instances (e.g., change instance type or add capacity). For multi-site active/active, no failover is needed; Route 53 simply stops routing to the failed region. This step must be automated as much as possible to minimize RTO. Use AWS Lambda or Step Functions to orchestrate the process.
Redirect Traffic to DR Region
Update DNS to point to the DR region's endpoint. With Route 53 failover routing, this happens automatically when the primary health check fails. Alternatively, you can manually update a CNAME or use weighted routing with 0 weight to primary. Ensure that the DR region's ALB is healthy and accepting traffic. Test that users can access the application. This step may also involve updating application configurations (e.g., database connection strings) if not using DNS. The time to propagate DNS changes affects RTO; Route 53 supports 60-second TTLs.
Recover and Fail Back
Once the primary region is restored, replicate data back from DR to primary. This may involve reversing replication direction or using database dumps. For RDS, you can create a new read replica from the promoted instance and then promote it back. For S3, use CRR in reverse or copy objects. Update Route 53 to route traffic back to primary. Perform a controlled failback during a maintenance window to minimize disruption. Document the process and test regularly. Failback is often more complex than failover, and the exam may test your understanding of data synchronization.
Enterprise Scenario 1: E-Commerce Platform
A large e-commerce company with a global customer base requires RTO < 15 minutes and RPO < 5 minutes. They choose a warm standby strategy across two AWS regions (us-east-1 and eu-west-1). In the primary region, they run a full production stack: EC2 instances behind an ALB, RDS Multi-AZ for the database, ElastiCache for session data, and S3 for product images. In the DR region, they run a scaled-down version: one EC2 instance (t3.medium vs. m5.large in prod), an RDS Cross-Region Read Replica, and a replica of the S3 bucket using CRR. Route 53 latency-based routing sends most traffic to the primary region. Health checks monitor the primary ALB every 30 seconds with a 3-failure threshold. On failure, a Lambda function scales up the DR EC2 instances using an Auto Scaling group with a larger instance type, promotes the RDS read replica, and updates Route 53 to direct all traffic to the DR region. The RTO is about 10 minutes, and RPO is under 5 minutes. Common misconfiguration: forgetting to pre-warm the DR ELB or not having enough reserved capacity in the DR region for scaling.
Enterprise Scenario 2: Financial Services with Compliance
A bank requires RPO of 1 hour and RTO of 4 hours for a legacy application that cannot be modified. They choose a backup and restore strategy using AWS Backup. Daily snapshots of EBS volumes and RDS databases are stored in S3 Glacier Deep Archive in a separate region. On disaster, they restore from the latest snapshot, launch EC2 instances from AMIs, and attach restored EBS volumes. The RTO is dominated by the time to restore large volumes (e.g., 1 TB EBS takes about 1 hour to restore from snapshot). They also use Snowball for initial data seeding. The bank must validate that backups are restorable quarterly. A common pitfall is not enabling cross-region backup for RDS; the exam often tests that RDS automated backups are region-specific and must be copied manually or using AWS Backup.
Enterprise Scenario 3: Media Streaming Service
A video streaming service requires zero downtime and near-zero data loss. They deploy a multi-site active/active architecture across three AWS regions (us-east-1, eu-west-1, ap-southeast-1). User sessions are stored in DynamoDB Global Tables (synchronous replication). Video content is replicated via S3 CRR. Route 53 uses geolocation routing to direct users to the nearest region. Each region has its own Auto Scaling group and ALB. If one region fails, Route 53 automatically redistributes traffic to the remaining regions. The service can handle a region failure without any manual intervention. The main challenge is managing eventual consistency for user metadata; the application must be designed to handle conflicts. The exam may ask about trade-offs: DynamoDB Global Tables offer strong consistency for same-region reads but eventual consistency for cross-region writes.
SAA-C03 Exam Focus on Disaster Recovery
This topic falls under Domain 2: Resilient Architectures, Objective 2.6: "Design disaster recovery strategies." The exam tests your ability to select the appropriate DR strategy based on given RTO and RPO requirements, and to identify which AWS services support each strategy. Key objective codes: 2.6.1 (Choose appropriate DR strategy), 2.6.2 (Implement recovery procedures). Expect 2-3 scenario-based questions.
Most Common Wrong Answers
Choosing Multi-AZ for region-level DR: Many candidates see "automatic failover" and think it covers region failure. Reality: Multi-AZ only protects within a region. For cross-region DR, you need RDS Cross-Region Read Replicas or Aurora Global Database.
Confusing RTO and RPO: A question might say "The company needs to recover within 1 hour of data loss" — that describes RPO, not RTO. Candidates often pick a strategy that meets RTO but not RPO.
Selecting backup and restore when RTO is minutes: Backup and restore has hours of RTO. The exam will present a scenario with aggressive RTO (e.g., 5 minutes) and expect you to choose warm standby or multi-site.
Assuming S3 CRR provides synchronous replication: S3 CRR is asynchronous. For synchronous cross-region replication, you need DynamoDB Global Tables (for tables) or Aurora Global Database (for DB).
Specific Numbers and Terms
RDS Multi-AZ: RTO ~1-2 minutes, RPO near-zero.
RDS Cross-Region Read Replica: RPO seconds to minutes, RTO includes promotion time (~1-2 minutes).
S3 CRR: RPO typically 15 minutes or more.
Route 53 health check intervals: 30 seconds (standard) or 10 seconds (fast).
Route 53 failover routing: Requires primary and secondary records.
CloudFormation StackSets: Deploy to multiple regions/accounts.
AWS Backup: Supports cross-region copy with 35-day retention for RDS automated backups.
Edge Cases and Exceptions
If the primary region is completely unavailable and you have a read replica in another region, you must promote it manually. The exam may ask: "How do you failover RDS to another region?" Answer: Promote read replica.
For DynamoDB Global Tables, failover is automatic only if you have application-level retry logic. The tables are multi-master; no manual promotion needed.
S3 CRR does not replicate existing objects unless you use S3 Batch Replication. This is a common trick question.
If RTO is 0, you need multi-site active/active. No other strategy achieves zero downtime.
Eliminating Wrong Answers
If the question says "cost-effective" and RTO is hours, eliminate warm standby and multi-site. Choose backup/restore.
If RPO is 0 (no data loss), eliminate any asynchronous replication (CRR, cross-region read replicas). Choose synchronous replication (Multi-AZ within region, or DynamoDB Global Tables for cross-region).
If the application must be modified, look for strategies that require code changes (e.g., active/active may require session management changes).
The four DR strategies are: Backup and Restore, Pilot Light, Warm Standby, and Multi-Site Active/Active, ordered by increasing cost and decreasing RTO/RPO.
RTO is the maximum acceptable downtime; RPO is the maximum acceptable data loss. Know typical ranges: Backup/Restore RTO hours, Pilot Light minutes-hours, Warm Standby minutes, Active/Active seconds.
RDS Multi-AZ protects against AZ failure only, not region failure. Use Cross-Region Read Replicas for cross-region DR (manual promotion required).
S3 Cross-Region Replication (CRR) is asynchronous; RPO is typically 15+ minutes. Use DynamoDB Global Tables or Aurora Global Database for synchronous cross-region replication.
Route 53 failover routing uses health checks to automatically direct traffic to a secondary endpoint when the primary fails.
CloudFormation StackSets can deploy identical infrastructure across multiple regions for DR.
AWS Backup centralizes backups and supports cross-region copy, enabling backup and restore strategy.
For zero RTO and near-zero RPO, you must use a multi-site active/active architecture with synchronous replication and DNS routing.
Always consider cost vs. recovery speed: the exam will test your ability to match the strategy to business requirements.
Failback is often more complex than failover; plan for data synchronization and DNS switch back.
These come up on the exam all the time. Here's how to tell them apart.
Pilot Light
Core data services (e.g., RDS) are replicated and running in DR region.
Compute (EC2) is not running; must be launched on failover.
Lower cost because no idle compute in DR.
Higher RTO (minutes to hours) due to compute launch time.
Uses CloudFormation to launch infrastructure on failover.
Warm Standby
Full stack (compute, data, networking) runs in DR region at reduced capacity.
Compute is always on (e.g., 1 EC2 instance vs. 10 in prod).
Higher cost due to always-on compute.
Lower RTO (minutes) because compute is already running.
Scales up existing resources (e.g., change instance size) on failover.
Mistake
RDS Multi-AZ provides disaster recovery across regions.
Correct
Multi-AZ only protects against an Availability Zone failure within the same region. For cross-region DR, you need RDS Cross-Region Read Replicas or Aurora Global Database.
Mistake
S3 Cross-Region Replication provides synchronous replication with zero data loss.
Correct
S3 CRR is asynchronous and eventually consistent. Data loss can occur if a disaster strikes before replication completes. RPO is typically 15 minutes or more.
Mistake
Backup and restore is the fastest recovery strategy.
Correct
Backup and restore has the highest RTO (hours to days) because you must restore data and launch infrastructure. The fastest recovery is multi-site active/active with real-time failover.
Mistake
Pilot light and warm standby are the same strategy.
Correct
In pilot light, core data services run in the DR region but compute is not active. In warm standby, a scaled-down version of the full stack runs, including compute. Warm standby has lower RTO but higher cost.
Mistake
Route 53 failover routing automatically promotes an RDS read replica.
Correct
Route 53 only handles DNS routing. Promoting an RDS read replica must be done manually or via automation (e.g., Lambda). Route 53 health checks can trigger the automation, but do not perform the promotion themselves.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
RTO (Recovery Time Objective) is the maximum acceptable time to restore service after a disaster. RPO (Recovery Point Objective) is the maximum acceptable amount of data loss measured in time. For example, an RTO of 1 hour means the system must be back up within 1 hour; an RPO of 15 minutes means you can lose at most 15 minutes of data. RTO focuses on downtime, RPO on data loss. Both are specified by the business, and the chosen DR strategy must meet both.
No. RDS Multi-AZ only provides automatic failover to a standby in a different Availability Zone within the same region. For cross-region DR, you need to set up RDS Cross-Region Read Replicas, which are asynchronous and require manual promotion. Alternatively, use Aurora Global Database for synchronous replication across regions.
S3 CRR is asynchronous, so the RPO is typically 15 minutes or more, depending on object size and replication queue. It is not suitable for scenarios requiring near-zero data loss. For synchronous cross-region replication, consider DynamoDB Global Tables or Aurora Global Database.
Use Route 53 health checks to detect primary failure. When health checks fail, a CloudWatch alarm can trigger a Lambda function or Step Functions workflow that: (1) promotes RDS read replicas, (2) executes CloudFormation templates to launch EC2 instances and other resources, and (3) updates Route 53 DNS records to point to the DR region. Automation is key to minimizing RTO.
Backup and Restore is the cheapest because it only incurs storage costs for backups (e.g., S3, Glacier) and no compute resources are running in the DR region until recovery. However, it has the highest RTO (hours to days) and RPO (hours). For cost-sensitive scenarios with relaxed recovery requirements, this is the best choice.
Yes, but AMIs are region-specific. You must copy the AMI to the DR region using the AWS Management Console, CLI, or SDK. Use the `copy-image` API. Alternatively, use a tool like EC2 Image Builder to automate cross-region AMI distribution. Remember to update the AMI after changes in the primary region.
CloudFormation allows you to define your entire infrastructure as code (IaC). In DR, you can store CloudFormation templates in a version-controlled S3 bucket and execute them in the DR region to quickly launch identical environments. StackSets enable deploying templates across multiple regions simultaneously, which is useful for warm standby or pilot light strategies.
You've just covered Disaster Recovery Strategies on AWS — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.
Done with this chapter?