This chapter covers AWS Disaster Recovery (DR) strategies: Backup/Restore, Pilot Light, Warm Standby, and Multi-Site. These are core to the Reliability pillar and a frequent exam topic, appearing in about 8-12% of SOA-C02 questions. You must understand the trade-offs between Recovery Time Objective (RTO), Recovery Point Objective (RPO), and cost to select the right strategy for a given scenario. The exam tests your ability to match business requirements to the correct AWS architecture.
Jump to a section
Imagine a restaurant chain with multiple locations. The Backup/Restore strategy is like keeping frozen ingredients in a warehouse. If the restaurant burns down, you can thaw ingredients and cook meals, but it takes hours and customers leave. Pilot Light is like having a small prep kitchen running 24/7 with essential staff, ready to scale up to full cooking when disaster strikes — faster but costs more. Warm Standby is like having a second restaurant fully equipped but closed, needing only staff to open doors — near-zero downtime but double the rent. Multi-Site is like two identical restaurants both open, serving customers simultaneously; if one fails, the other handles all traffic instantly — maximum resilience but highest cost. Each strategy trades off cost, complexity, and recovery time (RTO/RPO). For AWS, Backup/Restore uses S3/Glacier, Pilot Light uses a minimal EC2+RDS stack, Warm Standby uses an Auto Scaling group across AZs, and Multi-Site uses Route53 with active-active DNS and replication.
What is Disaster Recovery in AWS?
Disaster Recovery (DR) is the process of restoring IT infrastructure and data after a disaster (natural, human, or technical). AWS defines four DR strategies on a spectrum from low cost/high RTO to high cost/low RTO. The key metrics are RTO (time to recover) and RPO (maximum acceptable data loss). Exam questions often present a business requirement (e.g., 'RTO of 1 hour, RPO of 15 minutes') and ask you to choose the strategy.
Backup/Restore (RTO: hours to days, RPO: 24 hours typical)
This is the simplest and cheapest strategy. Data is backed up to Amazon S3 (or S3 Glacier for archival) periodically. In a disaster, you restore from backups by launching new EC2 instances from AMIs, restoring RDS snapshots, and copying data from S3. The RTO is high because you must provision infrastructure and restore data. The RPO depends on backup frequency (e.g., daily backups = up to 24 hours data loss). AWS services: AWS Backup, S3 Lifecycle policies, RDS automated snapshots (retention up to 35 days), manual snapshots. Exam tip: Know that RDS automated snapshots have a retention limit of 35 days; manual snapshots are kept until deleted.
Pilot Light (RTO: 10-60 minutes, RPO: minutes)
A minimal copy of your core infrastructure runs continuously. For example, a small EC2 instance with a database replica (RDS Multi-AZ or Read Replica) is kept alive in another Region. In a disaster, you "scale up" by launching additional instances, updating DNS (Route53 failover), and promoting the read replica. The core system (the "pilot light") is always on, reducing RTO. RPO is low because data is continuously replicated. Exam focus: The key is that the pilot light stack is minimal — only critical components (e.g., database, a few EC2 instances) are running. You must automate the scaling with CloudFormation or Elastic Beanstalk.
Warm Standby (RTO: minutes, RPO: seconds)
A scaled-down but fully functional copy of your production environment runs in another Region. It is idle or handling minimal traffic. In a disaster, you scale up the standby to full capacity and redirect traffic. RTO is low (minutes) because the environment is already running. RPO is low because data replication is near real-time (e.g., using Database Migration Service or cross-Region replication). Exam tip: Warm Standby often uses Auto Scaling groups with minimum size = 1 (or more) in the standby Region. You must also replicate data continuously (e.g., S3 Cross-Region Replication, RDS cross-Region read replicas).
Multi-Site (Active-Active) (RTO: near zero, RPO: seconds)
Two or more independent, fully functional environments run simultaneously, each handling part of the traffic. This is the most expensive but provides the lowest RTO/RPO. DNS routing (Route53 weighted or latency-based routing) distributes traffic. If one site fails, traffic is routed to the remaining sites. Data must be replicated synchronously or asynchronously across sites. AWS services: Route53, Global Accelerator, ElastiCache Global Datastore, DynamoDB Global Tables, RDS cross-Region replication. Exam focus: Multi-Site requires applications to be stateless or handle eventual consistency. The exam often tests the difference between active-active (Multi-Site) and active-passive (Warm Standby).
Key Components and Defaults
RTO/RPO: Not AWS defaults but business-defined. AWS provides tools to achieve them.
Route53 Failover: Health checks every 10 seconds (default). Failover threshold: 3 consecutive failures (default). You can configure.
RDS Multi-AZ: Synchronous replication to standby in different AZ. RPO = 0 for AZ failure.
RDS Cross-Region Read Replicas: Asynchronous replication. RPO = seconds to minutes.
S3 Cross-Region Replication (CRR): Asynchronous. Typically replicates within 15 minutes (SLA).
DynamoDB Global Tables: Multi-Region, multi-master. Replication latency ~1 second.
How They Interact
Often a single DR plan uses multiple strategies: Backup/Restore for long-term archival, Pilot Light for critical data, and Multi-Site for front-end web servers. The exam expects you to identify which strategy meets given RTO/RPO and cost constraints.
Configuration and Verification
AWS Backup: Create backup plans with lifecycle rules. Verify by restoring a test instance.
CloudFormation: Use StackSets to deploy infrastructure in multiple Regions.
Route53: Configure failover routing with health checks. Test by disabling a health check target.
Data replication: Monitor replication lag via CloudWatch metrics (e.g., RDS ReplicaLag).
Common Exam Scenarios
A company requires RTO < 1 hour and RPO < 5 minutes. Warm Standby is best because it's cheaper than Multi-Site but meets RTO/RPO.
A company has a small budget and can tolerate 24-hour RTO. Backup/Restore is correct.
A company needs zero downtime. Multi-Site is required.
A company has a critical database and wants minimal data loss. Use Pilot Light with RDS cross-Region read replica.
The exam may present a scenario where you must choose the right services: e.g., "Which strategy uses Route53 failover and Auto Scaling?" Answer: Warm Standby or Pilot Light (depending on details).
Trap Patterns
Confusing Pilot Light with Warm Standby: Pilot Light has minimal running resources; Warm Standby has a scaled-down but full environment.
Assuming Multi-Site requires synchronous replication: It can be asynchronous if application tolerates eventual consistency.
Thinking Backup/Restore is always wrong: It is valid for low-priority workloads.
Overlooking RDS Multi-AZ: It only protects against AZ failure, not Region failure. For cross-Region, you need cross-Region read replicas.
Summary of Trade-offs
| Strategy | RTO | RPO | Cost | |----------|-----|-----|------| | Backup/Restore | Hours | 24h | $ | | Pilot Light | 10-60 min | Minutes | $$ | | Warm Standby | Minutes | Seconds | $$$ | | Multi-Site | Near zero | Seconds | $$$$ |
Assess Business Requirements
First, determine the RTO and RPO from the business. For example, an e-commerce site might require RTO < 15 minutes and RPO < 1 minute. This drives the strategy choice. Also consider budget, compliance, and geographic constraints. Document these in a DR plan.
Select DR Strategy and AWS Services
Based on requirements, choose Backup/Restore, Pilot Light, Warm Standby, or Multi-Site. Map services: Backup/Restore uses S3 and AWS Backup; Pilot Light uses minimal EC2, RDS read replicas; Warm Standby uses Auto Scaling and Route53; Multi-Site uses Route53 active-active, Global Accelerator, and multi-Region databases.
Implement Data Replication
For Pilot Light and above, replicate data continuously. Use RDS cross-Region read replicas (asynchronous), S3 CRR, or DynamoDB Global Tables. For Backup/Restore, schedule snapshots. Ensure replication lag meets RPO. Monitor with CloudWatch metrics like ReplicaLag.
Deploy Infrastructure as Code
Use CloudFormation or Terraform to define the DR environment. For Pilot Light, a template launches minimal resources. For Warm Standby, the template includes Auto Scaling groups with desired capacity = 1+ in standby Region. Use StackSets for multi-Region deployment.
Configure DNS Failover and Health Checks
In Route53, create a failover routing policy (active-passive) for Pilot Light/Warm Standby, or weighted/latency for Multi-Site. Associate health checks with each endpoint. Set health check interval (default 10s), failure threshold (default 3). Test by simulating failure.
Automate Failover and Recovery
Use AWS Lambda or Step Functions to automate failover actions (e.g., promote read replica, scale up Auto Scaling group, update DNS). For Multi-Site, traffic shifting is automatic via DNS. Document runbooks and test regularly.
Test and Validate DR Plan
Conduct regular DR drills. Use AWS Fault Injection Simulator to inject failures. Verify RTO/RPO metrics. For Backup/Restore, restore a test instance. For Pilot Light, simulate Region failure by disabling primary site. Adjust configurations based on results.
Enterprise Scenario 1: E-commerce Platform with Strict RTO/RPO
A large online retailer requires RTO < 5 minutes and RPO < 1 minute for its transactional database. They choose Warm Standby with a scaled-down environment in a second AWS Region. The primary Region runs production, while the standby Region has a single EC2 instance running the web app and an RDS Multi-AZ database with a cross-Region read replica. Data is replicated asynchronously via RDS replication. Route53 is configured with failover routing and health checks on the primary ALB. In a disaster, the standby database is promoted to primary, Auto Scaling group scales up to desired capacity, and Route53 updates DNS. The cost is about 40% of production but ensures near-zero downtime.
Enterprise Scenario 2: Regulatory Compliance with Long-Term Archival
A financial institution must retain transaction records for 7 years. They use Backup/Restore for archival data. Daily snapshots of RDS databases are stored in S3 Glacier Deep Archive. AWS Backup manages lifecycle policies to transition data to colder storage. RTO is 24 hours, RPO is 24 hours, which is acceptable for compliance. In a disaster, they restore from the latest snapshot to a new RDS instance. They also have a separate Pilot Light for critical customer-facing APIs with RTO of 30 minutes.
Common Pitfalls
Underestimating replication lag: Cross-Region replication can have lag of seconds to minutes. If RPO is seconds, use synchronous replication (e.g., DynamoDB Global Tables with strong consistency).
Not testing failover: Many companies fail their first DR test because automation scripts have errors or permissions are missing.
Ignoring data consistency: Asynchronous replication can lead to conflicts. Use conflict resolution strategies (e.g., last writer wins).
Over-provisioning standby: For Warm Standby, keep standby small to save cost, but ensure it can scale quickly. Use Auto Scaling with scale-out policies based on CPU or memory.
What SOA-C02 Tests on This Topic
Objective 2.2: Implement DR strategies. You must be able to recommend the appropriate strategy based on RTO/RPO and cost.
Common services: Route53, RDS, S3, CloudFormation, Auto Scaling, AWS Backup.
Specific values: Route53 health check interval (10s), failure threshold (3), RDS automated snapshot retention (35 days).
Most Common Wrong Answers
Choosing Multi-Site when Warm Standby is sufficient: Candidates see low RTO and automatically pick Multi-Site, but Multi-Site is expensive and unnecessary if RTO of minutes is acceptable.
Selecting Pilot Light when Backup/Restore is cheaper: If RTO is 24 hours, Backup/Restore is correct. Pilot Light would be overkill.
Confusing RDS Multi-AZ with cross-Region DR: Multi-AZ protects against AZ failure, not Region failure. For DR, you need cross-Region read replicas or Global Databases.
Assuming Backup/Restore cannot meet low RPO: With frequent snapshots (e.g., every 5 minutes using RDS manual snapshots), RPO can be low, but RTO remains high due to restore time.
Specific Exam Traps
The exam may ask: "Which strategy uses Route53 failover and Auto Scaling?" The answer could be Pilot Light or Warm Standby. The difference: Pilot Light has minimal running instances; Warm Standby has a scaled-down environment that can handle some traffic.
Questions about RDS: Know that automated snapshots have 35-day retention; manual snapshots are indefinite.
S3 CRR: Requires versioning enabled on both buckets. Replication time is typically 15 minutes (SLA).
DynamoDB Global Tables: Multi-master, eventually consistent. If strong consistency is needed, use DAX or application logic.
How to Eliminate Wrong Answers
If the scenario mentions "minimal cost" and "RTO of 24 hours", eliminate Pilot Light, Warm Standby, Multi-Site — they are more expensive.
If the scenario says "RTO of 1 hour", Backup/Restore is too slow; Pilot Light or Warm Standby fits.
If the scenario says "zero data loss" (RPO=0), you need synchronous replication. Only Multi-Site with synchronous replication (e.g., Aurora Global Database with RPO=0) can achieve that. Pilot Light and Warm Standby use asynchronous replication.
If the scenario mentions "active-active", it must be Multi-Site.
Edge Cases
Cross-Region failover for S3: Use CRR or S3 Batch Operations to replicate data. For static websites, use Route53 failover to a secondary bucket.
Multi-Region application with stateful sessions: Use ElastiCache Global Datastore or DynamoDB Global Tables for session state.
DR for serverless applications: Use Lambda with cross-Region replication of DynamoDB tables and S3 buckets. API Gateway can have custom domain names with Route53 failover.
DR strategies on a spectrum: Backup/Restore (low cost, high RTO) to Multi-Site (high cost, low RTO).
RTO and RPO are business-defined; AWS provides services to meet them.
Pilot Light uses a minimal running stack; Warm Standby uses a scaled-down but full stack.
RDS Multi-AZ is for AZ failure within a Region, not cross-Region DR.
Route53 health check interval is 10 seconds; failure threshold is 3 by default.
S3 Cross-Region Replication requires versioning on both source and destination buckets.
DynamoDB Global Tables provide multi-Region, multi-master with eventual consistency.
AWS Backup can automate lifecycle policies for S3 and RDS snapshots.
For RPO=0, use synchronous replication (e.g., Aurora Global Database with RPO=0).
Regular DR testing is essential; use AWS Fault Injection Simulator.
CloudFormation StackSets allow consistent deployment across Regions.
Always consider cost vs. RTO/RPO trade-offs when choosing a strategy.
These come up on the exam all the time. Here's how to tell them apart.
Pilot Light
Minimal running resources (e.g., one EC2, one DB replica).
RTO: 10-60 minutes (requires scaling up).
Cost: ~10-20% of production.
Data replication: continuous (e.g., RDS read replica).
Failover: manual or automated scaling of resources.
Warm Standby
Scaled-down but complete environment (e.g., Auto Scaling min=1).
RTO: minutes (already running, scale up quickly).
Cost: ~30-50% of production.
Data replication: continuous (e.g., RDS cross-Region replica).
Failover: DNS switch and scale up (if needed).
Backup/Restore
RTO: hours to days.
RPO: depends on backup frequency (e.g., 24h).
Cost: lowest (storage only).
Recovery: manual restore from snapshots/backups.
Use case: non-critical workloads, compliance archives.
Multi-Site
RTO: near zero (automatic traffic shift).
RPO: seconds (synchronous or async replication).
Cost: highest (two full environments).
Recovery: automatic via DNS routing.
Use case: mission-critical, zero downtime required.
Mistake
Pilot Light and Warm Standby are the same strategy.
Correct
Pilot Light runs only a minimal core (e.g., database, a single EC2) while Warm Standby runs a scaled-down but complete environment that can handle some traffic. RTO for Pilot Light is typically 10-60 minutes; Warm Standby can achieve minutes.
Mistake
RDS Multi-AZ provides cross-Region disaster recovery.
Correct
RDS Multi-AZ only replicates synchronously within a single Region (different AZs). For cross-Region DR, you need RDS cross-Region read replicas or Aurora Global Database.
Mistake
Backup/Restore always has RPO of 24 hours.
Correct
You can achieve lower RPO by taking more frequent snapshots (e.g., every hour), but RTO remains high because restoring infrastructure takes time.
Mistake
Multi-Site requires synchronous replication between sites.
Correct
Multi-Site can use asynchronous replication if the application can tolerate eventual consistency. Synchronous replication is only needed for zero RPO.
Mistake
Route53 health checks are only for public endpoints.
Correct
Route53 can health check private endpoints using Route53 Resolver endpoints or by associating health checks with CloudWatch alarms.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Pilot Light runs only a minimal set of core resources (e.g., a database replica and a small EC2 instance) that are always on. When a disaster occurs, you scale up by launching additional infrastructure. Warm Standby runs a scaled-down but fully functional copy of your production environment (e.g., Auto Scaling group with min=1). It can handle some traffic immediately. The key difference: Pilot Light requires more provisioning during recovery (higher RTO), while Warm Standby is ready to scale quickly (lower RTO).
Yes, but only with synchronous replication. Aurora Global Database supports synchronous replication with RPO of 0 for a single Region (Aurora Global Database is asynchronous cross-Region, but within a Region, Multi-AZ is synchronous). For cross-Region, you can use DynamoDB Global Tables with strong consistency (if configured) or application-level synchronous replication. Most DR strategies (Pilot Light, Warm Standby) use asynchronous replication, so RPO is seconds to minutes.
AWS Backup is the primary service for managing backups across services (EC2, RDS, EFS, DynamoDB). For long-term archival, use S3 with Lifecycle policies to transition to Glacier or Deep Archive. RDS automated snapshots (retention up to 35 days) and manual snapshots (indefinite) are also key. For EC2, create AMIs regularly. S3 Cross-Region Replication can be used for additional resilience.
Create a Route53 health check that monitors your primary endpoint. You can manually disable the health check target (e.g., stop the EC2 instance or block traffic) to simulate failure. Route53 will automatically failover to the secondary record after the health check threshold (default 3 failures). Monitor the Route53 console for health check status. You can also use AWS Fault Injection Simulator to introduce latency or errors.
Multi-Site requires two fully provisioned, active environments, so you pay for both sites at full production capacity. Warm Standby runs a scaled-down environment (e.g., 1 EC2 instance instead of 10), so it typically costs 30-50% of production. Multi-Site can be 2x production cost. For Pilot Light, cost is even lower (10-20%) because only critical components run.
Yes. Use CloudFormation StackSets to deploy the same stack in multiple Regions. For Pilot Light, the stack includes a minimal set of resources. For Warm Standby, include an Auto Scaling group with a desired capacity that can be scaled up. For failover automation, you can use Lambda functions triggered by CloudWatch alarms or Route53 health check status.
RTO for Backup/Restore is typically hours to days. Restoring data from S3 to new EC2 instances involves launching instances, restoring EBS volumes from snapshots, and copying data from S3. For large datasets, this can take several hours. If you need faster recovery, consider Pilot Light or Warm Standby.
You've just covered DR Strategies: Backup/Restore, Pilot Light, Warm Standby, Multi-Site — now see how well it sticks with free SOA-C02 practice questions. Full explanations included, no account needed.
Done with this chapter?