SOA-C02Chapter 57 of 104Objective 2.2

DR Strategies: Backup/Restore, Pilot Light, Warm Standby, Multi-Site

This chapter covers AWS Disaster Recovery (DR) strategies: Backup/Restore, Pilot Light, Warm Standby, and Multi-Site. These are core to the Reliability pillar and a frequent exam topic, appearing in about 8-12% of SOA-C02 questions. You must understand the trade-offs between Recovery Time Objective (RTO), Recovery Point Objective (RPO), and cost to select the right strategy for a given scenario. The exam tests your ability to match business requirements to the correct AWS architecture.

25 min read

Intermediate

Updated May 31, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

Disaster Recovery as Backup Plans for a Restaurant Chain

Imagine a restaurant chain with multiple locations. The Backup/Restore strategy is like keeping frozen ingredients in a warehouse. If the restaurant burns down, you can thaw ingredients and cook meals, but it takes hours and customers leave. Pilot Light is like having a small prep kitchen running 24/7 with essential staff, ready to scale up to full cooking when disaster strikes — faster but costs more. Warm Standby is like having a second restaurant fully equipped but closed, needing only staff to open doors — near-zero downtime but double the rent. Multi-Site is like two identical restaurants both open, serving customers simultaneously; if one fails, the other handles all traffic instantly — maximum resilience but highest cost. Each strategy trades off cost, complexity, and recovery time (RTO/RPO). For AWS, Backup/Restore uses S3/Glacier, Pilot Light uses a minimal EC2+RDS stack, Warm Standby uses an Auto Scaling group across AZs, and Multi-Site uses Route53 with active-active DNS and replication.

How It Actually Works

What is Disaster Recovery in AWS?

Disaster Recovery (DR) is the process of restoring IT infrastructure and data after a disaster (natural, human, or technical). AWS defines four DR strategies on a spectrum from low cost/high RTO to high cost/low RTO. The key metrics are RTO (time to recover) and RPO (maximum acceptable data loss). Exam questions often present a business requirement (e.g., 'RTO of 1 hour, RPO of 15 minutes') and ask you to choose the strategy.

Backup/Restore (RTO: hours to days, RPO: 24 hours typical)

This is the simplest and cheapest strategy. Data is backed up to Amazon S3 (or S3 Glacier for archival) periodically. In a disaster, you restore from backups by launching new EC2 instances from AMIs, restoring RDS snapshots, and copying data from S3. The RTO is high because you must provision infrastructure and restore data. The RPO depends on backup frequency (e.g., daily backups = up to 24 hours data loss). AWS services: AWS Backup, S3 Lifecycle policies, RDS automated snapshots (retention up to 35 days), manual snapshots. Exam tip: Know that RDS automated snapshots have a retention limit of 35 days; manual snapshots are kept until deleted.

Pilot Light (RTO: 10-60 minutes, RPO: minutes)

A minimal copy of your core infrastructure runs continuously. For example, a small EC2 instance with a database replica (RDS Multi-AZ or Read Replica) is kept alive in another Region. In a disaster, you "scale up" by launching additional instances, updating DNS (Route53 failover), and promoting the read replica. The core system (the "pilot light") is always on, reducing RTO. RPO is low because data is continuously replicated. Exam focus: The key is that the pilot light stack is minimal — only critical components (e.g., database, a few EC2 instances) are running. You must automate the scaling with CloudFormation or Elastic Beanstalk.

Warm Standby (RTO: minutes, RPO: seconds)

A scaled-down but fully functional copy of your production environment runs in another Region. It is idle or handling minimal traffic. In a disaster, you scale up the standby to full capacity and redirect traffic. RTO is low (minutes) because the environment is already running. RPO is low because data replication is near real-time (e.g., using Database Migration Service or cross-Region replication). Exam tip: Warm Standby often uses Auto Scaling groups with minimum size = 1 (or more) in the standby Region. You must also replicate data continuously (e.g., S3 Cross-Region Replication, RDS cross-Region read replicas).

Multi-Site (Active-Active) (RTO: near zero, RPO: seconds)

Two or more independent, fully functional environments run simultaneously, each handling part of the traffic. This is the most expensive but provides the lowest RTO/RPO. DNS routing (Route53 weighted or latency-based routing) distributes traffic. If one site fails, traffic is routed to the remaining sites. Data must be replicated synchronously or asynchronously across sites. AWS services: Route53, Global Accelerator, ElastiCache Global Datastore, DynamoDB Global Tables, RDS cross-Region replication. Exam focus: Multi-Site requires applications to be stateless or handle eventual consistency. The exam often tests the difference between active-active (Multi-Site) and active-passive (Warm Standby).

Key Components and Defaults

RTO/RPO: Not AWS defaults but business-defined. AWS provides tools to achieve them.

Route53 Failover: Health checks every 10 seconds (default). Failover threshold: 3 consecutive failures (default). You can configure.

RDS Multi-AZ: Synchronous replication to standby in different AZ. RPO = 0 for AZ failure.

RDS Cross-Region Read Replicas: Asynchronous replication. RPO = seconds to minutes.

S3 Cross-Region Replication (CRR): Asynchronous. Typically replicates within 15 minutes (SLA).

DynamoDB Global Tables: Multi-Region, multi-master. Replication latency ~1 second.

How They Interact

Often a single DR plan uses multiple strategies: Backup/Restore for long-term archival, Pilot Light for critical data, and Multi-Site for front-end web servers. The exam expects you to identify which strategy meets given RTO/RPO and cost constraints.

Configuration and Verification

AWS Backup: Create backup plans with lifecycle rules. Verify by restoring a test instance.

CloudFormation: Use StackSets to deploy infrastructure in multiple Regions.

Route53: Configure failover routing with health checks. Test by disabling a health check target.

Data replication: Monitor replication lag via CloudWatch metrics (e.g., RDS ReplicaLag).

Common Exam Scenarios

A company requires RTO < 1 hour and RPO < 5 minutes. Warm Standby is best because it's cheaper than Multi-Site but meets RTO/RPO.

A company has a small budget and can tolerate 24-hour RTO. Backup/Restore is correct.

A company needs zero downtime. Multi-Site is required.

A company has a critical database and wants minimal data loss. Use Pilot Light with RDS cross-Region read replica.

The exam may present a scenario where you must choose the right services: e.g., "Which strategy uses Route53 failover and Auto Scaling?" Answer: Warm Standby or Pilot Light (depending on details).

Trap Patterns

Confusing Pilot Light with Warm Standby: Pilot Light has minimal running resources; Warm Standby has a scaled-down but full environment.

Assuming Multi-Site requires synchronous replication: It can be asynchronous if application tolerates eventual consistency.

Thinking Backup/Restore is always wrong: It is valid for low-priority workloads.

Overlooking RDS Multi-AZ: It only protects against AZ failure, not Region failure. For cross-Region, you need cross-Region read replicas.

Summary of Trade-offs

| Strategy | RTO | RPO | Cost | |----------|-----|-----|------| | Backup/Restore | Hours | 24h | $ | | Pilot Light | 10-60 min | Minutes | $$ | | Warm Standby | Minutes | Seconds | $$$ | | Multi-Site | Near zero | Seconds | $$$$ |

Walk-Through

Assess Business Requirements

First, determine the RTO and RPO from the business. For example, an e-commerce site might require RTO < 15 minutes and RPO < 1 minute. This drives the strategy choice. Also consider budget, compliance, and geographic constraints. Document these in a DR plan.

Select DR Strategy and AWS Services

Based on requirements, choose Backup/Restore, Pilot Light, Warm Standby, or Multi-Site. Map services: Backup/Restore uses S3 and AWS Backup; Pilot Light uses minimal EC2, RDS read replicas; Warm Standby uses Auto Scaling and Route53; Multi-Site uses Route53 active-active, Global Accelerator, and multi-Region databases.

Implement Data Replication

For Pilot Light and above, replicate data continuously. Use RDS cross-Region read replicas (asynchronous), S3 CRR, or DynamoDB Global Tables. For Backup/Restore, schedule snapshots. Ensure replication lag meets RPO. Monitor with CloudWatch metrics like ReplicaLag.

Deploy Infrastructure as Code

Use CloudFormation or Terraform to define the DR environment. For Pilot Light, a template launches minimal resources. For Warm Standby, the template includes Auto Scaling groups with desired capacity = 1+ in standby Region. Use StackSets for multi-Region deployment.

Configure DNS Failover and Health Checks

In Route53, create a failover routing policy (active-passive) for Pilot Light/Warm Standby, or weighted/latency for Multi-Site. Associate health checks with each endpoint. Set health check interval (default 10s), failure threshold (default 3). Test by simulating failure.

Automate Failover and Recovery

Use AWS Lambda or Step Functions to automate failover actions (e.g., promote read replica, scale up Auto Scaling group, update DNS). For Multi-Site, traffic shifting is automatic via DNS. Document runbooks and test regularly.

Test and Validate DR Plan

Conduct regular DR drills. Use AWS Fault Injection Simulator to inject failures. Verify RTO/RPO metrics. For Backup/Restore, restore a test instance. For Pilot Light, simulate Region failure by disabling primary site. Adjust configurations based on results.

What This Looks Like on the Job

Enterprise Scenario 1: E-commerce Platform with Strict RTO/RPO

A large online retailer requires RTO < 5 minutes and RPO < 1 minute for its transactional database. They choose Warm Standby with a scaled-down environment in a second AWS Region. The primary Region runs production, while the standby Region has a single EC2 instance running the web app and an RDS Multi-AZ database with a cross-Region read replica. Data is replicated asynchronously via RDS replication. Route53 is configured with failover routing and health checks on the primary ALB. In a disaster, the standby database is promoted to primary, Auto Scaling group scales up to desired capacity, and Route53 updates DNS. The cost is about 40% of production but ensures near-zero downtime.

Enterprise Scenario 2: Regulatory Compliance with Long-Term Archival

A financial institution must retain transaction records for 7 years. They use Backup/Restore for archival data. Daily snapshots of RDS databases are stored in S3 Glacier Deep Archive. AWS Backup manages lifecycle policies to transition data to colder storage. RTO is 24 hours, RPO is 24 hours, which is acceptable for compliance. In a disaster, they restore from the latest snapshot to a new RDS instance. They also have a separate Pilot Light for critical customer-facing APIs with RTO of 30 minutes.

Common Pitfalls

Underestimating replication lag: Cross-Region replication can have lag of seconds to minutes. If RPO is seconds, use synchronous replication (e.g., DynamoDB Global Tables with strong consistency).

Not testing failover: Many companies fail their first DR test because automation scripts have errors or permissions are missing.

Ignoring data consistency: Asynchronous replication can lead to conflicts. Use conflict resolution strategies (e.g., last writer wins).

Over-provisioning standby: For Warm Standby, keep standby small to save cost, but ensure it can scale quickly. Use Auto Scaling with scale-out policies based on CPU or memory.

How SOA-C02 Actually Tests This

What SOA-C02 Tests on This Topic

Objective 2.2: Implement DR strategies. You must be able to recommend the appropriate strategy based on RTO/RPO and cost.

Common services: Route53, RDS, S3, CloudFormation, Auto Scaling, AWS Backup.

Specific values: Route53 health check interval (10s), failure threshold (3), RDS automated snapshot retention (35 days).

Most Common Wrong Answers

Choosing Multi-Site when Warm Standby is sufficient: Candidates see low RTO and automatically pick Multi-Site, but Multi-Site is expensive and unnecessary if RTO of minutes is acceptable.

Selecting Pilot Light when Backup/Restore is cheaper: If RTO is 24 hours, Backup/Restore is correct. Pilot Light would be overkill.

Confusing RDS Multi-AZ with cross-Region DR: Multi-AZ protects against AZ failure, not Region failure. For DR, you need cross-Region read replicas or Global Databases.

Assuming Backup/Restore cannot meet low RPO: With frequent snapshots (e.g., every 5 minutes using RDS manual snapshots), RPO can be low, but RTO remains high due to restore time.

Specific Exam Traps

The exam may ask: "Which strategy uses Route53 failover and Auto Scaling?" The answer could be Pilot Light or Warm Standby. The difference: Pilot Light has minimal running instances; Warm Standby has a scaled-down environment that can handle some traffic.

Questions about RDS: Know that automated snapshots have 35-day retention; manual snapshots are indefinite.

S3 CRR: Requires versioning enabled on both buckets. Replication time is typically 15 minutes (SLA).

DynamoDB Global Tables: Multi-master, eventually consistent. If strong consistency is needed, use DAX or application logic.

How to Eliminate Wrong Answers

If the scenario mentions "minimal cost" and "RTO of 24 hours", eliminate Pilot Light, Warm Standby, Multi-Site — they are more expensive.

If the scenario says "RTO of 1 hour", Backup/Restore is too slow; Pilot Light or Warm Standby fits.

If the scenario says "zero data loss" (RPO=0), you need synchronous replication. Only Multi-Site with synchronous replication (e.g., Aurora Global Database with RPO=0) can achieve that. Pilot Light and Warm Standby use asynchronous replication.

If the scenario mentions "active-active", it must be Multi-Site.

Edge Cases

Cross-Region failover for S3: Use CRR or S3 Batch Operations to replicate data. For static websites, use Route53 failover to a secondary bucket.

Multi-Region application with stateful sessions: Use ElastiCache Global Datastore or DynamoDB Global Tables for session state.

DR for serverless applications: Use Lambda with cross-Region replication of DynamoDB tables and S3 buckets. API Gateway can have custom domain names with Route53 failover.

Key Takeaways

DR strategies on a spectrum: Backup/Restore (low cost, high RTO) to Multi-Site (high cost, low RTO).

RTO and RPO are business-defined; AWS provides services to meet them.

Pilot Light uses a minimal running stack; Warm Standby uses a scaled-down but full stack.

RDS Multi-AZ is for AZ failure within a Region, not cross-Region DR.

Route53 health check interval is 10 seconds; failure threshold is 3 by default.

S3 Cross-Region Replication requires versioning on both source and destination buckets.

DynamoDB Global Tables provide multi-Region, multi-master with eventual consistency.

AWS Backup can automate lifecycle policies for S3 and RDS snapshots.

For RPO=0, use synchronous replication (e.g., Aurora Global Database with RPO=0).

Regular DR testing is essential; use AWS Fault Injection Simulator.

CloudFormation StackSets allow consistent deployment across Regions.

Always consider cost vs. RTO/RPO trade-offs when choosing a strategy.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Pilot Light

Minimal running resources (e.g., one EC2, one DB replica).

RTO: 10-60 minutes (requires scaling up).

Cost: ~10-20% of production.

Data replication: continuous (e.g., RDS read replica).

Failover: manual or automated scaling of resources.

Warm Standby

Scaled-down but complete environment (e.g., Auto Scaling min=1).

RTO: minutes (already running, scale up quickly).

Cost: ~30-50% of production.

Data replication: continuous (e.g., RDS cross-Region replica).

Failover: DNS switch and scale up (if needed).

Backup/Restore

RTO: hours to days.

RPO: depends on backup frequency (e.g., 24h).

Cost: lowest (storage only).

Recovery: manual restore from snapshots/backups.

Use case: non-critical workloads, compliance archives.

Multi-Site

RTO: near zero (automatic traffic shift).

RPO: seconds (synchronous or async replication).

Cost: highest (two full environments).

Recovery: automatic via DNS routing.

Use case: mission-critical, zero downtime required.

Watch Out for These

Mistake

Pilot Light and Warm Standby are the same strategy.

Correct

Pilot Light runs only a minimal core (e.g., database, a single EC2) while Warm Standby runs a scaled-down but complete environment that can handle some traffic. RTO for Pilot Light is typically 10-60 minutes; Warm Standby can achieve minutes.

Mistake

RDS Multi-AZ provides cross-Region disaster recovery.

Correct

RDS Multi-AZ only replicates synchronously within a single Region (different AZs). For cross-Region DR, you need RDS cross-Region read replicas or Aurora Global Database.

Mistake

Backup/Restore always has RPO of 24 hours.

Correct

You can achieve lower RPO by taking more frequent snapshots (e.g., every hour), but RTO remains high because restoring infrastructure takes time.

Mistake

Multi-Site requires synchronous replication between sites.

Correct

Multi-Site can use asynchronous replication if the application can tolerate eventual consistency. Synchronous replication is only needed for zero RPO.

Mistake

Route53 health checks are only for public endpoints.

Correct

Route53 can health check private endpoints using Route53 Resolver endpoints or by associating health checks with CloudWatch alarms.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Pilot Light and Warm Standby in AWS DR?

Pilot Light runs only a minimal set of core resources (e.g., a database replica and a small EC2 instance) that are always on. When a disaster occurs, you scale up by launching additional infrastructure. Warm Standby runs a scaled-down but fully functional copy of your production environment (e.g., Auto Scaling group with min=1). It can handle some traffic immediately. The key difference: Pilot Light requires more provisioning during recovery (higher RTO), while Warm Standby is ready to scale quickly (lower RTO).

Can I achieve RPO of zero with AWS DR strategies?

Yes, but only with synchronous replication. Aurora Global Database supports synchronous replication with RPO of 0 for a single Region (Aurora Global Database is asynchronous cross-Region, but within a Region, Multi-AZ is synchronous). For cross-Region, you can use DynamoDB Global Tables with strong consistency (if configured) or application-level synchronous replication. Most DR strategies (Pilot Light, Warm Standby) use asynchronous replication, so RPO is seconds to minutes.

What AWS services are best for Backup/Restore DR strategy?

AWS Backup is the primary service for managing backups across services (EC2, RDS, EFS, DynamoDB). For long-term archival, use S3 with Lifecycle policies to transition to Glacier or Deep Archive. RDS automated snapshots (retention up to 35 days) and manual snapshots (indefinite) are also key. For EC2, create AMIs regularly. S3 Cross-Region Replication can be used for additional resilience.

How do I test DR failover for Route53?

Create a Route53 health check that monitors your primary endpoint. You can manually disable the health check target (e.g., stop the EC2 instance or block traffic) to simulate failure. Route53 will automatically failover to the secondary record after the health check threshold (default 3 failures). Monitor the Route53 console for health check status. You can also use AWS Fault Injection Simulator to introduce latency or errors.

What is the cost implication of Multi-Site vs Warm Standby?

Multi-Site requires two fully provisioned, active environments, so you pay for both sites at full production capacity. Warm Standby runs a scaled-down environment (e.g., 1 EC2 instance instead of 10), so it typically costs 30-50% of production. Multi-Site can be 2x production cost. For Pilot Light, cost is even lower (10-20%) because only critical components run.

Can I use CloudFormation to automate DR deployment?

Yes. Use CloudFormation StackSets to deploy the same stack in multiple Regions. For Pilot Light, the stack includes a minimal set of resources. For Warm Standby, include an Auto Scaling group with a desired capacity that can be scaled up. For failover automation, you can use Lambda functions triggered by CloudWatch alarms or Route53 health check status.

What is the RTO for Backup/Restore using S3?

RTO for Backup/Restore is typically hours to days. Restoring data from S3 to new EC2 instances involves launching instances, restoring EBS volumes from snapshots, and copying data from S3. For large datasets, this can take several hours. If you need faster recovery, consider Pilot Light or Warm Standby.

Terms Worth Knowing

Autoscaler CloudTrail CloudWatch EC2 IAM Load balancer

Ready to put this to the test?

You've just covered DR Strategies: Backup/Restore, Pilot Light, Warm Standby, Multi-Site — now see how well it sticks with free SOA-C02 practice questions. Full explanations included, no account needed.

Try SOA-C02 practice questions Back to all chapters

Done with this chapter?

EC2 Hibernation and Spot Interruption Handling

AWS Elastic Disaster Recovery (DRS)

See the full SOA-C02 study guide