CLF-C02Chapter 39 of 130Objective 1.4

Disaster Recovery Concepts on AWS

This chapter covers disaster recovery (DR) concepts on AWS, a critical topic for the CLF-C02 exam under Domain 1: Cloud Concepts (Objective 1.4, which accounts for approximately 24% of the exam). You will learn the key DR strategies, AWS services that enable them, and how to design for resilience. Mastering these concepts ensures you can answer questions about business continuity, recovery objectives, and multi-region architectures—all common on the exam.

25 min read
Intermediate
Updated May 31, 2026

The City Emergency Backup Plan

Imagine you run a city with a central data center (your primary AWS region). A disaster—like a flood or earthquake—could destroy that center, stopping all city services. To prepare, you create a backup plan. First, you have a 'pilot light' strategy: you keep a small emergency office in a nearby town (another region) with just a few staff and essential supplies. If disaster strikes, you can quickly scale up that office to run full city operations. This is like AWS Elastic Disaster Recovery (DRS) with minimal replication. Next, you have a 'warm standby' approach: you maintain a fully equipped backup office that runs at low capacity, ready to take over within minutes. This mirrors AWS multi-site active/passive with Amazon Route 53 failover. Finally, you have 'multi-site active/active': two fully operational city halls sharing the load, so if one fails, the other handles everything. That's like running workloads in two AWS Regions with Route 53 weighted routing. The key mechanism: replication. You copy city records (data) continuously to the backup office using secure couriers (AWS DataSync or S3 Cross-Region Replication). The time to get back online (Recovery Time Objective, RTO) and how much data you lose (Recovery Point Objective, RPO) depend on how often you send couriers and how ready the backup office is. Cheaper plans have longer RTOs and RPOs; expensive plans have near-zero values.

How It Actually Works

What is Disaster Recovery and Why Does It Matter?

Disaster recovery (DR) is the process of restoring IT systems and data after a disruptive event, such as natural disasters, power outages, cyberattacks, or human error. On AWS, DR leverages the cloud's global infrastructure to replicate workloads across multiple Availability Zones (AZs) and Regions, providing high availability and business continuity. The exam tests your understanding of four main DR strategies: Backup and Restore, Pilot Light, Warm Standby, and Multi-Site Active/Active. Each strategy balances cost, complexity, and recovery speed.

Key Metrics: RTO and RPO

Two critical metrics define DR effectiveness: - Recovery Time Objective (RTO): The maximum acceptable time to restore systems after a disaster. For example, an RTO of 1 hour means the system must be fully operational within 60 minutes. - Recovery Point Objective (RPO): The maximum acceptable data loss measured in time. An RPO of 15 minutes means you can lose at most 15 minutes of data.

On the exam, you must associate each DR strategy with typical RTO/RPO ranges. Backup and Restore has the highest RTO (hours to days) and RPO (hours), while Multi-Site Active/Active has the lowest (minutes or seconds).

DR Strategies in Detail

#### 1. Backup and Restore This is the simplest and cheapest DR strategy. You regularly back up data and applications to AWS (e.g., Amazon S3, AWS Backup) and restore them to a new environment after a disaster. RTO and RPO are measured in hours or even days. It is suitable for non-critical workloads where downtime is acceptable.

How it works: - Use AWS Backup to automate backups of EBS volumes, RDS databases, and DynamoDB tables. - Store backups in S3, optionally using S3 Glacier for long-term archival. - In a disaster, launch new EC2 instances from AMIs stored in S3, restore RDS snapshots, and reconfigure networking.

AWS services involved: AWS Backup, Amazon S3, Amazon S3 Glacier, Amazon EBS snapshots, RDS snapshots.

Example: A company backs up its on-premises database nightly to S3. If the office is destroyed, they launch a new RDS instance from the latest snapshot in another Region. RTO: 4 hours, RPO: 24 hours.

#### 2. Pilot Light Pilot Light replicates core services (like a database) to a secondary Region while keeping them running in a minimal state. In a disaster, you 'turn on' the full environment around the replicated data. RTO: 10-60 minutes, RPO: minutes to hours.

How it works: - Replicate data continuously using services like AWS Database Migration Service (DMS) or S3 Cross-Region Replication. - Maintain a small set of running resources (e.g., a single EC2 instance with a database) in the DR Region. - During failover, scale up by launching additional EC2 instances, updating Route 53 DNS records, and scaling the database.

AWS services involved: Amazon RDS cross-Region replication, Amazon S3 CRR, AWS DMS, Amazon Route 53, EC2 Auto Scaling.

Example: A web application runs in us-east-1. The database is replicated to us-west-2. A small EC2 instance runs there with the database but no web servers. During a disaster, Auto Scaling launches a full fleet of web servers in us-west-2, and Route 53 redirects traffic. RTO: 30 minutes, RPO: 5 minutes.

#### 3. Warm Standby Warm Standby runs a scaled-down version of your production environment in the DR Region. It is always running, but at lower capacity. During failover, you scale up to full production. RTO: minutes, RPO: seconds.

How it works: - Deploy a duplicate of your production environment in another Region, but with fewer instances (e.g., one web server instead of ten). - Use Route 53 health checks to monitor the primary Region. - On failure, Route 53 automatically routes traffic to the DR Region, and Auto Scaling increases capacity.

AWS services involved: AWS Elastic Load Balancing (ELB), EC2 Auto Scaling, Route 53, RDS Multi-AZ or cross-Region replica, AWS Global Accelerator.

Example: An e-commerce site runs in eu-west-1 with a warm standby in eu-central-1. The standby has one web server and a read replica of the database. If eu-west-1 fails, Route 53 sends traffic to eu-central-1, and Auto Scaling adds more web servers. RTO: 5 minutes, RPO: 1 second (synchronous replication).

#### 4. Multi-Site Active/Active This is the most resilient and expensive strategy. You run identical workloads in two or more Regions simultaneously, with traffic distributed across them. If one Region fails, the others handle all traffic. RTO: near zero, RPO: near zero (if synchronous replication is used).

How it works: - Deploy full production stacks in multiple Regions. - Use Route 53 with latency-based routing or weighted routing to distribute traffic. - Use DynamoDB global tables or Aurora Global Database for multi-Region data replication. - Implement health checks and automated failover.

AWS services involved: Route 53, ELB, EC2 Auto Scaling, DynamoDB Global Tables, Aurora Global Database, AWS Global Accelerator, Amazon CloudFront.

Example: A global social media platform runs in us-east-1, eu-west-1, and ap-southeast-1. Users are routed to the nearest Region. If us-east-1 fails, traffic shifts to the other two Regions seamlessly. RTO: <1 minute, RPO: <1 second.

AWS Services for Disaster Recovery

The exam expects you to know which AWS services support DR:

AWS Backup: Centralized backup service for multiple AWS resources. Supports cross-Region and cross-account backup.

Amazon S3 Cross-Region Replication (CRR): Automatically replicates objects to a bucket in another Region. Provides RPO of minutes.

Amazon RDS Multi-AZ and Cross-Region Read Replicas: Multi-AZ provides automatic failover within a Region; cross-Region replicas allow DR across Regions.

Amazon DynamoDB Global Tables: Fully managed, multi-Region, multi-active database with eventual consistency.

Amazon Aurora Global Database: One primary Region and up to five secondary Regions with <1 second replication lag.

AWS Elastic Disaster Recovery (DRS): Continuous block-level replication for EC2 instances and on-premises servers, with automated recovery in minutes.

AWS CloudEndure Disaster Recovery: (Now part of AWS DRS) Legacy name; exam may still reference it.

VMware Cloud on AWS: For hybrid DR scenarios.

Comparison to On-Premises DR

Traditional on-premises DR requires owning a secondary data center, which is costly and underutilized. AWS DR eliminates upfront capital expenditure; you pay only for resources used during testing or actual failover. The cloud also provides elasticity: during a disaster, you can scale rapidly. However, you must manage data transfer costs for replication and ensure compliance with data residency requirements.

When to Use Each Strategy

Backup and Restore: Non-critical applications, long RTO/RPO acceptable, low budget.

Pilot Light: Critical applications with moderate RTO/RPO, cost-sensitive.

Warm Standby: Business-critical applications requiring fast recovery, moderate cost.

Multi-Site Active/Active: Mission-critical applications requiring zero downtime, high budget.

The exam may ask you to recommend a DR strategy based on RTO/RPO requirements. Always match the strategy to the given metrics.

Walk-Through

1

Assess RTO and RPO Requirements

Begin by determining your Recovery Time Objective (RTO) and Recovery Point Objective (RPO). For example, a critical database may require an RTO of 5 minutes and an RPO of 1 second, while a reporting system may tolerate 24 hours. These metrics drive your DR strategy choice. On AWS, you can achieve sub-minute RTOs with active-active configurations, but at higher cost. Document these requirements before selecting services.

2

Choose a DR Strategy

Based on RTO/RPO, select one of the four strategies: Backup and Restore (high RTO/RPO), Pilot Light (moderate), Warm Standby (low), or Multi-Site Active/Active (near zero). For the exam, remember that Pilot Light is like a small candle you can blow into a flame; Warm Standby is a pre-warmed engine; Active/Active is two engines running simultaneously. Each has specific AWS services and cost implications.

3

Configure Data Replication

Implement replication using appropriate AWS services. For databases, use RDS cross-Region read replicas or Aurora Global Database. For S3, enable Cross-Region Replication (CRR) with a rule that applies to the entire bucket or specific prefixes. For EC2, use AWS DRS to replicate entire servers. Ensure replication is continuous or scheduled based on your RPO. Note that CRR has a minimum RPO of about 15 minutes; for lower RPO, use synchronous replication like Aurora Global Database.

4

Deploy Infrastructure in DR Region

In the secondary Region, create the required infrastructure using AWS CloudFormation or Terraform. For Pilot Light, deploy only core components (e.g., a small EC2 instance and database). For Warm Standby, deploy a scaled-down version of your production stack. Use Auto Scaling groups with minimum and desired capacities set appropriately. This step ensures you can failover quickly.

5

Implement DNS Failover with Route 53

Configure Amazon Route 53 with health checks and failover routing policies. Create a primary record pointing to your primary Region's load balancer and a secondary record to the DR Region. Set health checks to monitor the primary endpoint (e.g., HTTP 200 response). When the primary fails, Route 53 automatically switches to the secondary record. For active-active, use weighted or latency-based routing.

6

Test Failover Regularly

Periodically test your DR plan by simulating a disaster. Use AWS Systems Automation to run failover drills without impacting production. For example, terminate primary instances and verify that Route 53 routes traffic to the DR Region. Monitor RTO and RPO during the test. Adjust configurations if recovery times exceed targets. AWS recommends testing at least every 6 months.

What This Looks Like on the Job

Scenario 1: Financial Services Compliance A bank must comply with regulatory requirements for disaster recovery, mandating an RTO of 15 minutes and an RPO of 5 minutes for its transaction processing system. The bank uses AWS DRS to replicate its on-premises Oracle database to a pilot light environment in a different AWS Region. During a regional outage, AWS DRS automatically launches the replicated servers in the DR Region. Route 53 health checks detect the failure and redirect traffic. The bank achieves an RTO of 12 minutes and an RPO of 4 minutes, meeting compliance. Cost is higher due to continuous replication and standby compute, but avoids penalties.

Scenario 2: E-commerce Flash Sale A retail company runs a global e-commerce platform on AWS. During Black Friday, they need zero downtime. They deploy a multi-site active-active architecture using DynamoDB Global Tables and CloudFront for content delivery. Traffic is distributed across three Regions using Route 53 latency-based routing. When an AWS Region experiences an outage, traffic is automatically rerouted to the nearest healthy Region. The company experiences no downtime and no data loss. The cost is significant due to running full stacks in multiple Regions, but the revenue during the sale justifies it.

Scenario 3: Startup with Limited Budget A startup runs a non-critical blog on a single EC2 instance in us-east-1. They cannot afford a full DR setup. They use AWS Backup to take daily snapshots of the EBS volume and store them in S3. They also back up the RDS database nightly. If the instance fails, they launch a new instance from the latest AMI and restore the database. RTO is 4 hours, RPO is 24 hours. This is acceptable for their use case. The cost is minimal (storage for snapshots). They also enable S3 Cross-Region Replication for the backup bucket to another Region for added safety. Misconfiguration: initially they forgot to enable versioning on the backup bucket, so accidental deletion of the snapshot could not be undone. They corrected this after a test failure.

How CLF-C02 Actually Tests This

What CLF-C02 Tests Objective 1.4 falls under Domain 1 (Cloud Concepts) and covers disaster recovery strategies, RTO/RPO definitions, and the AWS services that support DR. The exam expects you to:

Distinguish between Backup and Restore, Pilot Light, Warm Standby, and Multi-Site Active/Active.

Match RTO/RPO values to the appropriate strategy.

Identify which AWS service to use for a given DR scenario (e.g., AWS DRS for server replication, RDS cross-Region replicas for databases).

Understand the trade-offs between cost and recovery speed.

Common Wrong Answers and Why 1. Choosing 'Multi-Site Active/Active' when RTO is 1 hour and RPO is 1 hour. Candidates think active-active is always best, but it is overkill. The correct answer is Pilot Light or Warm Standby, which can meet those metrics at lower cost. 2. Confusing 'Pilot Light' with 'Warm Standby'. Pilot Light has only core services running; Warm Standby has a scaled-down version of the full environment. The exam may describe a scenario where a small EC2 instance and database are running (Pilot Light) vs. a full but smaller stack (Warm Standby). 3. Selecting 'AWS CloudEndure' instead of 'AWS Elastic Disaster Recovery'. CloudEndure is the old name; the current service is AWS DRS. The exam uses the new name. 4. Thinking 'RTO' means data loss. RTO is time to recover, not data loss. RPO is data loss. Candidates often swap them.

Specific Terms and Values - RTO: time to recover, measured in hours/minutes. - RPO: data loss tolerance, measured in time. - AWS DRS: continuous replication, RTO of minutes, RPO of seconds. - S3 CRR: RPO of minutes (typically 15-30 minutes). - Aurora Global Database: RPO of <1 second. - DynamoDB Global Tables: eventual consistency, RPO of <1 second typically.

Tricky Distinctions - Backup and Restore vs. Pilot Light: Both use backups, but Pilot Light maintains a running core service for faster failover. - Warm Standby vs. Active/Active: Warm standby is passive (only one site active); active-active has both sites serving traffic.

Decision Rule When asked to choose a DR strategy, first identify the RTO/RPO. If RTO > 1 hour and RPO > 1 hour, choose Backup and Restore. If RTO < 1 hour but > 10 minutes, and RPO < 1 hour, choose Pilot Light. If RTO < 10 minutes and RPO < 1 minute, choose Warm Standby. If RTO near zero and RPO near zero, choose Active/Active. Also consider budget: lower RTO/RPO costs more.

Key Takeaways

Disaster recovery on AWS uses four strategies: Backup and Restore, Pilot Light, Warm Standby, and Multi-Site Active/Active.

RTO (Recovery Time Objective) is the maximum acceptable downtime; RPO (Recovery Point Objective) is the maximum acceptable data loss.

Backup and Restore has the highest RTO/RPO and lowest cost; Multi-Site Active/Active has the lowest RTO/RPO and highest cost.

AWS services for DR include AWS Backup, S3 CRR, RDS cross-Region replicas, DynamoDB Global Tables, Aurora Global Database, and AWS DRS.

Pilot Light runs only core services; Warm Standby runs a scaled-down full stack.

Route 53 health checks and failover routing are essential for DNS-based failover.

Regular testing of DR plans is critical; use AWS Systems Automation for automated drills.

AWS DRS provides continuous replication with RTO of minutes and RPO of seconds for EC2 and on-premises servers.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Backup and Restore

Highest RTO (hours to days)

Highest RPO (hours)

Lowest cost (only storage for backups)

No running resources in DR Region

Suitable for non-critical workloads

Pilot Light

Moderate RTO (10-60 minutes)

Moderate RPO (minutes to hours)

Moderate cost (storage + small compute)

Core services running (e.g., database)

Suitable for critical workloads with some tolerance

Warm Standby

Low RTO (minutes)

Low RPO (seconds)

Higher cost (scaled-down environment)

Passive DR site (not serving traffic)

Failover requires scaling up

Multi-Site Active/Active

Near-zero RTO

Near-zero RPO

Highest cost (full production in multiple Regions)

All sites active and serving traffic

No failover required; traffic shifts automatically

Watch Out for These

Mistake

AWS guarantees automatic disaster recovery across Regions for all services.

Correct

AWS provides tools and services to implement DR, but you must configure them. For example, EC2 instances are not automatically replicated; you must use AMI backups or AWS DRS. Only some managed services like DynamoDB Global Tables offer built-in multi-Region capabilities.

Mistake

Pilot Light and Warm Standby are the same thing.

Correct

Pilot Light runs only core services (e.g., a database) in a minimal state; Warm Standby runs a scaled-down version of the full application stack. Pilot Light requires more steps to failover (launching additional resources), while Warm Standby is closer to full production.

Mistake

RTO and RPO are interchangeable terms.

Correct

RTO is the time to restore service; RPO is the maximum acceptable data loss (time since last backup). For example, an RTO of 1 hour means systems are back in 1 hour; an RPO of 1 hour means you may lose up to 1 hour of data. They are independent but often correlated.

Mistake

Multi-Site Active/Active means you only pay for one Region's resources.

Correct

You pay for resources in all active Regions. For example, if you run EC2 instances in two Regions, you are billed for both. This can double your compute costs, but it provides the lowest RTO/RPO.

Mistake

AWS DRS only works for on-premises servers, not EC2.

Correct

AWS DRS can replicate both on-premises servers and EC2 instances. It uses continuous block-level replication to a staging area in your chosen AWS Region, then can launch recovery instances on demand.

Frequently Asked Questions

What is the difference between Pilot Light and Warm Standby on AWS?

Pilot Light runs only the core services (e.g., a database) in a minimal state in the DR Region. During failover, you launch additional resources (web servers, etc.) around that core. Warm Standby runs a scaled-down version of the entire production stack (e.g., one web server, one app server, one database) that is always on, but at lower capacity. Failover involves scaling up, not launching new resources. On the exam, Pilot Light is described as having a small running core; Warm Standby has a full but smaller environment.

How do I choose the right DR strategy for my workload?

First, define your RTO and RPO. If you can tolerate hours of downtime and data loss, choose Backup and Restore. If you need recovery within 10-60 minutes and can lose minutes of data, choose Pilot Light. For recovery within minutes and seconds of data loss, choose Warm Standby. For near-zero downtime and data loss, choose Multi-Site Active/Active. Also consider cost: lower RTO/RPO costs more. For example, a blog might use Backup and Restore; a banking app might use Warm Standby.

What AWS service provides continuous replication for EC2 instances?

AWS Elastic Disaster Recovery (AWS DRS) provides continuous block-level replication for EC2 instances and on-premises servers. It replicates data to a staging area in your chosen AWS Region, allowing you to launch recovery instances in minutes with an RPO of seconds. It replaced AWS CloudEndure Disaster Recovery. On the exam, know that AWS DRS is the service for server-level DR with low RTO/RPO.

Can I use S3 Cross-Region Replication for disaster recovery?

Yes, S3 Cross-Region Replication (CRR) automatically replicates objects to a bucket in another Region. It is suitable for data-level DR, but not for entire application failover. CRR has an RPO of typically 15-30 minutes (not real-time). You must combine it with other services (e.g., EC2 Auto Scaling, Route 53) for full DR. It is cost-effective for replicating static assets or backups.

What is the RPO of Amazon Aurora Global Database?

Amazon Aurora Global Database provides an RPO of less than 1 second. It uses dedicated, fast replication links to replicate data from one primary Region to up to five secondary Regions. Failover can be completed in less than 1 minute. This makes it ideal for global applications requiring low data loss. On the exam, remember that Aurora Global Database offers the lowest RPO among AWS database services.

How does Route 53 support disaster recovery?

Route 53 supports DR through health checks and failover routing policies. You can create a failover record set with a primary and secondary endpoint. Route 53 monitors the primary endpoint's health using health checks; if the primary fails, it automatically returns the secondary endpoint's IP address. You can also use latency-based routing for active-active setups. This enables DNS-level failover without changing application code.

Is it necessary to test disaster recovery on AWS?

Yes, regular testing is crucial to ensure your DR plan works. AWS recommends testing at least every 6 months. You can use AWS Systems Automation to automate failover drills without impacting production. Testing helps validate RTO/RPO, identify configuration errors, and train staff. Without testing, you risk discovering issues during an actual disaster.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Disaster Recovery Concepts on AWS — now see how well it sticks with free CLF-C02 practice questions. Full explanations included, no account needed.

Done with this chapter?