AWS Solutions Architect GuideAWS Solutions Architect Associate

AWS Disaster Recovery Questions: Backup, Pilot Light and Warm Standby

AWS DR strategies are a regular SAA-C03 topic. Here is how to choose between backup and restore, pilot light, warm standby, and multi-site active-active based on RPO and RTO.

10 min read
11 sections
Courseiva Study Hub

Quick answer

AWS DR strategies are a regular SAA-C03 topic. Here is how to choose between backup and restore, pilot light, warm standby, and multi-site active-active based on RPO and RTO.

AWS disaster recovery questions require you to match an RPO/RTO requirement to the correct DR strategy. The four strategies sit on a spectrum from cheapest-and-slowest to most-expensive-and-fastest.

The Four DR Strategies

1. Backup and Restore

The simplest strategy. Data is regularly backed up to S3 or another durable store. In a disaster, you restore from the latest backup and rebuild infrastructure.

  • Cost: Lowest
  • RPO: Hours (depends on backup frequency)
  • RTO: Hours (time to restore + rebuild)
  • When to use: non-critical workloads where extended downtime is acceptable

2. Pilot Light

Core infrastructure is kept running in the DR region at minimum scale — like a pilot light in a furnace. Typically this means the database is replicated and running, but application servers are stopped or not deployed.

In a disaster, the DB is already current (or close to it). Application servers are launched from AMIs and pointed to the replicated DB.

  • Cost: Low (only core components running)
  • RPO: Minutes (database replication lag)
  • RTO: Minutes to tens of minutes (time to launch app servers and update DNS)
  • When to use: workloads that can tolerate some downtime but need current data

3. Warm Standby

A fully functional but scaled-down version of the production environment runs continuously in the DR region. It can handle some traffic immediately and scales up to full production capacity during a disaster.

  • Cost: Higher (all components running, just smaller)
  • RPO: Seconds to minutes
  • RTO: Minutes (scale up existing infrastructure)
  • When to use: business-critical workloads that require rapid recovery

4. Multi-Site Active-Active

Full production capacity runs simultaneously in two or more regions. Traffic is distributed across regions via Route 53 or Global Accelerator. If one region fails, the other handles all traffic immediately.

  • Cost: Highest (full duplicate infrastructure)
  • RPO: Near-zero (real-time replication)
  • RTO: Near-zero (already running)
  • When to use: mission-critical workloads with zero tolerance for downtime

How the Exam Tests These

The exam gives you an RPO and RTO requirement and asks which strategy to use:

  • RPO 24 hours, RTO 24 hours, cost is the top priority → Backup and Restore
  • RPO 1 hour, RTO 4 hours, moderate cost → Pilot Light
  • RPO 5 minutes, RTO 30 minutes → Warm Standby
  • RPO 0, RTO 0 (zero downtime) → Multi-Site Active-Active

Exam trap: candidates sometimes confuse Pilot Light with Warm Standby. Pilot Light has the absolute minimum infrastructure running (DB only, no app servers). Warm Standby has all components running but scaled down. The key question is: is the application running in DR right now (even at low capacity)?

RPO vs RTO

  • RPO (Recovery Point Objective) — How much data loss is acceptable? An RPO of 1 hour means you can lose up to 1 hour of data.
  • RTO (Recovery Time Objective) — How quickly must the system be available after a disaster? An RTO of 4 hours means the system must be operational within 4 hours.

Shorter RPO and RTO requirements push you toward more expensive strategies.

AWS Services Used in DR

Component Service
Data backup S3, AWS Backup
Database replication RDS Read Replica (cross-region), DynamoDB Global Tables
DNS failover Route 53 Failover routing
Infrastructure as code CloudFormation (rebuild in DR region)
AMI copying AMI copied to DR region
Continuous replication AWS Elastic Disaster Recovery (CloudEndure)

Practice SAA-C03 disaster recovery questions with RPO/RTO scenarios to build automatic recognition of the four strategies.

RTO and RPO — Specific Numbers the Exam Tests

The exam gives you a target RTO and RPO and asks which DR strategy meets it at the lowest cost. You need to know the approximate recovery time and data loss window for each strategy.

DR Strategy RTO RPO Relative Cost
Backup and Restore Hours Hours (since last backup) Lowest
Pilot Light 10–30 minutes Minutes Low–Medium
Warm Standby 2–10 minutes Seconds to minutes Medium–High
Multi-Site Active/Active Near zero (<1 min) Near zero Highest

As RTO and RPO decrease toward zero, cost increases. This relationship is the foundation of every DR strategy question.

The exam scenario: "A company requires an RTO of 15 minutes and can tolerate up to 5 minutes of data loss." Backup and restore takes hours — eliminated. Pilot light can recover in 10-30 minutes, which might work, but 5-minute RPO requires very recent replication, which pilot light can support if combined with database replication. Warm standby with a running reduced-capacity environment hits that RTO comfortably. The correct answer depends on the exact numbers — learn the ranges, not just the rank order.

The cost ranking matters for questions that say "at the lowest cost that meets the requirements." Always pick the cheapest strategy whose RTO/RPO meets the stated requirement, not the most resilient one.

The Pilot Light Confusion — What's Running vs What's Not

Pilot light is the strategy that confuses the most candidates because the name implies "almost nothing is running" but that's not quite accurate.

In a pilot light setup:

  • The database is running in the DR region with live data replication. The data needs to be current (low RPO), so you can't afford to have the database stopped.
  • Application servers are NOT running. They're defined (as AMIs or Launch Templates) but not launched. There are no active EC2 instances processing requests.
  • When disaster strikes, you launch the application servers (takes 10-30 minutes to boot, configure, and join the load balancer) and the database is already current.

Warm standby is the next tier up:

  • Everything is running at reduced capacity — database AND application servers. Application servers might be running on smaller instance types (t3.small instead of m5.large) or fewer instances.
  • When disaster strikes, you scale out the application tier (minutes) and scale up instance types. Data is already being served.

The exam trap: "A pilot light configuration has database servers running at all times in the secondary region." This is TRUE and is not the distinguishing characteristic. The distinguishing characteristic is that application servers are stopped. A question that says "both database and application servers are running at minimum capacity" describes warm standby, not pilot light.

Elastic Disaster Recovery (DRS) — The Modern Tool

AWS Elastic Disaster Recovery (previously CloudEndure Disaster Recovery) is the answer when the exam asks about continuous replication and fast failover without a complex standby infrastructure.

DRS continuously replicates block storage from source servers (physical, virtual, or cloud) to a staging area in the target AWS region. The staging area uses low-cost storage and minimal compute. At failover, DRS launches EC2 instances from the latest replicated state — typically in minutes.

Key characteristics to recognize on the exam:

  • Continuous replication: RPO measured in seconds, not hours
  • Automated failover: drills and actual failover through the console
  • Cross-platform: works for on-premises servers migrating to AWS, not just AWS-to-AWS
  • Point-in-time recovery: can recover to a specific snapshot in the replication timeline

The exam scenario that points to DRS: "A company runs workloads on-premises and wants to set up disaster recovery in AWS with minimal RTO and RPO. They want ongoing replication without managing standby infrastructure." DRS fits this exactly.

Don't confuse DRS with AWS Backup — Backup handles backup policies for managed services (RDS, EBS, DynamoDB) and has RPO measured in hours. DRS handles continuous block-level replication with RPO in seconds.

Route 53 Health Checks in DR Scenarios

DR strategies don't mean much without DNS failover — users need to be routed to the DR environment when the primary fails. Route 53 health checks drive this.

Route 53 can monitor an endpoint (by HTTP, HTTPS, or TCP), a CloudWatch alarm, or other health checks (calculated health checks). If a health check fails, Route 53 stops returning the associated DNS record.

Active-Passive (Failover routing policy): Two records with the same name — one primary, one secondary. Route 53 returns the primary record when its health check passes. If the health check fails, Route 53 returns the secondary record. TTL determines how quickly DNS resolvers pick up the change. This is the pattern for pilot light and warm standby DR: primary serves normally, secondary sits ready, DNS switches automatically on failure.

Active-Active: Multiple records all with health checks. Route 53 returns all healthy records (using weighted, latency-based, or geolocation policies). This matches multi-site active/active DR — traffic is always flowing to both regions, and losing one region reduces capacity rather than causing full outage.

Health check evaluation interval is 10 or 30 seconds. Even with a 10-second interval, DNS TTL and resolver caching mean actual failover from a user's perspective takes longer — typically 1-3 minutes for most resolvers to pick up the change.

S3 Cross-Region Replication for DR

S3 data often needs to be part of a DR strategy. CRR (Cross-Region Replication) is the mechanism, and it has constraints the exam tests.

Versioning required: Both source and destination buckets must have versioning enabled. CRR will not configure without this.

Existing objects are not replicated automatically: CRR only replicates objects written after the replication rule is enabled. To replicate existing objects, you must use S3 Batch Operations with an S3 Replication operation. This is a common trap — companies enable CRR and assume all existing data is now protected, but it isn't.

Replication lag: CRR is asynchronous. There is typically a replication lag of seconds to minutes for most objects, but large objects can lag more. The RPO for CRR-based DR is not zero — it's the current replication lag.

Delete markers: By default, delete markers are not replicated. If you delete an object in the source (creating a delete marker), the object remains in the destination bucket. You can configure replication to also replicate delete markers, but this is not the default.

The Cost vs RTO Tradeoff Scenario

The exam regularly presents this pattern: a company needs DR, has budget constraints, and you must find the strategy that meets the RTO/RPO at the lowest cost.

The logic to apply:

  1. Determine the required RTO and RPO from the question
  2. Eliminate strategies that can't meet those requirements
  3. Among the remaining strategies, pick the cheapest

Multi-site active/active always satisfies any RTO/RPO requirement (near-zero both). But it's also the most expensive because you're running full production capacity in two regions simultaneously. It should only be selected if the RTO requirement is "near zero" or "seconds" and cost is not a constraint.

If RTO is 4 hours and RPO is 1 hour, backup and restore works and is the cheapest. If RTO is 30 minutes and RPO is 5 minutes, warm standby or pilot light with database replication is the answer. Always apply the minimum necessary strategy.

Practice Question Sets

Working through real SAA-C03 questions is the fastest way to lock in how the exam phrases these scenarios. Pick a session that fits your time:

Session Questions Estimated time Link
Quick check 10 10–12 min Start →
Standard session 20 20–25 min Start →
Focused drill 30 30–40 min Start →
Deep study block 50 50–65 min Start →
Full mock exam 120 2–2.5 hours Start →

Practise SAA-C03 questions

Original exam-style practice questions with detailed, explained answers. Track your weak topics and review missed questions before exam day.

Courseiva is a free IT certification practice platform offering original exam-style practice questions, detailed explanations, topic-based practice, mock exams, readiness tracking, and study analytics for Cisco, CompTIA, Microsoft, AWS, and other technology certifications.