This chapter covers the Pilot Light disaster recovery pattern on AWS, a core strategy tested in the SAA-C03 exam under Domain 2: Resilient Architectures (Objective 2.6: Choose disaster recovery strategies). The Pilot Light pattern is one of the four common DR strategies (Backup & Restore, Pilot Light, Warm Standby, Multi-Site Active/Active). Approximately 10-15% of exam questions touch on disaster recovery, and Pilot Light is frequently the correct answer when the requirement balances cost and recovery time. You will learn the precise architecture, step-by-step failover process, key AWS services involved, and common exam traps.
Jump to a section
Imagine a house with a gas furnace that has a pilot light. The pilot light is a small, continuously burning flame that keeps the furnace ready to ignite the main burners at a moment's notice. In normal operation, the pilot light uses a tiny amount of gas, just enough to maintain its flame. If the thermostat calls for heat, the main gas valve opens, and the pilot light instantly ignites the larger burners to warm the house. Now, consider that your house needs to be fully operational only during a severe winter storm, but you don't want to waste energy keeping it warm all year. Instead, you keep the pilot light on – a minimal set of systems – and the rest of the house is cold and dark. When a storm is forecast, you turn up the thermostat; the pilot light triggers the main furnace, and within minutes, the house is warm. In AWS terms, the pilot light is your smallest possible core set of services (e.g., a small database, a minimal app server) that are always running. The rest of your infrastructure (full app servers, load balancers, auto scaling groups) is stopped or in a dormant state. When disaster strikes, you 'turn up the thermostat' by scaling out those services, and your full production environment becomes operational quickly. The pilot light pattern ensures you have the essential data and configuration ready, but you only pay for the tiny flame until you need the full fire.
What Is the Pilot Light DR Pattern?
The Pilot Light pattern is a disaster recovery (DR) strategy where a minimal core set of AWS resources runs continuously in a secondary Region (or Availability Zone), while the full production environment is scaled down or stopped. The name comes from a gas furnace pilot light: a small flame that can instantly ignite the main burners. In AWS, the 'pilot light' typically includes:
A replicated database (e.g., Amazon RDS Multi-AZ or cross-Region read replica, or DynamoDB global tables)
A small EC2 instance or two running critical services (e.g., DNS, configuration management)
Possibly a load balancer in a stopped state (or a small ALB)
Data replication mechanisms (e.g., S3 Cross-Region Replication, AWS DMS ongoing replication)
The rest of the infrastructure – such as Auto Scaling groups with many EC2 instances, Elastic Load Balancers, and full application tiers – is either not deployed or is in a stopped/detached state. Upon a disaster declaration, you 'ignite' the full environment by scaling out the existing pilot light resources and launching additional resources from pre-prepared Amazon Machine Images (AMIs) or CloudFormation templates.
Why Use Pilot Light?
Pilot Light offers a middle ground between Backup & Restore (low cost but high Recovery Time Objective – RTO) and Warm Standby (higher cost but lower RTO). Typical RTO for Pilot Light is minutes to a few hours, and Recovery Point Objective (RPO) is typically seconds to minutes, depending on data replication frequency. It is ideal for workloads that can tolerate a short outage (e.g., 15-30 minutes) but cannot afford the cost of a full standby environment. The exam expects you to recommend Pilot Light when the scenario mentions 'cost-sensitive' and 'RTO of less than an hour' but not 'near-zero RTO.'
How It Works Internally
#### Data Replication Layer
The foundation of Pilot Light is continuous data replication. The most common services: - Amazon RDS: Use a cross-Region read replica (for MySQL, MariaDB, PostgreSQL, Oracle) or a Multi-AZ DB instance with a standby in another Region (using AWS Aurora Global Database). The replica is promoted to master during failover. - Amazon DynamoDB: Use global tables to replicate data across Regions with sub-second latency. - Amazon S3: Enable Cross-Region Replication (CRR) on buckets. Replication is asynchronous, typically within 15 minutes. - AWS Database Migration Service (DMS): For ongoing replication from on-premises or EC2 databases.
#### Compute Layer
EC2 Instances: The pilot light includes a small number of EC2 instances (e.g., t3.nano) running essential services like DNS (Route 53 resolver), monitoring agents, or a minimal web server. They are kept small to minimize cost.
Auto Scaling Groups: Not active in the pilot light state. Instead, you maintain pre-built AMIs for your application servers. These AMIs are updated regularly (e.g., weekly) via automated pipelines (e.g., EC2 Image Builder).
Elastic Load Balancer: You may create an ALB in the DR Region but keep it in a 'stopped' state (by not associating any target groups) or simply create the configuration but not deploy it. Alternatively, you can create a CloudFormation template that launches the ALB during failover.
#### Networking Layer
VPC: A VPC exists in the DR Region, with subnets, route tables, and security groups pre-configured. The pilot light instances run in private subnets.
Route 53: DNS records are configured with health checks and failover routing. Typically, you use a weighted or failover record set that points to the primary Region's load balancer. When the primary fails, Route 53 automatically routes traffic to the DR Region's load balancer.
VPN/Direct Connect: If connecting to on-premises, you may have a secondary VPN connection or Direct Connect virtual interface to the DR Region.
Step-by-Step Failover Process
Detection: Use Amazon CloudWatch alarms or AWS Health Dashboard to detect a disaster impacting the primary Region. Automated scripts (e.g., AWS Lambda) can initiate failover.
Scale Out Compute: Launch EC2 instances from the pre-built AMIs. Register them with the Auto Scaling group and attach to the pre-created ALB target group. This can be automated via AWS Systems Manager Automation or CloudFormation.
Promote Database: If using an RDS read replica, promote it to a standalone instance. For Aurora, fail over to the Aurora Global Database secondary Region.
Update DNS: Route 53 health checks detect primary failure and automatically switch to the DR load balancer. Alternatively, manually update DNS records.
Verify: Run health checks and smoke tests to confirm the application is functional.
Key Components, Values, and Defaults
RDS Cross-Region Read Replica: Supports MySQL, MariaDB, PostgreSQL, Oracle, and SQL Server. Replication is asynchronous; typical lag is < 1 second under normal conditions. The replica can be promoted in about 1-2 minutes.
Aurora Global Database: Consists of one primary Region and up to five secondary Regions. Replication lag is typically < 1 second. Failover to a secondary Region takes about 1 minute.
DynamoDB Global Tables: Replication is multi-Region, fully managed, and provides eventual consistency. RPO is typically < 1 second.
S3 CRR: Requires versioning on both source and destination buckets. Replication is asynchronous; 99.99% of objects are replicated within 15 minutes.
EC2 AMI: Use EC2 Image Builder to automate AMI creation on a schedule (e.g., daily, weekly). Store AMIs in the DR Region.
CloudFormation: Use StackSets to deploy infrastructure across Regions consistently.
Configuration and Verification Commands (AWS CLI)
Create RDS cross-Region read replica:
aws rds create-db-instance-read-replica \
--db-instance-identifier my-read-replica \
--source-db-instance-identifier arn:aws:rds:us-west-2:123456789012:db:my-primary \
--region us-east-1Promote read replica:
aws rds promote-read-replica \
--db-instance-identifier my-read-replicaEnable S3 CRR:
{
"ReplicationConfiguration": {
"Role": "arn:aws:iam::account-id:role/s3-replication-role",
"Rules": [
{
"Status": "Enabled",
"Destination": {
"Bucket": "arn:aws:s3:::destination-bucket"
}
}
]
}
}Interaction with Related Technologies
AWS Backup: Use AWS Backup to create cross-Region backup copies of EBS snapshots, RDS snapshots, etc. These can be used to restore data in the DR Region if replication fails.
AWS Elastic Disaster Recovery (DRS): A newer service that provides continuous replication for EC2 instances. It can be used as an alternative to custom Pilot Light setups, but the exam still tests the traditional patterns.
AWS Global Accelerator: Can be used to route traffic to the nearest healthy endpoint, providing automatic failover across Regions.
AWS Health Dashboard: Provides events that can trigger automated failover via EventBridge.
Common Exam Scenarios
Scenario: A company has a web application in us-east-1. They want DR in us-west-2 with an RTO of 30 minutes and RPO of 5 minutes. They want to minimize cost. Answer: Pilot Light with RDS cross-Region read replica and pre-provisioned AMIs.
Trap: The question says 'minimize cost' but also 'RTO less than 1 minute.' Pilot Light cannot achieve that; Warm Standby or Multi-Site is needed.
Trap: The question says 'serverless application using Lambda and DynamoDB.' Pilot Light may not be the best fit because Lambda is ephemeral and DynamoDB global tables provide automatic replication. The correct answer might be 'Multi-Region application with DynamoDB global tables.'
Edge Cases and Exceptions
RDS Read Replica Promotion: After promotion, the replica becomes a standalone instance. It cannot be converted back to a replica. You must re-create the replica after failback.
Aurora Global Database: The secondary Region can be promoted to primary in less than 1 minute, but the old primary becomes unavailable. Failback requires rebuilding the global database.
S3 CRR: If you delete an object in the source bucket, the deletion is not replicated unless you enable 'DeleteMarkerReplication.' Also, CRR does not replicate objects that existed before enabling it (unless you use S3 Batch Replication).
Route 53 Failover: The TTL of DNS records affects how quickly clients switch. Use a low TTL (e.g., 60 seconds) to speed up failover.
Identify Core Services
Determine the minimal set of services that must always be running in the DR Region to support failover. This typically includes a replicated database (e.g., RDS read replica or DynamoDB global table), a small EC2 instance for DNS or configuration, and an S3 bucket with CRR enabled. The goal is to keep the 'pilot light' as small as possible to minimize cost, while ensuring that the essential data and configuration are available. For example, if your application uses RDS, you create a cross-Region read replica that stays in sync with the primary. This replica consumes compute and storage resources 24/7, but you can choose a smaller instance class than the primary to save money (though performance may lag). The pilot light also includes IAM roles, security groups, and VPC resources that are pre-created in the DR Region.
Prepare Automated Recovery
Create automation scripts or CloudFormation templates that can launch the full production environment in the DR Region. This includes Auto Scaling groups, ELBs, additional EC2 instances, and any dependent services. The automation should reference pre-built AMIs that are regularly updated. Use AWS Systems Manager Automation or AWS Lambda functions to orchestrate the failover. For example, a Lambda function can be triggered by an EventBridge rule that detects a Region failure via CloudWatch alarms. The function then calls CloudFormation to deploy the full stack. Alternatively, you can use AWS Elastic Disaster Recovery (DRS) which automates replication and launch of instances.
Configure DNS Failover
Set up Amazon Route 53 with failover routing policy. Create a record set for your domain (e.g., www.example.com) with two records: one pointing to the primary Region's load balancer (set as primary) and one pointing to the DR Region's load balancer (set as secondary). Associate health checks with the primary record. If the health check fails, Route 53 automatically routes traffic to the secondary. Ensure TTL is low (e.g., 60 seconds) to allow quick propagation. For weighted routing, you can assign weight 100 to primary and 0 to secondary, then change weights during failover.
Test Failover Regularly
Conduct periodic DR drills to validate that the pilot light pattern works. Simulate a Region failure by stopping the primary instance or blocking network traffic. Verify that the automation launches the full environment, the database promotes correctly, and DNS switches traffic. Measure RTO and RPO to ensure they meet requirements. Use AWS Fault Injection Simulator (FIS) to inject failures. Document the runbook and update automation based on lessons learned. Regular testing is critical because stale AMIs or configuration drift can cause failover failures.
Implement Monitoring and Alerts
Set up Amazon CloudWatch alarms to monitor the health of the primary Region and the replication lag of the database. For RDS read replicas, monitor the 'ReplicaLag' metric. For DynamoDB global tables, monitor 'ReplicationLatency'. Create alarms that trigger when lag exceeds acceptable thresholds (e.g., > 5 seconds). Also monitor the health of the pilot light instances. Use AWS Health Dashboard events to detect Region-wide issues. When an alarm triggers, send notifications via SNS and optionally invoke automated failover scripts.
Scenario 1: E-Commerce Platform with RTO of 30 Minutes
A mid-size e-commerce company runs its application on EC2 instances behind an ALB with an RDS Multi-AZ database in us-east-1. They need DR in us-west-2 with an RTO of 30 minutes and RPO of 5 minutes, but they cannot afford a full warm standby. The solution: implement Pilot Light. They create a cross-Region read replica of their RDS instance in us-west-2 (using db.r5.large instead of db.r5.xlarge to save 50% on compute). They also run a single t3.nano EC2 instance running a configuration management agent and a small Redis cache (ElastiCache) for session data. S3 Cross-Region Replication is enabled for static assets. They use EC2 Image Builder to create a weekly AMI of their application server and copy it to us-west-2. A CloudFormation template is stored in S3 and can be executed to launch the full environment (Auto Scaling group, ALB, additional instances). A Lambda function triggered by a CloudWatch alarm (based on RDS ReplicaLag > 10 seconds) executes the template. In production, a minor AWS outage in us-east-1 caused a 15-minute database failover; the pilot light pattern allowed full recovery in 22 minutes, well within the 30-minute RTO. The monthly cost for the DR resources was under $500, compared to an estimated $3,000 for a warm standby.
Scenario 2: Financial Services Application with Strict Compliance
A financial services firm runs a critical application on a stack of EC2 instances, RDS for PostgreSQL, and ElastiCache. They require DR in a second Region with an RPO of 1 second and RTO of 5 minutes. They initially considered Multi-Site Active/Active but found it too complex and costly. Instead, they chose Pilot Light with Aurora Global Database, which provides replication lag of typically < 1 second. They run a single db.r5.large instance in the DR Region as part of the global database (the secondary Region has a cluster with zero compute capacity until failover). They also maintain a small EC2 instance running a health check endpoint. Automation uses AWS Systems Manager Automation documents that promote the secondary Aurora cluster, update the Auto Scaling group launch template, and switch Route 53 records. During a compliance audit, they demonstrated a failover drill that completed in 4 minutes 30 seconds. The key challenge was ensuring that the AMIs for EC2 instances were updated daily to include the latest security patches; they used EC2 Image Builder with a scheduled pipeline.
Scenario 3: SaaS Provider with Multi-Tenant Architecture
A SaaS provider uses a multi-Region architecture but wants to minimize DR costs. They run a fleet of EC2 instances, RDS for MySQL, and S3 for tenant data. For DR, they implement Pilot Light by using RDS cross-Region read replicas for each tenant database (they have 50 tenants). They also run a single small EC2 instance that hosts a configuration service. They use AWS CloudFormation StackSets to deploy the same infrastructure in the DR Region but with a parameter to keep the compute resources at zero (e.g., Auto Scaling group min=0). Upon failover, they update the parameter to min=5 and the stack updates. They also use DynamoDB global tables for session state. The main challenge was managing the replication lag across 50 read replicas; they used CloudWatch composite alarms to detect when any replica lag exceeded 10 seconds. During a real outage caused by a power failure in the primary data center, they successfully failed over in 18 minutes. The cost savings compared to a warm standby were approximately 70%.
SAA-C03 Exam Focus on Pilot Light
The SAA-C03 exam tests Pilot Light under Objective 2.6: 'Choose disaster recovery strategies.' You must be able to differentiate Pilot Light from Backup & Restore, Warm Standby, and Multi-Site Active/Active. The exam often presents a scenario with specific RTO/RPO requirements and cost constraints. Pilot Light is the correct answer when:
RTO is 'minutes to an hour' (typically 15-30 minutes)
RPO is 'seconds to minutes'
Cost is a concern (you want to minimize ongoing DR costs)
The application can tolerate a short failover time
Common Wrong Answers and Why
Warm Standby – Candidates often choose this because it has a lower RTO, but the scenario emphasizes 'minimize cost.' Warm standby runs a scaled-down version of the production environment at all times, which costs significantly more than pilot light. If the scenario says 'cost-effective' or 'minimize cost,' Pilot Light is usually the answer.
Backup & Restore – This has the lowest cost but the highest RTO (hours). If the scenario requires RTO of 30 minutes, Backup & Restore is too slow.
Multi-Site Active/Active – This provides near-zero RTO but is the most expensive and complex. It is only correct when the scenario says 'zero downtime' or 'active-active across Regions.'
Using only S3 CRR – Some candidates think S3 CRR alone constitutes Pilot Light, but Pilot Light requires compute and database replication as well. S3 CRR is just one component.
Specific Numbers and Terms That Appear on the Exam
RTO for Pilot Light: typically 15-30 minutes (the exam may say 'under 1 hour')
RPO: seconds to minutes (e.g., 5 minutes)
Services: RDS cross-Region read replica, Aurora Global Database, DynamoDB global tables, S3 CRR, EC2 AMIs, CloudFormation, Route 53 failover
The phrase 'pilot light' is used explicitly in the exam
The exam may ask: 'Which DR strategy uses a small core set of resources running in the DR Region?'
Edge Cases and Exceptions
If the database is not compatible with cross-Region read replicas (e.g., SQL Server with basic license), Pilot Light may require log shipping or DMS.
If the application is serverless (Lambda, DynamoDB, S3), Pilot Light may not be the best pattern because serverless services are inherently multi-Region. The exam might expect 'Multi-Region serverless' instead.
If the RPO must be zero, Pilot Light cannot achieve it; you need synchronous replication, which is not offered cross-Region (except Aurora Global Database, which is asynchronous but very low lag).
How to Eliminate Wrong Answers
If the scenario mentions 'cost' and 'RTO of less than 1 hour,' eliminate Backup & Restore (too slow) and Warm Standby (too expensive).
If the scenario mentions 'automated failover in under 1 minute,' eliminate Pilot Light (failover takes minutes).
If the scenario mentions 'on-premises to AWS DR,' Pilot Light is still applicable but may require AWS DMS or Storage Gateway.
Exam Tip: Always read the RTO and cost requirements first. Then match to the DR pattern. Pilot Light is the sweet spot for moderate RTO and low cost.
Pilot Light is a DR pattern where only a minimal core set of resources (e.g., database replica, small EC2) runs in the DR Region to reduce cost.
Typical RTO for Pilot Light is 15-30 minutes; RPO is seconds to minutes (depends on replication).
Key services: RDS cross-Region read replicas, Aurora Global Database, DynamoDB global tables, S3 CRR, EC2 AMIs, CloudFormation, Route 53 failover.
The exam expects you to choose Pilot Light when cost is a concern and RTO is under 1 hour but not sub-minute.
Do not confuse Pilot Light with Warm Standby (which runs a scaled-down environment) or Backup & Restore (which has higher RTO).
Automation (Lambda, CloudFormation, Systems Manager) is critical for scaling up during failover.
Regular DR testing is required because stale AMIs or configuration drift can cause failover failure.
These come up on the exam all the time. Here's how to tell them apart.
Pilot Light
Runs only core data and minimal compute (pilot light).
Lower cost – only pays for minimal resources.
RTO: minutes to hours (typically 15-30 minutes).
Requires automation to scale up during failover.
Database replication is continuous (e.g., read replica).
Warm Standby
Runs a scaled-down but fully functional environment.
Higher cost – pays for standby compute 24/7.
RTO: seconds to minutes (typically < 5 minutes).
Can handle traffic immediately after DNS switch.
Database is fully operational and can serve reads.
Mistake
Pilot Light requires running a full-scale production environment in the DR Region.
Correct
Pilot Light runs only a minimal core set of resources (e.g., a small database replica and a single EC2 instance). The full environment is launched only when failover occurs.
Mistake
Pilot Light and Warm Standby are the same thing.
Correct
Warm Standby runs a scaled-down but fully functional version of the production environment that can handle traffic immediately. Pilot Light runs only essential data and configuration; compute must be scaled up during failover.
Mistake
S3 Cross-Region Replication alone constitutes a Pilot Light.
Correct
S3 CRR is only one component. Pilot Light also requires database replication and compute resources that can be scaled up quickly.
Mistake
Pilot Light can achieve an RTO of less than 1 minute.
Correct
Pilot Light typically has an RTO of 15-30 minutes because it requires launching instances and promoting databases. For sub-minute RTO, use Multi-Site Active/Active or Warm Standby with pre-provisioned resources.
Mistake
You cannot use Pilot Light with serverless applications.
Correct
Serverless applications (Lambda, DynamoDB, S3) can use a form of Pilot Light by having DynamoDB global tables and S3 CRR, but the compute (Lambda) is automatically scaled. However, the exam typically does not refer to this as Pilot Light; it's often called 'Multi-Region serverless.'
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Pilot Light runs only essential data and a tiny compute footprint (e.g., a database replica and a single EC2 instance). Upon failover, you launch the full environment from pre-provisioned AMIs and CloudFormation templates. Warm Standby runs a scaled-down but fully functional version of your production environment that can handle traffic immediately after DNS switch. Warm Standby has a lower RTO (seconds to minutes) but higher cost. Pilot Light is cheaper but has a higher RTO (minutes to hours). The exam tests this distinction frequently.
Use Amazon EventBridge to detect a disaster event (e.g., CloudWatch alarm on RDS ReplicaLag exceeding a threshold). Trigger an AWS Lambda function that runs CloudFormation to deploy the full environment, promotes the database replica, and updates Route 53 DNS records. Alternatively, use AWS Systems Manager Automation documents. Ensure your automation scripts are tested regularly.
Yes. You can replicate data from on-premises to AWS using AWS DMS (ongoing replication) or Storage Gateway. The pilot light runs in a chosen AWS Region. During failover, you launch the full environment in that Region. This is a common pattern for hybrid DR.
Cost includes the database replica (e.g., RDS instance), a small EC2 instance, S3 storage and replication costs, and data transfer. You avoid paying for idle compute resources. Typically, Pilot Light costs 10-20% of a full warm standby. For example, an RDS db.r5.large replica costs ~$0.24/hour, plus a t3.nano EC2 instance at ~$0.0052/hour.
RPO depends on the replication mechanism. For RDS cross-Region read replicas, RPO is typically < 1 second under normal conditions. For DynamoDB global tables, RPO is sub-second. For S3 CRR, 99.99% of objects are replicated within 15 minutes. With Aurora Global Database, RPO is typically < 1 second. You can achieve RPO of seconds to minutes.
Yes. After the primary Region recovers, you can failback by reversing the replication direction. For RDS, you create a new read replica from the promoted instance and then promote it back. For Aurora, you rebuild the global database. Automation scripts should include failback steps.
Pilot Light cannot achieve sub-minute RTO because it requires launching compute resources. It also requires ongoing replication, which incurs data transfer costs. If the application state is not fully captured in the database (e.g., in-memory caches), you may lose some state. Additionally, the database replica may lag during peak loads.
You've just covered Pilot Light DR Pattern on AWS — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.
Done with this chapter?