SAA-C03Chapter 6 of 189Objective 2.6

Multi-Site Active-Active DR on AWS

This chapter covers multi-site active-active disaster recovery on AWS, a critical pattern for achieving high availability and resilience across multiple geographic locations. For the SAA-C03 exam, this topic appears in roughly 10-15% of questions related to Resilient Architectures (Objective 2.6). You will need to understand how to design and implement active-active architectures using Route 53, Global Accelerator, Aurora Global Database, DynamoDB global tables, and S3 Cross-Region Replication. We will focus on the mechanisms, trade-offs, and exam traps associated with active-active vs. active-passive designs.

25 min read
Intermediate
Updated May 31, 2026

Two Data Centers as a Dual-Engine Jet

Imagine a commercial jet with two independent engines, each capable of powering the entire aircraft. In normal flight, both engines run simultaneously, sharing the load. If one engine fails, the other instantly takes over at full power, and the plane continues without any loss of altitude or speed. This is active-active disaster recovery on AWS. You have two (or more) AWS Regions or Availability Zones, each actively serving traffic and processing data. They are fully independent but work together to provide a seamless experience. Just as the jet's engines have their own fuel lines, control systems, and sensors, each site has its own compute, storage, and networking resources. The pilot doesn't switch to a backup engine—both are always running. Similarly, in active-active DR, there is no standby site; every site is live. If one site fails, the remaining sites absorb the load instantly because they were already handling traffic. The key is that the load distribution and failover must be automatic and transparent to the users, just like the jet's flight computer balances thrust without pilot intervention. This requires careful design of data replication, DNS routing, and application architecture to ensure consistency and availability.

How It Actually Works

What is Multi-Site Active-Active DR?

Multi-site active-active disaster recovery (DR) is an architecture where two or more geographically separated AWS Regions (or Availability Zones) actively serve user traffic and process data simultaneously. Unlike active-passive (or pilot light) where one site is idle and only takes over during failure, active-active distributes load across all sites. This provides higher resource utilization, lower latency for global users, and faster failover (typically seconds, not minutes). However, it introduces complexity in data consistency, conflict resolution, and cost.

Why Active-Active?

The primary driver is recovery time objective (RTO) and recovery point objective (RPO). Active-active can achieve RTO of seconds to minutes and RPO of seconds (or even near-zero with synchronous replication). It also eliminates the waste of idle resources in passive setups. For global applications, active-active allows traffic to be served from the nearest region, reducing latency. The SAA-C03 exam expects you to know when to choose active-active over active-passive based on RTO/RPO requirements.

Core Components

1.

Route 53 – DNS-based traffic routing with health checks. Use latency-based, geolocation, or weighted routing policies to distribute traffic. Failover is achieved by updating DNS records when health checks fail, but DNS caching can cause propagation delays (TTL matters).

2.

AWS Global Accelerator – Uses anycast IPs to route traffic to the nearest healthy endpoint. Provides faster failover (sub-second) than DNS because it operates at the network layer and doesn't depend on DNS TTL.

3.

Data Replication – For databases, use Aurora Global Database (synchronous replication across regions, RPO < 1 second) or DynamoDB global tables (active-active replication with eventual consistency). For S3, use Cross-Region Replication (CRR) or Same-Region Replication (SRR) for objects. Replication introduces latency and cost.

4.

Application Architecture – Must be stateless or have session state stored in a shared, replicated store (e.g., ElastiCache with global datastore, or DynamoDB). Stateful applications require careful design to avoid data loss during failover.

5.

Health Checks and Monitoring – Route 53 health checks, CloudWatch alarms, and AWS Health events trigger failover actions.

How Active-Active Works Internally

Let's step through a typical active-active setup with two regions (us-east-1 and eu-west-1):

DNS Resolution: A user in New York queries Route 53 for app.example.com. Route 53 uses a latency-based routing policy. It evaluates the latency between the user's DNS resolver and each region's health-checked endpoints. It returns the IP address of the region with the lowest latency (likely us-east-1). The DNS response includes a TTL (e.g., 60 seconds).

Traffic Flow: The user's browser connects directly to the ALB in us-east-1. The application serves the request. If the application needs to write data, it writes to the local Aurora database (primary in us-east-1). Aurora Global Database asynchronously replicates the write to eu-west-1 within milliseconds.

Failover Scenario: If us-east-1 experiences a failure, Route 53 health checks fail (after 3 consecutive failures, default interval 10 seconds). Route 53 stops returning us-east-1's IP. New DNS queries return only eu-west-1's IP. Existing users with cached DNS records may still hit the failed region until TTL expires. To accelerate failover, use Global Accelerator: it detects endpoint health via TCP health checks every 10 seconds and reroutes traffic in under a second by updating its anycast routing tables.

Data Consistency: During the failover, any un-replicated writes in us-east-1 are lost (RPO = replication lag). Aurora Global Database can promote one of the secondary regions to primary if needed. DynamoDB global tables use last-writer-wins conflict resolution.

Key Defaults and Timers

Route 53 health check interval: 10 seconds (default) or 30 seconds (fast). Failure threshold: 3 (default). So detection time ~30 seconds.

DNS TTL: default 300 seconds for Route 53 alias records, but can be set as low as 0. For active-active, use low TTL (e.g., 60 seconds) to speed failover.

Global Accelerator health check interval: 10 seconds. Failure threshold: 3. Detection time ~30 seconds, but traffic rerouting is sub-second after detection.

Aurora Global Database replication lag: typically <1 second within the same continent, up to a few seconds across continents.

DynamoDB global tables eventual consistency: typically <1 second.

Configuration and Verification

Route 53: Create a latency-based routing policy with multiple records (one per region). Each record points to an ALB or Global Accelerator endpoint. Attach health checks to each record. Use aws route53 list-resource-record-sets --hosted-zone-id ZONEID to verify.

Global Accelerator: Create an accelerator with two endpoint groups (one per region). Each group has endpoints (e.g., ALBs). Health checks are automatic. Use aws globalaccelerator list-accelerators to verify.

Aurora Global Database: Create an Aurora cluster as primary, then add a secondary region via the console or CLI: aws rds create-db-cluster --db-cluster-identifier aurora-global --engine aurora --master-username admin --master-user-password password --global-cluster-identifier myglobal. Verify with aws rds describe-global-clusters.

DynamoDB Global Tables: Enable DynamoDB Streams, then add a replica region via aws dynamodb create-global-table --global-table-name mytable --replication-group RegionNames=us-east-1,eu-west-1. Verify with aws dynamodb describe-global-table.

Interactions with Related Technologies

AWS WAF & Shield: Can be integrated with ALB or CloudFront for DDoS protection. In active-active, each region's ALB should have its own WAF configuration.

CloudFront: Can be used as a CDN in front of multiple origins (ALBs in each region). CloudFront supports origin failover with custom error responses.

VPC Peering / Transit Gateway: For inter-region communication, use VPC peering or Transit Gateway with inter-region peering. This is needed for database replication traffic.

Direct Connect: For hybrid scenarios, each region may have its own Direct Connect connection to on-premises.

Common Exam Traps

Active-Active vs. Active-Passive: The exam may present a scenario with RTO of 1 minute and RPO of 5 minutes. Active-active can achieve this, but active-passive with warm standby might also work. The key is cost: active-active is more expensive due to running resources in multiple regions. If the question emphasizes cost savings, active-passive may be preferred.

DNS TTL and Failover Speed: Candidates often assume Route 53 failover is instantaneous. In reality, DNS caching can cause delays. The exam may ask how to achieve sub-second failover—answer is Global Accelerator or Route 53 with health checks and low TTL (but low TTL alone doesn't guarantee sub-second because of DNS resolvers).

Data Consistency: Active-active with asynchronous replication can lead to conflicts. The exam tests understanding of eventual consistency and conflict resolution (e.g., DynamoDB's last-writer-wins). For strong consistency, use synchronous replication (e.g., Aurora Global Database with failover priority) or active-passive.

Global Accelerator Pricing: It has a fixed hourly cost plus data transfer costs. The exam may ask about cost vs. performance trade-offs.

Best Practices

Use Global Accelerator for latency-sensitive applications requiring fast failover.

Implement health checks for all endpoints.

Design applications to be stateless or use a distributed session store (e.g., ElastiCache for Redis with global datastore).

Regularly test failover using Route 53 health check simulation or Global Accelerator endpoint deactivation.

Monitor replication lag using CloudWatch metrics (e.g., AuroraReplicaLag, DynamoDBPendingReplicationCount).

Walk-Through

1

Design Application for Statelessness

The application must not store session state locally on a single server. Instead, session data is stored in a shared, replicated data store such as ElastiCache for Redis with global datastore or DynamoDB global tables. This ensures that if a user's request is routed to a different region after failover, their session is preserved. For example, a shopping cart stored in an in-memory cache in us-east-1 must be available in eu-west-1. Use Redis global datastore to replicate data across regions with sub-second latency. Statelessness also allows any region to handle any request, enabling true active-active load distribution.

2

Configure DNS or Anycast Routing

Choose between Route 53 with latency-based routing and AWS Global Accelerator. For Route 53, create multiple A records (one per region) with the same name but different set identifiers. Attach health checks to each record. Set TTL to 60 seconds or less to minimize DNS caching. For Global Accelerator, create an accelerator with two endpoint groups. Each group contains the regional ALB as an endpoint. Global Accelerator uses anycast IPs, so the user's traffic is routed to the nearest healthy region based on network conditions, not DNS. This provides faster failover (sub-second) because it doesn't depend on DNS propagation.

3

Set Up Cross-Region Data Replication

For databases, use Aurora Global Database for MySQL/PostgreSQL or DynamoDB global tables. Aurora Global Database synchronously replicates data from the primary region to one or more secondary regions. The replication lag is typically under 1 second. DynamoDB global tables replicate data across regions using DynamoDB Streams; each write is eventually replicated to all replicas. Conflict resolution uses last-writer-wins. For S3, enable Cross-Region Replication (CRR) to copy objects to a bucket in another region. CRR is asynchronous and can take minutes to hours depending on object size. For file systems, use EFS Replication or FSx for Lustre with data replication.

4

Implement Health Checks and Monitoring

Create Route 53 health checks for each regional endpoint (e.g., ALB). Health checks can be HTTP/HTTPS or TCP. They run every 10 seconds. If three consecutive checks fail, the endpoint is marked unhealthy. Route 53 then stops returning that endpoint's IP in DNS responses. For Global Accelerator, health checks are automatic and use TCP on the configured port. Monitor replication lag using CloudWatch metrics. Set up CloudWatch alarms to notify when lag exceeds a threshold (e.g., 5 seconds). Also monitor CPU, memory, and error rates in each region.

5

Test Failover and Validate RTO/RPO

Simulate a regional failure by deactivating an endpoint in Global Accelerator or removing a Route 53 health check. Measure the time from failure to traffic rerouting. For DNS-based failover, this includes health check detection time (30 seconds) plus DNS propagation (TTL). For Global Accelerator, detection is similar but rerouting is immediate. Verify that the application continues to function correctly and that data is consistent. Check replication lag before failover to estimate data loss (RPO). Use tools like AWS Fault Injection Simulator to automate testing. Document the failover process and run drills quarterly.

What This Looks Like on the Job

Enterprise Scenario 1: Global E-Commerce Platform

A large e-commerce company operates in North America and Europe. They need low latency for all users and must survive a regional outage without losing sales. They deploy an active-active architecture using AWS Global Accelerator with two regions: us-east-1 and eu-west-1. Each region has a full stack: ALB, EC2 Auto Scaling group, Aurora Global Database, and ElastiCache for Redis global datastore. Route 53 is used for DNS resolution, but Global Accelerator handles the actual traffic routing. During a DDoS attack on us-east-1, Global Accelerator automatically routes all traffic to eu-west-1 within 30 seconds. The application remains available, though latency increases for North American users. The company achieves RTO of 30 seconds and RPO of under 1 second. They monitor replication lag using CloudWatch and have automated runbooks to promote a secondary Aurora region to primary if needed. The main challenge is cost: running two full stacks doubles infrastructure spend. They optimize by using Reserved Instances and Savings Plans.

Enterprise Scenario 2: Financial Services with Strict Compliance

A bank requires RTO of 1 minute and RPO of 5 seconds across two regions in the same country (us-east-1 and us-west-2). They use active-active with Aurora Global Database for their transaction database. However, they also need to maintain synchronous replication for certain critical data. They use a custom solution: writes go to a primary Aurora cluster, and a secondary cluster is kept in sync using Aurora Global Database (asynchronous, but typically <1 second). For absolute consistency, they use DynamoDB global tables with strongly consistent reads (at the cost of higher latency). They deploy Global Accelerator for routing. The bank's compliance team requires regular failover testing. They use AWS Fault Injection Simulator to simulate a full region outage. One common misconfiguration is setting health check thresholds too high (e.g., 10 consecutive failures), which delays failover. They now use the default of 3 and low TTL of 60 seconds. They also discovered that Global Accelerator's health checks must be configured to check the application's health endpoint, not just the ALB's default response.

What Goes Wrong When Misconfigured

Data Conflicts in DynamoDB Global Tables: If two regions write to the same item simultaneously, last-writer-wins may cause data loss. The application must handle conflicts or design to avoid concurrent writes to the same key.

DNS Causing Traffic to Hit Dead Region: If TTL is set too high (e.g., 300 seconds), users may continue to reach a failed region for up to 5 minutes, causing errors. Always use low TTL (60 seconds or less) for active-active.

Replication Lag Exceeding RPO: During peak traffic, replication lag may spike. Without monitoring, a failover could lose minutes of data. Set CloudWatch alarms on replication lag and consider throttling writes if lag exceeds threshold.

Global Accelerator Cost Overruns: Data transfer costs can be high because traffic is routed through multiple regions. Ensure that data transfer pricing is factored into the budget.

How SAA-C03 Actually Tests This

What SAA-C03 Tests on This Topic (Objective 2.6)

The exam tests your ability to design a multi-site active-active DR solution that meets specific RTO and RPO requirements. You must know the differences between Route 53 and Global Accelerator for routing, the replication mechanisms of Aurora Global Database and DynamoDB global tables, and the trade-offs of active-active vs. active-passive. Expect scenario-based questions where you need to choose the correct combination of services.

Common Wrong Answers and Why Candidates Choose Them

1.

'Use Route 53 with failover routing policy for sub-second failover' – Wrong because DNS failover depends on TTL and health check intervals; it cannot achieve sub-second failover. The correct answer is Global Accelerator or Route 53 with a very low TTL (but still not sub-second).

2.

'Active-active is always better than active-passive' – Wrong because active-active is more expensive and introduces data consistency challenges. The exam may present a scenario with low budget where active-passive (pilot light) is appropriate.

3.

'DynamoDB global tables provide strong consistency across regions' – Wrong; they provide eventual consistency. For strong consistency, use Aurora Global Database with synchronous replication (though it's actually asynchronous, but can be configured for near-sync) or active-passive with a single write region.

4.

'S3 Cross-Region Replication is synchronous' – Wrong; it's asynchronous. Candidates confuse it with S3 Same-Region Replication.

Specific Numbers and Terms That Appear on the Exam

RTO < 1 minute, RPO < 5 seconds – This is a typical requirement that points to active-active with Aurora Global Database or DynamoDB global tables.

TTL values – 60 seconds or 300 seconds. The exam may ask how to reduce failover time: reduce TTL.

Health check failure threshold – Default 3. The exam may ask the impact of increasing it to 5: slower failover.

Global Accelerator – Know that it uses anycast and provides faster failover than DNS.

Aurora Global Database – Supports up to 5 secondary regions. Replication lag is typically <1 second.

DynamoDB Global Tables – Use last-writer-wins. Replication is eventually consistent.

Edge Cases and Exceptions

Cross-region replication for S3 – Only new objects are replicated; existing objects are not. You must use S3 Batch Operations to copy existing objects.

Aurora Global Database write forwarding – Secondary regions can forward writes to the primary, but this adds latency. The exam may test that writes are always forwarded to the primary.

Global Accelerator endpoint weights – You can assign weights to distribute traffic unevenly, e.g., 80% to primary, 20% to secondary. This is useful for gradual migration.

How to Eliminate Wrong Answers

If the question mentions 'sub-second failover', eliminate any answer that relies solely on DNS (Route 53). Choose Global Accelerator.

If the question emphasizes cost savings, avoid active-active and choose active-passive (e.g., pilot light).

If the question requires strong consistency across regions, avoid DynamoDB global tables and choose Aurora Global Database with read replicas (but note that writes are still only on primary).

If the question requires near-zero RPO, avoid any asynchronous replication (S3 CRR, DynamoDB global tables) and look for synchronous options like Aurora Global Database or synchronous replication with Multi-AZ.

Key Takeaways

Active-active DR requires all sites to be actively serving traffic, providing faster failover and better resource utilization than active-passive.

For sub-second failover, use AWS Global Accelerator; for cost-sensitive scenarios, use Route 53 with low TTL (60s) and health checks.

Aurora Global Database provides near-synchronous replication across regions with RPO < 1 second and supports up to 5 secondary regions.

DynamoDB global tables use eventual consistency and last-writer-wins conflict resolution; not suitable for strong consistency requirements.

S3 Cross-Region Replication is asynchronous; do not rely on it for RPO less than minutes.

Stateless application design is critical for active-active; use ElastiCache global datastore or DynamoDB global tables for session state.

Regularly test failover using AWS Fault Injection Simulator or manual endpoint deactivation to validate RTO/RPO.

Monitor replication lag with CloudWatch metrics and set alarms to detect anomalies before they cause data loss.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Route 53 Latency-Based Routing

DNS-based, subject to TTL caching (default 300s)

Failover time: ~30s (health check) + TTL

No additional cost beyond Route 53 hosted zone fees

Supports geolocation, weighted, and failover policies

Health checks are optional but recommended

AWS Global Accelerator

Anycast IP-based, no DNS caching

Failover time: sub-second after health check (30s detection)

Additional hourly cost + data transfer costs

Only supports endpoint groups with traffic dials

Health checks are automatic and mandatory

Watch Out for These

Mistake

Route 53 latency-based routing automatically fails over when a region becomes unhealthy.

Correct

Latency-based routing does not include health checks by default. You must attach health checks to each record. Without health checks, Route 53 will continue to return the IP of a failed region as long as it is the lowest latency.

Mistake

Active-active DR always provides zero data loss (RPO=0).

Correct

Active-active typically uses asynchronous replication, so there is always some replication lag. RPO is non-zero (e.g., <1 second). For zero RPO, you need synchronous replication, which is only possible within a single region (e.g., Multi-AZ) or with specialized solutions like Amazon EFS Replication with sync mode.

Mistake

Global Accelerator and Route 53 provide the same failover speed.

Correct

Global Accelerator provides sub-second failover because it operates at the network layer and updates its anycast routing tables immediately upon health check failure. Route 53 failover is delayed by DNS caching (TTL) and health check intervals, typically taking 30-60 seconds.

Mistake

DynamoDB global tables provide strongly consistent reads across all regions.

Correct

DynamoDB global tables provide eventual consistency. Reads from a replica region may not reflect the latest writes from another region. For strongly consistent reads, you must read from the region where the write was performed, or use a different database.

Mistake

S3 Cross-Region Replication is synchronous and guarantees immediate replication.

Correct

S3 CRR is asynchronous. There is no SLA on replication time. It can take minutes to hours depending on object size and queue length. For near-real-time replication, consider using S3 Event Notifications with Lambda to replicate objects.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between active-active and active-passive DR on AWS?

Active-active DR has multiple regions actively serving traffic and processing data simultaneously. All regions are live, and traffic is distributed among them. Active-passive DR has one primary region handling traffic and a secondary region on standby (either pilot light, warm standby, or cold standby). During a failure, traffic is redirected to the secondary region. Active-active provides faster failover (seconds) and better resource utilization, but is more expensive and complex. Active-passive is cheaper but has longer RTO (minutes to hours).

How do I achieve sub-second failover for my multi-region application?

Use AWS Global Accelerator. It uses anycast IP addresses and health checks to route traffic to the nearest healthy endpoint. When an endpoint fails, Global Accelerator updates its routing tables within seconds, and traffic is rerouted without relying on DNS caching. In contrast, Route 53 DNS failover can take 30-60 seconds due to health check intervals and TTL. For sub-second failover, Global Accelerator is the recommended service.

Can I use DynamoDB global tables for active-active DR with strong consistency?

No, DynamoDB global tables provide eventual consistency. Reads from a replica region may not reflect the latest writes from another region. For strong consistency, you would need to read from the region where the write was performed, which defeats the purpose of active-active. For strongly consistent multi-region databases, consider Aurora Global Database (which is asynchronous but typically <1 second lag) or a custom solution with synchronous replication.

What is the RPO of Aurora Global Database?

Aurora Global Database typically has an RPO of less than 1 second. It uses asynchronous replication from the primary region to secondary regions. The actual RPO depends on network latency and write workload. In practice, replication lag is usually under 1 second for same-continent regions and a few seconds for cross-continent. You can monitor lag using the AuroraReplicaLag CloudWatch metric.

How do I handle session state in an active-active multi-region architecture?

Session state must be stored in a shared, replicated data store. Use ElastiCache for Redis with global datastore to replicate session data across regions with sub-second latency. Alternatively, use DynamoDB global tables for session storage. The application must be stateless, meaning it does not rely on local server memory for sessions. This allows any region to serve any user, even after failover.

Is S3 Cross-Region Replication suitable for active-active DR?

S3 CRR is asynchronous and can have replication delays from minutes to hours. It is not suitable for active-active DR where RPO must be seconds. For near-real-time replication, consider using S3 Event Notifications to trigger a Lambda function that copies objects to the other region. However, for most active-active scenarios, you would use a database with built-in replication (Aurora Global Database or DynamoDB global tables) rather than S3 CRR.

What is the cost implication of active-active vs active-passive?

Active-active requires running full infrastructure in multiple regions simultaneously, doubling compute, storage, and data transfer costs. Active-passive (e.g., pilot light) only runs minimal resources in the secondary region (e.g., database replicas without compute), significantly reducing cost. For example, active-active might cost $10,000/month per region, while pilot light might cost $2,000/month for the secondary region. The trade-off is faster failover and better resource utilization.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Multi-Site Active-Active DR on AWS — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.

Done with this chapter?