AZ-305Chapter 25 of 103Objective 3.2

Disaster Recovery Patterns: Active-Active vs Active-Passive

This chapter covers disaster recovery patterns in Azure: Active-Active and Active-Passive architectures. Understanding these patterns is critical for the AZ-305 exam, as Business Continuity and Disaster Recovery (BCDR) is a key domain, and questions on this topic appear in about 15-20% of the exam. You will learn the mechanisms, trade-offs, and configuration details for each pattern, including when to use Azure Traffic Manager, Azure Front Door, or Azure Site Recovery. The exam expects you to select the appropriate pattern based on RPO, RTO, cost, and complexity requirements.

25 min read
Intermediate
Updated May 31, 2026

Two-Bridge Traffic Model for Active-Active vs Active-Passive

Imagine a city with two bridges connecting the mainland to an island. In an Active-Active design, both bridges are open and carry traffic simultaneously. Cars are directed to the least congested bridge using real-time sensors and traffic lights. If one bridge collapses, the other instantly takes all traffic, though with some delay as cars reroute. In an Active-Passive design, only one bridge is open; the other is kept closed but fully maintained. A standby crew monitors the active bridge. If it fails, they open the closed bridge, but this takes several minutes to verify the failure, redirect traffic, and manually raise the barrier. The passive bridge never shares load during normal operation, so the active bridge may become congested. This mirrors how Active-Active distributes load across multiple regions, while Active-Passive keeps a standby region idle until failover is triggered.

How It Actually Works

What Are Active-Active and Active-Passive Patterns?

Disaster recovery (DR) patterns define how applications and data are replicated and accessed across multiple Azure regions to ensure availability during a regional outage. The two primary patterns are Active-Active and Active-Passive.

Active-Active: Multiple regions simultaneously serve traffic. Each region runs a fully functional copy of the application and processes user requests. Load is distributed across regions using a global load balancer like Azure Traffic Manager or Azure Front Door. Data is replicated asynchronously or synchronously, depending on consistency needs. If one region fails, traffic is redirected to the remaining regions with minimal disruption.

Active-Passive: Only one region (active) serves traffic. The other region (passive) runs a scaled-down or idle copy of the application and data. The passive region is not used until a failover event occurs. Failover is typically automated using Azure Site Recovery (ASR) or manual scripts. The passive region may have compute resources stopped or running at minimal capacity to save cost.

Why Both Patterns Exist

The choice between patterns depends on business requirements: - RTO (Recovery Time Objective): How quickly must the application be available after a disaster? - RPO (Recovery Point Objective): How much data loss is acceptable? - Cost: Active-Active typically costs more because both regions run at full capacity. Active-Passive can reduce costs by running the passive region at minimal capacity. - Complexity: Active-Active requires careful handling of data consistency, session state, and geo-distributed traffic routing. Active-Passive is simpler to implement but may have longer failover times.

How Active-Active Works Internally

In an Active-Active deployment, traffic is distributed across multiple Azure regions using a global load balancer. The load balancer uses health probes to detect region health and routes traffic away from unhealthy regions.

1.

Traffic Routing: Azure Traffic Manager uses DNS-based routing policies (e.g., Performance, Priority, Weighted) to direct users to the closest or healthiest region. Azure Front Door uses HTTP/HTTPS-based routing with anycast and split TCP for faster failover.

2. Data Replication: Data must be replicated between regions to ensure consistency. Common strategies: - Azure SQL Database Active Geo-Replication: Asynchronously replicates committed transactions to up to four readable secondary databases in different regions. RPO is typically <5 seconds. - Azure Cosmos DB Multi-region Writes: Each region can accept writes; conflicts are resolved using configurable conflict resolution policies. RPO is 0 for writes within the same region but conflicts may cause data loss. - Azure Storage Geo-Redundant Storage (GRS): Asynchronously replicates data to a paired region. RPO is typically <15 minutes.

3.

Session State: For stateful applications, session state must be stored in a shared data store accessible from all regions, such as Azure Redis Cache or Azure Cosmos DB. Sticky sessions (session affinity) are often used with Traffic Manager to route a user to the same region for the duration of their session.

4.

Failover: When a region becomes unhealthy, the load balancer detects the failure via health probes and stops sending traffic to that region. Users experience a brief interruption (seconds to minutes) while DNS TTL expires or the load balancer updates routing. The remaining regions continue serving traffic.

How Active-Passive Works Internally

In an Active-Passive deployment, the passive region is not serving traffic until a failover is triggered. The passive region may have: - Compute: Stopped VMs or scaled-down App Service plans to save cost. - Data: Replicated asynchronously using Azure Site Recovery (for VMs) or database replication (for PaaS).

1.

Replication: Azure Site Recovery continuously replicates Azure VMs from the primary region to the secondary region. Replication is asynchronous with RPO as low as 30 seconds for Azure-to-Azure replication. For SQL databases, Active Geo-Replication or failover groups are used.

2.

Health Monitoring: Azure Site Recovery monitors the primary region. Alternatively, custom health probes can be used with Traffic Manager or Front Door to detect failures and trigger failover.

3. Failover Process: - Test Failover: Non-disruptive drill that creates a recovery environment without impacting production. - Planned Failover: Zero data loss; used for maintenance. Shuts down primary VMs before replicating final changes. - Unplanned Failover: Triggered by disaster. Primary VMs may be lost, so RPO is based on last successful replication. - Failback: After the primary region recovers, data is replicated back and traffic is switched back.

4.

Traffic Switch: After failover, the passive region becomes active. Traffic is redirected using DNS changes (e.g., update Traffic Manager endpoint priority or Azure DNS alias record). This can take minutes due to DNS propagation.

Key Components, Values, Defaults, and Timers

Azure Traffic Manager:

DNS TTL: Default 300 seconds (5 minutes). Can be set as low as 0 (not recommended) for faster failover.

Health Probe Interval: 10 seconds (default). Monitored endpoints must respond within 5 seconds.

Tolerated Number of Failures: 3 consecutive failures (default) before marking endpoint as degraded.

Routing Methods: Performance (closest region), Priority (active-passive), Weighted (round-robin), Geographic, Subnet, MultiValue.

Azure Front Door:

Health Probe Interval: 30 seconds (default).

Health Probe Path: / (root) by default.

Load Balancing Method: Latency-sensitive (lowest latency) or Priority (active-passive).

Session Affinity: Enabled using a cookie (Front Door sets a cookie to route subsequent requests to the same backend).

Azure Site Recovery (ASR):

Replication Policy: Default RPO threshold is 15 minutes. If exceeded, alert is raised.

Recovery Point Retention: 24 hours by default (up to 72 hours).

Replication Frequency: 30 seconds for Azure-to-Azure replication.

Failover Time: Typically 10-30 minutes for VMs, depending on size and network speed.

Azure SQL Database Failover Groups:

Grace Period (Data Loss): 1 hour (default). If primary region fails, automatic failover waits up to 1 hour to allow manual intervention.

Read/Write Listener Endpoint: servername.database.windows.net points to primary.

Read-Only Listener Endpoint: servername.secondary.database.windows.net points to secondary.

Configuration and Verification Commands

Azure Traffic Manager (using Azure CLI):

# Create a Traffic Manager profile with Performance routing
az network traffic-manager profile create \
  --name myTMProfile \
  --resource-group myRG \
  --routing-method Performance \
  --ttl 30 \
  --protocol HTTP \
  --port 80 \
  --path "/health"

# Add endpoints
az network traffic-manager endpoint create \
  --name endpoint-eastus \
  --profile-name myTMProfile \
  --resource-group myRG \
  --type azureEndpoints \
  --target-resource-id /subscriptions/.../publicIPAddresses/eastus-pip

Azure Front Door (using Azure CLI):

# Create a Front Door profile
az afd profile create --profile-name myAFDProfile --resource-group myRG --sku Standard_AzureFrontDoor

# Add an endpoint and route traffic to backends
az afd endpoint create --resource-group myRG --profile-name myAFDProfile --endpoint-name myEndpoint --enabled-state Enabled

Azure Site Recovery (using PowerShell):

# Configure replication for a VM
$vm = Get-AzVM -ResourceGroupName "myRG" -Name "myVM"
Set-AzVMR replication -ResourceGroupName "myRG" -VMName "myVM" -RecoveryResourceGroupName "myRG-dr" -RecoveryRegion "West US"

Interaction with Related Technologies

Azure DNS: Used for custom domain names. Traffic Manager uses DNS to route traffic. DNS TTL affects failover speed.

Azure Load Balancer: Regional load balancer (Layer 4) distributes traffic within a region. It does not replace global load balancers.

Azure Application Gateway: Regional Layer 7 load balancer with SSL termination and WAF. Can be used behind Traffic Manager or Front Door.

Azure Monitor: Monitors health and performance. Alerts can trigger automation runbooks to initiate failover.

Azure Automation: Can run scripts to automate failover steps (e.g., change Traffic Manager endpoint priority, start VMs in passive region).

Exam-First Perspective

The AZ-305 exam focuses on selecting the right pattern based on given requirements. Key differentiators: - Active-Active: Required when RTO < 1 minute and RPO < 5 seconds, and the application is stateless or can handle eventual consistency. - Active-Passive: Used when RTO is minutes to hours and cost savings are important. Often used for legacy applications that cannot handle multi-region writes.

The exam may present scenarios with specific RPO/RTO and ask for the appropriate Azure service (e.g., Traffic Manager vs. Front Door vs. Site Recovery).

Walk-Through

1

Assess RPO and RTO Requirements

Start by determining the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) for the application. For Active-Active, RPO is typically near-zero (seconds) and RTO is under a minute because traffic is immediately redirected. For Active-Passive, RPO may be minutes (due to async replication) and RTO can be 10-30 minutes (time to failover and update DNS). The exam expects you to match these values to the pattern: RPO < 5 seconds and RTO < 1 minute → Active-Active; RPO 15 minutes and RTO 30 minutes → Active-Passive.

2

Design Data Replication Strategy

Choose a replication method that meets RPO and consistency needs. For Active-Active, use Azure SQL Database Active Geo-Replication (RPO < 5 seconds) or Cosmos DB multi-region writes (RPO 0 for writes in-region, but conflict resolution may cause data loss). For Active-Passive, use Azure Site Recovery for VMs (RPO as low as 30 seconds) or SQL Database failover groups (RPO 0 for planned failover, up to 1 hour for unplanned). Ensure replication is asynchronous to avoid latency impact on the primary region.

3

Implement Global Load Balancing

Deploy Azure Traffic Manager or Azure Front Door to route traffic across regions. For Active-Active, use Performance routing (Traffic Manager) or latency-based routing (Front Door). For Active-Passive, use Priority routing: primary region has priority 1, secondary has priority 2. Set health probes to monitor endpoints. Configure low TTL (e.g., 30 seconds) for faster failover. For Front Door, enable session affinity if needed.

4

Configure Health Monitoring and Alerts

Set up health probes on the application endpoints (e.g., /health). For Traffic Manager, probe interval is 10 seconds, tolerated failures 3. For Front Door, interval is 30 seconds, tolerated failures 1. Configure Azure Monitor alerts on probe failures to trigger automated failover actions (e.g., via Azure Automation runbook). For Active-Passive, also monitor replication health (e.g., ASR replication status) to ensure data is up-to-date.

5

Test Failover and Validate Recovery

Perform regular test failovers (disaster recovery drills) to ensure the pattern works as expected. For Active-Active, test by disabling one region's endpoints in the load balancer and verify traffic shifts to other regions. For Active-Passive, use Azure Site Recovery's test failover feature to spin up VMs in the passive region without impacting production. Validate application functionality, data consistency, and recovery time. Document the failover process and update runbooks.

What This Looks Like on the Job

Enterprise Scenario 1: Global E-Commerce Platform

A multinational e-commerce company deploys its customer-facing website in Active-Active across East US and West Europe. They use Azure Front Door for global load balancing with latency-based routing. The application is stateless; session state is stored in Azure Redis Cache with geo-replication. Product catalog data is stored in Azure Cosmos DB with multi-region writes, using last-writer-wins conflict resolution. Traffic is split roughly 60% to East US (primary) and 40% to West Europe, but Front Door automatically adjusts based on latency. During a regional outage in East US, Front Door detects the failure within 30 seconds and routes all traffic to West Europe. Users experience a brief interruption (less than 1 minute) while DNS caches expire. The RPO for catalog updates is near-zero because Cosmos DB replicates writes asynchronously within seconds. This pattern ensures high availability with minimal data loss.

Enterprise Scenario 2: Financial Services Application

A bank runs a core banking application in Active-Passive across Canada Central (active) and Canada East (passive). The application is stateful and requires zero data loss for transactions. They use Azure SQL Database failover groups with manual failover (grace period set to 1 hour). The passive region has a smaller SQL database tier (S2 vs. S4 in primary) to reduce cost. Compute is provided by Azure VMs replicated via Azure Site Recovery with 15-minute RPO. During a planned maintenance, the bank performs a planned failover: they stop the application, replicate final transactions (RPO=0), then switch the DNS record to the passive region. The failover takes 10 minutes. For unplanned disasters, Azure Monitor triggers an alert, and the operations team manually initiates failover via Azure Portal. The RTO is 30 minutes. This pattern saves cost by not running full capacity in the passive region.

Common Misconfigurations

Incorrect TTL Settings: Setting DNS TTL too high (e.g., 300 seconds) delays failover. Many engineers forget to lower TTL for critical applications.

Health Probe Path Not Monitored: Using root path ("/") may not reflect application health. A dedicated health endpoint (e.g., /api/health) that checks dependencies is better.

Ignoring Session State: In Active-Active, if session state is stored locally (e.g., in-memory), users lose session on failover. Always use a shared store like Redis.

Overlooking Data Consistency: Active-Active with multi-region writes can lead to conflicts. Without proper conflict resolution, data can be lost. Cosmos DB provides last-writer-wins, but custom logic may be needed.

Not Testing Failover: Many companies configure DR but never test it. During an actual disaster, they discover missing dependencies or misconfigured replication.

How AZ-305 Actually Tests This

What AZ-305 Tests on This Topic

The AZ-305 exam objective 3.2 covers "Design a disaster recovery strategy" and specifically tests:

Selecting between Active-Active and Active-Passive based on RPO, RTO, cost, and complexity.

Choosing the appropriate Azure service: Traffic Manager, Front Door, Site Recovery, Azure SQL failover groups, Cosmos DB multi-region writes.

Understanding replication types: synchronous vs. asynchronous, geo-replication vs. local redundancy.

Configuring health probes, failover policies, and DNS TTL.

Common Wrong Answers and Why Candidates Choose Them

1.

Choosing Active-Passive when RTO < 5 minutes: Candidates often think Active-Passive is cheaper and sufficient, but failover time (DNS propagation, VM start) typically exceeds 5 minutes. Active-Active is required for sub-minute RTO.

2.

Selecting Azure Load Balancer instead of Traffic Manager: Azure Load Balancer is a regional service; it cannot route traffic across regions. Candidates confuse it with Traffic Manager, which is global. The exam tests this distinction.

3.

Assuming synchronous replication for Active-Active: Synchronous replication across regions introduces latency and is rarely used. The exam expects you to know that Active-Active uses asynchronous replication with eventual consistency.

4.

Using Azure Site Recovery for stateless apps: ASR is designed for VM replication; for stateless apps, a simpler Active-Active with Traffic Manager is more appropriate. Candidates may overcomplicate the solution.

Specific Numbers and Terms on the Exam

RPO: Active-Active: <5 seconds (SQL geo-replication), 0 (Cosmos DB multi-region writes). Active-Passive: 30 seconds (ASR), 15 minutes (ASR default threshold), 1 hour (SQL failover group grace period).

RTO: Active-Active: <1 minute (with low TTL and health probes). Active-Passive: 10-30 minutes (ASR failover), minutes to hours (manual DNS update).

Health Probe Defaults: Traffic Manager: interval 10s, tolerated failures 3. Front Door: interval 30s, tolerated failures 1.

DNS TTL: Default 300s; recommended 30s for fast failover.

Edge Cases and Exceptions

Planned vs. Unplanned Failover: Planned failover (e.g., for maintenance) can achieve zero data loss by shutting down the primary. Unplanned failover may lose data up to the last replication.

Failback: After disaster, you must replicate data back to the original region and switch traffic. This is often overlooked in design.

Multi-region writes with conflict resolution: Cosmos DB offers last-writer-wins, custom, or merge procedures. The exam may test which to use based on application needs.

How to Eliminate Wrong Answers

If the question mentions "cost savings" and "RTO of 1 hour", eliminate Active-Active (more expensive, lower RTO).

If the question says "global load balancing" and "multiple active regions", eliminate Azure Load Balancer and Application Gateway (regional).

If the question says "zero data loss" and "planned failover", consider SQL failover groups or ASR with planned failover.

If the question says "session affinity" and "global", Front Door supports it natively; Traffic Manager requires external session store.

Key Takeaways

Active-Active provides RTO <1 minute and RPO <5 seconds; Active-Passive provides RTO 10-30 minutes and RPO up to 15 minutes (ASR) or 1 hour (SQL failover group).

For Active-Active, use Azure Traffic Manager (DNS-based) or Azure Front Door (HTTP/HTTPS) for global load balancing.

For Active-Passive, use Azure Site Recovery for VM replication and SQL Database failover groups for PaaS databases.

Health probe defaults: Traffic Manager interval 10s, tolerated failures 3; Front Door interval 30s, tolerated failures 1.

DNS TTL should be set low (e.g., 30 seconds) for fast failover in Active-Passive.

Session state must be stored in a shared data store (e.g., Redis Cache) for Active-Active.

Planned failover can achieve zero data loss; unplanned failover risks data loss up to the last replication.

Cost vs. RTO/RPO trade-off: Active-Active costs more but meets stringent recovery objectives.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Active-Active

Both regions serve traffic simultaneously.

RPO typically <5 seconds (SQL geo-replication) or near-zero (Cosmos DB multi-region writes).

RTO <1 minute (health probe detection + DNS TTL).

Higher cost (both regions at full capacity).

Requires stateless or eventually consistent application design.

Active-Passive

Only one region serves traffic at a time.

RPO typically 30 seconds (ASR) to 15 minutes (default threshold) or up to 1 hour (SQL failover group).

RTO 10-30 minutes (ASR failover) to hours (manual DNS update).

Lower cost (passive region can run minimal capacity).

Simpler implementation; suitable for stateful apps with strict consistency.

Watch Out for These

Mistake

Active-Active always means all regions are actively serving traffic at full capacity.

Correct

Active-Active can have regions running at different capacities. For example, one region may handle 80% of traffic, another 20%. The key is that both are active and can serve traffic, not that they are equally loaded.

Mistake

Active-Passive provides the same RTO as Active-Active because Azure Site Recovery can failover in seconds.

Correct

ASR failover time is typically 10-30 minutes for VMs, plus DNS propagation. Active-Active RTO is under 1 minute because traffic is redirected without waiting for compute to start.

Mistake

Synchronous replication is required for Active-Active to ensure data consistency.

Correct

Synchronous replication across regions introduces significant latency (hundreds of milliseconds) and is rarely used. Active-Active relies on asynchronous replication with eventual consistency. Applications must tolerate brief data divergence.

Mistake

Azure Traffic Manager and Azure Front Door are interchangeable for all scenarios.

Correct

Traffic Manager is a DNS-based load balancer (Layer 4) with no SSL termination or WAF. Front Door is a Layer 7 service with SSL offload, WAF, and URL-based routing. For HTTP/HTTPS applications requiring advanced features, Front Door is preferred.

Mistake

You can achieve zero data loss with Active-Passive unplanned failover.

Correct

Unplanned failover with asynchronous replication always risks data loss up to the last replication. Only planned failover (with primary shutdown) can achieve zero data loss.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Active-Active and Active-Passive in Azure?

Active-Active means multiple Azure regions serve traffic simultaneously, distributing load via a global load balancer. RPO is near-zero and RTO under a minute. Active-Passive means one region is active, the other is on standby; failover is triggered manually or automatically, with RPO typically minutes and RTO tens of minutes. Choose based on your RPO/RTO requirements and budget.

Which Azure service should I use for Active-Active global load balancing?

Use Azure Traffic Manager for DNS-based routing (supports Performance, Priority, Weighted methods) or Azure Front Door for HTTP/HTTPS routing with SSL termination, WAF, and URL-based routing. Front Door offers faster failover (anycast) and more features, but Traffic Manager is simpler for non-HTTP endpoints.

Can I use Azure Site Recovery for Active-Active?

No, Azure Site Recovery is designed for Active-Passive disaster recovery. It replicates VMs from a primary region to a secondary region and requires failover to activate the secondary. For Active-Active, you need a different replication strategy (e.g., Azure SQL Active Geo-Replication, Cosmos DB multi-region writes) and a global load balancer.

What is the default RPO for Azure Site Recovery?

The default RPO threshold for Azure Site Recovery is 15 minutes. Actual RPO can be as low as 30 seconds for Azure-to-Azure replication. If the RPO exceeds the threshold, an alert is raised. You can configure the threshold to be lower (e.g., 5 minutes) but this may increase replication frequency and cost.

How do I achieve zero data loss during a disaster?

Zero data loss (RPO=0) is only possible with a planned failover where the primary region is shut down gracefully before failing over. For unplanned failover, asynchronous replication always risks data loss. Use Azure SQL Database failover groups with planned failover or Azure Site Recovery with planned failover to achieve RPO=0. For unplanned scenarios, you can use synchronous replication but it impacts performance.

What is the recommended DNS TTL for fast failover?

For fast failover, set DNS TTL to 30 seconds or lower. The default is 300 seconds (5 minutes), which delays traffic redirection. However, very low TTL increases DNS query load. Balance between speed and cost. Azure Traffic Manager allows TTL as low as 0, but this is not recommended due to excessive DNS traffic.

Can I use Azure Load Balancer for cross-region failover?

No, Azure Load Balancer is a regional service that distributes traffic within a single region. For cross-region failover, use Azure Traffic Manager or Azure Front Door. Azure Load Balancer can be used in conjunction with Traffic Manager to load balance within each region.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Disaster Recovery Patterns: Active-Active vs Active-Passive — now see how well it sticks with free AZ-305 practice questions. Full explanations included, no account needed.

Done with this chapter?