This chapter covers multi-region active-active architecture in Azure, a key design pattern for achieving high availability and disaster recovery. For the AZ-305 exam, this topic is critical as it tests your ability to design resilient solutions that meet uptime SLAs and RPO/RTO targets. Approximately 15-20% of exam questions touch on business continuity topics, with active-active architectures being a significant portion. You will learn the mechanisms, components, and trade-offs of deploying workloads across multiple Azure regions where all regions are actively serving traffic.
Jump to a section
Imagine a company with headquarters in New York and London, each fully staffed and operating simultaneously. Customers worldwide can contact either office, and both have the same customer database that updates in real time. If a customer calls New York, the New York team handles the request. If the New York office experiences a power outage, calls automatically route to London where the same data is available. This is active-active architecture: both sites are live, serving traffic and sharing data. In contrast, an active-passive setup would have London as a backup, idle until New York fails. The challenge is keeping both databases synchronized, just as the two offices must share customer updates instantly. In Azure, this is achieved with global database replication and traffic routing via Azure Traffic Manager or Azure Front Door. The key is ensuring that write conflicts are handled correctly—like two employees updating the same customer record simultaneously—using conflict resolution policies. This architecture provides maximum availability and performance but requires careful design to avoid data inconsistency and to manage the cost of running multiple active regions.
What Is Multi-Region Active-Active Architecture?
Multi-region active-active architecture is a deployment model where an application is deployed in two or more Azure regions, and each region actively handles user traffic simultaneously. Unlike active-passive (or active-standby) where one region is idle until failover, active-active ensures all regions are live, providing higher resource utilization and lower latency for global users. The primary goals are: - High Availability: If one region fails, traffic is redistributed to remaining regions with no downtime. - Disaster Recovery: Data is replicated across regions, enabling recovery from region-wide outages with minimal data loss (RPO) and recovery time (RTO). - Performance: Users are directed to the nearest region, reducing latency.
How It Works Internally – The Mechanism
At its core, active-active architecture relies on two key components: traffic routing and data synchronization.
Traffic Routing: Azure Traffic Manager or Azure Front Door distributes incoming requests across regions based on routing methods such as performance, geographic, or priority. For example, Traffic Manager uses DNS-based load balancing: when a client resolves the application domain, Traffic Manager returns the IP address of the closest healthy endpoint. Health probes (HTTP/HTTPS) are sent every 10 seconds (default) to each endpoint; if an endpoint fails to respond after a configurable number of failures (default 2), it is marked unhealthy and removed from DNS responses until it recovers.
Data Synchronization: For stateful applications, data must be replicated across regions. Azure offers: - Azure SQL Database Active Geo-Replication: Creates readable secondaries in other regions using asynchronous replication. RPO is typically up to 5 seconds but can be higher under load. - Cosmos DB Multi-Region Writes: Allows writes in any region with automatic conflict resolution (e.g., last-writer-wins). Replication is synchronous within a region and asynchronous across regions, with RPO of 0 for single-region writes but potential for conflict. - Azure Storage Read-Access Geo-Redundant Storage (RA-GRS): Provides read access to a secondary region with RPO of 15 minutes for blobs.
Key Components, Values, Defaults, and Timers
Azure Traffic Manager: Profile with routing method (Performance, Geographic, Priority, Weighted, Subnet). Health probe interval: 10 seconds (default), can be set to 10, 30, or 60 seconds. TTL for DNS responses: 300 seconds (default), min 30 seconds. Endpoint monitoring: HTTP/HTTPS 200 OK required.
Azure Front Door: Application-layer load balancer with global anycast. Health probe interval: 30 seconds (default). Supports path-based routing, SSL termination, and WAF.
Azure SQL Active Geo-Replication: Secondary databases are readable. Failover can be initiated manually or via auto-failover groups. RPO: typically <5 seconds. Failover time: ~1 minute (depends on number of transactions).
Cosmos DB: Multi-region writes enabled at account level. Consistency levels: Strong, Bounded Staleness, Session, Consistent Prefix, Eventual. For multi-region writes, use Session or Eventual (Strong not supported). Conflict resolution: last-writer-wins (LWW) or custom.
Azure Storage: RA-GRS provides read access to secondary region. RPO: 15 minutes for blobs. Failover: manual via Azure Portal or CLI.
Configuration and Verification Commands
To create a Traffic Manager profile with two endpoints:
az network traffic-manager profile create \
--name MyProfile \
--resource-group MyRG \
--routing-method Performance \
--unique-dns-name myapp \
--ttl 30 \
--protocol HTTP \
--port 80 \
--path "/"
az network traffic-manager endpoint create \
--name eastus-endpoint \
--profile-name MyProfile \
--resource-group MyRG \
--type azureEndpoints \
--target-resource-id /subscriptions/.../.../appService/eastus-app \
--endpoint-status Enabled
az network traffic-manager endpoint create \
--name westeurope-endpoint \
--profile-name MyProfile \
--resource-group MyRG \
--type azureEndpoints \
--target-resource-id /subscriptions/.../.../appService/weu-app \
--endpoint-status EnabledTo enable multi-region writes in Cosmos DB:
az cosmosdb update \
--name mycosmosdb \
--resource-group MyRG \
--enable-multiple-write-locations trueInteraction with Related Technologies
Active-active architecture often integrates with Azure DevOps for CI/CD pipelines that deploy to multiple regions simultaneously. Azure Monitor and Application Insights provide cross-region monitoring. Azure Policy can enforce geo-redundancy. Azure Site Recovery is not used in active-active (it's for active-passive); instead, native replication services are preferred. The architecture also impacts cost: running two active regions doubles compute and storage costs, but can be offset by improved SLAs and reduced latency.
Design the Application for Statelessness
Ensure the application tier is stateless so that any region can handle any request. Session state should be stored externally in a distributed cache like Azure Redis Cache or Cosmos DB. This allows requests to be routed to any region without requiring sticky sessions. If the application is stateful, you must implement session affinity (e.g., using Application Gateway with cookie-based affinity) which complicates active-active. For the exam, remember that stateless design is a prerequisite for true active-active.
Configure Global Traffic Routing
Deploy Azure Traffic Manager or Azure Front Door to distribute traffic. For example, use the Performance routing method to direct users to the nearest region based on geographic latency. Configure health probes to monitor each endpoint's health. Set the probe interval (default 10s) and number of failures (default 2) to detect failures quickly. DNS TTL should be low (e.g., 30 seconds) to allow fast failover. Verify that endpoints return HTTP 200 to be considered healthy.
Implement Cross-Region Data Replication
Use Azure SQL Active Geo-Replication or Cosmos DB multi-region writes. For SQL, create auto-failover groups with a primary and secondary region. The secondary is readable and can be used for read-heavy workloads. For Cosmos DB, enable multi-region writes and choose a consistency level. Be aware that strong consistency is not supported with multi-region writes. For storage, use RA-GRS to provide read access to a secondary region. Ensure the RPO meets your requirements.
Deploy Application to Multiple Regions
Deploy identical application stacks (compute, storage, etc.) to each region. Use Azure Resource Manager templates or Terraform to ensure consistency. Each region should have its own resources (e.g., App Service, VMs, databases). The application code must be able to read/write to the local database instance or use the global endpoint. For Cosmos DB, the SDK automatically routes writes to the nearest region. For SQL, use the failover group listener endpoint.
Test Failover and Monitor
Simulate a region failure by disabling an endpoint in Traffic Manager or stopping the application in one region. Verify that traffic redirects to the healthy region within the DNS TTL plus probe interval (e.g., 30s + 10s = 40s). Monitor the application using Azure Monitor and set alerts for regional health. Test data consistency after failover: ensure no data loss or conflicts. Perform this test regularly as part of DR drills.
Enterprise Scenario 1: Global E-Commerce Platform
A large e-commerce company deploys its web application across Azure regions in North America, Europe, and Asia. They use Azure Front Door with geographic routing to send users to the nearest region, reducing page load times by 40%. The product catalog is stored in Cosmos DB with multi-region writes, allowing inventory updates from any region. During a regional outage in one Azure region, Front Door automatically routes traffic to the next closest region with no downtime. The challenge is managing write conflicts during network partitions; they use last-writer-wins with a timestamp from each region's clock, which can cause data loss if clocks drift. They mitigate this by using NTP-synchronized clocks and monitoring conflict rates.
Enterprise Scenario 2: Financial Services Application
A bank deploys a transaction processing system across two Azure regions using active-active architecture. They use Azure SQL Database with auto-failover groups and geo-replication. The application is stateless, with session data stored in Azure Redis Cache replicated across regions. Traffic Manager with performance routing distributes read/write traffic. During normal operations, both regions accept writes, but the secondary region's database is readable only (active geo-replication). If the primary region fails, an auto-failover promotes the secondary to primary. The RPO is under 5 seconds, and RTO is about 1 minute. The bank must ensure that transactions are idempotent to handle duplicate processing after failover. They also use Azure Policy to enforce that all resources are geo-redundant.
Common Pitfalls
Misconfigured Health Probes: If health probes are too permissive (e.g., checking only the homepage), a region may be considered healthy even if the database is down. Use custom probes that check critical dependencies.
Inconsistent Deployment: If the application code or configuration differs between regions, users may experience different behavior. Use CI/CD pipelines that deploy identical artifacts to all regions.
Overlooking DNS Caching: Clients and intermediate DNS resolvers may cache DNS responses beyond the TTL, delaying failover. Set TTL as low as possible (30 seconds) and educate users about browser caching.
Cost Overruns: Running two active regions doubles compute and storage costs. Use reserved instances and auto-scaling to optimize. Consider using a single active region for non-critical workloads.
Exam Focus for AZ-305 (Objective 3.2)
The AZ-305 exam tests your ability to recommend appropriate business continuity solutions. For active-active architecture, the exam focuses on: - Identifying when to use active-active vs active-passive: Active-active is preferred when low RTO (seconds) and low RPO (seconds) are required, and the application can handle multi-region writes. Active-passive is simpler and cheaper. - Selecting the right replication technology: For Azure SQL, use Active Geo-Replication for active-active (secondary readable). For Cosmos DB, enable multi-region writes. For storage, use RA-GRS. - Traffic routing: Understand when to use Traffic Manager vs Front Door. Front Door provides application-layer features (WAF, SSL offload) but is more expensive. Traffic Manager is DNS-based and simpler. - Common wrong answers: 1. Choosing Azure Site Recovery for active-active: ASR is for active-passive (replication of VMs). 2. Assuming strong consistency is possible with multi-region writes: It is not; only eventual or session consistency. 3. Forgetting that SQL geo-replication is asynchronous: RPO is not zero. - Specific numbers: Default health probe interval 10s (Traffic Manager), 30s (Front Door). DNS TTL default 300s. RPO for SQL geo-replication <5s. RPO for RA-GRS 15 minutes. - Edge cases: If using Cosmos DB with multi-region writes and a network partition occurs, conflicts may arise. The exam may test that conflict resolution policies (LWW or custom) must be configured. - How to eliminate wrong answers: If the question mentions "zero RPO" or "synchronous replication across regions", eliminate any option that uses asynchronous replication. If the question requires immediate failover with no manual steps, look for auto-failover groups or Front Door health probes.
Active-active architecture requires stateless application design and externalized session state.
Azure Traffic Manager uses DNS-based routing with default health probe interval of 10 seconds.
Azure Front Door provides application-layer routing with WAF and SSL termination.
Azure SQL Active Geo-Replication is asynchronous with RPO typically <5 seconds.
Cosmos DB multi-region writes support only eventual or session consistency, not strong.
RA-GRS storage provides read access to secondary region with RPO of 15 minutes for blobs.
Auto-failover groups in SQL Database enable automated failover with RTO ~1 minute.
Cost of active-active is approximately double that of active-passive due to duplicate resources.
These come up on the exam all the time. Here's how to tell them apart.
Active-Active Architecture
All regions serve traffic simultaneously.
Lower latency for global users due to proximity.
Requires complex data replication with conflict handling.
Higher cost (multiple active resources).
RTO is near zero (seconds) if traffic routing is fast.
Active-Passive Architecture
Only one region serves traffic; others are standby.
Higher latency for users far from primary region.
Simpler data replication (one-way, no conflicts).
Lower cost (standby resources can be scaled down).
RTO includes failover time (minutes to hours).
Mistake
Active-active architecture means all regions are identical and handle all traffic equally.
Correct
Not necessarily. Traffic can be distributed unevenly using weighted routing. Also, some regions may be used only for reads while writes go to a primary region (active-active reads, active-passive writes).
Mistake
Azure SQL Active Geo-Replication provides zero RPO.
Correct
It is asynchronous, so RPO is typically under 5 seconds but can be higher under load. Zero RPO is only possible with synchronous replication within a region (e.g., SQL Always On Availability Groups).
Mistake
Traffic Manager and Front Door both work at the application layer.
Correct
Traffic Manager works at the DNS layer (Layer 7 only via health probes). Front Door works at the application layer (Layer 7) and can inspect HTTP headers, perform URL rewrites, and terminate SSL.
Mistake
You can use Azure Site Recovery for active-active replication.
Correct
Azure Site Recovery is designed for active-passive disaster recovery. It replicates VMs and can be used for failover, but it does not support active-active because the replicated VMs are not meant to serve traffic until failover.
Mistake
Cosmos DB multi-region writes guarantee strong consistency globally.
Correct
Strong consistency is not supported with multi-region writes. The highest consistency level available is Bounded Staleness, but even that has limits. For multi-region writes, the default is Session or Eventual.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Traffic Manager is a DNS-based load balancer that routes traffic at the DNS level, returning the IP of the closest healthy endpoint. It works for any protocol (HTTP, TCP, etc.) but cannot inspect application-layer data. Front Door is an application-layer load balancer that uses anycast and can route based on HTTP headers, perform SSL offload, URL rewrites, and integrate with WAF. For active-active, use Traffic Manager for simple global load balancing or Front Door if you need application-layer features. Front Door also provides faster failover due to anycast.
Yes, but with limitations. Azure SQL supports Active Geo-Replication where secondary databases are readable. For true active-active (both regions accepting writes), you would need to implement a custom conflict resolution layer or use Cosmos DB. SQL's geo-replication is asynchronous and one-way; writes only go to the primary. You can use auto-failover groups to promote a secondary, but during normal operation only one region accepts writes.
With Azure Traffic Manager and Cosmos DB multi-region writes, RTO can be near zero (seconds) because traffic is automatically redirected when a health probe fails. RPO depends on the replication method: for Cosmos DB multi-region writes, RPO is zero for writes within a region but conflicts may cause data loss. For SQL geo-replication, RPO is typically under 5 seconds. For RA-GRS storage, RPO is 15 minutes. The exam expects you to know these numbers.
The Performance routing method in Traffic Manager directs users to the endpoint with the lowest network latency from the client's DNS resolver. This is ideal for active-active because it automatically routes users to the fastest region. Geographic routing can also be used if you want to pin users to a specific region for data sovereignty.
Store session state externally in a distributed cache like Azure Redis Cache, which can be replicated across regions using geo-replication. Alternatively, use Cosmos DB with session consistency. The application must be stateless so that any region can process any request. Avoid using in-memory session state on the web server, as it will be lost if the server fails or traffic is routed to a different region.
No, Azure Load Balancer operates within a single region. For multi-region load balancing, you need a global load balancer like Traffic Manager or Front Door. Standard Load Balancer can be used within a region for high availability, but not across regions.
If a region fails, Cosmos DB automatically routes writes to the nearest healthy region. The SDK handles this transparently. During the outage, any writes that were in transit to the failed region may be lost (RPO depends on consistency level). After recovery, conflicts may be detected and resolved using the configured conflict resolution policy (e.g., last-writer-wins).
You've just covered Multi-Region Active-Active Architecture — now see how well it sticks with free AZ-305 practice questions. Full explanations included, no account needed.
Done with this chapter?