This chapter covers Azure resiliency patterns and anti-patterns, a critical topic for the AZ-305 exam's Business Continuity domain (objective 3.2). Understanding these patterns is essential for designing solutions that withstand failures and maintain availability. Approximately 15-20% of exam questions touch on resiliency, including choosing between redundancy options, implementing disaster recovery, and avoiding common design mistakes. We will explore the core mechanisms, step-by-step processes, real-world scenarios, and exam-focused insights to ensure you can architect resilient Azure solutions.
Jump to a section
Imagine you are a librarian managing a three-ring binder containing the only copy of a critical reference book. If a single page is torn, the entire book is incomplete. To prevent this, you make three identical copies of the binder and store them in three separate fireproof safes in different buildings. When a patron requests a page, you retrieve it from the nearest safe. If one safe is destroyed, you still have two intact copies. This is Azure's replication: each binder is a replica in a different Availability Zone. If a zone fails, you seamlessly switch to another. But what if a fire destroys an entire city? You then need a fourth copy in a different city—Azure's cross-region replication. The librarian's log (Azure Traffic Manager) tracks which safes are reachable and redirects requests accordingly. The key is that the binder's content is always consistent because you update all copies in a specific order (e.g., synchronous for local, asynchronous for remote). Without this mechanism, a torn page (data corruption) could be propagated to all copies, making the entire system unreliable. Azure's resiliency patterns mirror this: redundancy, isolation, and consistency protocols ensure that even if components fail, the service remains available and data intact.
What Are Azure Resiliency Patterns and Anti-Patterns?
Resiliency in Azure refers to the ability of a system to recover from failures and continue to function. Patterns are proven, repeatable solutions to common design problems, while anti-patterns are common pitfalls that lead to reduced availability or data loss. The AZ-305 exam tests your ability to select appropriate patterns for given requirements and recognize anti-patterns that must be avoided. Key concepts include redundancy, fault isolation, graceful degradation, and disaster recovery.
Why Resiliency Patterns Exist
Cloud environments are inherently unreliable at the component level. Hardware fails, networks partition, and software bugs cause outages. Resiliency patterns mitigate these risks by distributing workloads across multiple independent failure domains, automating failover, and ensuring data durability. Without them, a single server failure could take down an entire application.
How Resiliency Patterns Work Internally
#### Redundancy Patterns - Active-Active: Multiple instances of a service run simultaneously, all receiving traffic. Traffic Manager or Azure Front Door distributes requests. If one instance fails, traffic is redirected to healthy instances. This requires stateless application design or distributed session state (e.g., Azure Redis Cache). - Active-Passive: One instance handles traffic; a standby instance remains idle. On failure, a health probe detects the outage and triggers failover. Azure Traffic Manager with priority routing implements this. Failover time depends on probe interval (default 30 seconds) and time-to-live (TTL) of DNS records. - N+1 Redundancy: Deploy one extra instance beyond the minimum required to handle load. For example, if three VMs can handle peak load, deploy four. This provides buffer for maintenance and failures.
#### Fault Isolation Patterns - Availability Zones: Physically separate datacenters within an Azure region, each with independent power, cooling, and networking. Deploying VMs across zones ensures that a zone failure affects only a subset of instances. Zone-redundant storage (ZRS) replicates data synchronously across zones. - Region Pairs: Azure pairs regions within the same geography (e.g., East US and West US) for disaster recovery. Data is replicated asynchronously across regions. During planned maintenance, updates are applied sequentially to paired regions to minimize impact.
#### Data Resiliency Patterns - Locally Redundant Storage (LRS): Three synchronous copies within a single datacenter. Protects against drive failures but not datacenter failure. - Zone-Redundant Storage (ZRS): Three copies across three Availability Zones in a region. Protects against zone failure. - Geo-Redundant Storage (GRS): Six copies total—three in the primary region (LRS) and three in a secondary region (LRS). Asynchronous replication to secondary region. Protects against region failure. - Read-Access Geo-Redundant Storage (RA-GRS): Same as GRS but with read access to the secondary region.
Key Components, Values, Defaults, and Timers
Azure Traffic Manager: DNS-based traffic load balancer. Uses DNS resolution to direct clients to the nearest or healthiest endpoint. Default DNS TTL: 300 seconds. Health probe interval: 30 seconds (configurable). Failure threshold: 3 consecutive failures (configurable).
Azure Load Balancer: Layer 4 (TCP/UDP) load balancer. Health probes: default interval 5 seconds, number of probes 2. Probes can be HTTP, TCP, or HTTPS.
Azure Front Door: Layer 7 global load balancer with SSL offload and WAF. Health probe interval: default 30 seconds. Supports custom probes.
Azure Site Recovery: Orchestrates replication and failover of VMs. Default recovery point objective (RPO): 15 seconds for Azure-to-Azure replication. Recovery time objective (RTO) depends on configuration, typically minutes.
Azure Backup: Backs up data with configurable retention. Default backup frequency: daily. Retention: up to 99 years.
Configuration and Verification Commands
#### Azure Traffic Manager Profile Creation
New-AzTrafficManagerProfile -Name "MyProfile" -ResourceGroupName "RG1" -RelativeDnsName "myapp" -TrafficRoutingMethod Priority -Ttl 30 -MonitorProtocol HTTP -MonitorPort 80 -MonitorPath "/health" -ProfileStatus Enabled#### Add Endpoints
New-AzTrafficManagerEndpoint -Name "Endpoint1" -ProfileName "MyProfile" -ResourceGroupName "RG1" -Type AzureEndpoints -TargetResourceId "/subscriptions/.../Microsoft.Network/publicIPAddresses/myPublicIP" -EndpointStatus Enabled -Priority 1#### Verify Health Probes
Get-AzTrafficManagerEndpoint -Name "Endpoint1" -ProfileName "MyProfile" -ResourceGroupName "RG1" | Select-Object MonitorStatus#### Configure Azure Site Recovery for Azure VMs
New-AzRecoveryServicesVault -Name "MyVault" -ResourceGroupName "RG1" -Location "East US"
Set-AzRecoveryServicesVaultContext -Vault (Get-AzRecoveryServicesVault -Name "MyVault")
Start-AzRecoveryServicesAsrReplication -Fabric $fabric -ProtectionContainer $container -VM $vm -RecoveryAzureStorageAccountId $storageId -RecoveryResourceGroupId $rgIdInteraction with Related Technologies
Resiliency patterns often combine multiple services. For example, a web application might use Azure Front Door for global load balancing and SSL termination, Azure Traffic Manager for DNS failover between regions, and Azure Load Balancer for distributing traffic within a region. Data is stored in Azure SQL Database with active geo-replication for cross-region failover. Azure Monitor and Application Insights provide health monitoring and alerting. Understanding how these components interact is crucial for designing end-to-end resiliency.
Anti-Patterns to Avoid
Single Point of Failure (SPOF): Deploying only one VM or database instance without redundancy. Exam loves to test this: if a question asks about ensuring availability, never accept a single-instance solution.
Tight Coupling: Designing components that depend on each other's direct availability, e.g., synchronous calls between microservices without circuit breakers. This leads to cascading failures.
Ignoring Data Consistency: Using asynchronous replication without understanding replication lag. During a disaster, some data may not be replicated. Exam may ask about RPO and RTO trade-offs.
Over-Engineering: Implementing unnecessary redundancy (e.g., using geo-replication for a non-critical app) increases cost and complexity. Exam expects you to balance cost and resiliency.
Specific Numbers and Defaults for the Exam
Availability Zones are supported in most Azure regions but not all. Know which regions have zones (e.g., East US 2, West US 2, West Europe) and which do not (e.g., East US, West US, North Europe). The exam may present a scenario where a region without zones is chosen, requiring a different approach.
Azure Site Recovery RPO: 15 seconds for Azure-to-Azure replication. For on-premises to Azure, RPO can be as low as 30 seconds.
Traffic Manager default probe interval: 30 seconds. Failure threshold: 3. So failover can take up to 90 seconds (3 intervals) + DNS TTL propagation.
Azure Load Balancer health probe: 5-second interval, 2 failed probes = 10 seconds to mark unhealthy.
Step-by-Step: Implementing Active-Passive Failover with Traffic Manager
Deploy primary and secondary VMs in different regions or Availability Zones.
Configure health probe on each VM to respond on a specific path (e.g., /health).
Create Traffic Manager profile with priority routing. Set primary endpoint priority 1, secondary priority 2.
Set DNS TTL to a low value (e.g., 30 seconds) for fast failover.
Test failover by stopping the primary VM. Traffic Manager detects failure after probe interval * failure threshold (e.g., 90 seconds) and redirects traffic to secondary.
Monitor using Traffic Manager's monitor status and Azure Monitor alerts.
This pattern is suitable for stateless applications where a brief outage (1-2 minutes) is acceptable.
Define Resiliency Requirements
Start by determining the required recovery point objective (RPO) and recovery time objective (RTO) for your application. RPO is the maximum acceptable data loss in time; RTO is the maximum acceptable downtime. For example, a critical banking app might require RPO of 5 seconds and RTO of 1 minute, while a reporting tool might tolerate RPO of 1 hour and RTO of 4 hours. These values drive pattern selection. Also identify SLAs from Azure services (e.g., 99.99% for Azure SQL Database Business Critical tier). Document these requirements before designing any resiliency mechanism.
Choose Redundancy Model
Based on RPO and RTO, decide between active-active, active-passive, or pilot light. Active-active provides sub-second failover but requires stateless design and session management (e.g., Azure Redis Cache). Active-passive is simpler but has failover delay (typically 1-2 minutes). Pilot light is for disaster recovery: keep a minimal environment running and scale up on failover. For data, choose replication: synchronous for low RPO (e.g., Availability Zones with ZRS) or asynchronous for cross-region (e.g., GRS with typical RPO of 15 minutes). The exam often presents a scenario with specific RPO/RTO and asks you to pick the correct replication or deployment model.
Implement Fault Isolation
Deploy resources across fault domains: use Availability Zones within a region for local redundancy, and region pairs for regional disaster recovery. For example, deploy VMs across three zones and use a zone-redundant load balancer (Standard SKU). For storage, use ZRS for zone-level protection and GRS for region-level. Ensure that the application can tolerate the loss of one zone or region. Azure Site Recovery can orchestrate failover between regions. Test failover regularly to validate that isolation boundaries work as expected.
Configure Health Monitoring and Auto-Failover
Set up health probes for all endpoints. Use Azure Traffic Manager, Load Balancer, or Front Door to monitor and redirect traffic. Configure probe intervals and failure thresholds to meet your RTO. For example, Traffic Manager with probe interval 10 seconds and failure threshold 2 gives failover detection in 20 seconds plus DNS propagation. Use custom health endpoints that check application health, not just OS responsiveness. Integrate with Azure Monitor to alert on failures. Also configure auto-scaling rules to add capacity on increased load after failover.
Validate and Test Resiliency
Conduct regular chaos engineering and failover drills. Use Azure Chaos Studio to inject faults (e.g., VM shutdown, network latency). Verify that failover works end-to-end, including database connections, caching, and external dependencies. Measure actual RPO and RTO achieved. Document runbooks for manual steps. The exam may ask about testing strategies: never assume a pattern works without testing. Also test recovery of data from backups (Azure Backup) and ensure retention policies align with compliance.
Monitor and Optimize Continuously
After deployment, continuously monitor metrics like failover time, replication lag, and error rates. Use Azure Monitor and Application Insights to detect anomalies. Review and adjust configurations: for example, if RTO is too high, reduce probe interval or increase instance count. Also review cost: active-active can be expensive; consider if lower redundancy tier meets SLAs. Update documentation and runbooks as the application evolves. The exam emphasizes that resiliency is an ongoing process, not a one-time setup.
Scenario 1: E-Commerce Platform with Global Customer Base
A large e-commerce company deployed its web application on Azure VMs in West Europe. Initially, they used a single VM with LRS storage. During a regional outage, the entire site went down for hours, causing significant revenue loss. They redesigned using active-active pattern: deployed VMs in three Availability Zones in West Europe, with Azure Load Balancer (Standard SKU) distributing traffic. For data, they used Azure SQL Database with Business Critical tier (zone-redundant) and Azure Redis Cache for session state. They also set up a secondary deployment in East US using Azure Site Recovery for disaster recovery, with Traffic Manager using performance routing. Health probes were configured to check application response time. During a zone outage, traffic automatically shifted to healthy zones with minimal impact (< 10 second disruption). Key lesson: proper health probe design is critical—they initially probed only TCP port, which didn't detect application-level failures. Switching to HTTP probes with custom path improved detection.
Scenario 2: Financial Services with Strict RPO/RTO
A bank required RPO of 5 seconds and RTO of 1 minute for its transaction processing system. They used Azure SQL Database Hyperscale tier with zone-redundant configuration (synchronous replication within region) and active geo-replication to a paired region (asynchronous). For compute, they deployed mission-critical VMs across Availability Zones with Azure Site Recovery replicating to secondary region with 15-second RPO. They used Azure Front Door for global load balancing with health probes every 10 seconds. During a planned failover test, they discovered that database failover took 30 seconds due to DNS propagation of read-write listener. They mitigated by using connection retry logic in the application. The exam might ask about combining synchronous and asynchronous replication: synchronous ensures zero data loss within region, but cross-region asynchronous may lose up to 15 seconds of transactions.
Scenario 3: SaaS Provider with Cost Constraints
A SaaS startup needed high availability but had limited budget. They chose active-passive pattern with two VMs in different Availability Zones (same region), using Azure Traffic Manager with priority routing. For storage, they used ZRS for blob data. Their RTO was 5 minutes, RPO 1 hour. They used Azure Backup for daily backups with 30-day retention. During a zone failure, failover took 2 minutes (probe interval 30 seconds, failure threshold 3). However, they realized that their application had stateful sessions stored in-memory, causing user logout on failover. They added Azure Redis Cache as a session store, making the app stateless. This pattern saved costs compared to active-active, but required careful application design. The exam often tests the trade-off between cost and complexity: active-passive is cheaper but has longer failover time and may require state management changes.
Exactly What AZ-305 Tests on This Topic
Objective 3.2: Design for business continuity. The exam expects you to:
Choose between redundancy patterns (active-active vs. active-passive) based on RPO/RTO.
Select appropriate Azure services for load balancing and failover (Traffic Manager, Load Balancer, Front Door, Application Gateway).
Understand data replication options (LRS, ZRS, GRS, RA-GRS) and their failure scopes.
Recognize anti-patterns like single points of failure, tight coupling, and ignoring data consistency.
Design disaster recovery solutions using Azure Site Recovery and Azure Backup.
Apply fault isolation using Availability Zones and region pairs.
Common Wrong Answers and Why Candidates Choose Them
Choosing LRS for a mission-critical application: Candidates see "redundant" in the name and think it's sufficient, but LRS protects only against drive failure, not datacenter failure. Exam expects ZRS or GRS for higher availability.
Selecting Traffic Manager with performance routing for disaster recovery: Performance routing directs to the closest endpoint, not the healthiest. For failover, use priority routing. Candidates confuse routing methods.
Assuming all regions support Availability Zones: Some regions (e.g., East US, West US) do not have zones. The exam tests this by giving a region without zones and asking for alternative (e.g., region pairs).
Overlooking DNS TTL impact on failover time: Candidates set low probe interval but forget that DNS clients cache records. Traffic Manager failover is not instant due to TTL. Exam may ask to calculate worst-case failover time.
Using geo-redundant storage (GRS) without understanding read access: In a disaster, GRS does not provide read access to secondary until Microsoft initiates failover. RA-GRS does. Candidates might choose GRS for read availability.
Specific Numbers, Values, and Terms That Appear Verbatim
Azure Site Recovery RPO: 15 seconds (Azure-to-Azure).
Traffic Manager default probe interval: 30 seconds; failure threshold: 3.
Azure Load Balancer Standard SKU health probe interval: 5 seconds; unhealthy threshold: 2.
Availability Zones: supported in regions like East US 2, West US 2, West Europe, Southeast Asia.
Region pairs: e.g., East US and West US, UK South and UK West.
Storage replication: LRS (3 copies in one datacenter), ZRS (3 copies across zones), GRS (6 copies across two regions), RA-GRS (read access to secondary).
Edge Cases and Exceptions
What if the application is stateful? Active-active requires distributed session state (e.g., Redis). Without it, use active-passive or sticky sessions (but sticky sessions can cause uneven load).
What if the database cannot be replicated synchronously? Use asynchronous replication and accept higher RPO. For example, Azure SQL Database geo-replication is asynchronous.
What if the application is deployed in a single region without zones? Use region pairs for disaster recovery. Or use Azure Site Recovery to replicate to a different region.
What about third-party appliances? They may not support Availability Zones; consider using Azure native services.
How to Eliminate Wrong Answers Using the Underlying Mechanism
If a question asks for "zero data loss," look for synchronous replication (e.g., Availability Zones with ZRS, Azure SQL Business Critical tier). Asynchronous options (GRS, geo-replication) are wrong.
If a question asks for "automatic failover in under 1 minute," ensure the solution uses active-active or has very low probe intervals and TTL. Active-passive with default settings may exceed 1 minute.
If a question mentions "cost-effective" and "RTO of 1 hour," choose active-passive or pilot light, not active-active.
If a question involves multiple regions, consider Traffic Manager or Front Door for global routing. For within a region, use Load Balancer or Application Gateway.
Resiliency patterns include active-active, active-passive, and pilot light; choose based on RPO and RTO.
Availability Zones protect against datacenter failures within a region; region pairs protect against regional failures.
Data replication: LRS protects against drive failure; ZRS protects against zone failure; GRS/RA-GRS protects against region failure.
Azure Traffic Manager uses DNS routing; failover time = (probe interval × failure threshold) + DNS TTL.
Azure Load Balancer Standard SKU uses TCP/HTTP health probes with default 5-second interval and 2 failed probes.
Azure Site Recovery RPO is 15 seconds for Azure-to-Azure replication; RTO depends on application startup time.
Anti-patterns: single point of failure, tight coupling, ignoring data consistency, over-engineering.
Always test failover regularly; use Azure Chaos Studio for fault injection.
Cost vs. resiliency trade-off: active-active costs more but provides faster failover; active-passive is cheaper but slower.
For zero data loss, use synchronous replication (e.g., Availability Zones with ZRS, Azure SQL Business Critical).
Remember that not all regions support Availability Zones; check before designing.
Health probes should test application health, not just OS or port availability.
These come up on the exam all the time. Here's how to tell them apart.
Active-Active Pattern
All instances handle traffic simultaneously.
Failover is instantaneous (sub-second) as traffic is already distributed.
Requires stateless application design or distributed session state (e.g., Redis).
Higher cost due to running multiple instances at full capacity.
Better suited for low RTO (seconds) and high availability requirements.
Active-Passive Pattern
Only primary instance handles traffic; secondary is idle.
Failover takes time: health probe interval + failure threshold + DNS propagation (typically 1-2 minutes).
Can support stateful applications with sticky sessions (but risk of session loss on failover).
Lower cost because secondary can be smaller or stopped.
Suitable for moderate RTO (minutes) and cost-sensitive scenarios.
Azure Traffic Manager
DNS-based traffic routing (Layer 7).
Supports priority, performance, geographic, weighted, and multi-value routing.
No SSL offloading or Web Application Firewall (WAF).
Health probes are DNS-based; failover time depends on DNS TTL.
Best for global load balancing of non-HTTP/S traffic (e.g., TCP).
Azure Front Door
HTTP/HTTPS reverse proxy with global load balancing (Layer 7).
Supports path-based routing, SSL offloading, and WAF.
Provides application-layer health probes and faster failover (no DNS caching).
Requires HTTP/HTTPS traffic; more features but higher cost.
Best for web applications needing advanced traffic management and security.
Mistake
Availability Zones are available in all Azure regions.
Correct
Not all regions support Availability Zones. As of 2025, regions like East US, West US, North Europe, and France Central do not have zones. Always verify before designing.
Mistake
GRS provides read access to the secondary region immediately after a failover.
Correct
GRS does not provide read access to the secondary region until Microsoft initiates a failover. RA-GRS is required for read access during normal operations.
Mistake
Traffic Manager with performance routing automatically fails over to the next healthiest endpoint.
Correct
Performance routing does not consider health; it routes based on latency. For automatic failover, use priority routing.
Mistake
Azure Load Balancer's health probe checks the application's functionality.
Correct
By default, health probes only check TCP connectivity or HTTP response code. To verify application health, use a custom health endpoint that returns 200 only if the app is fully functional.
Mistake
Azure Site Recovery guarantees zero data loss.
Correct
Azure Site Recovery for Azure-to-Azure replication has an RPO of 15 seconds, meaning up to 15 seconds of data loss is possible. For zero data loss, use synchronous replication options like Availability Zones.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
LRS (Locally Redundant Storage) keeps three copies within a single datacenter, protecting against drive failures. ZRS (Zone-Redundant Storage) replicates across three Availability Zones in a region, protecting against a zone failure. GRS (Geo-Redundant Storage) replicates to a paired region asynchronously, with three copies in the primary and three in the secondary. RA-GRS adds read access to the secondary region during normal operation. For the exam, choose ZRS for zone-level protection, GRS for region-level disaster recovery, and RA-GRS if you need read access to the secondary.
Use Traffic Manager for DNS-based global traffic distribution that works with any protocol (HTTP, TCP, etc.). It supports priority, performance, and geographic routing. Use Front Door for HTTP/HTTPS applications requiring advanced features like SSL offloading, path-based routing, WAF, and faster failover (no DNS caching). The exam often presents a scenario: if the question mentions 'web application with SSL termination,' choose Front Door; if it mentions 'global DNS failover for a non-HTTP service,' choose Traffic Manager.
Default health probe interval is 30 seconds, and failure threshold is 3 consecutive failures. So it takes up to 90 seconds to detect failure. After detection, DNS TTL (default 300 seconds) must expire for clients to resolve to the new endpoint. Worst-case failover time is 90 seconds + 300 seconds = 6.5 minutes. To reduce failover time, lower the probe interval (minimum 10 seconds) and TTL (minimum 0, but not recommended). The exam may ask you to calculate failover time given specific settings.
No, Availability Zones are used by deploying multiple VMs across different zones. A single VM cannot span zones. To protect a single VM, you would need to deploy at least two VMs in different zones and use a load balancer. Alternatively, use Azure Site Recovery to replicate the VM to another region. The exam tests that you understand that Availability Zones provide redundancy for multi-instance deployments, not for single instances.
Azure Site Recovery for Azure-to-Azure replication has a default RPO of 15 seconds. This means that in the event of a disaster, up to 15 seconds of data may be lost. For lower RPO, consider using synchronous replication options like Azure SQL Database Business Critical tier or Availability Zones with ZRS storage. The exam expects you to know this value and compare it with other replication options.
A common anti-pattern is the 'single point of failure' (SPOF). This occurs when a critical component (e.g., a single VM, database, or load balancer) has no redundancy. If that component fails, the entire application goes down. Another anti-pattern is 'tight coupling' where components depend on each other's availability, leading to cascading failures. The exam often presents a design with a single VM or database and asks you to identify the risk. Always look for redundancy at every tier.
You've just covered Azure Resiliency Patterns and Anti-Patterns — now see how well it sticks with free AZ-305 practice questions. Full explanations included, no account needed.
Done with this chapter?