This chapter covers two foundational cloud concepts: fault tolerance and high availability. Understanding the difference is crucial for the AZ-900 exam, as it directly tests your ability to recommend the right reliability strategy for a given workload. The 'Cloud Concepts' objective area carries approximately 25-30% of the exam weight, and questions on fault tolerance vs. high availability appear frequently. By the end of this chapter, you will be able to define each term, explain how Azure implements them, and choose the appropriate design for different business requirements.
Jump to a section
Imagine you are the CEO of a regional airline. Your most profitable route is a daily flight from New York to Chicago. You want to ensure that flight always departs on time, even if the assigned pilot calls in sick. Fault tolerance is like having two pilots in the cockpit at all times. If the primary pilot has a heart attack, the co-pilot can take over instantly—no delay, no missed departure. The flight continues without interruption. High availability is different: it is like having a backup engine on the plane. If the main engine fails, the backup engine kicks in, but there is a brief moment of hesitation as the system switches. The flight might be delayed by a few seconds, but it still reaches Chicago. In Azure, fault tolerance means a service continues operating even when a component fails—no downtime. High availability means the service is designed to be available most of the time, but a brief interruption (seconds to minutes) is acceptable. For the airline, fault tolerance is critical for safety; for your business, it depends on whether you can afford any downtime at all.
What Are Fault Tolerance and High Availability?
Fault tolerance and high availability are two strategies for ensuring that an application or service remains operational despite failures. They are often confused, but they address different levels of reliability.
Fault tolerance is the property that enables a system to continue operating properly in the event of a failure of one or more of its components. The system is designed so that no single point of failure can bring it down. In a fault-tolerant system, there is no interruption of service—the failure is completely transparent to the end user. For example, a fault-tolerant web application might run on multiple virtual machines (VMs) behind a load balancer. If one VM crashes, the load balancer immediately redirects traffic to the remaining healthy VMs. The user never notices a glitch.
High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, over a given period. High availability is measured as a percentage, such as 99.9% (three nines) or 99.99% (four nines). Unlike fault tolerance, HA allows for brief periods of downtime during failover. For example, an HA pair of VMs might use Azure Availability Sets. If one VM fails, the other takes over, but there may be a few seconds of disruption while the failover occurs. HA is often more cost-effective than full fault tolerance.
The Business Problem They Solve
Both concepts address the business need for continuity and reliability. When you move to the cloud, you are responsible for architecting your applications to handle failures. Azure provides the building blocks, but you must choose the right one based on your recovery time objective (RTO) and recovery point objective (RPO).
RTO is the maximum acceptable time your application can be down after a failure.
RPO is the maximum acceptable amount of data loss measured in time.
For a fault-tolerant system, RTO is essentially zero—no downtime. For a highly available system, RTO might be a few seconds to minutes. The cost increases as you move from HA to fault tolerance.
How Azure Implements Fault Tolerance
Azure achieves fault tolerance through redundancy at every layer. The key mechanisms include:
Availability Zones: Physically separate datacenters within an Azure region. Each zone has independent power, cooling, and networking. By deploying resources across multiple zones, you can tolerate a complete zone failure. For example, a zonal VM (a VM pinned to a specific zone) can be replicated in another zone. Azure Load Balancer distributes traffic across zones.
Azure Site Recovery: Replicates workloads from a primary region to a secondary region. In the event of a regional disaster, you can fail over to the secondary region with minimal data loss. This is disaster recovery, which is a form of fault tolerance at the regional level.
Azure SQL Database Active Geo-Replication: Creates readable secondary databases in different regions. If the primary fails, you can fail over to a secondary with no data loss (if synchronous) or minimal loss (if asynchronous).
How Azure Implements High Availability
High availability is typically achieved through redundancy with automatic failover, but with some acceptable downtime. Examples include:
Availability Sets: Logical grouping of VMs that spreads them across multiple fault domains (racks) and update domains (maintenance windows). If a rack fails or Azure performs maintenance, only a subset of VMs is affected. The remaining VMs continue serving traffic. However, there is a brief interruption while the load balancer re-routes traffic.
Azure Load Balancer: Distributes incoming traffic across multiple VMs. If a VM fails, the load balancer stops sending traffic to that VM. The failover is quick but not instantaneous.
Azure App Service: Provides built-in high availability through automatic patching and load balancing across multiple instances. If an instance fails, the platform automatically redirects traffic.
Key Components and Tiers
Azure offers different SLA levels depending on the configuration:
Single VM: 99.9% uptime (if using premium SSD and availability set).
Two VMs in an Availability Set: 99.95% uptime.
Two VMs in an Availability Zone: 99.99% uptime.
Azure SQL Database: Up to 99.995% with Zone Redundancy.
These SLAs are financial guarantees—if Microsoft fails to meet the SLA, you may receive service credits.
Comparison to On-Premises
In an on-premises datacenter, achieving fault tolerance requires duplicating hardware, power, and network paths. This is expensive and often only done for critical systems. High availability in on-premises might involve clustering with shared storage, which also has cost and complexity. Azure abstracts the physical infrastructure, allowing you to achieve high availability and fault tolerance without owning the hardware. You pay only for the redundant resources you consume.
Azure Portal and CLI Touchpoints
You can configure fault tolerance and high availability through the Azure portal, CLI, or ARM templates.
Azure CLI example for creating VMs in an availability set:
az vm availability-set create \
--resource-group myResourceGroup \
--name myAvailabilitySet \
--platform-fault-domain-count 2 \
--platform-update-domain-count 2Azure CLI example for creating VMs in availability zones:
az vm create \
--resource-group myResourceGroup \
--name myVM \
--image UbuntuLTS \
--zone 1 \
--generate-ssh-keysARM template snippet for an availability set:
{
"type": "Microsoft.Compute/availabilitySets",
"apiVersion": "2023-03-01",
"name": "myAvailabilitySet",
"location": "[resourceGroup().location]",
"properties": {
"platformFaultDomainCount": 2,
"platformUpdateDomainCount": 2
}
}Concrete Business Scenarios
Scenario 1: E-commerce Platform – An online retailer needs 99.99% uptime during Black Friday. They deploy VMs across three availability zones with a load balancer. If one zone fails, traffic is routed to the remaining zones. This is fault tolerance because there is no downtime.
Scenario 2: Internal HR System – A company’s HR portal can tolerate 5 minutes of downtime per month. They use an availability set with two VMs. If one VM fails, the other takes over after a few seconds. This is high availability.
Scenario 3: Global SaaS Application – A SaaS provider replicates their database across two regions using Azure SQL Active Geo-Replication. If the primary region fails, they initiate a failover to the secondary region. This is disaster recovery, a form of fault tolerance at the regional level.
Design Redundancy Strategy
First, determine your RTO and RPO. For fault tolerance, RTO must be near zero. For high availability, RTO can be seconds to minutes. Based on this, choose between availability zones (fault tolerance) or availability sets (high availability). Document the decision and cost implications.
Create Availability Set or Zones
In the Azure portal, navigate to Virtual Machines > Create. Under 'Availability options', select 'Availability set' or 'Availability zone'. For an availability set, specify the number of fault domains and update domains. For zones, select one or more zones. Azure automatically distributes VMs across the chosen fault domains or zones.
Configure Load Balancer
Create an Azure Load Balancer (Basic or Standard SKU). Add a backend pool containing the VMs. Define health probes (e.g., HTTP GET on port 80). The load balancer monitors VM health and stops sending traffic to unhealthy VMs. For fault tolerance, use a cross-zone load balancer.
Set Up Health Monitoring
Enable Azure Monitor and Application Insights to track VM performance and availability. Set up alerts for when a VM becomes unhealthy. For critical workloads, configure autoscale rules to automatically add VMs during peak traffic or after a failure.
Test Failover Scenarios
Simulate a failure by stopping a VM or disabling a network interface. Verify that the load balancer redirects traffic to the remaining VMs. Measure the time it takes for the failover to complete. Adjust health probe settings if needed to meet your RTO.
Scenario 1: Financial Trading Platform
A hedge fund runs a real-time trading application that processes thousands of transactions per second. Even a few seconds of downtime could result in millions of dollars in losses. The team deploys the application across three availability zones in the East US region. Each zone contains multiple VM instances running the trading engine. An Azure Load Balancer with a health probe checks the status of each VM every 5 seconds. If a zone experiences a power outage, the load balancer immediately stops sending traffic to that zone's VMs and distributes the load among the remaining zones. The database layer uses Azure SQL Database with Zone Redundancy, which synchronously replicates data across zones. The entire failover is transparent to users. The cost is high—triple the compute and storage—but the business justifies it because the cost of downtime is even higher.
Scenario 2: E-Learning Platform
A university runs an online course portal that serves 10,000 students. The portal can tolerate up to 5 minutes of downtime per month. The IT team uses an availability set with two VMs running the web server and a separate VM for the database. The web VMs are behind an internal load balancer. During Azure planned maintenance, one VM is updated while the other remains online. There is a brief interruption (about 30 seconds) while the load balancer switches traffic. The database uses a simple backup and restore strategy with a 5-minute RPO. The total cost is moderate—double the VM cost for the web tier. This is a classic high-availability design.
Scenario 3: Global Social Media App
A social media startup wants to ensure their app is available even if an entire Azure region fails. They deploy the app in two regions: West Europe and North Europe. Traffic is routed using Azure Traffic Manager (DNS-based load balancing). The database uses Azure Cosmos DB with multi-region writes, which is fault tolerant at the global level. If West Europe goes down, Traffic Manager automatically directs users to North Europe. The failover takes 1-2 minutes because DNS propagation is not instantaneous. The cost is significant—double the infrastructure—but the startup considers it necessary for their global user base.
Common Pitfalls
Misconfiguring health probes: If the probe interval is too long, the load balancer may continue sending traffic to a failed VM, causing errors. Always set the probe interval to meet your RTO.
Not testing failover: Many teams assume failover works but never simulate a failure. Regular testing is essential.
Ignoring data replication: High availability for compute is useless if the database is a single point of failure. Ensure data tier is also redundant.
Objective Code: 1.5 – Understand the concepts of fault tolerance and high availability.
What AZ-900 Tests: - Definition of fault tolerance vs. high availability. - Which Azure services provide each (e.g., Availability Zones for fault tolerance, Availability Sets for high availability). - SLA implications: 99.99% vs. 99.95% vs. 99.9%. - Scenarios: choose between fault tolerance and high availability based on RTO.
Top 4 Wrong Answers and Why: 1. 'Fault tolerance and high availability are the same.' – This is the most common mistake. Candidates confuse the terms because both involve redundancy. The key difference is that fault tolerance means zero downtime, while high availability allows for brief interruptions. 2. 'Availability Zones guarantee fault tolerance.' – Availability Zones provide fault tolerance only if you deploy resources across multiple zones and use a load balancer. Simply enabling zones on a single VM does not make it fault tolerant. 3. 'Availability Sets provide fault tolerance.' – Availability Sets provide high availability, not fault tolerance. They protect against rack failures and maintenance, but failover takes seconds. 4. 'A single VM with premium SSD is fault tolerant.' – A single VM is a single point of failure. Even with premium SSD, if the VM fails, the application is down. Redundancy is required.
Specific Terms and Values on the Exam: - Fault domain: A rack of servers. Availability Sets distribute VMs across up to 3 fault domains. - Update domain: A group of VMs that are updated together during maintenance. Availability Sets support up to 20 update domains. - SLA for single VM: 99.9% (with premium SSD and availability set). - SLA for two VMs in availability set: 99.95%. - SLA for two VMs in availability zones: 99.99%.
Edge Cases: - The exam may ask about a scenario where the RTO is 0 seconds. The correct answer is fault tolerance, not high availability. - If the question mentions 'planned maintenance' or 'unplanned hardware failure', the best choice is often an Availability Set (high availability) because it protects against both. - If the question mentions 'regional disaster', the answer is Azure Site Recovery or Availability Zones (fault tolerance at region level).
Memory Trick: Think of a car. Fault tolerance is like having two engines running simultaneously—if one fails, the other is already running. High availability is like a spare tire—you have to stop and change it, but you can continue driving shortly after.
Fault tolerance = no downtime; high availability = minimal downtime.
Availability Zones provide fault tolerance within a region; Availability Sets provide high availability within a datacenter.
Azure Load Balancer is essential for distributing traffic across redundant VMs.
SLA for a single VM is 99.9% (with premium SSD and availability set).
SLA for two VMs in an availability set is 99.95%.
SLA for two VMs in availability zones is 99.99%.
Choose fault tolerance when RTO must be zero; choose high availability when some downtime is acceptable.
Always test failover scenarios to ensure your configuration meets RTO and RPO.
These come up on the exam all the time. Here's how to tell them apart.
Fault Tolerance
Zero downtime on failure
Requires fully redundant components (active-active)
Higher cost (e.g., 3x infrastructure)
Example: Availability Zones with load balancer
RTO = 0 seconds
High Availability
Brief downtime (seconds to minutes) on failure
Uses redundant components with failover (active-passive)
Moderate cost (e.g., 2x infrastructure)
Example: Availability Sets
RTO = seconds to minutes
Mistake
Fault tolerance and high availability are the same thing.
Correct
They are different. Fault tolerance means zero downtime; high availability allows for brief downtime (e.g., seconds to minutes).
Mistake
Availability Zones guarantee fault tolerance for any application.
Correct
Availability Zones only provide fault tolerance if you deploy redundant resources across zones and use a load balancer. A single VM in a zone is not fault tolerant.
Mistake
An Availability Set protects against a regional outage.
Correct
Availability Sets only protect against failures within a single datacenter (rack failures, maintenance). For regional protection, use Availability Zones or Azure Site Recovery.
Mistake
A 99.9% SLA means my application will never be down for more than 8.76 hours per year.
Correct
99.9% SLA means Microsoft guarantees uptime for the underlying infrastructure. Your application may still be down due to your own code or configuration errors. Also, 99.9% equates to about 8.76 hours of potential downtime per year.
Mistake
Fault tolerance is always the best choice.
Correct
Fault tolerance costs more because it requires fully redundant resources. For many applications, high availability is sufficient and more cost-effective.
Fault tolerance ensures a system continues operating without interruption when a component fails. High availability aims to keep the system operational most of the time, but allows for brief periods of downtime during failover. In Azure, fault tolerance is achieved through Availability Zones and redundant deployments, while high availability uses Availability Sets or similar redundancy with automatic failover that takes seconds.
Availability Zones provide fault tolerance at the datacenter level within a region. By deploying resources across multiple zones, you can tolerate a complete zone failure. For regional fault tolerance, use Azure Site Recovery to replicate to a secondary region. Additionally, Azure SQL Database's Zone Redundancy offers fault tolerance for databases.
An Availability Set is a logical grouping of VMs that distributes them across fault domains (racks) and update domains (maintenance groups). It provides high availability by ensuring that not all VMs are affected by a single rack failure or Azure maintenance. Failover takes a few seconds, so there is brief downtime. It is a cost-effective way to improve availability compared to full fault tolerance.
For a single VM using premium SSD storage and deployed in an availability set, the SLA is 99.9%. Without an availability set, the SLA is lower (e.g., 99.5% for standard HDD). The SLA is a financial guarantee; if Microsoft fails to meet the uptime, you may receive service credits.
No. A single VM is a single point of failure. Even with high-performance storage, if the VM fails, your application goes down. Fault tolerance requires at least two redundant instances behind a load balancer, ideally in different availability zones.
Fault tolerance typically costs more because you need fully redundant resources (e.g., three VMs in three zones). High availability with an availability set might only require two VMs. The exact cost depends on the number of instances, storage, and networking. For example, three VMs cost three times the compute, while two VMs cost twice.
Base your decision on RTO (recovery time objective). If your business requires zero downtime, choose fault tolerance. If you can tolerate a few seconds to minutes of downtime, high availability is sufficient. Also consider cost: fault tolerance is more expensive. For most applications, high availability with an availability set is a good balance.
You've just covered Fault Tolerance vs High Availability — now see how well it sticks with free AZ-900 practice questions. Full explanations included, no account needed.
Done with this chapter?