AZ-900Chapter 8 of 127Objective 1.5

Reliability and Predictability

This chapter covers two foundational cloud principles: reliability and predictability. Reliability ensures your applications stay running despite failures, while predictability means you can anticipate performance and costs. These concepts are central to the AZ-900 exam, appearing in roughly 10-15% of questions across the 'Cloud Concepts' domain (objective 1.5). By the end, you'll understand how Azure achieves high availability, disaster recovery, and consistent pricing, and you'll be ready to answer exam questions about SLAs, fault domains, and reserved capacity.

25 min read
Beginner
Updated May 31, 2026

The Reliable Hotel Chain

Imagine you're a travel agency booking rooms for clients across multiple cities. You need a hotel chain that guarantees rooms are always available when clients arrive, even if a hotel has a fire or flood. This chain operates by having multiple identical hotels in each city, each with the same room layout, amenities, and service standards. When a client books, the system automatically assigns the nearest available hotel. If one hotel goes offline, bookings seamlessly shift to another, with no client noticing. The chain also predicts future demand using historical data—if a conference is coming, they pre-book extra rooms so you never hear 'no vacancy.' They offer two service tiers: Standard (99.9% uptime, meaning you might get a few hours of downtime per year) and Premium (99.99% uptime, with dedicated backup generators and redundant staff). As a travel agency, you pay a monthly fee based on the number of rooms you anticipate needing, not per booking. This allows you to budget predictably. The chain's central system monitors occupancy, maintenance, and performance across all hotels, sending you a monthly report with guarantees. If a hotel fails to meet its uptime guarantee, you get a service credit—a discount on your next bill. This is exactly how Azure's reliability and predictability work: redundancy across regions, automatic failover, reserved capacity, and SLAs with credits for downtime.

How It Actually Works

What Are Reliability and Predictability?

Reliability in cloud computing means that a service or application continues to operate correctly and without interruption, even when underlying components fail. Predictability means that you can forecast the behavior of the system—its performance, cost, and availability—with a high degree of confidence. Together, they allow businesses to move critical workloads to the cloud without fear of unexpected downtime or budget overruns.

The Business Problem They Solve

On-premises, building reliability requires massive capital investment in redundant hardware, backup generators, and disaster recovery sites. Even then, failures happen. Predictability is also hard: capacity planning is guesswork, and costs spike when you need to scale. Azure solves this by abstracting hardware management and offering built-in redundancy, global scale, and multiple pricing models.

How Azure Achieves Reliability: Regions, Availability Zones, and Redundancy

Azure operates from data centers grouped into regions (e.g., East US, West Europe). Each region is paired with another region hundreds of miles away to form a region pair. Data is replicated between pairs for disaster recovery. Within a region, availability zones are physically separate data centers with independent power, cooling, and networking. By deploying resources across multiple zones, you protect against a single data center failure.

Step-by-Step Mechanism for High Availability

1.

Design for redundancy: Deploy at least two instances of a virtual machine (VM) or service across availability zones or regions.

2.

Use load balancers: Azure Load Balancer or Application Gateway distributes traffic to healthy instances.

3.

Monitor health: Azure Health Monitoring detects failures and automatically reroutes traffic.

4.

Failover: If an instance or zone fails, traffic moves to healthy instances without manual intervention.

5.

Data replication: Azure Storage, SQL Database, and Cosmos DB replicate data synchronously or asynchronously to ensure durability.

Key Components and Tiers

Service Level Agreements (SLAs): Formal commitments for uptime and connectivity. For example, a single VM has a 99.9% SLA; two VMs in an availability set have 99.95%; two VMs in availability zones have 99.99%.

Availability Sets: Logical grouping of VMs that spreads them across fault domains (hardware racks) and update domains (maintenance schedules).

Availability Zones: Physical separation within a region.

Azure Site Recovery: Replicates workloads from one region to another for disaster recovery.

Reserved Instances: Pre-pay for VM capacity for 1 or 3 years to get up to 72% discount, providing cost predictability.

How It Compares to On-Premises

On-premises, you buy and maintain redundant servers, storage, and networking. Achieving 99.99% uptime requires duplicate everything, plus a secondary site. Azure provides this as a service: you pay only for what you use, and Microsoft handles hardware failures, power outages, and network issues. The trade-off is that you share infrastructure with other tenants (multi-tenancy), but Azure isolates workloads via virtual networks and security boundaries.

Azure Portal and CLI Touchpoints

In the Azure portal, you can configure availability zones when creating a VM (select 'Availability zone' in the Basics tab). You can create an availability set under 'High availability' options. For cost predictability, use the Azure Pricing Calculator to estimate costs, and purchase Reserved Instances under 'Reservations'.

CLI example to create a VM in availability zone 1:

az vm create \
  --resource-group myRG \
  --name myVM \
  --image UbuntuLTS \
  --zone 1

To see available VM sizes and their SLA:

az vm list-sizes --location eastus --output table

Pricing Models for Predictability

Pay-as-you-go: No upfront cost, highest per-hour rate. Predictable per hour but variable total.

Reserved Instances: 1 or 3-year commitment, significant discount. Very predictable.

Spot VMs: Discounted but can be evicted. Not predictable; used for batch jobs.

Azure Hybrid Benefit: Use existing on-premises licenses to reduce cost.

Business Scenario: E-Commerce Platform

An e-commerce company runs its website on Azure VMs. During Black Friday, traffic spikes 10x. With auto-scaling and load balancing, Azure adds VMs automatically. The company uses reserved instances for baseline capacity (predictable cost) and pay-as-you-go for burst capacity. The site stays up because VMs are spread across availability zones. If one zone fails, traffic shifts to another. The SLA guarantees 99.99% uptime for the multi-VM deployment, and the company receives a credit if the SLA is not met.

Summary of Reliability & Predictability Mechanisms

Redundancy across zones/regions.

Automatic failover via load balancers and health probes.

Data replication for durability.

SLAs with financial backing.

Reserved capacity for cost stability.

Monitoring and alerts via Azure Monitor.

Exam Tip

On the AZ-900, you will be asked to identify which configuration achieves a given SLA percentage. Remember: one VM = 99.9%; two VMs in same availability set = 99.95%; two VMs in different availability zones = 99.99%. Also, know that SLAs do not cover application-level failures—only infrastructure uptime.

Walk-Through

1

Design for redundancy across zones

When creating a virtual machine, you choose an availability zone or availability set. For maximum reliability, select two or three zones. Azure ensures each zone has independent power, cooling, and networking. This step is critical because a single-zone deployment is vulnerable to a data center outage. In the portal, under 'Availability options,' pick 'Availability zone' and select Zone 1, 2, or 3. The VM's managed disks are also zone-redundant if you choose zone-redundant storage (ZRS).

2

Configure load balancing and health probes

Deploy an Azure Load Balancer (Layer 4) or Application Gateway (Layer 7) in front of your VMs. Define a backend pool containing your VM instances. Configure health probes to ping a specific port (e.g., 80 for HTTP). If a VM fails to respond, the load balancer stops sending traffic to it and redirects to healthy VMs. This automatic failover happens in seconds. In the portal, create a load balancer, add backend pools, and set probe settings like interval (5 seconds) and unhealthy threshold (2 failures).

3

Set up auto-scaling rules

Use Azure Virtual Machine Scale Sets (VMSS) to automatically adjust the number of VM instances based on CPU usage or other metrics. For example, scale out when average CPU > 75% for 5 minutes, scale in when < 25%. This ensures performance predictability under varying load. In the portal, create a VMSS, define scaling rules, and set minimum (e.g., 2) and maximum (e.g., 10) instances. Behind the scenes, Azure Monitor collects metrics and triggers scaling actions.

4

Enable disaster recovery with Azure Site Recovery

For workloads that must survive a full region outage, configure Azure Site Recovery (ASR). ASR replicates your VMs to a secondary region (e.g., East US to West US). You define a recovery plan with order of operations. During a disaster, you initiate a failover; ASR spins up VMs in the secondary region using replicated disks. This provides recovery point objective (RPO) of seconds to minutes and recovery time objective (RTO) of minutes. In the portal, create a Recovery Services vault, enable replication for your VMs, and test failover regularly.

5

Purchase reserved instances for cost predictability

To stabilize costs, buy Reserved Instances (RI) for baseline VM usage. In the Azure portal, go to 'Reservations' and select the VM series, region, and term (1 or 3 years). You can pay upfront or monthly. RI gives a discount of up to 72% compared to pay-as-you-go. Azure automatically applies the discount to matching VMs. This step ensures that even if usage spikes, the baseline cost is fixed. For variable workloads, combine RI with pay-as-you-go.

What This Looks Like on the Job

Scenario 1: Global SaaS Provider

A SaaS company provides a customer relationship management (CRM) application to clients worldwide. They deploy their multi-tier application across Azure regions: primary in West Europe, secondary in North Europe. They use Azure Traffic Manager to route users to the nearest healthy endpoint. Inside each region, they use availability zones for each tier: web servers, application servers, and database. The database tier uses Azure SQL Database with active geo-replication to the secondary region. The team configures auto-scaling for web servers based on CPU load. Cost is managed by reserving 70% of capacity with Reserved Instances, leaving 30% for pay-as-you-go burst. The SLA for the multi-region deployment is 99.99%. If West Europe goes down, Traffic Manager automatically fails over to North Europe within minutes. The business problem solved: zero downtime for critical sales operations, and predictable monthly cloud spend.

Scenario 2: E-Commerce Platform with Seasonal Spikes

An online retailer experiences 100x traffic during Black Friday. Their on-premises data center cannot handle the peak. They migrate to Azure using VM Scale Sets and Azure Kubernetes Service (AKS). They deploy across two availability zones in East US. During normal days, they run 10 VMs (all reserved). During spikes, auto-scaling adds up to 200 pay-as-you-go VMs. They use Azure Load Balancer to distribute traffic. The database is Azure Cosmos DB with multi-region writes for high availability. The team monitors with Azure Monitor and sets alerts for cost anomalies. The business problem: ability to handle massive traffic without provisioning for peak capacity, and cost that scales with revenue. Misconfiguration risk: if auto-scaling rules are too aggressive, costs can skyrocket; they set a maximum instance limit and use budget alerts.

Scenario 3: Financial Services with Compliance

A bank runs a trading application that requires 99.999% uptime and data residency in the US. They deploy across three availability zones in East US 2. They use Azure Proximity Placement Groups to keep VMs close for low latency. They configure Azure Site Recovery to a paired region (Central US) for disaster recovery. The database uses SQL Server with Always On availability groups across zones. They purchase three-year Reserved Instances for all baseline VMs. The SLA for three-zone deployment is 99.99% (cannot reach 99.999% with standard SLAs, but they combine with custom redundancy). The business problem: meeting regulatory requirements while avoiding massive capital expenditure on a secondary data center. Common mistake: forgetting to test failover regularly; they run a failover drill quarterly.

How AZ-900 Actually Tests This

Objective 1.5: Describe reliability and predictability

This objective is part of the 'Cloud Concepts' domain (15-20% of exam). You need to understand how Azure delivers high availability, disaster recovery, and cost predictability. The exam will test your ability to:

Identify the SLA for different deployment configurations.

Understand the difference between availability sets and availability zones.

Know that SLAs are for infrastructure uptime, not application performance.

Recognize that Reserved Instances provide cost predictability.

Common Wrong Answers and Why

1.

'A single VM has a 99.99% SLA.' Wrong. A single VM has 99.9%. Candidates confuse the SLA for a single VM with multi-VM deployments. Remember: more redundancy = higher SLA.

2.

'Availability zones and availability sets are the same.' Wrong. Availability sets protect against rack-level failures within a single data center; availability zones protect against entire data center failures. The exam loves this distinction.

3.

'SLAs guarantee application performance.' Wrong. SLAs cover uptime and connectivity of Azure services, not how fast your app runs. Candidates think SLA means 'my app will be fast.'

4.

'Pay-as-you-go is the most cost-predictable model.' Wrong. Pay-as-you-go varies with usage; Reserved Instances are more predictable because you commit to a fixed cost.

Specific Terms and Values to Memorize

99.9% = one VM

99.95% = two VMs in an availability set

99.99% = two VMs in availability zones

Region pair = two regions within same geography (e.g., East US & West US)

Azure Site Recovery = disaster recovery service

Reserved Instances = 1 or 3-year commitment for discount

SLA credit = 10% to 25% of monthly fee if SLA not met

Edge Cases and Tricky Distinctions

Composite SLA: If you have an app with a web tier (99.95%) and a database tier (99.99%), the composite SLA is 99.95% * 99.99% = 99.94%. The exam may ask you to calculate this.

Free services: Azure free services have no SLA. Don't expect uptime guarantees.

SLA for Azure AD Free: 99.9% for authentication, but not for other features.

SLA for Azure DevOps: Only for certain services; check documentation.

Memory Trick: 'One VM, Nine Nine' = 99.9%

For SLAs, think: one VM = 99.9% (three nines). Two VMs in availability set = 99.95% (three nines plus a half). Two VMs in zones = 99.99% (four nines). The more nines, the more redundant.

Decision Tree for Eliminating Wrong Answers

Is the question about infrastructure uptime? If yes, SLA applies. If about app performance, it's not covered.

Is the deployment single VM? Then answer is 99.9%.

Is the deployment multi-VM? Check if they mention availability set or zone.

Is the question about cost predictability? Look for 'Reserved Instances' or 'reserved capacity.'

Is the question about disaster recovery? Look for 'Azure Site Recovery' or 'geo-replication.'

Key Takeaways

Reliability in Azure is achieved through redundancy across availability zones and region pairs.

A single VM has an SLA of 99.9%; two VMs in an availability set have 99.95%; two VMs in availability zones have 99.99%.

SLAs cover infrastructure uptime, not application performance or response times.

Azure Site Recovery provides disaster recovery with RPO of seconds to minutes and RTO of minutes.

Reserved Instances (1 or 3-year commitment) provide up to 72% discount and cost predictability.

Availability sets protect against rack failures; availability zones protect against data center failures.

Composite SLA is calculated by multiplying the SLAs of individual components.

Free Azure services have no SLA; paid services have SLAs that vary by tier.

Azure Load Balancer and Application Gateway enable automatic failover across instances.

Use Azure Pricing Calculator and Total Cost of Ownership (TCO) calculator to estimate costs.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Availability Set

Logical grouping within a single data center

Protects against rack-level failures (fault domains)

Protects against planned maintenance (update domains)

SLA for 2+ VMs: 99.95%

No additional cost; uses same data center

Availability Zone

Physical separation within a region (different data centers)

Protects against entire data center failure

No update domain concept; each zone is independent

SLA for 2+ VMs across zones: 99.99%

May incur inter-zone data transfer costs

Watch Out for These

Mistake

Availability zones and availability sets provide the same level of reliability.

Correct

Availability sets protect against failures within a single data center (rack-level). Availability zones protect against entire data center failures because they are physically separate data centers within a region. Zones offer higher reliability.

Mistake

An SLA of 99.9% means the service will be down exactly 8.76 hours per year.

Correct

99.9% uptime translates to a maximum of 8.76 hours of downtime per year, but it is not guaranteed to be exactly that. The SLA is a commitment; actual downtime may be less. Also, downtime is calculated monthly, not annually.

Mistake

Deploying a single VM with premium SSD gives a 99.99% SLA.

Correct

The SLA for a single VM is 99.9%, regardless of disk type. To get 99.99%, you need at least two VMs in different availability zones. Disk type affects performance, not uptime SLA.

Mistake

Reserved instances guarantee performance.

Correct

Reserved instances only guarantee a discounted price for a committed capacity. They do not affect performance or uptime. Performance depends on the VM size and configuration.

Mistake

Azure Site Recovery provides real-time replication with zero data loss.

Correct

Azure Site Recovery typically has an RPO of a few seconds to minutes, not zero. For near-zero data loss, you need synchronous replication, which is more expensive and has latency trade-offs. ASR is asynchronous by default.

Frequently Asked Questions

What is the difference between reliability and availability in Azure?

Reliability is the ability of a system to function correctly over time, including recovering from failures. Availability is the percentage of time a service is operational and accessible. In Azure, SLAs measure availability (uptime). Reliability is achieved through redundant design, while availability is the measured outcome. For the exam, know that 'reliability' is the broader concept including disaster recovery, and 'availability' is often quantified by SLA percentages.

How do I calculate the composite SLA for a multi-tier application?

Multiply the SLAs of each component. For example, if your web tier has a 99.95% SLA and your database tier has 99.99%, the composite SLA is 99.95% * 99.99% = 99.94%. This means the overall availability is slightly lower than the weakest component. The exam may ask you to compute this. Remember that SLAs are expressed as decimals: 99.95% = 0.9995, 99.99% = 0.9999.

What happens if Azure does not meet its SLA?

You can request a service credit, which is a percentage discount on your monthly bill. The credit amount depends on the actual uptime. For example, if uptime is between 99.9% and 99.95%, you get a 10% credit. Below 95%, you may get 25% credit. Credits are applied automatically if you have an eligible plan, but you may need to submit a claim. This is covered in the 'Service Level Agreements' section of the exam.

Can I get a 99.999% SLA in Azure?

Azure does not offer a standard 99.999% SLA for any single service. The highest standard SLA is 99.99% for multi-zone deployments. To achieve 99.999%, you need custom architectures with multiple regions and manual failover, but Microsoft does not provide a financial SLA for that level. For critical workloads, you might combine Azure with third-party solutions.

What is the difference between Azure Site Recovery and Azure Backup?

Azure Site Recovery (ASR) replicates entire workloads (VMs, apps) to a secondary region for disaster recovery, enabling failover. Azure Backup backs up data (files, databases) to a vault for point-in-time recovery. ASR is for rapid recovery from a disaster; Backup is for data protection against accidental deletion or corruption. Both contribute to reliability but serve different purposes.

Do all Azure services have an SLA?

No. Free services (e.g., Azure Free Account, free tiers of some services) do not have an SLA. Paid services typically have SLAs, but the percentage varies. For example, Azure AD Free has a 99.9% SLA for authentication, but Azure AD Premium P1 has 99.99%. Always check the SLA documentation for each service.

How does reserved capacity provide cost predictability?

Reserved Instances (RI) allow you to commit to a specific VM configuration (size, region) for 1 or 3 years. You pay upfront or monthly, and in return, you get a significant discount (up to 72%) compared to pay-as-you-go. This locks in your cost for that capacity, making it predictable regardless of usage fluctuations. However, you are charged even if you don't use the VMs.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Reliability and Predictability — now see how well it sticks with free AZ-900 practice questions. Full explanations included, no account needed.

Done with this chapter?