AZ-305Chapter 34 of 103Objective 1.3

Azure Architecture Design Principles

This chapter covers the Azure Architecture Design Principles, the foundation of the AZ-305 exam. These principles are derived from the Microsoft Azure Well-Architected Framework and guide every decision a solutions architect makes. Approximately 15-20% of exam questions directly test your understanding of these principles, either explicitly or as the basis for scenario-based questions. Mastering them is essential for designing secure, scalable, resilient, and cost-effective solutions on Azure.

25 min read
Intermediate
Updated May 31, 2026

Architecture Design as City Planning

Designing an Azure architecture is like planning a modern city. The city has zones (regions), districts (resource groups), and buildings (resources). The city planner must decide on zoning laws (Azure Policy), traffic flow (networking), utilities (storage), emergency services (disaster recovery), and growth patterns (scalability). Just as a city planner cannot redesign the entire city every few years, an architect must design for longevity and change. The Azure Well-Architected Framework provides the principles: reliability (redundant power feeds), security (police and fire departments), cost optimization (budgeting for services), operational excellence (city management systems), and performance efficiency (road capacity and public transit). A poorly planned city leads to traffic jams, power outages, and unsafe neighborhoods; similarly, a poorly architected Azure solution leads to outages, security breaches, and runaway costs. The architect must consider future expansion (suburbs), peak loads (rush hour), and disaster recovery (emergency shelters). Each design decision has trade-offs: a wider road (more throughput) costs more land (higher cost). The AZ-305 exam tests your ability to make these trade-offs using the Well-Architected Framework principles.

How It Actually Works

What Are the Azure Architecture Design Principles?

The Azure Architecture Design Principles are a set of five pillars defined in the Microsoft Azure Well-Architected Framework. They are not product-specific but rather a mental model for making architectural trade-offs. The five pillars are: - Reliability: The ability of a system to recover from failures and continue to function. - Security: Protecting applications and data from threats. - Cost Optimization: Managing costs to maximize the value delivered. - Operational Excellence: Operations processes that keep a system running in production. - Performance Efficiency: The ability of a system to adapt to changes in load.

These pillars are interconnected. For example, increasing reliability (e.g., adding redundancy) often increases cost. The exam expects you to prioritize based on business requirements.

Why These Principles Exist

Before the Well-Architected Framework, architects often focused on a single aspect (usually performance or cost) and neglected others, leading to brittle systems. Microsoft created the framework based on lessons learned from thousands of customer deployments. The principles provide a common language and set of trade-offs that align with business goals.

How the Principles Work Internally

Each pillar has a set of design principles. For example, under Reliability, the principles include: - Design for business continuity: Plan for failure from the start. - Design for recovery: Implement automated recovery processes. - Design for redundancy: Use multiple instances and regions. - Design for self-healing: Automatically detect and recover from failures. - Design for failure modes: Understand how each component can fail.

The exam does not require memorizing all sub-principles but does test your ability to apply them. For instance, if a scenario requires high availability, you should think about Availability Zones, load balancers, and geo-redundant storage.

Key Components, Values, and Defaults

Reliability: Target Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are defined by the business. Azure Site Recovery supports RTO of minutes and RPO of seconds. Availability Zones provide 99.99% uptime for VMs.

Security: Use Azure Policy to enforce compliance. Azure Security Center (now Microsoft Defender for Cloud) provides threat protection. Default network security groups block all inbound traffic.

Cost Optimization: Use Azure Reservations for 1- or 3-year terms to save up to 72% compared to pay-as-you-go. Right-size VMs using Azure Advisor.

Operational Excellence: Infrastructure as Code (IaC) with ARM templates or Bicep. Azure Monitor for logging and alerts. Default retention for Activity Logs is 90 days, but you can set custom retention in Log Analytics.

Performance Efficiency: Autoscale rules with minimum and maximum instance counts. Use Azure Load Balancer or Application Gateway for traffic distribution. Default autoscale cool-down period is 5 minutes.

Configuration and Verification Commands

While the exam is not command-line heavy, you should know how to verify designs using Azure Portal or CLI. For example:

To check redundancy: az storage account show --name <name> --query 'sku.name' returns Standard_GRS for geo-redundant.

To verify autoscale settings: az monitor autoscale show --resource-group <rg> --name <autoscale-name>

For security policies: az policy assignment list --query '[].{Name:name, PolicyDefinitionId:policyDefinitionId}'

Interaction with Related Technologies

The principles are applied across all Azure services. For example: - Reliability + Networking: Use Azure Traffic Manager for global load balancing with multiple endpoints. - Security + Compute: Use Azure Disk Encryption for VMs using BitLocker or DM-Crypt. - Cost Optimization + Storage: Use Azure Blob Storage access tiers (Hot, Cool, Archive) to reduce costs based on access patterns. - Operational Excellence + DevOps: Use Azure DevOps pipelines to deploy ARM templates after approval. - Performance Efficiency + Databases: Use Azure SQL Database Hyperscale for auto-scaling storage up to 100 TB.

The exam often presents a scenario where you must choose between two services based on these trade-offs. For instance, choosing between Azure SQL Database and SQL Managed Instance: Managed Instance offers more compatibility but higher cost; SQL Database offers lower cost but less control.

Exam Emphasis

The AZ-305 exam focuses on the trade-offs between pillars. A common question: "You need to design a solution that minimizes cost while meeting a 99.9% SLA. Which approach should you use?" The answer often involves using a single VM with premium storage (lower cost) vs. multiple VMs across zones (higher reliability, higher cost). The correct answer depends on the SLA requirement.

Specific Numbers and Terms

SLA for Azure VMs: 99.9% for single instance with premium storage, 99.95% for two or more instances in same availability set, 99.99% for two or more instances across availability zones.

Azure Site Recovery: Supports RPO as low as 30 seconds.

Azure Backup: Default retention for daily backups is 30 days.

Azure Policy: Over 80 built-in policies for security and compliance.

Cost Management: Budget alerts at 50%, 75%, 90%, and 100% of budget.

Trap Patterns

Candidates often confuse availability sets (fault domains and update domains within a single datacenter) with availability zones (physically separate datacenters within a region). Availability sets protect against rack failures and planned maintenance; availability zones protect against entire datacenter failures. The exam will test this distinction.

Another trap: RTO vs. RPO. RTO is time to recover; RPO is data loss tolerance. For example, if RPO is 1 hour, you can lose up to 1 hour of data. Candidates often reverse them.

Finally, candidates sometimes think that using more services automatically increases reliability. In fact, adding complexity can reduce reliability if not designed properly. The exam expects you to consider the reliability of each component and the overall system.

Walk-Through

1

Identify Business Requirements

Start by gathering non-functional requirements such as uptime SLA (e.g., 99.99%), RTO (e.g., 4 hours), RPO (e.g., 1 hour), budget constraints, and compliance needs (e.g., GDPR, HIPAA). Interview stakeholders to understand criticality of each workload. Document these in a requirements matrix. This step is often overlooked but is the foundation of all architectural decisions. For example, a financial trading system might require RPO of seconds, while a blog can tolerate hours.

2

Map Requirements to Pillars

For each requirement, determine which Well-Architected pillar(s) it maps to. High availability maps to Reliability. Data encryption maps to Security. Cost cap maps to Cost Optimization. Operational runbooks map to Operational Excellence. Performance during peak maps to Performance Efficiency. Prioritize pillars if there is conflict (e.g., cost vs. reliability). Document trade-offs. For instance, if budget is tight but reliability is critical, consider using Azure Reserved Instances to reduce cost.

3

Design the Architecture

Using the prioritized pillars, select Azure services and configuration. For reliability, design for redundancy: use Availability Zones for VMs, geo-redundant storage (GRS), and active-passive or active-active database configurations. For security, implement network segmentation with VNet, NSGs, and Azure Firewall. For cost, choose appropriate tiers and use Azure Reservations. For operations, plan for monitoring with Azure Monitor and alerts. For performance, set autoscaling rules. Document the architecture in a diagram with components and data flows.

4

Validate Against Principles

Review the design against each pillar. Ask: Does it meet the required SLA? Is data encrypted at rest and in transit? Are there cost overruns? Is there a deployment automation? Can it scale to expected peak load? Use Azure Advisor to check for best practices. For example, Advisor might flag a VM that is underutilized (cost optimization) or a storage account without geo-redundancy (reliability). Adjust design accordingly.

5

Implement and Monitor

Deploy using IaC (ARM/Bicep/Terraform) to ensure repeatability. After deployment, set up Azure Monitor dashboards and alerts for key metrics like CPU, memory, latency, and error rates. Configure budget alerts in Cost Management. Regularly review Advisor recommendations. Implement a process for continuous improvement: every quarter, review the architecture against evolving business requirements and new Azure features.

What This Looks Like on the Job

Enterprise Scenario 1: Global E-Commerce Platform

Problem: A retail company needs a highly available, globally distributed e-commerce platform with 99.99% uptime, RPO of 5 minutes, and RTO of 15 minutes. They must handle traffic spikes during Black Friday.

Solution: They use Azure Front Door for global load balancing with health probes. Web tier runs on VMSS across two availability zones in two paired regions (e.g., East US and West US). Azure Traffic Manager routes traffic to the active region. Database uses Azure SQL Database with active geo-replication (RPO of 5 seconds). Azure Redis Cache for session state. Azure Cosmos DB for product catalog with multi-region writes. Cost is optimized using reserved capacity for SQL Database and Cosmos DB. Monitoring via Azure Monitor and Application Insights.

Common Pitfall: Misconfiguring health probes in Front Door, causing false positives and routing traffic to unhealthy instances. Also, failing to test disaster recovery quarterly.

Enterprise Scenario 2: Healthcare Application with Compliance

Problem: A hospital system needs to store patient records with HIPAA compliance. They require encryption at rest and in transit, audit logging, and a backup retention of 7 years. Budget is constrained.

Solution: They use Azure Blob Storage with cool tier for active data and archive tier for older records. Azure Policy enforces encryption and logging. Azure Key Vault for encryption keys. Azure SQL Database with Transparent Data Encryption (TDE). Azure Backup with 7-year retention using Backup Vault. Cost is minimized by using cool and archive tiers, and by right-sizing VMs. Operational excellence achieved through Azure Policy and Azure Blueprints.

Common Pitfall: Not enabling audit logging on Azure SQL, leading to compliance violation. Also, not testing backup restoration.

Enterprise Scenario 3: Financial Services with Low Latency

Problem: A trading platform requires sub-millisecond latency for transactions, 99.999% uptime, and RPO of zero (no data loss). They have unlimited budget.

Solution: They use Azure Dedicated Host for VMs with low latency networking (Accelerated Networking). VMs are placed in proximity placement groups within a single availability zone to minimize latency. Database uses Azure SQL Database Hyperscale with zone-redundant configuration and automatic failover groups. Azure ExpressRoute for dedicated connectivity. For disaster recovery, they use Azure Site Recovery with continuous replication. Cost is not a concern, so they use premium SSDs and reserved capacity.

Common Pitfall: Assuming availability zones provide zero latency; in reality, cross-zone latency is higher than within a zone. Also, not testing failover regularly.

How AZ-305 Actually Tests This

AZ-305 Exam Focus on Architecture Design Principles

This topic maps to objective 1.3: "Design a solution for identity governance" but also underpins all design decisions. The exam tests your ability to apply the Well-Architected Framework pillars to specific scenarios. Expect 3-5 questions directly on these principles, plus many scenario-based questions where you must choose the best architecture based on trade-offs.

Most Common Wrong Answers

1.

Choosing Availability Sets over Availability Zones for high availability: Candidates often select availability sets because they are familiar. However, availability sets only protect against rack failures within a single datacenter, not entire datacenter outages. For 99.99% SLA, you need availability zones.

2.

Confusing RTO and RPO: A question might ask: "You need to minimize data loss. Which metric is most important?" The answer is RPO, but candidates often choose RTO.

3.

Selecting the cheapest option without considering SLA: For a mission-critical app, choosing a single VM with standard HDD (low cost) instead of multiple VMs with premium SSD (higher cost) will not meet the 99.9% SLA. The exam expects you to consider the cost-SLA trade-off.

4.

Assuming all Azure services are inherently secure: Some services like Azure Storage have encryption at rest by default, but others like Azure SQL require you to enable TDE. The exam tests your knowledge of which services have default security features.

Specific Numbers and Terms

SLA values: 99.9%, 99.95%, 99.99% for VMs; 99.99% for Azure SQL Database Business Critical tier; 99.999% for Azure Front Door with two regions.

RTO/RPO: Typical RTO for Azure Site Recovery is minutes; RPO as low as 30 seconds.

Cost savings: Reservations save up to 72%; Azure Hybrid Benefit saves up to 40% on Windows Server and SQL Server.

Autoscale: Default cool-down period is 5 minutes; minimum instance count is 1.

Backup retention: Maximum 99 years for Azure Backup; default 30 days for daily backups.

Edge Cases and Exceptions

Single VM SLA: A single VM with premium storage has a 99.9% SLA, but only if it is in a premium storage account. Using standard HDD reduces SLA to 95%.

Zone-redundant storage (ZRS): For blobs, ZRS is available in select regions. The exam might ask: "Which storage type provides synchronous replication across zones?" Answer: ZRS.

Cosmos DB consistency levels: Strong consistency offers the highest reliability but highest latency. The exam might test trade-offs between consistency and performance.

How to Eliminate Wrong Answers

If a question mentions "minimize downtime," eliminate options that do not include redundancy (e.g., single instance).

If "minimize cost" is the primary driver, eliminate options with geo-redundancy or premium tiers.

If "compliance" is mentioned, look for options that include Azure Policy, Azure Blueprints, or encryption.

Always map the requirement to the pillar. For example, "fast recovery" maps to RTO (Reliability), "minimal data loss" maps to RPO (Reliability).

Key Takeaways

The five pillars of the Well-Architected Framework are Reliability, Security, Cost Optimization, Operational Excellence, and Performance Efficiency.

Availability Zones provide 99.99% SLA; Availability Sets provide 99.95% SLA for multi-VM deployments.

RTO (Recovery Time Objective) is the maximum acceptable downtime; RPO (Recovery Point Objective) is the maximum acceptable data loss.

Azure Site Recovery supports RPO as low as 30 seconds and RTO of minutes.

Azure Reservations can save up to 72% compared to pay-as-you-go pricing.

Azure Policy enforces compliance; over 80 built-in policies are available.

Autoscale default cool-down period is 5 minutes; minimum instance count is 1.

Azure Backup default retention for daily backups is 30 days; maximum is 99 years.

Azure Advisor provides best practice recommendations for reliability, security, cost, and performance.

Always map business requirements to the appropriate pillar before making design decisions.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Availability Sets

Protects against rack failures and planned maintenance within a single datacenter.

Provides 99.95% SLA for two or more VMs.

No additional cost for the availability set itself.

VMs in the same availability set are placed in different fault domains (up to 3) and update domains (up to 20).

Does not protect against datacenter-level outages.

Availability Zones

Protects against entire datacenter failures within a region.

Provides 99.99% SLA for two or more VMs across zones.

May incur additional cost for cross-zone traffic.

Each zone is a separate physical location with independent power, cooling, and networking.

Requires VMs to be deployed in at least two zones; not all services support zones.

Watch Out for These

Mistake

Availability sets provide the same SLA as availability zones.

Correct

Availability sets protect against rack failures within a single datacenter and provide 99.95% SLA for two or more VMs. Availability zones protect against entire datacenter failures and provide 99.99% SLA. They are not equivalent.

Mistake

Azure Backup automatically retains backups for 99 years.

Correct

Azure Backup supports a maximum retention of 99 years, but you must configure it. Default retention for daily backups is 30 days. You need to set the retention policy explicitly.

Mistake

All Azure services are encrypted by default at rest.

Correct

Many services like Azure Blob Storage and Azure SQL Database have encryption at rest enabled by default. However, some services like Azure Virtual Machines require you to enable Azure Disk Encryption. Always check service documentation.

Mistake

Using more Azure services always improves reliability.

Correct

Adding more services increases complexity and potential failure points. Reliability improves only if you design for redundancy and failure modes. A single well-designed service can be more reliable than a complex multi-service architecture without proper failover.

Mistake

Cost optimization means always choosing the cheapest option.

Correct

Cost optimization is about maximizing value, not minimizing cost. The cheapest option may not meet performance or reliability requirements. The goal is to choose the right tier and features that balance cost with business needs.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between an availability set and an availability zone?

An availability set is a logical grouping of VMs that protects against rack failures and planned maintenance within a single datacenter. It distributes VMs across up to 3 fault domains (different racks) and up to 20 update domains (to stagger updates). An availability zone is a physically separate datacenter within a region, with independent power, cooling, and networking. Deploying VMs across two zones protects against entire datacenter failures. Availability sets provide 99.95% SLA for multi-VM, while availability zones provide 99.99% SLA.

How do I choose between RTO and RPO?

RTO (Recovery Time Objective) is the maximum time you can afford to be without your system after a disaster. RPO (Recovery Point Objective) is the maximum amount of data you can afford to lose. If minimizing downtime is critical, prioritize low RTO. If minimizing data loss is critical, prioritize low RPO. For example, a financial transaction system might require RPO of seconds and RTO of minutes, while a content management system might tolerate RPO of hours and RTO of a day.

What is the most cost-effective way to achieve high availability for VMs?

The most cost-effective way depends on your SLA requirement. For 99.95% SLA, use two VMs in an availability set with premium storage. For 99.99% SLA, use two VMs across availability zones with premium storage. To reduce cost, use Azure Reserved Instances (1 or 3 years) and Azure Hybrid Benefit if you have existing Windows Server licenses. Also, right-size VMs using Azure Advisor to avoid overprovisioning.

How does Azure Policy help with operational excellence?

Azure Policy allows you to enforce organizational standards and assess compliance at scale. You can create policies that require specific resource types, enforce tagging, or restrict locations. For example, a policy can ensure all storage accounts use geo-redundant storage. This automates compliance and reduces manual errors, contributing to operational excellence by standardizing deployments and simplifying governance.

What is the default SLA for a single VM with standard HDD?

A single VM with standard HDD (hard disk drive) does not have a published SLA. For a single VM to qualify for a 99.9% SLA, it must use premium SSD or ultra disk for all OS and data disks. Standard HDD and standard SSD do not meet the SLA requirements. Always check the SLA documentation for specific configurations.

Can I use Azure Backup to replicate data to another region?

Yes, Azure Backup supports cross-region restore (CRR) for Azure VMs. You can configure a Backup Vault to replicate backups to a paired region. This provides disaster recovery capability. Note that CRR is an optional feature and incurs additional cost. The default backup is stored in the same region as the resource.

What is the difference between Azure Site Recovery and Azure Backup?

Azure Site Recovery (ASR) is a disaster recovery solution that replicates workloads from a primary site to a secondary site (Azure or on-premises) to enable failover and failback. It focuses on minimizing downtime (RTO) and data loss (RPO) during a disaster. Azure Backup is a backup service that creates point-in-time copies of data (VMs, files, databases) and retains them for a specified period. It focuses on long-term retention and recovery from accidental deletion or corruption. ASR is for full site failover; Backup is for data recovery.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Azure Architecture Design Principles — now see how well it sticks with free AZ-305 practice questions. Full explanations included, no account needed.

Done with this chapter?