High availability and fault tolerance are not the same thing, and the SAA-C03 exam expects you to know the difference. A highly available system recovers quickly when something fails. A fault-tolerant system continues running without any interruption when something fails. Both require redundancy, but fault tolerance requires redundancy that is active and seamless. Every architecture question on the exam is implicitly a resilience question: how do you design this to stay available when a server dies, an AZ goes dark, or a region has an outage? Get comfortable thinking in layers: instance, AZ, and region.
Practice this topic
Availability Zones are physically separate data centers within a region, connected by low-latency links. Deploying across multiple AZs protects against a single data center failure. If AZ-a loses power, your resources in AZ-b and AZ-c keep serving traffic. This is the minimum acceptable architecture for production workloads. Single-AZ deployments have no protection against AZ-level failures.
Multi-region architectures protect against entire region failures or provide lower latency to users in different geographies. They are significantly more complex and expensive. A Pilot Light architecture keeps a minimal version of the environment running in a secondary region, ready to scale up in an emergency. A Warm Standby keeps a scaled-down but fully functional version running in the secondary region. Active-Active runs full capacity in multiple regions simultaneously, with traffic distributed between them.
For RDS, Multi-AZ creates a synchronous standby replica in a different AZ. If the primary fails, AWS automatically promotes the standby in typically 60 to 120 seconds, with a DNS endpoint flip. The standby is not readable during normal operation. Read Replicas are asynchronous and readable, used for read scaling rather than HA.
Load balancers perform health checks and route traffic only to healthy targets. If a target fails its health check, it is removed from rotation. New targets are added when Auto Scaling launches them. This combination of load balancer health checks and Auto Scaling is the foundation of elastic, self-healing architectures: when demand spikes, more instances launch; when they fail, they are replaced automatically.
Route 53 health checks enable DNS-level failover between resources, including between regions. A failover routing policy keeps a primary record active as long as health checks pass and switches to a secondary record if the primary fails. This provides region-level failover without the user noticing more than the DNS TTL delay.
Designing for failure means assuming every component will fail and building so that failure of any single component does not bring down the system. Single points of failure (SPOF) are the enemy. If removing any one component takes down the system, that component is a SPOF and must be made redundant.
Availability goal: survive a single instance failure = Auto Scaling group. Survive an AZ failure = deploy across multiple AZs with a load balancer. Survive a region failure = Multi-region with Route 53 failover.
RDS HA: Multi-AZ for automatic failover. Read Replicas for read scaling, not HA.
High availability vs fault tolerance: HA = system recovers quickly (some downtime). FT = system continues without interruption. Fault tolerance costs more because active redundancy is always running.
RPO and RTO: RPO = maximum acceptable data loss (how old can the restored data be?). RTO = maximum acceptable downtime (how long to recover?). Architect backup and replication frequency to meet RPO. Architect failover speed to meet RTO.
Pilot Light: minimal standby, scales up in emergency. Warm Standby: scaled-down but functional. Active-Active: full traffic in multiple regions simultaneously.
| Strategy | RTO | RPO | Cost | Description |
|---|---|---|---|---|
| Backup and Restore | Hours | Hours to days | Lowest | Restore from backup when disaster strikes |
| Pilot Light | Minutes to hours | Minutes | Low | Minimal env running in secondary region, scales up on failover |
| Warm Standby | Minutes | Seconds to minutes | Medium | Scaled-down full env running, scale up on failover |
| Active-Active | Seconds or none | Near zero | Highest | Full capacity in multiple regions, immediate failover |
Multi-AZ RDS provides read scaling.
The Multi-AZ standby is not readable. It exists purely for failover. To offload read traffic, create Read Replicas, which are readable but asynchronous and do not provide automatic failover for the primary.
High availability and fault tolerance mean the same thing.
High availability systems recover quickly from failure, meaning there may be brief downtime. Fault-tolerant systems continue operating without interruption through failures. Fault tolerance is stricter and more expensive because it requires active redundancy that can absorb failures seamlessly.
Route 53 DNS failover is instantaneous.
Route 53 failover switches the DNS record, but clients cache the old record for the duration of the TTL. To make failover faster, lower the TTL on your records before an expected change. Failover detection plus TTL propagation means actual traffic shifting can take minutes even with health checks configured.
These questions are representative of what you will see on AWS SAA-C03 exams. The correct answer and explanation are shown immediately below each question.
An RDS instance is deployed with Multi-AZ enabled. The primary instance in us-east-1a fails. What happens?
Explanation: RDS Multi-AZ maintains a synchronous standby in a different AZ. On primary failure, AWS automatically promotes the standby and updates the RDS DNS endpoint to point to the new primary — typically within 60-120 seconds. Applications use the same DNS endpoint (no reconfiguration needed). No data is lost because replication is synchronous. Read Replicas are a separate feature for scaling reads.
A company needs its application to continue serving traffic even if an entire AWS Availability Zone fails. What architecture achieves this?
Explanation: Multi-AZ deployment with an Elastic Load Balancer is the standard pattern for AZ-failure resilience. The ELB distributes traffic across instances in multiple AZs and performs health checks. If an AZ fails, health checks fail for those instances, and the ELB routes traffic only to healthy instances in remaining AZs. Backups help with data recovery but don't provide HA. Reserved Instances are a cost optimization, not an HA feature.
A company's RTO is 4 hours and RPO is 1 hour for a DR scenario. Their workload runs in us-east-1. Which disaster recovery strategy is most appropriate and cost-effective?
Explanation: Pilot Light keeps a minimal environment (core services running, data replicated) in a secondary region that can scale up within hours when needed. This meets the 4-hour RTO (scale-up time) and 1-hour RPO (replication frequency). Active-Active would easily meet the RTO/RPO but costs significantly more. Warm Standby meets the requirements but is more expensive than Pilot Light. Single-region offers no region-level DR.
A company wants to minimize data loss (near-zero RPO) and downtime (near-zero RTO) for its critical application across two AWS regions. Which DR strategy is required?
Explanation: Active-Active runs full capacity in multiple regions simultaneously, with traffic distributed between them. Failover is immediate (traffic is already serving from both regions) and data loss is near-zero (writes go to both regions). This is the highest-cost DR strategy but provides the lowest RTO and RPO. Pilot Light and Warm Standby both have meaningful scale-up time (RTO > seconds).
What is the difference between High Availability and Fault Tolerance in AWS architecture?
Explanation: High Availability means a system is designed to recover quickly from failures, accepting brief downtime measured in seconds or minutes. Fault Tolerance means the system continues operating without any interruption when components fail — requiring active redundancy that absorbs failures seamlessly. Fault tolerance is stricter, more expensive, and requires more complex architecture (active-active, not active-passive).
RDS Multi-AZ creates a synchronous standby replica in a different AZ specifically for high availability. The standby is NOT readable during normal operation. On primary failure, AWS automatically fails over to the standby. RDS Read Replicas use asynchronous replication to create readable copies of the database, used for offloading read traffic and improving read performance. Read Replicas can be promoted to standalone databases but don't provide automatic failover for the primary.
RPO (Recovery Point Objective) is the maximum acceptable amount of data loss measured in time — how old can the restored data be? An RPO of 1 hour means you can tolerate losing up to 1 hour of data. RTO (Recovery Time Objective) is the maximum acceptable downtime — how long can the system be unavailable? Lower RPO requires more frequent replication. Lower RTO requires faster failover mechanisms. Both directly drive architecture decisions and cost.
From lowest cost/slowest to highest cost/fastest: (1) Backup and Restore — restore from backups when disaster strikes (hours RTO/RPO, cheapest). (2) Pilot Light — minimal secondary environment with core services running, scales up during DR (minutes-to-hours RTO). (3) Warm Standby — scaled-down but fully functional secondary, scales up on failover (minutes RTO). (4) Active-Active — full capacity in multiple regions, immediate failover (seconds/zero RTO, most expensive).
Identify each component whose failure would bring down the application. For compute: use Auto Scaling Groups across multiple AZs with an ELB. For databases: use Multi-AZ RDS or Aurora with multiple replicas. For networking: use redundant NAT Gateways (one per AZ). For DNS: Route 53 is globally redundant by design. For storage: S3 and EFS are regionally redundant. The goal is that no single component failure (instance, AZ, service) should bring down the application.
SAA-C03 heavily tests HA architecture patterns: Multi-AZ vs Multi-Region, RDS Multi-AZ vs Read Replicas, ELB health checks with Auto Scaling, Route 53 failover routing, and DR strategy selection (Backup and Restore vs Pilot Light vs Warm Standby vs Active-Active). RPO/RTO trade-offs and cost implications are key themes. Expect scenarios where you must choose between HA options based on availability and budget requirements.
Try free AWS High Availability practice questions with explanations, topic links and progress tracking.