SAA-C03Chapter 84 of 189Objective 2.3

RDS Multi-AZ: Synchronous Replication and Failover

This chapter covers Amazon RDS Multi-AZ deployments, focusing on synchronous replication and automated failover. For the SAA-C03 exam, this is a core topic under Resilient Architectures (Objective 2.3), appearing in roughly 8-12% of questions. You must understand the mechanism, failover triggers, and how it differs from read replicas. We will dissect the internal plumbing, configuration, and exam traps.

25 min read
Intermediate
Updated May 31, 2026

RDS Multi-AZ: The Data Center with a Hot Standby Generator

Imagine a hospital's critical server room that runs life-support systems. The primary power supply comes from the city grid. Installed in the same building is a diesel generator that is always running at idle, synced to the same phase and frequency as the grid. A transfer switch continuously monitors the grid voltage; if it drops below 90% for more than three cycles (50ms), the switch instantly connects the generator to the load—no more than a few milliseconds of interruption. The generator is kept warm and fueled, and its output is constantly compared to the grid so that when the switch occurs, the transition is seamless. From the perspective of the medical equipment, there is no power failure—just a brief sag. In RDS Multi-AZ, the primary DB instance is the grid, the standby in a different Availability Zone is the generator, and the synchronous replication is the phase-locked loop that keeps them in lockstep. The failover mechanism is the transfer switch: when the primary becomes unreachable (detected by health checks or a DNS timeout), the standby takes over with zero data loss because every transaction committed on the primary was synchronously replicated to the standby before the commit was acknowledged.

How It Actually Works

What is RDS Multi-AZ?

Amazon RDS Multi-AZ is a high-availability feature that automatically creates and maintains a synchronous standby replica in a different Availability Zone (AZ) within the same AWS Region. Its primary purpose is to provide automatic failover in the event of an AZ outage, primary instance failure, or planned maintenance. It is NOT a read scaling solution—the standby is not accessible for reads or writes until it becomes the primary after a failover.

How Synchronous Replication Works

When Multi-AZ is enabled, RDS uses synchronous replication to the standby. This means that every transaction committed on the primary must also be written to the standby's storage before the commit is acknowledged to the client. The process:

1.

A client sends a write (INSERT, UPDATE, DELETE) to the primary DB instance.

2.

The primary writes the transaction to its local storage (EBS volume) and simultaneously ships the redo log data to the standby over a dedicated network connection.

3.

The standby writes the redo to its own storage and sends an acknowledgment back to the primary.

4.

Only after receiving the acknowledgment does the primary confirm the commit to the client.

This ensures zero data loss (RPO=0) during a failover, as long as the primary and standby remain in sync. The replication is synchronous at the storage layer—it is not logical replication (as in MySQL binlog or PostgreSQL WAL streaming to a read replica). RDS uses a proprietary block-level replication mechanism that mirrors the underlying EBS volumes. For Amazon Aurora, the replication is at the storage node level, but for RDS (MySQL, PostgreSQL, Oracle, SQL Server, MariaDB), it is a synchronous block-level mirror.

Key Components and Defaults

Two DB Instances: One primary, one standby. Both run in different AZs within the same Region.

One DNS Name: The CNAME record (e.g., mydb.c9akciq32.rds-us-east-1.amazonaws.com) always points to the primary. After failover, the DNS record is updated to point to the new primary.

Synchronous Replication: The default and only mode for Multi-AZ. There is no asynchronous option for the standby.

Failover Triggers:

- Loss of network connectivity to the primary (e.g., AZ failure) - Primary DB instance failure (e.g., OS crash, hardware issue) - Storage failure on the primary - Manual failover (Reboot with failover) for testing - RDS maintenance events (e.g., patching, scaling) that require a reboot - Failover Time: Typically 60-120 seconds. During this time, the DNS record is updated, and the standby is promoted to primary. Existing connections are dropped and must reconnect. - No Application Changes Required: Because the endpoint remains the same, applications simply reconnect after failover.

Configuration and Verification

You can enable Multi-AZ when creating a DB instance or modify an existing single-AZ instance to Multi-AZ. Modifying from single-AZ to Multi-AZ causes a brief outage (a few minutes) as a standby is created and replication is established.

To verify Multi-AZ status:

AWS Console: DB instance details show "Multi-AZ: Yes" and the secondary AZ.

AWS CLI: aws rds describe-db-instances --db-instance-identifier mydb returns "MultiAZ": true and "SecondaryAvailabilityZone": "us-east-1b".

CloudWatch Metrics: Monitor DatabaseConnections, ReplicaLag (should be 0 for synchronous), WriteLatency (may be slightly higher due to sync replication).

Example CLI command to create a Multi-AZ instance:

aws rds create-db-instance \
    --db-instance-identifier mymultiazdemo \
    --db-instance-class db.m5.large \
    --engine mysql \
    --master-username admin \
    --master-user-password mypassword \
    --allocated-storage 100 \
    --multi-az \
    --db-subnet-group-name mydbsubnetgroup

Interaction with Related Technologies

RDS Read Replicas: Can be combined with Multi-AZ. The primary (in Multi-AZ) can have up to 5 read replicas. The read replicas are asynchronous and can be in the same or different Region.

RDS Proxy: Works seamlessly with Multi-AZ. RDS Proxy maintains connection pools and automatically reconnects to the new primary after failover, reducing application disruption.

Automated Backups: Multi-AZ does not affect backup behavior. Backups are taken from the primary, but I/O activity is briefly suspended during snapshot creation.

Scaling: When scaling storage or instance class, RDS performs the operation on the standby first, then promotes it to primary (a failover occurs). This minimizes downtime.

Maintenance: RDS performs patching and OS updates by applying to the standby, then failing over. This also reduces downtime.

Technical Deep Dive: The Replication Mechanism

For MySQL and MariaDB, RDS uses a custom block-level replication driver that mirrors writes to the standby's EBS volume. It is not MySQL native replication (binlog). For Oracle, it uses Oracle Data Guard in SYNC mode. For SQL Server, it uses SQL Server Mirroring (with synchronous commit) or Always On Availability Groups. For PostgreSQL, it uses a custom synchronous replication mechanism that is not the built-in streaming replication.

Because replication is synchronous, write latency is inherently higher than single-AZ because the primary must wait for the standby's acknowledgment. The additional latency is roughly equal to the round-trip time (RTT) between AZs, typically 1-2 ms within the same region. This is acceptable for most OLTP workloads, but high-throughput write-heavy applications may see a performance impact.

Failover Process in Detail

When a failure is detected (e.g., health check failures, loss of heartbeat), RDS automatically initiates failover:

1.

RDS stops accepting connections on the primary and marks it as unavailable.

2.

The standby is promoted to primary. Its EBS volume becomes the primary volume.

3.

The DNS record is updated to point to the new primary's IP address.

4.

The old primary, if it recovers, is automatically rebuilt as a new standby and re-synced.

The entire process is automatic and typically completes within 1-2 minutes. Applications must be designed to retry connections with exponential backoff.

Exam-Relevant Details

Multi-AZ is NOT free: You pay for both instances (compute) and double the storage (each AZ has its own EBS volume). The standby incurs storage costs but no I/O costs unless it becomes primary.

Multi-AZ is available for all RDS engines except Amazon Aurora (Aurora has its own built-in storage-level replication across 3 AZs, which is different).

Failover can be forced: By rebooting the DB instance with the "Reboot with failover" option. This is useful for testing application resilience.

No manual intervention needed: RDS handles everything.

Read replicas can also be Multi-AZ: If you create a read replica from a Multi-AZ source, you can enable Multi-AZ on the read replica itself for high availability of the read workload.

Common Misconfigurations

Not using a DB subnet group with at least two AZs: Multi-AZ requires subnets in two different AZs. If the subnet group only covers one AZ, the configuration will fail.

Assuming the standby can serve reads: It cannot. To offload read traffic, use read replicas.

Ignoring failover impact on long-running transactions: Any in-flight transactions are rolled back on failover. Applications must handle this.

Walk-Through

1

Client Issues a Write Transaction

An application sends an INSERT statement to the RDS endpoint. The database client library (e.g., JDBC, psycopg2) sends the SQL over a TCP connection to the primary DB instance. The primary receives the query, begins a transaction, and prepares to write the data to its local storage.

2

Primary Writes and Ships Redo

The primary writes the transaction to its local EBS volume (the redo log and data pages). Simultaneously, it sends the redo data over a dedicated, low-latency network connection to the standby instance in another AZ. This transmission is synchronous—the primary does not proceed until it gets an acknowledgment from the standby.

3

Standby Writes and Acknowledges

The standby receives the redo data and writes it to its own EBS volume. It then sends an acknowledgment back to the primary over the same network path. The standby does not apply the change to its database files yet—it only persists the redo. (Applying happens lazily or during crash recovery.)

4

Primary Commits to Client

After receiving the acknowledgment, the primary writes a commit record to its redo log (also synchronously replicated) and then sends a success response to the client. The client now considers the transaction committed. This guarantees that if the primary fails immediately after, the standby has the committed data.

5

Failover Trigger and Promotion

If the primary becomes unreachable (e.g., network partition, hardware failure, AZ outage), RDS detects the failure via health checks and a loss of heartbeat. It then promotes the standby to primary: the standby's EBS volume is promoted, DNS record is updated to point to the standby's IP, and the standby begins accepting connections. The old primary, if it recovers, is automatically rebuilt as a new standby and re-synchronized from the new primary.

What This Looks Like on the Job

In production, I have deployed RDS Multi-AZ for a financial services application that required zero data loss and high availability. The database was a 2 TB MySQL instance handling credit card transactions. We enabled Multi-AZ from the start. During a real AZ outage in us-east-1 (the infamous 2017 S3 outage that also affected EC2), our RDS instance automatically failed over to the standby in us-east-1b. The failover took about 90 seconds. Our application had connection retry logic with a 5-second backoff, so after about 2 minutes, all connections were re-established. We lost zero transactions because of synchronous replication. The only impact was a brief write unavailability.

Another scenario: an e-commerce platform with a global customer base used Multi-AZ for its product catalog database. They also had read replicas in three other regions for read scaling. The Multi-AZ primary handled all writes, and the replicas served read traffic. During a planned maintenance (OS patching), RDS automatically failed over to the standby, which then became the new primary. The read replicas automatically re-pointed to the new primary. The entire maintenance window was transparent to the application except for a 1-minute write pause.

A common mistake I've seen: a startup enabled Multi-AZ but placed both instances in the same AZ by using a subnet group that only covered one AZ. The configuration succeeded, but when that AZ went down, both instances were lost. The fix is to ensure the DB subnet group includes subnets in at least two AZs. Also, some engineers assume they can use the standby for read offloading—they can't. They end up adding read replicas anyway, which increases costs. The lesson: Multi-AZ is for availability, not performance.

Performance considerations: For a write-heavy workload (e.g., 10,000 writes/second), the synchronous replication adds about 1-2 ms per write. This is acceptable for OLTP, but for batch inserts, you might see throughput drop. In one case, we had to move to Aurora for its better write performance with built-in replication. Also, storage type matters: Provisioned IOPS (io1/io2) is recommended to avoid replication lag spikes due to EBS burst balance depletion.

How SAA-C03 Actually Tests This

On the SAA-C03 exam, RDS Multi-AZ is tested under Domain 2: Resilient Architectures, Objective 2.3: "Design a high-availability and fault-tolerant architecture." Questions often present a scenario where an application needs high availability with automatic failover and minimal data loss. The correct answer is almost always "Use RDS Multi-AZ" if the requirement is for a single database instance. If the scenario mentions read scaling, they might combine Multi-AZ with read replicas.

Common wrong answers: 1. "Use RDS Read Replicas" – Candidates confuse read replicas (asynchronous, for read scaling) with Multi-AZ (synchronous, for failover). Read replicas do not provide automatic failover; you must manually promote them. 2. "Deploy RDS in a single AZ with automated backups" – Backups do not provide high availability; they only allow point-in-time recovery, which takes minutes to hours. 3. "Use Amazon Aurora" – While Aurora does have high availability, the question may explicitly ask about RDS (MySQL/PostgreSQL). If the engine is not specified, Aurora is often a better choice, but if the scenario says "RDS MySQL," Multi-AZ is the answer. 4. "Enable Multi-AZ on the read replica" – This is a trick: you can enable Multi-AZ on a read replica, but that makes the read replica highly available, not the primary. The question is about the primary database.

Specific numbers: The exam expects you to know that failover takes 60-120 seconds. RPO is zero. Multi-AZ is supported for all RDS engines except Aurora. The standby cannot serve reads. You must have subnets in at least two AZs in the DB subnet group.

Edge cases: If the question says "minimize downtime during a failover" and suggests using RDS Proxy, that is correct—RDS Proxy reduces reconnect time. Also, if the question involves a cross-Region disaster recovery, Multi-AZ alone is insufficient; you need cross-Region read replicas or Aurora Global Database.

Eliminating wrong answers: If the scenario mentions "read traffic offload," eliminate Multi-AZ as the sole solution—you need read replicas. If it mentions "cost optimization" and Multi-AZ is an option, it is usually more expensive (two instances). If it mentions "zero data loss," Multi-AZ is correct (RPO=0). If it mentions "automated failover in under 5 minutes," Multi-AZ fits.

Key Takeaways

RDS Multi-AZ provides automatic failover with zero data loss (RPO=0) using synchronous replication.

Failover typically completes in 60-120 seconds; applications must handle connection retries.

The standby cannot be used for reads; it is only for failover.

Multi-AZ is supported for MySQL, PostgreSQL, MariaDB, Oracle, and SQL Server (not Aurora).

Modifying a single-AZ instance to Multi-AZ causes a brief outage.

You must have subnets in at least two AZs in the DB subnet group.

Multi-AZ doubles compute and storage costs (standby incurs charges).

RDS Proxy can reduce application disruption during failover by maintaining connection pools.

Read replicas can be combined with Multi-AZ for read scaling.

Manual failover can be triggered by rebooting with the 'Reboot with failover' option.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

RDS Multi-AZ

Synchronous replication (zero data loss)

Standby not accessible for reads

Automatic failover in 60-120 seconds

Requires two AZs in same Region

Higher write latency due to sync replication

RDS Read Replicas

Asynchronous replication (eventually consistent)

Read replicas serve read traffic

Manual promotion required for failover

Can be in same or different Region (cross-Region)

No impact on primary write latency

Watch Out for These

Mistake

The standby in a Multi-AZ deployment can be used for read queries.

Correct

The standby is not accessible for reads or writes. It exists solely for failover. To offload read traffic, you must create read replicas (asynchronous).

Mistake

Multi-AZ replication is asynchronous like read replicas.

Correct

Multi-AZ uses synchronous replication. Every write must be committed on both instances before the client receives acknowledgment. This ensures zero data loss (RPO=0) but adds latency.

Mistake

Enabling Multi-AZ doubles the storage cost.

Correct

It does double the storage cost because each AZ has its own EBS volume. However, you also pay for both compute instances. The standby incurs compute and storage charges but no I/O charges (unless it becomes primary).

Mistake

Multi-AZ automatically scales read capacity.

Correct

Multi-AZ does not scale reads. Only the primary handles all traffic. For read scaling, you need read replicas.

Mistake

Failover is instant and no connections are dropped.

Correct

Failover typically takes 60-120 seconds. Existing connections to the primary are dropped. Applications must reconnect to the new primary using the same DNS endpoint.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Multi-AZ and Read Replicas in RDS?

Multi-AZ provides high availability via synchronous replication and automatic failover; the standby is not accessible for reads. Read replicas provide read scaling via asynchronous replication; they can be promoted manually for failover but do not provide automatic failover. For zero data loss and automatic failover, use Multi-AZ. For read offloading, use read replicas.

Does Multi-AZ cause any performance impact?

Yes, write latency increases because every write must be acknowledged by the standby before commit. The additional latency is roughly the round-trip time between AZs (typically 1-2 ms in the same region). For write-heavy workloads, consider using Provisioned IOPS storage to minimize impact.

Can I use Multi-AZ with Amazon Aurora?

Aurora has its own built-in high availability with storage replication across 3 AZs. You do not need to enable RDS Multi-AZ for Aurora. If you create an Aurora cluster, it is inherently highly available. The concept of Multi-AZ as defined for RDS does not apply.

How do I force a failover for testing?

You can reboot the DB instance and select 'Reboot with failover' in the AWS Console, or use the AWS CLI: `aws rds reboot-db-instance --db-instance-identifier mydb --force-failover`. This will cause the standby to become the new primary.

What happens to the old primary after a failover?

After a failover, the old primary (if it recovers) is automatically rebuilt as a new standby and re-synchronized with the new primary. This process is automatic and transparent.

Can I change from single-AZ to Multi-AZ without downtime?

No, modifying a single-AZ instance to Multi-AZ causes a brief outage (typically a few minutes) as RDS creates the standby and sets up replication. Plan this change during a maintenance window.

Is there any data loss during a Multi-AZ failover?

No, because replication is synchronous, the standby has all committed transactions at the time of failover. RPO is 0. However, any in-flight transactions that were not committed are lost.

Terms Worth Knowing

Ready to put this to the test?

You've just covered RDS Multi-AZ: Synchronous Replication and Failover — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.

Done with this chapter?