This chapter covers Amazon RDS Multi-AZ deployments and Read Replicas, two critical features for database high availability and scalability. For the SOA-C02 exam, understanding the differences between synchronous and asynchronous replication, failover behavior, and use cases is essential. Approximately 10-15% of exam questions touch on RDS, with Multi-AZ and Read Replicas being frequent topics. You will need to know when to use each, how they interact with other services like Route 53 and CloudWatch, and common pitfalls.
Jump to a section
Imagine a university library with a main collection and a backup storage facility. Multi-AZ is like having a complete backup copy of every book stored in a fireproof vault in a separate building. If the main library burns down, the vault copy is immediately available—no data loss, but only one copy can be checked out at a time. Read Replicas, on the other hand, are like satellite libraries in different towns. The main library sends a daily courier with copies of new books (asynchronous replication). Students in those towns can read the copies without affecting the main library's operations. However, if the main library burns down, the satellite copies might be missing the last few hours of new arrivals. Also, you cannot write to the satellite libraries—they are read-only. In AWS terms, Multi-AZ provides synchronous replication to a standby in another AZ for high availability, while Read Replicas provide asynchronous replication to up to 15 read-only copies for scaling read traffic. The standby in Multi-AZ is not used for reads unless a failover occurs, whereas Read Replicas actively serve SELECT queries.
What is RDS Multi-AZ?
Amazon RDS Multi-AZ (Multi-Availability Zone) is a deployment option that provides enhanced availability and durability for RDS database instances. When you enable Multi-AZ, Amazon RDS automatically provisions and maintains a synchronous standby replica in a different Availability Zone (AZ) within the same AWS Region. The primary database synchronously replicates every write to the standby. This ensures zero data loss (RPO=0) and a brief failover (typically 1-2 minutes) if the primary fails. Multi-AZ is not for scaling read traffic; the standby is not accessible for reads unless a failover occurs.
What are Read Replicas?
Read Replicas are read-only copies of your RDS database that are asynchronously replicated from the primary instance. You can create up to 15 Read Replicas for each primary instance (for MySQL, MariaDB, PostgreSQL, Oracle, and SQL Server). They are used to offload read-heavy workloads, such as reporting or analytics, from the primary database. Read Replicas can be in the same region or cross-region, and they can be promoted to a standalone primary if needed. However, asynchronous replication means there may be a small lag (typically <1 second) between writes on the primary and visibility on the replica, so they are not suitable for applications requiring strong consistency.
How Multi-AZ Works Internally
When you enable Multi-AZ on an RDS instance, AWS uses synchronous replication at the block level (for EBS) between the primary and standby. The replication is handled by the database engine's native replication mechanism (e.g., MySQL's semi-synchronous replication, but AWS uses a custom block-level replication for all engines). The primary writes data to its EBS volume, and the same data is written synchronously to the standby's EBS volume. The standby is fully provisioned with the same storage, compute, and network resources. AWS monitors the health of both instances using a heartbeat mechanism. If the primary becomes unreachable (e.g., AZ outage, instance failure, network partition), Amazon RDS automatically fails over to the standby by updating the DNS record (CNAME) to point to the standby's endpoint. The failover typically takes 60-120 seconds. Multi-AZ is supported for all RDS database engines except for some older versions.
How Read Replicas Work Internally
Read Replicas use asynchronous replication based on the database engine's native replication. For MySQL and MariaDB, this is based on binary log (binlog) replication. The primary writes changes to its binlog, and the replica reads the binlog and applies the changes. For PostgreSQL, it uses streaming replication with WAL (Write-Ahead Log) segments. For Oracle, it uses Active Data Guard. For SQL Server, it uses Always On Availability Groups. The replication is asynchronous, meaning the primary does not wait for the replica to confirm receipt before committing the transaction. This can introduce replication lag. Amazon RDS monitors replication lag via the Amazon CloudWatch metric ReplicaLag. If the lag exceeds a threshold (default 5 minutes for MySQL), the replica will automatically restart replication. Read Replicas can be created in the same region or a different region. Cross-region replicas are useful for disaster recovery or geographic proximity. Read Replicas can also be promoted to a standalone primary, breaking the replication chain.
Key Components, Values, Defaults, and Timers
Multi-AZ: Synchronous replication, automatic failover, standby is not used for reads. Failover time: 60-120 seconds. RPO=0. No additional cost for the standby (you pay for both instances). Supported for all RDS engines except some older versions.
Read Replicas: Asynchronous replication, up to 15 replicas per primary. Replicas can be in same or different region. Replication lag is normal; maximum lag before replica restart is configurable (default 5 minutes for MySQL). Read Replicas can be promoted (takes a few minutes). Cross-region replicas incur data transfer costs.
Failover Behavior: In Multi-AZ, if the primary fails, RDS updates the DNS record to point to the standby. The application should use the RDS endpoint (not the instance IP) to automatically reconnect. No need to modify application code. For Read Replicas, if the primary fails, the replicas continue to serve stale data until promoted or the primary recovers.
Configuration and Verification
To enable Multi-AZ, you can specify the MultiAZ parameter when creating or modifying a DB instance via the AWS Console, CLI, or API. For example:
aws rds create-db-instance --db-instance-identifier mydb --multi-az --engine mysql --db-instance-class db.t3.medium --master-username admin --master-user-password password --allocated-storage 100To modify an existing instance:
aws rds modify-db-instance --db-instance-identifier mydb --multi-az --apply-immediatelyTo create a Read Replica:
aws rds create-db-instance-read-replica --db-instance-identifier myreplica --source-db-instance-identifier mydb --region us-west-2To verify status:
aws rds describe-db-instances --db-instance-identifier mydb --query 'DBInstances[0].MultiAZ'
aws rds describe-db-instances --db-instance-identifier myreplica --query 'DBInstances[0].ReadReplicaSourceDBInstanceIdentifier'Interaction with Related Technologies
Route 53: RDS uses DNS CNAME records for endpoints. Multi-AZ failover updates the CNAME. Read Replicas have their own endpoints. You can use Route 53 weighted routing to distribute read traffic among replicas.
CloudWatch: Monitor ReplicaLag for Read Replicas, DatabaseConnections, CPUUtilization, etc. CloudWatch alarms can trigger actions like promoting a replica if lag exceeds a threshold.
DMS (Database Migration Service): Can be used to migrate data to RDS with Multi-AZ or create continuous replication for migration.
RDS Proxy: Can be used to pool connections and improve scalability for both Multi-AZ and Read Replicas.
Lambda: Can be triggered by CloudWatch alarms to automate failover or scaling actions.
Performance Considerations
Multi-AZ adds a small latency overhead due to synchronous replication (typically <1ms). The standby consumes compute and storage resources. For Read Replicas, the primary's binlog or WAL generation can add I/O overhead. Replication lag can increase under heavy write load on the primary. To minimize lag, ensure the replica instance class is at least as powerful as the primary. Also consider using Multi-AZ for the primary if you need high availability while also using Read Replicas for read scaling.
Exam Tips
Multi-AZ is for high availability, not read scaling. Read Replicas are for read scaling, not high availability (unless you promote them).
Multi-AZ failover is automatic and requires no manual intervention. Read Replica promotion is manual.
Multi-AZ synchronous replication ensures zero data loss. Read Replicas may have data loss if the primary fails before replication.
You can have both Multi-AZ and Read Replicas on the same primary. The Multi-AZ standby is synchronous, while Read Replicas are asynchronous.
Cross-region Read Replicas are useful for disaster recovery but incur data transfer costs and higher latency.
The exam may ask about the maximum number of Read Replicas (15) and the replication lag metric (ReplicaLag).
1. Enable Multi-AZ on RDS
When you enable Multi-AZ on an RDS instance, AWS automatically provisions a standby instance in a different Availability Zone within the same region. The standby has the same compute, storage, and network configuration as the primary. Synchronous replication is established at the block level. The primary writes data to its EBS volume, and the same data is written to the standby's EBS volume before the write is acknowledged to the application. This ensures zero data loss (RPO=0). The standby is fully synchronized and ready to take over immediately if the primary fails.
2. Monitor health with heartbeat
Amazon RDS continuously monitors the health of the primary and standby instances using a heartbeat mechanism. The heartbeat checks network connectivity, database availability, and storage health. If the primary becomes unreachable due to an AZ outage, instance failure, or network partition, the heartbeat fails. AWS initiates a failover by updating the DNS CNAME record to point to the standby's endpoint. The failover typically completes within 60-120 seconds. During this time, existing connections to the primary are dropped, and applications must reconnect using the same endpoint.
3. Create Read Replica
To create a Read Replica, you specify the source DB instance and optionally a different region. AWS takes a snapshot of the source instance and creates a new RDS instance from that snapshot. Then, asynchronous replication is established using the database engine's native replication (e.g., binlog for MySQL, WAL streaming for PostgreSQL). The primary continues to serve writes, and changes are propagated to the replica asynchronously. The replica can be used for read queries immediately after creation, but it will initially lag behind the primary until the snapshot is applied and replication catches up.
4. Monitor replication lag
Replication lag is the time difference between a write on the primary and its application on the replica. Amazon RDS exposes the `ReplicaLag` metric in CloudWatch, measured in seconds. If the lag exceeds a threshold (default 5 minutes for MySQL), the replica automatically restarts replication to recover. High lag can be caused by heavy write load on the primary, insufficient replica compute capacity, or network latency. You can set CloudWatch alarms to notify you if lag exceeds a certain value, and even trigger automated actions like promoting the replica if lag is too high.
5. Promote Read Replica (if needed)
If the primary fails and you want to make a Read Replica the new primary, you can promote it. Promotion stops replication and makes the replica a standalone, read-write DB instance. This process takes a few minutes. After promotion, the old primary cannot be used as a replica source. Promotion is manual and should be used for disaster recovery or scaling purposes. Note that promoted replicas may have missing data from the last few seconds of writes if the primary failed before replication completed.
Enterprise Scenario 1: E-commerce Platform with High Availability
A large e-commerce company runs its transactional database on RDS MySQL with Multi-AZ enabled. The database handles order processing, inventory updates, and user authentication. Multi-AZ ensures that if an AZ outage occurs, the standby in another AZ takes over without data loss. The failover is automatic and typically completes within 90 seconds. The application uses the RDS endpoint and reconnects seamlessly. The company also uses Read Replicas for reporting and analytics. They have three Read Replicas in the same region for real-time dashboards. The replicas are asynchronously replicated, so they accept a few seconds of lag for non-critical reports. They monitor ReplicaLag and have a CloudWatch alarm that triggers an SNS notification if lag exceeds 30 seconds. They also use Route 53 weighted routing to distribute read queries among the replicas.
Enterprise Scenario 2: Global SaaS Application with Cross-Region Disaster Recovery
A SaaS provider with customers worldwide uses RDS PostgreSQL for its multi-tenant database. They have a primary instance in us-east-1 with Multi-AZ enabled for high availability. To provide disaster recovery across regions, they set up a cross-region Read Replica in eu-west-1. The replica is used for read traffic from European customers and as a failover target if the primary region goes down. They periodically test failover by promoting the replica and redirecting traffic. They incur data transfer costs for cross-region replication. They also have local Read Replicas in each region for read scaling. The setup is complex but provides both high availability and global performance.
Common Pitfalls
Using Multi-AZ for read scaling: Some candidates think Multi-AZ provides a read endpoint. It does not. The standby is not accessible for reads unless failover occurs.
Assuming Read Replicas are highly available: Read Replicas are not automatically failed over. If a replica fails, you must create a new one. They are not a replacement for Multi-AZ.
Ignoring replication lag: In production, lag can cause stale reads. Applications must tolerate eventual consistency. Use ReplicaLag monitoring and set appropriate thresholds.
Promoting a replica with pending changes: If you promote a replica while replication is still catching up, the promoted instance may not have the latest data. Always wait for lag to be minimal.
What SOA-C02 Tests
Domain 2: Reliability, Objective 2.1: Implement high availability and scaling for compute, storage, and database services. Specific sub-objectives include:
Configure Multi-AZ for RDS to achieve high availability.
Configure Read Replicas for read scaling and disaster recovery.
Understand the difference between synchronous and asynchronous replication.
Know the failover behavior and how to verify it.
Understand the limitations: Multi-AZ standby is not used for reads; Read Replicas have replication lag; maximum 15 replicas.
Common Wrong Answers and Why
"Multi-AZ can be used to scale read traffic" – This is false because the standby is not accessible for reads. Candidates confuse Multi-AZ with Read Replicas.
"Read Replicas provide automatic failover" – False. Read Replicas are not failed over automatically. You must promote them manually. Candidates think replication implies automatic failover.
"Multi-AZ uses asynchronous replication" – False. Multi-AZ uses synchronous replication. Candidates confuse with Read Replicas.
"You can have up to 5 Read Replicas" – The correct limit is 15. Candidates might remember the old limit (5) from earlier AWS documentation.
Specific Numbers and Terms
Maximum Read Replicas: 15 per primary.
Failover time: 60-120 seconds.
Replication lag metric: ReplicaLag in CloudWatch.
Default max lag before restart: 5 minutes for MySQL.
Multi-AZ RPO: 0 (zero data loss).
Read Replica promotion: Manual, takes a few minutes.
Cross-region replicas: Supported for MySQL, MariaDB, PostgreSQL, Oracle, SQL Server.
Edge Cases and Exceptions
Multi-AZ for SQL Server: Requires SQL Server Enterprise Edition for synchronous replication. Standard Edition only supports asynchronous.
Read Replica for Oracle: Requires Oracle Active Data Guard license.
Read Replica promotion: Once promoted, the replica becomes a standalone instance and cannot be reintegrated as a replica.
Replication lag in cross-region replicas: Can be higher due to network latency; typical lag is <1 second in same region, but can be seconds across regions.
How to Eliminate Wrong Answers
If the question mentions "automatic failover" and "no data loss", the answer is Multi-AZ.
If the question mentions "offload read traffic" or "scale reads", the answer is Read Replicas.
If the question mentions "asynchronous replication" and "eventual consistency", the answer is Read Replicas.
If the question mentions "synchronous replication" and "zero RPO", the answer is Multi-AZ.
If the question asks about increasing read capacity without changing the primary, think Read Replicas.
If the question asks about disaster recovery across regions, think cross-region Read Replicas.
Multi-AZ provides synchronous replication and automatic failover with RPO=0.
Read Replicas provide asynchronous replication for read scaling, up to 15 per primary.
Multi-AZ standby is not accessible for reads; Read Replicas are read-only.
Failover in Multi-AZ takes 60-120 seconds; Read Replica promotion is manual.
Monitor replication lag using CloudWatch metric ReplicaLag.
Cross-region Read Replicas incur data transfer costs.
You can have both Multi-AZ and Read Replicas on the same primary.
These come up on the exam all the time. Here's how to tell them apart.
Multi-AZ
Synchronous replication – zero data loss
Automatic failover – no manual intervention
Standby not used for reads
High availability within one region
No scaling of read traffic
Read Replicas
Asynchronous replication – possible data loss
Manual promotion for failover
Replicas serve read traffic
Read scaling and cross-region disaster recovery
Up to 15 replicas per primary
Mistake
Multi-AZ can be used to scale read traffic because you can read from the standby.
Correct
The standby in a Multi-AZ deployment is not accessible for reads or writes. It is only used for failover. To scale read traffic, use Read Replicas.
Mistake
Read Replicas automatically fail over if the primary fails.
Correct
Read Replicas do not provide automatic failover. If the primary fails, the replicas continue to serve stale data until you manually promote one of them to a new primary.
Mistake
Multi-AZ uses asynchronous replication, so there may be data loss.
Correct
Multi-AZ uses synchronous replication, ensuring zero data loss (RPO=0). Asynchronous replication is used for Read Replicas.
Mistake
You can have up to 5 Read Replicas for an RDS instance.
Correct
The current limit is 15 Read Replicas per primary instance. The old limit was 5, but it has been increased.
Mistake
Cross-region Read Replicas are free; you only pay for the replica instance.
Correct
Cross-region replication incurs data transfer costs between regions. You also pay for the replica instance itself.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
No, Multi-AZ is for high availability only. The standby is not accessible for reads. You must use Read Replicas to offload read traffic.
Multi-AZ uses synchronous replication to a standby in another AZ for automatic failover. Read Replicas use asynchronous replication to multiple read-only copies for scaling reads. Multi-AZ provides zero data loss; Read Replicas may have lag.
You can create up to 15 Read Replicas per primary RDS instance.
No, Multi-AZ only operates within a single region across Availability Zones. For cross-region disaster recovery, use cross-region Read Replicas.
The Read Replicas continue to serve read traffic with the data they have. They do not automatically become the primary. You must manually promote one to a standalone primary if needed.
Yes, you can have a Multi-AZ primary and also create Read Replicas from it. The Multi-AZ standby is synchronous, while Read Replicas are asynchronous.
Use the Amazon CloudWatch metric 'ReplicaLag'. It shows the lag in seconds between the primary and replica.
You've just covered RDS Multi-AZ vs Read Replicas — now see how well it sticks with free SOA-C02 practice questions. Full explanations included, no account needed.
Done with this chapter?