This chapter covers Cloud SQL failover replicas and High Availability (HA) configuration, a core topic for the ACE exam under Objective 2.3 (Planning Solutions). Understanding how to plan for database resilience using regional persistent disks, standby instances, and automatic failover is critical for designing fault-tolerant applications. Expect 2-4 exam questions on this topic, focusing on configuration requirements, failover behavior, and cost implications.
Jump to a section
Imagine a critical hospital operating room that must stay powered at all times. The primary power comes from the city grid. A backup generator is always running but not connected to the main circuit. A transfer switch monitors the primary line. If the voltage drops below 200V for more than 10 seconds, the switch disconnects the grid and connects the generator within 5 seconds. The generator has its own fuel supply and is tested monthly. The switch logs every event. In Cloud SQL High Availability (HA), the primary instance is the grid, the standby instance is the generator, and the transfer switch is the regional persistent disk and the failover mechanism. The key difference: in Cloud SQL, both instances share the same regional persistent disk (fuel supply), so the standby has the exact same data. The generator doesn't need to catch up — it's already running with the same logs. The transfer switch is the synchronous replication and health checks that detect primary failure and promote the standby. The 10-second threshold in the analogy corresponds to the 60-second health check failure window. The 5-second switchover corresponds to the ~60-120 second failover time in Cloud SQL. The monthly test corresponds to the forced failover you should perform to verify readiness.
What is Cloud SQL High Availability (HA)?
Cloud SQL HA provides automatic failover capability for MySQL, PostgreSQL, and SQL Server instances. It ensures that if the primary zone or instance becomes unavailable, a standby instance in a different zone within the same region takes over with minimal data loss. The HA configuration is not a separate product; it is an option enabled when creating or modifying a Cloud SQL instance. The key architectural difference from a standalone instance is the use of a regional persistent disk and a standby instance.
How HA Works Internally
When you enable HA on a Cloud SQL instance, the following components are provisioned: - Primary instance: Runs in one zone (e.g., us-central1-a). - Standby instance: Runs in a different zone (e.g., us-central1-b) within the same region. - Regional persistent disk: A zonal SSD or HDD persistent disk that is replicated synchronously across the two zones. The primary and standby instances both mount this same disk, but only the primary has read-write access. The standby has read-only access. - Synchronous replication: Every write to the primary is synchronously replicated to the regional persistent disk. This ensures that the standby always has the latest committed data. The replication is handled at the storage layer, not the database engine. - Health checks: Google Cloud health checking agents monitor the primary instance every 60 seconds. If the primary fails to respond for 60 seconds, the system initiates a failover. - Failover process: The standby is promoted to primary. The regional persistent disk is remounted with read-write access for the new primary. The old primary, if it recovers, becomes the new standby. The entire process typically takes 60-120 seconds.
Key Components and Defaults
Regional persistent disk: Required for HA. Cannot be changed to zonal after HA is enabled. The disk type (SSD or HDD) must be chosen at creation; SSD is recommended for production. Minimum size is 10 GB for SSD, 200 GB for HDD. Maximum throughput depends on disk size.
Zone selection: You can specify the preferred zones for primary and standby, or let Cloud SQL choose automatically. The zones must be in the same region.
Failover time: Typically 60-120 seconds. This includes detection time (up to 60 seconds) and promotion time (30-60 seconds). During failover, the instance is unavailable.
Data loss: With synchronous replication, committed transactions are preserved. Uncommitted transactions are lost. There is no data loss for committed writes.
Connection strings: After failover, the public IP and private IP remain the same. The DNS name does not change. Applications do not need to update connection strings.
Read replicas: HA instances can have up to 8 read replicas. Read replicas are not affected by failover; they continue to replicate from the new primary.
Backups: Automated backups and point-in-time recovery are supported.
Cost: HA instances incur charges for both the primary and standby instances (2x compute), plus the regional persistent disk (2x storage cost compared to zonal disk). You pay for both instances even if the standby is idle.
Configuration and Verification Commands
To create a Cloud SQL instance with HA using the gcloud command:
gcloud sql instances create my-instance \
--database-version=MYSQL_8_0 \
--region=us-central1 \
--zone=us-central1-a \
--secondary-zone=us-central1-b \
--availability-type=REGIONAL \
--tier=db-n1-standard-2 \
--storage-type=SSD \
--storage-size=100GB--availability-type=REGIONAL enables HA. Default is ZONAL (no HA).
--zone and --secondary-zone specify the primary and standby zones. If omitted, Cloud SQL selects them.
To modify an existing instance to enable HA:
gcloud sql instances patch my-instance --availability-type=REGIONAL --secondary-zone=us-central1-bNote: This operation requires a brief downtime (minutes).
To verify HA status:
gcloud sql instances describe my-instanceLook for the availabilityType field. It will be REGIONAL if HA is enabled. Also check gceZone (primary zone) and secondaryZone.
To perform a manual failover (testing):
gcloud sql instances failover my-instanceThis simulates a zone failure and promotes the standby. The instance becomes unavailable during the failover.
Interaction with Related Technologies
Cloud SQL Proxy: Works seamlessly with HA. The proxy connects to the instance's IP, which does not change after failover.
Private Services Access: HA instances can use private IPs via VPC peering. The private IP remains stable across failover.
Cloud Armor: Can be used for DDoS protection but does not affect HA behavior.
Cloud Monitoring: You can set up alerts for failover events. Metrics include cloudsql.googleapis.com/database/failover.
Backup and DR: HA is not a replacement for backups. Use automated backups and point-in-time recovery for disaster recovery across regions. HA protects against zonal failures within a region.
Limitations
HA is not available for all tiers. For MySQL, HA requires at least db-n1-standard-1. For PostgreSQL, at least db-custom-1-3840. For SQL Server, Express edition does not support HA.
HA does not protect against regional outages. For regional DR, use cross-region read replicas or backups.
The standby instance cannot be used for reads. It is idle and cannot serve traffic.
Failover is not instantaneous. Expect 60-120 seconds of downtime.
Regional persistent disks have slightly lower IOPS than zonal disks (by ~10%).
You cannot downgrade from HA to zonal without recreating the instance.
How HA Differs from Read Replicas
Read replicas are separate instances that asynchronously replicate from the primary. They can serve read traffic and can be promoted to standalone instances, but promotion is manual and may result in data loss (since replication is asynchronous). HA uses synchronous replication and automatic failover, with no data loss for committed transactions. Read replicas are for read scaling; HA is for high availability.
Enable HA on Instance
When you create or modify a Cloud SQL instance with `--availability-type=REGIONAL`, Google Cloud provisions a regional persistent disk and a standby instance in a different zone. The primary instance is created in the specified zone (or auto-selected), and the standby is placed in the secondary zone. The regional persistent disk is configured with synchronous replication between the two zones. This step takes several minutes as the disk and instances are provisioned.
Health Check Monitoring
Every 60 seconds, Google's health checking system sends a TCP health check to the primary instance's IP on the database port (e.g., 3306 for MySQL). The health check expects a successful connection establishment within a few seconds. If the health check fails three consecutive times (180 seconds total), the instance is considered unhealthy. However, the failover trigger is a single 60-second failure window. The health check also monitors the underlying VM and disk health.
Failover Initiation
When the health check detects that the primary instance is unresponsive for 60 seconds, the failover process begins. The system verifies that the standby instance is healthy and has the latest data from the regional persistent disk. The standby is then promoted to primary. The regional persistent disk is remounted with read-write access for the new primary. The old primary, if it recovers, is automatically reconfigured as the new standby. The entire process typically takes 60-120 seconds.
Connection Redirection
During failover, the public IP and private IP of the Cloud SQL instance remain the same. The DNS name also remains unchanged. This is because the IP is associated with the instance's forwarding rule, not the underlying VM. After the standby is promoted, the forwarding rule is updated to point to the new primary's VM. Applications do not need to modify connection strings. However, existing connections to the old primary are dropped and must be re-established.
Post-Failover Verification
After failover completes, the instance status changes to 'Runnable'. The new primary zone is the previous secondary zone. The old primary becomes the new standby. You can verify the new primary zone using `gcloud sql instances describe my-instance` and checking the `gceZone` field. It is recommended to test failover regularly using the `gcloud sql instances failover` command to ensure your application handles the brief downtime gracefully.
Enterprise Scenario 1: E-commerce Platform with Zonal SLA
A large e-commerce company runs its production database on Cloud SQL MySQL. They have a requirement for 99.99% availability within a region. They enable HA with regional persistent disks (SSD, 500 GB) and use db-n1-standard-8 tier. The primary zone is us-east1-b, standby in us-east1-c. During a real incident, a network switch failure in us-east1-b caused the primary to become unreachable. The health check failed after 60 seconds, and failover completed in 90 seconds. The application experienced a 90-second write outage but no data loss. The standby took over, and the application recovered automatically. The company learned that their connection pool timeout was set to 30 seconds, causing some application servers to throw errors before the failover completed. They increased the timeout to 120 seconds and added retry logic. They also perform a forced failover every month during maintenance windows to ensure the process works.
Enterprise Scenario 2: Financial Services with Strict RPO
A financial services firm requires zero data loss for committed transactions (RPO=0). They use Cloud SQL PostgreSQL with HA enabled. They also have a cross-region read replica in a different continent for disaster recovery. The primary is in europe-west1-b, standby in europe-west1-c. During a software upgrade that caused a kernel panic on the primary, failover occurred in 75 seconds. All committed transactions were preserved. However, they discovered that their application had a hardcoded connection string pointing to the primary's internal IP, which did not change, but the application's connection pool was not configured to retry on connection failures. They fixed this by using the Cloud SQL Proxy with a Unix socket, which automatically reconnects. They also monitor the failover event using Cloud Monitoring and trigger a PagerDuty alert.
Common Misconfiguration Pitfalls
Not testing failover: Many enterprises enable HA but never test it. When a real failover occurs, they discover that their application's connection timeout is too short (e.g., 30 seconds) or that their monitoring doesn't alert on failover. Always test failover in a staging environment first.
Choosing wrong tier: HA requires a minimum tier. Using db-f1-micro or db-g1-small will cause HA configuration to fail. Always check the documentation for minimum tier requirements.
Ignoring cost: HA doubles compute cost and increases storage cost by ~2x. For non-critical workloads, a zonal instance with automated backups may be sufficient.
Using HA as a DR solution: HA only protects against zonal failures. For regional disasters, you need cross-region replicas or backups.
What the ACE Exam Tests (Objective 2.3)
The ACE exam specifically tests your ability to plan for high availability in Cloud SQL. Key areas:
- Understanding the difference between HA (regional) and zonal instances: You must know that HA uses a regional persistent disk and a standby instance in a different zone. Zonal instances use a zonal disk and have no standby.
- Failover behavior: Know that failover is automatic, takes 60-120 seconds, and does not change the IP address. Committed transactions are preserved; uncommitted transactions are lost.
- Configuration requirements: HA requires a minimum tier (e.g., db-n1-standard-1 for MySQL). It is not available for all machine types. You must specify --availability-type=REGIONAL.
- Cost implications: HA incurs charges for both primary and standby instances (2x compute) and regional persistent disk (2x storage cost).
- Limitations: Standby cannot serve reads. HA does not protect against regional outages. You cannot downgrade from HA to zonal.
Common Wrong Answers and Traps
"Failover changes the IP address": Wrong. The IP remains the same because it is associated with the forwarding rule, not the VM. The forwarding rule is updated to point to the new primary.
"HA provides zero downtime": Wrong. Failover takes 60-120 seconds, so there is a brief downtime. The exam may say "HA eliminates downtime" — that is false.
"You can use the standby for read queries": Wrong. The standby is idle and cannot serve traffic. Only the primary handles reads and writes.
"HA protects against regional outages": Wrong. HA only protects against zonal failures within a region. Use cross-region replicas for regional DR.
"HA is available for all machine types": Wrong. There are minimum tier requirements. The exam may list a micro tier as an option — that is incorrect.
Specific Numbers and Terms
60 seconds: Health check interval and failure detection window.
60-120 seconds: Typical failover time.
2x compute + 2x storage: Cost multiplier for HA.
`--availability-type=REGIONAL`: CLI flag to enable HA.
Regional persistent disk: Required storage type for HA.
gcloud sql instances failover: Command to manually trigger failover.
Edge Cases
What if both zones fail?: HA cannot protect. The instance will be unavailable until at least one zone recovers. This is why cross-region DR is important.
What if the standby fails?: The primary continues to operate. HA is degraded. You should recreate the instance to restore HA.
What if you enable HA on an existing instance?: There will be a brief downtime during the conversion. Plan accordingly.
How to Eliminate Wrong Answers
If a question asks about failover behavior, eliminate any answer that mentions IP address change, zero downtime, or read-capable standby. If the question asks about cost, eliminate any answer that says HA does not increase cost. If the question asks about configuration, eliminate any answer that uses a micro tier or zonal disk.
Cloud SQL HA uses a regional persistent disk and a standby instance in a different zone for automatic failover.
Failover is automatic and completes in 60-120 seconds; the IP address does not change.
HA doubles compute cost (2 instances) and storage cost (regional disk).
The standby instance cannot serve read traffic; it is idle.
HA requires a minimum tier (e.g., db-n1-standard-1 for MySQL).
HA only protects against zonal failures, not regional outages.
Use `gcloud sql instances failover` to test failover manually.
These come up on the exam all the time. Here's how to tell them apart.
Cloud SQL HA (Regional)
Uses regional persistent disk with synchronous replication across zones.
Provides automatic failover within 60-120 seconds with no data loss for committed transactions.
Costs 2x compute (primary + standby) and 2x storage (regional disk).
Requires minimum tier (e.g., db-n1-standard-1 for MySQL).
Protects against zonal failures but not regional outages.
Cloud SQL Zonal (No HA)
Uses zonal persistent disk; no replication across zones.
No automatic failover. If the zone fails, the instance is unavailable until the zone recovers or you restore from backup.
Standard compute and storage costs (1x).
Available for all tiers, including micro.
No protection against zonal failures. Requires manual recovery.
Mistake
Cloud SQL HA provides zero downtime during failover.
Correct
Failover takes 60-120 seconds, during which the instance is unavailable. Applications must handle connection retries.
Mistake
After failover, the IP address changes, so applications must update connection strings.
Correct
The IP address remains the same. Only the underlying VM changes. The forwarding rule is updated automatically.
Mistake
You can use the standby instance for read queries to reduce load on the primary.
Correct
The standby is idle and cannot serve traffic. Use read replicas for read scaling.
Mistake
HA protects against all types of failures, including regional outages.
Correct
HA only protects against zonal failures. For regional DR, use cross-region read replicas or backups.
Mistake
Enabling HA does not increase the cost of the Cloud SQL instance.
Correct
HA doubles compute cost (primary + standby) and increases storage cost (regional disk is 2x the cost of zonal disk).
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Failover typically takes 60-120 seconds. This includes up to 60 seconds for health check detection and 30-60 seconds for promoting the standby. Applications should be designed to handle this brief downtime with retry logic.
No, the IP address remains the same. The forwarding rule is updated to point to the new primary VM. Applications do not need to update connection strings. However, existing connections are dropped and must be re-established.
No, the standby instance is idle and cannot serve traffic. Only the primary instance handles reads and writes. For read scaling, use read replicas.
HA costs approximately 2x more for compute (you pay for both primary and standby instances) and 2x more for storage (regional persistent disk is double the cost of zonal disk). For example, if a zonal instance costs $100/month, a comparable HA instance would cost ~$200/month.
No, HA requires a minimum tier. For MySQL, the minimum is db-n1-standard-1. For PostgreSQL, db-custom-1-3840. For SQL Server, Express edition does not support HA. Check the documentation for the latest requirements.
No, HA only protects against zonal failures within the same region. For regional disaster recovery, use cross-region read replicas or automated backups with point-in-time recovery.
No, you cannot downgrade an existing HA instance to zonal. You would need to create a new zonal instance and migrate data (e.g., using export/import or a read replica promotion).
You've just covered Cloud SQL Failover Replicas and HA — now see how well it sticks with free ACE practice questions. Full explanations included, no account needed.
Done with this chapter?