This chapter covers the critical concept of high availability (HA) expressed in terms of 'nines' – 99.9%, 99.99%, and 99.999% uptime. For the GCDL exam, understanding the real-world meaning, cost implications, and architectural requirements of each tier is essential, as approximately 10-15% of questions touch on availability, SLAs, or disaster recovery. You will learn not only the percentages and corresponding downtime values but also the specific Google Cloud services and designs that enable each level. The exam tests your ability to match availability targets to appropriate architectural patterns, not just memorize numbers.
Jump to a section
Imagine a 100-story office building with a single elevator. 99.9% availability (three nines) means the elevator is out of service for about 8.77 hours per year. That's roughly one full workday annually. For most tenants, that's acceptable – they can take the stairs occasionally. Now consider 99.99% (four nines): only 52.6 minutes of downtime per year. The elevator now has a backup generator and a redundant motor. If the main motor fails, the backup kicks in within seconds. Tenants experience a brief blink, but never a long outage. Finally, 99.999% (five nines) means just 5.26 minutes of downtime per year. This requires multiple independent elevator shafts, each with its own power supply and motor, plus a smart control system that instantly reroutes cars if one shaft is blocked. The building also has a dedicated maintenance crew on site 24/7. The cost and complexity scale dramatically: three nines might cost $1M for the elevator system, four nines $3M, and five nines $10M+ because every component must be duplicated and failure detection must be sub-second. In cloud computing, the same principle applies: each additional nine roughly doubles infrastructure cost and complexity.
What Are the Nines and Why Do They Matter?
High availability is quantified as a percentage of uptime over a defined period, typically one year. The term 'nines' refers to the number of 9s after the decimal point: 99.9% (three nines), 99.99% (four nines), and 99.999% (five nines). Each additional nine reduces allowable downtime by a factor of 10. The GCDL exam expects you to know the exact annual downtime values: - 99.9%: 8.77 hours (or 525.6 minutes) - 99.99%: 52.6 minutes - 99.999%: 5.26 minutes
These values are calculated as: (1 - availability) * 365.25 * 24 * 60 = downtime minutes. For monthly SLAs, divide by 12. For example, 99.9% monthly is about 43.8 minutes.
How Availability Is Achieved in Google Cloud
Achieving high availability requires eliminating single points of failure (SPOFs) through redundancy, fault isolation, and automated failover. Google Cloud provides services that map to each tier:
99.9% (Three Nines) – Suitable for development, test, or non-critical workloads. Achieved by deploying a single resource (e.g., one Compute Engine VM) with basic monitoring and manual recovery. Google's infrastructure itself offers 99.9% availability for individual zonal resources. No special design needed.
99.99% (Four Nines) – Requires multi-zone deployment. For Compute Engine, use managed instance groups (MIGs) with autoscaling across at least two zones. A regional load balancer distributes traffic. If one zone fails, traffic is rerouted within seconds. Cloud SQL offers 99.99% for regional instances (with automatic failover to a standby in another zone). Key: you must configure multi-zonal or regional resources.
99.999% (Five Nines) – Demands multi-region deployment with active-active or active-passive failover. For example, Spanner provides 99.999% availability by replicating data across three or more regions using synchronous replication and Paxos consensus. Cloud Load Balancing with global anycast IP can direct users to the nearest healthy region. Achieving five nines often requires custom application-level redundancy and careful planning for planned maintenance windows.
Key Components and Defaults
Compute Engine VM: Default SLA is 99.9% for a single instance with a premium OS license. To get 99.99%, use a MIG with at least two zones.
Cloud SQL: Standard tier: 99.9%. Regional tier: 99.99% (automatic failover to standby in another zone within ~60 seconds).
Cloud Storage: Multi-regional storage: 99.95% (eleven 9s durability, but availability is 99.95%). Dual-region: 99.99%.
Spanner: 99.999% for multi-region configurations.
BigQuery: 99.99% SLA for storage and query processing.
Configuration and Verification
To configure a multi-zone MIG for 99.99%:
gcloud compute instance-groups managed create my-mig \
--zone=us-central1-a \
--template=my-template \
--size=3 \
--zones=us-central1-a,us-central1-b,us-central1-cThis creates instances across three zones. The regional load balancer must be set up:
gcloud compute backend-services create my-backend \
--load-balancing-scheme=EXTERNAL \
--protocol=HTTP \
--health-checks=my-health-check \
--globalThen add the MIG as a backend:
gcloud compute backend-services add-backend my-backend \
--instance-group=my-mig \
--instance-group-zone=us-central1-a \
--balancing-mode=RATE \
--max-rate-per-instance=100How It Interacts with Related Technologies
Health Checks: Required for load balancers to detect failed instances. Default interval: 5 seconds, timeout: 5 seconds, unhealthy threshold: 2. Faster checks improve recovery time but increase cost.
Autoscaling: Works with MIGs to maintain desired capacity. Scale-in cooldown: 60 seconds default.
Disaster Recovery: For five nines, you need cross-region replication. Cloud Storage dual-region or Spanner multi-region handles this. For self-managed VMs, use snapshot replication or rsync.
Cost Implications
Each additional nine roughly doubles infrastructure cost. A single zonal VM costs ~$50/month; a multi-region Spanner instance can cost thousands. The exam tests that you understand the cost vs. benefit tradeoff: not every workload needs five nines. For example, a batch processing job that runs nightly can tolerate hours of downtime – three nines is sufficient.
Define Availability Target
First, determine the required uptime percentage based on business criticality. For GCDL, you must know that 99.9% allows 8.77 hours/year downtime, 99.99% allows 52.6 minutes, and 99.999% allows 5.26 minutes. This step involves identifying the workload's tolerance for unplanned outages and planned maintenance. For example, an e-commerce site may need four nines during business hours but can accept lower availability overnight. Document the target in an SLA.
Eliminate Single Points of Failure
Identify all components that can cause a total outage: compute, storage, network, database, and load balancer. For each, design redundancy. For compute, use multiple instances across zones (MIG). For storage, use multi-regional or dual-region Cloud Storage. For databases, use Cloud SQL regional or Spanner multi-region. For load balancing, use regional or global load balancers with health checks. Each component must have at least one backup that can take over automatically.
Implement Automated Failover
Configure health checks and automatic failover mechanisms. For Cloud SQL regional, the failover is automatic – a standby instance in another zone takes over if the primary fails. For Compute Engine MIGs, the load balancer health check detects unhealthy instances and stops sending traffic; the MIG autoheals by recreating instances. Ensure health check parameters (interval, timeout, thresholds) are tuned for your recovery time objective (RTO). Defaults may be too slow for five nines.
Test Failure Scenarios
Simulate zone failures, instance crashes, and network partitions. Google Cloud provides fault injection tools (e.g., Chaos Monkey via gcloud). For example, to stop an instance: `gcloud compute instances stop my-instance --zone=us-central1-a`. Verify that the load balancer reroutes traffic within the expected time (typically <30 seconds). Measure actual downtime and compare to target. Adjust health check thresholds if needed. Testing should be done regularly, at least quarterly.
Monitor and Report SLA Compliance
Use Cloud Monitoring (Stackdriver) to track uptime. Create uptime checks that probe your endpoints from multiple locations. Set alerts for any breach of the availability target. For example, if your target is 99.99%, alert if downtime in a rolling 30-day period exceeds 4.38 minutes (monthly equivalent). Generate monthly SLA reports for stakeholders. The exam may ask about using Cloud Monitoring for availability tracking.
Scenario 1: E-commerce Platform (99.99% required) A mid-size e-commerce company runs its website on Google Cloud. They need four nines during peak shopping seasons (Black Friday). They deploy a regional managed instance group across three zones (us-central1-a, b, c) with a global HTTP(S) load balancer. The backend service uses a Cloud SQL regional instance for the product catalog. They configure health checks with a 5-second interval and 2 unhealthy thresholds (so ~10 seconds to detect failure). During a zone outage, the load balancer routes traffic to healthy zones within seconds. The Cloud SQL failover completes in under 60 seconds. Total downtime per year is under 52.6 minutes. Common misconfiguration: not setting up read replicas for the database, causing read-heavy traffic to overwhelm the primary during failover. They learned to use Cloud SQL read replicas in each zone to offload reads.
Scenario 2: Financial Trading System (99.999% required) A financial services firm requires five nines for its trading application. They use Spanner multi-region (e.g., us-central1, us-east1, us-west1) to achieve 99.999% availability. Spanner automatically handles synchronous replication across regions. For compute, they deploy a global MIG with instances in three regions, each behind a global load balancer. They use Cloud CDN for static content. The challenge: planned maintenance (e.g., OS patching) must not cause downtime. They use rolling updates with a max surge of 1 and a max unavailable of 0. They also implement canary deployments to test new code. Cost is high (~$10k/month for Spanner alone) but justified by the cost of downtime: each minute of outage costs $100k in lost trades.
Scenario 3: Internal HR Portal (99.9% sufficient) A company's HR portal is used for employee self-service (benefits, time-off requests). It runs on a single Compute Engine VM with a Cloud SQL standard instance. If the VM goes down, IT manually restores from a snapshot. Downtime of a few hours is acceptable because employees can submit requests later. They save costs by not implementing multi-zone redundancy. The GCDL exam tests that you can identify when three nines is appropriate: non-critical, batch, or internal tools.
The GCDL exam tests high availability under Objective 2.1 (Infrastructure) and also appears in disaster recovery scenarios. You must memorize the exact downtime values for three, four, and five nines. Common wrong answers include: - 99.9% = 8.76 hours (trap): The correct value is 8.77 hours (or 525.6 minutes). The difference arises from using 365 days vs. 365.25 days. The exam uses 365.25. - 99.99% = 52.56 minutes (trap): Correct is 52.6 minutes. Again, rounding matters. - 99.999% = 5.26 minutes: This is correct; some candidates confuse with 5.25 minutes.
Another trap: confusing availability with durability. Cloud Storage offers 99.999999999% (11 9s) durability, but availability is only 99.95% for multi-regional. The exam may ask: 'Which Google Cloud service provides 99.999% availability?' Answer: Spanner (multi-region). Not Cloud Storage.
Key terms that appear verbatim: - 'Regional' vs 'zonal' resources: Regional resources (e.g., Cloud SQL regional, regional MIGs) provide higher availability. - 'Managed instance group' (MIG) and 'autoscaling'. - 'Health checks' and 'failover'.
Edge cases:
Planned maintenance: Google performs maintenance that may cause brief downtime. For five nines, you must design to survive maintenance windows. The exam may ask: 'What is the impact of a zonal maintenance event on a multi-zone deployment?' Answer: No impact if properly configured.
Single VM with a premium SLA: 99.9% is the maximum for a single instance, even with premium OS. To exceed, you need redundancy.
How to eliminate wrong answers:
If a question asks for the best design for 99.99% availability, eliminate options that use only one zone or one VM.
If a question mentions 'low cost' and 'high availability', 99.9% is likely the answer.
Always check whether the solution includes a load balancer and multiple zones for compute, and a regional database for stateful workloads.
99.9% availability = 8.77 hours downtime/year; 99.99% = 52.6 minutes; 99.999% = 5.26 minutes.
Single-zone resources max out at 99.9% availability. For higher, use multi-zone or multi-region.
Managed instance groups with autoscaling and regional load balancers enable 99.99% for compute.
Cloud SQL regional provides 99.99% availability with automatic failover to a standby in another zone.
Spanner multi-region offers 99.999% availability using synchronous replication and Paxos.
Each additional nine roughly doubles infrastructure cost; choose the right tier based on business need.
Health check intervals and thresholds directly impact recovery time; tune for your RTO.
Durability and availability are different: Cloud Storage has 11 nines durability but only 99.95% availability for multi-regional.
Planned maintenance (e.g., OS patching) must be accounted for in five nines designs; use rolling updates with max surge and max unavailable settings.
The GCDL exam tests your ability to map availability percentages to appropriate Google Cloud architectures.
These come up on the exam all the time. Here's how to tell them apart.
99.9% (Three Nines)
8.77 hours downtime/year
Single-zone deployment possible
Cost: ~$50/month for a single VM
Suitable for dev/test or non-critical workloads
Manual recovery acceptable
99.99% (Four Nines)
52.6 minutes downtime/year
Requires multi-zone deployment (at least 2 zones)
Cost: ~$200-500/month for redundant VMs and load balancer
Suitable for production workloads
Automated failover required (health checks, MIGs)
99.99% (Four Nines)
52.6 minutes downtime/year
Regional redundancy (within one region, multiple zones)
Cost: moderate (e.g., Cloud SQL regional, MIGs)
RTO: ~60 seconds for database failover
Planned maintenance can cause brief downtime
99.999% (Five Nines)
5.26 minutes downtime/year
Multi-region redundancy (three or more regions)
Cost: high (e.g., Spanner multi-region, global load balancers)
RTO: sub-second to seconds (Paxos consensus)
Must survive planned maintenance without downtime
Mistake
99.99% availability means 52.56 minutes of downtime per year.
Correct
The correct value is 52.6 minutes (based on 365.25 days). The discrepancy comes from using 365 days vs 365.25 days. Google Cloud uses 365.25 for SLA calculations.
Mistake
A single Compute Engine VM with premium OS can achieve 99.99% availability.
Correct
A single VM has a maximum SLA of 99.9% (with premium OS). To achieve 99.99%, you need at least two VMs in different zones behind a load balancer (e.g., a managed instance group).
Mistake
Cloud Storage multi-regional provides 99.999% availability.
Correct
Cloud Storage multi-regional offers 99.95% availability (and 99.999999999% durability). For 99.99% availability, use dual-region. For 99.999%, use Spanner.
Mistake
99.999% availability means the system never goes down.
Correct
It allows 5.26 minutes of downtime per year. This includes both unplanned and planned maintenance. Achieving this requires multi-region redundancy and careful planning for maintenance windows.
Mistake
All Google Cloud services offer the same availability SLA.
Correct
Different services have different SLAs. For example, Compute Engine zonal VMs: 99.9%; Cloud SQL regional: 99.99%; Spanner multi-region: 99.999%. You must choose services that match your target.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Availability measures the percentage of time a system is accessible and functioning (uptime). Durability measures the likelihood that data will not be lost. For example, Cloud Storage multi-regional offers 99.95% availability but 99.999999999% (11 nines) durability. A system can be durable (data safe) but not available (can't access it). The exam tests this distinction: you need both for a reliable application.
Downtime per year = (1 - availability) * 365.25 * 24 * 60 minutes. For 99.9%: (0.001) * 365.25 * 1440 = 525.6 minutes (8.77 hours). For 99.99%: (0.0001) * 365.25 * 1440 = 52.6 minutes. For 99.999%: (0.00001) * 365.25 * 1440 = 5.26 minutes. Remember to use 365.25 days, not 365.
It is possible but difficult and expensive. You would need to deploy VMs across at least three regions, each with multiple zones, use global load balancing, and implement cross-region replication for stateful data. Most candidates achieve 99.999% by using managed services like Spanner for the database and Cloud Load Balancing for traffic. The exam expects you to know that Spanner provides 99.999% out of the box.
A single VM with a premium OS license has a 99.9% SLA. If you use a standard OS, the SLA is 99.5% (for sustained use discounts). To get higher than 99.9%, you must use a managed instance group with instances in multiple zones. The GCDL exam may ask: 'What is the maximum availability for a single-zone deployment?' Answer: 99.9%.
For services like Compute Engine, Google performs live migration of VMs during host maintenance, which usually causes no downtime. However, for five nines, you must design your application to handle potential brief pauses. Using multi-zone MIGs ensures that even if one VM is migrated, others handle traffic. The exam may ask about live migration as a feature for maintaining availability.
Zonal resources run in a single zone (e.g., a zonal VM). Regional resources run across multiple zones within one region (e.g., a regional MIG or Cloud SQL regional). Multi-regional resources span multiple regions (e.g., Spanner multi-region, Cloud Storage multi-regional). Each step up provides higher availability but also higher cost and latency.
Monthly downtime = (1 - 0.9999) * (365.25/12) * 24 * 60 = 4.38 minutes. Alternatively, divide annual downtime by 12: 52.6 / 12 = 4.38 minutes. The exam may ask for monthly equivalents.
You've just covered High Availability: 99.9% vs 99.99% vs 99.999% — now see how well it sticks with free GCDL practice questions. Full explanations included, no account needed.
Done with this chapter?