Knowledge + Practice

CCNA Ensure solution and operations reliability Questions

75 of 99 questions · Page 1/2 · Ensure solution and operations reliability · Answers revealed

Practice these questions Domain overview All questions

1

MCQhard

A company is running a critical application on Compute Engine. The application writes logs to a local persistent disk. The operations team wants to ensure logs are not lost if the VM fails. What should they do?

A.Use a regional persistent disk to replicate data across zones.

B.Schedule persistent disk snapshots every 5 minutes.

C.Create a script to copy logs to a Cloud Storage bucket every minute.

D.Configure the application to write logs to Cloud Logging using the Logging agent.

AnswerD

Logs are streamed to a durable, centralized service, ensuring no loss on VM failure.

Why this answer

Option D is correct because Cloud Logging with the Logging agent provides a centralized, durable, and managed log storage solution. The agent streams logs from the VM to Cloud Logging in near real-time, ensuring logs are preserved even if the VM or its local persistent disk fails. This decouples log storage from the VM's lifecycle, meeting the operations team's requirement for log durability.

Exam trap

The trap here is that candidates often overestimate the reliability of local persistent disks or periodic backups (snapshots/scripts) for log durability, failing to recognize that only a real-time, off-instance streaming solution like Cloud Logging eliminates the risk of log loss during VM failure.

How to eliminate wrong answers

Option A is wrong because regional persistent disks replicate data synchronously across zones within a region, but they still depend on the VM being operational; if the VM fails, the disk is inaccessible until the VM is recovered, and logs on the disk are not automatically exported. Option B is wrong because scheduling snapshots every 5 minutes introduces a recovery point objective (RPO) of up to 5 minutes, meaning logs written between snapshots are lost if the VM fails; snapshots are also not a real-time streaming solution. Option C is wrong because a script copying logs to Cloud Storage every minute creates an RPO of up to 1 minute, still risking log loss, and adds complexity and potential failure points (e.g., script crashes, permissions issues) without guaranteeing delivery.

Practice this question →

2

MCQmedium

Your company's global e-commerce platform uses a managed instance group (MIG) in us-central1 and a Cloud Load Balancer. Traffic has grown, and you want to improve availability by distributing load across multiple regions. What should you do?

A.Increase the machine type of the existing instances to handle more traffic.

B.Enable Cloud CDN to cache content closer to users.

C.Create MIGs in additional regions and add them as backends to the existing global load balancer.

D.Change the load balancer to global and configure a single backend.

AnswerC

Multiple backends across regions with health checks enable the load balancer to route traffic only to healthy backends, improving availability.

Why this answer

Option C is correct because a global external HTTP(S) load balancer can have backends in multiple regions. By creating managed instance groups (MIGs) in additional regions and adding them as backends to the existing global load balancer, you distribute traffic across regions, improving availability and reducing latency for users worldwide. This approach leverages the load balancer's anycast IP and cross-region load balancing capabilities.

Exam trap

The trap here is that candidates confuse Cloud CDN (which caches content) with multi-region backend distribution, or think that simply making the load balancer 'global' with a single backend achieves regional redundancy, when in fact you must add backends in multiple regions to distribute load and improve availability.

How to eliminate wrong answers

Option A is wrong because increasing the machine type of existing instances only scales vertically within a single region, which does not address multi-region availability or distribute load geographically. Option B is wrong because Cloud CDN caches static content at edge locations but does not distribute compute load across regions; it reduces latency for cached content but does not improve availability for dynamic requests or handle regional failures. Option D is wrong because changing the load balancer to global and configuring a single backend (a single MIG) still limits compute resources to one region, failing to provide multi-region distribution or fault isolation.

Practice this question →

3

MCQhard

A user reports that an application running on instance-1 is unreliable and often restarts. What is the most likely cause?

A.The instance is in a single zone without redundancy.

B.The machine type is too small.

C.The instance is using an outdated image.

D.The instance is preemptible and can be terminated at any time.

AnswerD

Preemptible VMs are subject to termination within 24 hours.

Why this answer

Preemptible instances (now called 'spot instances' in Google Cloud) can be terminated by Google Compute Engine at any time due to resource demands, with only 30 seconds of warning. This makes them unsuitable for applications that require reliability and continuous uptime, as the instance can be stopped abruptly, causing the application to restart or become unavailable.

Exam trap

Google Cloud often tests the distinction between preemptible instances and other common causes of instability, such as resource exhaustion or zone failures, to see if candidates understand that preemptible instances are explicitly designed to be terminated at any time.

How to eliminate wrong answers

Option A is wrong because a single-zone deployment without redundancy can cause downtime if the zone fails, but it does not cause frequent, unpredictable restarts of the instance itself. Option B is wrong because a machine type that is too small would typically cause performance degradation or out-of-memory errors, not frequent restarts of the instance. Option C is wrong because an outdated image may have security vulnerabilities or missing patches, but it does not directly cause the instance to restart repeatedly.

Practice this question →

4

MCQmedium

Refer to the exhibit. The exhibit shows logs and a metric from a GCE instance that was terminated. The instance was part of a managed instance group. Which diagnostic step should be taken FIRST to prevent recurrence?

A.Review the memory usage metric for the instance prior to termination.

B.Set a disk usage alert to be notified when disk exceeds 90%.

C.Increase the disk size of the instance template and redeploy.

D.Add a startup script to clear temporary files on boot.

AnswerA

Memory usage history will reveal if the instance was memory-constrained, guiding whether to increase memory.

Why this answer

The logs indicate OOM kills, and the disk is nearly full. The most likely cause is a combination of high memory usage and disk filling up (possibly swap or logs). First, check memory usage history to confirm if the instance was under-provisioned.

Practice this question →

5

MCQhard

A company uses Cloud NAT to allow private instances to access the internet. They notice intermittent connectivity issues. What should they check first?

A.Cloud NAT gateway has at least one NAT IP address configured.

B.Cloud NAT router has a configured IP address range.

C.Cloud NAT gateway is in the same region as the instances.

D.The VPC subnet has private Google access enabled.

E.The instances have external IP addresses assigned.

AnswerA

Without NAT IPs, traffic cannot be translated, causing intermittent failures.

Why this answer

Intermittent connectivity issues when using Cloud NAT are most commonly caused by a lack of NAT IP addresses. Cloud NAT uses source network address translation (SNAT) to map private instance traffic to a public IP address; if the gateway has no NAT IP addresses configured, or if the number of concurrent connections exceeds the available port capacity of the assigned NAT IPs, packets are dropped, leading to intermittent failures. Checking that at least one NAT IP address is assigned is the first and most critical troubleshooting step.

Exam trap

Google Cloud often tests the misconception that Cloud NAT requires a router with a configured IP range or that Private Google Access is needed for internet access, but the real first check is ensuring NAT IP addresses are assigned to the gateway.

How to eliminate wrong answers

Option B is wrong because Cloud NAT does not require a configured IP address range on the router; the router handles dynamic routing, but NAT IPs are assigned directly to the Cloud NAT gateway, not as a range on the router. Option C is wrong because Cloud NAT is a regional resource and must be in the same region as the instances by design; if it were in a different region, connectivity would fail entirely, not intermittently. Option D is wrong because Private Google Access enables instances to reach Google APIs and services without public IPs, but it does not affect general internet connectivity through Cloud NAT.

Option E is wrong because instances behind Cloud NAT should not have external IP addresses; assigning external IPs bypasses Cloud NAT entirely and would cause direct internet access, not intermittent NAT issues.

Practice this question →

6

Multi-Selecteasy

A company wants to monitor the health of their Cloud Run services. Which THREE metrics should they use to define a comprehensive health SLI? (Choose 3)

Select 3 answers

A.Latency (e.g., p99 response time)

B.CPU utilization

C.Request count

D.Instance count

E.Error rate (percentage of 5xx responses)

AnswersA, C, E

Latency is a key performance SLI for user experience.

Why this answer

Latency (p99 response time) is a critical metric for Cloud Run because it measures the end-to-end request processing time, directly reflecting user experience. In a serverless environment, high latency can indicate cold starts, insufficient concurrency, or downstream service bottlenecks, making it essential for a comprehensive health SLI.

Exam trap

Google Cloud often tests the misconception that infrastructure-level metrics like CPU or instance count are valid health SLIs for serverless services, when in fact user-facing metrics (latency, errors, request count) are the correct choices for a comprehensive health SLI.

Practice this question →

7

MCQeasy

A company uses Cloud SQL for MySQL to host its production database. The database experiences high read traffic. The team wants to improve read performance without modifying the application. What should they do?

A.Increase the number of CPUs on the primary Cloud SQL instance.

B.Use Cloud SQL Proxy with connection pooling.

C.Add read replicas and configure the application to use them for read queries.

D.Enable automatic storage increase to allow more data.

AnswerC

Read replicas distribute read load, improving performance without app changes.

Why this answer

Option C is correct because adding read replicas offloads read queries from the primary Cloud SQL instance, distributing the read load across multiple replicas. This improves read performance without any application code changes, as the application can be configured to direct read queries to the replica endpoints. Cloud SQL for MySQL replicas use asynchronous replication, ensuring near-real-time data consistency for read-heavy workloads.

Exam trap

The trap here is that candidates confuse scaling the primary instance (vertical scaling) with offloading reads via replicas (horizontal scaling), or they mistakenly believe Cloud SQL Proxy provides performance benefits when it is only a connectivity and security layer.

How to eliminate wrong answers

Option A is wrong because increasing CPUs on the primary instance only scales vertical capacity, which does not address high read traffic without modifying the application; it also increases cost and may hit instance limits. Option B is wrong because Cloud SQL Proxy is a secure connectivity tool that provides IAM-based authentication and encryption, not a connection pooler; it does not improve read performance or offload read traffic. Option D is wrong because enabling automatic storage increase only prevents out-of-disk errors by expanding storage capacity, which has no impact on read performance or query throughput.

Practice this question →

8

MCQeasy

A company uses Cloud SQL for PostgreSQL. They want to minimize downtime during maintenance. Which feature should they enable?

A.Read replicas.

B.High availability with a standby in another zone.

C.Point-in-time recovery.

D.Automated backups.

AnswerB

Provides automatic failover.

Why this answer

High availability (HA) with a standby in another zone ensures that Cloud SQL for PostgreSQL automatically fails over to a standby instance in a different zone if the primary zone experiences an outage. This minimizes downtime during maintenance because Cloud SQL performs a controlled failover to the standby, typically completing within a few seconds, rather than requiring a full instance restart or rebuild.

Exam trap

The trap here is that candidates often confuse read replicas with high availability, assuming read replicas can automatically take over for the primary, but read replicas require manual promotion and do not provide automatic failover, making HA with a standby the correct choice for minimizing downtime during maintenance.

How to eliminate wrong answers

Option A is wrong because read replicas are designed for offloading read traffic and do not provide automatic failover for the primary instance; they require manual promotion, which introduces downtime. Option C is wrong because point-in-time recovery (PITR) is used for restoring data to a specific timestamp after data corruption or accidental deletion, not for reducing downtime during planned maintenance. Option D is wrong because automated backups protect against data loss by creating periodic backups, but they do not provide a standby instance for failover, so maintenance still requires downtime to restart the primary instance.

Practice this question →

9

MCQeasy

Your company runs a critical application on Google Kubernetes Engine (GKE) with 5 nodes. The application experiences intermittent high latency every Friday afternoon. The team has ruled out infrastructure issues and suspects the application logic. You need to instrument the application to identify the root cause. Which approach should you take?

A.Use Cloud Monitoring to create custom metrics for application performance and investigate recent code changes.

B.Increase the number of nodes in the GKE cluster to handle the load.

C.Enable Cloud Logging and analyze logs for error messages during the latency periods.

D.Configure GKE usage metering to track resource consumption by namespace.

AnswerA

Custom metrics provide visibility into application logic performance, and correlating with code changes can pinpoint the cause.

Why this answer

Option A is correct because the team has already ruled out infrastructure issues and suspects application logic. Creating custom metrics in Cloud Monitoring allows you to instrument the application with key performance indicators (e.g., request latency, error rates) and correlate them with recent code changes to pinpoint the root cause of intermittent high latency. This approach directly addresses the need to monitor application-level behavior rather than infrastructure metrics.

Exam trap

The trap here is that candidates often confuse operational logging (Option C) with performance monitoring, failing to recognize that intermittent latency without errors requires custom metrics to measure application-specific performance indicators.

How to eliminate wrong answers

Option B is wrong because increasing the number of nodes addresses infrastructure capacity, which has already been ruled out as the cause; it does not help identify application logic issues. Option C is wrong because while Cloud Logging can capture error messages, the problem is intermittent high latency without necessarily generating errors; analyzing logs alone may miss performance bottlenecks that require custom metrics. Option D is wrong because GKE usage metering tracks resource consumption by namespace for cost allocation, not application performance or latency issues.

Practice this question →

10

Multi-Selectmedium

A company has deployed a critical application on Google Kubernetes Engine (GKE) with a Regional cluster (us-central1). The application uses a Cloud SQL for PostgreSQL database with a cross-region replica for disaster recovery. The SRE team needs to ensure that the application can survive a regional outage with minimal data loss. Which TWO actions should the team take to improve the reliability of the solution?

Select 2 answers

A.Configure the application to automatically promote the Cloud SQL cross-region replica to a primary instance when the primary region is unavailable.

B.Configure Cloud SQL cross-region replication to be synchronous to ensure zero data loss during failover.

C.Configure an external HTTP(S) load balancer with a backend service pointing to both the primary and secondary GKE clusters, and use a DNS failover policy to route traffic to the secondary region if the primary region becomes unhealthy.

D.Deploy a secondary GKE cluster in the same region as the primary to provide a hot standby that can take over immediately.

E.Use a TCP/UDP load balancer to route traffic to both regions based on latency.

AnswersA, C

Option D is correct because promoting the replica makes it the new primary, allowing the application to continue with minimal data loss.

Why this answer

Option A is correct because promoting a Cloud SQL cross-region replica to a primary instance is the standard procedure for disaster recovery when the primary region fails. This operation is supported by Cloud SQL and can be automated using Cloud Functions or Cloud Run triggers, ensuring minimal manual intervention and reduced RTO. The cross-region replica maintains a near-synchronous copy of the data, so data loss is limited to the replication lag.

Exam trap

The trap here is that candidates often assume synchronous replication is possible across regions for zero data loss, but in practice, cross-region replication is always asynchronous due to the speed of light and network latency constraints.

Practice this question →

11

MCQhard

A company uses Cloud Armor to protect their HTTP Load Balancer from DDoS attacks. During a traffic spike from a legitimate source, legitimate requests are being blocked. How should they tune the security policy to minimize false positives?

A.Enable adaptive protection with machine learning

B.Increase the rate limiting threshold

C.Disable the Web Application Firewall (WAF) rules

D.Use the JSON Web Token (JWT) authentication to filter requests

AnswerA

Adaptive protection dynamically adjusts based on traffic patterns, reducing false positives.

Why this answer

Option C is correct because enabling adaptive protection uses machine learning to distinguish between legitimate and malicious traffic, automatically adjusting rules. Increasing rate limiting threshold may help but is static. Whitelisting IPs is not scalable.

Disabling WAF rules removes protection entirely.

Practice this question →

12

Multi-Selecthard

An organization deploys a microservices application on Google Kubernetes Engine (GKE) with multiple Deployments. They want to ensure that the application remains available during a cluster-wide upgrade. Which three best practices should they follow? (Choose three.)

Select 3 answers

A.Enable cluster autoscaling.

B.Use StatefulSets instead of Deployments for all services.

C.Use multiple node pools across different zones.

D.Use node pools with multiple node types.

E.Set up a load balancer with health checks.

.Configure PodDisruptionBudgets for each deployment.

AnswersC, E

Spreading node pools across zones allows rolling upgrades zone by zone.

Why this answer

Option C is correct because deploying node pools across multiple zones ensures that GKE can perform a cluster-wide upgrade by upgrading nodes in one zone at a time, maintaining application availability as long as the workload is replicated across zones. This leverages GKE's zonal upgrade strategy, which upgrades nodes in a single zone before moving to the next, preventing simultaneous disruption of all replicas.

Exam trap

The trap here is that candidates often confuse cluster autoscaling or multi-node types with high availability during upgrades, but GKE's upgrade process is zone-aware, not node-type-aware, and autoscaling only handles scaling, not disruption management.

Practice this question →

13

MCQmedium

A company monitors their application with Cloud Monitoring. They set up an alerting policy to notify the on-call team when the 99th percentile latency exceeds 500 ms for 5 minutes. However, they receive false positive alerts due to short bursts. How should they refine the policy?

A.Set up alerting on each data point individually.

B.Decrease the threshold to 400 ms.

C.Change the metric to average latency instead of 99th percentile.

D.Increase the evaluation window to 10 minutes.

AnswerD

Longer window filters out transient spikes, alerting only on sustained high latency.

Why this answer

Option D is correct because increasing the evaluation window to 10 minutes smooths out short bursts of high latency, ensuring the alert triggers only when the 99th percentile latency exceeds 500 ms for a sustained period. Cloud Monitoring evaluates metrics over the specified window, so a longer window reduces false positives from transient spikes while still detecting genuine degradation.

Exam trap

Google Cloud often tests the misconception that lowering thresholds or changing percentiles reduces false positives, when in reality the evaluation window duration is the key lever for filtering out short-lived bursts without sacrificing sensitivity to sustained issues.

How to eliminate wrong answers

Option A is wrong because setting up alerting on each data point individually would make the policy hypersensitive to every single spike, increasing false positives rather than reducing them. Option B is wrong because decreasing the threshold to 400 ms would cause the alert to fire even more frequently, including during normal operation, exacerbating the false positive problem. Option C is wrong because changing the metric to average latency masks tail latency issues; the 99th percentile is specifically used to catch outliers, and averaging would hide the very bursts they want to monitor, potentially missing real problems.

Practice this question →

14

MCQeasy

A company deploys a stateful workload using StatefulSets on GKE. They want to ensure that if a pod is evicted, its persistent volume claim (PVC) is reattached to the replacement pod in the same zone. Which configuration achieves this?

A.Use a StatefulSet with a volumeClaimTemplate referencing a persistent disk in the same zone.

B.Use a Deployment with a PVC that has allowedTopologies restricting to the desired zone.

C.Use a Deployment with a persistent volume that is manually attached after pod creation.

D.Use a StatefulSet with a persistent disk that has access mode ReadOnlyMany.

AnswerA

StatefulSet ensures stable pod identity and PVC reattachment; zone affinity ensures the disk is in the same zone.

Why this answer

StatefulSets are designed for stateful workloads and guarantee stable network identities and persistent storage. When a pod is evicted, the StatefulSet controller ensures the replacement pod uses the same PVC, which is bound to a GCE Persistent Disk in the same zone as the original pod, provided the volumeClaimTemplate specifies a disk in that zone. This maintains data locality and avoids cross-zone reattachment.

Exam trap

Google Cloud often tests the misconception that Deployments can handle stateful workloads with persistent storage, but they lack the ordinal identity and PVC reattachment guarantees that StatefulSets provide for zone-pinned recovery.

How to eliminate wrong answers

Option B is wrong because Deployments do not guarantee stable pod identities or PVC reattachment to the same zone; allowedTopologies can restrict where a PVC is created but do not ensure the replacement pod reuses the same PVC after eviction. Option C is wrong because manually attaching a persistent volume after pod creation is not automated and defeats the purpose of a self-healing, declarative Kubernetes setup. Option D is wrong because ReadOnlyMany access mode allows multiple pods to read the same volume but does not ensure zone-pinned reattachment or single-pod write access, and StatefulSets typically use ReadWriteOnce for stateful workloads.

Practice this question →

15

MCQhard

You are designing a high-availability architecture for a global e-commerce platform that uses Cloud SQL for MySQL as the primary database. The application writes to a single Cloud SQL instance in us-central1 and reads from read replicas in us-central1 and us-west1. During a recent regional outage in us-central1, the primary instance became unavailable, and the application experienced full downtime for 3 hours because the failover to a read replica was not automatic. The application can tolerate up to 10 minutes of data loss but needs to recover within 30 minutes. You need to automate failover to a geographically distant region with minimal manual intervention. The application's connection string must not change. Which solution meets these requirements?

A.Set up a Cloud SQL for MySQL high-availability configuration across zones within us-central1

B.Create a cross-region read replica in us-west1, use a Cloud Load Balancing with a static IP that maps to the primary or promoted replica, and automate monitoring and failover via Cloud Functions

C.Configure an external read replica in us-west1 and manually promote it using gcloud commands during an incident

D.Enable automatic failover by creating a Cloud SQL for MySQL regional failover replica in us-central1

AnswerB

Correct: cross-region replica with load balancer and automation meets RTO and RPO.

Why this answer

Option B is correct because it uses a cross-region read replica in us-west1 combined with a static IP via Cloud Load Balancing, which allows the connection string to remain unchanged after failover. Cloud Functions automate the monitoring and promotion of the replica, meeting the 30-minute recovery and 10-minute data loss tolerance. This design ensures failover to a geographically distant region with minimal manual intervention, unlike single-zone or same-region HA configurations.

Exam trap

The trap here is that candidates often confuse zonal high-availability (HA) with cross-region disaster recovery, assuming that a regional failover replica (Option D) provides geographic redundancy, when in fact it only spans zones within the same region.

How to eliminate wrong answers

Option A is wrong because a high-availability configuration across zones within us-central1 does not provide failover to a geographically distant region; it only protects against zonal failures, not a full regional outage. Option C is wrong because manually promoting an external read replica using gcloud commands during an incident does not meet the requirement for automated failover with minimal manual intervention, and it would likely exceed the 30-minute recovery time. Option D is wrong because a regional failover replica in us-central1 is still within the same region and cannot recover from a regional outage; it only provides zonal HA within that region.

Practice this question →

16

MCQeasy

A company uses Cloud Storage for backup data. They want to protect against accidental deletion. Which option is best?

A.Enable object versioning.

B.Use a lifecycle policy.

C.Set a retention policy.

D.Enable object versioning.

AnswerA

Preserves noncurrent versions for recovery.

Why this answer

Object versioning in Cloud Storage preserves every version of an object, including overwrites and deletions. When versioning is enabled, a delete operation creates a delete marker instead of permanently removing the object, allowing easy recovery. This directly protects against accidental deletion by retaining all previous object versions.

Exam trap

Google Cloud often tests the distinction between versioning (which allows recovery from accidental deletion) and retention policies (which prevent deletion but do not provide recovery after the fact), leading candidates to confuse compliance protection with accidental deletion protection.

How to eliminate wrong answers

Option B is wrong because lifecycle policies automate transitions or deletions based on age or conditions, but they do not prevent accidental deletion; they can actually cause deletion if misconfigured. Option C is wrong because retention policies (e.g., Bucket Lock) prevent object modification or deletion for a fixed period, but they are designed for compliance and data retention, not for recovering from accidental deletion after the fact. Option D is a duplicate of the correct answer and is not a separate option; the question lists two identical 'Enable object versioning' entries, but only one is correct.

Practice this question →

17

MCQeasy

A company runs a global e-commerce site on GKE. They want to ensure disaster recovery with multi-region deployment. What is the best practice for configuring GKE clusters?

A.Deploy separate regional clusters in two or more regions.

B.Use a single zonal cluster with node auto-repair.

C.Deploy a single cluster with multi-master setup.

D.Use a single regional cluster with multiple zones.

AnswerA

Multi-region clusters provide geographic redundancy.

Why this answer

For disaster recovery with a multi-region deployment, the best practice is to deploy separate regional clusters in two or more regions. This ensures that if an entire region fails, traffic can be redirected to the other region's cluster, providing true geographic redundancy. A single cluster, whether zonal or regional, cannot survive a regional outage because it is bound to a single control plane location.

Exam trap

Google Cloud often tests the misconception that a regional cluster with multiple zones is sufficient for disaster recovery, but the trap here is that a regional cluster is still confined to a single region and cannot survive a full regional outage.

How to eliminate wrong answers

Option B is wrong because a single zonal cluster with node auto-repair only protects against node-level failures within that single zone, not against a full zone or regional outage, and thus does not meet multi-region disaster recovery requirements. Option C is wrong because GKE does not support a multi-master setup; each cluster has a single control plane, and multi-master is not a valid configuration for GKE. Option D is wrong because a single regional cluster with multiple zones provides high availability within a single region but cannot survive a regional failure, as the control plane is still regional and would be unavailable if the entire region goes down.

Practice this question →

18

MCQeasy

Your company runs a stateless web application on Compute Engine. You want to ensure that if a zone fails, the application continues to serve traffic with minimal manual intervention. What should you do?

A.Schedule regular snapshots of each instance's persistent disk to a regional bucket.

B.Create a regional managed instance group with an autoscaling policy and use a global Cloud Load Balancer.

C.Use a global Cloud Load Balancer and enable Cloud CDN.

D.Create an instance template and manually deploy instances in another zone.

AnswerB

A regional MIG with autoscaling across zones and a global load balancer ensures traffic is rerouted away from failed zones and instances are automatically replaced.

Why this answer

A regional managed instance group (MIG) distributes instances across multiple zones within a region, ensuring that if one zone fails, the remaining zones continue serving traffic. Combined with a global Cloud Load Balancer, traffic is automatically routed to healthy instances in any zone, providing high availability with minimal manual intervention. Autoscaling further ensures that new instances are created to handle load, even if a zone becomes unavailable.

Exam trap

Google Cloud often tests the distinction between data backup (snapshots) and compute redundancy (MIGs), leading candidates to choose backup solutions when the question asks for continuous traffic serving during a zone failure.

How to eliminate wrong answers

Option A is wrong because scheduling snapshots to a regional bucket provides data backup and disaster recovery for persistent disks, but does not automatically redirect traffic or maintain application availability during a zone failure; it requires manual restoration and reconfiguration. Option C is wrong because enabling Cloud CDN caches static content at edge locations, which improves performance and reduces load on origin servers, but does not provide zone-level redundancy or automatic failover for the compute instances themselves. Option D is wrong because manually deploying instances in another zone is a manual, slow process that does not provide automated failover or load balancing; it also lacks autoscaling and health checking, leading to potential downtime and increased operational overhead.

Practice this question →

19

Multi-Selecteasy

A company uses Cloud Storage to store user-uploaded content. They want to ensure that the data is highly durable and protected against accidental deletion. Which two features should they enable? (Choose two.)

Select 2 answers

A.Requester pays.

B.Lifecycle management.

C.Object versioning.

D.Bucket retention policy.

E.Uniform bucket-level access.

AnswersC, D

Versioning protects against accidental deletion or overwrite.

Why this answer

Options A and B are correct: Object versioning preserves previous versions, and retention policy prevents deletion until the retention period ends. Option C lifecycle management is for automated deletion. Option D requester pays is for billing.

Option E uniform bucket-level access is for access control.

Practice this question →

20

MCQhard

A company is designing a disaster recovery plan for a Cloud SQL for PostgreSQL instance. They want to failover to a different region with minimal data loss and recovery time under 10 minutes. The database is 500 GB and experiences 2,000 write transactions per second. Which solution should they use?

A.Export the database daily using gsutil and import in the other region using pg_restore.

B.Create a cross-region read replica and promote it to primary during failover.

C.Configure a cross-region replica instance using Cloud SQL's cross-region replication feature.

D.Automated backups with point-in-time recovery to a new instance in the other region.

AnswerC

Cross-region replication provides a standby instance with synchronous replication, minimal data loss, and failover in minutes.

Why this answer

Cloud SQL for PostgreSQL offers a managed cross-region replication feature that creates a replica instance in a different region, using synchronous or asynchronous replication to keep data nearly in sync. This solution meets the RPO (minimal data loss) and RTO (under 10 minutes) requirements because the replica is continuously updated and can be promoted to primary in minutes, without needing to restore from a backup or export.

Exam trap

Google Cloud often tests the distinction between read replicas (which are intra-region only for Cloud SQL PostgreSQL) and cross-region replica instances (which are a separate managed feature), leading candidates to incorrectly choose option B because they assume read replicas can be cross-region.

How to eliminate wrong answers

Option A is wrong because daily exports using gsutil and pg_restore would result in up to 24 hours of data loss (poor RPO) and the recovery time would exceed 10 minutes due to the time needed to transfer and restore a 500 GB database. Option B is wrong because Cloud SQL for PostgreSQL does not support cross-region read replicas; read replicas are only available within the same region, so this option is not technically feasible. Option D is wrong because automated backups with point-in-time recovery require restoring from a backup stored in the same region or a different region, but the restore process can take significantly longer than 10 minutes for a 500 GB database, and the recovery point would be at best the last backup, not near-real-time.

Practice this question →

21

Drag & Dropmedium

Drag and drop the steps to set up a Cloud VPN tunnel between Google Cloud and an on-premises network into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Cloud Router is used for dynamic routing. The tunnel requires the on-premises public IP and pre-shared key.

Practice this question →

22

MCQmedium

A team is using Cloud Functions and wants to ensure retries on failure. What is the best practice?

A.Increase function timeout.

B.Use background functions with Pub/Sub.

C.Configure maximum retries and set dead-letter topic.

D.Use synchronous invocation.

AnswerC

Automatic retries with dead-letter for investigation.

Why this answer

Option C is correct because Cloud Functions (2nd gen) and Cloud Run allow configuring maximum retry attempts and a dead-letter topic to handle messages that repeatedly fail processing. This ensures that transient failures are retried automatically, while persistent failures are captured in a dead-letter queue for later analysis, preventing message loss and enabling reliable event-driven processing.

Exam trap

Google Cloud often tests the misconception that simply using a background function or increasing timeout is sufficient for reliability, when in fact explicit retry configuration and dead-letter handling are required for robust error recovery.

How to eliminate wrong answers

Option A is wrong because increasing function timeout does not cause retries; it only extends the maximum execution duration, and if the function fails after the timeout, no retry is triggered unless explicitly configured. Option B is wrong because background functions with Pub/Sub are a type of function, not a retry mechanism; while Pub/Sub can be used with retry policies, the statement itself does not address configuring retries or dead-letter topics. Option D is wrong because synchronous invocation (e.g., via HTTP triggers) does not inherently provide retry logic; the caller must implement retries, and Cloud Functions does not automatically retry synchronous invocations on failure.

Practice this question →

23

MCQeasy

A company wants to monitor their Cloud Run services for errors and latency. Which Google Cloud product should they use?

A.Cloud Trace

B.Cloud Monitoring

C.Cloud Logging

D.Error Reporting

AnswerB

Provides metrics, dashboards, and alerts.

Why this answer

Cloud Monitoring (formerly Stackdriver Monitoring) provides comprehensive observability for Cloud Run services, including built-in dashboards for request latency, error rates, and resource utilization. It collects metrics like request count, request latencies, and container instance counts, and allows you to set alerting policies based on these metrics. While Cloud Trace can help with latency analysis and Cloud Logging captures logs, Cloud Monitoring is the primary product for monitoring both errors and latency in a unified view.

Exam trap

The trap here is that candidates often confuse Cloud Trace (for latency) or Error Reporting (for errors) as standalone solutions, but the question asks for a single product that monitors both errors and latency, which is Cloud Monitoring's role as the central metrics and alerting platform.

How to eliminate wrong answers

Option A is wrong because Cloud Trace is a distributed tracing tool focused on analyzing latency across service requests, but it does not provide a unified dashboard for error rates or resource metrics for Cloud Run. Option C is wrong because Cloud Logging is for storing, searching, and analyzing log data, not for monitoring metrics like latency percentiles or error counts in real-time dashboards. Option D is wrong because Error Reporting aggregates and analyzes application errors from logs, but it does not monitor latency or provide a holistic view of service health.

Practice this question →

24

Multi-Selecthard

Your service has a 99.99% uptime SLO (monthly error budget ~ 4 minutes). Which TWO monitoring practices best support this SLO? (Choose 2)

Select 2 answers

A.Monitor CPU utilization and alert when average exceeds 80%.

B.Use a combination of availability (e.g., HTTP 200 rate) and latency (e.g., p99) as SLIs.

C.Use only synthetic monitoring from multiple locations.

D.Alert on every 5xx error immediately.

E.Track error budget consumption and alert when burn rate exceeds a threshold.

AnswersB, E

Good SLIs reflect user experience; availability and latency are common SLIs.

Why this answer

Options B and D are correct. A good SLI combines availability and latency into a single measure; the error budget approach is the standard way to manage SLOs. Option A is wrong: CPU alone is not a user-facing SLI.

Option C is wrong: Alerting on every 5xx error can lead to alert fatigue; better to alert based on error budget burn rate. Option E is wrong: Synthetic monitoring is useful but not alone sufficient; a combination of real and synthetic is recommended.

Practice this question →

25

MCQmedium

A company uses the above IAM policy on a Cloud Storage bucket. They find that Bob can view objects in the bucket. Which statement explains this?

A.There is a higher-level policy that grants Bob viewer access.

B.The etag is mismatched causing policy override.

C.The bucket has uniform bucket-level access disabled.

D.Bob is a member of the group viewers@example.com.

E.The objectCreator role implicitly includes read access.

AnswerD

Group membership grants viewer access to Bob.

Why this answer

Option D is correct because the IAM policy shown includes a binding that grants the `roles/storage.objectViewer` role to the group `viewers@example.com`. If Bob is a member of that group, he inherits the permissions to view objects in the bucket. The policy explicitly lists this group as a principal, so Bob's ability to view objects is directly explained by his group membership.

Exam trap

Google Cloud often tests the distinction between IAM roles and ACLs, and the trap here is that candidates may overlook the group membership in the policy and instead incorrectly attribute Bob's access to a higher-level policy or a misunderstanding of role permissions.

How to eliminate wrong answers

Option A is wrong because the question asks which statement explains Bob's access given the provided IAM policy; a higher-level policy is not shown and would be an assumption, not a direct explanation from the given policy. Option B is wrong because the `etag` is used for optimistic concurrency control to prevent concurrent modification conflicts, not to cause policy overrides or grant access. Option C is wrong because uniform bucket-level access controls whether IAM policies alone govern access (disabling it would allow ACLs, but the policy shown still grants Bob access via IAM, so this does not explain his access).

Option E is wrong because the `roles/storage.objectCreator` role only allows creating objects, not reading them; read access requires the `roles/storage.objectViewer` role or equivalent.

Practice this question →

26

MCQhard

A company runs a batch processing workload on Compute Engine that processes financial transactions. The workload runs daily and must complete within a 4-hour window. The application reads input data from Cloud Storage, processes it, and writes output to another Cloud Storage bucket. The current implementation uses a single VM with a 500 GB persistent disk. Recently, the data volume has increased, and the job is now taking over 6 hours, exceeding the SLA. The team is tasked with redesigning the solution to be faster and more reliable. They want to minimize costs and operational overhead. The data is critical and must not be lost. Which approach should they take?

A.Use a managed instance group with a startup script that processes data, and use Cloud Pub/Sub to coordinate.

B.Increase the VM to a high-CPU machine type with a regional persistent disk for HA.

C.Deploy the processing logic in Cloud Functions and trigger from Cloud Storage events.

D.Use Cloud Dataflow with autoscaling to process the data in parallel.

AnswerD

Dataflow is a managed service that can scale horizontally, complete the job within the window, and provides fault tolerance.

Why this answer

Cloud Dataflow with autoscaling is the correct choice because it provides a fully managed, serverless service for parallel data processing that can automatically scale resources based on the volume of data. This directly addresses the need to complete the batch workload within the 4-hour SLA, as Dataflow can distribute the processing across many workers, significantly reducing execution time. It also ensures reliability and data durability through checkpointing and exactly-once processing semantics, meeting the critical data loss prevention requirement.

Exam trap

Google Cloud often tests the misconception that serverless functions like Cloud Functions can handle long-running batch jobs, but the key trap is ignoring the 9-minute timeout and lack of state management, leading candidates to choose Option C over the correct Dataflow solution.

How to eliminate wrong answers

Option A is wrong because using a managed instance group with a startup script and Cloud Pub/Sub adds unnecessary operational overhead and complexity for a batch workload; it does not natively provide parallel processing or autoscaling for data pipelines, and the coordination via Pub/Sub is not designed for batch processing of this nature. Option B is wrong because simply increasing the VM to a high-CPU machine type with a regional persistent disk does not address the parallelism needed to reduce processing time from 6+ hours to under 4 hours; it is a vertical scaling approach that has limits and does not improve reliability through distribution, and regional persistent disks provide high availability but not faster processing. Option C is wrong because Cloud Functions are designed for event-driven, short-lived executions with a maximum timeout of 9 minutes (540 seconds) and are not suitable for long-running batch processing jobs that can take hours; they also lack the ability to handle large-scale data shuffling and stateful processing required for financial transactions.

Practice this question →

27

Multi-Selecthard

A company runs a stateful application on GKE using StatefulSets. Which THREE practices improve reliability?

Select 3 answers

A.Use headless services.

B.Use horizontal autoscaling based on disk usage.

C.Use volume snapshots for backup.

D.Use pod disruption budgets.

E.Use persistent volumes with reclaim policy Delete.

AnswersA, C, D

Provides stable network identities for stateful workloads.

Why this answer

A headless service (clusterIP: None) allows direct pod-to-pod communication without load balancing, which is essential for stateful applications like databases that require stable network identities. Each pod in a StatefulSet gets a unique DNS name (e.g., pod-0.service.namespace.svc.cluster.local), enabling reliable discovery and ordering for replication, leader election, and failover. This ensures that clients always reach the correct pod instance, improving overall reliability.

Exam trap

Google Cloud often tests the misconception that horizontal autoscaling can be based on any arbitrary metric like disk usage, but the HPA only supports CPU, memory, and custom/external metrics that must be exposed through the Metrics Server or a custom metrics adapter.

Practice this question →

28

Multi-Selectmedium

Which TWO options are best practices for ensuring high availability of an application running on Google Kubernetes Engine (GKE)?

Select 2 answers

A.Use pod anti-affinity to spread pods across multiple zones.

B.Deploy all nodes in the same zone to simplify networking.

C.Configure managed instance groups with autohealing.

D.Prefer using preemptible VMs for cost savings.

E.Use a single zonal cluster to avoid cross-zone latency.

AnswersA, C

Spreading pods across zones improves resilience to zonal failures.

Why this answer

Option A is correct because pod anti-affinity ensures that pods from the same application are scheduled on different nodes across multiple zones, reducing the blast radius of a zonal failure. This is a key pattern for achieving high availability in GKE, as it prevents a single zone outage from taking down all replicas of your application.

Exam trap

The trap here is that candidates often confuse cost-optimization strategies (like preemptible VMs) with high-availability strategies, or they mistakenly believe that a single-zone cluster with autohealing is sufficient for zonal fault tolerance, when in fact you need multi-zone distribution and a regional cluster.

Practice this question →

29

MCQhard

You are running a Kubernetes cluster in GKE with the default node pool configuration shown in the exhibit. Your application requires high disk I/O performance. You notice that the application is experiencing high latency for disk operations. What is the most likely cause?

A.Node auto-repair is causing disk contention.

B.The default node pool uses pd-standard disks, which have low IOPS.

C.The OAuth scopes restrict disk access, causing high latency.

D.The machine type n1-standard-2 does not have enough CPU.

AnswerB

pd-standard is HDD with lower IOPS; pd-ssd provides higher performance for high I/O workloads.

Why this answer

The default node pool in GKE uses pd-standard (standard persistent disk) which provides lower IOPS compared to pd-ssd. For applications requiring high disk I/O performance, pd-standard disks become a bottleneck, causing high latency. Upgrading to pd-ssd or using local SSDs would resolve this issue.

Exam trap

Google Cloud often tests the distinction between storage performance (disk type) and other operational features (auto-repair, scopes, machine type), leading candidates to confuse node health mechanisms or permission settings with actual I/O performance bottlenecks.

How to eliminate wrong answers

Option A is wrong because node auto-repair is a GKE feature that automatically repairs unhealthy nodes (e.g., if the node fails health checks), but it does not cause disk contention; it operates at the node level, not by interfering with disk I/O. Option C is wrong because OAuth scopes control API access permissions (e.g., read/write to Cloud Storage), not the performance characteristics of persistent disk operations; disk I/O latency is a storage performance issue, not an authorization issue. Option D is wrong because n1-standard-2 (2 vCPUs, 7.5 GB memory) is a general-purpose machine type that can handle moderate workloads; insufficient CPU would manifest as high CPU utilization or scheduling delays, not specifically high disk I/O latency.

Practice this question →

30

MCQhard

Your company runs a critical multi-tier application: a global HTTP(S) load balancer, multiple regional managed instance groups (MIGs) for the web tier, and Cloud Spanner for the data tier. You need to design for zone-level and region-level failures. What architecture ensures the highest availability?

A.Use a global HTTP(S) load balancer with a single global MIG and a multi-region Cloud Spanner instance.

B.Use a global HTTP(S) load balancer with a single zonal MIG and Cloud Spanner single-region.

C.Use a global HTTP(S) load balancer with regional MIGs in multiple regions, each spanning zones, and a multi-region Cloud Spanner instance.

D.Use a regional HTTP(S) load balancer with a regional MIG and Cloud SQL with cross-region replication.

AnswerC

Regional MIGs across zones handle zone failures; multiple regions and multi-region Spanner handle region failures.

Why this answer

Option C is correct because it combines a global HTTP(S) load balancer (which can route traffic to healthy backends across regions), regional MIGs that span multiple zones within each region (providing zone-level redundancy), and a multi-region Cloud Spanner instance (which provides synchronous replication across regions for strong consistency and automatic failover). This architecture ensures that if an entire zone or region fails, traffic is automatically redirected to healthy backends in other zones/regions, and Spanner continues to serve reads and writes without manual intervention.

Exam trap

Google Cloud often tests the distinction between 'regional' and 'global' load balancers, and the trap here is that candidates might choose a regional load balancer (Option D) thinking it is sufficient, but it cannot route traffic across regions, making it unsuitable for region-level failure recovery.

How to eliminate wrong answers

Option A is wrong because a single global MIG (even if multi-zonal) is still deployed within a single region; if that entire region fails, the application becomes unavailable. Option B is wrong because a single zonal MIG cannot survive even a zone failure, and a single-region Cloud Spanner instance cannot survive a regional failure. Option D is wrong because a regional HTTP(S) load balancer cannot distribute traffic across multiple regions, and Cloud SQL with cross-region replication does not provide the same strong consistency and automatic failover as multi-region Spanner; also, Cloud SQL cross-region replication is asynchronous and may lose data during a failover.

Practice this question →

31

MCQhard

A global application uses Cloud Spanner with a multi-region configuration. During a regional outage, some transactions are failing. What is the recommended approach to maintain write availability?

A.Implement application-level retry with exponential backoff

B.Use a single-region Spanner instance with a standby in a different zone

C.Configure Spanner with leader-based replication and rely on automatic failover

D.Manually failover to a different region using a script

AnswerC

Spanner automatically fails over to another region if the leader region fails.

Why this answer

Cloud Spanner's multi-region configuration uses leader-based replication, where each region has a leader for its read-write replicas. During a regional outage, Spanner automatically fails over the leader to another region, ensuring write availability without manual intervention. This is the recommended approach because it leverages Spanner's built-in synchronous replication and automatic failover to maintain consistency and availability.

Exam trap

The trap here is that candidates confuse Spanner's automatic failover with manual failover approaches used in traditional databases, or assume that application-level retry alone can compensate for a regional outage, ignoring Spanner's built-in leader election and synchronous replication.

How to eliminate wrong answers

Option A is wrong because application-level retry with exponential backoff is a general resilience pattern but does not address the root cause of write unavailability during a regional outage; Spanner's automatic failover is required to restore write capability. Option B is wrong because a single-region Spanner instance with a standby in a different zone does not provide multi-region write availability; it only offers zone-level redundancy within a single region, which cannot survive a full regional outage. Option D is wrong because manual failover using a script is not recommended for Spanner; the service handles failover automatically via its leader-based replication, and manual intervention can lead to inconsistencies or extended downtime.

Practice this question →

32

Multi-Selectmedium

A company is migrating a critical database to Cloud SQL for MySQL. Which TWO actions ensure high availability?

Select 2 answers

A.Use read replicas in multiple zones.

B.Configure a failover replica with a different IP.

C.Enable automatic backups.

D.Enable high availability with a standby in another zone.

E.Enable multi-region failover.

AnswersC, D

Allows point-in-time recovery in case of data loss.

Why this answer

Option C is correct because enabling automatic backups in Cloud SQL for MySQL ensures that point-in-time recovery (PITR) and daily backups are automatically taken, which is a fundamental requirement for high availability. While backups alone do not provide instant failover, they are essential for data durability and recovery in case of a disaster, and the question asks for actions that 'ensure high availability'—backups are a core component of a high-availability strategy by enabling recovery from data loss or corruption.

Exam trap

The trap here is that candidates often confuse read replicas (which are for read scaling) with high-availability standby instances (which are for automatic failover), and they may also mistakenly think that multi-region failover is a built-in Cloud SQL feature when it is not supported for MySQL.

Practice this question →

33

Multi-Selecteasy

A company runs a containerized application on Cloud Run. Which TWO actions will most improve the reliability of the service?

Select 2 answers

A.Enable CPU always allocated

B.Disable concurrency

C.Deploy the service in multiple regions

D.Set min instances to at least 1

E.Use Cloud CDN

AnswersC, D

Multi-region deployment provides high availability and failover if one region becomes unavailable.

Why this answer

Setting min instances > 0 prevents cold starts, ensuring consistent performance. Deploying in multiple regions provides regional failover. Enabling CPU always allocated is for background tasks, not reliability.

Disabling concurrency limits throughput. Cloud CDN is for static content, not compute reliability.

Practice this question →

34

MCQeasy

A company runs a batch process every night that loads data into BigQuery. They want to ensure that if the job fails, it is retried automatically up to 3 times. Which configuration should they use?

A.Cloud Run jobs with --max-retries=3.

B.Cloud Scheduler with retry policy.

C.BigQuery load job with maximum retries setting.

D.Cloud Functions with error handling.

E.Cloud Composer (Airflow) tasks with retries.

AnswerE

Cloud Composer (Airflow) provides built-in task retry mechanism.

Why this answer

Option E is correct because Cloud Composer (Airflow) natively supports task-level retries via the `retries` parameter in task definitions, allowing you to specify up to 3 automatic retries on failure. This is the appropriate choice for orchestrating a batch process that loads data into BigQuery, as Airflow provides robust retry logic, dependency management, and monitoring for complex workflows.

Exam trap

Google Cloud often tests the distinction between a simple retry mechanism (like Cloud Scheduler or Cloud Functions) and a full workflow orchestration tool (Cloud Composer) that can manage retries within a multi-step batch process, leading candidates to pick a simpler option that lacks the necessary pipeline control.

How to eliminate wrong answers

Option A is wrong because Cloud Run jobs are designed for stateless containerized workloads, not for orchestrating batch data loads into BigQuery; their `--max-retries` applies to the job execution itself, not to the data loading step within a pipeline. Option B is wrong because Cloud Scheduler retry policies handle HTTP request failures, not the success or failure of the underlying BigQuery load job; it would retry the scheduler trigger, not the data load. Option C is wrong because BigQuery load jobs do not have a 'maximum retries setting' — they either succeed or fail, and retries must be managed externally by the caller.

Option D is wrong because Cloud Functions error handling (e.g., retry on failure) is for function execution, not for orchestrating a batch load job; it lacks the workflow-level retry control and dependency management needed for a nightly batch process.

Practice this question →

35

MCQeasy

You are the lead cloud architect for a startup that runs a web application on Google Kubernetes Engine (GKE) with a standard (zonal) cluster. The application is deployed with 3 replicas of a stateless frontend service. During a recent incident, a zone outage caused all GKE nodes to become unavailable, leading to application downtime of 45 minutes. You need to redesign the cluster to tolerate a single zone failure with no more than 5 minutes of downtime. Your budget allows for at most a 20% increase in compute costs. Which approach should you take?

A.Increase the number of replicas from 3 to 9 and keep the zonal cluster

B.Change the frontend deployment to use regional persistent disks

C.Deploy second GKE cluster in another region and use global load balancer for failover

D.Migrate the cluster to a regional GKE cluster with nodes in 3 zones and distribute replicas across zones

AnswerD

Correct: regional cluster survives zone failure.

Why this answer

D is correct because a regional GKE cluster distributes nodes across three zones, ensuring that if one zone fails, the remaining two zones continue serving traffic. By spreading the 3 replicas across zones (e.g., one per zone), the application tolerates a single zone outage with near-zero downtime, and the 20% cost increase covers the additional node pool overhead without exceeding the budget.

Exam trap

The trap here is that candidates confuse increasing replica count with achieving zone redundancy, failing to realize that replicas must be distributed across failure domains (zones) to survive a zone outage, and that regional persistent disks are irrelevant for stateless workloads.

How to eliminate wrong answers

Option A is wrong because increasing replicas to 9 in a zonal cluster does not provide zone redundancy; all nodes remain in a single zone, so a zone outage still takes down all replicas. Option B is wrong because regional persistent disks are used for stateful workloads (e.g., databases) and do not help with zone-level node failure for a stateless frontend; the frontend does not require persistent disks. Option C is wrong because deploying a second cluster in another region introduces cross-region latency and failover complexity, and the 5-minute downtime target cannot be met with DNS propagation or global load balancer failover; it also likely exceeds the 20% cost increase due to full cluster duplication.

Practice this question →

36

MCQhard

You are designing a Dataflow streaming pipeline for real-time event processing. The pipeline must be cost-effective while tolerating worker failures without data loss. Which configuration should you use?

A.Use a batch Dataflow job with preemptible workers.

B.Use high-memory machine types for all workers to avoid preemption.

C.Use FlexRS with preemptible workers and enable streaming engine.

D.Use a standard Dataflow job with non-preemptible workers.

AnswerC

FlexRS allows preemptible workers with cost savings; Dataflow's checkpointing prevents data loss on preemption.

Why this answer

Option C is correct because FlexRS (Flexible Resource Scheduling) allows you to use preemptible workers in a streaming pipeline, which significantly reduces cost while the Streaming Engine provides durable state storage and checkpointing to tolerate worker failures without data loss. Preemptible workers are cheaper but can be terminated at any time; the Streaming Engine ensures that pipeline state is preserved and processing can resume seamlessly from the last checkpoint.

Exam trap

Google Cloud often tests the misconception that preemptible workers cannot be used in streaming pipelines, or that high-memory instances alone solve reliability, but the key is that FlexRS with Streaming Engine is the only option that combines cost savings with failure tolerance for real-time processing.

How to eliminate wrong answers

Option A is wrong because batch Dataflow jobs do not support streaming mode, and preemptible workers in a batch job can cause data loss if not combined with appropriate checkpointing mechanisms, which are not designed for real-time event processing. Option B is wrong because using high-memory machine types does not prevent preemption; preemptible workers are still subject to termination, and this approach increases cost without addressing failure tolerance. Option D is wrong because non-preemptible workers are more expensive and do not inherently provide the cost-effectiveness required, while standard Dataflow jobs without Streaming Engine may lose data on worker failure in streaming mode due to lack of durable state persistence.

Practice this question →

37

MCQhard

Company B uses Cloud Endpoints to expose their API. Recently, they started seeing 503 errors during periods of high traffic. They have enabled Cloud Endpoints with a moderate quota. The backend is running on Cloud Run. The Cloud Run service is configured with min instances = 0 and max instances = 100. The container concurrency is set to 80. The average request latency is 200ms. What is the most likely cause and what should they do?

A.The container concurrency is too low; increase it to 200.

B.The backend is experiencing cold starts; set a higher CPU limit.

C.Cloud Run is scaling too slowly; set min instances to a higher value.

D.The Cloud Endpoints quota is being exhausted; increase the quota.

AnswerC

Cold starts cause latency spikes leading to 503s; warm instances mitigate this.

Why this answer

The 503 errors during high traffic are most likely caused by Cloud Run's scaling latency. With min instances = 0, new requests must wait for a container to start (cold start), and during traffic spikes, the scaling algorithm may not provision instances quickly enough, leading to request timeouts and 503s. Setting a higher min instances value ensures a warm pool of containers is always ready to handle traffic bursts, reducing cold start delays.

Exam trap

The trap here is that candidates confuse 503 errors with quota exhaustion (option D) or misattribute the issue to concurrency limits (option A), when in fact the 503 is a classic symptom of Cloud Run's cold start and scaling delay with min instances = 0.

How to eliminate wrong answers

Option A is wrong because container concurrency (80) is already high; increasing it to 200 would not address the root cause of scaling delays and could overload containers, increasing latency. Option B is wrong because cold starts are caused by idle instances being shut down (min instances = 0), not by CPU limits; increasing CPU limits would not prevent cold starts. Option D is wrong because Cloud Endpoints quota is described as 'moderate' and the errors occur during high traffic on the backend, not at the API gateway; quota exhaustion would typically return 429 or 403 errors, not 503.

Practice this question →

38

MCQeasy

A startup uses Cloud Functions for event-driven processing. They notice some functions are timing out. How to increase reliability without changing the business logic?

A.Increase the function timeout to the maximum allowed

B.Use Cloud Tasks to decouple and retry synchronously

C.Enable retry on failure for the event-driven function

D.Refactor the function to reduce complexity

AnswerC

Cloud Functions supports automatic retry for event-driven triggers, which handles transient timeouts.

Why this answer

Option C is correct because enabling retry on failure for event-driven Cloud Functions allows the platform to automatically retry the invocation when a function times out or fails, without modifying the business logic. This leverages the built-in retry mechanism for background functions, which uses exponential backoff to handle transient failures and improve reliability.

Exam trap

Google Cloud often tests the misconception that increasing timeout or refactoring code is the only way to handle timeouts, but the trap here is that enabling retry on failure is a configuration-only change that improves reliability without altering business logic.

How to eliminate wrong answers

Option A is wrong because simply increasing the timeout to the maximum allowed (e.g., 540 seconds for HTTP functions) does not address the root cause of timeouts; it only postpones the failure and can lead to resource exhaustion. Option B is wrong because Cloud Tasks decouples and retries asynchronously, not synchronously; using it would require changing the architecture and business logic, which contradicts the requirement to not change business logic. Option D is wrong because refactoring the function to reduce complexity changes the business logic, which is explicitly disallowed by the question.

Practice this question →

39

MCQeasy

You are using Cloud SQL for PostgreSQL. You want to ensure that data can be recovered to any point within the last 7 days. What should you enable?

A.Export the database daily to Cloud Storage.

B.Create a cross-region read replica.

C.Enable automated backups with a 7-day retention period.

D.Enable automated backups and set the backup configuration to enable binary logging (point-in-time recovery).

AnswerD

Point-in-time recovery uses binary logs to replay transactions, enabling recovery to any second within the retention period.

Why this answer

Option D is correct because enabling automated backups with binary logging (also known as write-ahead logging or WAL archiving) in Cloud SQL for PostgreSQL allows point-in-time recovery (PITR). This lets you restore your database to any specific timestamp within the retention window, which you can set to 7 days. Automated backups alone only provide daily snapshot restores, not the granularity needed for recovery to any point in time.

Exam trap

The trap here is that candidates confuse automated backups (daily snapshots) with point-in-time recovery, assuming that a 7-day backup retention alone provides the ability to restore to any moment, when in fact binary logging (WAL archiving) is required for that granularity.

How to eliminate wrong answers

Option A is wrong because exporting the database daily to Cloud Storage creates static snapshots at a single point in time each day; you cannot recover to arbitrary timestamps between exports, and the process is manual or scheduled, not a continuous recovery mechanism. Option B is wrong because a cross-region read replica provides read-only copies for disaster recovery or read scaling, but it does not enable point-in-time recovery or retain transaction logs for the primary instance. Option C is wrong because enabling automated backups with a 7-day retention period only stores daily full backups; without binary logging (WAL archiving), you can only restore to the exact backup timestamps, not to any arbitrary point within the 7 days.

Practice this question →

40

MCQhard

The exhibit shows a managed instance group configuration. What is the primary purpose of the 'autoHealingPolicies' section?

A.Distribute incoming traffic evenly across the instances.

B.Automatically add more instances when CPU utilization exceeds 60%.

C.Automatically replace instances that are deemed unhealthy based on the health check.

D.Automatically update instances to a new instance template.

AnswerC

Autohealing monitors instance health and replaces unhealthy ones.

Why this answer

The 'autoHealingPolicies' section in a managed instance group configuration is specifically designed to automatically replace instances that are deemed unhealthy based on a configured health check. When a health check probe (e.g., HTTP, TCP, or SSL) fails for a sustained period, the managed instance group terminates the unhealthy instance and creates a new one from the instance template, ensuring the desired number of healthy instances is maintained. This is distinct from autoscaling, which adjusts instance count based on load metrics.

Exam trap

Google Cloud often tests the distinction between 'autohealing' (health-based instance replacement) and 'autoscaling' (metric-based instance count adjustment), causing candidates to confuse the purpose of the 'autoHealingPolicies' section with scaling policies.

How to eliminate wrong answers

Option A is wrong because distributing incoming traffic evenly across instances is the function of a load balancer (e.g., HTTP(S) Load Balancer or Network Load Balancer) and its backend service, not the 'autoHealingPolicies' section of a managed instance group. Option B is wrong because automatically adding instances when CPU utilization exceeds 60% is a function of the 'autoscaling' policy (based on a CPU utilization metric), not the 'autoHealingPolicies' section, which only reacts to health check failures. Option D is wrong because automatically updating instances to a new instance template is achieved through a 'rolling update' or 'canary update' strategy (e.g., using the 'updatePolicy' section), not through 'autoHealingPolicies', which only replaces unhealthy instances with the current template.

Practice this question →

41

MCQhard

A company has a Spanner instance for global transactions. They need to ensure reliability during a regional outage. What is the best approach?

A.Spanner is already resilient across zones; use backup/restore for regions.

B.Use multiple instances in different regions.

C.Enable multi-region configuration with read-only replicas.

D.Enable leader placement option.

AnswerC

Automatic failover and read scalability.

Why this answer

Spanner's multi-region configuration with read-only replicas provides global transactional consistency and automatic failover without manual intervention. Read-only replicas can serve reads locally and, in the event of a regional outage, one can be promoted to a read-write replica to maintain availability. This approach ensures reliability during a regional outage while preserving Spanner's strong consistency guarantees.

Exam trap

Google Cloud often tests the misconception that multiple independent instances (Option B) are needed for cross-region resilience, when in fact Spanner's multi-region configuration with read-only replicas provides built-in global high availability without application changes.

How to eliminate wrong answers

Option A is wrong because Spanner's zone-level resilience does not protect against a full regional outage; backup/restore is a disaster recovery mechanism with significant RTO/RPO, not a high-availability solution. Option B is wrong because using multiple independent Spanner instances in different regions would require application-level sharding or eventual consistency, breaking Spanner's native strong consistency and global transaction support. Option D is wrong because leader placement only controls which zone holds the leader for a given split, but does not provide read-only replicas or automatic failover across regions for regional outage protection.

Practice this question →

42

Multi-Selectmedium

A company runs a stateful workload on Compute Engine with regional persistent disks (PD). They need to implement a disaster recovery (DR) plan with a Recovery Point Objective (RPO) of less than 1 hour and Recovery Time Objective (RTO) of less than 4 hours. Which THREE steps should they include in their DR plan? (Choose three.)

Select 3 answers

A.Take snapshots of the persistent disk every 30 minutes and copy them to a Cloud Storage bucket in another region

B.Create a snapshot schedule for the persistent disk every 4 hours

C.Create a custom machine image of the instance and store it in a Cloud Storage bucket in the DR region

D.Use regional persistent disks to automatically replicate data to a second zone

E.Test the failover procedure quarterly to validate RTO and RPO

AnswersA, C, E

Correct: meets RPO and protects against regional failure.

Why this answer

Option A is correct because taking snapshots every 30 minutes meets the RPO of less than 1 hour. By copying these snapshots to a Cloud Storage bucket in another region, you ensure data is available in a DR region for recovery, which is essential for cross-region disaster recovery.

Exam trap

The trap here is confusing zonal replication (regional PD) with cross-region disaster recovery; regional PDs only protect against zonal failures, not regional outages, so they cannot meet a cross-region DR requirement.

Practice this question →

43

MCQhard

An organization uses Cloud Functions (2nd gen) for event-driven processing. They notice that some functions fail with 'memory limit exceeded' errors during peak load. The function processes messages from Pub/Sub and writes to Firestore. What should they do to improve reliability without sacrificing throughput?

A.Increase the maximum number of concurrent function instances.

B.Increase the memory allocated to the Cloud Function.

C.Enable Pub/Sub batching to reduce the number of function invocations.

D.Split the function into multiple smaller functions, each handling a subset of the data.

AnswerB

More memory allows the function to handle larger data per invocation without hitting the limit.

Why this answer

The 'memory limit exceeded' error indicates that the function's allocated memory is insufficient for the workload during peak load. Increasing the memory allocation (Option B) directly resolves this by providing more RAM for processing larger messages or concurrent operations, without altering the invocation pattern or throughput. Cloud Functions (2nd gen) allow memory to be set up to 32 GiB, and this change does not reduce the number of events processed per second.

Exam trap

Google Cloud often tests the misconception that scaling out (more instances) solves memory issues, but the trap here is that memory limits are per-instance, so only increasing the per-instance memory allocation directly resolves the error.

How to eliminate wrong answers

Option A is wrong because increasing the maximum number of concurrent instances does not address the per-instance memory limit; it may actually worsen the problem by allowing more instances to hit the same memory ceiling simultaneously. Option C is wrong because Pub/Sub batching reduces the number of function invocations but does not increase the memory available per invocation; it could also increase latency and does not fix the root cause of memory exhaustion. Option D is wrong because splitting the function into multiple smaller functions does not increase the memory per function instance; it adds complexity and may reduce throughput due to additional overhead, without guaranteeing that each smaller function avoids memory limits.

Practice this question →

44

MCQeasy

A company uses Cloud Spanner for a global financial application. They experience increased latency and transaction aborts during peak hours. Which measure should they take first to improve reliability?

A.Increase the number of nodes in the Spanner instance.

B.Reduce the number of indexes on frequently updated columns.

C.Optimize transactions to reduce lock contention.

D.Use interleaved tables to co-locate related data.

AnswerC

Short, single-partition transactions reduce the chance of conflicts and aborts.

Why this answer

Option C is correct because transaction aborts and latency in Cloud Spanner are most commonly caused by lock contention during peak hours. By optimizing transactions—such as reducing their scope, using read-only transactions where possible, and avoiding hot-spot writes—you directly address the root cause of contention without incurring additional cost or schema changes. This aligns with Google's best practices for Spanner reliability.

Exam trap

Google Cloud often tests the misconception that scaling nodes (Option A) is the universal fix for performance issues, but the trap here is that Spanner's horizontal scaling does not resolve lock contention—it only increases parallelism, which can worsen contention if transactions are not optimized.

How to eliminate wrong answers

Option A is wrong because increasing nodes primarily improves throughput and storage capacity, not latency or abort rates caused by lock contention; adding nodes can even increase distributed transaction overhead. Option B is wrong because reducing indexes on frequently updated columns may reduce write amplification but does not address the immediate issue of lock contention and aborts; indexes are not the primary cause of transaction conflicts. Option D is wrong because interleaved tables co-locate parent-child rows for faster joins and lower latency, but they do not reduce lock contention; in fact, they can increase contention if the parent row becomes a hot spot.

Practice this question →

45

MCQeasy

A developer wants to monitor a custom application metric from their application running on GKE. What should they use?

A.Cloud Logging

B.Cloud Trace

C.Cloud Debugger

D.Cloud Monitoring custom metrics API

AnswerD

The custom metrics API allows ingesting and monitoring custom application metrics.

Why this answer

Cloud Monitoring custom metrics API (option D) is the correct choice because it allows a developer to push custom application-specific metrics (e.g., request latency, queue depth) from a GKE pod using the `custom.googleapis.com` metric domain. This integrates directly with Cloud Monitoring for alerting and dashboards, whereas Cloud Logging is for log data, not metrics.

Exam trap

The trap here is that candidates confuse Cloud Logging (for logs) with Cloud Monitoring (for metrics), or assume that Cloud Trace can handle custom metrics because it deals with application performance data.

How to eliminate wrong answers

Option A is wrong because Cloud Logging ingests log entries (text-based events), not numeric metric data points; it cannot be used to monitor custom application metrics like counters or gauges. Option B is wrong because Cloud Trace is a distributed tracing system for latency analysis of requests, not for publishing custom numeric metrics. Option C is wrong because Cloud Debugger is used for inspecting application state at specific code points without stopping the app, not for collecting or monitoring time-series metrics.

Practice this question →

46

MCQmedium

Refer to the exhibit. The process-image function fails intermittently with a memory limit exceeded error. Which action will MOST effectively resolve the issue?

A.Increase the function memory to 256MB.

B.Increase the function timeout to 120 seconds.

C.Reduce the maximum concurrent executions to 5.

D.Change the trigger to Cloud Pub/Sub to reduce load.

AnswerA

More memory directly addresses the 'memory limit exceeded' error.

Why this answer

The error indicates memory limit exceeded. Increasing the memory allocation will give the function more memory to process images, and is the direct fix.

Practice this question →

47

MCQmedium

Company A runs a containerized application on Google Kubernetes Engine (GKE) with 3 node pools: one for frontend, one for backend, and one for stateful databases. The backend services experience periodic latency spikes. After investigation, they found that the spikes correlate with the node pool autoscaler scaling down nodes. The backend services are deployed as Deployments with resource requests and limits set to 100m CPU and 200Mi memory each. The node pool uses n1-standard-2 machine types. The cluster autoscaler is enabled. What should they do to prevent the latency spikes?

A.Disable cluster autoscaler for the backend node pool.

B.Use node taints and tolerations to isolate the backend services.

C.Increase the resource requests for the backend services to ensure they are scheduled on dedicated nodes.

D.Configure a PodDisruptionBudget for the backend Deployment with minAvailable set to a high value.

AnswerD

Limits the number of pods that can be disrupted during voluntary disruptions.

Why this answer

The latency spikes occur because the cluster autoscaler is terminating nodes that host backend Pods, causing those Pods to be rescheduled and disrupting traffic. A PodDisruptionBudget (PDB) with a high minAvailable value ensures that a minimum number of backend Pods remain available during voluntary disruptions like node scale-down, preventing the sudden loss of capacity that leads to latency spikes. This directly addresses the root cause without disabling autoscaling or misconfiguring scheduling.

Exam trap

The trap here is that candidates often confuse resource requests/limits or node isolation with disruption protection, failing to recognize that PodDisruptionBudgets are the specific Kubernetes mechanism to control voluntary disruptions like autoscaler-driven node termination.

How to eliminate wrong answers

Option A is wrong because disabling the cluster autoscaler for the backend node pool would prevent automatic scaling entirely, leading to either over-provisioning (waste) or under-provisioning (capacity issues), and does not solve the disruption caused by scaling events. Option B is wrong because node taints and tolerations isolate Pods to specific nodes but do not prevent the autoscaler from terminating those nodes, so latency spikes would still occur during scale-down. Option C is wrong because increasing resource requests would only affect scheduling priority and node selection, not protect Pods from being evicted when the autoscaler decides to scale down a node.

Practice this question →

48

MCQmedium

A company uses Cloud Spanner for a global financial application. They need to ensure that a regional outage does not cause data loss. The application requires strong consistency and low latency reads and writes across multiple regions. Which configuration meets the reliability requirements?

A.Use a multi-region Spanner instance with read replicas in two other regions

B.Use a single-region Spanner instance and schedule backups to Cloud Storage

C.Use a multi-region Spanner instance with a primary region and two witness regions

D.Use a single-region Spanner instance with point-in-time recovery (PITR) enabled

AnswerC

Correct: provides synchronous replication and automatic failover.

Why this answer

Option C is correct because a multi-region Spanner instance with a primary region and two witness regions uses Google's synchronous replication across three regions, ensuring strong consistency and no data loss during a regional outage. Witness regions participate in the Paxos quorum without serving read traffic, guaranteeing that writes are committed in at least two regions before acknowledgment, which meets the requirement for zero data loss and low latency reads and writes.

Exam trap

Google Cloud often tests the misconception that read replicas or backups can prevent data loss during a regional outage, but in Spanner, only synchronous replication via a multi-region instance with a quorum of regions (including witness regions) guarantees zero data loss and strong consistency across regions.

How to eliminate wrong answers

Option A is wrong because read replicas in Spanner are not a supported configuration; Spanner uses multi-region instances with regional replicas or witness regions, and read replicas would not participate in the write quorum, thus failing to prevent data loss during a regional outage. Option B is wrong because a single-region instance with backups to Cloud Storage cannot provide strong consistency and low latency across multiple regions, and backups are asynchronous, risking data loss of recent writes during an outage. Option D is wrong because point-in-time recovery (PITR) only protects against accidental data deletion or corruption within a single region, not against a regional outage, and it does not provide multi-region availability or low latency reads and writes across regions.

Practice this question →

49

MCQmedium

Your company runs a customer-facing API on Cloud Run with a concurrency setting of 80. The API calls a backend Cloud Function that performs a heavy computation (2–5 seconds). During peak hours, the API experiences increased latency and some requests time out after 60 seconds. Monitoring shows that the Cloud Run max instances is set to 100, and the Cloud Function max instances is set to 10. The timeout for Cloud Run is set to 300 seconds. The Cloud Function's timeout is set to 540 seconds. You need to reduce end-to-end latency and prevent timeouts while minimizing cost. Which action is most effective?

A.Increase Cloud Run max instances from 100 to 500

B.Increase Cloud Run request timeout from 300 to 600 seconds

C.Increase Cloud Function max instances from 10 to 100

D.Reduce Cloud Run concurrency from 80 to 10

AnswerC

Correct: removes backend capacity bottleneck.

Why this answer

Option C is correct because the bottleneck is the Cloud Function's low max instances (10), causing queuing. Increasing Cloud Function max instances allows more concurrent requests to be processed, reducing latency and timeouts. Option A is wrong because concurrency on Cloud Run is separate from backend; reducing concurrency would require more Cloud Run containers and increase cost.

Option B is wrong because increasing Cloud Run max instances alone doesn't help if Cloud Function capacity is the limit. Option D is wrong because increasing Cloud Run timeout doesn't reduce latency; it just keeps the connection alive longer.

Practice this question →

50

MCQmedium

A developer ran the above command to create a health check for a backend service. Which of the following should they do to resolve the error?

A.Change the request-path to a different value.

B.Delete the existing health check and recreate it.

C.Add the --global flag to the command.

D.Use --load-balancer-type internal to create a new health check with the same name.

E.Use a different name for the health check.

AnswerE

Using a different name resolves the conflict without disruption.

Why this answer

Option E is correct because the error indicates that a health check with the same name already exists for the load balancer. In AWS, health check names must be unique within a load balancer. By using a different name, the developer can create a new health check without conflicting with the existing one.

Exam trap

Google Cloud often tests the misconception that modifying parameters like request-path or load balancer type can resolve naming conflicts, when in fact the core issue is a duplicate name that must be changed.

How to eliminate wrong answers

Option A is wrong because changing the request-path does not resolve a naming conflict; it only alters the path used for health checks. Option B is wrong because deleting and recreating the health check with the same name would still fail if the name is already in use. Option C is wrong because the --global flag is used for global accelerators, not for resolving health check naming conflicts.

Option D is wrong because --load-balancer-type internal specifies the load balancer type, not the health check name; it does not address the duplicate name error.

Practice this question →

51

MCQhard

Your company runs a data pipeline on Google Cloud using Cloud Dataflow for streaming processing from Pub/Sub to BigQuery. The pipeline writes to a BigQuery table partitioned by day. The data is used for real-time dashboards. Recently, a spike in traffic caused the Dataflow pipeline to fall behind, and the dashboard displayed stale data. You need to design the pipeline to handle traffic spikes without data loss or long delays. The pipeline must be cost-efficient and use defaults where possible. Which solution should you implement?

A.Enable autoscaling in the Dataflow pipeline and use Streaming Engine to handle larger throughput

B.Modify the pipeline to use a batch (non-streaming) approach, writing hourly batches from Pub/Sub to BigQuery

C.Create a Cloud Scheduler job that increases the number of Dataflow workers every 5 minutes based on Pub/Sub subscription backlog

D.Change the Dataflow worker machine type from n1-standard-4 to n1-highmem-8

AnswerA

Correct: autoscaling dynamically adjusts workers; Streaming Engine reduces checkpoint overhead.

Why this answer

Option A is correct because enabling autoscaling in Dataflow allows the pipeline to dynamically adjust the number of workers based on the processing backlog, while Streaming Engine offloads the shuffle and state storage to Google-managed resources, reducing the impact of traffic spikes. This combination ensures the pipeline can scale up quickly to handle increased throughput without data loss or long delays, and it remains cost-efficient by scaling down when demand decreases.

Exam trap

Google Cloud often tests the misconception that manual scaling (Option C) or static resource changes (Option D) are sufficient for handling spikes, when in fact Dataflow's built-in autoscaling and Streaming Engine are the designed, cost-efficient solutions for dynamic workloads.

How to eliminate wrong answers

Option B is wrong because switching to a batch approach introduces inherent latency (hourly batches) that would make the real-time dashboard stale, violating the requirement for minimal delays; it also does not handle spikes within the batch window. Option C is wrong because using Cloud Scheduler to manually adjust worker count every 5 minutes is reactive, not adaptive, and cannot respond quickly enough to sudden spikes; Dataflow's native autoscaling is designed to adjust more granularly and efficiently. Option D is wrong because simply changing the worker machine type to a larger instance (n1-highmem-8) does not address the need for dynamic scaling; it increases cost without guaranteeing sufficient capacity during spikes and does not leverage Dataflow's autoscaling capabilities.

Practice this question →

52

MCQhard

A financial services company runs a stateful backend service on Google Kubernetes Engine (GKE) using StatefulSets with Persistent Volumes. They observe that after a node failure, the pod is rescheduled on a different node but the Persistent Volume cannot be attached because it is still "released" and not "available". What is the most likely cause and solution?

A.The PersistentVolume has retain policy "Retain"; manually delete and recreate the volume.

B.The PersistentVolume has reclaim policy "Recycle"; it is not supported in GKE.

C.The PersistentVolumeClaim was not created with the correct storage class; recreate with reclaim policy "Delete".

D.The PersistentVolumeClaim's access mode is ReadWriteOnce, which prevents attachment to a new node; change to ReadWriteMany.

E.The PersistentVolume has reclaim policy "Retain" and the previous pod's volume attachment is not cleared; use a StatefulSet with volumeClaimTemplates and reclaim policy "Delete".

AnswerA

Retain policy leaves PV in 'Released' state; manual intervention is needed.

Why this answer

Option A is correct because when a PersistentVolume (PV) has a reclaim policy of 'Retain', after the PersistentVolumeClaim (PVC) is deleted, the PV enters a 'Released' state and is not automatically recycled for reuse. The underlying storage resource (e.g., a Compute Engine persistent disk) still exists but the PV cannot be re-attached until an administrator manually deletes the PV and recreates it, or edits the PV to remove the claim reference. This explains why the pod rescheduled on a new node cannot attach the volume.

Exam trap

Google Cloud often tests the distinction between PV reclaim policies and the 'Released' vs 'Available' states, where candidates mistakenly think the issue is with the PVC's access mode or storage class rather than the PV's manual cleanup requirement.

How to eliminate wrong answers

Option B is wrong because the 'Recycle' reclaim policy is deprecated and not supported in GKE, but the scenario describes a 'Released' state, not a 'Recycle' issue. Option C is wrong because the storage class and reclaim policy are not the cause; the problem is the PV's 'Retain' policy, not the PVC's creation parameters. Option D is wrong because ReadWriteOnce allows attachment to a single node at a time, but after a node failure the PVC is unbound and can be re-attached to a new node; the issue is the PV's state, not the access mode.

Option E is wrong because while using volumeClaimTemplates with reclaim policy 'Delete' would avoid the problem, the existing PV has 'Retain' policy, and the solution is to manually handle the released PV, not to change the StatefulSet definition.

Practice this question →

53

MCQmedium

An organization wants to define an SLO for their API hosted on Cloud Endpoints. Which metric should they use as a Service Level Indicator (SLI) for availability?

A.Number of HTTP 5xx errors

B.Request latency at the 99th percentile

C.Ratio of HTTP 200 responses to total requests

D.CPU utilization of backend instances

AnswerC

This directly measures the availability of the API.

Why this answer

For an availability SLO, the SLI must measure the proportion of successful requests. In Cloud Endpoints, availability is defined as the ratio of successful (HTTP 200) responses to total requests, as this directly reflects whether the API is functioning correctly. Option C is correct because it captures the fraction of requests that completed without error, which is the standard definition of availability in service-level monitoring.

Exam trap

Google Cloud often tests the distinction between availability and performance metrics, so the trap here is that candidates confuse latency (a performance SLI) with availability, or they mistakenly think that counting only server-side errors (5xx) is sufficient for an availability SLI, ignoring that availability is a ratio of successful to total requests.

How to eliminate wrong answers

Option A is wrong because HTTP 5xx errors are only one component of unavailability; they do not account for other failure modes (e.g., timeouts, 4xx errors caused by infrastructure issues) and using just the count of 5xx errors would not produce a ratio suitable for an availability SLI. Option B is wrong because request latency at the 99th percentile measures performance, not availability; an API can be available but slow, and latency is used for a different SLO (e.g., responsiveness). Option D is wrong because CPU utilization of backend instances is an infrastructure metric that does not directly measure whether the API is serving requests successfully; high CPU may indicate performance issues but does not equate to availability failures.

Practice this question →

54

MCQmedium

The exhibit shows the output of a 'gcloud compute instances describe' command for an instance. What is the most likely impact on reliability if the host machine needs maintenance?

A.The instance will be terminated and then restarted, causing a brief downtime.

B.The instance will not be affected because automatic restart is enabled.

C.The instance will be backed up automatically before maintenance.

D.The instance will be live migrated to another host without interruption.

AnswerA

With TERMINATE, the instance is shut down and later restarted on a healthy host, resulting in downtime.

Why this answer

Option A is correct because when a host machine requires maintenance, Google Compute Engine instances that are not configured for live migration will be terminated and then restarted on another host. This behavior is determined by the 'onHostMaintenance' setting; if it is set to 'TERMINATE' (the default for instances with GPUs or preemptible VMs), the instance stops and restarts, causing brief downtime. The exhibit likely shows 'onHostMaintenance: TERMINATE' or the instance lacks live migration support, making termination and restart the expected outcome.

Exam trap

Google Cloud often tests the distinction between 'automatic restart' (which handles crash recovery) and 'onHostMaintenance' (which handles planned maintenance), causing candidates to mistakenly think automatic restart prevents downtime during maintenance.

How to eliminate wrong answers

Option B is wrong because 'automatic restart' is a separate setting that controls whether an instance restarts after a failure or crash, not how it behaves during host maintenance; it does not prevent downtime from maintenance events. Option C is wrong because Google Compute Engine does not automatically back up instances before host maintenance; backups must be configured separately via snapshots or images. Option D is wrong because live migration is only possible if the instance has 'onHostMaintenance' set to 'MIGRATE' and does not have GPUs, local SSDs, or preemptible status; the exhibit likely shows a configuration that disables live migration, such as a GPU attached or the setting explicitly set to 'TERMINATE'.

Practice this question →

55

MCQmedium

A company has a microservices architecture on GKE. One service is failing due to resource exhaustion. How can they proactively prevent this?

A.Use vertical pod autoscaling.

B.Set up autoscaling based on CPU utilization.

C.Configure a horizontal pod autoscaler with custom metrics.

D.Implement a cluster autoscaler.

AnswerC

Custom metrics can detect specific exhaustion signals.

Why this answer

Option B is correct because a horizontal pod autoscaler with custom metrics (e.g., memory, request queue depth) can detect resource exhaustion early and scale pods before failure. Option A is wrong because CPU-based autoscaling may not capture all exhaustion types. Option C is wrong because vertical pod autoscaling may not react fast enough.

Option D is wrong because cluster autoscaler scales nodes, not pods.

Practice this question →

56

Multi-Selectmedium

A company is designing a highly available application on GCE. Which TWO steps should they take to ensure reliability?

Select 2 answers

A.Use a global external HTTP(S) load balancer.

B.Use a managed instance group with autohealing.

C.Configure health checks that check the application endpoint.

D.Use persistent disks without snapshots.

E.Deploy instances in a single zone to avoid latency.

AnswersB, C

Automatically replaces unhealthy instances.

Why this answer

Option B is correct because a managed instance group (MIG) with autohealing automatically replaces unhealthy VM instances based on health check results, ensuring the application remains available even if individual instances fail. This is a core reliability pattern for stateless applications on Compute Engine, as it provides self-healing infrastructure without manual intervention.

Exam trap

Google Cloud often tests the distinction between load balancing (traffic distribution) and instance-level recovery (autohealing), causing candidates to incorrectly select a global load balancer as the sole reliability measure without recognizing the need for health-check-driven instance replacement.

Practice this question →

57

MCQmedium

A company runs a web application on Google Kubernetes Engine (GKE) with Cluster Autoscaler enabled. During a traffic spike, the application becomes slow and some requests timeout. The cluster has sufficient CPU and memory headroom. What is the most likely cause and solution?

A.Increase the node pool's machine type to a larger size.

B.Enable Cluster Autoscaler to add more nodes.

C.Deploy the application in a regional cluster for higher availability.

D.Configure Horizontal Pod Autoscaler (HPA) based on CPU utilization or custom metrics.

AnswerD

HPA automatically scales pods based on load, resolving the timeout issue.

Why this answer

The correct answer is D because the cluster has sufficient CPU and memory headroom, indicating that the issue is not about cluster capacity but about pod-level scaling. The Horizontal Pod Autoscaler (HPA) automatically scales the number of pod replicas based on observed CPU utilization or custom metrics, which directly addresses the application slowdown and timeouts during traffic spikes by distributing the load across more pods.

Exam trap

Google Cloud often tests the distinction between node-level scaling (Cluster Autoscaler) and pod-level scaling (HPA), trapping candidates who assume that adding more nodes is the solution when the cluster already has headroom, whereas the real issue is insufficient pod replicas to handle the load.

How to eliminate wrong answers

Option A is wrong because increasing the node pool's machine type addresses node-level resource constraints, but the cluster already has sufficient CPU and memory headroom, so the bottleneck is at the pod level, not the node level. Option B is wrong because Cluster Autoscaler is already enabled and the cluster has headroom, so adding more nodes would not solve the problem of insufficient pod replicas to handle the traffic spike. Option C is wrong because deploying in a regional cluster improves availability and resilience to zone failures, but does not directly address the performance degradation and timeouts caused by insufficient application instances during a traffic spike.

Practice this question →

58

MCQmedium

After deploying the above configuration, the application is not receiving traffic from the Kubernetes Service. The Service is correctly configured to target port 8080. What is the most likely issue?

A.The initialDelaySeconds for readiness probe is too short; increase it.

B.The port name is not defined; add a name to the container port.

C.The readiness probe is using HTTP but the container may not be ready on that path; change to TCP.

D.The image pull policy is not set to Always; new pods may use stale image.

E.The liveness probe uses tcpSocket; it should be HTTPGet.

AnswerC

If /healthz is not served, the probe fails and pod is not ready.

Why this answer

Option C is correct because the readiness probe is configured as an HTTP GET request, but the application container may not be serving traffic on the specified HTTP path at startup. If the application listens on a TCP port but does not respond to HTTP GET on that path, the readiness probe will fail, causing the Service to not route traffic to the Pod. Changing the readiness probe to a TCP socket check ensures the probe only verifies that the port is open, which is more reliable when the application does not expose an HTTP endpoint for health checks.

Exam trap

Google Cloud often tests the distinction between readiness and liveness probes, and the trap here is that candidates confuse a failing readiness probe with a liveness probe issue, or assume that any HTTP probe is better than TCP without considering the application's actual behavior.

How to eliminate wrong answers

Option A is wrong because the initialDelaySeconds for the readiness probe being too short would cause the probe to start too early, potentially failing temporarily, but the application would eventually become ready; the issue described is that the application never receives traffic, indicating a persistent probe failure, not a timing issue. Option B is wrong because the port name is optional for Service targeting; the Service correctly targets port 8080 by number, so a missing port name does not prevent traffic routing. Option D is wrong because the image pull policy not being set to Always does not affect traffic routing; it only controls when the image is pulled, and stale images would still run and serve traffic if the container starts.

Option E is wrong because the liveness probe using tcpSocket is valid and does not affect traffic routing; the liveness probe is for restarting the container, not for Service traffic distribution.

Practice this question →

59

MCQmedium

A company uses Cloud Storage for backups of on-premises databases. They want to ensure that data is protected against accidental deletion or modification by users. Which combination of features should they enable?

A.Object versioning and lifecycle management to delete old versions.

B.Bucket locking with retention policy and bucket-level IAM restrictions.

C.Bucket locking with retention policy and object holds.

D.Object versioning and bucket locking with retention policy.

E.Object versioning and IAM conditions restricting access to specific IP ranges.

AnswerD

Versioning preserves overwrites; retention policy prevents deletion.

Why this answer

Option D is correct because object versioning protects against accidental deletion or modification by preserving all versions of an object, while a bucket lock with a retention policy enforces a minimum retention period, preventing premature deletion or alteration. Together, they provide both recoverability and immutable compliance, which is essential for backup data integrity.

Exam trap

Google Cloud often tests the misconception that object holds alone provide sufficient immutability, but they are per-object and temporary, whereas a bucket lock with a retention policy provides a bucket-wide, locked-in immutable period that cannot be bypassed even by the bucket owner.

How to eliminate wrong answers

Option A is wrong because lifecycle management to delete old versions actively removes data, which contradicts the goal of protecting against accidental deletion. Option B is wrong because bucket-level IAM restrictions alone do not prevent a user with sufficient permissions from deleting or modifying objects; they lack the versioning-based recovery mechanism. Option C is wrong because object holds are temporary and must be manually applied per object, making them impractical for broad backup protection and not providing the automatic version history that versioning offers.

Option E is wrong because IAM conditions restricting access to specific IP ranges only control network-level access, not the ability to delete or modify objects once accessed, and they do not provide any data recovery or immutability features.

Practice this question →

60

MCQmedium

The exhibit shows a Cloud Storage bucket configuration. What does this configuration ensure?

A.Older versions of objects are automatically transferred to a different storage class.

B.Data is replicated to another region for disaster recovery.

C.Objects can only be permanently deleted after the retention period expires.

D.Objects older than 30 days will be automatically deleted.

AnswerC

A locked retention policy prevents permanent deletion before the retention period ends. Versioning retains noncurrent versions.

Why this answer

The exhibit shows a bucket configured with a retention policy. When a retention policy is set on a Cloud Storage bucket, objects cannot be deleted or overwritten until the retention period expires. This ensures that objects can only be permanently deleted after the retention period ends, which is exactly what option C describes.

Exam trap

The trap here is that candidates confuse retention policies with lifecycle management rules, mistakenly thinking retention policies automatically delete or transition objects, when in fact they only prevent deletion until the retention period expires.

How to eliminate wrong answers

Option A is wrong because retention policies do not automatically transfer objects to a different storage class; that is the function of lifecycle management rules, not retention policies. Option B is wrong because retention policies do not replicate data to another region; replication is configured separately using object replication or dual-region buckets. Option D is wrong because retention policies do not automatically delete objects after a period; they prevent deletion until the retention period expires, and automatic deletion is achieved via lifecycle rules with a Delete action.

Practice this question →

61

MCQmedium

A company deploys a microservices application on Google Kubernetes Engine (GKE). Pods in one deployment are frequently OOMKilled. The team sets memory requests and limits, but pods still crash. What is the most likely remaining cause?

A.CPU requests are too low, causing throttling and eventual crash.

B.The node pool is too small, causing memory pressure on the node.

C.Memory limits are set higher than the node's allocatable memory.

D.The application has a memory leak that eventually exceeds the limit.

AnswerD

A memory leak causes continuous memory growth until the limit is hit, resulting in OOMKill.

Why this answer

Option D is correct because OOMKilled errors occur when a container exceeds its memory limit. Setting memory requests and limits prevents unbounded usage, but if the application has a memory leak, it will continue to consume memory until it hits the configured limit, causing the kernel's Out-Of-Memory (OOM) killer to terminate the pod. The fact that pods still crash after setting limits indicates the application itself is the root cause, not resource configuration.

Exam trap

The trap here is that candidates confuse OOMKilled (per-container limit) with node-pressure eviction (node-level memory), or assume that setting requests/limits automatically fixes all memory issues, ignoring application-level bugs like memory leaks.

How to eliminate wrong answers

Option A is wrong because CPU throttling does not cause OOMKilled; CPU limits throttle performance but do not trigger the OOM killer, which is specific to memory exhaustion. Option B is wrong because node-level memory pressure would cause pods to be evicted (not OOMKilled) or the node to become NotReady, but the question states pods are OOMKilled, which is a per-container limit violation, not a node-level issue. Option C is wrong because setting memory limits higher than the node's allocatable memory would prevent the pod from being scheduled (pending state), not cause it to run and then be OOMKilled.

Practice this question →

62

Multi-Selectmedium

Which TWO actions should you take to improve the reliability of a stateful application deployed on Compute Engine with regional persistent disks?

Select 2 answers

A.Use a regional persistent disk to replicate data across two zones.

B.Deploy the application across multiple zones in a managed instance group with autohealing.

C.Use preemptible VMs to reduce costs.

D.Place an HTTP(S) load balancer in front of the application.

E.Schedule regular snapshots of the persistent disk to Cloud Storage.

AnswersA, B

Regional persistent disks replicate data synchronously across zones, protecting against zone failure.

Why this answer

Option A is correct because regional persistent disks (RPDs) synchronously replicate data between two zones within a region, ensuring that if one zone fails, the data remains available in the other zone without data loss. This directly improves the reliability of a stateful application by providing a durable, zone-failure-tolerant storage layer that maintains data consistency across zones.

Exam trap

Google Cloud often tests the distinction between data durability (synchronous replication) and data backup (asynchronous snapshots), and candidates mistakenly choose scheduled snapshots (Option E) thinking they improve reliability, when in fact they only provide disaster recovery with a non-zero RPO.

Practice this question →

63

MCQmedium

A company is deploying a critical application on Compute Engine with an HTTP load balancer. They want to ensure that if an instance health check fails, traffic is automatically rerouted to healthy instances. Which configuration should they implement?

A.Use an HTTP(S) load balancer with a backend service configured with a health check and enable connection draining.

B.Use an internal load balancer with a backend service configured with a health check.

C.Use a network load balancer with a health check configured on the target pool.

D.Use an HTTP(S) load balancer with a backend service configured with a health check and enable session affinity.

E.Use a TCP proxy load balancer with a backend service configured with a health check.

AnswerA

HTTP(S) LB with health checks automatically reroutes traffic; connection draining adds graceful shutdown.

Why this answer

Option A is correct because an HTTP(S) load balancer with a backend service configured with a health check automatically monitors instance health and reroutes traffic away from unhealthy instances. Enabling connection draining ensures that in-flight requests to an unhealthy instance are given time to complete before the instance is removed from the load balancing pool, preventing disruption to active sessions.

Exam trap

Google Cloud often tests the distinction between connection draining and session affinity, where candidates mistakenly think session affinity is needed for failover, but in reality session affinity prevents rerouting and should be avoided for high-availability scenarios.

How to eliminate wrong answers

Option B is wrong because an internal load balancer is used for private traffic within a VPC and does not handle external HTTP(S) traffic, nor does it provide the automatic rerouting required for a public-facing critical application. Option C is wrong because a network load balancer (TCP/UDP) operates at layer 4 and does not support HTTP(S) health checks or connection draining; it forwards traffic based on IP and port, not application-level health. Option D is wrong because session affinity (sticky sessions) pins a client to a specific instance, which would prevent traffic from being rerouted away from an unhealthy instance, defeating the purpose of health-check-based failover.

Option E is wrong because a TCP proxy load balancer terminates TCP connections and forwards traffic at layer 4, lacking HTTP(S)-specific health checks and the ability to reroute based on application-level health status.

Practice this question →

64

MCQmedium

A company uses Cloud Interconnect to connect on-premises network to GCP. They want to ensure that if one interconnect link fails, traffic is automatically rerouted to another link. Which configuration should they implement?

A.Configure BGP sessions with equal-cost multi-path (ECMP) over multiple interconnect links.

B.Use a VPN as backup for the interconnect.

C.Use a single VLAN attachment with multiple interconnect links.

D.Create a second interconnect in a different metro and use BGP with MED.

E.Use multiple VLAN attachments with the same interconnect.

AnswerD

Two interconnects in different metro areas with BGP MED provide automatic failover.

Why this answer

Option D is correct because using a second interconnect in a different metro with BGP MED (Multi-Exit Discriminator) allows you to influence inbound traffic path selection and provides true geographic redundancy. If one interconnect link fails, BGP withdraws the routes, and traffic automatically fails over to the remaining interconnect via the alternate path, ensuring high availability without relying on a single point of failure.

Exam trap

The trap here is that candidates often confuse link-level redundancy (e.g., ECMP or multiple VLAN attachments on the same interconnect) with true geographic redundancy, failing to recognize that a single interconnect location is a single point of failure regardless of how many links or VLANs are used.

How to eliminate wrong answers

Option A is wrong because ECMP over multiple interconnect links requires all links to be active and does not provide automatic rerouting if a link fails; BGP would still need to withdraw routes, and ECMP alone does not handle failover. Option B is wrong because using a VPN as a backup introduces a different technology with lower bandwidth and higher latency, and it is not the recommended configuration for automatic rerouting over dedicated interconnect links. Option C is wrong because a single VLAN attachment cannot span multiple interconnect links; VLAN attachments are tied to a specific interconnect, so this configuration does not provide link-level redundancy.

Option E is wrong because multiple VLAN attachments on the same interconnect do not protect against the failure of that single interconnect; they only provide logical separation, not physical link redundancy.

Practice this question →

65

MCQeasy

Your company runs an e-commerce platform on Google Cloud. The application is deployed on Compute Engine instances in a managed instance group (MIG) with autoscaling based on CPU utilization. The database uses Cloud SQL for MySQL with a single instance. During a recent flash sale, traffic spiked and the application became slow, resulting in a poor user experience. After analyzing the incident, you discovered that the MIG scaled up but the Cloud SQL instance reached its maximum connections limit, causing some requests to fail. You need to recommend a solution to improve the reliability of the application for future traffic spikes. What should you do?

A.Increase the maximum connections setting on the Cloud SQL instance and also increase the instance's tier to handle more concurrent connections.

B.Migrate the database to Cloud Spanner to provide unlimited scalability and automatic sharding.

C.Implement a connection pooling library in the application code to reuse database connections and reduce the number of new connections.

D.Deploy the Cloud SQL Proxy on each Compute Engine instance to manage database connections more efficiently, and configure a connection pool size that matches the maximum connections of the Cloud SQL instance.

AnswerD

Option B reduces the number of open connections and efficiently distributes them.

Why this answer

Option D is correct because deploying the Cloud SQL Proxy on each Compute Engine instance provides a secure, efficient way to manage database connections. The proxy can be configured with a connection pool size that matches the Cloud SQL instance's maximum connections, preventing the application from exhausting the database's connection limit. This approach also reduces the overhead of establishing new connections and improves connection reuse, directly addressing the bottleneck during traffic spikes.

Exam trap

The trap here is that candidates often assume increasing the database tier or max_connections is the simplest fix, but the PCA exam tests the understanding that connection pooling with a proxy is a more scalable and cost-effective reliability pattern, especially when combined with autoscaling compute instances.

How to eliminate wrong answers

Option A is wrong because simply increasing the maximum connections and tier on Cloud SQL does not address the root cause of connection exhaustion; it only delays the problem and increases cost without improving connection management efficiency. Option B is wrong because migrating to Cloud Spanner is an over-engineered solution for a MySQL-based application; it introduces significant complexity, cost, and potential application rewrites, and is not necessary for handling connection limits. Option C is wrong because implementing a connection pooling library in the application code alone does not prevent the application from opening too many connections if the pool size is not properly configured; it also does not provide the secure, managed connection handling that Cloud SQL Proxy offers, and the application may still exceed the database's connection limit without a centralized proxy.

Practice this question →

66

MCQmedium

After a data corruption incident, a company needs to restore their Cloud SQL for PostgreSQL instance from a backup. What is the correct procedure to minimize downtime?

A.Restore the backup directly to the existing Cloud SQL instance

B.Create a new instance from the backup, then rename and delete the old instance

C.Use point-in-time recovery to restore to a time before corruption

D.Export the backup to Cloud Storage and import into the existing instance

AnswerA

Cloud SQL supports restoring from backup to the same instance with minimal steps.

Why this answer

Restoring a backup directly to the existing Cloud SQL instance is the fastest method to minimize downtime because it overwrites the current data in-place without requiring DNS propagation, connection string changes, or reconfiguration of applications. Cloud SQL supports in-place restore from automated or on-demand backups, which typically completes within minutes for most instance sizes, as the operation leverages the underlying storage layer to apply the backup snapshot directly to the existing persistent disk.

Exam trap

Google Cloud often tests the misconception that creating a new instance and renaming it is the standard recovery procedure, but the trap here is that candidates overlook the additional downtime caused by DNS propagation and connection string updates, making the direct in-place restore the correct choice for minimizing downtime.

How to eliminate wrong answers

Option B is wrong because creating a new instance from the backup, then renaming and deleting the old instance introduces significant additional downtime due to the time required for provisioning a new instance, DNS propagation (which can take up to 5 minutes or more), and the need to update application connection strings or IP addresses. Option C is wrong because point-in-time recovery (PITR) is used for transactional log replay to restore to a specific timestamp, but it requires that write-ahead logs (WAL) are still available and is not the correct procedure for restoring from a backup after data corruption; PITR is typically slower and more complex than a direct backup restore. Option D is wrong because exporting a backup to Cloud Storage and then importing it into the existing instance is a multi-step, time-consuming process that involves exporting the database dump (e.g., using pg_dump), uploading to Cloud Storage, and then running an import operation (e.g., using psql or the Cloud SQL import feature), which can take hours for large databases and is not designed for minimizing downtime.

Practice this question →

67

Multi-Selecthard

A company uses Cloud CDN to accelerate content delivery. They notice that some users receive stale content even after purging the cache. Which THREE factors could cause this?

Select 3 answers

A.The content is compressed with gzip.

B.The purge request did not complete successfully.

C.The content was cached at multiple edge locations and not all were purged.

D.The CDN is configured with signed URLs.

E.The origin server returns a long Cache-Control: max-age header, causing the CDN to ignore the purge.

AnswersB, C, E

Failed purge operations leave stale cache intact.

Why this answer

Option B is correct because a purge request that does not complete successfully will leave cached content intact, causing users to receive stale data. Cloud CDN processes purge requests asynchronously, and if the request fails (e.g., due to network issues or invalid paths), the cache is not invalidated. This directly explains why stale content persists despite an attempted purge.

Exam trap

Google Cloud often tests the misconception that a purge is instantaneous and global, leading candidates to overlook that incomplete or failed purge requests can leave stale content at some edge locations.

Practice this question →

68

MCQhard

A financial services company is migrating a monolithic Java application to Google Kubernetes Engine (GKE) for improved scalability and reliability. The application serves real-time trading data and has strict latency requirements. Post-migration, the team observes frequent pod restarts due to OutOfMemory (OOM) errors, increased latency during peak trading hours, and occasional database connection timeouts. The current setup uses a single GKE cluster with a node pool of n1-standard-4 machines, a stateless application deployed as a Deployment with resource requests and limits set to 512 Mi memory and 1 CPU. The database is a Cloud SQL PostgreSQL instance with 2 vCPUs and 7.5 GB memory, and applications connect using a hardcoded connection string. The team wants to ensure reliable operation under load and during node maintenance events. Which course of action best addresses the reliability issues?

A.Adjust resource requests to 1 Gi memory and 2 CPU, set limits to 2 Gi and 4 CPU, create an HPA based on a custom metric (e.g., requests per second), enable cluster autoscaler, implement Cloud SQL connection pooling via Cloud SQL Auth Proxy with a max connection pool size, and configure PDB with maxUnavailable 1.

B.Enable GKE node auto-upgrade, configure Pod Disruption Budgets (PDB) with minAvailable 1, and set readiness probes to check application health.

C.Migrate the database to a StatefulSet in GKE with persistent volumes, increase node count to 10, and enable cluster autoscaler.

D.Increase memory limits to 2 Gi and CPU to 2, add Horizontal Pod Autoscaler (HPA) based on CPU utilization, and implement connection pooling using Cloud SQL Auth Proxy.

AnswerA

Correctly addresses all issues: resource tuning for OOM, custom metric HPA for load, cluster autoscaler for capacity, connection pooling for timeouts, and PDB for maintenance.

Why this answer

Option C comprehensively addresses all issues: setting resource requests ensures scheduling, limits prevent OOM, HPA on custom metrics (e.g., requests per second) scales based on load, Cloud SQL connection pooling with Cloud SQL Auth Proxy prevents connection exhaustion and adds security, cluster autoscaler handles node capacity, and PDB ensures availability during maintenance. Option A misses readiness probes and autoscaling; Option B ignores resource limits and connection pooling; Option D uses StatefulSet unnecessarily and omits connection pooling and HPA on custom metrics.

Practice this question →

69

MCQmedium

Your team manages a service with a 99.9% uptime SLO over a 30-day window. The error budget for this period is 43 minutes. In the first week, outages consumed 30 minutes of the budget. You are planning a new release. What should you do?

A.Reduce the SLO to 99.8% to increase the error budget.

B.Proceed with the release because the remaining budget is sufficient.

C.Delay the release and focus on improving reliability to rebuild the error budget.

D.Release the feature but only to a small percentage of users.

AnswerC

Conservative approach: wait until more error budget is earned (e.g., through flawless operation) before releasing.

Why this answer

With only 13 minutes of error budget remaining after the first week, proceeding with the release (Option B) risks exhausting the budget entirely from any unforeseen issues, violating the 99.9% SLO. Delaying the release (Option C) allows the team to focus on reliability improvements, such as implementing canary deployments, adding circuit breakers, or enhancing monitoring with tools like Prometheus and Grafana, to rebuild the error budget over the remaining 23 days. This aligns with the principle of using error budgets to balance innovation with reliability, as defined in Google's SRE practices.

Exam trap

Google Cloud often tests the misconception that a canary release (Option D) is always safe, but the trap here is that it still consumes error budget and does not solve the underlying reliability deficit when the budget is already critically low.

How to eliminate wrong answers

Option A is wrong because reducing the SLO to 99.8% would increase the error budget to 86.4 minutes, but this is a reactive measure that lowers the reliability target rather than addressing the root cause of the outages; it also violates the principle of maintaining a consistent SLO commitment to customers. Option B is wrong because proceeding with the release with only 13 minutes of error budget left is reckless—any minor incident could exhaust the budget, leading to SLO violations and potential service credits or customer dissatisfaction, especially since the first week already consumed 70% of the budget. Option D is wrong because releasing to a small percentage of users (e.g., a canary deployment) is a valid risk mitigation strategy, but it does not address the fact that the error budget is nearly depleted; even a small-scale release could introduce bugs that consume the remaining budget, and the team should first stabilize the service before any new changes.

Practice this question →

70

Multi-Selecteasy

You are deploying a stateless web application on Compute Engine. Which TWO actions improve availability? (Choose 2)

Select 2 answers

A.Use a regional managed instance group.

B.Enable Cloud CDN for the static content.

C.Purchase 1-year committed use contracts for the instances.

D.Enable automatic restart on the instance template.

E.Use preemptible VMs to reduce cost.

AnswersA, D

Regional MIGs spread instances across zones; if one zone fails, other zones continue serving.

Why this answer

A regional managed instance group (MIG) distributes instances across multiple zones within a region, ensuring that if one zone fails, traffic is automatically routed to healthy instances in other zones. This provides high availability by eliminating a single zone of failure, which is critical for stateless web applications that can serve requests from any instance.

Exam trap

Google Cloud often tests the distinction between cost optimization (committed use contracts, preemptible VMs) and availability improvements, leading candidates to mistakenly choose financial commitments or caching services as availability solutions.

Practice this question →

71

MCQeasy

A company deploys a web application on Compute Engine behind an HTTP Load Balancer. They want to ensure only healthy instances receive traffic. What should they configure?

A.Configure the instance group autoscaling based on CPU utilization

B.Configure an HTTP health check with a custom request path that returns a 200 status

C.Configure a TCP health check on port 80

D.Configure an SSL health check to verify TLS handshake

AnswerB

HTTP health check validates the application layer by checking a specific endpoint.

Why this answer

Option B is correct because an HTTP health check with a custom request path that returns a 200 status allows the HTTP Load Balancer to verify that the web application is actually serving requests correctly. This ensures that only instances passing the application-level health check are considered healthy and receive traffic, preventing requests from being routed to instances that may be running but not serving the expected content.

Exam trap

The trap here is that candidates often confuse health checks with autoscaling metrics, assuming that CPU-based autoscaling alone ensures traffic is only sent to healthy instances, when in fact health checks are a separate mechanism required for load balancer traffic routing.

How to eliminate wrong answers

Option A is wrong because autoscaling based on CPU utilization manages the number of instances but does not determine which instances are healthy for traffic routing; the load balancer still needs health checks to decide which instances to send traffic to. Option C is wrong because a TCP health check on port 80 only verifies that the TCP port is open, not that the web application is responding correctly; an instance could have a listening port but return errors or be unresponsive at the application layer. Option D is wrong because an SSL health check verifies the TLS handshake, which is unnecessary for HTTP traffic and does not validate the application's response; it is designed for HTTPS backends, not plain HTTP.

Practice this question →

72

Multi-Selectmedium

A company wants to improve the reliability of their microservices architecture on Google Cloud. Which TWO practices should they implement? (Choose 2)

Select 2 answers

A.Design with a single point of failure for simplicity

B.Implement retry with exponential backoff

C.Use synchronous communication between all services

D.Implement circuit breaker pattern

E.Disable health checks to reduce latency

AnswersB, D

Retry with backoff handles transient failures without overwhelming the system.

Why this answer

B is correct because implementing retry with exponential backoff allows transient failures (e.g., network timeouts, temporary service unavailability) to be handled gracefully by automatically retrying the request after increasing delays, reducing load on the recovering service. This pattern is essential in microservices on Google Cloud to improve reliability without overwhelming downstream dependencies.

Exam trap

Google Cloud often tests the misconception that synchronous communication is more reliable because it provides immediate feedback, but in distributed systems, asynchronous patterns and resilience mechanisms like retries and circuit breakers are actually critical for reliability.

Practice this question →

73

Multi-Selecthard

A company runs a microservices-based application on Google Kubernetes Engine (GKE) with a Regional cluster. They want to improve reliability by implementing best practices for pod scheduling and resilience. Which TWO actions should they take? (Choose two.)

Select 2 answers

A.Set terminationGracePeriodSeconds to 0 for faster pod termination during scale-down

B.Enable cluster autoscaler to automatically add nodes when pods are pending

C.Define a PodDisruptionBudget for each deployment to limit the number of concurrent disruptions

D.Set resource requests equal to limits to ensure guaranteed QoS class

E.Configure pod anti-affinity to spread replicas across different zones

AnswersC, E

Correct: PDB ensures minimum availability during voluntary disruptions.

Why this answer

Option C is correct because a PodDisruptionBudget (PDB) limits the number of Pods of a replicated application that can be down simultaneously from voluntary disruptions, such as node maintenance or cluster upgrades. This ensures that a minimum number of replicas remain available, improving application reliability during planned events.

Exam trap

Google Cloud often tests the distinction between voluntary disruptions (handled by PDB) and involuntary disruptions (e.g., node failure), and the trap here is that candidates confuse resource optimization (requests/limits) or scaling (cluster autoscaler) with resilience mechanisms like PDB and anti-affinity.

Practice this question →

74

MCQhard

Refer to the exhibit. The HPA is configured to scale based on CPU, but it has not scaled up despite the CPU usage being above the target. Which is the most likely cause?

A.The cluster has autoscaling enabled, which may conflict with HPA.

B.The node pool oauthScopes lack the monitoring scope required for HPA to read metrics.

C.The HPA target is 80%, but the current CPU is 90% which should trigger scaling; the HPA may be broken.

D.The HPA min replicas is 3, so it cannot scale down, but it should scale up.

AnswerB

Without the monitoring scope, the HPA cannot retrieve CPU metrics from the nodes.

Why this answer

The node pool uses a service account with devstorage.read_only scope, which does not include the required permissions for the HPA to read metrics. The HPA needs the monitoring scope or a service account with monitoring roles to access CPU metrics.

Practice this question →

75

MCQeasy

A company needs to deploy a stateless web application that can handle variable traffic. Which compute option is the most cost-effective and scales automatically?

A.App Engine standard environment with automatic scaling.

B.Google Kubernetes Engine (GKE) with cluster autoscaling.

C.Compute Engine with managed instance groups and autoscaling.

D.Compute Engine with preemptible VMs.

E.Cloud Run with CPU always allocated.

AnswerA

App Engine standard is serverless, cost-effective, and auto-scales.

Why this answer

App Engine standard environment with automatic scaling is the most cost-effective and automatically scales to zero when there is no traffic, making it ideal for variable traffic stateless web applications. It abstracts infrastructure management, charges only for resources used, and handles scaling instantly without provisioning overhead.

Exam trap

Google Cloud often tests the misconception that managed instance groups or GKE are always the best for autoscaling, but the trap here is that for a stateless web app with variable traffic, serverless options like App Engine standard are more cost-effective because they scale to zero and require no infrastructure management.

How to eliminate wrong answers

Option B is wrong because GKE with cluster autoscaling requires managing a Kubernetes cluster, which adds operational overhead and cost for a simple stateless web app, and it does not scale to zero. Option C is wrong because Compute Engine with managed instance groups and autoscaling still requires managing VMs and has a minimum instance count, leading to higher costs and slower scaling compared to serverless options. Option D is wrong because preemptible VMs can be terminated at any time, making them unsuitable for a production web application that needs reliability and consistent availability.

Option E is wrong because Cloud Run with CPU always allocated incurs costs even when the application is idle, whereas the default CPU-throttled mode is more cost-effective for variable traffic.

Practice this question →

Page 1 of 2 · 99 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Ensure solution and operations reliability questions.

Start 20-question session