Knowledge + Practice

CCNA Ensuring Successful Operation Of A Cloud Solution Questions

75 of 102 questions · Page 1/2 · Ensuring Successful Operation Of A Cloud Solution topic · Answers revealed

Practice these questions Exam hub All questions

1

MCQeasy

A load balancer is routing traffic to a VM where the application process has crashed, but the VM itself is still running. What prevents the load balancer from continuing to send traffic to this instance?

A.A VPC firewall rule blocking traffic to the VM

B.An HTTP health check configured on the backend service

C.A Cloud Armor security policy blocking the crashed instance's IP

D.The instance group autoscaling policy detecting the failure

AnswerB

HTTP health checks probe the application port. A crashed application fails the probe, causing the load balancer to stop directing traffic to that VM until it recovers.

Why this answer

The load balancer uses an HTTP health check to periodically probe the application on the VM. When the application process crashes, the health check fails (e.g., returns a non-2xx status code or times out), and the load balancer automatically stops routing new traffic to that unhealthy instance. This is the standard mechanism in Google Cloud for detecting application-level failures, as opposed to infrastructure-level failures.

Exam trap

The trap here is that candidates confuse infrastructure-level health (VM running) with application-level health (process responding), and assume autoscaling or firewall rules handle this, when in fact only a properly configured health check can detect a crashed application process.

How to eliminate wrong answers

Option A is wrong because a VPC firewall rule would block traffic at the network layer, but the question states the VM is still running and the application has crashed—firewall rules do not detect application crashes. Option C is wrong because Cloud Armor security policies filter traffic based on IP addresses, geographic regions, or layer 7 attributes, not based on the health of the application process on a VM. Option D is wrong because the instance group autoscaling policy reacts to overall load metrics (e.g., CPU utilization, requests per second) and may replace unhealthy instances, but it does not directly prevent the load balancer from sending traffic to a crashed instance—that is the health check's role.

Practice this question →

2

MCQmedium

A GKE node pool has auto-repair enabled. A node becomes unresponsive (not ready) for 10 minutes. What action does GKE's auto-repair feature take?

A.GKE alerts the team and waits for manual intervention

B.GKE drains the node, recreates it, and rejoins it to the cluster automatically

C.GKE terminates the node permanently and adds a replacement from the node pool's minimum size

D.GKE migrates all Pods to other nodes but leaves the unresponsive node running

AnswerB

GKE auto-repair drains the unhealthy node (evicting Pods while respecting PDBs), recreates it from the node pool configuration, and reconnects it to the cluster.

Why this answer

B is correct because GKE's auto-repair feature periodically performs health checks on nodes. When a node is in a 'NotReady' state for the default timeout of 10 minutes, GKE automatically initiates a repair: it drains the node (evicting pods gracefully), recreates it from the node pool's instance template, and rejoins it to the cluster. This ensures minimal disruption without requiring manual intervention.

Exam trap

The trap here is that candidates often confuse auto-repair with auto-scaling or manual node recovery, assuming the node is permanently deleted or that the cluster waits for human action, when in fact GKE's auto-repair is a fully automated, non-destructive replacement process that preserves node identity.

How to eliminate wrong answers

Option A is wrong because auto-repair is designed to be fully automated; it does not wait for manual intervention—it triggers the repair process after the 10-minute 'NotReady' threshold. Option C is wrong because GKE does not permanently terminate the node; it recreates the node using the same instance template, and the node pool's minimum size is irrelevant to the repair action (the node is replaced, not permanently removed). Option D is wrong because GKE does not leave the unresponsive node running; it drains the node and then recreates it, fully replacing the unhealthy node.

Practice this question →

3

MCQeasy

A developer is using Cloud Build to deploy a containerized application to Cloud Run. The deployment fails with an error 'Permission denied' when pulling the container from Container Registry. What is the most likely cause?

A.The Cloud Build service account lacks the Cloud Run Invoker role.

B.The Cloud Run service agent does not have the Cloud Run Invoker role.

C.The Container Registry is not configured to work with Cloud Run.

D.The Cloud Build service account lacks the Storage Object Viewer role on the container registry bucket.

AnswerD

To pull images, Cloud Build needs read access to Container Registry (which uses Cloud Storage).

Why this answer

Option C is correct because Cloud Build needs the Storage Object Viewer role to pull images from Container Registry. Option A is wrong because the Cloud Run service agent is for runtime, not build time. Option B is wrong because Cloud Run Invoker is for invoking services.

Option D is wrong because Container Registry Service Agent is a different role but not typically granted to Cloud Build.

Practice this question →

4

MCQmedium

A company wants to monitor the CPU utilization of their Compute Engine instances and receive an alert if utilization exceeds 80% for 5 minutes. Which services should they combine?

A.Cloud Functions and Cloud Tasks.

B.Cloud Audit Logs and Cloud Storage.

C.Cloud Monitoring and Cloud Pub/Sub.

D.Cloud Logging and Cloud Functions.

AnswerC

Cloud Monitoring collects metrics and sends alerts; Pub/Sub can be used for notifications.

Why this answer

Cloud Monitoring collects CPU utilization metrics from Compute Engine instances and can evaluate them against a threshold-based alerting policy. When the condition (CPU > 80% for 5 minutes) is met, the alert fires and sends a notification to a Cloud Pub/Sub topic, which can then trigger downstream actions such as sending emails or invoking serverless functions. This combination provides the metric ingestion, alert evaluation, and event-driven notification pipeline required for the use case.

Exam trap

Google Cloud often tests the distinction between logging (Cloud Logging) and monitoring (Cloud Monitoring) — the trap here is that candidates confuse log-based metrics with native system metrics, assuming Cloud Logging can evaluate CPU thresholds when it can only parse log entries, not numeric time-series data.

How to eliminate wrong answers

Option A is wrong because Cloud Functions and Cloud Tasks are serverless compute and task orchestration services, not designed for metric collection or threshold-based alerting; they lack native monitoring of CPU utilization. Option B is wrong because Cloud Audit Logs record administrative actions and access events, not system metrics like CPU utilization, and Cloud Storage is an object store with no alerting capability for real-time metrics. Option D is wrong because Cloud Logging ingests log data, not time-series metrics, and Cloud Functions alone cannot evaluate metric thresholds over a sliding window; the alerting logic must be handled by Cloud Monitoring's alerting policies.

Practice this question →

5

MCQeasy

A DevOps team notices that a Compute Engine instance running a critical application has been terminated unexpectedly. The team wants to ensure the instance restarts automatically if it stops. Which configuration should they use?

A.Configure a startup script that checks for termination and restarts the instance.

B.Set the 'On host maintenance' policy to 'Migrate VM instance'.

C.Enable the 'Automatic restart' flag on the instance template.

D.Create a firewall rule to allow health check traffic from the load balancer.

AnswerC

Enabling automatic restart causes Compute Engine to restart the instance if it terminates for a non-user-initiated reason.

Why this answer

Option C is correct because enabling the 'Automatic restart' flag on the instance template ensures that Compute Engine automatically restarts the VM if it terminates due to a non-user-initiated failure (e.g., hardware failure, system crash). This is the native mechanism for automatic recovery without requiring external scripts or manual intervention.

Exam trap

Google Cloud often tests the distinction between 'Automatic restart' (for infrastructure failures) and managed instance group autohealing (for application-level health), leading candidates to confuse the two or incorrectly choose a startup script as a restart mechanism.

How to eliminate wrong answers

Option A is wrong because a startup script runs only when the instance boots, but it cannot detect termination events or trigger a restart; it would require an external monitoring system to restart the VM, which is not a built-in Compute Engine feature. Option B is wrong because the 'On host maintenance' policy (Migrate VM instance) controls behavior during host maintenance events (e.g., live migration), not automatic restart after unexpected termination. Option D is wrong because firewall rules for health check traffic are used by load balancers to determine instance health, but they do not cause an instance to restart; they only allow or deny traffic.

Practice this question →

6

MCQeasy

A Compute Engine VM's boot disk is nearly full and the application is failing. You want to snapshot the disk first (for safety), then resize it online. What is the correct sequence of gcloud commands?

A.Stop the VM, resize the disk, take a snapshot, restart the VM.

B.Snapshot the disk, resize the disk with `gcloud compute disks resize`, then grow the filesystem within the VM.

C.Resize the disk with `gcloud compute instances set-disk-auto-delete` to automatically expand the disk.

D.Create a new larger disk, attach it as a secondary disk, and move data using rsync.

AnswerB

Snapshot first for safety → resize disk (online, no stop needed) → resize filesystem inside the VM with resize2fs. This is the correct, safe sequence.

Why this answer

Option B is correct because you must snapshot the disk first to ensure data safety before making changes, then resize the disk using `gcloud compute disks resize` (which works on a running VM with live resize enabled), and finally grow the filesystem inside the VM to utilize the additional space. This sequence avoids downtime and preserves a recovery point.

Exam trap

The trap here is that candidates assume a VM must be stopped before resizing a boot disk, but Google Cloud supports live resize for most disk types, making the snapshot-then-resize-then-grow sequence the correct online approach.

How to eliminate wrong answers

Option A is wrong because stopping the VM is unnecessary for a live resize and introduces downtime; also, taking the snapshot after resizing would capture the resized disk, not the original state for safety. Option C is wrong because `gcloud compute instances set-disk-auto-delete` controls whether a disk is deleted when the instance is deleted, not disk resizing or expansion. Option D is wrong because creating a new disk and using rsync is a valid migration approach but is not the correct sequence for resizing an existing boot disk online as specified in the question.

Practice this question →

7

MCQmedium

Your application is deployed on GKE and experiencing increased latency. You suspect a memory leak causing the JVM to run frequent garbage collection cycles. Cloud Monitoring shows high memory usage but you need to understand the garbage collection behavior over time. Which GCP tool provides JVM-level profiling including memory allocation data?

A.Cloud Trace

B.Cloud Profiler with heap profiling enabled for the JVM application.

C.Cloud Monitoring with JVM MBeans metrics exported via the Ops Agent.

D.Error Reporting filtered for OutOfMemoryError exceptions.

AnswerB

Cloud Profiler captures CPU and memory profiles from production JVMs with minimal overhead (<1% CPU impact). Heap profiling shows memory allocation patterns and identifies the source of leaks.

Why this answer

Cloud Profiler with heap profiling enabled captures JVM-level memory allocation data and garbage collection behavior over time, allowing you to identify memory leaks and GC frequency. Unlike generic memory monitoring, it provides per-method allocation snapshots and GC pause analysis specific to the JVM.

Exam trap

Google Cloud often tests the distinction between metric-based monitoring (Cloud Monitoring with MBeans) and profiling (Cloud Profiler), where candidates mistakenly choose Cloud Monitoring because it shows memory usage, but it lacks the allocation-level detail needed to diagnose garbage collection behavior.

How to eliminate wrong answers

Option A is wrong because Cloud Trace is a distributed tracing tool for request latency analysis, not JVM memory profiling; it cannot show garbage collection or heap allocation data. Option C is wrong because Cloud Monitoring with JVM MBeans via the Ops Agent provides metric-based memory usage (e.g., heap usage) but lacks the detailed allocation profiling and GC cycle breakdown that Cloud Profiler's heap profiling offers. Option D is wrong because Error Reporting only aggregates application errors like OutOfMemoryError; it does not provide proactive profiling of memory allocation or garbage collection behavior over time.

Practice this question →

8

MCQmedium

A DevOps team deploys a MIG (Managed Instance Group) with autohealing configured. The health check probes `/health` on port 8080 with a 30-second initial delay. After deployment, new VMs are failing the health check and being immediately recreated — causing a restart loop. What is the most likely cause?

A.The health check HTTP path `/health` doesn't exist — the application uses `/healthz`

B.The initial delay is too short — the application hasn't finished starting before the health check probes begin

C.Autohealing is incompatible with autoscaling — they cannot be used together

D.The MIG does not support HTTP health checks — TCP checks must be used instead

AnswerB

A 30-second initial delay may not be enough for slow-starting applications. Increasing `initialDelaySec` gives the app time to start before health checks begin, breaking the restart loop.

Why this answer

Option B is correct because the 30-second initial delay is too short for the application to complete its startup sequence. When the health check begins probing before the application is ready, it immediately fails, causing the MIG autohealing mechanism to treat the VM as unhealthy and recreate it, leading to a restart loop. The initial delay must be set to a value that exceeds the application's typical startup time.

Exam trap

Google Cloud often tests the distinction between health check path errors (which cause persistent failure) and initial delay misconfiguration (which causes a restart loop), trapping candidates who focus on the path mismatch rather than the timing issue.

How to eliminate wrong answers

Option A is wrong because even if the health check path is incorrect, the VM would not be immediately recreated; instead, the health check would consistently fail, but the MIG would not enter a restart loop unless the application eventually becomes healthy after a restart, which is not the case here. Option C is wrong because autohealing and autoscaling are fully compatible in Google Cloud MIGs; they serve different purposes (health-based repair vs. load-based scaling) and can be used together without conflict. Option D is wrong because MIGs fully support HTTP health checks; TCP checks are an alternative but not a requirement, and the issue described is unrelated to the health check protocol.

Practice this question →

9

MCQmedium

You need to create a dashboard in Cloud Monitoring that shows: (1) Cloud Run request count per second, (2) Cloud Run p99 latency, (3) GKE pod CPU utilization, and (4) Cloud SQL query duration — all on a single screen. Which Cloud Monitoring feature enables this multi-service overview?

A.Create four separate alerting policies and pin them to a shared alerting page.

B.Create a Cloud Monitoring custom dashboard with chart widgets for each metric across the different services.

C.Use BigQuery to query the metrics export and build a Looker Studio dashboard.

D.Use Cloud Logging to create a log-based dashboard with all four metrics.

AnswerB

Custom dashboards support heterogeneous metric widgets from any GCP service. Each widget is independently configured, creating a unified operational view across Cloud Run, GKE, and Cloud SQL.

Why this answer

Option B is correct because Cloud Monitoring custom dashboards allow you to combine chart widgets from multiple monitored services (Cloud Run, GKE, Cloud SQL) into a single screen. This feature supports heterogeneous metric queries using the Monitoring Query Language (MQL) or metric selectors, enabling a unified view without needing separate tools or exports.

Exam trap

Google Cloud often tests the distinction between monitoring (dashboards) and alerting (policies), and the trap here is assuming that alerting policies can serve as a dashboard or that logging tools can natively display numeric metrics without additional configuration.

How to eliminate wrong answers

Option A is wrong because alerting policies are designed for threshold-based notifications, not for displaying real-time metric data on a dashboard; pinning alerts to a shared page does not create a visual dashboard with time-series charts. Option C is wrong because while BigQuery metrics export can feed Looker Studio, this adds unnecessary complexity and latency, and is not the native Cloud Monitoring feature for a single-screen overview. Option D is wrong because Cloud Logging is for log data, not numeric metrics like request count or latency; log-based dashboards cannot natively chart metric time-series such as p99 latency or CPU utilization.

Practice this question →

10

MCQmedium

A billing report shows a Compute Engine VM has been running unused for 3 months. The team wants to stop it to save costs but needs the VM's disk data preserved for potential future use. What should they do?

A.Delete the VM to avoid all compute charges

B.Stop (shut down) the VM — compute charges stop but disk storage charges continue

C.Snapshot the VM disk and delete the VM

D.Set the VM to a smaller machine type to reduce costs

AnswerB

Stopping a VM halts compute billing immediately while preserving all disk data. The team only pays for persistent disk storage until the VM is restarted or the disks are deleted.

Why this answer

Option B is correct because stopping (shutting down) a Compute Engine VM immediately halts all compute charges (vCPU, memory, GPU) while preserving the persistent disk and its data. The disk continues to incur storage costs, which is acceptable since the team wants to retain the data for potential future use. This is the most cost-effective approach that meets the requirement of preserving disk data without paying for idle compute resources.

Exam trap

Google Cloud often tests the misconception that stopping a VM eliminates all costs, but the trap here is that persistent disk storage charges continue even when the VM is stopped, which candidates may overlook when focusing only on compute savings.

How to eliminate wrong answers

Option A is wrong because deleting the VM removes the instance and its attached persistent disks by default, which would destroy the disk data unless a snapshot or disk backup was taken beforehand. Option C is wrong because while snapshotting the disk and deleting the VM does preserve the data, it introduces unnecessary complexity and additional snapshot storage costs; stopping the VM is simpler and directly meets the goal without extra steps. Option D is wrong because resizing to a smaller machine type reduces but does not eliminate compute charges, and the VM would still be running unused, wasting resources; the goal is to stop compute charges entirely.

Practice this question →

11

MCQhard

A managed instance group (MIG) is configured with autohealing using a health check. During a rolling update, several VMs become unhealthy before the new application version starts responding to health checks. The MIG deletes and recreates these VMs repeatedly, causing a deployment loop. How should you fix this?

A.Disable autohealing during rolling updates by removing the health check.

B.Increase the `initialDelaySec` in the autohealing policy to give VMs time to start before health checks are evaluated.

C.Switch from rolling update to canary update to reduce the number of affected VMs.

D.Reduce the health check interval and timeout to detect unhealthy VMs faster.

AnswerB

initialDelaySec defines how long the MIG waits after VM creation before starting health check evaluation. Setting it longer than startup time prevents deletion of healthy VMs that are still initializing.

Why this answer

Option B is correct because increasing `initialDelaySec` in the autohealing policy gives the new application version sufficient time to start and become healthy before the health check begins evaluating the VM. This prevents the MIG from prematurely marking VMs as unhealthy during the rolling update, breaking the deployment loop where VMs are repeatedly deleted and recreated.

Exam trap

The trap here is that candidates often confuse autohealing health checks with load balancer health checks, assuming that reducing intervals or timeouts will speed up recovery, when in fact it exacerbates the deployment loop by triggering autohealing before the new version is ready.

How to eliminate wrong answers

Option A is wrong because disabling autohealing entirely removes the ability to recover from genuine failures during the update, leaving the MIG vulnerable to stuck deployments without automatic recovery. Option C is wrong because switching to a canary update does not address the root cause—the health check timing—and can still result in a deployment loop if the new version is slow to start. Option D is wrong because reducing the health check interval and timeout would make the problem worse by evaluating health more aggressively, increasing the likelihood of premature unhealthy detection and loop amplification.

Practice this question →

12

Multi-Selecthard

Which THREE configurations are required to allow a Compute Engine instance in VPC A (without external IP) to send emails through a third-party SMTP server on the internet? (Choose three.)

Select 3 answers

A.A Cloud NAT gateway attached to VPC A.

B.A route for the default internet gateway (0.0.0.0/0).

C.A firewall rule allowing egress traffic to the SMTP server's IP and port.

D.A VPC firewall rule allowing ingress from the SMTP server.

E.An external IP address on the instance.

AnswersA, B, C

Enables outbound internet access for instances without external IP.

Why this answer

Option A is correct because a Cloud NAT gateway allows instances without external IP addresses to initiate outbound connections to the internet, translating their private IPs to a public IP for the SMTP server. Without Cloud NAT, the instance in VPC A cannot reach the third-party SMTP server on the internet because it lacks a public IP and the VPC has no outbound internet path.

Exam trap

Google Cloud often tests the misconception that an instance without an external IP cannot reach the internet at all, but Cloud NAT provides outbound-only connectivity without a public IP on the instance, and candidates may incorrectly think ingress rules are needed for outbound traffic.

Practice this question →

13

MCQeasy

A company is running a batch processing job on Compute Engine every night. The job usually completes in 2 hours, but recently it has been taking over 4 hours. The CPU utilization on the VM is consistently below 20%. What is the most likely cause?

A.The VM is using a shared-core machine type.

B.The VM's machine type is too small.

C.The VM is running out of memory and swapping to disk.

D.The VM's persistent disk is in a different zone.

AnswerC

Swapping causes high disk I/O and slow performance with low CPU.

Why this answer

The correct answer is C. When CPU utilization is below 20% but job execution time has doubled, the bottleneck is likely I/O, not compute. Swapping to disk occurs when the VM runs out of memory, causing the kernel to page memory to the persistent disk, which is orders of magnitude slower than RAM.

This I/O wait directly increases job duration without raising CPU utilization.

Exam trap

Google Cloud often tests the misconception that low CPU utilization always means the machine is over-provisioned, but the trap here is that I/O-bound workloads (like memory swapping) can cause severe performance degradation while CPU remains idle, leading candidates to incorrectly choose a machine type or disk zone issue.

How to eliminate wrong answers

Option A is wrong because shared-core machine types (e.g., e2-micro) can cause CPU throttling under sustained load, but the symptom would be high CPU credit exhaustion and visible CPU throttling, not consistently low CPU utilization. Option B is wrong because if the machine type were too small, CPU utilization would be high (near 100%) as the job struggles to complete, not below 20%. Option D is wrong because a persistent disk in a different zone than the VM is not supported; Compute Engine requires the disk to be in the same zone as the VM, so this configuration would cause an immediate launch failure, not a gradual performance degradation.

Practice this question →

14

MCQeasy

A Cloud Run service handles payment processing. A monitoring alert shows the service is experiencing 3-second P99 latency, up from its normal 200ms. The team wants to find the slowest individual requests in the last hour. Which tool provides per-request latency data?

A.Cloud Monitoring — check the request_latencies metric distribution

B.Cloud Trace — sort traces by latency in the last hour

C.Cloud Logging — filter for requests with duration > 3s

D.Cloud Profiler — view the slowest functions in the last hour

AnswerB

Cloud Trace records individual request traces with full timing breakdown. Sorting by latency in the Trace list immediately surfaces the slowest requests.

Why this answer

Cloud Trace is designed to capture end-to-end latency for individual requests, allowing you to sort and identify the slowest requests in a specific time range. The P99 latency increase indicates a tail-latency problem, and Trace provides per-request granularity to pinpoint the exact slow requests. This makes it the correct tool for finding the slowest individual requests in the last hour.

Exam trap

Google Cloud often tests the distinction between aggregated metrics (Cloud Monitoring) and per-request tracing (Cloud Trace), trapping candidates who assume a metric distribution can identify individual slow requests.

How to eliminate wrong answers

Option A is wrong because Cloud Monitoring's request_latencies metric is a distribution (e.g., histogram) that shows aggregated percentiles like P99, not individual request latencies. Option C is wrong because Cloud Logging does not natively include a 'duration' field for HTTP requests unless you explicitly log it; even then, filtering for >3s would show all requests exceeding that threshold, not the slowest ones sorted by latency. Option D is wrong because Cloud Profiler samples function call stacks and CPU/memory usage, not per-request latency data; it identifies slow functions, not slow individual requests.

Practice this question →

15

MCQmedium

A Go service is consuming significantly more CPU than expected. The team suspects an inefficient function but doesn't know which one. Which Cloud Operations tool identifies CPU hotspots in production code?

A.Cloud Debugger

B.Cloud Trace

C.Cloud Profiler

D.Cloud Monitoring custom dashboards

AnswerC

Cloud Profiler samples production applications continuously with minimal overhead, generating flame graphs that show exactly which functions are most CPU-intensive.

Why this answer

Cloud Profiler is the correct tool because it continuously gathers CPU and heap usage data from production services using statistical sampling, then presents a flame graph or call tree that pinpoints which functions consume the most CPU. Unlike debugging or tracing tools, Profiler is designed specifically for identifying performance bottlenecks like CPU hotspots with minimal overhead, making it ideal for diagnosing an inefficient function in a Go service running in production.

Exam trap

Google Cloud often tests the distinction between latency-focused tools (Trace) and resource-usage-focused tools (Profiler), and the trap here is that candidates confuse 'slow function' (latency) with 'CPU-hungry function' (resource consumption), leading them to pick Cloud Trace instead of Cloud Profiler.

How to eliminate wrong answers

Option A is wrong because Cloud Debugger is used for inspecting application state (variables, stack traces) at a specific point in time without stopping the service, but it does not collect or aggregate CPU usage data over time to identify hotspots. Option B is wrong because Cloud Trace focuses on latency analysis of requests and spans, measuring how long operations take, not CPU consumption; it can show slow operations but cannot attribute CPU usage to specific functions. Option D is wrong because Cloud Monitoring custom dashboards display metrics like CPU utilization at the instance or container level, but they cannot drill down into function-level CPU hotspots within the application code.

Practice this question →

16

MCQmedium

A company is using Cloud Storage to store sensitive data. They want to enforce that objects are automatically deleted after 90 days. Which configuration should they use?

A.Configure Object Lifecycle Management with a Delete action after 90 days.

B.Set a retention policy on the bucket for 90 days.

C.Enable bucket locking with a retention period of 90 days.

D.Enable object versioning and set a lifecycle rule to delete noncurrent versions.

AnswerA

Lifecycle rules can be set to delete objects when they reach a certain age.

Why this answer

Object Lifecycle Management allows you to set rules that automatically perform actions on objects after a specified number of days. By configuring a rule with a Delete action set to trigger after 90 days, objects in the bucket will be automatically removed, meeting the requirement without manual intervention.

Exam trap

Google Cloud often tests the distinction between retention policies (which protect data from deletion) and lifecycle rules (which automate deletion), leading candidates to confuse a retention policy with a deletion policy.

How to eliminate wrong answers

Option B is wrong because a retention policy on a bucket prevents objects from being deleted or overwritten until the retention period expires, which is the opposite of automatically deleting objects after 90 days. Option C is wrong because bucket locking with a retention period enforces a write-once-read-many (WORM) policy that prevents object deletion or modification, not automatic deletion. Option D is wrong because enabling object versioning and setting a lifecycle rule to delete noncurrent versions only removes older versions of objects, not the current live objects, and does not guarantee deletion of all objects after 90 days.

Practice this question →

17

MCQhard

Your Dataflow streaming pipeline is consuming messages from Pub/Sub but the pipeline's throughput has dropped significantly. Cloud Monitoring shows the `pubsub/subscription/oldest_unacked_message_age` metric is growing. The pipeline has enough workers. What is the most likely bottleneck, and how should you address it?

A.Increase the number of Dataflow workers to process messages faster.

B.Inspect Dataflow job graph metrics to identify the slow stage, then optimize that stage's logic or address data skew.

C.Increase the Pub/Sub subscription's ack deadline to 600 seconds.

D.Switch from Dataflow to Pub/Sub Lite for lower cost and higher throughput.

AnswerB

The Dataflow job monitoring graph shows per-stage throughput and latency. Identifying the slow stage reveals the root cause (slow transform, external call, skew), enabling targeted optimization.

Why this answer

The growing `oldest_unacked_message_age` metric indicates that messages are not being processed and acknowledged quickly enough, even though the pipeline has enough workers. This points to a bottleneck within a specific stage of the Dataflow pipeline, such as a transformation or grouping operation that is slow or suffering from data skew. Option B is correct because inspecting the job graph metrics (e.g., wall time, backlog, and throughput per stage) allows you to identify the slow stage and then optimize its logic or address data skew, which directly resolves the processing delay.

Exam trap

Google Cloud often tests the misconception that adding more workers or increasing timeouts always solves throughput issues, but the correct approach is to diagnose the specific bottleneck stage using Dataflow's built-in metrics.

How to eliminate wrong answers

Option A is wrong because the question explicitly states the pipeline has enough workers, so adding more workers would not address the root cause of a slow stage or data skew; it could even increase cost without improving throughput. Option C is wrong because increasing the ack deadline to 600 seconds only gives workers more time to process messages but does not fix the underlying bottleneck; it may delay the detection of stuck messages and could lead to duplicate processing if workers fail. Option D is wrong because switching to Pub/Sub Lite does not address a Dataflow pipeline bottleneck; Pub/Sub Lite is designed for lower cost and predictable throughput but does not resolve slow stage logic or data skew within Dataflow.

Practice this question →

18

Multi-Selecteasy

Which THREE actions are recommended to ensure the successful operation of a Compute Engine instance running a production workload?

Select 3 answers

A.Enable deletion protection on the instance.

B.Set up a health check and autohealing policy for the instance group.

C.Use a custom VPC with a subnet in a single zone.

D.Configure snapshots with a retention policy.

E.Attach a GPU for machine learning inference.

AnswersA, B, D

Deletion protection prevents accidental deletion of a production instance.

Why this answer

Enabling deletion protection prevents accidental deletion of the instance. Configuring snapshots with a retention policy ensures data backup and recovery. Setting up a health check and autohealing automatically recreates unhealthy instances, improving availability.

Using a single zone subnet (D) reduces fault tolerance, and attaching a GPU (E) is only needed for specific workloads, not a general best practice.

Practice this question →

19

MCQmedium

An operations team wants to count how many times the string 'PaymentFailure' appears in application logs per minute and alert when it exceeds 10 occurrences. Cloud Monitoring doesn't have a native metric for this log pattern. What is the correct approach?

A.Create a Cloud Monitoring custom metric and write values via the application's exception handler

B.Create a log-based metric in Cloud Logging with a filter matching 'PaymentFailure', then alert on it in Cloud Monitoring

C.Export logs to BigQuery and run a scheduled query counting PaymentFailure entries

D.Enable Cloud Trace and look for PaymentFailure in trace annotations

AnswerB

Log-based metrics automatically count (or extract values from) log entries matching a filter. The resulting time-series metric is immediately usable in Cloud Monitoring alerting policies.

Why this answer

Option B is correct because Cloud Logging log-based metrics allow you to define a filter (e.g., `textPayload:"PaymentFailure"`) that counts matching log entries in real time, and Cloud Monitoring can directly create an alerting policy on that metric with a threshold of 10 occurrences per minute. This approach avoids custom code, external exports, or additional services, and it leverages the native integration between Cloud Logging and Cloud Monitoring.

Exam trap

Google Cloud often tests the distinction between native log-based metrics (which require no code or external services) and custom metrics or export-based solutions, leading candidates to overcomplicate the answer by choosing BigQuery or custom code.

How to eliminate wrong answers

Option A is wrong because creating a custom metric via the application's exception handler requires modifying application code and introduces latency and reliability issues, whereas a log-based metric is serverless and automatically counts log entries without code changes. Option C is wrong because exporting logs to BigQuery and running a scheduled query adds unnecessary complexity, cost, and delay (minutes to hours) compared to near-real-time log-based metrics. Option D is wrong because Cloud Trace is designed for distributed tracing of request latency, not for counting log patterns; it does not provide a metric for the frequency of a string in logs.

Practice this question →

20

MCQhard

A team needs to export all Cloud Logging entries from a production GCP project to a BigQuery dataset for long-term analysis and compliance. The export must be near-real-time and include future log entries automatically. Which approach achieves this?

A.Schedule a daily Cloud Function to query the Logging API and write results to BigQuery

B.Create a log sink (Log Router) that routes all logs from the project to the BigQuery dataset

C.Use BigQuery Data Transfer Service to pull logs from Cloud Logging on a schedule

D.Enable VPC Flow Logs and stream them directly to BigQuery

AnswerB

Log sinks continuously route matching log entries to BigQuery in near-real-time. Once configured, future entries are exported automatically without manual steps.

Why this answer

Option B is correct because a log sink (Log Router) in Cloud Logging can be configured to route log entries in near-real-time to a BigQuery dataset. This approach automatically includes all future log entries without requiring any scheduled jobs or manual intervention, making it ideal for long-term analysis and compliance.

Exam trap

The trap here is that candidates may confuse BigQuery Data Transfer Service with a general-purpose data ingestion tool, but it does not support Cloud Logging as a source, leading them to choose option C instead of the correct log sink approach.

How to eliminate wrong answers

Option A is wrong because scheduling a daily Cloud Function to query the Logging API and write results to BigQuery introduces latency (up to 24 hours) and is not near-real-time; it also requires custom code to handle pagination and deduplication. Option C is wrong because BigQuery Data Transfer Service does not support Cloud Logging as a source; it is designed for transferring data from services like Google Ads, Amazon S3, or Teradata, not for streaming log entries. Option D is wrong because VPC Flow Logs only capture network traffic metadata, not all Cloud Logging entries (e.g., application logs, audit logs), and they cannot be streamed directly to BigQuery without an intermediary sink or export.

Practice this question →

21

MCQmedium

A GKE node pool has an autoscaler configured with min=2, max=10 nodes. After a sustained traffic surge, the cluster scaled to 10 nodes. Traffic drops overnight to its normal level, but the cluster remains at 10 nodes after 4 hours. What is the most likely reason?

A.Cluster autoscaler only scales down once per day

B.Pods have a PodDisruptionBudget preventing eviction, or nodes are still within the scale-down cooldown period

C.The max node count prevents scale-down — setting max=10 locks the cluster at 10 nodes

D.Cloud Monitoring alerts are pausing the autoscaler to investigate the traffic drop

AnswerB

Cluster autoscaler delays scale-down to avoid premature removal of needed capacity. PodDisruptionBudgets that prevent Pod eviction also block node removal.

Why this answer

Option B is correct because the cluster autoscaler will not scale down nodes that are hosting pods protected by a PodDisruptionBudget (PDB) that prevents eviction, or if the nodes are still within the scale-down cooldown period (default 10 minutes after a scale-up). Since the cluster scaled up to 10 nodes during the traffic surge, the autoscaler must wait for the cooldown to expire and for all pods to be safely evictable before removing nodes. If any pods have a PDB that blocks eviction (e.g., requiring at least 3 replicas), the autoscaler cannot terminate the underlying nodes, keeping the cluster at 10 nodes even after traffic drops.

Exam trap

Google Cloud often tests the misconception that the autoscaler's max node count prevents scale-down, when in fact the max only caps scale-up, and scale-down is blocked by cooldown periods or PodDisruptionBudgets.

How to eliminate wrong answers

Option A is wrong because the cluster autoscaler does not have a 'once per day' scale-down limit; it evaluates scale-down continuously, subject to cooldown periods and other constraints. Option C is wrong because the max node count (10) only sets an upper bound on scaling up, not a lock; the autoscaler can scale down to the min node count (2) when utilization is low. Option D is wrong because Cloud Monitoring alerts do not pause the autoscaler; the autoscaler operates independently based on resource utilization metrics, and alerts are for notification, not control.

Practice this question →

22

MCQhard

A production GKE service processes payments and must maintain at least 3 replicas running at all times, even during node upgrades or Pod evictions. How should this be enforced?

A.Set the Deployment replica count to 6 — node upgrades will only affect half at a time

B.Create a PodDisruptionBudget with minAvailable: 3 targeting the payment service Pods

C.Add node affinity rules pinning all 3 replicas to specific long-lived nodes

D.Enable cluster autoscaler with minNodeCount=3 — this preserves Pod availability

AnswerB

A PDB with minAvailable: 3 instructs GKE's node drain process (during upgrades, autoscaler removals) to ensure at least 3 payment service Pods remain running throughout the disruption.

Why this answer

A PodDisruptionBudget (PDB) with `minAvailable: 3` ensures that at least 3 replicas of the payment service remain available during voluntary disruptions like node upgrades or Pod evictions. The Kubernetes scheduler respects the PDB by blocking evictions that would drop the number of healthy Pods below the specified threshold, guaranteeing continuous service availability even when nodes are being drained.

Exam trap

Google Cloud often tests the misconception that increasing replica count or using node-level controls (affinity, autoscaler) alone can guarantee availability during voluntary disruptions, when in fact only a PodDisruptionBudget provides the explicit eviction protection needed.

How to eliminate wrong answers

Option A is wrong because simply setting the replica count to 6 does not enforce availability during node upgrades; node upgrades can still evict all Pods on a node, and without a PDB, the eviction controller may drain all replicas simultaneously, dropping below 3. Option C is wrong because node affinity rules pin Pods to specific nodes, but those nodes themselves can be upgraded or fail, and affinity does not prevent evictions during node maintenance, so availability is not guaranteed. Option D is wrong because cluster autoscaler with `minNodeCount=3` only ensures a minimum number of nodes exist, not that Pods are distributed or protected from eviction; Pods can still be evicted from those nodes during upgrades, and the autoscaler does not enforce a minimum number of running replicas.

Practice this question →

23

MCQhard

Your company runs a stateful web application on a managed instance group (MIG) in us-central1 across three zones. The application uses a TCP health check on port 8080. Recently, instances have been recreated frequently due to health check failures, though the application logs show no errors and instances are accessible via SSH. The firewall rules allow traffic from the health check ranges (130.211.0.0/22 and 35.191.0.0/16) to port 8080. You have verified that the application is listening on 0.0.0.0:8080. Which of the following is the most likely cause of the health check failures?

A.The health check is using the HTTP protocol instead of TCP.

B.The MIG's autohealing policy is set to a too-aggressive initial delay.

C.The firewall rule allowing health check probes is not applied to the instances due to missing network tags.

D.The application is only listening on the localhost interface.

AnswerC

Health check probes will be blocked if the firewall rule does not apply to the instances' network tags, even if the rule exists.

Why this answer

Since the application listens on 0.0.0.0, it should respond on all interfaces, but if the firewall is not properly configured for the health check source ranges, probes could be blocked. The firewall rule exists but might not be applied to the correct network tags or target instances. Option B is the most common cause of health check failures in such scenarios.

Option A (wrong protocol) is incorrect because TCP health check uses TCP. Option C (autohealing misconfiguration) would affect behavior but not cause failures. Option D (app listening on wrong interface) is ruled out by listening on 0.0.0.0.

Practice this question →

24

MCQmedium

A team needs to roll out a configuration change to a MIG (Managed Instance Group) — updating the instance template to set a new environment variable. They want to validate the change on 1 VM before rolling it out to all 20 VMs. Which MIG update type supports this?

A.Canary update — update 1 VM to the new template, verify, then roll out to all

B.Opportunistic update — the new template applies only when VMs are naturally replaced

C.Recreate update — terminate all 20 VMs and recreate with the new template simultaneously

D.Snapshot the existing VMs and restore 1 with the new environment variable to test

AnswerA

MIG supports canary updates where a targetSize partition runs the new template while the rest run the old template. The team validates the canary VM before proceeding with full rollout.

Why this answer

A canary update in a Managed Instance Group (MIG) allows you to specify a target size (e.g., 1 VM) to update to the new instance template first. After validating the change on that single VM, you can then promote the canary to roll out the new template to the remaining 19 VMs. This matches the requirement to test on 1 VM before full rollout.

Exam trap

Google Cloud often tests the distinction between 'canary' and 'opportunistic' updates, where candidates mistakenly think opportunistic allows manual selection of a single VM to update, but it only applies changes during natural instance replacement.

How to eliminate wrong answers

Option B is wrong because an opportunistic update only applies the new template when existing VMs are stopped or terminated naturally (e.g., by autohealing or manual deletion), not on demand for a single VM test. Option C is wrong because a recreate update terminates all 20 VMs simultaneously and recreates them with the new template, which does not allow a staged validation on just 1 VM. Option D is wrong because snapshotting and restoring a VM does not update the instance template of the MIG; the MIG would still use the old template for any new VMs or managed operations, and this approach bypasses the MIG's update management entirely.

Practice this question →

25

Multi-Selecthard

Which THREE are best practices for monitoring and alerting in Google Cloud?

Select 3 answers

A.Configure budget alerts in Cloud Billing to notify when costs exceed a threshold.

B.Use Cloud Audit Logs to monitor administrative actions and data access.

C.Enable VPC Flow Logs for all subnets to capture all network traffic metadata.

D.Set up alerts based on log-based metrics to detect specific error patterns.

E.Use Cloud Monitoring to create uptime checks for external-facing services.

AnswersB, D, E

Audit logs are essential for security and compliance monitoring.

Why this answer

Option B is correct because Cloud Audit Logs provide a comprehensive record of administrative actions (Admin Activity logs) and data access (Data Access logs) within Google Cloud. Monitoring these logs is a best practice for security, compliance, and operational troubleshooting, as they capture who did what, where, and when across services like IAM, Compute Engine, and Cloud Storage.

Exam trap

Google Cloud often tests the distinction between cost management (budget alerts) and operational monitoring (Cloud Monitoring, log-based metrics, uptime checks), leading candidates to incorrectly include budget alerts as a monitoring best practice.

Practice this question →

26

MCQmedium

A Cloud Run service has been running for weeks. A sudden spike in 5xx errors appears in Cloud Monitoring. The team wants to view the actual request logs to identify which endpoint is failing. Where should they look?

A.Cloud Monitoring Metrics Explorer — filter by request_count metric with error status

B.Cloud Logging Logs Explorer — filter by resource type 'cloud_run_revision'

C.Cloud Trace — filter by 5xx response code

D.Cloud Debugger — set a snapshot at the error handler

AnswerB

Cloud Run streams request logs to Cloud Logging automatically. Filtering by `resource.type="cloud_run_revision"` in Logs Explorer shows individual request details including URL path and status codes.

Why this answer

Cloud Logging's Logs Explorer is the correct place to view actual request logs for a Cloud Run service. It allows filtering by resource type 'cloud_run_revision' and by HTTP status codes (e.g., 5xx) to identify which specific endpoint is failing. Cloud Monitoring Metrics Explorer shows aggregated metrics, not individual log entries, so it cannot pinpoint the exact endpoint.

Exam trap

Google Cloud often tests the distinction between aggregated metrics (Cloud Monitoring) and raw logs (Cloud Logging), trapping candidates who think Metrics Explorer can show individual request details when it only provides statistical aggregates.

How to eliminate wrong answers

Option A is wrong because Cloud Monitoring Metrics Explorer displays aggregated metrics (e.g., request_count) and cannot show the actual request logs needed to identify the specific failing endpoint. Option C is wrong because Cloud Trace is designed for distributed tracing latency analysis, not for viewing request logs filtered by response code; it does not store or expose full request logs. Option D is wrong because Cloud Debugger is used for inspecting live application state (e.g., variable values) via snapshots, not for viewing historical request logs or error details.

Practice this question →

27

MCQhard

A platform team wants to define a formal service level objective (SLO) for their API: 99.9% of requests must succeed (HTTP 2xx) over a 30-day rolling window. Which Cloud Monitoring feature tracks this?

A.Create an alerting policy with a 99.9% threshold on the request success metric

B.Define a Cloud Monitoring SLO with a 99.9% availability target over a 30-day rolling window

C.Build a BigQuery dashboard showing 30-day average success rates from exported logs

D.Set an uptime check target of 99.9% in Cloud Monitoring

AnswerB

Cloud Monitoring SLOs track error budget consumption over a rolling window and alert on burn rate — specifically designed for this use case.

Why this answer

Option B is correct because Cloud Monitoring's SLO feature is specifically designed to track compliance with a formal service level objective, such as 99.9% of requests succeeding over a 30-day rolling window. It automatically calculates the success rate from the selected metric (e.g., request count or latency) and reports the SLO's performance over the defined period, including error budgets and burn rates.

Exam trap

Google Cloud often tests the distinction between an SLO (a formal target with error budgets and burn rates) and a simple threshold alert or uptime check, so candidates mistakenly choose an alerting policy or uptime check because they think any 99.9% threshold tracking qualifies as an SLO.

How to eliminate wrong answers

Option A is wrong because an alerting policy with a 99.9% threshold on the request success metric would trigger an alert when the metric drops below that threshold, but it does not track the SLO over a 30-day rolling window or provide the structured SLO monitoring, error budget, and burn rate analysis that the SLO feature offers. Option C is wrong because building a BigQuery dashboard from exported logs is an indirect, manual approach that lacks the native integration, automatic calculation, and built-in alerting of Cloud Monitoring SLOs; it also requires additional setup and does not provide real-time SLO tracking. Option D is wrong because an uptime check target of 99.9% in Cloud Monitoring measures external availability via synthetic probes (e.g., HTTP GET to a URL), not the actual success rate of all API requests (HTTP 2xx) as defined in the SLO, and it does not track a 30-day rolling window of request-level success.

Practice this question →

28

MCQmedium

A Kubernetes namespace is shared by multiple teams. The platform team wants to ensure no single team's workloads can consume more than 10 CPU cores and 20 GB memory in that namespace. Which Kubernetes resource enforces this constraint?

A.LimitRange — sets per-Pod CPU and memory limits

B.ResourceQuota scoped to the namespace

C.PodDisruptionBudget limiting the number of running Pods

D.Network Policy restricting namespace traffic to avoid resource contention

AnswerB

ResourceQuota enforces aggregate limits on resource consumption within a namespace (e.g., `requests.cpu: 10`, `requests.memory: 20Gi`). API server rejects Pods that would exceed the quota.

Why this answer

ResourceQuota is the Kubernetes resource that enforces aggregate resource consumption limits at the namespace level. By configuring a ResourceQuota with spec.hard.cpu: 10 and spec.hard.memory: 20Gi, the platform team can cap the total CPU and memory usage across all Pods in the namespace, preventing any single team from exceeding those limits.

Exam trap

The trap here is that candidates confuse LimitRange (per-Pod constraints) with ResourceQuota (namespace-level aggregate constraints), leading them to select LimitRange when the question explicitly asks for a resource that enforces a total cap across all workloads.

How to eliminate wrong answers

Option A is wrong because LimitRange sets per-Pod or per-Container default and minimum/maximum resource requests and limits, not an aggregate namespace-wide cap; it cannot prevent the sum of all Pods from exceeding 10 CPU cores and 20 GB memory. Option C is wrong because PodDisruptionBudget limits the number of Pods that can be voluntarily disrupted (e.g., during node maintenance), not the total resource consumption or running Pod count. Option D is wrong because Network Policy controls traffic flow between Pods based on labels and namespaces, not resource usage; it has no mechanism to enforce CPU or memory quotas.

Practice this question →

29

MCQhard

An application running on Compute Engine writes logs to a local file. The operations team wants to centralize these logs in Cloud Logging with minimal code changes. What is the recommended approach?

A.Use Cloud Scheduler to tail the log file and send entries to Logging.

B.Install and configure the Cloud Logging agent on the instance.

C.Set up a scheduled cron job to upload the log file via gcloud logging write.

D.Modify the application to use the Cloud Logging client library.

AnswerB

Agent can tail log files and send entries to Cloud Logging.

Why this answer

Option B is correct because the Cloud Logging agent automatically collects local log files and sends them to Cloud Logging without code changes. Option A requires code changes. Option C and D are manual and not recommended.

Practice this question →

30

MCQmedium

Your GKE application's pods are being evicted frequently during periods of high traffic. You notice that pods without resource requests are being evicted first. The nodes are running at ~85% memory utilization. What should you do to reduce pod eviction?

A.Set memory requests and limits for all pods to match their actual memory usage.

B.Increase the node machine type to have more memory.

C.Configure pod disruption budgets (PDBs) to prevent eviction.

D.Enable cluster autoscaler to add nodes before memory pressure occurs.

AnswerA

Pods with requests are classified as Burstable (or Guaranteed if limits equal requests). These are evicted after BestEffort pods. Proper requests also help the scheduler distribute pods evenly, reducing node pressure.

Why this answer

Setting memory requests and limits for all pods to match their actual memory usage ensures that the Kubernetes scheduler can accurately allocate resources and make informed scheduling decisions. Without requests, pods are treated as BestEffort QoS class, making them the first candidates for eviction under memory pressure (when nodes exceed ~85% utilization). By defining requests, pods are classified as Burstable or Guaranteed, which gives them higher priority during eviction and prevents unnecessary disruptions.

Exam trap

Google Cloud often tests the misconception that increasing node resources or adding nodes (autoscaling) solves eviction, when the real issue is the lack of resource requests that determines eviction priority under memory pressure.

How to eliminate wrong answers

Option B is wrong because simply increasing node memory does not address the root cause—pods without requests are still BestEffort and will be evicted first under any memory pressure, regardless of total node capacity. Option C is wrong because PodDisruptionBudgets (PDBs) only protect against voluntary disruptions (e.g., node drains), not involuntary evictions caused by node memory pressure (kubelet eviction). Option D is wrong because Cluster Autoscaler adds nodes only when pods are unschedulable due to resource shortages, not to prevent eviction of already-running pods; memory pressure eviction occurs before autoscaling can react.

Practice this question →

31

MCQhard

You are managing a GKE cluster that runs a mixed workload: latency-sensitive web services and batch data processing jobs. The batch jobs run for hours and consume significant CPU/memory. During batch peaks, the web services experience CPU throttling. What is the best configuration to prevent batch jobs from impacting web service latency?

A.Set CPU requests and limits on batch job pods to be lower than web service pods.

B.Assign web service pods a higher PriorityClass and run batch jobs on a separate node pool with taints.

C.Use Horizontal Pod Autoscaler for batch jobs so they scale down during peak web traffic.

D.Enable Cluster Autoscaler so new nodes are added when batch jobs demand more resources.

AnswerB

Separate node pools with taints prevent batch pods from running on web-service nodes (hard isolation). PriorityClass ensures web pods preempt batch pods if they ever share resources. Together this prevents batch-induced throttling.

Why this answer

Option B is correct because it uses PriorityClass to ensure web service pods are scheduled and maintained over batch pods during resource contention, while placing batch jobs on a separate node pool with taints isolates their resource consumption. This prevents batch jobs from causing CPU throttling on latency-sensitive web services by guaranteeing that web pods have priority access to CPU cycles and that batch workloads do not share nodes with web pods.

Exam trap

Google Cloud often tests the misconception that resource limits alone (Option A) or autoscaling (Options C and D) can solve resource contention, when in reality priority and isolation mechanisms are required to guarantee QoS for latency-sensitive workloads.

How to eliminate wrong answers

Option A is wrong because setting CPU requests and limits lower on batch pods does not prevent them from consuming CPU when they are scheduled on the same node as web pods; CPU throttling occurs when the node's CPU is oversubscribed, and lower limits only cap the batch pod's usage but do not guarantee that web pods get CPU time first. Option C is wrong because Horizontal Pod Autoscaler (HPA) scales batch pods based on their own metrics (e.g., CPU utilization), not on web traffic; scaling down batch jobs during web peaks would require custom metrics or manual intervention, and HPA does not inherently prioritize web services. Option D is wrong because Cluster Autoscaler adds nodes when pods are unschedulable, but it does not prevent batch jobs from being scheduled on the same nodes as web pods; if batch jobs are already running on nodes with web pods, adding new nodes does not relieve the existing CPU contention on those nodes.

Practice this question →

32

MCQeasy

A team wants to create a consistent backup of a Compute Engine VM's boot disk before applying a major OS patch. The backup should be usable to restore the disk to a new VM if the patch fails. What is the recommended approach?

A.Create a custom VM image from the running instance using `gcloud compute images create`

B.Create a persistent disk snapshot using `gcloud compute disks snapshot`

C.Copy all files to a Cloud Storage bucket using gsutil rsync

D.Enable live migration for the VM — it automatically backs up state before migrations

AnswerB

Disk snapshots capture the disk's exact state. Stopping the VM before snapshotting ensures consistency. Snapshots can restore to a new disk or new VM if needed.

Why this answer

Option B is correct because a persistent disk snapshot captures a point-in-time, crash-consistent backup of the disk, including the boot disk, without requiring the VM to be stopped. Snapshots are the recommended method for backing up persistent disks because they support incremental backups, can be used to create new disks, and are optimized for restore operations to a new VM. This approach ensures the backup is usable for disaster recovery if the OS patch fails.

Exam trap

The trap here is that candidates confuse VM images (used for creating identical instances) with disk snapshots (used for backups), leading them to choose Option A, even though images require the VM to be stopped for consistency and are not designed for point-in-time recovery of a single disk.

How to eliminate wrong answers

Option A is wrong because creating a custom VM image from a running instance requires the instance to be stopped or the image creation process to quiesce the filesystem, which is not a consistent backup method for a live boot disk and is intended for creating reusable templates, not for point-in-time backups. Option C is wrong because using gsutil rsync to copy files to Cloud Storage does not capture the disk's boot sector, partition table, or system state, making it impossible to restore a bootable disk to a new VM; it only copies file-level data. Option D is wrong because live migration is a feature that moves a running VM between hosts without downtime and does not create any backup or snapshot of the disk; it is unrelated to backup creation.

Practice this question →

33

Drag & Dropmedium

Order the steps to set up a VPC network with a subnet, firewall rule allowing SSH, and a Compute Engine instance in that subnet.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

VPC and subnet must exist before instance; firewall rule must allow SSH.

Practice this question →

34

MCQhard

A team is using Terraform to manage infrastructure in Google Cloud. After running terraform apply, they receive an error: 'Error 409: Resource already exists'. The team needs to resolve this without deleting and recreating the resource. What should they do?

A.Run 'terraform refresh' to update the state file.

B.Import the existing resource into Terraform state with 'terraform import'.

C.Set the 'create_before_destroy' lifecycle rule on the resource.

D.Change the resource name in the Terraform configuration.

AnswerB

Importing tells Terraform to adopt the existing resource into its state.

Why this answer

Option B is correct because the 'Error 409: Resource already exists' indicates that the resource was created outside of Terraform or the state file lost track of it. Running 'terraform import' brings the existing resource under Terraform management by adding its current attributes to the state file, allowing subsequent operations without deletion or recreation.

Exam trap

Google Cloud often tests the misconception that 'terraform refresh' can fix state mismatches for missing resources, but it only syncs attributes for resources already in state, not imports new ones.

How to eliminate wrong answers

Option A is wrong because 'terraform refresh' only updates the state file with current real-world attributes of resources already tracked in state; it cannot add a resource that is missing from state. Option C is wrong because 'create_before_destroy' is a lifecycle meta-argument that controls the order of creation and destruction during updates, not a mechanism to resolve a state mismatch or import an existing resource. Option D is wrong because changing the resource name in the configuration would cause Terraform to attempt creating a new resource with the new name, leaving the existing resource unmanaged and still causing a conflict if the original name is reused elsewhere.

Practice this question →

35

MCQmedium

A microservices application has intermittent high latency. The team wants to identify which specific service-to-service call in the request chain is causing the slowdown. Which Cloud Operations tool is designed for this?

A.Cloud Monitoring Metrics Explorer

B.Cloud Logging log viewer

C.Cloud Trace

D.Cloud Profiler

AnswerC

Cloud Trace instruments requests as they flow through services, recording each span's duration and parent-child relationships, making it ideal for pinpointing latency in distributed systems.

Why this answer

Cloud Trace is designed to capture latency data for individual service-to-service calls in a distributed request chain. It provides end-to-end tracing by collecting trace spans from each microservice, allowing you to pinpoint which specific call is causing the slowdown. This makes it the correct tool for identifying the exact service-to-service latency bottleneck.

Exam trap

Google Cloud often tests the distinction between tools that monitor aggregate metrics (Cloud Monitoring) versus tools that trace individual request paths (Cloud Trace), and the trap here is that candidates confuse Cloud Profiler's code-level profiling with distributed tracing, leading them to pick D instead of C.

How to eliminate wrong answers

Option A is wrong because Cloud Monitoring Metrics Explorer aggregates metrics like CPU, memory, and request counts, but it does not trace individual request paths or provide per-call latency breakdowns across services. Option B is wrong because Cloud Logging log viewer collects and filters log entries, but it lacks the distributed tracing context needed to correlate spans and measure latency for each service-to-service hop. Option D is wrong because Cloud Profiler continuously analyzes CPU and memory usage of running code to identify performance hotspots within a single service, but it does not trace request flows or measure network latency between services.

Practice this question →

36

MCQhard

You are on-call and receive a PagerDuty alert: `Cloud SQL CPU utilization > 90% for 15 minutes`. Checking `pg_stat_activity`, you find 200 connections with many in `idle` state and 15 queries running for > 5 minutes each. The long queries are table scans on a 500 GB unindexed table. What should you do IMMEDIATELY to restore service, and what is the root cause fix?

A.Restart the Cloud SQL instance to clear all connections and queries.

B.Terminate the long-running table scan queries immediately, then add indexes on the frequently queried columns as the root cause fix.

C.Increase Cloud SQL's CPU to a larger machine type to handle the current load.

D.Reduce `max_connections` to prevent new connections from adding load.

AnswerB

pg_terminate_backend(pid) kills the specific long-running queries to restore CPU. Adding indexes addresses the root cause: full table scans due to missing indexes.

Why this answer

Option B is correct because terminating the long-running table scans immediately stops the CPU-intensive queries, restoring service. The root cause is the missing index on the 500 GB table, which forces sequential scans and high CPU usage. Adding indexes on frequently queried columns eliminates the need for full table scans, preventing recurrence.

Exam trap

Google Cloud often tests the distinction between immediate mitigation (terminating bad queries) and root cause fix (adding indexes), tempting candidates to choose a scaling or restart option that avoids addressing the fundamental indexing problem.

How to eliminate wrong answers

Option A is wrong because restarting the Cloud SQL instance kills all connections and queries indiscriminately, but the long-running scans will resume on restart if the root cause (missing index) is not addressed, and it causes unnecessary downtime. Option C is wrong because increasing CPU only masks the symptom; the unindexed scans will still consume excessive CPU on a larger machine, and it does not fix the underlying query performance issue. Option D is wrong because reducing max_connections does not stop the already-running long queries; it only prevents new connections, leaving the CPU-hogging scans active and service degraded.

Practice this question →

37

MCQhard

Refer to the exhibit. An engineer runs this command and sees the output. The instance is unable to reach the internet. What is the most likely reason?

A.VPC firewall rules are blocking egress traffic.

B.The instance needs a Cloud NAT gateway for outbound connectivity.

C.The instance does not have a public IP address.

D.The subnetwork is misconfigured.

AnswerC

The output shows no accessConfigs, confirming no external IP.

Why this answer

The instance cannot reach the internet because it lacks a public IP address. In Google Cloud, an instance without an external IP address cannot initiate outbound connections to the internet unless a Cloud NAT or a VM with a public IP is used as a proxy. The command output likely shows that the instance only has an internal IP, confirming this as the root cause.

Exam trap

Google Cloud often tests the misconception that Cloud NAT is always required for internet access, but the trap here is that an instance with a public IP can directly reach the internet without NAT, so the absence of a public IP is the primary issue.

How to eliminate wrong answers

Option A is wrong because VPC firewall rules are stateful and allow egress traffic by default; unless explicitly blocked, they would not prevent outbound connectivity. Option B is wrong because Cloud NAT is not required for instances with a public IP; it is only needed for instances without one to access the internet. Option D is wrong because a misconfigured subnetwork would affect internal routing or IP allocation, but the instance still has a valid internal IP and the subnet is correctly assigned; the issue is the lack of a public IP, not subnet configuration.

Practice this question →

38

MCQhard

A managed instance group serves production traffic. During a rolling update to a new VM template, 30% of instances become unhealthy (failing health checks). The update has not completed yet. What should the team do to immediately restore service?

A.Delete all unhealthy instances manually; the MIG will recreate them with the old template

B.Pause the rolling update, then roll back using `gcloud compute instance-groups managed rolling-action rollback`

C.Increase the target size of the MIG to dilute the unhealthy instances

D.Disable autohealing temporarily to prevent the MIG from restarting instances

AnswerB

Rolling back the MIG update reverts all instances (including the updated ones) to the previous template version, quickly restoring service.

Why this answer

Option B is correct because the `gcloud compute instance-groups managed rolling-action rollback` command immediately reverts all instances in the managed instance group (MIG) to the previous template, restoring the known-good configuration. Pausing the update first stops the rollout from continuing to affect more instances, and the rollback command then replaces the unhealthy instances with the old template, allowing health checks to pass again.

Exam trap

Google Cloud often tests the misconception that manual deletion or disabling autohealing can fix a failed rolling update, when in fact only a deliberate rollback to the previous template restores the known-good state and ensures health checks pass again.

How to eliminate wrong answers

Option A is wrong because manually deleting unhealthy instances does not guarantee they will be recreated with the old template; the MIG's current instance template is the new one, so autohealing would recreate them with the new template, perpetuating the failure. Option C is wrong because increasing the target size does not fix the underlying health issue; it only adds more instances (potentially also unhealthy if based on the new template) and does not restore service for the existing unhealthy instances. Option D is wrong because disabling autohealing prevents the MIG from replacing unhealthy instances, leaving them in a failed state and not restoring service; it merely stops the system from acting on the health check failures.

Practice this question →

39

MCQmedium

A developer reports that a Cloud Run service is returning 503 errors intermittently. Based on the log entry, what is the most likely cause?

A.The service has too few max instances configured.

B.The revision has been deleted.

C.The service's CPU throttling is too aggressive.

D.The service is experiencing cold starts with high latency.

E.The container image has a bug that prevents it from starting quickly.

AnswerD

The log directly states the instance is being started, which is typical of cold starts in Cloud Run.

Why this answer

The log indicates the instance is not ready and is being started, which is characteristic of a cold start. Cold starts occur when a new instance is created to handle incoming requests, causing a startup delay. Too few max instances (A) would lead to request queuing, not necessarily startup logs.

A container bug (B) would show different errors. A deleted revision (D) would show a different error. CPU throttling (E) would not cause a 'being started' message.

Practice this question →

40

MCQmedium

A company is using BigQuery for analytics. They notice that queries are slow and expensive. The data is loaded daily into a single table. Which action would most improve performance and reduce cost?

A.Use a flat-rate reservation to improve query concurrency.

B.Denormalize the table to reduce joins.

C.Increase the number of slots available for the project.

D.Partition the table by date and cluster by frequently filtered columns.

AnswerD

Partitioning prunes partitions, clustering improves filter efficiency, reducing scanned data.

Why this answer

Partitioning the table by date allows BigQuery to prune partitions during query execution, scanning only the relevant daily data instead of the entire table. Clustering on frequently filtered columns further reduces the data scanned by sorting data within partitions. This directly reduces both query cost (pay-per-byte) and latency, addressing the core issue of slow, expensive queries on a large daily-loaded table.

Exam trap

Google Cloud often tests the misconception that increasing compute resources (slots or concurrency) is the primary fix for slow queries, when in reality data pruning via partitioning and clustering is the first and most impactful optimization for cost and performance.

How to eliminate wrong answers

Option A is wrong because a flat-rate reservation improves query concurrency and provides predictable slot capacity, but it does not reduce the amount of data scanned per query; slow and expensive queries due to scanning the entire table would persist. Option B is wrong because denormalization reduces joins but does not address the primary issue of scanning a massive single table; it may even increase storage costs and data scanned if not combined with partitioning/clustering. Option C is wrong because increasing slots (via reservations or flex slots) improves query execution speed by providing more parallel processing, but it does not reduce the bytes billed; queries would still scan the entire table, keeping costs high.

Practice this question →

41

Matchingmedium

Match each GCP networking concept to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Virtual private cloud network

Regional IP address range within a VPC

Outbound internet access for private instances

Distributes traffic across instances

Content delivery network for low-latency delivery

Why these pairings

These are fundamental networking components in GCP.

Practice this question →

42

MCQhard

Your Cloud SQL for MySQL primary instance in `us-central1` has failed. Cloud SQL HA automatically fails over to the standby. After the failover, your application is experiencing intermittent connection errors. What is the most likely cause and solution?

A.The standby instance has a different IP address; update the connection string.

B.Application connection pools hold stale connections to the failed primary; configure pools to validate connections and reconnect after failure.

C.The standby replica must be manually promoted before it can accept connections.

D.The MySQL binary log is incomplete after failover; run `mysqlcheck` to repair tables.

AnswerB

Stale connection pool entries are the most common cause of post-failover errors. Configuring connection validation on borrow and using the Cloud SQL Auth Proxy (which handles reconnection transparently) resolves this.

Why this answer

Option B is correct because after a Cloud SQL HA failover, the standby instance becomes the new primary with the same IP address, but existing application connections that were established to the old primary are now broken. Connection pools that do not validate connections before reuse will attempt to use these stale connections, causing intermittent errors. Configuring the pool to test connections (e.g., via `SELECT 1` or JDBC `connectionTestQuery`) and automatically reconnect resolves this by discarding dead connections and establishing fresh ones to the new primary.

Exam trap

Google Cloud often tests the misconception that IP addresses change during HA failover, leading candidates to incorrectly choose Option A, but in Cloud SQL HA the VIP remains constant, and the real issue is stale connections in the application pool.

How to eliminate wrong answers

Option A is wrong because Cloud SQL HA failover preserves the same IP address (the VIP is moved to the standby), so updating the connection string is unnecessary and would not fix stale connection pool issues. Option C is wrong because Cloud SQL HA automatically promotes the standby to primary during failover; no manual promotion is required, and the standby accepts connections immediately after failover completes. Option D is wrong because MySQL binary logs are replicated continuously to the standby in HA configurations, so the binary log is not incomplete after failover; `mysqlcheck` is used for table corruption repair, not for connection errors, and is unrelated to the described symptom.

Practice this question →

43

MCQmedium

A DevOps team monitors a Cloud SQL instance and notices its CPU is consistently above 85% for several hours. The instance handles a critical production database. What should be the team's immediate action?

A.Enable read replicas to distribute query load

B.Scale up the Cloud SQL instance machine type to add more vCPUs

C.Restart the Cloud SQL instance to clear CPU-intensive processes

D.Delete and recreate the instance with a larger machine type

AnswerB

Scaling up the Cloud SQL instance (more CPUs/RAM) immediately provides more compute capacity. For Cloud SQL, this operation involves a brief restart but is the fastest relief for CPU saturation.

Why this answer

Scaling up the Cloud SQL instance to a larger machine type with more vCPUs directly addresses sustained high CPU utilization by providing additional compute capacity. This is the immediate corrective action for a critical production database because it increases the instance's processing power without requiring architectural changes or downtime (if using a rolling update). Read replicas help with read-heavy workloads but do not reduce CPU load from write operations or complex queries on the primary instance.

Exam trap

Google Cloud often tests the misconception that read replicas can solve all performance issues, but the trap here is that replicas only help with read scaling, not CPU-bound write or compute workloads on the primary instance.

How to eliminate wrong answers

Option A is wrong because enabling read replicas offloads read queries but does not reduce CPU usage from write operations, DDL statements, or complex queries that execute on the primary instance; the primary's CPU would remain high. Option C is wrong because restarting the instance only temporarily clears transient processes; the underlying workload or insufficient capacity will cause CPU to spike again, and restarting a critical production database risks downtime and connection drops. Option D is wrong because deleting and recreating the instance is unnecessarily destructive and causes extended downtime; scaling up the machine type can be done via a rolling update with minimal disruption, whereas recreating requires restoring from backup.

Practice this question →

44

MCQmedium

You notice that your Cloud SQL for PostgreSQL instance's `pg_stat_activity` shows many connections in `idle in transaction` state, and the connection count is near the max_connections limit. Application threads are blocking waiting for connections. What is the most effective solution to manage database connections for a GKE-hosted application?

A.Increase `max_connections` in the Cloud SQL PostgreSQL instance flags.

B.Deploy PgBouncer as a sidecar or deployment to pool connections to Cloud SQL in transaction mode.

C.Switch from Cloud SQL to Cloud Spanner, which has no connection limits.

D.Restart the Cloud SQL instance to clear idle connections.

AnswerB

PgBouncer in transaction pooling mode multiplexes many application connections onto fewer database connections, eliminating idle-in-transaction waste and staying well below max_connections.

Why this answer

PgBouncer is a lightweight connection pooler that can be deployed as a sidecar or separate deployment in GKE to manage connections to Cloud SQL for PostgreSQL. By operating in transaction mode, it holds database connections only for the duration of a transaction, not for the entire client session, which drastically reduces the number of concurrent connections to the database. This directly addresses the `idle in transaction` connections and the near-max_connections issue without requiring application code changes.

Exam trap

Google Cloud often tests the misconception that simply increasing `max_connections` is a safe scaling solution, when in fact it can lead to resource exhaustion and does not address the underlying idle connection problem.

How to eliminate wrong answers

Option A is wrong because increasing `max_connections` only raises the hard limit without solving the root cause of idle connections; it can also degrade database performance due to increased context switching and memory overhead. Option C is wrong because Cloud Spanner is a globally distributed, horizontally scalable database with a different API and consistency model, not a drop-in replacement for PostgreSQL, and it still has connection limits (though higher). Option D is wrong because restarting the instance is a disruptive, temporary fix that kills all connections but does not prevent idle connections from reaccumulating, and it causes downtime for the application.

Practice this question →

45

MCQhard

A company runs a batch job on Compute Engine that processes large files from Cloud Storage. The job is taking longer than expected. The instances are using standard persistent disks. Which change would most likely improve I/O performance?

A.Use regional persistent disks instead of zonal.

B.Increase the machine type to have more vCPUs.

C.Add local SSDs to the instances.

D.Replace standard persistent disks with SSD persistent disks.

AnswerD

SSD persistent disks offer better I/O performance.

Why this answer

Standard persistent disks (pd-standard) are backed by HDDs and have lower IOPS and throughput compared to SSD persistent disks (pd-ssd). Since the batch job processes large files from Cloud Storage, the bottleneck is likely disk I/O performance. Upgrading to SSD persistent disks provides higher IOPS and throughput, directly improving I/O performance for read/write operations.

Exam trap

Google Cloud often tests the distinction between persistent disk types (standard vs. SSD) versus disk replication options (zonal vs. regional), leading candidates to mistakenly choose regional disks for performance instead of durability.

How to eliminate wrong answers

Option A is wrong because regional persistent disks provide synchronous replication across two zones for durability, not higher I/O performance; they have the same performance characteristics as zonal persistent disks. Option B is wrong because increasing vCPUs does not improve disk I/O performance; the bottleneck is the disk subsystem, not CPU capacity. Option C is wrong because local SSDs provide high IOPS but are ephemeral and cannot be used for persistent data; the job processes files from Cloud Storage, which requires persistent storage for the batch job's working data.

Practice this question →

46

MCQmedium

You need to monitor a Cloud Run service for errors and receive a PagerDuty notification when the number of 5xx errors exceeds 10 in any 5-minute window. Which Cloud Monitoring feature should you use?

A.Create a log-based metric on Cloud Run error logs, then create an alerting policy on that metric with a PagerDuty notification channel.

B.Configure Cloud Run to send error emails directly to the PagerDuty email integration.

C.Use Cloud Pub/Sub to stream Cloud Run logs to a custom application that pages PagerDuty.

D.Enable Cloud Run's built-in alerting feature in the service configuration.

AnswerA

Log-based metrics extract the error count from Cloud Run's request logs. An alerting policy monitors the metric and fires when the threshold is exceeded, notifying PagerDuty via a configured notification channel.

Why this answer

A log-based metric extracts a numeric counter from Cloud Run error logs (e.g., HTTP 5xx status codes). An alerting policy can then evaluate that metric over a sliding 5-minute window, triggering a PagerDuty notification via a configured notification channel when the count exceeds 10. This is the native, serverless approach that requires no additional infrastructure.

Exam trap

Google Cloud often tests the misconception that Cloud Run has built-in alerting or that direct email integration is sufficient, when in fact Cloud Monitoring's log-based metrics and alerting policies are the required mechanism for threshold-based paging.

How to eliminate wrong answers

Option B is wrong because Cloud Run does not have a built-in feature to send error emails directly to a PagerDuty email integration; it would require custom log routing and filtering. Option C is wrong because using Cloud Pub/Sub and a custom application adds unnecessary complexity and latency compared to the native Cloud Monitoring alerting pipeline. Option D is wrong because Cloud Run does not have a built-in alerting feature in its service configuration; alerting must be configured externally via Cloud Monitoring.

Practice this question →

47

MCQeasy

A team wants logs from their Python application running on a Compute Engine VM to appear in Cloud Logging. What must be installed on the VM?

A.Cloud Trace SDK for the Python application

B.Ops Agent (Google Cloud's combined logging and monitoring agent)

C.Cloud Monitoring agent only

D.No installation needed — GCE VMs automatically stream logs to Cloud Logging

AnswerB

The Ops Agent collects logs from system files and application log streams and forwards them to Cloud Logging. It must be installed explicitly on Compute Engine VMs.

Why this answer

The Ops Agent is Google Cloud's unified agent for both logging and monitoring, and it is required to stream custom application logs from a Compute Engine VM to Cloud Logging. While the VM itself sends basic platform logs (e.g., serial console output), application-level logs (e.g., from a Python app) require the Ops Agent to collect, parse, and forward them to the Cloud Logging API.

Exam trap

The trap here is that candidates assume GCE VMs automatically send all logs (including application logs) to Cloud Logging, but in reality only platform-level logs are auto-streamed, and application logs require the Ops Agent.

How to eliminate wrong answers

Option A is wrong because the Cloud Trace SDK is used for distributed tracing, not for collecting or forwarding application logs to Cloud Logging. Option C is wrong because the Cloud Monitoring agent only handles metrics for Cloud Monitoring, not logs for Cloud Logging; the Ops Agent replaces both the legacy logging and monitoring agents. Option D is wrong because GCE VMs do not automatically stream application logs; they only send basic platform logs (e.g., from the guest environment), and custom application logs require an agent like the Ops Agent to be installed and configured.

Practice this question →

48

MCQhard

Your organization uses Cloud Functions to process messages from a Pub/Sub topic. Each function processes a single message and writes results to BigQuery. Recently, the function has been timing out and the Pub/Sub subscription's unacknowledged message count is growing rapidly. The function's memory is set to 256 MB and timeout is 60 seconds. The function logs show occasional 'memory limit exceeded' errors. You suspect that the function is leaking memory when processing large messages. What should you do to resolve the issue while minimizing cost and complexity?

A.Increase the function's memory to 1 GB and timeout to 540 seconds.

B.Set up a retry policy on the Pub/Sub subscription to dead-letter undelivered messages.

C.Increase the function's timeout to 120 seconds and reduce the batch size.

D.Increase the function's memory to 512 MB and timeout to 120 seconds.

AnswerD

Increases memory to eliminate memory errors and timeout to allow longer processing.

Why this answer

The function is timing out and running out of memory due to large messages. Increasing memory to 512 MB provides more headroom for processing, and raising the timeout to 120 seconds gives the function enough time to complete without unnecessary cost. This directly addresses the memory leak and timeout issues while keeping complexity low.

Exam trap

Google Cloud often tests the misconception that increasing timeout alone (Option C) or adding a dead-letter queue (Option B) solves memory-related failures, when in fact memory must be increased to prevent 'memory limit exceeded' errors.

How to eliminate wrong answers

Option A is wrong because increasing memory to 1 GB and timeout to 540 seconds is over-provisioned and unnecessarily increases cost without addressing the root cause of memory leaks; it also exceeds typical Cloud Functions limits for event-driven processing. Option B is wrong because a dead-letter queue only handles undelivered messages after retries, but does not fix the underlying memory leak or timeout; messages will still fail and accumulate. Option C is wrong because reducing batch size is irrelevant since each function processes a single message, and increasing timeout alone without addressing memory will still cause 'memory limit exceeded' errors.

Practice this question →

49

MCQeasy

You need to monitor the CPU utilization across all instances in a managed instance group. What is the most efficient way to create an alerting policy?

A.Create an alerting policy using the Logs Explorer to parse instance logs.

B.Use Cloud Scheduler to call the monitoring API periodically.

C.Set up a cron job to run gcloud compute instances list and check CPU.

D.Create an alerting policy in Cloud Monitoring for the metric 'compute.googleapis.com/instance/cpu/utilization'.

AnswerD

Cloud Monitoring provides native CPU metrics for easy alerting.

Why this answer

Option D is correct because Cloud Monitoring provides a pre-built metric, 'compute.googleapis.com/instance/cpu/utilization', which directly measures CPU usage for VM instances. Creating an alerting policy based on this metric is the most efficient approach, as it requires no custom scripting or external scheduling, and integrates natively with managed instance groups to aggregate data across all instances.

Exam trap

Google Cloud often tests the distinction between logs and metrics, and the trap here is that candidates may confuse log-based analysis (Option A) with metric-based alerting, or assume that custom scripting (Options B and C) is necessary when a native monitoring service already provides the required functionality.

How to eliminate wrong answers

Option A is wrong because the Logs Explorer parses log entries, not real-time metrics; CPU utilization is a metric, not a log event, and parsing logs for CPU data would be inefficient and miss real-time thresholds. Option B is wrong because Cloud Scheduler calling the Monitoring API periodically introduces latency and complexity, and is not the recommended method for continuous metric-based alerting; alerting policies are designed to evaluate metrics automatically. Option C is wrong because a cron job running 'gcloud compute instances list' only retrieves instance metadata, not CPU utilization metrics, and would require additional commands and scripting to fetch and analyze monitoring data, making it inefficient and non-native.

Practice this question →

50

MCQeasy

A team wants to receive an email alert when the average CPU utilization of VMs in a managed instance group exceeds 80% for more than 5 minutes. What should they create in Cloud Monitoring?

A.A dashboard with a CPU utilization chart

B.An alerting policy with a CPU utilization threshold condition

C.A log-based metric filter for high-CPU events

D.An uptime check targeting the managed instance group

AnswerB

Alerting policies evaluate metric conditions continuously and send notifications via configured channels when thresholds are breached for the specified duration.

Why this answer

B is correct because Cloud Monitoring alerting policies allow you to define conditions based on metric thresholds, such as average CPU utilization exceeding 80% for a specified duration (5 minutes). This directly meets the requirement to trigger an email alert when the condition is met.

Exam trap

Google Cloud often tests the distinction between alerting policies (which trigger notifications) and dashboards (which only display data), so candidates mistakenly choose a dashboard thinking it can send alerts.

How to eliminate wrong answers

Option A is wrong because a dashboard with a CPU utilization chart only visualizes data; it does not send alerts. Option C is wrong because log-based metric filters are used to extract metrics from log entries (e.g., custom application logs), not to monitor VM CPU utilization metrics which are already collected by Cloud Monitoring. Option D is wrong because uptime checks monitor the availability and response of HTTP/HTTPS services, not CPU utilization of VMs.

Practice this question →

51

Multi-Selecteasy

Which TWO statements are true about Cloud IAM roles?

Select 2 answers

A.Custom roles are available for all Google Cloud services by default.

B.IAM roles are collections of permissions.

C.Roles assigned to a project are automatically inherited by all resources in the project.

D.The basic roles include Owner, Editor, and Viewer.

E.Primitive roles are the same as predefined roles.

AnswersB, D

IAM roles define what actions a principal can perform.

Why this answer

Option B is correct because IAM roles are indeed collections of permissions that define what actions an identity can perform on Google Cloud resources. Permissions are grouped into roles, and roles are assigned to principals (users, groups, or service accounts) to grant specific access. This is the fundamental building block of Google Cloud IAM, where a role bundles one or more permissions, and the role is then bound to a principal.

Exam trap

Google Cloud often tests the misconception that IAM roles assigned at the project level are automatically inherited by all resources in the project, but in reality, inheritance can be overridden by resource-level policies, and some resources (like Cloud Storage buckets) have their own ACLs that can bypass IAM inheritance entirely.

Practice this question →

52

MCQmedium

Your company runs a data processing pipeline on Cloud Dataproc. The pipeline reads data from Cloud Storage, processes it with Spark, and writes results to BigQuery. Recently, the pipeline has been failing with errors indicating insufficient disk space on the worker nodes. The cluster is configured with standard worker nodes with 100 GB of standard persistent disk. The data size being processed has grown from 50 GB to 150 GB. What is the most cost-effective way to resolve the disk space issue?

A.Increase the size of the persistent disks on the worker nodes to 200 GB.

B.Use local SSDs instead of persistent disks for temporary storage.

C.Enable automatic disk resizing for the cluster.

D.Increase the number of worker nodes in the cluster.

AnswerC

Automatic disk resizing adjusts disk size based on usage, managing cost efficiently.

Why this answer

Option C is correct because Cloud Dataproc's automatic disk resizing feature dynamically increases the size of persistent disks on worker nodes when disk usage exceeds a threshold (default 90%). This resolves the insufficient disk space issue without manual intervention or additional cost for unused capacity, making it the most cost-effective solution for handling the increased data volume from 50 GB to 150 GB.

Exam trap

Google Cloud often tests the misconception that adding more nodes (scaling out) is the default solution for storage issues, but the trap here is that the problem is disk space per node, not cluster capacity, making automatic disk resizing the most cost-effective and operationally efficient fix.

How to eliminate wrong answers

Option A is wrong because increasing persistent disks to 200 GB incurs ongoing costs for the full provisioned size, even if only a portion is used, and is less cost-effective than automatic resizing which only grows disks as needed. Option B is wrong because local SSDs provide temporary, non-persistent storage that is lost on VM termination and cannot be used for the pipeline's intermediate data if it must survive restarts or failures; additionally, local SSDs are more expensive per GB than persistent disks and require manual configuration. Option D is wrong because adding more worker nodes increases the total disk capacity but also increases compute costs unnecessarily; the issue is disk space per node, not insufficient nodes, and scaling out does not address the root cause of insufficient local storage on existing nodes.

Practice this question →

53

MCQmedium

A managed instance group (MIG) is running 4 VMs with a CPU autoscaling target of 60%. A traffic spike drives average CPU to 90%. How does the autoscaler respond?

A.The MIG terminates the 2 least-used VMs to trigger a restart with higher performance settings

B.The autoscaler adds VMs until average CPU across the group drops to approximately 60%

C.The MIG live-migrates instances to larger machine types automatically

D.The MIG restarts all existing VMs to clear cached load

AnswerB

The autoscaler computes how many VMs are needed to bring average utilization to the target and scales out accordingly.

Why this answer

The autoscaler for a managed instance group (MIG) uses a target utilization metric—here, CPU at 60%. When average CPU exceeds that target (90%), the autoscaler calculates the desired number of VMs to bring utilization back to 60% (e.g., 4 VMs * 90% / 60% = 6 VMs) and adds instances accordingly. It does not terminate, migrate, or restart VMs; it scales out horizontally.

Exam trap

Google Cloud often tests the misconception that autoscaling involves modifying existing instances (e.g., restarting, migrating, or resizing) rather than simply adding or removing instances based on a target metric.

How to eliminate wrong answers

Option A is wrong because the autoscaler does not terminate VMs to trigger restarts; it adds VMs to reduce load, and termination would increase load on remaining instances. Option C is wrong because MIGs do not support live migration to larger machine types; autoscaling only adds or removes instances of the same template, and changing machine type requires a new instance template or a different MIG. Option D is wrong because restarting VMs does not reduce CPU utilization; it temporarily disrupts service and does not address sustained high load.

Practice this question →

54

MCQmedium

A developer has deployed a Cloud Run service but receives a 503 error when accessing it. The service logs show 'The request was aborted because there was no available instance.' What is the most likely cause?

A.The minimum number of instances is set too high.

B.The container health checks are failing.

C.The service is experiencing a spike in traffic and the max instances are too low.

D.The service's memory limit is set too low.

AnswerC

When all instances are busy, new requests are rejected with a 503 and this log message.

Why this answer

The 503 error with the message 'The request was aborted because there was no available instance' indicates that all current instances are saturated and Cloud Run cannot scale up quickly enough to handle the incoming requests. This occurs when traffic spikes exceed the configured maximum number of instances, causing new requests to be rejected until an instance becomes free. Option C correctly identifies that the max instances setting is too low for the traffic spike.

Exam trap

Google Cloud often tests the distinction between scaling limits (max instances) and resource constraints (memory/CPU), where candidates mistakenly attribute 503 errors to resource limits rather than the explicit scaling cap.

How to eliminate wrong answers

Option A is wrong because setting the minimum number of instances too high would keep idle instances running, which would reduce cold starts and help handle traffic, not cause a 503 due to no available instances. Option B is wrong because failing container health checks would cause the instance to be marked unhealthy and removed from serving, but the error message specifically states 'no available instance' rather than 'unhealthy instance' or 'health check failure'. Option D is wrong because a memory limit set too low would cause the container to be killed (OOMKilled) or return 502/504 errors, not a 503 with the specific 'no available instance' message.

Practice this question →

55

MCQhard

A GKE cluster running Kubernetes 1.27 needs to be upgraded to 1.29. The cluster has a stateful workload with a PodDisruptionBudget requiring at least 2 out of 3 replicas running at all times. What is the correct upgrade sequence?

A.Upgrade node pools first, then upgrade the control plane

B.Upgrade the control plane to 1.28 first, then 1.29; then upgrade node pools incrementally, respecting the PDB

C.Delete and recreate the cluster at version 1.29 to skip incremental upgrades

D.Upgrade control plane directly from 1.27 to 1.29, then upgrade node pools

AnswerB

GKE requires incremental minor version upgrades (1.27→1.28→1.29). The control plane upgrades first, then nodes are drained one at a time respecting the PodDisruptionBudget.

Why this answer

Option B is correct because Kubernetes requires that the control plane be upgraded one minor version at a time (e.g., 1.27 → 1.28 → 1.29) to maintain API compatibility and stability. After the control plane is upgraded, node pools can be upgraded incrementally, and the PodDisruptionBudget (PDB) ensures that during node upgrades, at least 2 out of 3 replicas remain available, preventing workload disruption.

Exam trap

The trap here is that candidates may think node pools can be upgraded first (Option A) or that skipping minor versions is acceptable (Option D), but Cisco tests the strict Kubernetes version skew policy and the requirement for sequential control plane upgrades, especially when a PDB is in play.

How to eliminate wrong answers

Option A is wrong because upgrading node pools before the control plane violates Kubernetes version skew policy, which requires the control plane to be at a higher or equal minor version than nodes. Option C is wrong because deleting and recreating the cluster at version 1.29 is unnecessary and causes downtime, whereas a rolling upgrade with PDB compliance is the correct approach for stateful workloads. Option D is wrong because skipping a minor version (1.27 to 1.29) is not supported; Kubernetes requires sequential minor version upgrades to ensure API compatibility and safe migration.

Practice this question →

56

MCQhard

A team manages multiple Kubernetes Engine clusters across different projects. They need to enforce that all clusters have the same security policies, including private cluster settings and workload identity. Which approach is most scalable?

A.Use Cloud Asset Inventory to compare configurations and alert on differences.

B.Retrieve cluster configuration for each cluster using gcloud container clusters describe and apply changes manually.

C.Use Config Connector with deployment scripts to manage cluster resources as Kubernetes custom resources.

D.Use Terraform with a module that defines the standard cluster configuration, and apply it to each project.

AnswerD

Terraform provides infrastructure-as-code for consistent, scalable deployment.

Why this answer

Option D is correct because Terraform, combined with a reusable module, provides an Infrastructure as Code (IaC) approach that enforces consistent cluster configurations across multiple projects declaratively. This method is scalable as it allows you to define the standard security policies (private cluster settings, Workload Identity) once in a module and apply it to any number of clusters, ensuring drift is prevented and changes are auditable.

Exam trap

The trap here is that candidates may confuse monitoring tools (Cloud Asset Inventory) or Kubernetes-native tools (Config Connector) with true IaC enforcement, overlooking that Terraform's declarative, module-based approach is the only option that provides scalable, automated, and consistent policy application across multiple projects.

How to eliminate wrong answers

Option A is wrong because Cloud Asset Inventory is a monitoring and alerting tool, not a configuration enforcement mechanism; it can detect differences but cannot automatically apply or remediate policies, making it reactive rather than proactive and less scalable for enforcement. Option B is wrong because manually retrieving and applying configurations with gcloud commands is error-prone, time-consuming, and does not scale across multiple clusters and projects, as it lacks automation and version control. Option C is wrong because Config Connector manages Google Cloud resources as Kubernetes custom resources, but it requires a Kubernetes cluster to run and is primarily designed for managing resources within a single project or from a central cluster, not for enforcing identical policies across multiple independent clusters in different projects.

Practice this question →

57

Multi-Selecteasy

Which TWO actions should be taken to reduce latency for users accessing a global application hosted on Compute Engine? (Choose two.)

Select 2 answers

A.Use a single region with more instances.

B.Deploy instances in multiple regions behind a global load balancer.

C.Enable Cloud Armor to filter traffic.

D.Use Cloud Interconnect for connectivity.

E.Use Cloud CDN with the backend bucket.

AnswersB, E

Routes users to closest healthy backend.

Why this answer

Deploying instances in multiple regions behind a global load balancer (Option B) reduces latency by directing user traffic to the closest healthy backend, minimizing network round-trip time. Using Cloud CDN with a backend bucket (Option E) caches static content at edge locations worldwide, serving users from a nearby cache and offloading origin requests. Together, these actions ensure both dynamic and static content are delivered with minimal latency across a global user base.

Exam trap

The trap here is that candidates often confuse 'scaling up' (more instances in one region) with 'scaling out' (multi-region deployment) and overlook that Cloud CDN is specifically for static content caching, not dynamic requests—though the question does not specify content type, the combination of B and E is the standard best practice for global latency reduction.

Practice this question →

58

MCQeasy

A team's GKE application is running out of memory due to a memory leak. Pods are restarting with OOMKilled status. As an immediate measure before a code fix is available, what kubectl action provides the most insight into which container is leaking?

A.kubectl get events --field-selector=reason=OOMKilling

B.kubectl top pods --containers -n [NAMESPACE]

C.kubectl delete pod [POD_NAME] -- force=true to clear the memory leak

D.gcloud container clusters describe [CLUSTER] --memory-usage

AnswerB

`kubectl top pods --containers` shows real-time CPU and memory consumption per container, helping identify which container is consuming the most memory — essential for diagnosing leaks.

Why this answer

Option B is correct because `kubectl top pods --containers` shows per-container CPU and memory usage for each pod in the namespace. This allows you to identify which specific container within a pod is consuming excessive memory and triggering the OOMKilled status, even before a code fix is deployed. It provides immediate, real-time insight into resource consumption at the container level, which is essential for diagnosing a memory leak in a multi-container pod.

Exam trap

Google Cloud often tests the misconception that cluster-level or event-based commands (like `kubectl get events` or `gcloud container clusters describe`) provide container-level resource diagnostics, when in fact only `kubectl top` with the `--containers` flag gives per-container memory usage in real time.

How to eliminate wrong answers

Option A is wrong because `kubectl get events --field-selector=reason=OOMKilling` only shows that an OOMKill event occurred, but does not reveal which specific container within the pod leaked memory; it lacks the granularity needed to pinpoint the leaking container. Option C is wrong because `kubectl delete pod --force=true` merely terminates the pod, which does not provide any diagnostic insight into which container caused the memory leak; it is a destructive action that removes the evidence without analysis. Option D is wrong because `gcloud container clusters describe` does not support a `--memory-usage` flag; cluster-level description commands provide static configuration metadata, not real-time per-container memory metrics.

Practice this question →

59

MCQeasy

A developer deployed a new version of a Compute Engine instance but the startup script fails to run. The developer needs to debug the startup script. Which step should be taken first?

A.RDP into the instance and check the system logs.

B.Check the instance's metadata for startup script errors.

C.Recreate the instance with a new image.

D.Review the serial port 1 output in the Google Cloud console.

AnswerD

Serial port 1 displays boot and startup script logs.

Why this answer

Serial port 1 output in the Google Cloud console captures the instance's serial console logs, including startup script execution output and any errors. This is the first and most direct step to debug a failing startup script because it shows the script's stdout, stderr, and any system messages during boot, without requiring network access or additional tools.

Exam trap

The trap here is that candidates confuse checking instance metadata (which stores the script) with viewing execution logs (serial port output), or they assume RDP/SSH is available when the script failure may prevent those services from starting.

How to eliminate wrong answers

Option A is wrong because Compute Engine instances typically run Linux, not Windows, so RDP is not applicable; even for Windows instances, RDP may not be available if the startup script fails before the network stack is ready. Option B is wrong because the instance's metadata stores the startup script content and configuration, not runtime errors or execution logs; checking metadata will not show why the script failed. Option C is wrong because recreating the instance with a new image does not help debug the existing script failure; it would only reset the environment without revealing the root cause.

Practice this question →

60

MCQhard

Your application running on GKE is experiencing intermittent 500 errors. You want to create an alert that fires when the 99th percentile latency exceeds 2 seconds OR when the error rate (5xx responses) exceeds 1% of all requests over a 5-minute window. You have Cloud Monitoring configured with the application exporting metrics via OpenTelemetry. What should you create in Cloud Monitoring?

A.Two separate alerting policies — one for latency and one for error rate — each with their own notification channel.

B.A single alerting policy with two conditions (p99 latency and error rate) joined with OR logic.

C.A log-based alert using Cloud Logging to detect 5xx response codes in access logs.

D.An SLO with error budget burn rate alerts configured in Cloud Monitoring.

AnswerB

Cloud Monitoring alerting policies support multi-condition policies with AND/OR combiners. A single OR-combined policy fires when either condition breaches its threshold.

Why this answer

Option B is correct because Cloud Monitoring alerting policies support multiple conditions combined with AND/OR logic, allowing you to trigger a single alert when either the 99th percentile latency exceeds 2 seconds or the error rate exceeds 1% over a 5-minute window. This directly matches the requirement without needing separate policies or relying on log-based detection.

Exam trap

Google Cloud often tests the distinction between metric-based alerts and log-based alerts, and the trap here is that candidates may choose a log-based alert (Option C) because they associate error detection with logs, but the question explicitly states metrics are exported via OpenTelemetry, making metric-based alerts the correct and more efficient choice.

How to eliminate wrong answers

Option A is wrong because creating two separate alerting policies would result in two independent alerts, which is unnecessary and less manageable; Cloud Monitoring supports multiple conditions in a single policy with OR logic, making this approach inefficient. Option C is wrong because a log-based alert using Cloud Logging would only detect 5xx errors from access logs, but the question specifies that metrics are exported via OpenTelemetry, so a metric-based alert is more appropriate and avoids log parsing latency. Option D is wrong because an SLO with error budget burn rate alerts is designed for tracking service-level objectives over longer periods (e.g., 30 days), not for real-time threshold-based alerting on latency and error rate over a 5-minute window.

Practice this question →

61

MCQmedium

Your company runs a critical web application on Google Kubernetes Engine (GKE) with a regional cluster. The application uses a Cloud SQL instance for database. Recently, users have been experiencing intermittent connection timeouts. The application logs show database connection errors, but the Cloud SQL instance's CPU and memory usage are low. The GKE cluster and Cloud SQL are in the same region. You notice that the Cloud SQL instance is configured with a private IP address. What is the most likely cause of the timeouts?

A.The Cloud SQL instance is not configured with automatic failover.

B.The Cloud SQL instance's connection pool size is too small.

C.The GKE cluster is not using a Private Service Connect endpoint to reach Cloud SQL.

D.The GKE cluster's nodes are in a different VPC subnet than the Cloud SQL instance.

AnswerC

Private connectivity to Cloud SQL via private IP requires a Private Service Connect endpoint.

Why this answer

The most likely cause is that the GKE cluster is not using a Private Service Connect endpoint to reach the Cloud SQL instance. When Cloud SQL uses a private IP, it is accessible only through a VPC network that has a Private Service Connect endpoint or a VPC peering connection to the Service Networking API. Without this endpoint, the GKE nodes cannot route traffic to the Cloud SQL private IP, leading to intermittent connection timeouts even though the instance itself is healthy.

Exam trap

Google Cloud often tests the misconception that resources in the same region and VPC can communicate automatically via private IP, but Cloud SQL private IP requires explicit Private Service Connect or VPC peering, not just same-region placement.

How to eliminate wrong answers

Option A is wrong because automatic failover affects high availability during a zonal outage, not intermittent connection timeouts when CPU and memory are low. Option B is wrong because a small connection pool would cause connection refused errors or queueing, not timeouts, and the logs show database connection errors, not pool exhaustion. Option D is wrong because the GKE cluster and Cloud SQL are in the same region, and VPC subnets can be different as long as they are in the same VPC and have proper routing; the real issue is the lack of a Private Service Connect endpoint or VPC peering to expose the Cloud SQL private IP.

Practice this question →

62

Matchingmedium

Match each Cloud Monitoring resource to its purpose.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Measurable data point from a resource

Notification based on a condition

Customizable view of metrics

Monitors availability of a service

Metric derived from log entries

Why these pairings

Cloud Monitoring helps observe and respond to system health.

Practice this question →

63

MCQmedium

A team discovers their Cloud Logging costs are unexpectedly high. The majority of costs come from verbose DEBUG-level logs from a development service in production. They want to stop storing DEBUG logs without modifying the application. What is the solution?

A.Set the application's log level to INFO — this is the only way to reduce log volume

B.Create a Cloud Logging exclusion filter to discard DEBUG-level log entries from the service

C.Move the development service to a separate GCP project with a lower logging tier

D.Delete old DEBUG log entries manually — Cloud Logging charges for stored volume

AnswerB

Logging exclusion filters (in Log Router) match and discard specified log entries before storage. A filter for `severity=DEBUG` on the resource type drops debug logs without application changes.

Why this answer

Option B is correct because Cloud Logging exclusion filters allow you to discard log entries based on criteria such as severity level, log name, or resource labels before they are ingested and stored. By creating an exclusion filter that matches DEBUG-level log entries from the specific development service, you can stop storing those logs without modifying the application code. This approach directly reduces storage costs because excluded logs are not indexed or retained.

Exam trap

The trap here is that candidates may think modifying the application's log level is the only way to reduce log volume, but Cloud Logging exclusion filters provide a non-invasive, infrastructure-level solution that avoids code changes.

How to eliminate wrong answers

Option A is wrong because setting the application's log level to INFO would require modifying the application code or configuration, which the question explicitly states is not allowed. Option C is wrong because moving the service to a separate GCP project does not reduce log volume; it merely shifts the cost to another project, and Cloud Logging charges are based on ingestion and storage regardless of project. Option D is wrong because deleting old DEBUG log entries manually does not prevent future DEBUG logs from being ingested and stored, and Cloud Logging charges are primarily for ingestion volume, not just stored volume.

Practice this question →

64

MCQeasy

A small business is deploying a web application on Compute Engine and wants to ensure high availability. They have set up two instances in different zones behind a TCP load balancer. What should they also configure to detect and route traffic away from unhealthy instances?

A.Configure a health check on the load balancer.

B.Set a firewall rule to allow traffic only on port 80.

C.Use a global HTTP(S) load balancer instead.

D.Create a snapshot schedule for the persistent disks.

AnswerA

Health checks allow the load balancer to stop sending traffic to unhealthy instances.

Why this answer

A health check is required for the TCP load balancer to monitor the backend instances. It periodically probes the instances on a specified port and path, marking them as unhealthy if they fail to respond. The load balancer then stops routing new traffic to unhealthy instances, ensuring high availability by directing traffic only to healthy backends.

Exam trap

The trap here is that candidates often confuse health checks with firewall rules or backup strategies, thinking that allowing traffic or creating snapshots ensures availability, but only health checks provide the active monitoring needed to detect and route around failures.

How to eliminate wrong answers

Option B is wrong because a firewall rule allowing only port 80 controls network access but does not detect instance health or influence load balancer routing decisions. Option C is wrong because a global HTTP(S) load balancer is designed for HTTP/HTTPS traffic and cannot be used with a TCP load balancer; the question specifies a TCP load balancer, which requires a TCP health check. Option D is wrong because snapshot schedules are for backup and disaster recovery of persistent disks, not for real-time health detection or traffic routing.

Practice this question →

65

MCQmedium

A cost-conscious team notices their GKE cluster's node pools have consistently high memory utilization (>90%) while CPU remains at 30%. Pods are occasionally OOMKilled. What should they do to balance resource efficiency and stability?

A.Switch node pool machine type to a memory-optimized series (e.g., m2-ultramem) and ensure Pod memory requests are accurate

B.Increase CPU limits for all Pods to use the available CPU capacity

C.Enable vertical pod autoscaling (VPA) set to Recreate mode as the only change

D.Reduce the number of replica Pods to lower memory consumption

AnswerA

Memory-optimized machine types provide more RAM per vCPU, directly addressing the memory bottleneck. Accurate Pod requests let the scheduler pack Pods efficiently and let the autoscaler add the right type of capacity.

Why this answer

Option A is correct because the team has a memory-bound workload (high memory utilization, low CPU, OOMKills). Switching to a memory-optimized machine series (e.g., m2-ultramem) provides a higher memory-to-CPU ratio, directly addressing the memory pressure. Ensuring accurate Pod memory requests allows the scheduler to place Pods efficiently and prevents overcommitment, balancing resource efficiency with stability.

Exam trap

Google Cloud often tests the misconception that vertical scaling (VPA) alone can fix memory pressure without considering the node's physical resource ratio, leading candidates to pick Option C and overlook the need for a memory-optimized machine type.

How to eliminate wrong answers

Option B is wrong because increasing CPU limits does not address memory pressure or OOMKills; it wastes CPU capacity that is already underutilized and may cause unnecessary throttling or scheduling inefficiencies. Option C is wrong because enabling VPA in Recreate mode as the only change will adjust CPU and memory requests based on historical usage, but it does not change the underlying machine type's memory-to-CPU ratio; the node pool may still lack sufficient memory capacity, leading to continued OOMKills or failed VPA recommendations. Option D is wrong because reducing replica Pods lowers overall memory consumption but also reduces application throughput and availability; it does not fix the root cause of memory inefficiency per Pod and may violate stability or SLA requirements.

Practice this question →

66

MCQeasy

A company is using Cloud Run for a stateless application. The application sometimes fails with HTTP 503 errors when traffic spikes. Which action should the team take to improve reliability?

A.Configure a liveness probe with a higher initial delay.

B.Increase the maximum number of container instances.

C.Use Cloud Functions instead of Cloud Run.

D.Enable HTTP load balancing with Cloud CDN.

AnswerB

Increasing the max instances allows more concurrent requests, reducing 503s.

Why this answer

HTTP 503 errors during traffic spikes indicate that Cloud Run is scaling out but hitting the maximum number of container instances limit, causing new requests to be rejected. Increasing the maximum number of container instances allows Cloud Run to spin up more concurrent containers to handle the burst, directly improving reliability under load.

Exam trap

The trap here is that candidates confuse liveness probes (which check container health) with scaling mechanisms, or assume that adding a CDN or switching to Cloud Functions will magically absorb traffic spikes, when the root cause is simply hitting the instance cap.

How to eliminate wrong answers

Option A is wrong because a liveness probe with a higher initial delay only affects when the container is considered healthy after startup; it does not address capacity limits during traffic spikes. Option C is wrong because Cloud Functions has similar or stricter concurrency and scaling limits, and switching to it would not inherently solve capacity-related 503 errors. Option D is wrong because HTTP load balancing with Cloud CDN caches static content but does not increase the backend's ability to handle more concurrent requests; the 503 originates from Cloud Run's instance cap, not from network-level congestion.

Practice this question →

67

MCQmedium

A team wants to automatically restart any GKE Pod that fails a liveness probe three consecutive times. The probe should check HTTP GET /healthz on port 8080, starting after 30 seconds and checking every 10 seconds. Which Pod spec configuration implements this?

A.readinessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 30 periodSeconds: 10 failureThreshold: 3

B.livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 30 periodSeconds: 10 failureThreshold: 3

C.startupProbe: httpGet: path: /healthz port: 8080 failureThreshold: 3

D.lifecycle: postStart: httpGet: path: /healthz port: 8080

AnswerB

livenessProbe with the correct httpGet, timing, and failureThreshold configuration will restart the container after 3 consecutive failures — exactly the described behavior.

Why this answer

Option B is correct because a livenessProbe with an HTTP GET on /healthz at port 8080, configured with initialDelaySeconds: 30, periodSeconds: 10, and failureThreshold: 3, will cause the kubelet to restart the Pod after three consecutive failed checks. This directly matches the requirement to restart on liveness probe failures, as liveness probes are specifically designed to determine if a container should be restarted.

Exam trap

Google Cloud often tests the distinction between readinessProbe and livenessProbe, trapping candidates who confuse 'restart on failure' with 'stop sending traffic on failure'.

How to eliminate wrong answers

Option A is wrong because it uses a readinessProbe, which only controls whether the Pod receives traffic from Services, not whether the container is restarted; readiness probes do not trigger restarts on failure. Option C is wrong because a startupProbe is used to delay other probes until the application has started, and it does not cause restarts after the initial startup phase; it also lacks the required initialDelaySeconds and periodSeconds. Option D is wrong because lifecycle hooks like postStart execute a command or HTTP request once after container creation, not as a recurring health check, and they cannot be configured with failure thresholds or periodic checks.

Practice this question →

68

MCQmedium

An organization has deployed a Compute Engine VM instance running a web server. The web server is not responding to HTTP requests from the internet. The firewall rules allow ingress traffic on port 80 and 443 from any source (0.0.0.0/0). The VM has a public IP address and is in a VPC network with default subnets. What is the most likely cause of the issue?

A.The VM does not have an HTTP health check configured.

B.The web server service is not running on the VM.

C.The VPC network's default firewall rule blocks ingress traffic.

D.The VM is not in the same region as the global load balancer.

AnswerB

If the web server process is not running, it will not respond to HTTP requests.

Why this answer

Option B is correct because the most likely cause of the web server not responding to HTTP requests, despite correct firewall rules and a public IP, is that the web server service (e.g., Apache, Nginx) is not running on the VM. Firewall rules only control network traffic; they do not ensure that the application process is listening on the specified ports. A simple `sudo systemctl status apache2` or `netstat -tlnp` would confirm whether the service is active.

Exam trap

Google Cloud often tests the misconception that firewall rules alone guarantee application availability, when in fact the application service must be running and listening on the correct port.

How to eliminate wrong answers

Option A is wrong because HTTP health checks are used by load balancers to monitor instance health, but they are not required for a standalone VM to respond to HTTP requests; the VM can serve traffic directly without any health check configuration. Option C is wrong because the default VPC firewall rules include an 'allow' rule for ingress traffic on ports 80 and 443 from 0.0.0.0/0, so they do not block the traffic; the issue is not firewall-related. Option D is wrong because a global load balancer is not mentioned in the scenario, and even if one were used, the VM does not need to be in the same region as the load balancer — global load balancers route traffic to backends in any region.

Practice this question →

69

MCQmedium

A team wants proactive alerting if their public HTTPS endpoint returns a non-2xx HTTP status code or becomes unreachable — before users report it. Which Cloud Monitoring capability provides this?

A.A log-based alert on 5xx error log entries

B.An uptime check with an HTTP status code condition

C.A Cloud Armor rule blocking 5xx responses

D.A metric alert on instance CPU exceeding 90%

AnswerB

Uptime checks actively probe the endpoint from multiple locations, alert on non-2xx status codes, and detect outages even during zero-traffic periods.

Why this answer

An uptime check with an HTTP status code condition is the correct choice because Cloud Monitoring’s uptime checks are specifically designed to proactively verify that a public HTTPS endpoint is reachable and returns a successful HTTP status (e.g., 2xx). When the check detects a non-2xx status or a timeout/unreachable condition, it can trigger an alert before users are impacted. This is the only option that directly monitors endpoint availability and HTTP response codes from an external perspective.

Exam trap

Google Cloud often tests the distinction between proactive monitoring (uptime checks) and reactive logging (log-based alerts), trapping candidates who assume that log entries for 5xx errors are sufficient for early detection, when in fact they require the error to already occur and be logged.

How to eliminate wrong answers

Option A is wrong because a log-based alert on 5xx error log entries is reactive—it only fires after a 5xx response has been logged, and it cannot detect unreachable endpoints (e.g., DNS failures or connection timeouts) that never generate a log entry. Option C is wrong because Cloud Armor is a web application firewall that blocks or filters traffic based on rules, not a monitoring tool; it does not generate proactive alerts about endpoint status. Option D is wrong because a metric alert on instance CPU exceeding 90% monitors compute resource utilization, not the HTTP endpoint’s availability or response status, so it would not detect a non-2xx or unreachable condition.

Practice this question →

70

Multi-Selecteasy

Which TWO practices help ensure the reliability of a Cloud Functions deployment? (Choose two.)

Select 2 answers

A.Deploy functions in a single region to minimize latency.

B.Configure a VPC connector for all functions.

C.Set maximum instances to 1 to avoid resource contention.

D.Use Cloud Tasks to decouple function invocations.

E.Implement retry policies for background functions.

AnswersD, E

Cloud Tasks provides retries and scheduling.

Why this answer

Option D is correct because Cloud Tasks decouples function invocations by queuing requests and delivering them asynchronously, which improves reliability by handling spikes in traffic without dropping requests and providing automatic retries on failure. Option E is correct because implementing retry policies for background functions (e.g., Cloud Functions triggered by Pub/Sub or Cloud Storage) ensures that transient failures are automatically retried, increasing the overall reliability of the deployment.

Exam trap

Google Cloud often tests the misconception that limiting concurrency (e.g., max instances = 1) improves reliability, when in fact it reduces fault tolerance and increases latency under load.

Practice this question →

71

MCQmedium

A BigQuery table in a data pipeline receives daily data loads. To control storage costs, the team wants table data older than 180 days to be automatically deleted at the table level, not the dataset level. How should this be configured?

A.Set a dataset-level default table expiration of 180 days in the dataset properties

B.Use a Cloud Scheduler job to run a DELETE statement on rows older than 180 days nightly

C.Configure partition expiration on a date-partitioned table to expire partitions after 180 days

D.Set a table-level TTL using BigQuery's TTL API with a 180-day value

AnswerC

For date-partitioned tables, partition expiration automatically deletes partitions older than the specified number of days — the most efficient and zero-maintenance approach for time-series data.

Why this answer

Option C is correct because BigQuery's partition expiration feature allows you to automatically delete entire partitions from a date-partitioned table after a specified number of days. By setting the partition expiration to 180 days, all data in partitions older than 180 days is dropped at the table level, meeting the requirement without affecting other tables in the dataset.

Exam trap

Google Cloud often tests the distinction between dataset-level defaults and table-level partition expiration, and the trap here is that candidates confuse dataset-level table expiration (which deletes entire tables) with the requirement to delete only old rows within a single table.

How to eliminate wrong answers

Option A is wrong because dataset-level default table expiration applies to all tables in the dataset, not just the specific table, and it deletes entire tables, not rows or partitions. Option B is wrong because using a Cloud Scheduler job to run a DELETE statement incurs query costs and does not automatically delete data at the table level; it also requires ongoing maintenance and does not leverage BigQuery's native storage management. Option D is wrong because BigQuery does not have a 'TTL API' for tables; the correct mechanism for automatic deletion of old data is partition expiration on partitioned tables.

Practice this question →

72

MCQmedium

A team stores sensitive configuration files in Cloud Storage that internal services download at startup. External partners occasionally need time-limited access to specific files without creating GCP accounts. Which feature grants temporary access without modifying bucket permissions?

A.Make the specific files publicly readable and share the direct URL

B.Generate a Signed URL for the specific files with the required expiration time

C.Create a temporary GCP service account for the partner and share its JSON key

D.Enable uniform bucket-level access and create a public IAM binding for 24 hours

AnswerB

Signed URLs are cryptographically signed, time-limited URLs that grant access to specific Cloud Storage objects. Partners access the file via the URL without needing GCP credentials.

Why this answer

Option B is correct because Signed URLs provide time-limited, granular access to specific Cloud Storage objects without altering the underlying bucket permissions. The partner receives a URL that embeds authentication information and an expiration time, enabling secure, temporary downloads without requiring a GCP account or IAM role.

Exam trap

Google Cloud often tests the distinction between Signed URLs (object-level, temporary, no IAM changes) and Signed Policy Documents (form uploads) or public access, trapping candidates who confuse 'temporary access' with 'making objects public' or 'creating temporary credentials.'

How to eliminate wrong answers

Option A is wrong because making files publicly readable grants unrestricted access to anyone with the URL, violating the requirement for time-limited access and potentially exposing sensitive data indefinitely. Option C is wrong because creating a temporary service account and sharing its JSON key violates security best practices (key exposure risk) and requires the partner to manage GCP credentials, which contradicts the 'without creating GCP accounts' requirement. Option D is wrong because enabling uniform bucket-level access and creating a public IAM binding grants broad, time-limited access to the entire bucket, not specific files, and still requires modifying bucket-level permissions, which the question explicitly forbids.

Practice this question →

73

MCQmedium

An organization has a VPC with subnets in us-central1 and europe-west1. They want to allow traffic from a specific on-premises IP range to reach a Compute Engine instance in europe-west1, but only through a single Cloud VPN tunnel attached to the us-central1 gateway. What configuration is required?

A.Create a route in us-central1 with the on-premises range and next hop set to the VPN tunnel. Add a firewall rule allowing the traffic.

B.Use policy-based routing on the Cloud VPN gateway to route the traffic to europe-west1.

C.Create a static route for the on-premises range in the europe-west1 subnet pointing to the VPN tunnel in us-central1.

D.Configure the VPN tunnel with BGP to advertise the on-premises range to both regions.

AnswerA

This routes traffic through the desired tunnel.

Why this answer

Option A is correct because the VPN tunnel is attached to the us-central1 gateway, and a static route in us-central1 with the on-premises IP range as the destination and the VPN tunnel as the next hop directs traffic from the on-premises network to the VPC. Since the VPC is global, the route applies to all regions, and the Compute Engine instance in europe-west1 is reachable as long as the traffic enters the VPC through the us-central1 tunnel. A firewall rule is required to allow the inbound traffic from the on-premises range to the instance.

Exam trap

The trap here is that candidates assume routes must be created in the same region as the destination instance, but in a global VPC, a route in one region can direct traffic to instances in another region as long as the next hop is valid and the traffic enters through the correct gateway.

How to eliminate wrong answers

Option B is wrong because Cloud VPN does not support policy-based routing; it uses route-based or BGP-based routing, and policy-based routing is not a feature of Cloud VPN gateways. Option C is wrong because a static route in the europe-west1 subnet cannot point to a VPN tunnel in us-central1; routes are global in a VPC, and the next hop must be a resource in the same region as the route's gateway, or the route must be created in the region where the VPN gateway resides. Option D is wrong because BGP advertises routes from the on-premises network to the VPC, not the other way around; advertising the on-premises range via BGP would not control the path through which traffic enters the VPC, and it would not force traffic through the us-central1 tunnel.

Practice this question →

74

MCQmedium

A Cloud CDN cache is serving stale content after a website update. New files were deployed to Cloud Storage but CDN is still serving the old versions to some users. What is the fastest way to force CDN to serve the updated content?

A.Wait for the CDN TTL to expire — cached content automatically refreshes

B.Run a CDN cache invalidation for the affected URL paths

C.Delete and recreate the Cloud Storage bucket — CDN will detect the new bucket as a fresh origin

D.Disable Cloud CDN temporarily — all users will hit the origin until CDN is re-enabled

AnswerB

`gcloud compute url-maps invalidate-cdn-cache [URL_MAP] --path=[PATH_PATTERN]` immediately purges matching cached content. CDN fetches fresh content on the next request.

Why this answer

Option B is correct because Cloud CDN supports cache invalidation, which immediately removes cached objects from edge caches for specified URL paths. This forces the CDN to fetch fresh content from the origin (Cloud Storage) on the next request, providing the fastest way to serve updated content without waiting for TTL expiry.

Exam trap

Google Cloud often tests the misconception that modifying the origin (e.g., deleting/recreating a bucket) automatically clears the CDN cache, when in fact the CDN cache is independent and requires explicit invalidation or TTL expiry to refresh.

How to eliminate wrong answers

Option A is wrong because waiting for TTL expiry is passive and can take minutes to hours depending on the configured cache duration, which is not the fastest solution. Option C is wrong because deleting and recreating the Cloud Storage bucket does not affect CDN cache; the CDN still holds stale content from the old bucket URL, and the new bucket would require a new CDN configuration. Option D is wrong because disabling Cloud CDN temporarily disrupts service for all users and does not clear the cache; re-enabling it would still serve stale content until TTL expires or invalidation is performed.

Practice this question →

75

MCQhard

A company runs a stable production workload on 20 n2-standard-8 VMs that run continuously year-round. Which pricing commitment maximizes cost savings on these VMs?

A.Sustained use discounts (automatically applied)

B.1-year committed use discount (CUD)

C.3-year committed use discount (CUD)

D.Switching to Spot VMs

AnswerC

3-year CUDs for N2 VMs offer up to 57% discount compared to on-demand pricing — the highest available discount for stable, continuously-running workloads.

Why this answer

The 3-year committed use discount (CUD) offers the highest discount rate (up to 57% for compute-optimized machine types) compared to 1-year CUDs (up to 20%) or sustained use discounts (up to 30% for running a VM the entire month). Since the workload runs 20 n2-standard-8 VMs continuously year-round, a 3-year CUD locks in the maximum savings for this predictable, steady-state usage.

Exam trap

Google Cloud often tests the misconception that sustained use discounts are always the best option for long-running workloads, but candidates must recognize that committed use discounts provide significantly higher savings for predictable, continuous usage, especially with a 3-year term.

How to eliminate wrong answers

Option A is wrong because sustained use discounts are automatically applied for running VMs more than 25% of a month, but they max out at 30% discount, which is lower than the 3-year CUD's up to 57% discount. Option B is wrong because a 1-year CUD offers a lower discount (up to 20%) compared to a 3-year CUD, and since the workload runs continuously for multiple years, the longer commitment yields greater savings. Option D is wrong because Spot VMs can be preempted at any time, making them unsuitable for a stable production workload that requires continuous availability and cannot tolerate interruptions.

Practice this question →

Page 1 of 2 · 102 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Ensuring Successful Operation Of A Cloud Solution questions.

Start 20-question session