Google Professional Cloud DevOps Engineer PCDOE Questions 151–225 | Page 3/7

151

MCQeasy

Your team receives an alert that the Error Reporting count for a critical service has increased tenfold in the last 10 minutes. You suspect a recent code deployment is the cause. What is the first action you should take?

A.Disable the alert to reduce noise.

B.Roll back the deployment to the previous version.

C.Increase the instance count to handle the load.

D.Open a post-mortem to document the incident.

AnswerB

Rolling back quickly mitigates user impact.

Why this answer

A tenfold increase in error reporting within 10 minutes strongly indicates a recent code deployment introduced a critical bug. The immediate priority is to restore service stability by rolling back the deployment to the previous known-good version, as this directly mitigates the root cause. Delaying the rollback risks further degradation of the service and potential data loss or corruption.

Exam trap

Google Cloud often tests the candidate's ability to prioritize immediate incident response over long-term analysis, trapping those who confuse 'post-mortem' (a retrospective activity) with 'first action' (a corrective action).

How to eliminate wrong answers

Option A is wrong because disabling the alert ignores the underlying problem and violates the principle of observability; alerts are meant to signal real issues, and suppressing them does not reduce errors. Option C is wrong because increasing the instance count does not fix the root cause—if the new code has a bug (e.g., a memory leak or an unhandled exception), adding more instances will only replicate the failure across more nodes, potentially amplifying the impact. Option D is wrong because opening a post-mortem is a reactive step that should occur after the immediate incident is resolved; performing it first wastes critical time and delays the necessary rollback.

Full explanation →

152

MCQmedium

Your team has deployed a microservices application on Google Kubernetes Engine (GKE). You notice that one service has high latency during peak hours. The service is CPU-bound and uses a HorizontalPodAutoscaler (HPA) based on CPU utilization. What is the most likely cause of the latency?

A.The GKE cluster uses preemptible nodes that are frequently reclaimed.

B.The HPA's target CPU utilization is set too high, causing the autoscaler to react slowly.

C.The service uses a global external HTTP(S) load balancer with session affinity.

D.The application does not implement request autoscaling at the application layer.

AnswerB

A high target CPU threshold delays scaling, leading to latency.

Why this answer

Option B is correct because when the HPA's target CPU utilization is set too high, the autoscaler waits until the average CPU utilization exceeds that threshold before scaling up. During peak hours, the service becomes CPU-bound and latency increases as pods are overwhelmed, but the HPA reacts slowly because it only triggers when the high threshold is breached, causing a delay in adding new pods to handle the load.

Exam trap

Google Cloud often tests the misconception that HPA scaling is instantaneous or that CPU-bound latency is caused by external factors like load balancers or node preemption, when the real issue is the HPA threshold configuration and its delayed reaction to sustained high utilization.

How to eliminate wrong answers

Option A is wrong because preemptible nodes being reclaimed would cause pod evictions and potential downtime, not a gradual latency increase during peak hours; the scenario describes high latency specifically during peak hours, not intermittent failures. Option C is wrong because a global external HTTP(S) load balancer with session affinity does not inherently cause high latency for a CPU-bound service; session affinity can cause uneven load distribution but the primary issue here is CPU saturation and slow HPA reaction. Option D is wrong because request autoscaling at the application layer (e.g., concurrency limits or queue-based scaling) is not a standard Kubernetes mechanism and does not address the root cause of the HPA's slow reaction to CPU utilization; the HPA is already configured for CPU, so the issue is the threshold setting.

Full explanation →

153

MCQmedium

A company uses Cloud Run for a critical service and needs to set up alerting for 5xx errors. They want to receive a notification within 1 minute of the error rate exceeding 1% for any 1-minute window. Which alerting approach should they use?

A.Set up a log-based metric for 5xx responses and create an alert on the metric.

B.Create a Cloud Logging sink to a Pub/Sub topic and trigger a Cloud Function that sends notifications.

C.Use Cloud Monitoring's log-based alerting to trigger on every 5xx log entry.

D.Create a Cloud Monitoring alerting policy using the 'Request count' metric with a condition that compares the ratio of 5xx responses to total requests over a 1-minute window.

AnswerD

Cloud Monitoring supports metric evaluation every few seconds, and the ratio condition meets the requirement of alerting within 1 minute.

Why this answer

Option D is correct because Cloud Monitoring's alerting policies can directly compute the ratio of 5xx responses to total requests using the 'Request count' metric with a custom ratio condition over a 1-minute window. This approach meets the 1-minute notification latency requirement without additional infrastructure, as Cloud Monitoring evaluates the metric every 60 seconds and triggers alerts based on the sliding window.

Exam trap

Google Cloud often tests the distinction between log-based metrics (which have inherent sampling and aggregation delays) and built-in request metrics (which are real-time and window-aware), leading candidates to mistakenly choose log-based alerting for rate-based conditions.

How to eliminate wrong answers

Option A is wrong because log-based metrics are sampled and aggregated with a typical latency of 2–5 minutes, which cannot guarantee notification within 1 minute of the error rate exceeding the threshold. Option B is wrong because creating a Cloud Logging sink to Pub/Sub and triggering a Cloud Function introduces additional latency from log ingestion, Pub/Sub delivery, and function execution, making it impossible to meet the 1-minute SLA; it also lacks native ratio computation. Option C is wrong because Cloud Monitoring's log-based alerting triggers on every individual log entry, not on a rate or ratio over a time window, so it cannot detect when the error rate exceeds 1% in a 1-minute window.

Full explanation →

154

MCQeasy

You are bootstrapping a Google Cloud organization. You need to set up a hierarchical structure that allows you to apply policies to groups of projects based on their environment (e.g., development, staging, production). What is the recommended way to organize resources?

A.Use resource tags to label projects by environment and apply policies via tag-based conditions.

B.Create folders under the organization for each environment and place projects in the appropriate folder.

C.Create separate organizations for each environment.

D.Use labels on projects to identify environments and then use Cloud Asset Inventory to enforce policies.

AnswerB

Folders allow hierarchical policy inheritance and grouping.

Why this answer

Option B is correct because Google Cloud's resource hierarchy (Organization -> Folders -> Projects) is specifically designed to group projects by environment and apply consistent policies (e.g., IAM, organization policies) at the folder level. By creating folders for development, staging, and production, you can enforce environment-specific controls (like VPC Service Controls or resource location restrictions) without duplicating policies per project.

Exam trap

Google Cloud often tests the distinction between metadata (tags/labels) and hierarchical policy inheritance (folders), leading candidates to choose tags or labels because they seem simpler, but folders are the only mechanism that provides automatic, inheritable policy enforcement across groups of projects.

How to eliminate wrong answers

Option A is wrong because resource tags are metadata key-value pairs used for fine-grained access control via IAM conditions, but they do not create a hierarchical structure for policy inheritance; policies must be explicitly attached to each project or resource, leading to management overhead. Option C is wrong because creating separate organizations for each environment violates the recommended single-organization model, breaks centralized billing and audit logging, and prevents cross-environment resource sharing (e.g., shared VPC). Option D is wrong because labels are metadata for billing and filtering, not for policy enforcement; Cloud Asset Inventory is an asset discovery and monitoring tool, not a policy enforcement mechanism, and cannot apply or enforce policies based on labels.

Full explanation →

155

MCQhard

Your company runs a microservices application on a private GKE cluster with Workload Identity enabled. Services communicate via gRPC and HTTP. After a recent update to the payment service, users report intermittent 503 errors and 2-second latency spikes during peak hours (10 AM - 12 PM). Cloud Monitoring shows the payment service's CPU utilization averages 60%, but memory spikes to 90% during errors. The existing alert on HTTP 503 responses fires only after 5 consecutive errors over 5 minutes, but the errors are sporadic. You need to diagnose and resolve the issue. What should you do?

A.Increase the memory limits for the payment service containers and set a memory request equal to the limit. Then restart the pods to clear any memory leaks.

B.Switch the payment service's HTTP/2 protocol to HTTP/1.1 to reduce overhead, and increase the memory request limit to avoid out-of-memory errors.

C.Enable detailed logging and metrics for the payment service in Cloud Logging and Cloud Monitoring. Analyze logs around the error timestamps to identify memory consumption patterns, and lower the alert threshold for 503 errors to trigger on 2 errors within 1 minute. Also, set up a custom alert on memory usage exceeding 85%.

D.Scale the payment service horizontally by increasing the minimum number of replicas to handle peak load, and adjust the HPA to scale faster.

AnswerC

This approach provides the necessary observability to correlate memory spikes with errors, and adjusts alerts to catch issues earlier, enabling proactive diagnosis.

Why this answer

Option C is correct because the issue requires a two-pronged approach: first, enable detailed logging and metrics to diagnose the root cause (memory spikes during peak hours), and second, adjust alerting thresholds to catch sporadic 503 errors earlier (2 errors in 1 minute) and add a proactive memory usage alert at 85%. This aligns with the PCDOE incident management process of first gathering data before making changes, and ensures you can correlate memory pressure with gRPC/HTTP errors in a GKE environment with Workload Identity.

Exam trap

Google Cloud often tests the misconception that scaling or resource adjustments should be the first step, when in reality the PCDOE framework emphasizes 'observe before act' — you must enable logging and metrics to diagnose the issue before making any configuration changes.

How to eliminate wrong answers

Option A is wrong because increasing memory limits and restarting pods does not address the root cause of intermittent memory spikes during peak load; it only masks the symptom and may cause unnecessary downtime. Option B is wrong because switching from HTTP/2 to HTTP/1.1 would break gRPC communication (gRPC requires HTTP/2), and increasing memory request limits without understanding the pattern does not resolve the sporadic errors. Option D is wrong because scaling horizontally without first diagnosing why memory spikes to 90% during peak hours may lead to inefficient resource usage and does not fix the underlying memory consumption issue; the HPA may already be configured but the memory pressure is causing OOM kills or latency.

Full explanation →

156

Multi-Selecthard

You are designing alerting policies for a microservice architecture. Which TWO metrics are most suitable for triggering a page to the on-call engineer?

Select 2 answers

A.Latency P99 exceeding the SLO target for 5 minutes.

B.Error budget burn rate exceeding 10x in 1 hour.

C.Number of requests per second.

D.CPU utilization at 50%.

E.Memory usage trend.

AnswersA, B

Breaching latency SLO directly impacts users.

Why this answer

Error budget burn rate and high latency (P99 breaching SLO) directly indicate customer-facing issues and require immediate attention. CPU and request count are less critical.

Full explanation →

157

Multi-Selecthard

A company runs a high-traffic web application on GKE. Which three practices can help optimize performance under load? (Select THREE.)

Select 3 answers

A.Use a global load balancer

B.Use a multi-cluster ingress

C.Store session state in Cloud Memorystore

D.Enable HorizontalPodAutoscaler with custom metrics

E.Increase the number of nodes to maximum

AnswersA, C, D

Routes traffic to nearest backend, reducing latency.

Why this answer

A global load balancer (GLB) distributes incoming traffic across multiple GKE clusters and regions using Google's global network infrastructure. It terminates traffic at the edge, reducing latency by routing users to the closest healthy backend, and provides DDoS protection and SSL offloading. This is essential for high-traffic web applications to handle load spikes and improve response times.

Exam trap

Google Cloud often tests the misconception that simply adding more nodes (Option E) is a valid performance optimization, when in fact it is a costly and inefficient approach that ignores the need for dynamic scaling and resource efficiency.

Full explanation →

158

MCQmedium

A company is using Google Kubernetes Engine (GKE) with multiple node pools. They notice that their monthly costs are higher than expected. Upon review, they find that several preemptible VMs are being recreated frequently, leading to sustained usage costs. What is the most cost-effective solution to reduce costs?

A.Purchase committed use discounts for the preemptible VMs.

B.Increase the number of preemptible VMs to spread the workload.

C.Enable sustained use discounts for the existing VMs.

D.Migrate to Spot VMs, which have a lower price and no maximum runtime.

AnswerD

Spot VMs are the recommended replacement for preemptible VMs and offer lower costs without the 24-hour limit.

Why this answer

Spot VMs are the recommended replacement for preemptible VMs, offering the same low price but without the 24-hour maximum runtime limit. This eliminates the frequent recreation and sustained usage costs caused by preemptible VMs being terminated and restarted, directly reducing monthly expenses.

Exam trap

Google Cloud often tests the misconception that preemptible VMs are the cheapest option, but the trap here is that Spot VMs are actually cheaper and have no maximum runtime, making them superior for cost reduction.

How to eliminate wrong answers

Option A is wrong because committed use discounts (CUDs) cannot be applied to preemptible or Spot VMs; CUDs are only available for regular (on-demand) VMs and require a 1- or 3-year commitment, which would increase costs for short-lived workloads. Option B is wrong because increasing the number of preemptible VMs would amplify the frequency of recreations and associated sustained usage costs, not reduce them. Option C is wrong because sustained use discounts are automatically applied to on-demand VMs based on monthly usage, but preemptible VMs are not eligible for sustained use discounts; enabling them has no effect on preemptible VM costs.

Full explanation →

159

MCQeasy

An organization is setting up a new Google Cloud organization and wants to enforce consistent resource naming conventions and policies across all projects. Which service should they use?

A.Organization Policies

B.VPC Service Controls

C.Cloud Run

D.Cloud Armor

AnswerA

Organization Policies can enforce constraints like resource location, naming, and service usage.

Why this answer

Organization Policies allow you to centrally constrain specific behaviors and enforce consistent naming conventions across all projects in a Google Cloud organization using constraints like `constraints/gcp.resourceLocations` or custom constraints via the `OrganizationPolicy` API. This service is the correct choice because it directly applies hierarchical policy enforcement at the organization, folder, or project level, ensuring resource naming and other governance rules are uniformly applied without requiring per-project configuration.

Exam trap

The trap here is that candidates often confuse Organization Policies with VPC Service Controls because both involve 'policies' and 'controls,' but VPC Service Controls focus on data exfiltration prevention, not resource naming or general governance, leading to a common misselection.

How to eliminate wrong answers

Option B (VPC Service Controls) is wrong because it is designed to mitigate data exfiltration risks by defining security perimeters around VPC services and API access, not for enforcing resource naming conventions or policies across projects. Option C (Cloud Run) is wrong because it is a fully managed compute platform for running stateless containers, not a policy enforcement or naming convention service. Option D (Cloud Armor) is wrong because it provides web application firewall (WAF) and DDoS protection for HTTP(S) load balancers, unrelated to organizational resource naming or policy governance.

Full explanation →

160

MCQmedium

A company is bootstrapping their Google Cloud organization with multiple departments. Each department has several projects. They want to apply different IAM policies and organization policies per department. What is the recommended way to structure the resource hierarchy?

A.Use multiple organizations, one per department.

B.Use organization-level IAM for all departments.

C.Create a project for each department, then use labels to separate.

D.Create a folder for each department, then place projects under that folder.

AnswerD

Folders allow separate policies per department.

Why this answer

Option D is correct because folders in the Google Cloud resource hierarchy allow you to group projects under a common node and apply both IAM policies and organization policies (e.g., constraints from the Organization Policy Service) at the folder level. This enables each department to have its own administrative boundary and policy inheritance, while still being under a single organization for centralized billing and auditing.

Exam trap

The trap here is that candidates may confuse labels (which are only for metadata and cost tracking) with folders (which are the correct mechanism for hierarchical policy enforcement), leading them to choose Option C instead of D.

How to eliminate wrong answers

Option A is wrong because using multiple organizations per department breaks centralized billing, audit logging, and cross-department resource sharing; Google Cloud recommends a single organization with folders for multi-department setups. Option B is wrong because applying organization-level IAM for all departments would grant the same permissions across all departments, violating the requirement for different IAM policies per department. Option C is wrong because labels are metadata tags used for filtering and cost allocation, not for enforcing IAM or organization policies; they cannot provide the hierarchical policy inheritance that folders offer.

Full explanation →

161

Multi-Selecthard

Which TWO actions should a DevOps engineer take to reduce latency for a global user base accessing a web application hosted on Compute Engine?

Select 2 answers

A.Configure instance groups in multiple regions with a global load balancer.

B.Enable HTTP/2 on the load balancer.

C.Enable Cloud CDN with cache static content.

D.Increase the machine type of the instances.

E.Use Cloud Load Balancing with global anycast IP.

AnswersA, C

Multi-region deployment allows serving users from the closest region, reducing latency.

Why this answer

Option A is correct because deploying instance groups in multiple regions and using a global load balancer (e.g., Google Cloud External HTTP(S) Load Balancer) allows user requests to be routed to the closest healthy backend, reducing network round-trip time. This geo-distribution minimizes latency by serving content from the nearest regional endpoint rather than a single centralized location.

Exam trap

Google Cloud often tests the misconception that a global anycast IP alone (Option E) is sufficient to reduce latency, but the trap is that anycast only optimizes the frontend routing to the edge; without multi-region backends, the request must still travel to the single backend region, negating the latency benefit.

Full explanation →

162

MCQhard

An e-commerce company uses Cloud SQL for MySQL to support their online store. The database has a 500GB dataset. They notice that monthly costs have increased significantly. On reviewing the billing export to BigQuery, they see that the highest cost is from 'Cloud SQL - Storage' and 'Cloud SQL - Backups'. They currently have automated backups enabled with a retention of 7 days. They also take manual backups before every deployment, which are stored for 30 days. They want to reduce backup storage costs without compromising disaster recovery. What should they do?

A.Reduce automated backup retention to 2 days and keep manual backups as is.

B.Reduce manual backup retention to 7 days and increase automated backup retention to 14 days.

C.Use Cloud SQL's point-in-time recovery (PITR) feature instead of manual backups.

D.Export backups to Cloud Storage and delete old ones from Cloud SQL.

AnswerD

Exporting backups to Cloud Storage allows you to use cheaper storage classes (e.g., Nearline) and delete the original backups from Cloud SQL, reducing costs.

Why this answer

Option D is correct because exporting backups to Cloud Storage and then deleting from Cloud SQL reduces storage costs since Cloud Storage has cheaper classes. Option A reduces retention but may risk data loss. Option B still retains many backups.

Option C (PITR) requires binary logs, which also incur storage costs.

Full explanation →

163

MCQeasy

A DevOps team is bootstrapping a new Google Cloud organization. They want to grant a group of engineers the ability to create and manage projects within the organization, but not to modify organization policies or folders. Which IAM role should be assigned at the organization level?

A.roles/owner

B.roles/resourcemanager.folderAdmin

C.roles/editor

D.roles/resourcemanager.projectCreator

AnswerD

This role allows project creation and grants Project Owner on new projects.

Why this answer

Option D, roles/resourcemanager.projectCreator, is correct because it grants the specific permission to create and manage projects within the organization without allowing modifications to organization policies or folders. This role includes permissions like resourcemanager.projects.create and resourcemanager.projects.update, but explicitly excludes permissions for organization-level policy management (e.g., resourcemanager.organizations.setIamPolicy) or folder administration (e.g., resourcemanager.folders.update).

Exam trap

The trap here is that candidates often confuse roles/resourcemanager.projectCreator with roles/editor or roles/owner, mistakenly thinking that project creation requires broader permissions, when in fact the projectCreator role is specifically designed to isolate project management from higher-level administrative actions.

How to eliminate wrong answers

Option A is wrong because roles/owner grants full control over all resources, including the ability to modify organization policies and folders, which violates the requirement to restrict such actions. Option B is wrong because roles/resourcemanager.folderAdmin grants permissions to manage folders (e.g., create, delete, update folders) but does not include the specific project creation permissions needed (e.g., resourcemanager.projects.create). Option C is wrong because roles/editor grants broad edit permissions across all services, including the ability to modify organization policies and folders, which exceeds the required scope.

Full explanation →

164

MCQmedium

Refer to the exhibit. A DevOps engineer noticed that the nightly batch processing costs are higher than expected. After running the above command, what is the most likely cause?

A.The instances are not preemptible, leading to higher compute costs

B.The instances are running in different zones, increasing egress costs

C.The instances are using the n2-standard-4 machine type, which is expensive

D.The instances have automatic restart enabled, incurring additional charges

AnswerA

Batch jobs can use preemptible VMs to save 60-91% on compute costs.

Why this answer

Option C is correct because the instances are not configured as preemptible, leading to higher compute costs. For batch jobs that tolerate interruptions, preemptible VMs are more cost-effective. Option A is not the main issue as they are in the same region.

Option B is true but not the primary cost driver for batch workloads. Option D is incorrect because automatic restart does not incur additional charges.

Full explanation →

165

Multi-Selectmedium

A DevOps team is analyzing Google Cloud costs and notices that spending on BigQuery has increased significantly. They want to reduce costs without impacting ongoing analytical workloads. Which TWO actions should they take? (Choose two.)

Select 2 answers

A.Switch to on-demand pricing to pay only for queries run.

B.Enable column-level security to restrict access to sensitive data.

C.Set custom cost controls like query quotas and maximum bytes billed per query.

D.Delete unused datasets to reduce storage costs.

E.Implement flat-rate pricing with reservations for consistent workloads.

AnswersC, E

Limits prevent expensive queries from running unbounded.

Why this answer

Option C is correct because BigQuery allows you to set custom cost controls such as query quotas (e.g., concurrent queries per project) and maximum bytes billed per query. These controls cap resource usage at the query level, preventing runaway costs while still allowing analytical workloads to run within defined limits. This directly addresses cost spikes without blocking ongoing operations.

Exam trap

The trap here is that candidates confuse storage cost reduction (Option D) with the primary driver of BigQuery cost spikes, which is almost always query compute (bytes processed), not storage, and they may overlook that flat-rate pricing (Option E) is a valid cost-reduction strategy for consistent workloads.

Full explanation →

166

MCQmedium

Refer to the exhibit. An App Engine application returns 504 errors. The application calls an external API and processes the result. Which change is most likely to resolve the errors?

A.Reduce the idle timeout in the scaling settings.

B.Increase the App Engine request timeout in app.yaml to 120 seconds.

C.Change the scaling type from automatic to manual.

D.Increase the number of instances to handle the load.

AnswerB

The default timeout is 60 seconds; increasing it allows more time.

Why this answer

A 504 error from App Engine indicates the request exceeded the timeout limit before the application could respond. The default App Engine request timeout is 60 seconds, and since the application calls an external API and processes the result, the total time may exceed this limit. Increasing the request timeout in app.yaml to 120 seconds allows the application more time to complete the external API call and processing, resolving the 504 error.

Exam trap

Google Cloud often tests the distinction between request timeout (which causes 504 errors) and scaling or load-related issues (which cause 503 errors or latency), so candidates mistakenly choose instance count or scaling type changes when the real problem is a timeout threshold.

How to eliminate wrong answers

Option A is wrong because reducing the idle timeout in scaling settings would cause instances to be shut down sooner when idle, potentially increasing cold starts and latency, but it does not address the request timeout that causes 504 errors. Option C is wrong because changing from automatic to manual scaling does not change the per-request timeout limit; manual scaling controls instance count and startup behavior, not the maximum time a single request can take. Option D is wrong because increasing the number of instances helps handle higher concurrency and load, but if a single request already exceeds the timeout, more instances will not prevent that request from timing out.

Full explanation →

167

Matchingmedium

Match each monitoring concept to its purpose.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Verify external accessibility of a service

Time taken to respond to a request

Percentage of failed requests

Number of requests processed per second

Degree to which a resource is fully utilized

Why these pairings

Four golden signals of monitoring.

Full explanation →

168

MCQeasy

A startup is building a CI/CD pipeline for their Cloud Run service. They use Cloud Build to build a Docker image, push it to Artifact Registry, and then deploy to Cloud Run with the 'gcloud run deploy' command in the build config. The initial deployment works, but subsequent builds fail at the deploy step with a permission error: 'Permission denied to deploy revision to Cloud Run service'. The Cloud Build service account has the Cloud Run Developer role. The developers can manually deploy from their workstations using their own accounts. What is the most likely cause?

A.The developers added the '--impersonate-service-account' flag inadvertently in the build config.

B.The Cloud Run API is not enabled in the project.

C.The Cloud Build service account lacks Artifact Registry read permissions to pull the image.

D.The Cloud Build service account needs the 'iam.serviceAccounts.actAs' permission on the Cloud Run runtime service account.

AnswerD

Cloud Deploy (and gcloud run deploy) requires the caller to have actAs permission on the identity being used by the Cloud Run service.

Why this answer

Option D is correct: Cloud Run requires the service account to have the 'iam.serviceAccounts.actAs' permission on the runtime service account used by Cloud Run (the compute engine default service account or a custom one). The Cloud Run Developer role alone does not include this. Option A is incorrect because the error is permission-related, not API enablement.

Option B is incorrect because Artifact Registry permissions are for pushing images, not deploying. Option C is incorrect because Cloud Build uses its own service account, not user credentials.

Full explanation →

169

MCQmedium

A company is using Cloud Build to build a Go application. The build fails with an error 'no Go files in /workspace'. What is the most likely cause?

A.The repository has no Go files.

B.The build step is running in the wrong directory.

C.The Go module is not properly initialized.

D.Cloud Build is unable to clone the repository.

AnswerB

The error indicates no Go files found in /workspace, confirming a directory mismatch.

Why this answer

Cloud Build copies the repository contents to /workspace. If the Go source files are in a subdirectory, the build step might be pointing to the wrong directory.

Full explanation →

170

Multi-Selectmedium

Your team is running a high-traffic web application on Google Kubernetes Engine (GKE) and has configured Horizontal Pod Autoscaling (HPA) based on CPU utilization. Recently, the application experienced intermittent latency spikes during traffic bursts. You suspect that the HPA is not scaling quickly enough. Which TWO actions would most effectively improve the autoscaling responsiveness?

Select 2 answers

A.Increase the target CPU utilization percentage to 90% so that pods are less likely to be added.

B.Configure a custom metric in HPA based on the application's request latency (e.g., p99 latency).

C.Set the maxReplicas of the HPA to a value lower than the expected peak traffic to force faster scaling.

D.Reduce the --horizontal-pod-autoscaler-upscale-stabilization window in the HPA configuration to a lower value (e.g., 30 seconds).

E.Increase the --horizontal-pod-autoscaler-sync-period flag to 30 seconds and increase the --horizontal-pod-autoscaler-upscale-delay flag to 5 minutes.

AnswersB, D

Custom metrics like request latency provide a more direct and responsive signal to scale based on performance.

Why this answer

Option B is correct because using a custom metric based on request latency (e.g., p99 latency) allows HPA to react to application performance degradation directly, rather than relying solely on CPU utilization which may lag behind traffic bursts. This provides a more immediate signal for scaling, as latency spikes often precede CPU saturation in web applications. Option D is correct because reducing the --horizontal-pod-autoscaler-upscale-stabilization window (default 5 minutes) to a lower value like 30 seconds decreases the time HPA waits before acting on scale-up recommendations, enabling faster response to sudden load increases.

Exam trap

Google Cloud often tests the misconception that increasing target CPU utilization or sync periods improves scaling speed, when in fact these changes reduce responsiveness; the trap is confusing 'stabilization' with 'delay' and assuming higher thresholds or longer intervals help with bursts.

Full explanation →

171

Multi-Selecthard

An organization wants to optimize BigQuery costs. Which three practices should they implement? (Choose three.)

Select 3 answers

A.Store data in multiple tables per day without partitioning

B.Set a maximum bytes billed per query at the project level

C.Use table partitioning and clustering to reduce data scanned

D.Purchase BigQuery flat-rate reservations for all queries

E.Use authorized views to restrict data access instead of copying data

AnswersB, C, E

Prevents runaway queries from incurring high costs.

Why this answer

Options A, C, and D are correct. Partitioning and clustering reduce data scanned; authorized views eliminate data duplication; setting a max bytes billed caps query costs. Option B is not always optimal, and option E increases costs.

Full explanation →

172

Multi-Selectmedium

A DevOps team is bootstrapping a Google Cloud organization. They need to ensure that all projects have a consistent set of labels applied automatically. Which two approaches can they use? (Choose TWO.)

Select 2 answers

A.Use an organization policy that requires labels.

B.Use a Terraform module to create projects with labels.

C.Use a Cloud Function triggered by Project creation events to apply labels.

D.Use a folder-level constraint to enforce labels.

E.Use Cloud Asset Inventory to monitor labels.

AnswersA, C

The constraint constraints/resourcemanager.requireLabels can enforce label presence.

Why this answer

Option A is correct because Organization Policies can enforce that all new projects must have specific labels by using a constraint like `constraints/compute.requireOsLogin` is not for labels, but you can use a custom constraint with the `constraints/gcp.resourceLabels` to require labels on resources. However, for projects themselves, you can use an organization policy with a list constraint to mandate that certain labels are present on all projects, and the policy will block creation of any project that does not comply.

Exam trap

The trap here is that candidates often confuse folder-level resource constraints with project-level label enforcement, not realizing that folder constraints apply to resources inside the folder (like VMs or buckets) but not to the project resource itself.

Full explanation →

173

MCQhard

Refer to the exhibit. A DevOps engineer applies this Terraform configuration but gets an error: "Error creating Project: googleapi: Error 403: The caller does not have permission to enable services". What is the most likely cause?

A.The service account used by Terraform lacks the Service Usage Admin role on the project.

B.The org_id is incorrect.

C.The organization requires that the Compute Engine API be enabled before project creation.

D.The project ID already exists.

AnswerA

Service Usage Admin grants permission to enable services.

Why this answer

The error 'The caller does not have permission to enable services' indicates that the identity (service account) used by Terraform to authenticate with the Google Cloud API lacks the required IAM permission to enable Google APIs on the project. The Service Usage Admin role (roles/serviceusage.serviceUsageAdmin) grants the necessary permissions, including serviceusage.services.enable and serviceusage.services.list, which are required to enable services like the Compute Engine API during resource creation. Without this role, the API call to enable services fails with a 403 error.

Exam trap

Google Cloud often tests the misconception that the error is due to a missing prerequisite API (like Compute Engine) or a duplicate project ID, when in fact the root cause is insufficient IAM permissions for the service account to enable services.

How to eliminate wrong answers

Option B is wrong because an incorrect org_id would typically cause a different error, such as 'Organization not found' or 'Invalid organization ID', not a permission error related to enabling services. Option C is wrong because the Compute Engine API does not need to be enabled before project creation; it is one of the services that Terraform attempts to enable during the project creation process, and the error is about the lack of permission to enable it, not about a prerequisite. Option D is wrong because if the project ID already exists, the error would be 'Error creating Project: googleapi: Error 409: Project already exists', not a 403 permission error.

Full explanation →

174

MCQmedium

A DevOps team wants to optimize resource utilization for their GKE deployment. Which built-in Kubernetes resource can automatically adjust CPU and memory requests based on historical usage?

A.HorizontalPodAutoscaler

B.ResourceQuota

C.PodDisruptionBudget

D.VerticalPodAutoscaler

AnswerD

Correct. VPA adjusts resource requests based on usage.

Why this answer

The VerticalPodAutoscaler (VPA) is the correct choice because it automatically adjusts CPU and memory resource requests (and limits) for pods based on historical usage data, optimizing resource utilization without manual intervention. Unlike the HorizontalPodAutoscaler, which scales the number of pods, the VPA modifies the resource specifications of existing pods to match observed demand.

Exam trap

Google Cloud often tests the distinction between scaling the number of replicas (HPA) versus scaling the resources per pod (VPA), and candidates mistakenly choose HPA because they associate 'automatic adjustment' with scaling out, not adjusting requests.

How to eliminate wrong answers

Option A is wrong because HorizontalPodAutoscaler (HPA) adjusts the number of pod replicas based on CPU/memory metrics, not the per-pod resource requests. Option B is wrong because ResourceQuota sets hard limits on total resource consumption within a namespace, preventing overcommitment, but does not dynamically adjust requests based on usage. Option C is wrong because PodDisruptionBudget (PDB) controls the number of pods that can be voluntarily disrupted during maintenance, not resource request adjustments.

Full explanation →

175

Multi-Selectmedium

You are an on-call engineer responding to a critical service incident affecting a production application. According to Google's Incident Management best practices, which TWO actions should you take immediately after declaring the incident?

Select 2 answers

A.Communicate the incident status to stakeholders and affected teams.

B.Notify the incident commander to take over coordination.

C.Begin documenting the incident for a postmortem report.

D.Roll back the latest deployment to the previous stable version.

E.Gather evidence and logs to identify the incident's impact and root cause.

AnswersA, E

Communication is a key initial step to keep everyone informed and coordinated.

Why this answer

Option A is correct because Google's Incident Management best practices emphasize that immediately after declaring an incident, the on-call engineer must communicate the incident status to stakeholders and affected teams. This ensures that everyone is aware of the ongoing issue, sets expectations, and prevents redundant troubleshooting. Early communication also helps in coordinating response efforts and reducing confusion during the critical initial phase.

Exam trap

Google Cloud often tests the distinction between immediate triage actions and later-stage mitigation or documentation steps, trapping candidates who confuse 'declaring an incident' with 'starting the fix' rather than recognizing that communication and initial impact assessment are the first priority.

Full explanation →

176

MCQhard

A Cloud Run service experiences high cold start latency. The team has already set min-instances to 1. Which additional optimization can further reduce cold start impact?

A.Use HTTP/2 to speed up request handling

B.Reduce container image size and enable CPU boost during startup

C.Place the service on Cloud Run for Anthos on GKE

D.Increase the container memory limit

AnswerB

Smaller images load faster, and CPU boost provides extra CPU during startup, reducing latency.

Why this answer

Option B is correct because reducing the container image size decreases the time required to pull and unpack the container on a cold start, and enabling CPU boost during startup temporarily allocates additional CPU resources to accelerate the initialization process. Together, these directly address the root causes of cold start latency by minimizing both the image loading time and the application initialization time, even when min-instances is already set to 1.

Exam trap

Google Cloud often tests the misconception that increasing resources (memory or CPU) always improves performance, but in the context of cold starts, the bottleneck is typically image size and initialization speed, not steady-state resource limits.

How to eliminate wrong answers

Option A is wrong because HTTP/2 improves multiplexing and reduces latency for multiple concurrent requests, but it does not affect the cold start latency of a single instance that has not yet been initialized. Option C is wrong because Cloud Run for Anthos on GKE runs on a Kubernetes cluster, which introduces additional orchestration overhead and does not inherently reduce cold start latency compared to the fully managed Cloud Run service. Option D is wrong because increasing the container memory limit may allow the container to use more memory, but it does not speed up the startup process; in fact, larger memory allocations can sometimes increase cold start time due to resource provisioning delays.

Full explanation →

177

Matchingmedium

Match each CI/CD concept to its definition.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Automated build and test on every commit

Automated deployment to staging, manual to production

Fully automated release to production

Short-lived branches, frequent merges to main

Gradual rollout to a subset of users

Why these pairings

Key DevOps practices for reliable releases.

Full explanation →

178

MCQmedium

A DevOps engineer is responsible for cost optimization of GKE clusters. They have identified that many nodes are underutilized. What is the best approach to reduce costs without impacting availability?

A.Enable cluster autoscaler to automatically scale down nodes

B.Use spot VMs for all node pools

C.Delete underutilized nodes manually

D.Migrate to a single zone cluster

AnswerA

Cluster autoscaler adjusts node count based on demand, saving costs.

Why this answer

Option A is correct because cluster autoscaler automatically scales down nodes when they are underutilized. Option B is manual and not scalable. Option C may cause disruptions.

Option D reduces availability.

Full explanation →

179

MCQeasy

A company runs a CI/CD pipeline using Cloud Build and Cloud Deploy for a web application. The pipeline builds a container image, pushes it to Artifact Registry, and then deploys it to a GKE cluster using Cloud Deploy. Recently, the deployment step started failing with the error: 'INVALID_ARGUMENT: The release contains one or more images that are not in the target project's Artifact Registry.' The container image is built in the same project as the target cluster. The Cloud Build service account has been granted the roles/cloudbuild.builds.builder and roles/artifactregistry.admin on the project. The DevOps engineer verified that the image exists in Artifact Registry and the path is correct. What should the DevOps engineer do to resolve the issue?

A.Reconfigure Cloud Build to store images in a different Artifact Registry repository.

B.Grant the Cloud Deploy service account the roles/artifactregistry.reader role on the project.

C.Ensure that the Cloud Deploy delivery pipeline references the correct image path, including the project ID and repository.

D.Change the Cloud Build service account to include the roles/clouddeploy.jobRunner role.

AnswerC

The most common cause is a misconfigured image reference in the delivery pipeline. Correcting the path resolves the error.

Why this answer

Option C is correct because the error 'INVALID_ARGUMENT: The release contains one or more images that are not in the target project's Artifact Registry' indicates that the Cloud Deploy delivery pipeline is referencing an image path that does not match the actual location of the image. Even though the image exists in Artifact Registry and the path appears correct, Cloud Deploy validates the image reference (including project ID, repository name, and image tag) against the target project's registry. The engineer must ensure the delivery pipeline YAML or configuration explicitly specifies the correct full image path, including the project ID and repository, so that Cloud Deploy can resolve and verify the image during release creation.

Exam trap

The trap here is that candidates assume the error is a permissions issue (leading them to grant roles to service accounts) when it is actually a configuration mismatch in the image path referenced by the Cloud Deploy delivery pipeline.

How to eliminate wrong answers

Option A is wrong because storing images in a different repository does not address the root cause—the mismatch between the image path referenced in the Cloud Deploy pipeline and the actual image location; it would only shift the problem to another repository. Option B is wrong because the Cloud Deploy service account does not need roles/artifactregistry.reader; Cloud Deploy uses the Cloud Build service account or the user's credentials to read images, and the error is about path resolution, not permissions. Option D is wrong because the roles/clouddeploy.jobRunner role is for executing deployment jobs, not for resolving image references; the issue is a configuration mismatch in the delivery pipeline, not a missing role on the Cloud Build service account.

Full explanation →

180

Matchingmedium

Match each Google Cloud deployment strategy to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Two identical environments; switch traffic

Gradually replace instances with new version

Route small traffic percentage to new version

Compare two versions based on user metrics

New version receives mirrored traffic without impact

Why these pairings

Common deployment strategies for minimizing risk.

Full explanation →

181

MCQhard

The above MQL query is used in a Cloud Monitoring dashboard. What does it display?

A.The total number of spans in 'my-service' each minute.

B.The maximum latency of spans grouped by span_id.

C.The 99th percentile latency of all spans in the 'my-service' service, every minute.

D.The 99th percentile latency of each span_id, every minute.

AnswerD

The group_by [span_id] combined with percentile(99) on latency gives per-span_id 99th percentile values.

Why this answer

The MQL query uses the `fetch` command to retrieve spans from the `my-service` service, then applies `percentile(99)` to the latency metric, and groups the result by `span_id` using the `into` clause. The `every 1m` parameter sets the alignment window to one minute. This produces a time series showing the 99th percentile latency for each distinct span_id, updated every minute.

Exam trap

Google Cloud often tests the distinction between aggregating across all spans (e.g., service-level percentile) versus grouping by a dimension like `span_id`, leading candidates to mistakenly choose a service-wide aggregation when the query explicitly groups by a finer granularity.

How to eliminate wrong answers

Option A is wrong because the query does not count spans; it calculates a percentile of latency, not a count. Option B is wrong because the query uses `percentile(99)`, not `max`, so it shows the 99th percentile, not the maximum latency. Option C is wrong because the query groups by `span_id` (via `into`), not by service, so it displays per-span_id percentiles, not a single aggregated value for the entire service.

Full explanation →

182

MCQhard

You are troubleshooting a performance issue with a Compute Engine instance that is part of a managed instance group serving a web application. Users report intermittent high latency. You run the command shown in the exhibit. Based on the output, what is the most likely cause of the performance issue?

A.The instance is under-provisioned for CPU.

B.The instance is hitting the network egress bandwidth limit.

C.The service account lacks the necessary scopes for Cloud Monitoring and Cloud Trace.

D.The boot disk is too small, causing I/O contention.

AnswerA

The output does not show the machine type, but the disk size and service account suggest a small instance, likely with 1 vCPU. Insufficient CPU causes high latency under load.

Why this answer

The output shows high CPU utilization (e.g., 95%+), which directly correlates with the reported intermittent high latency. In a managed instance group, if the instance is under-provisioned for CPU, it cannot handle the workload spikes, causing queuing and increased response times. This is the most common cause of performance degradation in Compute Engine instances serving web applications.

Exam trap

Google Cloud often tests the misconception that network bandwidth or disk I/O is the primary bottleneck in web application latency, but the exhibit's focus on CPU utilization is the key clue that the issue is compute-bound, not I/O-bound.

How to eliminate wrong answers

Option B is wrong because network egress bandwidth limits would manifest as packet drops, retransmissions, or a plateau in throughput, not sustained high CPU usage; the output does not show any network-related metrics like dropped packets or bandwidth saturation. Option C is wrong because missing scopes for Cloud Monitoring and Cloud Trace would prevent telemetry data from being sent, but the command shown (likely `top` or `htop`) still runs locally and would not cause the high latency itself; the latency issue is a symptom of resource contention, not a permissions problem. Option D is wrong because a boot disk that is too small causing I/O contention would show high disk I/O wait times or disk queue depth in the output, not high CPU utilization; the exhibit focuses on CPU, not disk metrics.

Full explanation →

183

MCQeasy

Refer to the exhibit. What does this recommendation suggest?

A.Change the instance to preemptible.

B.Upgrade the instance.

C.Delete the instance.

D.Downgrade the instance.

AnswerB

The description says 'Upgrade instance'.

Why this answer

The description explicitly states to upgrade the instance from n1-standard-2 to n1-standard-4, indicating a need for more resources.

Full explanation →

184

MCQmedium

You are the SRE for a financial services application running on Google Cloud. Users report that certain transactions are taking over 10 seconds, while most complete in under 200ms. You use Cloud Profiler and Cloud Trace. Upon reviewing the profiler data, you see a hotspot in a method that calls a Cloud SQL database with a slow query. You identify the query and create an index to speed it up. However, you cannot deploy the index change immediately due to change management processes. The incident response team needs to mitigate the impact now. Which temporary measure should you take?

A.Increase the database connection pool size in the application.

B.Add a database read replica to offload read queries.

C.Implement caching of query results using Cloud Memorystore.

D.Scale up the Cloud SQL instance to more vCPUs.

AnswerC

Caching reduces the need to run the slow query, providing immediate latency improvement.

Why this answer

Option C is correct because caching the results of the slow query in Cloud Memorystore (Redis) immediately reduces the load on the Cloud SQL database and eliminates the need to execute the slow query repeatedly. This provides a temporary performance improvement without requiring any database schema changes or deployments, bypassing the change management delay. The hotspot in the profiler indicates the query itself is the bottleneck, and caching avoids that bottleneck entirely for repeated reads.

Exam trap

Google Cloud often tests the distinction between temporary mitigation and permanent resolution, and the trap here is that candidates confuse scaling the database (Option D) or adding replicas (Option B) as quick fixes, when in reality those are infrastructure changes that require change management approval and cannot be deployed instantly.

How to eliminate wrong answers

Option A is wrong because increasing the database connection pool size would add more concurrent connections to the already overloaded database, potentially making the contention worse and increasing latency further. Option B is wrong because adding a read replica requires provisioning a new Cloud SQL instance and modifying application connection strings, which is a deployment change subject to the same change management processes and cannot be done immediately. Option D is wrong because scaling up the Cloud SQL instance to more vCPUs requires a database restart or a failover operation, which is a significant change that also falls under change management and does not address the root cause of the slow query.

Full explanation →

185

MCQmedium

A DevOps engineer is trying to create a service account key for a CI/CD pipeline, but receives the error: 'Constraint constraints/iam.disableServiceAccountKeyCreation violated'. What is the most likely cause and solution?

A.The Organization Policy prevents key creation; the engineer needs to request an exception from the security team.

B.The Cloud Resource Manager API is disabled; enable it.

C.The project does not have billing enabled; enable billing.

D.The service account has been deleted; the engineer must recreate it.

AnswerA

The constraint explicitly blocks key creation. An exception or alternative method (e.g., workload identity federation) is needed.

Why this answer

The error 'Constraint constraints/iam.disableServiceAccountKeyCreation violated' indicates that an Organization Policy with the constraint `iam.disableServiceAccountKeyCreation` is enforced at the organization, folder, or project level. This policy explicitly blocks the creation of service account keys, which is why the engineer cannot generate a key for the CI/CD pipeline. The correct solution is to request an exception from the security team, who can either remove the policy or add the engineer's project to an exemption list.

Exam trap

Google Cloud often tests the distinction between Organization Policy constraints (like `iam.disableServiceAccountKeyCreation`) and other common errors (like API disablement or billing issues), so candidates mistakenly choose B or C because they assume a missing API or billing is the root cause, but the specific error message directly names the constraint.

How to eliminate wrong answers

Option B is wrong because disabling the Cloud Resource Manager API would prevent listing or managing projects and policies, but the specific error message references an IAM constraint violation, not an API disabled error. Option C is wrong because billing being disabled would cause a different error (e.g., 'billing account not found' or 'project not billable'), not a constraint violation related to service account key creation. Option D is wrong because if the service account were deleted, the error would be 'Service account not found' or 'Permission denied', not a constraint violation about key creation.

Full explanation →

186

MCQhard

Which Cloud Monitoring feature can directly correlate this error with the associated trace and VM instance?

A.Log-based metrics

B.Metrics Explorer

C.Error Reporting

D.Alerting policies

AnswerC

Error Reporting aggregates errors and links to traces and resources.

Why this answer

Error Reporting in Google Cloud automatically groups errors from Cloud Logging and can directly correlate an error with the associated trace ID and VM instance metadata. This is because Error Reporting ingests structured log entries that contain the `@type` field set to `type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent` and parses the `serviceContext` and `context` fields to link the error to the specific trace and resource (e.g., VM instance).

Exam trap

Google Cloud often tests the misconception that Metrics Explorer or Log-based metrics can correlate errors to traces, but only Error Reporting is designed to parse and group error events with their associated trace and resource metadata.

How to eliminate wrong answers

Option A is wrong because Log-based metrics extract numerical data from log entries (e.g., count of errors) but do not provide direct correlation to a specific trace or VM instance; they aggregate metrics over time. Option B is wrong because Metrics Explorer is a tool for visualizing and querying metric time series data from Cloud Monitoring, not for linking individual error events to traces or instances. Option D is wrong because Alerting policies define conditions and notifications based on metric thresholds or log-based alerts, but they do not perform the correlation of an error with its trace and VM instance; they only trigger alerts when conditions are met.

Full explanation →

187

MCQhard

You are a Site Reliability Engineer (SRE) for an e-commerce platform running on Google Kubernetes Engine (GKE) with a microservices architecture. Your team uses Cloud Monitoring for alerting and Cloud Logging for centralized logs. Recently, during a flash sale event, you observed intermittent latency spikes in the checkout service, causing checkout failures and abandoned carts. The latency spikes last 1-2 seconds and occur roughly every 5-10 minutes during peak traffic. The checkout service runs as a Deployment with 10 replicas, each with resource requests of 500m CPU and 512Mi memory. The service has a Service Level Objective (SLO) of 99.9% of requests completing in under 1 second (p99 latency < 1s). Current p99 latency is 2.1s during peak. You reviewed the Cloud Monitoring dashboard and noticed that CPU utilization across pods is around 60%, memory around 50%, and there are no OOM kills. The logs show occasional 'connection reset by peer' errors in the checkout service logs, but no consistent pattern. You suspect the issue might be related to the database (Cloud SQL) or a downstream dependency. After checking the database, you find that query latency is normal. You also notice that the checkout service makes a synchronous HTTP call to a payment validation service that runs as a separate Deployment with 3 replicas. The payment service's p99 latency is 500ms, but its error rate is below 1%. Your task is to identify the most likely cause of the intermittent latency spikes and propose a remediation. Which action should you take first?

A.Increase the number of replicas of the payment validation service to 10 to handle peak load.

B.Check the garbage collection logs of the checkout service pods to identify if long GC pauses coincide with the latency spikes.

C.Enable connection pooling and retries with exponential backoff in the checkout service for the HTTP call to the payment service.

D.Investigate the checkout service pod restarts due to liveness probe failures, as 'connection reset by peer' indicates pod instability.

AnswerB

Periodic latency spikes are a classic symptom of JVM garbage collection. Checking GC logs will help confirm if this is the cause.

Why this answer

The intermittent latency spikes every 5-10 minutes with no CPU/memory pressure or database issues strongly suggest a periodic process like garbage collection. Java-based services (common in microservices) can experience stop-the-world GC pauses that cause latency spikes of 1-2 seconds, matching the observed pattern. Checking GC logs is the fastest way to confirm this before making architectural changes.

Exam trap

Google Cloud often tests the ability to distinguish between symptoms (connection resets, latency spikes) and root causes (GC pauses, thread pool exhaustion) by presenting plausible but superficial fixes like scaling or retries, while the correct answer requires analyzing internal service behavior.

How to eliminate wrong answers

Option A is wrong because increasing payment service replicas does not address the root cause of intermittent latency spikes; the payment service has low p99 latency (500ms) and low error rate, so scaling it won't fix a periodic issue like GC pauses. Option C is wrong because connection pooling and retries with exponential backoff would help with transient network failures or overload, but the 'connection reset by peer' errors are likely a symptom of the checkout service itself being unresponsive during GC pauses, not a network issue. Option D is wrong because 'connection reset by peer' does not indicate liveness probe failures or pod restarts; it typically means the remote side closed the connection, which could be due to the checkout service being paused by GC, not because the pod is unstable or restarting.

Full explanation →

188

MCQmedium

A company deploys a microservices application on Google Kubernetes Engine (GKE). They notice increased latency during peak hours. The application uses a Cloud SQL database for state. The team wants to optimize service performance. What should they do first?

A.Implement a caching layer with Memorystore.

B.Enable Cloud SQL connection pooling.

C.Move the database to Cloud Spanner.

D.Increase the number of replicas in the GKE deployment.

AnswerB

Connection pooling minimizes the cost of repeatedly opening and closing database connections, directly addressing latency caused by connection setup.

Why this answer

Option B is correct because the primary bottleneck during peak hours for a microservices application using Cloud SQL is often the number of database connections. Connection pooling reuses a fixed set of connections, reducing the overhead of establishing new connections and preventing connection exhaustion, which directly addresses increased latency without requiring architectural changes.

Exam trap

Google Cloud often tests the misconception that scaling out (increasing replicas) always improves performance, but in stateful applications with a centralized database, it can backfire by increasing connection pressure; the trap here is that candidates overlook connection management as the first optimization step and jump to caching or database migration.

How to eliminate wrong answers

Option A is wrong because implementing a caching layer with Memorystore reduces read latency for frequently accessed data but does not address the underlying connection management issue with Cloud SQL; it adds complexity and is not the first optimization step. Option C is wrong because moving the database to Cloud Spanner is a significant architectural change that introduces higher cost and complexity, and it is overkill for optimizing latency caused by connection management; Spanner is designed for global scale and strong consistency, not for fixing connection pooling issues. Option D is wrong because increasing the number of replicas in the GKE deployment can actually worsen latency by creating more pods that each open new connections to Cloud SQL, exacerbating connection exhaustion and database load without addressing the root cause.

Full explanation →

189

MCQeasy

An engineer receives an alert that a service's error rate has exceeded the threshold. To investigate, which log-based metric should the engineer query in Cloud Logging to identify the root cause?

A.Error log count grouped by service name.

B.Request latency histogram.

C.CPU utilization of the service instances.

D.Network bytes sent per instance.

AnswerA

Grouping by service reveals which service has the most errors.

Why this answer

Option A is correct because error log count grouped by service name directly surfaces which specific service is generating the elevated error rate. In Cloud Logging, log-based metrics are user-defined counters extracted from log entries; querying the error log count per service isolates the offending component, enabling root cause identification without mixing in unrelated performance data.

Exam trap

Google Cloud often tests the distinction between log-based metrics (derived from log entries) and system metrics (like CPU or network) from Cloud Monitoring, trapping candidates who confuse performance indicators with error source identification.

How to eliminate wrong answers

Option B is wrong because request latency histogram measures response times, not error rates; latency spikes can occur without errors, so it does not pinpoint the source of error threshold breaches. Option C is wrong because CPU utilization is a system-level metric from Cloud Monitoring, not a log-based metric; high CPU may correlate with errors but does not directly reveal which service or log entry caused the error rate alert. Option D is wrong because network bytes sent per instance is a network throughput metric, not derived from logs; it cannot identify error patterns or the specific service responsible for increased errors.

Full explanation →

190

Multi-Selectmedium

Which THREE of the following are recommended practices for writing effective post-mortem documents?

Select 3 answers

A.Focus on the root cause and contributing factors.

B.Assign blame to the responsible engineer.

C.Share the findings with the entire organization.

D.Include an action plan to prevent recurrence.

E.Keep the document brief to save time.

AnswersA, C, D

Understanding the root cause prevents recurrence.

Why this answer

Option A is correct because effective post-mortem documents must focus on root cause analysis and contributing factors to identify systemic issues rather than individual errors. This aligns with the Site Reliability Engineering (SRE) principle of blameless post-mortems, which prioritize learning and process improvement over punishment. By analyzing the technical chain of events—such as a misconfigured load balancer or a cascading failure due to missing circuit breakers—teams can implement lasting fixes.

Exam trap

Google Cloud often tests the misconception that post-mortems should be brief or assign blame, tempting candidates to select options that prioritize speed or accountability over thorough, blameless analysis.

Full explanation →

191

MCQmedium

During bootstrapping, a DevOps lead wants to ensure that all projects in the 'dev' folder have a consistent set of VPC firewall rules and network policies. They are considering using a shared VPC or VPC Network Peering. Which approach provides the most control and consistency for DevOps teams while minimizing administrative overhead?

A.Implement a shared VPC with a host project managed by the network team, and allow DevOps teams to use subnets in their service projects.

B.Use VPC Service Controls to define perimeters for each project.

C.Create a separate VPC in each project and use Cloud VPN to connect them.

D.Use VPC Network Peering between all projects to allow connectivity.

AnswerA

Shared VPC centralizes network administration while enabling resource reuse.

Why this answer

A shared VPC with a host project managed by the network team provides centralized control over VPC firewall rules and network policies, ensuring consistency across all service projects in the 'dev' folder. This approach minimizes administrative overhead because DevOps teams can create subnets in their service projects without needing to manage network infrastructure, while the network team enforces a uniform security posture.

Exam trap

The trap here is that candidates often confuse VPC Network Peering with shared VPC, assuming peering provides centralized policy control, but peering only enables connectivity while each VPC retains independent firewall and policy management.

How to eliminate wrong answers

Option B is wrong because VPC Service Controls are designed to define security perimeters around data (e.g., preventing data exfiltration), not to enforce consistent VPC firewall rules or network policies across projects. Option C is wrong because creating separate VPCs in each project and using Cloud VPN introduces significant administrative overhead for managing multiple VPN tunnels and does not provide centralized control over firewall rules or network policies. Option D is wrong because VPC Network Peering only enables connectivity between VPCs but does not enforce consistent firewall rules or network policies across peered networks, as each VPC retains its own independent firewall configuration.

Full explanation →

192

Multi-Selecteasy

Which TWO are best practices when setting up a Google Cloud organization for multiple teams? (Select exactly 2)

Select 2 answers

A.Use a shared VPC to connect projects.

B.Create a separate billing account for each team.

C.Grant all users the Project Creator role.

D.Use folders to organize projects by environment or team.

E.Enable the Compute Engine API in every project.

AnswersA, D

Simplifies networking across projects.

Why this answer

A shared VPC allows multiple projects to share a common VPC network, enabling centralized control of networking resources, firewall rules, and IP addressing. This is a best practice for multi-team organizations because it reduces administrative overhead, ensures consistent network policies, and simplifies connectivity between projects without requiring complex peering or VPN setups.

Exam trap

Google Cloud often tests the misconception that each team needs its own billing account for cost tracking, but the correct approach is to use a single billing account with labels and budgets to allocate costs per team.

Full explanation →

193

MCQhard

You manage a production environment with a web service deployed on Compute Engine instances behind a HTTP(S) Load Balancer. The service has a health check configured on the load balancer, probing a health endpoint every 10 seconds. After a recent configuration change, you observe that all instances are marked as unhealthy and traffic is failing. The health check response is 200 OK from the instances, but the load balancer still marks them unhealthy. The health check configuration: protocol: HTTP, port: 80, request path: /health, interval: 10s, timeout: 5s, unhealthy threshold: 2. The instances are running a custom web server. What is the most likely cause?

A.The load balancer is misconfigured with an incorrect protocol (HTTPS instead of HTTP).

B.The health check port is incorrectly configured as 80 but the service is listening on 8080.

C.The health check response is coming from a different server (e.g., a reverse proxy) that returns 200 but does not represent the actual service health.

D.The health check timeout is greater than the interval, causing overlapping probes.

AnswerC

If the response is from a different process, the load balancer may mark the instance unhealthy because the health check's interpretation of health is not satisfied.

Why this answer

Option C is correct because the health check is receiving a 200 OK response from a reverse proxy or intermediary that is not the actual web server, so the load balancer's health check does not reflect the true health of the application. Even though the response is successful, the load balancer marks instances unhealthy because the health check is likely failing to reach the intended health endpoint on the custom web server, or the response is not coming from the service itself. This scenario often occurs when a reverse proxy (e.g., nginx) returns a static 200 for /health without forwarding the request to the backend service, causing the load balancer to see a healthy response but the service to be actually unhealthy.

Exam trap

Google Cloud often tests the misconception that a 200 OK response always indicates a healthy service, but the trap here is that the health check may be hitting a different server or proxy that does not reflect the actual application state.

How to eliminate wrong answers

Option A is wrong because if the load balancer were configured with HTTPS but the instances expected HTTP, the health check would receive a connection error or a protocol mismatch, not a 200 OK response. Option B is wrong because if the health check port were 80 but the service listened on 8080, the health check would fail to connect or receive a response, not return a 200 OK. Option D is wrong because the health check timeout (5s) is less than the interval (10s), so overlapping probes are not possible; overlapping would require timeout > interval, which is not the case here.

Full explanation →

194

MCQhard

Refer to the exhibit. You are investigating a performance issue where the api-server container is using excessive CPU. You run a Cloud Monitoring API query and receive the JSON configuration shown. However, the query returns no data points. What is the most likely cause?

A.The time interval specified is too short and falls outside the data retention period.

B.The metric type 'kubernetes.io/container/cpu/core_usage_time' is deprecated and no longer available.

C.The filter is missing required resource labels such as project_id, location, cluster_name, and namespace_name, causing no time series to match.

D.The aggregation perSeriesAligner 'ALIGN_MEAN' is incompatible with the metric type, which requires 'ALIGN_RATE'.

AnswerC

Resource labels must be fully specified in the filter to match the specific container; otherwise the query may not return data.

Why this answer

Option C is correct because the Cloud Monitoring API query for the 'kubernetes.io/container/cpu/core_usage_time' metric type requires mandatory resource labels—specifically 'project_id', 'location', 'cluster_name', and 'namespace_name'—to uniquely identify the time series for a GKE container. Without these labels in the filter, the query cannot match any time series, resulting in no data points returned, even if the metric is actively emitting data.

Exam trap

Google Cloud often tests the misconception that a missing filter causes an error or that the metric is deprecated, when in fact the API silently returns no data points because the required resource labels are absent from the filter.

How to eliminate wrong answers

Option A is wrong because the time interval being too short does not cause zero data points; it would simply return data within that window, and the data retention period for GKE container metrics is typically 6 weeks, so a short interval is valid. Option B is wrong because 'kubernetes.io/container/cpu/core_usage_time' is a valid, non-deprecated metric type in Cloud Monitoring for GKE; deprecation would cause a warning or error, not silent empty results. Option D is wrong because 'ALIGN_MEAN' is compatible with cumulative metrics like 'core_usage_time' (which is a cumulative counter); 'ALIGN_RATE' is also valid but not required, and incompatibility would cause an error, not empty data.

Full explanation →

195

MCQmedium

During the bootstrapping of a Google Cloud organization, you need to ensure that all resources in a specific folder are subject to a particular VPC Service Controls perimeter. Which step is necessary to achieve this?

A.Use resource tags to label the projects and then create a tag-based perimeter.

B.Apply the perimeter to the folder directly.

C.Add the projects within the folder as members of the existing service perimeter.

D.Create an organization policy that forces all projects in the folder to be in a perimeter.

AnswerC

Projects must be explicitly added to the perimeter.

Why this answer

Option C is correct because VPC Service Controls perimeters are applied at the project level, not directly to folders. To enforce a perimeter on all resources within a folder, you must add each project in that folder as a member of the existing service perimeter. This ensures that all resources in those projects are subject to the perimeter's access restrictions.

Exam trap

Google Cloud often tests the misconception that VPC Service Controls can be applied hierarchically (e.g., to folders or via organization policies), when in fact they require explicit project-level membership.

How to eliminate wrong answers

Option A is wrong because resource tags are not supported for defining VPC Service Controls perimeters; perimeters are based on project membership, not tags. Option B is wrong because VPC Service Controls perimeters cannot be applied directly to a folder; they are configured at the project or organization level, and folder-level inheritance is not a feature. Option D is wrong because organization policies cannot force projects into a VPC Service Controls perimeter; perimeters are configured separately via Access Context Manager and require explicit project membership.

Full explanation →

196

MCQhard

A company runs a microservices architecture on GKE with Istio. They want to generate custom request-level metrics for SLO tracking without modifying application code. Which approach is most efficient?

A.Set up Istio telemetry with the Cloud Monitoring adapter

B.Write a custom Prometheus exporter deploying in each pod

C.Use the Cloud Monitoring agent to scrape metrics from pods

D.Enable Stackdriver for GKE (deprecated)

AnswerA

Istio's adapter exports request metrics directly to Cloud Monitoring.

Why this answer

Option A is correct because Istio's telemetry v2 can be configured to export custom request-level metrics (e.g., latency, error rate) to Cloud Monitoring via the Cloud Monitoring adapter without modifying application code. This approach leverages Istio's sidecar proxy to capture metrics at the service mesh layer, making it the most efficient and non-invasive method for SLO tracking.

Exam trap

Google Cloud often tests the misconception that the Cloud Monitoring agent (or legacy Stackdriver) can scrape Istio metrics from pods, but in reality, the agent operates at the VM or node level and cannot access the Envoy proxy's in-memory metrics without explicit Prometheus scraping configuration.

How to eliminate wrong answers

Option B is wrong because writing a custom Prometheus exporter and deploying it in each pod requires modifying application code or adding sidecars, which contradicts the requirement of not modifying application code and adds operational overhead. Option C is wrong because the Cloud Monitoring agent (formerly Stackdriver agent) scrapes metrics from the host OS or container runtime, not from Istio's sidecar proxies, and it cannot capture request-level metrics at the mesh layer without application instrumentation. Option D is wrong because 'Stackdriver for GKE' is deprecated and replaced by Google Cloud Managed Service for Prometheus and Cloud Monitoring; relying on a deprecated approach is not a valid or efficient solution.

Full explanation →

197

MCQmedium

An e-commerce platform uses Cloud Load Balancing with backend services running on Compute Engine managed instance groups. During Black Friday sales, the application experiences high latency and some 503 errors. The team uses autoscaling based on average CPU utilization, but scaling is too slow—Cloud Monitoring shows CPU rises to 90% before new instances are added. The team needs to reduce latency and eliminate 503 errors. What should they do?

A.Use HTTP load balancing with a larger backend timeout

B.Change the autoscaling metric to 'requests per second' and set a lower target value

C.Enable Cloud CDN for all dynamic content

D.Increase the cooldown period for the autoscaling policy

AnswerB

Requests per second scales proactively based on traffic, reacting faster than CPU-based scaling.

Why this answer

Option B is correct because switching the autoscaling metric from average CPU utilization to 'requests per second' (RPS) with a lower target value allows the autoscaler to react more quickly to traffic spikes. CPU utilization is a lagging indicator that rises only after requests have already been queued and processed, whereas RPS directly reflects incoming load. By setting a lower target RPS, the autoscaler can add instances before the backend becomes saturated, reducing latency and eliminating 503 errors.

Exam trap

Google Cloud often tests the misconception that CPU utilization is the best metric for scaling web applications, but the trap here is that CPU is a lagging indicator, and candidates may overlook that request-based metrics provide faster, more direct feedback for traffic-driven workloads.

How to eliminate wrong answers

Option A is wrong because increasing the backend timeout does not address the root cause of slow scaling; it only allows connections to wait longer before timing out, which can mask the problem but does not prevent 503 errors or reduce latency. Option C is wrong because Cloud CDN caches static content at edge locations, but the question describes dynamic content that cannot be cached; enabling CDN for dynamic content would not reduce latency or eliminate 503 errors caused by backend overload. Option D is wrong because increasing the cooldown period would make autoscaling even slower, as it delays the addition of new instances after a scale-up decision, worsening the latency and 503 errors.

Full explanation →

198

MCQeasy

A DevOps engineer notices that the monthly compute cost is higher than expected for a Kubernetes Engine cluster. They want to identify which namespaces or workloads are contributing most to the cost. What should they do?

A.Enable Logging for all containers and analyze logs.

B.Use Cloud Monitoring dashboards with Kubernetes metrics.

C.Add labels to nodes and use Cost Table reports.

D.Enable GKE Usage Metering and export to BigQuery.

AnswerD

GKE Usage Metering provides cost attribution by namespace and workload.

Why this answer

Option D is correct because GKE Usage Metering exports detailed per-cluster resource consumption data (CPU, memory, storage, and network) to BigQuery, enabling cost attribution by namespace, label, or workload. This is the only option that directly provides granular cost breakdowns for Kubernetes resources, allowing the engineer to identify which namespaces or workloads drive compute costs.

Exam trap

The trap here is that candidates confuse monitoring metrics (CPU/memory usage) with cost data, or assume node labels and billing reports alone provide namespace-level granularity, missing that GKE Usage Metering is the specific tool designed for this exact use case.

How to eliminate wrong answers

Option A is wrong because enabling Logging for all containers and analyzing logs would capture application events and errors, not resource usage or cost data; logs lack the structured cost metrics needed for cost attribution. Option B is wrong because Cloud Monitoring dashboards with Kubernetes metrics show real-time performance data (e.g., CPU utilization, memory usage) but do not provide cost data or billing-level attribution per namespace or workload. Option C is wrong because adding labels to nodes and using Cost Table reports (a reference to Google Cloud's Cost Table in the billing console) can help categorize costs by label, but without GKE Usage Metering, node labels alone do not capture per-workload or per-namespace resource consumption; the Cost Table reports aggregate billing data at the project or service level, not at the granularity of Kubernetes namespaces.

Full explanation →

199

MCQmedium

You want to send alerts to a Slack channel when a critical error occurs. What should you do?

A.Set up a webhook in Cloud Logging and point it to Slack

B.Use Cloud Tasks to schedule a job that queries logs and sends Slack messages

C.Create a Cloud Logging export to Pub/Sub and subscribe using a Cloud Function that sends to Slack

D.Configure a Slack webhook notification channel in Cloud Monitoring and associate it with an alerting policy

AnswerD

Cloud Monitoring natively integrates with Slack via webhooks.

Why this answer

Option D is correct because Cloud Monitoring alerting policies can directly send notifications to Slack via a configured webhook notification channel. When a critical error triggers the alerting policy, Cloud Monitoring sends an HTTP POST request to the Slack webhook URL, delivering the alert message to the specified Slack channel. This is the native, event-driven approach without requiring custom code or intermediate services.

Exam trap

Google Cloud often tests the distinction between Cloud Logging exports (for log storage/analysis) and Cloud Monitoring alerting (for notification delivery), leading candidates to over-engineer solutions with Pub/Sub and Cloud Functions when a direct notification channel is available.

How to eliminate wrong answers

Option A is wrong because Cloud Logging webhooks are not a feature; Cloud Logging does not support direct webhook integrations to Slack. Option B is wrong because Cloud Tasks is a distributed task queue for asynchronous execution, not designed for real-time log monitoring or Slack notifications; polling logs with scheduled jobs introduces latency and complexity. Option C is wrong because while a Cloud Logging export to Pub/Sub with a Cloud Function can send messages to Slack, this approach is unnecessarily complex and indirect compared to the native Cloud Monitoring alerting policy with a Slack notification channel, which is the recommended and simpler solution.

Full explanation →

200

Multi-Selecthard

Which THREE of the following are valid considerations when using organization policies to enforce compliance in a DevOps environment?

Select 3 answers

A.Child-level organization policies can only add more restrictions, not remove restrictions set by a parent.

B.A list policy can be used to define an allowlist or blocklist of resources.

C.An organization policy must be enabled by a service agent before it can be used.

D.IAM conditions can be used to override an organization policy for specific resources.

E.Organization policies can be set at the organization, folder, or project level.

AnswersA, B, E

This is a fundamental property of organization policy inheritance.

Why this answer

Option A is correct because organization policies follow a hierarchical inheritance model where child policies can only impose additional restrictions (i.e., they are more restrictive) than the parent policy. This is enforced by the Google Cloud Resource Manager hierarchy, ensuring that a policy set at the organization level cannot be relaxed by a folder or project policy.

Exam trap

The trap here is that candidates often confuse IAM conditions with organization policies, thinking that IAM can override policy constraints, when in fact organization policies are a separate enforcement layer that cannot be bypassed by IAM.

Full explanation →

201

MCQhard

A company runs a batch processing pipeline on Dataflow that reads from Pub/Sub and writes to BigQuery. The pipeline is falling behind due to high volume, and messages are backing up in Pub/Sub. Autoscaling is enabled and workers are running but utilization is only 30%. The streaming engine is off. What should the engineer do to increase throughput?

A.Use a larger machine type for workers.

B.Enable Streaming Engine for Dataflow.

C.Increase the number of worker machines.

D.Increase the worker disk size.

AnswerB

Streaming Engine reduces shuffle overhead, improving throughput for streaming pipelines.

Why this answer

B is correct because enabling Streaming Engine offloads the heavy lifting of shuffle and state management from worker VMs to the Dataflow service backend, reducing the per-worker overhead. With only 30% utilization, the bottleneck is not compute capacity but the per-worker throughput limit caused by the streaming pipeline's shuffle and state operations. Streaming Engine allows the existing workers to process more data per second by removing these bottlenecks, directly addressing the Pub/Sub backlog without adding more resources.

Exam trap

Google Cloud often tests the misconception that low utilization means you need more or larger workers, when in fact the bottleneck is the streaming shuffle overhead that Streaming Engine resolves, not compute capacity.

How to eliminate wrong answers

Option A is wrong because using a larger machine type increases CPU and memory per worker, but the pipeline is only at 30% utilization, indicating that compute resources are not the bottleneck; the issue is the streaming shuffle and state overhead that a larger machine type does not alleviate. Option C is wrong because increasing the number of workers would add more machines that would also be underutilized due to the same streaming overhead, failing to address the root cause of low throughput per worker. Option D is wrong because increasing worker disk size helps with buffering and spill-to-disk scenarios, but the pipeline is not disk-bound; the bottleneck is the streaming engine's shuffle and state management, which is not resolved by more disk space.

Full explanation →

202

MCQhard

Your company runs a multi-region application on GKE across us-east1 and europe-west1. The application serves a global user base with a strict SLO of 99.95% availability. Recently, the team noticed that during peak hours, some users in South America experience high latency and intermittent errors. The GKE clusters are monitored via Cloud Monitoring with custom dashboards and alerting policies. The team has set up a single alerting policy that triggers when the global error rate exceeds 0.1%. However, the alert fires only after the issue has persisted for 10 minutes, and by then the customer impact is already significant. You need to improve the detection and response time. Which action should you take first?

A.Create separate alerting policies per region with shorter evaluation periods, and set up a notification channel for the on-call team.

B.Add a dashboard that shows latency by region and set up a log-based metric for error counting.

C.Reduce the alerting policy's duration to 1 minute and increase the threshold to 0.5% to reduce noise.

D.Implement a canary deployment strategy to roll back changes quickly.

AnswerA

Regional alerts detect issues faster and target the affected region.

Why this answer

Option A is correct because the current single global alerting policy with a 10-minute evaluation period introduces a significant delay in detecting regional issues. By creating separate alerting policies per region with shorter evaluation periods, you can detect and respond to regional anomalies (like high latency in South America) much faster, directly improving the detection time for the multi-region application and reducing customer impact.

Exam trap

Google Cloud often tests the misconception that reducing the evaluation period and raising the threshold (Option C) is a quick fix, but this ignores the need for regional granularity and can lead to missed detections or increased noise, while the correct approach is to isolate alerts per region.

How to eliminate wrong answers

Option B is wrong because adding a dashboard and log-based metric for error counting only improves visibility and monitoring, but does not directly reduce the detection or response time for the alert; it provides data but no automated alerting improvement. Option C is wrong because reducing the duration to 1 minute and increasing the threshold to 0.5% would likely increase noise (false positives) and may still miss regional issues if the global error rate remains below 0.5% despite regional problems, thus not addressing the core issue of regional detection. Option D is wrong because implementing a canary deployment strategy is a deployment and rollback technique that helps mitigate impact after a bad release, but does not improve the detection and response time for the existing latency and error issue; it is a reactive measure, not a proactive monitoring fix.

Full explanation →

203

MCQeasy

A DevOps team is troubleshooting a web application that shows high latency during peak hours. The application runs on Google Kubernetes Engine (GKE). They want to identify which specific API calls are causing the delay. Which Google Cloud tool should they use?

A.Cloud Monitoring

B.Cloud Profiler

C.Cloud Logging

D.Cloud Trace

AnswerD

Cloud Trace offers distributed tracing, enabling identification of slow API calls.

Why this answer

Cloud Trace is the correct tool because it is a distributed tracing system that captures latency data from applications, allowing you to trace individual API calls and identify which specific endpoints are causing delays. Unlike Cloud Monitoring, which provides aggregate metrics, Cloud Trace provides per-request latency breakdowns, making it ideal for pinpointing slow API calls in a GKE environment.

Exam trap

The trap here is that candidates often confuse Cloud Monitoring (which shows overall latency metrics) with Cloud Trace (which provides per-request tracing), leading them to choose Cloud Monitoring when they need to drill down into specific API calls.

How to eliminate wrong answers

Option A is wrong because Cloud Monitoring provides aggregate metrics (e.g., CPU, memory, request count) but does not trace individual API calls or provide per-request latency breakdowns. Option B is wrong because Cloud Profiler is designed for continuous profiling of CPU and memory usage to identify performance bottlenecks in code, not for tracing specific API call latencies. Option C is wrong because Cloud Logging captures log entries and events but does not provide the distributed tracing capability needed to correlate latency across API calls.

Full explanation →

204

Multi-Selecteasy

Which TWO of the following are best practices when bootstrapping a Google Cloud organization for DevOps?

Select 2 answers

A.Use the default Compute Engine service account for all DevOps automation.

B.Create separate projects for development, staging, and production environments.

C.Create a service account in each project that needs it, rather than a central admin project.

D.Grant the principle of least privilege to service accounts used by CI/CD pipelines.

E.Set organization policies only at the project level, not at the folder or organization level.

AnswersB, D

This provides isolation and allows tailored policies.

Why this answer

Option B is correct because creating separate projects for development, staging, and production environments enforces resource isolation, simplifies IAM policy management, and aligns with Google Cloud's recommended multi-project architecture for DevOps. This approach allows you to apply distinct organization policies, budgets, and monitoring per environment, reducing the risk of accidental changes to production.

Exam trap

Google Cloud often tests the misconception that using the default Compute Engine service account is acceptable for automation, when in fact it should be avoided due to its overly permissive `editor` role, and that organization policies should be set only at the project level, ignoring the power of hierarchical inheritance at the folder and organization levels.

Full explanation →

205

Multi-Selecteasy

Which THREE are recommended practices for setting up alerting on Google Cloud?

Select 3 answers

A.Use multiple conditions with OR combiner to cover all scenarios

B.Set alert thresholds with a buffer (e.g., below SLO)

C.Test alerts in a non-production environment

D.Use a single alerting policy for all services

E.Configure multiple notification channels for different roles

AnswersB, C, E

Buffers reduce noise and provide time to act.

Why this answer

Option B is correct because setting alert thresholds with a buffer below the SLO (e.g., alert at 99.9% when SLO is 99.99%) provides early warning before the SLO is breached. This proactive approach allows time for investigation and remediation, preventing actual SLO violations and ensuring service reliability targets are met.

Exam trap

Google Cloud often tests the misconception that combining multiple conditions with OR is always beneficial for coverage, but in reality, it increases noise and violates the principle of alerting on symptoms rather than causes.

Full explanation →

206

MCQeasy

A team wants to monitor the latency of a microservice deployed on GKE. Which Google Cloud tool should they use to collect custom metrics?

A.Cloud Logging

B.Cloud Trace

C.Error Reporting

D.Cloud Monitoring

AnswerD

Cloud Monitoring collects metrics including custom metrics from GKE.

Why this answer

Cloud Monitoring (formerly Stackdriver Monitoring) is the correct tool because it allows you to collect, visualize, and alert on custom metrics via the Monitoring API or the `custom.googleapis.com` metric namespace. For a microservice on GKE, you can instrument your application with the OpenTelemetry SDK or the Cloud Monitoring client library to push custom latency metrics directly into Cloud Monitoring, where they can be queried and used in dashboards or alerting policies.

Exam trap

Google Cloud often tests the distinction between logging, tracing, and monitoring, and the trap here is that candidates confuse Cloud Trace (which deals with latency at the trace level) with Cloud Monitoring (which collects and stores custom numeric metrics for alerting and dashboards).

How to eliminate wrong answers

Option A is wrong because Cloud Logging is designed for storing, searching, and analyzing log data, not for collecting numeric time-series metrics; while logs can be converted to metrics via log-based metrics, that is an indirect and less efficient approach for custom latency monitoring. Option B is wrong because Cloud Trace is a distributed tracing tool that captures end-to-end request latency as trace spans, but it does not support custom metric collection or expose a metrics API for arbitrary numeric values. Option C is wrong because Error Reporting aggregates and analyzes application errors (exceptions, crashes), not latency data; it focuses on error events, not performance metrics.

Full explanation →

207

MCQhard

A Cloud Run service experiences high latency during cold starts. The service is memory-intensive. Which configuration change will most effectively reduce cold start latency?

A.Decrease the container concurrency setting.

B.Set a minimum number of instances.

C.Increase the max instances limit.

D.Enable CPU boost for the service.

AnswerB

Min instances keep instances always running, eliminating cold starts.

Why this answer

Setting a minimum number of instances ensures that a baseline of pre-warmed instances is always running, eliminating cold starts for the initial requests. For a memory-intensive service, this avoids the latency penalty of loading large libraries or datasets into memory on first invocation. This is the most direct and effective configuration change to reduce cold start latency.

Exam trap

Google Cloud often tests the misconception that increasing max instances or enabling CPU boost can solve cold start latency, when in fact only setting a minimum number of instances directly prevents the cold start from occurring.

How to eliminate wrong answers

Option A is wrong because decreasing container concurrency reduces the number of simultaneous requests each instance can handle, which may increase the number of instances needed but does not address the root cause of cold start latency. Option C is wrong because increasing the max instances limit only raises the ceiling for scaling out, which does nothing to prevent cold starts when no instances are running. Option D is wrong because CPU boost temporarily increases CPU during cold starts but only reduces latency by a small margin; it does not eliminate the cold start itself, and for memory-intensive services, the bottleneck is often memory allocation, not CPU.

Full explanation →

208

MCQhard

A company wants to reduce costs associated with Cloud Monitoring. They have many custom metrics and high ingestion rates. Which cost optimization strategy is most effective?

A.Use log-based metrics instead of custom metrics.

B.Aggregate metrics into buckets and export to BigQuery for analysis.

C.Reduce the sampling rate of all custom metrics to 1 minute.

D.Delete all unused custom metrics and reduce labels.

AnswerD

Unused metrics and high label cardinality contribute to costs; cleaning them up is the most direct approach.

Why this answer

Option D is correct because deleting unused custom metrics and reducing labels directly reduces the volume of data ingested and stored, which is the primary cost driver in Cloud Monitoring (formerly Stackdriver). Custom metrics incur charges per data point ingested, and each unique label combination creates additional time series, multiplying costs. This strategy eliminates waste without sacrificing necessary monitoring fidelity.

Exam trap

Google Cloud often tests the misconception that reducing sampling frequency or moving metrics to logs will always lower costs, when in fact the most effective first step is eliminating unused metrics and reducing label cardinality to minimize the number of time series ingested.

How to eliminate wrong answers

Option A is wrong because log-based metrics still incur ingestion costs for the logs themselves, and converting custom metrics to log-based metrics does not inherently reduce costs—it may shift costs to log ingestion and storage. Option B is wrong because aggregating metrics into buckets and exporting to BigQuery adds additional costs for BigQuery storage and querying, and does not reduce the ingestion volume into Cloud Monitoring. Option C is wrong because reducing the sampling rate to 1 minute may not be appropriate for all metrics (e.g., high-frequency metrics need finer granularity), and it does not address the root cause of high ingestion volume from unused metrics or excessive labels; also, Cloud Monitoring charges per data point, so simply lowering frequency may not yield proportional savings if the number of time series remains high.

Full explanation →

209

MCQeasy

A DevOps team wants to automate the deployment of a containerized application to multiple GKE clusters across different regions. They are using Cloud Build to build the container and Cloud Deploy for deployment. Which Cloud Deploy resource should they configure to define the deployment order and target clusters?

A.Cloud Build trigger with a multi-cluster deployment step.

B.A rollout object with target clusters specified.

C.A delivery pipeline with multiple targets defined.

D.A Cloud Deploy release configuration.

AnswerC

Delivery pipeline specifies targets and promotion order.

Why this answer

Option B is correct because a delivery pipeline defines the promotion sequence and targets. Option A is incorrect as a Cloud Build trigger only starts builds. Option C is incorrect - a release is a specific snapshot, not the pipeline.

Option D is incorrect - a rollout targets a single target.

Full explanation →

210

MCQhard

Refer to the exhibit. ```yaml name: projects/my-project/alertPolicies/12345 displayName: High Error Rate combiner: OR conditions: - conditionThreshold: filter: metric.type="logging.googleapis.com/user/myapp/error_count" resource.type="k8s_container" aggregations: - alignmentPeriod: 60s perSeriesAligner: ALIGN_RATE duration: 120s comparison: COMPARISON_GT thresholdValue: 5 trigger: count: 1 ``` An engineer notices that this alert fires too frequently during normal operation. Which change would most likely reduce the noise?

A.Increase duration to 300s.

B.Change combiner to AND.

C.Change perSeriesAligner to ALIGN_MEAN.

D.Decrease thresholdValue to 1.

AnswerA

A longer duration means the threshold must be exceeded for 300 seconds continuously, filtering out short-lived spikes that cause false alerts.

Why this answer

Increasing the duration to 300s means the condition (error rate > 5 per second) must be sustained for a full 5 minutes before the alert fires. This reduces noise by filtering out transient spikes that occur during normal operation, which are shorter than the new duration window. The original 120s duration allowed short-lived bursts to trigger the alert too frequently.

Exam trap

Google Cloud often tests the misconception that changing the threshold value or aligner is the primary way to reduce alert noise, when in fact adjusting the duration (or evaluation window) is the correct method to filter out transient spikes.

How to eliminate wrong answers

Option B is wrong because changing the combiner to AND would require all conditions to be true simultaneously, but there is only one condition in this policy, so AND has no effect and does not reduce noise. Option C is wrong because changing perSeriesAligner to ALIGN_MEAN would average the error count over the alignment period, which could smooth out spikes but does not address the core issue of transient bursts triggering the alert; it might even increase sensitivity if the mean is still above threshold. Option D is wrong because decreasing thresholdValue to 1 would make the alert fire even more frequently, as it would trigger on any error rate above 1, increasing noise rather than reducing it.

Full explanation →

211

MCQeasy

A company runs a web application on Compute Engine behind a regional HTTP Load Balancer. Users report slow page load times during peak hours. CPU utilization on instances is under 60%, but network egress is near the instance's bandwidth limit. Which action should the engineer take?

A.Increase the number of instances in the instance group.

B.Switch to a global external HTTP Load Balancer.

C.Enable Cloud CDN.

D.Use a larger machine type with higher network throughput.

AnswerD

Larger machine types have higher network egress limits, addressing the bottleneck.

Why this answer

The bottleneck is network egress bandwidth, not CPU. Increasing the instance size to a machine type with higher network throughput (e.g., n2-highmem-4 vs. n2-standard-2) directly raises the per-instance egress cap, alleviating the bandwidth limit. Option D is correct because it addresses the root cause—insufficient network I/O capacity per instance.

Exam trap

Google Cloud often tests the misconception that horizontal scaling (adding instances) always solves performance issues, but here the bottleneck is per-instance network throughput, not request handling capacity.

How to eliminate wrong answers

Option A is wrong because adding more instances distributes load but does not increase the per-instance egress bandwidth; if each instance is already at its egress limit, more instances will also hit the same limit. Option B is wrong because switching to a global external HTTP Load Balancer improves latency through anycast IP and cross-regional routing, but does not increase the egress bandwidth of individual Compute Engine instances. Option C is wrong because Cloud CDN caches static content at edge locations, reducing origin load, but the reported bottleneck is network egress on the instances, which CDN does not address for dynamic or uncacheable traffic.

Full explanation →

212

MCQeasy

A team is using Cloud Build to build a Docker image and push to Artifact Registry. After each build, they want to automatically trigger a deployment to Cloud Run. What is the best way to achieve this?

A.Use Cloud Build builder to deploy to Cloud Run in the same pipeline.

B.Use Cloud Deploy to manage the rollout.

C.Configure a Cloud Run trigger on Artifact Registry push.

D.Use Cloud Scheduler to periodically check for new images and deploy.

AnswerC

Cloud Run triggers can automatically deploy new images pushed to Artifact Registry.

Why this answer

Option C is correct because Cloud Run natively supports continuous deployment from Artifact Registry: when a new image is pushed to the registry, Cloud Run can automatically deploy the latest image without requiring an external pipeline or scheduler. This is the simplest and most direct approach, as it eliminates the need for additional CI/CD orchestration while ensuring deployments happen immediately after each successful build.

Exam trap

The trap here is that candidates often over-engineer the solution by choosing Cloud Deploy (Option B) because it is a dedicated deployment tool, but the question specifically asks for the 'best way' to trigger a deployment after a single build, where Cloud Run's built-in continuous deployment is simpler and more direct.

How to eliminate wrong answers

Option A is wrong because using a Cloud Build builder to deploy to Cloud Run in the same pipeline couples the build and deployment steps, which violates the principle of separation of concerns and makes it harder to manage rollbacks or approvals; it also requires the Cloud Build service account to have deployment permissions, increasing security risk. Option B is wrong because Cloud Deploy is designed for multi-target, progressive delivery (e.g., canary, blue/green) and adds unnecessary complexity for a simple single-service deployment triggered by a single image push. Option D is wrong because Cloud Scheduler polling for new images introduces latency (up to the polling interval) and inefficiency, and it is not event-driven; it also requires custom logic to compare image digests, making it brittle and harder to maintain.

Full explanation →

213

Multi-Selecthard

Which THREE factors should you consider when designing a Cloud Run service for optimal performance under unpredictable traffic patterns? (Choose 3)

Select 3 answers

A.Use HTTP/2 for faster connection reuse.

B.Set a minimum number of instances to reduce cold starts.

C.Configure VPC egress through Cloud NAT for lower latency.

D.Allocate sufficient CPU per instance to handle peak load.

E.Set a maximum number of instances to control concurrency and cost.

AnswersB, D, E

Min instances keep containers warm.

Why this answer

Option B is correct because setting a minimum number of instances ensures that Cloud Run always keeps a baseline of warm instances ready to serve requests. This eliminates cold starts for the first requests during traffic spikes, which is critical for unpredictable traffic patterns where latency spikes from cold starts would degrade user experience.

Exam trap

The trap here is that candidates confuse network optimization features (HTTP/2, Cloud NAT) with instance lifecycle management, mistakenly believing they address cold starts or scaling latency when they only affect connection efficiency or outbound connectivity.

Full explanation →

214

MCQmedium

A company has multiple projects under an organization. They want to allocate costs to different departments based on project labels. They have applied labels to all resources, but the cost reports show 'Unlabeled' for many resources. What is the most likely cause?

A.Labels are not propagated to billing export.

B.The billing export is not configured.

C.The project uses a different billing account.

D.Labels were applied after the billing period.

AnswerD

Historical cost data does not retroactively update with new labels.

Why this answer

Labels applied after the billing period do not apply to historical data; they only appear in future billing reports.

Full explanation →

215

MCQmedium

A company uses Cloud Source Repositories and wants automatic builds on pull requests to the main branch. Which Cloud Build trigger type should they configure?

A.Pull request (comment)

B.Push to a branch

C.Pull request (any)

D.Tag push

AnswerC

This trigger runs builds for all pull request events.

Why this answer

A pull request trigger in Cloud Build allows you to run builds automatically when a pull request is opened, synchronized, or updated.

Full explanation →

216

MCQmedium

A Cloud Build pipeline fails with 'Permission denied' when trying to pull a Docker image from Artifact Registry in the same project. The Cloud Build service account has the Artifact Registry Reader role. What additional configuration is likely missing?

A.Cloud Build needs to be enabled in the Artifact Registry region.

B.The Artifact Registry repository has a VPC-SC perimeter blocking access.

C.The Docker image tag is incorrect.

D.The service account needs the Artifact Registry Writer role as well.

AnswerB

VPC-SC perimeters can deny access from outside the perimeter; cloudbuild may run outside unless configured.

Why this answer

The Cloud Build service account has the Artifact Registry Reader role, which grants permission to read (pull) images. However, if the Artifact Registry repository is inside a VPC Service Controls (VPC-SC) perimeter, the service account must also be explicitly added to the perimeter's allowed identities or the request must originate from within the perimeter. Without this, VPC-SC blocks all API calls from outside the perimeter, resulting in a 'Permission denied' error despite valid IAM roles.

Exam trap

Google Cloud often tests the distinction between IAM permissions and VPC-SC perimeter policies, tricking candidates into thinking that a missing IAM role is the only cause of 'Permission denied' when the real issue is a network-level access control boundary.

How to eliminate wrong answers

Option A is wrong because Cloud Build does not need to be 'enabled' in a specific Artifact Registry region; Artifact Registry is a global service with regional repositories, and Cloud Build can pull from any region as long as network and IAM permissions are correct. Option C is wrong because an incorrect Docker image tag would produce a 'not found' error (e.g., manifest unknown), not a 'Permission denied' error. Option D is wrong because the Artifact Registry Writer role is only needed for pushing images, not pulling; the Reader role is sufficient for pull operations.

Full explanation →

217

MCQmedium

Refer to the exhibit. A DevOps engineer tries to create a project but gets this error. What is the most likely cause?

A.The organization has an organizational policy that restricts project creation.

B.The project ID already exists.

C.The user's billing account is not linked.

D.The user does not have the Project Creator role.

AnswerA

FAILED_PRECONDITION is typical for policy violations.

Why this answer

The error message indicates that project creation is blocked by an organizational policy. In Google Cloud, organization policies (e.g., constraints/compute.restrictNonCcmlProjects) can be set at the organization or folder level to restrict project creation. This is a common control for governance and cost management, and it overrides individual user permissions.

Exam trap

Google Cloud often tests the distinction between IAM permission errors and organizational policy errors, where candidates mistakenly attribute a policy-based denial to missing roles or billing issues.

How to eliminate wrong answers

Option B is wrong because a duplicate project ID would produce a '409 Conflict' error with a message like 'Project ID already exists', not a policy-based denial. Option C is wrong because an unlinked billing account would cause a 'billing account not found' or 'billing account is not associated' error during project creation, not a generic policy restriction. Option D is wrong because the Project Creator role (roles/resourcemanager.projectCreator) is required to create projects, but if the user lacked it, the error would be 'Permission denied' or 'The caller does not have permission', not an organizational policy violation.

Full explanation →

218

MCQeasy

A DevOps team wants to get a monthly report of GCP costs broken down by service and project. What is the simplest way to achieve this?

A.Export billing data to BigQuery and run a scheduled query

B.Use the gcloud CLI to run a billing query each month

C.Use the Cloud Billing reports page to export a CSV directly

D.Set up a Cloud Function to call the Cloud Billing API and generate a report

AnswerC

Billing reports provide a user interface with filtering and export capabilities.

Why this answer

Option B is correct because the Cloud Billing reports page allows custom reports and direct CSV export. Option A is more complex. Options C and D are overkill.

Full explanation →

219

MCQeasy

A DevOps team is bootstrapping their Google Cloud organization and wants to enable Infrastructure as Code (IaC) using Terraform. They need a service account that Terraform can use to create and manage resources across multiple projects. What is the best practice for creating and managing this service account?

A.Create a service account in a separate 'admin' project and grant it the required roles on each project via IAM.

B.Generate a service account key and store it in a Cloud Storage bucket accessible to the team.

C.Use a user account with two-factor authentication for Terraform automation.

D.Use the Compute Engine default service account from the project where Terraform runs.

AnswerA

This provides centralized control and separates credentials from workloads.

Why this answer

Option A is correct because creating a service account in a dedicated 'admin' project and granting it necessary roles across projects is a common pattern. Option B is wrong because the default service account has too many permissions and is not recommended. Option C is wrong because service account keys should be avoided in source control.

Option D is wrong because using a user account is not secure for automation.

Full explanation →

220

MCQmedium

A team notices that a Cloud Run service occasionally has high latency. They suspect a memory leak or excessive CPU usage. Which tool should they use to identify the bottleneck during those periods?

A.Cloud Trace

B.Cloud Monitoring

C.Cloud Logging

D.Cloud Profiler

AnswerD

Cloud Profiler continuously profiles CPU and memory usage to pinpoint bottlenecks.

Why this answer

Cloud Profiler is the correct tool because it continuously gathers CPU and memory usage data from your Cloud Run service, allowing you to identify which functions or code paths are consuming excessive resources during high-latency periods. Unlike monitoring or logging, Profiler provides a flame graph that pinpoints the exact bottleneck (e.g., a memory leak or CPU spike) at the function level, which is essential for diagnosing performance issues in a serverless environment.

Exam trap

The trap here is that candidates often confuse Cloud Trace (distributed tracing) with Cloud Profiler (code-level profiling), assuming that latency tracing alone can identify resource bottlenecks, when in fact only Profiler can pinpoint the specific function causing excessive CPU or memory usage.

How to eliminate wrong answers

Option A is wrong because Cloud Trace is designed for distributed tracing of request latency across services, not for profiling CPU or memory usage within a single service. Option B is wrong because Cloud Monitoring provides metrics and alerts (e.g., request latency, CPU utilization) but does not offer the granular, code-level insight needed to identify a memory leak or excessive CPU usage. Option C is wrong because Cloud Logging captures log entries and error messages but cannot profile resource consumption or show which specific functions are causing the bottleneck.

Full explanation →

221

Multi-Selectmedium

A company is using Google Cloud and wants to monitor and control costs. Which TWO actions should they take? (Choose two.)

Select 2 answers

A.Set up budget alerts to notify when spending exceeds thresholds.

B.Use labels to categorize resources and track costs by team.

C.Disable all unnecessary APIs at the organization level.

D.Export billing data to BigQuery for detailed analysis.

E.Grant billing account access to all project owners.

AnswersA, B

Budget alerts help monitor and control costs by providing notifications.

Why this answer

Setting up budget alerts (option A) is a core cost control action because it allows you to define a budget amount and receive notifications when actual or forecasted spending exceeds a specified threshold (e.g., 50%, 90%, 100%). This enables proactive intervention before costs spiral out of control. Using labels (option B) to categorize resources (e.g., by team, project, or environment) allows you to track and allocate costs accurately in billing reports, making it easier to identify cost drivers and enforce accountability.

Exam trap

Google Cloud often tests the distinction between cost monitoring (e.g., exporting to BigQuery) and cost control (e.g., budget alerts and labels), so candidates mistakenly select options that only provide visibility rather than active enforcement or categorization.

Full explanation →

222

MCQeasy

Your company runs a web application on Compute Engine behind a global HTTP(S) Load Balancer. You want to improve performance for users in Europe. You have already enabled Cloud CDN. What is the next best action to reduce latency?

A.Switch to a regional load balancer to route traffic more efficiently.

B.Configure a multi-regional Cloud CDN and enable cache warming for popular content.

C.Add Compute Engine instances in a European region (e.g., europe-west1) and add them to the load balancer's backend.

D.Increase the machine type of existing instances to improve processing speed.

AnswerC

Adding instances in Europe reduces the distance between users and servers, lowering latency.

Why this answer

Adding Compute Engine instances in a European region (e.g., europe-west1) and adding them to the load balancer's backend reduces latency by bringing the origin server physically closer to European users. Even with Cloud CDN enabled, cache misses still require a round trip to the backend; having a backend in Europe minimizes that distance. The global HTTP(S) Load Balancer automatically routes requests to the closest healthy backend, so this directly improves performance for users in Europe.

Exam trap

The trap here is that candidates assume Cloud CDN alone solves all latency issues, forgetting that cache misses still require backend proximity, and that the global load balancer already supports multi-region backends without needing to switch to a regional load balancer.

How to eliminate wrong answers

Option A is wrong because switching to a regional load balancer would actually increase latency for users outside that single region, as it cannot route traffic globally; the global load balancer is already the correct choice for multi-region distribution. Option B is wrong because multi-regional Cloud CDN is not a configurable setting—Cloud CDN is already global by default, and cache warming does not reduce latency for cache misses or dynamic content that bypasses the cache. Option D is wrong because increasing machine type improves processing speed but does not reduce network latency; the bottleneck for European users is geographic distance, not compute capacity.

Full explanation →

223

MCQhard

A company runs a production workload on Google Kubernetes Engine (GKE) with cluster autoscaling enabled. The cluster has three node pools: one for general-purpose workloads (n1-standard-4), one for memory-intensive workloads (n2-highmem-8), and one for GPU-accelerated jobs (with 1 NVIDIA Tesla T4 per node). The workloads are a mix of stateless microservices and stateful databases. Over the past month, the monthly GKE cost has increased by 40% despite no significant change in application traffic or resource requests. The team has verified that vertical pod autoscaling and node auto-provisioning are enabled. They have also checked that there are no orphaned resources. They suspect that overspending is due to inefficient resource utilization or node selection. What should the team do to identify and reduce the unnecessary cost?

A.Review billing export data in BigQuery to identify the top cost contributors by project, service, and label.

B.Use the GKE cost optimization recommender to identify idle resources and apply recommended changes.

C.Disable node auto-provisioning and switch all nodes to preemptible instances.

D.Create a budget alert for the GKE service and set a hard limit to prevent further overspend.

AnswerB

The recommender provides specific, actionable insights to reduce costs without impacting workload stability.

Why this answer

Option B is correct because the GKE cost optimization recommender specifically analyzes cluster utilization patterns, such as underutilized nodes or pods, and provides actionable recommendations to right-size resources. Since the team has already verified that VPA and node auto-provisioning are enabled and no orphaned resources exist, the recommender can pinpoint inefficiencies like over-provisioned node pools or idle GPU nodes that are driving the 40% cost increase despite stable traffic.

Exam trap

The trap here is that candidates confuse high-level billing analysis (Option A) with actionable optimization recommendations, or they assume that disabling auto-provisioning and using preemptible instances (Option C) is a universal cost-saving fix without considering workload compatibility and the need for diagnostic insights first.

How to eliminate wrong answers

Option A is wrong because reviewing billing export data in BigQuery identifies cost contributors at a high level (project, service, label) but does not provide granular, actionable recommendations for GKE resource optimization; it lacks the specific node-level and pod-level analysis needed to address inefficient utilization. Option C is wrong because disabling node auto-provisioning and switching all nodes to preemptible instances is a drastic change that could cause workload disruption for stateful databases (which require persistent nodes) and does not address the root cause of inefficient resource utilization; preemptible instances are not suitable for stateful workloads due to their 24-hour maximum lifetime and potential for sudden termination. Option D is wrong because creating a budget alert with a hard limit only prevents further overspend by stopping or alerting on cost, but does not identify or reduce the existing inefficiencies; it is a reactive measure, not a diagnostic or optimization tool.

Full explanation →

224

MCQmedium

A DevOps team is migrating their infrastructure to Google Cloud. They have a complex environment with multiple VPC networks, shared services, and separate development and production projects. They want to bootstrap a new organization that supports: (1) centralized network management with shared VPC, (2) separate folders for dev and prod, (3) consistent firewall rules across all projects, (4) a single Cloud NAT for outbound traffic. They have an existing on-premises VPN that must connect to all projects. What is the most efficient approach?

A.Create separate VPCs for each project and peer them all together, then configure Cloud NAT and VPN in each project.

B.Use a single project for all environments and rely on VPC subnets and firewall rules to isolate workloads.

C.Create a folder for networking and a folder for projects, then use a Cloud VPN appliance from the Marketplace.

D.Create a host project for Shared VPC, attach all projects as service projects, and configure Cloud NAT and VPN in the host project. Use organization policies to enforce firewall rules.

AnswerD

Shared VPC allows central network; host project handles VPN and NAT.

Why this answer

Option D is correct because Shared VPC allows you to create a host project that centrally manages network resources (VPC, subnets, firewall rules, Cloud NAT, VPN) while attaching development and production projects as service projects. This satisfies all requirements: centralized network management, separate folders for dev/prod, consistent firewall rules via organization policies, a single Cloud NAT for outbound traffic, and a single VPN connection to on-premises that routes to all service projects through the host project's VPC.

Exam trap

The trap here is that candidates often assume VPC peering or a single project with subnets is sufficient, but they miss that Shared VPC is the only Google Cloud-native solution that provides centralized network management, transitive routing, and consistent policy enforcement across multiple projects with separate folders.

How to eliminate wrong answers

Option A is wrong because creating separate VPCs for each project and peering them all together does not provide centralized network management; VPC peering is a non-transitive connection, so a single Cloud NAT and VPN in one project cannot serve all projects without complex routing and additional NAT gateways. Option B is wrong because using a single project for all environments violates the requirement for separate folders for dev and prod, and it does not scale for complex environments with multiple projects; it also lacks the organizational structure needed for bootstrapping a new organization. Option C is wrong because creating a folder for networking and a folder for projects does not inherently enable Shared VPC; using a Cloud VPN appliance from the Marketplace adds unnecessary complexity and cost, and does not provide the centralized network management or consistent firewall rules across projects that Shared VPC with organization policies offers.

Full explanation →

225

Multi-Selecteasy

A company wants to set budgets and alerts for a GCP project. Which two steps are required to enable budget notifications via email? (Choose two.)

Select 2 answers

A.Enable the Cloud Billing API for the project

B.Create a budget alert threshold in the Cloud Console

C.Add the email recipients in the budget configuration

D.Create a Pub/Sub topic and subscribe to it

E.Set up a Cloud Function to send emails

AnswersB, C

Thresholds are required to trigger alerts based on spending.

Why this answer

Options A and C are required. You must create a budget with alert thresholds and add email recipients in the budget configuration. Option B is not required for email.

Options D and E are for programmatic notifications.

Full explanation →

Google Professional Cloud DevOps Engineer (PCDOE) — Questions 151–225