Google Professional Cloud DevOps Engineer PCDOE Questions 376–450 | Page 6/7

376

MCQhard

Refer to the exhibit. A payment microservice on GKE logs frequent 'connection closed' errors. The service connects to a backend database. Which approach is most effective to reduce these errors?

A.Implement retry logic with exponential backoff in the service code.

B.Increase the number of pod replicas to distribute load.

C.Adjust the readiness probe to be more aggressive.

D.Increase the CPU and memory limits for the container.

AnswerA

Retries handle transient connection closures.

Why this answer

The 'connection closed' errors indicate transient network failures or database server-side connection drops. Implementing retry logic with exponential backoff in the service code is the most effective approach because it allows the microservice to gracefully recover from intermittent failures without overwhelming the database with immediate retries. This pattern is a standard resilience technique for cloud-native applications on GKE, as it handles temporary issues like network blips or database connection pool exhaustion.

Exam trap

Google Cloud often tests the misconception that scaling resources (pods or limits) fixes all performance issues, but here the trap is that 'connection closed' errors are typically transient network or database-side issues, not resource bottlenecks, so retry logic is the correct resilience pattern.

How to eliminate wrong answers

Option B is wrong because increasing pod replicas distributes load but does not address the root cause of transient connection failures; it may even increase the number of concurrent connections, potentially worsening the problem. Option C is wrong because adjusting the readiness probe to be more aggressive (e.g., shorter interval or lower threshold) could cause pods to be prematurely removed from service during brief hiccups, leading to more instability and connection errors. Option D is wrong because increasing CPU and memory limits addresses resource starvation, not transient network or database connection drops; the errors are not caused by insufficient resources but by connection lifecycle issues.

Full explanation →

377

MCQmedium

A company uses Cloud Build for CI/CD. They want to allow Cloud Build to deploy to Cloud Run. What is the minimum IAM role to assign to the Cloud Build service account?

A.roles/cloudbuild.builds.builder

B.roles/run.admin

C.roles/editor

D.roles/run.invoker

AnswerB

Provides full control over Cloud Run services, enabling deployment.

Why this answer

The Cloud Build service account needs permission to create and manage Cloud Run resources, including deploying new revisions. The `roles/run.admin` role provides full control over Cloud Run services, which is the minimum required for deployment. The `roles/cloudbuild.builds.builder` role only allows building and managing Cloud Build triggers, not deploying to Cloud Run.

Exam trap

The trap here is that candidates often confuse the Cloud Build service account's role with the Cloud Build builder role (`roles/cloudbuild.builds.builder`), mistakenly thinking it includes deployment permissions, when in fact it only covers build orchestration.

How to eliminate wrong answers

Option A is wrong because `roles/cloudbuild.builds.builder` grants permissions only for Cloud Build operations (e.g., creating builds, viewing logs) and does not include any Cloud Run deployment permissions. Option C is wrong because `roles/editor` is a broad, basic role that includes many permissions beyond what is needed, violating the principle of least privilege; it is not the minimum IAM role. Option D is wrong because `roles/run.invoker` only allows invoking (calling) an existing Cloud Run service, not deploying or updating it.

Full explanation →

378

MCQeasy

A company serves static assets (images, CSS) to global users. Users in distant regions experience slow load times. Which service should they use to optimize delivery?

A.Cloud CDN

B.Cloud Load Balancing

C.Cloud NAT

D.Cloud Armor

AnswerA

Cloud CDN caches static content at global edge locations, reducing latency for distant users.

Why this answer

Cloud CDN (Content Delivery Network) caches static assets (images, CSS) at edge locations worldwide, reducing latency for distant users by serving content from a nearby point of presence (PoP). This directly addresses slow load times caused by geographic distance, as the origin server is no longer the sole source of delivery.

Exam trap

Google Cloud often tests the distinction between caching (CDN) and load balancing, where candidates mistakenly think distributing traffic globally (Cloud Load Balancing) will also cache content, but load balancing alone does not reduce latency for static assets without edge caching.

How to eliminate wrong answers

Option B (Cloud Load Balancing) is wrong because it distributes incoming traffic across multiple backend instances to improve availability and fault tolerance, but it does not cache content or reduce latency for static assets globally. Option C (Cloud NAT) is wrong because it provides outbound internet connectivity for private instances (e.g., VMs without public IPs) by translating private IPs to public IPs, and has no role in content delivery or caching. Option D (Cloud Armor) is wrong because it is a web application firewall (WAF) that protects against DDoS and application-layer attacks (e.g., SQL injection, XSS), not a caching or content delivery service.

Full explanation →

379

MCQeasy

A DevOps team is defining an SLO for a web application that runs on Compute Engine behind an HTTP Load Balancer. They need to measure the proportion of requests that complete within 300ms. Which Cloud Monitoring metric is most appropriate as the SLI?

A.loadbalancing.googleapis.com/https/backend_request_bytes

B.loadbalancing.googleapis.com/https/frontend_tcp_rtt

C.loadbalancing.googleapis.com/https/request_count

D.loadbalancing.googleapis.com/https/total_latencies

AnswerD

This metric gives latency distribution, including percentiles, making it ideal for a latency SLI.

Why this answer

The SLI must measure the proportion of requests completing within 300ms, which is a latency distribution metric. The `total_latencies` metric from the HTTP Load Balancer provides a histogram of request latencies, allowing you to compute the percentage of requests below a threshold (e.g., 300ms). This directly supports the SLO definition.

Exam trap

Google Cloud often tests the distinction between latency metrics (histogram-based) and simple counters or byte metrics, expecting candidates to recognize that only a distribution metric like `total_latencies` can compute percentile-based SLIs.

How to eliminate wrong answers

Option A is wrong because `backend_request_bytes` measures the size of request data sent to backends, not latency. Option B is wrong because `frontend_tcp_rtt` measures TCP round-trip time between client and load balancer, not application-layer request latency. Option C is wrong because `request_count` only counts total requests without any latency information, so it cannot be used to measure the proportion of fast requests.

Full explanation →

380

MCQhard

You created the above alert policy to detect high CPU utilization in your GKE cluster. However, you are receiving too many false positive alerts. What is the most likely reason?

A.The threshold value of 0.8 is too low; it should be 0.9 for production.

B.The crossSeriesReducer is set to REDUCE_SUM, which sums CPU across containers, so a namespace with many containers can trigger the alert even if each container uses less than 80%.

C.The duration of 300 seconds (5 minutes) is too short; it should be longer to avoid transient spikes.

D.The filter does not specify a specific namespace, causing alerts from all namespaces.

AnswerB

REDUCE_SUM adds up CPU usage of all containers in the namespace/container group. This can exceed 0.8 when many containers are active, even if each is below 80%. Using REDUCE_MAX per container would be more appropriate.

Why this answer

Option B is correct because the crossSeriesReducer set to REDUCE_SUM aggregates CPU utilization across all containers in a namespace. This means that even if each container uses only 20% CPU, a namespace with five containers would show a total of 100%, triggering the alert when the threshold is 0.8 (80%). This causes false positives because the alert fires on the sum, not on individual container utilization.

Exam trap

Google Cloud often tests the misconception that false positives are caused by thresholds being too low or durations too short, when the real issue is an incorrect aggregation reducer that sums metrics across multiple resources.

How to eliminate wrong answers

Option A is wrong because raising the threshold to 0.9 would not fix the root cause—the aggregation issue—and could still trigger false positives if the sum of many low-utilization containers exceeds 0.9. Option C is wrong because the duration of 300 seconds is already long enough to filter transient spikes; extending it further would delay legitimate alerts without addressing the aggregation problem. Option D is wrong because the filter not specifying a namespace is not the primary cause; the alert would still fire on aggregated CPU across all containers, and adding a namespace filter would not prevent false positives from summed utilization within that namespace.

Full explanation →

381

MCQeasy

A company wants to receive notifications when their Google Cloud costs exceed $5000 in a month. They have set a budget alert at the billing account level. What is the minimum configuration required to ensure they get alerted?

A.Set budget amount to $5000 and alert threshold at 100% and configure a Pub/Sub topic for notifications.

B.Set budget amount to $5000 and alert threshold at 100%.

C.Set budget amount to $5000 and alert threshold at 50% and 100%.

D.Set budget amount to $5000 and alert threshold at 100% and ensure the budget is scoped to a single project.

AnswerA

This includes both the threshold and a notification channel (Pub/Sub or email), meeting the minimum requirement.

Why this answer

Option C is correct because budget alerts require both a threshold and a notification channel (Pub/Sub or email). Option A is missing the notification channel. Option B includes unnecessary thresholds but still lacks notification.

Option D incorrectly scopes to a single project.

Full explanation →

382

MCQhard

A financial services firm is implementing a CI/CD pipeline with Cloud Build and Artifact Registry. Their security policy requires all data to remain within a VPC Service Controls perimeter. They have configured Cloud Build to use a private worker pool with no external IP addresses and have set up VPC-SC to allow traffic between Cloud Build and Artifact Registry within the perimeter. However, builds that push Docker images to Artifact Registry fail with the error: 'denied: Unauthenticated request. Push access to the repository is denied.' The build configuration includes the step: 'steps: - name: gcr.io/cloud-builders/docker args: [push, us-central1-docker.pkg.dev/myproject/my-repo/myimage]' The Cloud Build service account has been granted roles/artifactregistry.writer on the repository. What is the most likely cause?

A.The Cloud Build service account does not have permissions to authenticate to Artifact Registry when using a private pool.

B.The VPC-SC perimeter does not allow egress to the Artifact Registry API endpoint.

C.The Docker push is failing because the image tag is missing a version.

D.The Artifact Registry repository is in a different region than the Cloud Build worker pool.

AnswerB

VPC-SC can restrict access to APIs; the Artifact Registry endpoint must be explicitly allowed in the perimeter.

Why this answer

Option C is correct because VPC Service Controls can block access to Artifact Registry API endpoints if they are not in the allowed list, resulting in a denied error even with correct IAM permissions. Option A is incorrect because Artifact Registry is regional but private pools can access any region. Option B is incorrect because IAM permissions are correct.

Option D is incorrect because the image tag is present.

Full explanation →

383

Multi-Selecthard

A DevOps team uses Cloud Build and Cloud Deploy to deploy to GKE. They want to implement a gated deployment where a manual approval is required before promoting from staging to production. What two resources should they configure? (Select TWO)

Select 2 answers

A.A Cloud Pub/Sub topic to notify approvers

B.A Cloud Deploy rollout with a pre-deploy hook

C.A Cloud Deploy approval rule in the delivery pipeline

D.A Cloud Deploy target with a requireApproval attribute set to true

E.A Cloud Build trigger with a manual approval step

AnswersC, D

Approval rules define stages where manual approval is needed.

Why this answer

Option C is correct because a Cloud Deploy approval rule in the delivery pipeline defines a manual gate that pauses the pipeline at a specific stage (e.g., before promoting to production) and requires explicit approval to proceed. Option D is correct because setting the `requireApproval` attribute to `true` on a Cloud Deploy target enforces that any rollout targeting that environment must receive manual approval before the deployment proceeds.

Exam trap

Google Cloud often tests the distinction between Cloud Deploy's native approval mechanism (approval rules and `requireApproval` on targets) and Cloud Build's manual approval steps, which are separate and apply to build pipelines, not deployment pipelines.

Full explanation →

384

MCQhard

You are a DevOps engineer at a large e-commerce company that runs its production workloads on Google Kubernetes Engine (GKE) in the us-central1 region. The cluster has 500 nodes, each with 8 vCPUs and 32 GB of memory, and uses preemptible VMs for cost savings. Over the past month, the monthly GKE cost has increased by 30% unexpectedly. Upon reviewing the billing reports, you notice a significant spike in Compute Engine costs, specifically for 'Sustained Use Discount' line items, but the total cost is higher than expected. You also observe that the cluster's node utilization is inconsistent, with some nodes running at 90% CPU and memory while others are below 20%. Your team has been deploying stateless microservices and using Cluster Autoscaler with default settings. The application traffic is variable but predictable, with peaks on weekends. You need to reduce the GKE costs without impacting performance. What should you do?

A.Enable node auto-provisioning and migrate baseline workloads to nodes covered by committed use discounts (1-year or 3-year).

B.Increase the minimum number of nodes in the cluster to 300 and use a larger machine type to reduce the number of pods per node.

C.Switch the cluster to a single-zone configuration and reduce the number of nodes to 200 to lower base costs.

D.Reduce the number of preemptible VMs to 30% and use only on-demand VMs for the remaining nodes to improve reliability.

AnswerA

Node auto-provisioning optimizes resource allocation based on pod requirements, and CUDs provide significant savings for stable workloads.

Why this answer

Option A is correct because enabling node auto-provisioning allows GKE to automatically select the most cost-effective node configurations for your workloads, reducing waste from over-provisioned nodes. Migrating baseline workloads to nodes covered by committed use discounts (CUDs) locks in lower prices for predictable usage, directly addressing the 30% cost spike caused by inconsistent utilization and unexpected sustained use discount charges. This combination optimizes both variable and steady-state workloads without sacrificing performance.

Exam trap

Google Cloud often tests the misconception that simply reducing node count or switching to cheaper VM types (like preemptible) will solve cost issues, without addressing the root cause of inconsistent utilization and the need for committed use discounts for baseline workloads.

How to eliminate wrong answers

Option B is wrong because increasing the minimum number of nodes to 300 and using larger machine types would increase, not reduce, costs by forcing more idle capacity and higher base compute charges. Option C is wrong because switching to a single-zone configuration reduces resilience and availability, and simply reducing node count to 200 does not address the root cause of inconsistent utilization or the sustained use discount anomaly. Option D is wrong because reducing preemptible VMs to 30% and using more on-demand VMs would raise costs significantly, as preemptible VMs are cheaper; the issue is utilization, not reliability.

Full explanation →

385

Multi-Selectmedium

Which TWO of the following are required steps to set up a shared VPC for DevOps teams?

Select 2 answers

A.Attach the service projects to the host project.

B.Create a new VPC in the service project and peer it with the host project.

C.Configure Cloud Interconnect between the host and service projects.

D.Designate the host project and enable Shared VPC for it.

E.Grant the Shared VPC Admin role (roles/compute.xpnAdmin) to the service project team.

AnswersA, D

Service projects must be explicitly attached to use the shared VPC.

Why this answer

Option A is correct because attaching service projects to the host project is a mandatory step in Shared VPC setup. After designating the host project and enabling Shared VPC, you must attach each service project to the host project so that the service projects can consume subnets from the host project's VPC. Without this attachment, the service projects cannot use the shared networking resources.

Exam trap

Google Cloud often tests the misconception that Shared VPC requires VPC peering or that service projects need their own VPC, but the correct model is a single host project VPC shared via attachment, not peering.

Full explanation →

386

Multi-Selecthard

A company runs a web application on Google Kubernetes Engine (GKE) with multiple services. They want to reduce costs without impacting performance. Which THREE actions should they take? (Choose three.)

Select 3 answers

A.Enable cluster autoscaling and manually scale nodes based on peak load.

B.Deploy a service mesh like Istio to optimize traffic routing.

C.Enable node auto-provisioning to automatically adjust node pools.

D.Right-size CPU and memory requests and limits for each service.

E.Use preemptible VMs for stateless, fault-tolerant workloads.

AnswersC, D, E

Node auto-provisioning ensures the cluster uses the right size and type of nodes.

Why this answer

Option C is correct because node auto-provisioning in GKE automatically creates and scales node pools based on the resource requirements of pending pods. This eliminates the need for manual node pool management and ensures that only the necessary compute resources are provisioned, reducing costs without manual intervention or over-provisioning.

Exam trap

Google Cloud often tests the misconception that manual scaling or service meshes are cost-saving measures, when in fact they either increase costs or fail to address the root cause of over-provisioning.

Full explanation →

387

Multi-Selecteasy

A DevOps engineer notices that some Compute Engine instances are not reporting metrics to Cloud Monitoring. Which two potential causes should they investigate? (Choose two.)

Select 2 answers

A.The instances are in a different region and Cloud Monitoring doesn't support cross-region.

B.The instances are preemptible and automatically stop reporting after 24 hours.

C.The instances have insufficient IAM permissions to write metrics.

D.The instances are in a different project and not peered.

E.The Ops Agent is not installed on the instances.

AnswersC, E

Instances need the roles/monitoring.metricWriter role to send metrics.

Why this answer

Option C is correct because Compute Engine instances require the appropriate IAM permissions (e.g., roles/monitoring.metricWriter) to write metrics to Cloud Monitoring. Without these permissions, the API calls to ingest metric data are denied, even if the Ops Agent is installed and running.

Exam trap

Google Cloud often tests the misconception that preemptible instances have a built-in metric reporting cutoff, when in fact they can report metrics normally until they are preempted, and the real issue is often IAM permissions or missing agent installation.

Full explanation →

388

MCQhard

An organization has multiple projects under a common folder. They want to enforce that all projects use the same VPC network from a central host project. However, one project needs to use a different VPC due to compliance requirements. How can this be achieved?

A.Set an organization policy to enforce shared VPC and create an exception for the specific project using the policy condition.

B.Use VPC Network Peering to connect the project to the host project.

C.Create a separate folder for the exception project and apply a different organizational policy.

D.Grant the project the necessary permissions to use its own VPC.

AnswerA

Organizational policies support conditions for exemptions.

Why this answer

Option A is correct because Google Cloud Organization Policies can enforce constraints like `compute.restrictSharedVpcHostProjects` to mandate shared VPC usage across projects. You can use policy conditions (e.g., `resource.matchTag`) to create an exception for a specific project that needs its own VPC, allowing it to bypass the constraint while all other projects remain bound to the central host project.

Exam trap

Google Cloud often tests the misconception that VPC peering or folder restructuring can solve policy enforcement exceptions, when in reality only organization policy conditions provide the precise, hierarchical override needed without breaking the uniform constraint.

How to eliminate wrong answers

Option B is wrong because VPC Network Peering connects two VPCs for communication but does not enforce that all projects use the same VPC from a central host project; it allows independent VPCs to exchange traffic, which contradicts the requirement of uniform VPC usage. Option C is wrong because creating a separate folder and applying a different organizational policy would affect all projects in that folder, not just the single exception project, and it violates the principle of minimal exception management. Option D is wrong because granting permissions to use its own VPC does not override an organization policy constraint; the policy must be explicitly exempted via conditions, not just by IAM permissions.

Full explanation →

389

MCQeasy

A company uses Cloud Source Repositories and Cloud Build to build and deploy a Node.js application to Google Kubernetes Engine (GKE). The build step fails intermittently with an error 'npm ERR! network timeout'. What is the most efficient way to reduce build failures?

A.Configure npm to use a proxy and increase the timeout in the build step.

B.Use Artifact Registry to cache npm packages and change npm registry url.

C.Set the build to retry on failure in the Cloud Build trigger configuration.

D.Increase the machine type to e2-highmem-4 in the cloudbuild.yaml.

AnswerA

A longer timeout reduces failures due to temporary network issues.

Why this answer

Option A is correct because configuring a proxy or specifying a longer timeout in the npm config can mitigate network timeouts. Option B is incorrect because retries in Cloud Build don't fix the underlying timeout. Option C is incorrect because moving to Artifact Registry doesn't affect npm network calls.

Option D is incorrect because increasing machine size doesn't resolve network timeouts.

Full explanation →

390

Multi-Selectmedium

Which TWO metrics from Cloud Monitoring would best indicate that a GKE workload is experiencing CPU throttling due to a resource quota? (Choose 2)

Select 2 answers

A.node/cpu/usage_time

B.container/cpu/throttled_time

C.container/memory/usage_bytes

D.container/cpu/usage_time

E.container/accelerator/duty_cycle

AnswersB, D

Directly shows time spent throttled.

Why this answer

Option B is correct because `container/cpu/throttled_time` directly measures the cumulative time a container's CPU usage was throttled due to exceeding its assigned CPU quota (CFS quota). Option D is correct because `container/cpu/usage_time` shows the actual CPU time used by the container; when compared against the quota limit, a high usage_time relative to the quota indicates that throttling is likely occurring. Together, these two metrics confirm both the occurrence and the cause of CPU throttling.

Exam trap

Google Cloud often tests the distinction between node-level and container-level metrics, and the trap here is that candidates may pick `node/cpu/usage_time` (Option A) thinking it reflects container throttling, when in fact it aggregates all pods on the node and cannot reveal per-container quota enforcement.

Full explanation →

391

MCQeasy

Refer to the exhibit. A GKE node shows MemoryPressure condition. What should the team do to improve performance of pods scheduled on this node?

A.Enable cluster autoscaler to scale up new nodes

B.Increase the node's memory by changing the machine type

C.Adjust pod resource requests to leave more allocatable memory

D.Evict pods and delete the node

AnswerA

Cluster autoscaler adds nodes when pod is unschedulable due to memory pressure, distributing load.

Why this answer

When a GKE node reports a MemoryPressure condition, it means the node's kubelet is actively evicting pods to free memory, which degrades performance. Enabling cluster autoscaler allows the cluster to automatically provision new nodes when existing nodes are under memory pressure, redistributing pods and alleviating the condition without manual intervention.

Exam trap

Google Cloud often tests the misconception that MemoryPressure can be resolved by modifying pod requests or node size, when the correct automated solution is cluster autoscaler to add capacity dynamically.

How to eliminate wrong answers

Option B is wrong because changing the machine type requires recreating the node, which is disruptive and not a dynamic solution; cluster autoscaler handles scaling without manual node replacement. Option C is wrong because adjusting pod resource requests only affects future scheduling, not the current memory pressure on the node, and does not free memory for existing pods. Option D is wrong because evicting pods and deleting the node is a manual, reactive action that causes downtime, whereas cluster autoscaler provides automated, proactive scaling.

Full explanation →

392

MCQmedium

A company runs a microservices application on GKE. The checkout service has high tail latency. Using Cloud Profiler, the team finds that most time is spent in database queries. Which action should they take to improve performance?

A.Migrate the database to Cloud Spanner.

B.Increase the number of replicas of the checkout service.

C.Add database connection pooling using a sidecar proxy.

D.Enable Cloud CDN for the checkout API.

AnswerC

Connection pooling reduces overhead of establishing connections, improving latency.

Why this answer

Option C is correct because database connection pooling reduces the overhead of establishing new connections for each request, which is a common cause of high tail latency in microservices. By using a sidecar proxy (e.g., Envoy or a dedicated connection pooler like PgBouncer), the checkout service can reuse existing database connections, minimizing latency spikes from connection setup and teardown. This directly addresses the root cause identified by Cloud Profiler—time spent in database queries—without requiring a database migration or scaling the service itself.

Exam trap

Google Cloud often tests the misconception that scaling out (increasing replicas) or migrating to a different database solves all performance issues, when the real problem is connection management overhead within the existing database layer.

How to eliminate wrong answers

Option A is wrong because migrating to Cloud Spanner does not inherently reduce per-query latency; it provides horizontal scalability and strong consistency, but the bottleneck is connection overhead, not database throughput or consistency. Option B is wrong because increasing replicas of the checkout service does not reduce the latency of individual database queries; it may even increase connection churn and exacerbate the problem. Option D is wrong because Cloud CDN caches static content at edge locations, but the checkout API involves dynamic, transactional database queries that cannot be cached, so CDN provides no benefit for this latency issue.

Full explanation →

393

MCQmedium

A DevOps engineer is setting up a Cloud Build trigger that deploys to Cloud Run. The build succeeds but the deployment fails with 'Permission denied on the Cloud Run service'. What is the most likely cause?

A.The Cloud Run service account lacks the roles/cloudbuild.builds.builder role.

B.The trigger is missing the required deployment configuration.

C.The cloudbuild.yaml file has an incorrect image tag.

D.The Cloud Build service account lacks the roles/run.admin role.

AnswerD

This role is necessary to deploy to Cloud Run.

Why this answer

The Cloud Build service account (typically the Compute Engine default service account or a user-specified service account) requires the roles/run.admin role to deploy to Cloud Run. This role grants permission to create, update, and manage Cloud Run services. Without it, the deployment step fails with a 'Permission denied' error, even though the build itself succeeds.

Exam trap

The trap here is that candidates often confuse the Cloud Build service account with the Cloud Run service account, mistakenly thinking the Cloud Run service account needs the builder role, when in fact the Cloud Build service account needs the run.admin role to deploy.

How to eliminate wrong answers

Option A is wrong because the Cloud Run service account does not need the roles/cloudbuild.builds.builder role; that role is for triggering builds, not for deploying to Cloud Run. Option B is wrong because a missing deployment configuration would typically cause a different error (e.g., missing 'service' or 'region' fields), not a permission denied error. Option C is wrong because an incorrect image tag would cause a build or deployment failure related to image resolution (e.g., 'Image not found'), not a permission denied error on the Cloud Run service.

Full explanation →

394

MCQmedium

An organization uses Cloud Build with a private pool to build container images that require access to on-premises Artifactory. After moving to a new VPC, builds fail with 'Connection refused' when fetching dependencies. What is the best step to troubleshoot?

A.Verify that VPC Network Peering is established between the Cloud Build private pool's service producer VPC and the customer VPC, and that routes to on-premises are present.

B.Verify that the Cloud Build service account has the dns.networks.bindPrivateZone permission.

C.Check that the Cloud Build service account has the storage.objectViewer role on the Artifactory bucket.

D.Ensure that Cloud NAT is configured in the private pool's VPC.

AnswerA

Private pools require peering; missing peering stops traffic.

Why this answer

The error 'Connection refused' indicates that the Cloud Build private pool's worker VMs cannot reach the on-premises Artifactory server. Private pools are deployed in a Google-managed service producer VPC that must be connected to the customer VPC via VPC Network Peering. Without this peering and the correct routes to the on-premises network (e.g., via Cloud VPN or Dedicated Interconnect), traffic from the private pool is dropped, causing the connection refusal.

Exam trap

The trap here is that candidates confuse connectivity issues with IAM permissions or misapply Cloud NAT, thinking it provides outbound access to on-premises, when in reality private pools require VPC peering and proper routing to reach non-Google Cloud endpoints.

How to eliminate wrong answers

Option B is wrong because the dns.networks.bindPrivateZone permission is used for binding a private DNS zone to a VPC network, which is unrelated to the connectivity issue causing 'Connection refused'. Option C is wrong because Artifactory is an on-premises service, not a Google Cloud Storage bucket; the storage.objectViewer role applies to GCS buckets, not to on-premises HTTP/HTTPS endpoints. Option D is wrong because Cloud NAT provides outbound internet access for private VMs, but the private pool's VPC is the service producer VPC managed by Google, not the customer's VPC; Cloud NAT in the customer VPC does not affect the private pool's connectivity to on-premises.

Full explanation →

395

MCQeasy

You are a DevOps engineer for a startup bootstrapping their Google Cloud organization. They have a single project for all environments (dev, test, prod) and a flat resource hierarchy. Recently, a developer accidentally deleted a production Cloud Storage bucket, causing data loss. The team wants to prevent this in the future with minimal disruption. They also want to enforce that all new projects follow a naming convention like 'company-environment-xxx'. The CTO wants a solution using native Google Cloud services without third-party tools. What should you do?

A.Implement a Cloud Function that renames projects not following the convention and deletes buckets not in a folder.

B.Grant all users the Project Creator role but restrict bucket deletion with IAM.

C.Use Google Cloud Deployment Manager to create projects with predefined templates.

D.Create folders for each environment, move existing resources into folders, and apply an organization policy to enforce the naming convention on project creation.

AnswerD

Folders provide isolation; org policy enforces naming.

Why this answer

Option D is correct because creating folders for each environment (dev, test, prod) and moving existing resources into them establishes a hierarchical resource structure that allows organization policies to be applied at the folder level. The organization policy constraint `constraints/resourcemanager.allowedProjectCreation` can enforce the naming convention on project creation, and IAM roles can be scoped to folders to restrict bucket deletion (e.g., using `roles/storage.objectAdmin` instead of `roles/storage.admin`). This solution uses native Google Cloud services (Resource Manager, Organization Policies, IAM) with minimal disruption by not requiring code changes or third-party tools.

Exam trap

Google Cloud often tests the misconception that Cloud Functions or Deployment Manager can enforce governance retroactively, when in fact organization policies and folders are the only native Google Cloud services that can enforce naming conventions and resource hierarchy constraints at scale.

How to eliminate wrong answers

Option A is wrong because Cloud Functions cannot rename projects or delete buckets based on folder membership; project names are immutable after creation, and bucket deletion requires IAM permissions, not serverless functions. Option B is wrong because granting all users the Project Creator role would allow them to create projects without naming enforcement, and restricting bucket deletion with IAM alone does not prevent accidental deletion in a flat hierarchy where permissions are inherited broadly. Option C is wrong because Deployment Manager can create projects with templates but cannot enforce naming conventions retroactively on existing projects or prevent bucket deletion; it is a deployment tool, not a governance enforcement mechanism.

Full explanation →

396

MCQeasy

A DevOps engineer receives an alert that the error budget for a critical service has been exhausted. The service runs on Compute Engine behind an HTTP(S) load balancer. The team wants to reduce the impact on users while investigating. What should the engineer do first?

A.Roll back the most recent deployment

B.Begin a detailed postmortem analysis

C.Disable the alerting policy to reduce noise

D.Increase the number of instances in the managed instance group

AnswerA

Rolling back quickly restores the previous stable version.

Why this answer

Rolling back the most recent deployment is the correct first action because it immediately restores the service to a known stable state, stopping further consumption of the error budget. This aligns with the incident management principle of 'mitigate first, investigate later' — reducing user impact takes priority over root cause analysis. The HTTP(S) load balancer will automatically route traffic to the previous healthy version once the rollback is complete.

Exam trap

Google Cloud often tests the misconception that scaling out (increasing instances) is the correct response to any degradation, but here the error budget exhaustion indicates a functional defect, not a capacity issue, so scaling would not fix the root cause.

How to eliminate wrong answers

Option B is wrong because beginning a detailed postmortem analysis is a later step; the immediate priority is to restore service, not analyze the incident. Option C is wrong because disabling the alerting policy would hide the problem rather than fix it, violating the principle of observability and potentially allowing further degradation. Option D is wrong because increasing the number of instances in the managed instance group does not address the root cause (likely a code or configuration defect) and may only temporarily mask the issue while continuing to exhaust the error budget.

Full explanation →

397

MCQeasy

A company wants to reduce the response time of a globally distributed web application. Which Google Cloud service can cache static content at edge locations to improve performance?

A.Cloud DNS

B.Cloud NAT

C.Cloud Armor

D.Cloud CDN

AnswerD

Correct. Cloud CDN caches content at edge locations to reduce latency.

Why this answer

Cloud CDN (Content Delivery Network) uses Google's globally distributed edge caches to serve static content (e.g., images, CSS, JavaScript) from locations closer to users, reducing latency and offloading origin servers. It integrates with external HTTPS load balancers to automatically cache responses based on cache-control headers, directly addressing the goal of improving response time for a globally distributed web application.

Exam trap

The trap here is that candidates confuse Cloud Armor (a security service) with a content delivery service, or assume Cloud DNS can cache content because it involves 'edge' name servers, but DNS caching is for DNS records, not web content.

How to eliminate wrong answers

Option A is wrong because Cloud DNS is a domain name resolution service that translates domain names to IP addresses; it does not cache or serve static content at edge locations. Option B is wrong because Cloud NAT provides outbound internet connectivity for private instances via network address translation, with no caching or edge delivery capabilities. Option C is wrong because Cloud Armor is a web application firewall (WAF) and DDoS protection service that filters traffic based on security policies; it does not cache static content or accelerate content delivery.

Full explanation →

398

MCQeasy

You are debugging a production issue where a Cloud Function occasionally throws a 'memory limit exceeded' error. You want to inspect the memory usage at the time of the error. What should you do?

A.Check Cloud Logging for memory metrics.

B.Use Cloud Trace to trace the invocations.

C.Use Cloud Debugger to set a breakpoint.

D.Enable Cloud Profiler and analyze the snapshot.

AnswerD

Profiler provides memory and CPU profiling snapshots.

Why this answer

Option D is correct because Cloud Profiler provides continuous, low-overhead profiling of CPU and memory usage, and its snapshot analysis can pinpoint memory allocation patterns at the time of a memory limit exceeded error. Unlike other tools, Profiler captures the call stack and memory consumption per function invocation, enabling you to identify the specific code path causing the spike.

Exam trap

Google Cloud often tests the distinction between monitoring (logging/tracing) and profiling, leading candidates to choose Cloud Logging or Cloud Trace because they are more familiar, while the correct answer requires a tool specifically designed for memory analysis.

How to eliminate wrong answers

Option A is wrong because Cloud Logging does not natively expose memory metrics for Cloud Functions; it logs textual events and errors, but memory usage is not a standard log entry unless explicitly instrumented. Option B is wrong because Cloud Trace focuses on latency and request tracing, not memory profiling; it can show execution time but not memory allocation details. Option C is wrong because Cloud Debugger is designed for inspecting code state at a breakpoint without stopping execution, but it cannot capture memory usage snapshots or profile memory over time, and it may alter the function's runtime behavior.

Full explanation →

399

Multi-Selecthard

A DevOps team is designing a CI/CD pipeline using Cloud Build and Spinnaker. They want to ensure secrets are managed securely. Which three recommended practices should they implement? (Choose THREE.)

Select 3 answers

A.Grant Cloud Build service account access to secrets via IAM.

B.Use Cloud KMS to encrypt secrets before storing in Cloud Storage.

C.Base64 encode secrets and store them in Cloud Build substitutions.

D.Rotate secrets regularly using Secret Manager.

E.Store secrets in Cloud Secret Manager.

AnswersA, D, E

Least-privilege access to necessary secrets.

Why this answer

A is correct because Cloud Build's service account must be granted IAM roles (e.g., roles/secretmanager.secretAccessor) on the Secret Manager secret to allow the pipeline to retrieve the secret value at build time. Without explicit IAM binding, the service account lacks permission to access the secret, causing the build to fail. This follows the principle of least privilege and ensures that only authorized identities can read secrets.

Exam trap

Google Cloud often tests the misconception that Base64 encoding or encrypting secrets with Cloud KMS before storage is sufficient, when in fact Secret Manager provides native secure storage, access control, and rotation—making options like B and C redundant or insecure.

Full explanation →

400

Multi-Selecthard

A team is designing a dashboard for their production environment using Cloud Monitoring. Which three types of information should be included on the dashboard to support incident response? (Choose three.)

Select 3 answers

A.Resource utilization trends

B.Recent alerting history

C.Real-time user feedback

D.Security audit logs

E.Service Level Indicators (SLIs)

AnswersA, B, E

Trends help identify capacity-related issues during incidents.

Why this answer

Resource utilization trends (A) are essential for incident response because they provide historical context, enabling responders to identify anomalies, correlate changes with incidents, and predict capacity issues. Cloud Monitoring's Metrics Explorer and dashboards allow you to plot trends over time, which is critical for root cause analysis during an active incident.

Exam trap

Google Cloud often tests the distinction between operational monitoring data (metrics, alerts, SLIs) and non-operational data (user feedback, audit logs), expecting candidates to recognize that dashboards for incident response must contain only real-time, actionable, and metric-based information.

Full explanation →

401

MCQmedium

An SRE team created the above logs-based metric. They expect it to count the number of HTTP 500 errors per instance. However, the metric shows no data. What is the most likely cause?

A.The metric kind is DELTA but should be CUMULATIVE.

B.The log entries might not have the 'status' field in jsonPayload; it could be in a different location or format.

C.The metric name does not follow the required naming convention.

D.The labelExtractors must use regex instead of JSON path.

AnswerB

If the logs are structured differently, the filter will not match, resulting in no data.

Why this answer

Option B is correct because the most likely reason for a logs-based metric showing no data is that the log entries do not contain the expected 'status' field in jsonPayload, or it is located in a different field (e.g., httpRequest.status) or formatted as a string instead of an integer. Cloud Logging metrics rely on exact field paths defined in the metric descriptor; if the field is missing or misnamed, no data points are generated.

Exam trap

Google Cloud often tests the misconception that metric kind or naming conventions cause missing data, but the real issue is almost always a mismatch between the log entry's actual field structure and the metric's extraction configuration.

How to eliminate wrong answers

Option A is wrong because the metric kind (DELTA vs. CUMULATIVE) affects how values are aggregated over time, not whether data appears; a DELTA metric will still show data if log entries match the filter and field extraction succeeds. Option C is wrong because Cloud Monitoring metric names do not have strict naming conventions that would cause zero data; they follow a simple resource type and metric type pattern, and invalid names would cause a creation error, not silent data absence.

Option D is wrong because labelExtractors can use either JSON path or regex; JSON path is the standard and recommended approach for structured logs, and using regex is not required to make the metric work.

Full explanation →

402

MCQeasy

A company wants to monitor custom application metrics in real-time and trigger alerts when a metric exceeds a threshold. Which Google Cloud service should they use?

A.Cloud Monitoring

B.Cloud Audit Logs

C.Cloud Logging

D.Cloud Error Reporting

AnswerA

Cloud Monitoring ingests custom metrics and provides alerting capabilities.

Why this answer

Cloud Monitoring (formerly Stackdriver Monitoring) is the correct service because it is designed to ingest custom application metrics via the Monitoring API or OpenTelemetry, create dashboards for real-time visualization, and configure alerting policies that trigger notifications when a metric exceeds a defined threshold. This directly meets the requirement for real-time monitoring and threshold-based alerts.

Exam trap

Google Cloud often tests the distinction between logging (text-based events) and monitoring (numeric time-series metrics), so the trap here is that candidates confuse Cloud Logging's log-based metrics or alerting on log entries with the dedicated metric monitoring and alerting capabilities of Cloud Monitoring.

How to eliminate wrong answers

Option B (Cloud Audit Logs) is wrong because it records administrative actions and access events for compliance and security auditing, not real-time custom application metrics or threshold-based alerting. Option C (Cloud Logging) is wrong because it ingests and stores log data (text-based events) and can trigger alerts on log content, but it is not designed for numeric metric time-series ingestion or threshold evaluation. Option D (Cloud Error Reporting) is wrong because it aggregates and analyzes application errors (e.g., exceptions, stack traces) from logs, not custom numeric metrics, and does not support threshold-based alerting on metric values.

Full explanation →

403

Multi-Selectmedium

A company wants to optimize costs for their Google Kubernetes Engine (GKE) clusters. Which three best practices should they implement? (Choose three.)

Select 3 answers

A.Use node auto-provisioning to dynamically add nodes

B.Use regional clusters instead of zonal clusters

C.Use committed use discounts for long-running workloads

D.Use pod resource requests and limits appropriately

E.Use preemptible nodes for stateless workloads

AnswersC, D, E

Committed use discounts provide significant savings for predictable, steady-state workloads.

Why this answer

Options A, B, and E are correct. Preemptible nodes are cost-effective for stateless workloads. Committed use discounts lower costs for long-running workloads.

Proper pod resource requests and limits prevent overprovisioning. Option C (node auto-provisioning) can help but is not a direct cost optimization best practice on its own; it may increase costs if not tuned. Option D (regional clusters) increases costs due to multi-zone replication.

Full explanation →

404

MCQmedium

An application on GKE frequently reads the same data from a Cloud Storage bucket. The data changes rarely. Which solution will best improve read performance and reduce costs?

A.Deploy a sidecar container that caches the data in an emptyDir volume.

B.Configure a Cloud SQL read replica for the data.

C.Increase the number of nodes in the cluster.

D.Use a StatefulSet with a persistent volume claim to store the data.

AnswerA

Sidecar with caching can serve data from local disk, reducing Cloud Storage reads.

Why this answer

Correct: Use a sidecar container with a shared emptyDir volume to cache data from Cloud Storage using a tool like gcsfuse with caching. Option A is wrong because persistent volumes are for stateful workloads. Option C is wrong because read replicas are for databases.

Option D is wrong because increasing node count does not improve per-pod read speed.

Full explanation →

405

MCQeasy

A startup runs a mobile app backend on App Engine standard environment. They recently added new features, and the app's response time increased significantly. The team suspects instance startup time is causing cold starts for new users. They have already reduced code size and enabled warmup requests. What is the best next step to improve performance?

A.Migrate to App Engine flexible environment

B.Increase the number of idle instances using automatic scaling settings

C.Implement a latency-based health check to redirect traffic

D.Use Cloud Endpoints to limit traffic and reduce load

AnswerB

Setting min_idle_instances to a higher value keeps instances warm, eliminating cold start delays.

Why this answer

Warmup requests reduce cold starts by initializing the app before live traffic arrives, but they don't eliminate startup latency for new instances. Increasing the number of idle instances via automatic scaling settings ensures that pre-warmed, ready-to-serve instances are always available, so new users never trigger a cold start. This directly addresses the root cause—instance startup time—without changing the environment or adding complexity.

Exam trap

Google Cloud often tests the misconception that warmup requests alone solve cold starts, when in fact they only reduce the impact—idle instances are required to eliminate the latency entirely.

How to eliminate wrong answers

Option A is wrong because migrating to App Engine flexible environment would increase cold start latency (VMs take longer to boot than containers) and adds operational overhead, making performance worse, not better. Option C is wrong because latency-based health checks redirect traffic away from unhealthy instances but do not reduce cold start latency; they only manage traffic routing after an instance is already slow. Option D is wrong because Cloud Endpoints manages API authentication and throttling, not instance startup performance; limiting traffic does not reduce the time it takes for a new instance to become ready.

Full explanation →

406

MCQhard

A company uses Cloud Deploy for continuous delivery to GKE. They have a delivery pipeline with a rollout strategy: canary (25% for 30m) then full. The canary rollout fails because the new revision's health check errors. The team wants to automatically rollback the canary and notify. What native GCP feature can achieve this?

A.Configure Cloud Monitoring alerting policy on deployment errors that triggers a Cloud Function to rollback.

B.Set up a Cloud Build trigger that detects deployment failure and runs a rollback.

C.Configure a Cloud Deploy rollout strategy with an automated rollback policy.

D.Use a Cloud Deploy rollout strategy with a post-deploy hook that calls Cloud Run jobs to revert.

AnswerC

Cloud Deploy can automatically rollback a rollout on failure by setting rollbackPolicy to ALWAYS or ON_FAILURE.

Why this answer

Option A is correct because Cloud Deploy supports automated rollback via the rollbackPolicy in the delivery pipeline. Option B is incorrect because Cloud Build triggers are not designed for rollback automation. Option C is incorrect because post-deploy hooks are not for rollbacks.

Option D is incorrect because it requires custom scripting and is not as native as Cloud Deploy's feature.

Full explanation →

407

MCQmedium

You are creating a Cloud Monitoring dashboard to display the 99th percentile latency of your HTTP Load Balancer over the last 6 hours. Which MQL query should you use?

A.fetch https_lb_rule :: latency | align 99p

B.fetch loadbalancing.googleapis.com/https/total_latencies | align percentile(99)

C.fetch loadbalancing.googleapis.com/https/request_count | align | ratio

D.fetch loadbalancing.googleapis.com/https/total_latencies | align 99 | with latency

AnswerB

This query fetches the latency distribution and aligns to the 99th percentile, exactly as needed.

Why this answer

Option B is correct because it uses the correct metric type (`total_latencies`) and the proper MQL function `percentile(99)` to compute the 99th percentile latency. The `fetch` statement targets the exact Cloud Monitoring metric for HTTPS load balancer latencies, and `align percentile(99)` aggregates the raw latency distribution data over the specified time window (last 6 hours) to produce the desired percentile value.

Exam trap

Google Cloud often tests the distinction between valid metric names (e.g., `total_latencies` vs. `latency`) and correct MQL syntax (e.g., `percentile(99)` vs. `99p` or `align 99`), leading candidates to choose syntactically close but incorrect options like A or D.

How to eliminate wrong answers

Option A is wrong because `https_lb_rule` is not a valid metric type in Cloud Monitoring; the correct metric is `loadbalancing.googleapis.com/https/total_latencies`. Additionally, `align 99p` is not valid MQL syntax — the correct function is `percentile(99)`. Option C is wrong because `request_count` is a count metric, not a latency metric, and using `ratio` would compute a ratio of request counts, not a latency percentile.

Option D is wrong because `align 99` is invalid MQL syntax (percentile requires the `percentile()` function), and `with latency` is not a recognized MQL clause for extracting or labeling the result.

Full explanation →

408

Multi-Selecthard

An incident is declared for a production service running on GKE. The on-call engineer suspects a recent code change may have introduced a memory leak. Which THREE actions should the engineer take to investigate and mitigate?

Select 3 answers

A.Increase the memory limit for the container as a temporary mitigation

B.Scale down the number of replicas to reduce memory pressure

C.Roll back the deployment immediately without further investigation

D.Check container logs for Out of Memory (OOM) killed messages

E.Compare memory usage metrics before and after the deployment using Cloud Monitoring

AnswersA, D, E

Temporary increase buys time for a permanent fix.

Why this answer

Option A is correct because increasing the memory limit for the container provides a temporary mitigation to prevent the service from being killed by the Out of Memory (OOM) killer while the root cause is investigated. In GKE, the container's memory limit is defined in the pod spec under `resources.limits.memory`, and raising it gives the application more headroom to continue serving requests without immediate termination. This is a standard incident response practice to buy time for deeper analysis, such as reviewing logs and metrics, before applying a permanent fix.

Exam trap

Google Cloud often tests the misconception that scaling down replicas reduces memory pressure, when in fact it reduces total available memory and can worsen the impact of a memory leak.

Full explanation →

409

MCQhard

A company uses BigQuery for analytics with many queries scanning terabytes daily. They need to reduce query costs without reducing usage. What is the most effective strategy?

A.Reserve capacity in specific regions.

B.Use flat-rate pricing with slots.

C.Partition and cluster tables.

D.Use standard SQL instead of legacy SQL.

AnswerC

Partitioning and clustering limit the data scanned per query, reducing cost.

Why this answer

Partitioning and clustering tables reduce the amount of data scanned per query, directly lowering costs.

Full explanation →

410

MCQeasy

A company runs a multi-region web application on Google Kubernetes Engine (GKE) using Cloud Load Balancing and Cloud Armor. They use Cloud Monitoring to track user-facing latency. Recently, they noticed that the p99 latency has increased from 200ms to 2s during peak hours, but only for users in the US region. The team suspects a specific backend service in us-central1 is causing the spike. They have set up a dashboard showing latency by region, but the latency metric is aggregated globally, not broken down by region. What should they do to pinpoint the issue?

A.Deploy a sidecar proxy in each pod to collect detailed latency data and export it to a third-party tool.

B.Use Cloud Monitoring's 'Service Monitoring' to set up a service SLO and create a burn-rate alert.

C.Use the GKE Dashboard to view per-pod latency metrics.

D.Create a custom log-based metric that extracts latency per region from application logs.

AnswerD

Log-based metrics allow you to parse latency values and labels (e.g., region) from structured logs, providing per-region latency data to pinpoint the issue.

Why this answer

Option D is correct because creating a custom log-based metric that extracts latency per region from application logs allows you to break down the globally aggregated latency metric into per-region slices. This directly addresses the need to isolate the us-central1 backend service's impact on p99 latency during US peak hours, without requiring additional infrastructure or third-party tools.

Exam trap

The trap here is that candidates may assume per-pod metrics (Option C) are sufficient for user-facing latency analysis, but GKE Dashboard metrics are infrastructure-focused and lack the regional breakdown needed to isolate a specific backend service's impact on global p99 latency.

How to eliminate wrong answers

Option A is wrong because deploying a sidecar proxy adds unnecessary complexity and cost, and exporting data to a third-party tool is not required when Cloud Monitoring's log-based metrics can already extract and filter latency by region from existing application logs. Option B is wrong because Service Monitoring and SLO burn-rate alerts are designed to detect when a service level objective is being violated, not to diagnose the root cause of a latency spike by region; they would only confirm the problem exists, not pinpoint the specific backend. Option C is wrong because the GKE Dashboard provides per-pod metrics like CPU and memory, but it does not expose user-facing latency broken down by region; latency metrics are typically collected at the load balancer or application layer, not at the pod level.

Full explanation →

411

MCQeasy

A company is setting up a new Google Cloud organization. They want to ensure that all projects inherit common IAM policies. What is the best practice?

A.Apply IAM policies at the folder level.

B.Apply IAM policies at the project level.

C.Apply IAM policies at the organization level.

D.Use multiple organizations to isolate policies.

AnswerC

Organization-level policies apply to all projects and folders under the organization.

Why this answer

Applying policies at the organization level ensures all projects and folders inherit them, providing consistent enforcement and reducing administrative overhead.

Full explanation →

412

MCQhard

A company has purchased Compute Engine committed use discounts (CUD) for 1 year for vCPU and memory. After 3 months, they need to upgrade some VMs to a larger machine type. What happens to the CUD coverage?

A.The CUD is automatically adjusted to cover the new machine type

B.The CUD continues to apply to the original resource, and any additional usage is charged at on-demand rates

C.The CUD applies only to the specific machine types in the commitment, so upgrades are not covered

D.The CUD is voided and a refund is issued

AnswerB

CUD covers the committed vCPU and memory; if you upgrade, the CUD still applies to the original amount, and extra usage is on-demand.

Why this answer

Option C is correct because CUDs apply to resource usage (vCPU, memory) not specific machine types. The original commitment continues to cover usage up to the committed amount; any additional usage is on-demand. Option A is incorrect because CUDs are not voided.

Option B is incorrect as CUDs are not auto-adjusted. Option D is partially true but A is more accurate. However, the key point is that CUDs cover the resources, so upgrades are covered as long as the resource types match.

Full explanation →

413

MCQhard

Based on the log entry, what is the most likely cause of the 404 error?

A.The user does not have permission to invoke the service.

B.The revision is not configured with the correct container port.

C.The Cloud Run service is not autoscaling properly, causing requests to be dropped.

D.The service has run out of memory.

AnswerB

A 404 often means the container is listening on a different port than what Cloud Run expects.

Why this answer

A 404 error on Cloud Run typically indicates that the request reached the service but no container is listening on the configured port. If the revision's container port does not match the port the application is actually serving on (e.g., the app listens on 8080 but the revision is configured for 3000), Cloud Run's HTTP ingress will fail to route traffic, resulting in a 404. This is the most likely cause because the error is not a permission or resource issue, but a routing mismatch at the container level.

Exam trap

Google Cloud often tests the distinction between HTTP status codes (404 vs 403 vs 503 vs 500) and their root causes in serverless environments, trapping candidates who confuse permission errors with routing misconfigurations.

How to eliminate wrong answers

Option A is wrong because a 403 Forbidden error, not a 404, would occur if the user lacks permission to invoke the service (IAM permissions control invocation, not routing). Option C is wrong because autoscaling issues typically cause 503 Service Unavailable or 429 Too Many Requests errors, not 404s; a 404 indicates the service exists but the endpoint is not reachable. Option D is wrong because running out of memory would cause the container to crash or return a 500 Internal Server Error, not a 404; memory limits affect container health, not HTTP routing.

Full explanation →

414

MCQeasy

Refer to the exhibit. What does the alert condition indicate?

A.It alerts when the request count drops below 1000 for 1 minute.

B.It alerts for any Cloud Run revision that has more than 1000 requests in a 1-minute window.

C.It alerts when the average request count across all revisions exceeds 1000 over 1 minute.

D.It alerts when the total request count across all revisions exceeds 1000 per minute.

AnswerB

For each revision, if its request count exceeds 1000 for at least 1 minute, alert fires.

Why this answer

The alert condition in the exhibit uses a per-revision metric (e.g., `run.googleapis.com/request_count`) with a threshold of 1000 and a 1-minute window. This means the alert fires for any individual Cloud Run revision that exceeds 1000 requests within that window, not for the aggregate across all revisions. Option B correctly identifies this per-revision behavior.

Exam trap

Google Cloud often tests the distinction between per-resource and aggregate metrics, so the trap here is assuming that a threshold on a metric like 'request_count' automatically implies a sum across all revisions, when in fact it applies to each individual revision's time series.

How to eliminate wrong answers

Option A is wrong because the alert condition is set to fire when the request count exceeds 1000, not when it drops below 1000; a 'less than' threshold would require a different condition. Option C is wrong because the alert evaluates each revision independently, not the average across all revisions; averaging would require a different aggregation function like `mean` or `avg`. Option D is wrong because the alert does not sum request counts across all revisions; it triggers per revision, so a single revision exceeding 1000 requests in a minute fires the alert regardless of other revisions' counts.

Full explanation →

415

MCQmedium

A DevOps engineer needs to set up a centralized logging solution for multiple projects. They want to store logs in a BigQuery dataset for analysis. What is the best approach?

A.Use Cloud Logging's export feature to Pub/Sub and then to BigQuery.

B.Use the BigQuery Data Transfer Service for logs.

C.Create a sink in each project to export logs to the BigQuery dataset.

D.Create an aggregated sink at the organization or folder level to export logs to BigQuery.

AnswerD

Centralized and efficient.

Why this answer

Option D is correct because an aggregated sink at the organization or folder level allows you to collect logs from all projects within that hierarchy into a single BigQuery dataset in a centralized project. This approach eliminates the need to configure individual sinks per project, reduces administrative overhead, and ensures consistent log routing across the entire organization.

Exam trap

The trap here is that candidates often choose Option C (per-project sinks) because they think each project must independently export its logs, failing to recognize that aggregated sinks at the organization or folder level provide a centralized, scalable solution that reduces management overhead.

How to eliminate wrong answers

Option A is wrong because Cloud Logging's export to Pub/Sub then to BigQuery introduces unnecessary complexity and latency; Pub/Sub is typically used for real-time streaming or fan-out to multiple subscribers, not as a direct path to BigQuery when a sink can write directly. Option B is wrong because the BigQuery Data Transfer Service is designed for scheduled data imports from external sources (e.g., Google Ads, Amazon S3), not for ingesting Cloud Logging logs. Option C is wrong because creating a sink in each project is inefficient and error-prone for a multi-project setup; it requires manual configuration per project and does not scale, whereas an aggregated sink centralizes management.

Full explanation →

416

Multi-Selectmedium

A team is setting up CI/CD for a microservices architecture. They want to ensure each service is independently buildable and deployable. What practices should they adopt? (Select THREE)

Select 3 answers

A.Use Artifact Registry with separate repositories per service

B.Use Cloud Deploy's multi-target pipeline

C.Use a single repository with separate Cloud Build triggers per service

D.Use separate repositories per service

E.Use Cloud Build's build config with substitutions to build multiple services

AnswersA, C, D

Separate repositories provide isolation and access control per service.

Why this answer

Options A, B, and E are correct. Separate repositories (A) or separate triggers with includeFiles (B) ensure independent builds. Separate Artifact Registry repositories (E) ensure artifact isolation.

Option C builds multiple services in one config, reducing independence. Option D is about deployment targets, not builds.

Full explanation →

417

Multi-Selecteasy

A company is bootstrapping a Google Cloud organization for DevOps. Which TWO practices should be implemented to ensure secure and efficient management of infrastructure as code (IaC) pipelines?

Select 2 answers

A.Store infrastructure secrets (e.g., API keys) directly in Terraform configuration files for simplicity.

B.Use a dedicated project for CI/CD pipelines that houses Cloud Build triggers and Cloud Source Repositories.

C.Use a single project to host all development, staging, and production environments to reduce complexity.

D.Implement separation of duties by using least-privilege service accounts for Terraform and restricting direct human access to production projects.

E.Require manual approval from a security team for every infrastructure change.

AnswersB, D

A separate project isolates CI/CD resources and simplifies IAM management for pipeline service accounts.

Why this answer

Option B is correct because using a dedicated project for CI/CD pipelines isolates Cloud Build triggers and Cloud Source Repositories from other workloads, preventing accidental interference and simplifying access control. This aligns with Google Cloud's recommended landing zone pattern where pipeline infrastructure is managed separately from application environments.

Exam trap

The trap here is that candidates often confuse 'simplicity' with 'security' and choose a single project for all environments (Option C) or manual approval for every change (Option E), failing to recognize that Google Cloud's recommended architecture emphasizes isolation and automated guardrails over manual processes.

Full explanation →

418

MCQhard

A financial services company uses Spanner for their core database. They notice that some transactions are taking longer than expected, especially during cross-region writes. They have set up Spanner with regional configuration. What is the most likely cause?

A.The transaction is experiencing contention due to a hot spot

B.The transaction is using stale reads

C.The transaction is not using a read-write transaction

D.The transaction is too large

AnswerA

Contention on popular keys causes retries and delays.

Why this answer

In a regional Spanner configuration, cross-region writes are not possible; hot spots (contention) are a common cause of latency. Stale reads are fast, and transaction size alone rarely causes significant delays.

Full explanation →

419

MCQeasy

A web application serves static assets (images, CSS, JavaScript) from Compute Engine instances. Users in different geographic regions report slow page loads. Which Google Cloud service can be used to improve performance for these users?

A.VPC Network Peering

B.Cloud Load Balancing

C.Cloud CDN

D.Cloud NAT

AnswerC

Cloud CDN uses Google's global edge network to cache static content closer to users.

Why this answer

Cloud CDN (Content Delivery Network) caches static assets at Google's globally distributed edge points of presence (PoPs). When users request images, CSS, or JavaScript, the content is served from the nearest edge cache rather than the origin Compute Engine instances, reducing latency and improving page load times for geographically distributed users.

Exam trap

The trap here is that candidates confuse Cloud Load Balancing (which only distributes traffic) with Cloud CDN (which caches at edge locations), assuming load balancing alone solves geographic latency issues.

How to eliminate wrong answers

Option A is wrong because VPC Network Peering connects two VPC networks for private IP communication; it does not cache content or accelerate delivery to end users. Option B is wrong because Cloud Load Balancing distributes traffic across backend instances but does not cache responses at edge locations; it can be used with Cloud CDN but alone does not reduce latency for static assets. Option D is wrong because Cloud NAT provides outbound internet connectivity for instances without external IPs; it does not cache or accelerate content delivery.

Full explanation →

420

MCQeasy

Which tool is recommended for managing the initial setup of a Google Cloud organization, including creating folders, projects, and IAM policies in an automated and repeatable manner?

A.Terraform

B.Deployment Manager

C.Cloud Console

D.gcloud command line

AnswerA

Terraform is widely adopted and Google recommends it for infrastructure automation.

Why this answer

Terraform is the recommended tool for bootstrapping a Google Cloud organization because it is declarative, idempotent, and supports infrastructure-as-code (IaC) for creating folders, projects, and IAM policies in an automated and repeatable manner. Unlike Google Cloud's Deployment Manager, Terraform is cloud-agnostic and has a mature provider (hashicorp/google) that directly manages organization-level resources such as google_folder, google_project, and google_organization_iam_member. This aligns with DevOps best practices for version-controlled, reproducible infrastructure provisioning.

Exam trap

Google Cloud often tests the misconception that Deployment Manager is the best choice because it is Google-native, but the question specifically asks for a tool that is 'recommended' for automated and repeatable bootstrapping, which Terraform achieves through its declarative, stateful, and multi-cloud design.

How to eliminate wrong answers

Option B is wrong because Deployment Manager is a Google Cloud-native IaC tool that uses YAML or Python templates, but it is less portable and lacks the broad community support and multi-cloud capabilities of Terraform; it also does not natively support the same level of modularity and state management for bootstrapping an organization. Option C is wrong because Cloud Console is a manual, click-based web interface that cannot be automated or repeated programmatically, making it unsuitable for initial setup in a DevOps pipeline. Option D is wrong because the gcloud command line is imperative and requires sequential commands, which is error-prone and not designed for idempotent, stateful infrastructure management across multiple environments.

Full explanation →

421

MCQmedium

A team uses a monorepo with multiple microservices in separate directories. They want to build only the changed service(s) when a push occurs to the repo. How can they achieve this efficiently?

A.Use a single Cloud Build trigger with a Dockerfile build step that builds all services.

B.Use Cloud Functions to invoke Cloud Build per changed directory.

C.Create multiple Cloud Build triggers, each with a different includeFiles filter matching the service directory.

D.Use a Cloud Build trigger with a build config that dynamically detects changes using git diff.

AnswerC

includeFiles and excludeFiles allow triggering only when files in specific paths change.

Why this answer

Option B is correct because Cloud Build triggers can use includeFiles filters to only trigger when files in a specific directory change. Option A builds all services, which is inefficient. Option C is possible but not native.

Option D adds complexity.

Full explanation →

422

Drag & Dropmedium

Arrange the steps to migrate a monolithic application to microservices on Google Kubernetes Engine.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Identify contexts, containerize, deploy, set up communication, redirect traffic.

Full explanation →

423

Multi-Selecteasy

A company is bootstrapping a Google Cloud organization with multiple projects. They want to enable consistent security and compliance across all projects. Which two organization policies should they consider? (Choose TWO.)

Select 2 answers

A.Require all service accounts to have a unique naming convention.

B.Restrict domain of users to the company domain.

C.Enforce that all projects have a Cloud Storage bucket.

D.Allow all projects to use any external IPs.

E.Prevent users from disabling audit logging.

AnswersB, E

Use constraints/resourcemanager.allowedPolicyMemberDomains.

Why this answer

Option B is correct because the 'Restrict domain of users to the company domain' organization policy (constraints/iam.allowedPolicyMemberDomains) ensures that only identities from the specified Google Workspace or Cloud Identity domain can be added as members in IAM policies across all projects. This prevents external users from gaining access, enforcing a consistent security boundary from the outset of bootstrapping.

Exam trap

The trap here is that candidates often confuse organization policies with project-level configurations or best practices, mistakenly thinking that naming conventions or resource creation requirements can be enforced as organization policies, when in reality only specific predefined constraints are available.

Full explanation →

424

MCQhard

A company runs a microservices architecture on GKE with Istio service mesh. They observe that service-to-service latency has increased after enabling mTLS. What is the most likely cause?

A.mTLS encryption overhead

B.Incorrect load balancer configuration

C.Network policy restriction

D.Sidecar proxy resource limits

AnswerA

Encrypting and decrypting each request adds CPU overhead and latency.

Why this answer

Enabling mTLS in Istio encrypts all service-to-service traffic using mutual TLS, which adds CPU overhead for encryption and decryption of each request. This encryption overhead directly increases latency, especially for high-throughput or small-payload services, as the sidecar proxies must perform TLS handshakes and cryptographic operations on every packet.

Exam trap

Google Cloud often tests the misconception that mTLS only adds security without performance impact, but candidates must recognize that encryption/decryption at the sidecar proxy level introduces measurable CPU-bound latency.

How to eliminate wrong answers

Option B is wrong because incorrect load balancer configuration would cause traffic routing issues or dropped connections, not a general increase in latency after enabling mTLS. Option C is wrong because network policy restrictions would block or drop traffic, not simply increase latency across all service-to-service calls. Option D is wrong because sidecar proxy resource limits would cause throttling, timeouts, or OOM kills, but the question states latency increased after enabling mTLS, not after changing resource limits.

Full explanation →

425

Multi-Selectmedium

A service experiences increased latency and HTTP 503 errors. The engineer finds that the backend managed instance group (MIG) is at max instances and CPU utilization is 90%. Which TWO actions should the engineer take to restore the service quickly?

Select 2 answers

A.Enable autoscaling based on HTTP load balancing utilization

B.Increase the autoscaling target CPU utilization to 95%

C.Increase the maximum number of instances in the MIG

D.Reduce the autoscaling target CPU utilization to 50%

E.Reduce the number of instances to avoid resource contention

AnswersA, C

Scales based on request rate, which is more responsive than CPU.

Why this answer

Option A is correct because enabling autoscaling based on HTTP load balancing utilization allows the MIG to scale out based on the actual request load, which directly addresses the 503 errors caused by the backend being at max capacity. This metric is more responsive to traffic spikes than CPU utilization alone, as it reflects the frontend load balancer's view of backend capacity.

Exam trap

Google Cloud often tests the misconception that adjusting CPU utilization thresholds (either up or down) is a quick fix for capacity issues, when in fact the immediate solution is to increase the maximum instance count or enable a more responsive scaling metric.

Full explanation →

426

MCQeasy

A company uses Cloud Storage to store archival data. They want to minimize storage costs while maintaining availability. Which storage class should they use?

A.Nearline storage class.

B.Standard storage class.

C.Archive storage class.

D.Coldline storage class.

AnswerC

Archive is the lowest-cost storage class for long-term archival data.

Why this answer

The Archive storage class is the correct choice because it offers the lowest storage cost for archival data that is accessed less than once per year, with a 365-day minimum storage duration and retrieval costs that are higher than other classes. This aligns with the requirement to minimize storage costs while maintaining availability, as Archive data is still available for retrieval (though with a longer latency) and is replicated for durability.

Exam trap

The trap here is that candidates often confuse 'Coldline' with the cheapest option because of its name, but Archive is actually the lowest-cost class for truly archival data, and Cisco tests whether you know the specific access frequency and minimum storage duration differences between Coldline and Archive.

How to eliminate wrong answers

Option A is wrong because Nearline is designed for data accessed less than once per 30 days, not for archival data, and has higher storage costs than Archive. Option B is wrong because Standard is for frequently accessed data (e.g., multiple times per month) and has the highest storage cost, making it unsuitable for minimizing costs on archival data. Option D is wrong because Coldline is for data accessed less than once per 90 days, with storage costs higher than Archive and a 90-day minimum storage duration, which does not provide the lowest cost for long-term archival.

Full explanation →

427

MCQeasy

A team deploys a Cloud Function that processes user requests. They notice cold starts cause high latency for the first request after a period of inactivity. What is the most effective way to reduce cold starts?

A.Use a larger function timeout

B.Set the minimum instances to 1

C.Increase the memory allocation

D.Deploy the function in multiple regions

AnswerB

Keeping at least one warm instance eliminates cold start latency.

Why this answer

Setting minimum instances to 1 pre-warms a function instance, keeping it idle and ready to serve requests immediately. This eliminates the cold start latency for the first request after inactivity because the runtime environment is already initialized and loaded into memory.

Exam trap

Google Cloud often tests the misconception that increasing resources (memory or timeout) or spreading across regions solves cold starts, when the actual solution is keeping an instance alive via minimum instances or similar warm-start mechanisms.

How to eliminate wrong answers

Option A is wrong because increasing the function timeout does not prevent cold starts; it only allows the function to run longer before being terminated, which does not address the initialization delay. Option C is wrong because increasing memory allocation can improve performance during execution but does not keep an instance alive or reduce the cold start penalty; cold starts still occur after idle periods. Option D is wrong because deploying in multiple regions improves geographic latency and availability but does not reduce cold starts; each regional deployment still experiences cold starts independently after inactivity.

Full explanation →

428

Multi-Selecthard

A company runs a stateful workload on Compute Engine with local SSDs. They need to improve disk I/O performance without changing the instance type. Which THREE actions should they take?

Select 3 answers

A.Migrate to persistent SSD for better durability.

B.Stripe data across multiple local SSD volumes using RAID 0.

C.Use a filesystem optimized for SSDs, such as ext4 with noatime and nodiratime options.

D.Ensure the instance is in the same zone as the application that accesses the disks.

E.Enable encryption for the local SSDs to reduce I/O overhead.

AnswersB, C, D

Increases throughput and IOPS.

Why this answer

Option B is correct because striping data across multiple local SSD volumes using RAID 0 increases the aggregate I/O throughput and IOPS by distributing read and write operations across all disks in parallel. This directly improves disk I/O performance without changing the instance type, as local SSDs are physically attached to the host and offer the highest performance when combined.

Exam trap

Google Cloud often tests the misconception that persistent SSDs are always better for performance, but local SSDs provide lower latency and higher IOPS for stateful workloads, and striping them with RAID 0 is the key to maximizing I/O without changing the instance type.

Full explanation →

429

MCQhard

A company is transferring large datasets from on-premises to Google Cloud using a VPN. They notice high latency due to packet loss. What is the most effective way to improve throughput?

A.Set up Dedicated Interconnect for a more reliable connection.

B.Enable compression on the VPN tunnel.

C.Increase the number of VPN tunnels and use BGP multipath.

D.Use a multi-region GCP endpoint and distribute traffic.

AnswerA

Dedicated Interconnect provides a direct physical connection, reducing packet loss and latency.

Why this answer

Dedicated Interconnect provides a direct, private physical connection between on-premises and Google Cloud, bypassing the public internet entirely. This eliminates the packet loss and high latency inherent in VPN tunnels over the internet, offering consistent throughput and lower latency for large dataset transfers.

Exam trap

Google Cloud often tests the misconception that adding more VPN tunnels or enabling compression can overcome internet-based packet loss, but the correct solution is to eliminate the unreliable public internet path entirely with a dedicated connection like Interconnect.

How to eliminate wrong answers

Option B is wrong because enabling compression on the VPN tunnel can reduce the amount of data transmitted but does not address the underlying packet loss causing high latency; in fact, compression can increase CPU overhead and may worsen performance if packet loss is present. Option C is wrong because increasing the number of VPN tunnels with BGP multipath can improve bandwidth utilization but still relies on the public internet, so packet loss and latency issues remain; it does not provide a reliable, low-latency path. Option D is wrong because using a multi-region GCP endpoint and distributing traffic does not solve the fundamental problem of packet loss on the VPN connection; it only spreads traffic across regions, which may introduce additional latency and complexity without addressing the unreliable internet path.

Full explanation →

430

MCQhard

Refer to the exhibit. A DevOps engineer is debugging a Cloud Build pipeline that fails after the second step. The error indicates that the docker push fails with a permission denied error. The service account used by Cloud Build has the roles/storage.objectAdmin role on the project. What is the most likely cause of the failure?

A.The docker push command uses an incorrect repository path.

B.The service account does not have permission to push to Artifact Registry.

C.The Cloud Build service account needs the roles/artifactregistry.writer role.

D.The gcloud auth configure-docker step must be run for Artifact Registry.

AnswerC

Artifact Registry requires specific roles; storage.objectAdmin is insufficient for pushing images.

Why this answer

The service account has storage.objectAdmin which grants access to Cloud Storage, not Artifact Registry. Pushing to Artifact Registry requires the roles/artifactregistry.writer (or admin) role. Option A is too vague.

Option D is already performed in the first step. Option B is less likely as the path appears correct. Option C correctly identifies the missing role.

Full explanation →

431

MCQhard

A company uses BigQuery for analytics. They notice high costs due to queries scanning large amounts of data. They want to reduce costs without sacrificing performance for urgent queries. Which approach is most cost-effective?

A.Use flat-rate pricing with reservations

B.Partition and cluster tables, and use BI Engine for acceleration

C.Use on-demand pricing with query caching

D.Use materialized views and limit query jobs to interactive priority

AnswerB

Partitioning and clustering limit the data scanned, and BI Engine provides fast in-memory analysis for critical queries without scanning.

Why this answer

Option C is correct because partitioning and clustering reduce data scanned, and BI Engine accelerates queries. Option A (flat-rate) is costly if not fully utilized. Option B (caching) helps but does not reduce scanned data as much.

Option D (materialized views) incurs storage costs and may not be suitable for all queries.

Full explanation →

432

MCQhard

A company uses Spinnaker for continuous delivery across multiple GKE clusters. After a recent infrastructure change, the 'Canary' deployment strategy fails during the 'disable' phase of the old version. The error log shows: 'Unable to disable server group: Not authorized to perform compute.instanceGroups.update.' What is the most likely root cause?

A.The GKE cluster has reached its maximum node quota.

B.The Cloud Deploy pipeline is missing the required IAM role for the Spinnaker service account.

C.The Spinnaker service account lacks the compute.instanceGroups.update permission on the project.

D.The Kayenta canary analysis service is not configured correctly.

AnswerC

Correct: Spinnaker uses this permission to disable old server groups.

Why this answer

The error 'Unable to disable server group: Not authorized to perform compute.instanceGroups.update' directly indicates an IAM permissions issue. In Spinnaker, the service account used to interact with GCP must have the compute.instanceGroups.update permission to manage instance groups during the disable phase of a canary deployment. Option C correctly identifies that the Spinnaker service account lacks this specific permission on the project.

Exam trap

Google Cloud often tests the distinction between permissions errors and resource quota errors, leading candidates to incorrectly select quota-related options when the error message explicitly states 'Not authorized'.

How to eliminate wrong answers

Option A is wrong because reaching the maximum node quota would cause a failure to provision new nodes, not a permissions error during the disable phase. Option B is wrong because Cloud Deploy is a separate Google Cloud service; the error is from Spinnaker's own service account, not from a Cloud Deploy pipeline. Option D is wrong because Kayenta handles canary analysis and metric evaluation, not the disabling of server groups; the error is an IAM authorization failure, not a configuration issue with Kayenta.

Full explanation →

433

MCQmedium

A company uses Cloud Build to deploy a microservices application to Google Kubernetes Engine (GKE). They want to integrate Container Analysis to scan images for vulnerabilities before deployment. What is the minimal set of changes needed to achieve this?

A.Enable the Container Analysis API; no changes to the build configuration are needed.

B.Migrate images from Container Registry to Artifact Registry and enable vulnerability scanning there.

C.Add a build step to run a vulnerability scanner CLI tool before pushing the image.

D.Enable Binary Authorization to block deployment of vulnerable images.

AnswerA

Cloud Build automatically pushes images to defined registry, and Container Analysis scans them when API is enabled.

Why this answer

Option D is correct because Cloud Build natively integrates with Container Analysis; enabling the API and building the image triggers scanning automatically. Option A is incorrect - no need for a separate scan step. Option B is incorrect - Binary Authorization is for policy enforcement, not scanning.

Option C is incorrect - Artifact Registry does not replace scanning.

Full explanation →

434

Multi-Selecteasy

A DevOps team wants to monitor the performance of a Cloud SQL database. Which two metrics should they track? (Select TWO.)

Select 2 answers

A.Auto-increment counter

B.Query error rate

C.CPU utilization

D.Number of active connections

E.Disk read/write latency

AnswersC, E

High CPU may indicate inefficient queries or need for scaling.

Why this answer

CPU utilization (C) is a critical metric for Cloud SQL because high CPU usage indicates that the database instance is struggling to process queries, often due to inefficient queries or insufficient compute capacity. Monitoring CPU utilization helps teams decide when to scale up or optimize query performance.

Exam trap

Google Cloud often tests the distinction between metrics that measure performance (e.g., CPU, latency) versus metrics that measure capacity or configuration (e.g., active connections, auto-increment), leading candidates to select D because they conflate 'active connections' with performance impact.

Full explanation →

435

MCQmedium

An organization wants to enforce that all Compute Engine VMs use only specific machine families (e.g., N2, C2). Which mechanism should they use?

A.IAM deny policies

B.Quota management

C.Folders with different owners

D.Organization policy with compute.restrictComputeEngineMachineTypes

AnswerD

Org policies can restrict machine types.

Why this answer

Organization policies in Google Cloud allow administrators to enforce constraints on resources across the entire hierarchy. The `compute.restrictComputeEngineMachineTypes` constraint specifically limits which machine families (e.g., N2, C2) can be used when creating Compute Engine VMs, making it the correct mechanism for this requirement.

Exam trap

The trap here is that candidates often confuse IAM deny policies with organization policy constraints, thinking that deny policies can restrict resource configurations, when in fact they only control identity-based access, not resource properties.

How to eliminate wrong answers

Option A is wrong because IAM deny policies control who can perform actions (e.g., deny a user from creating VMs), not which machine types are allowed; they cannot restrict specific machine families. Option B is wrong because quota management limits the quantity of resources (e.g., number of vCPUs or GPUs) but does not restrict the selection of machine families like N2 or C2. Option C is wrong because folders with different owners are an organizational structure for delegating administration and access control, not a mechanism to enforce technical constraints on machine families.

Full explanation →

436

MCQhard

Your organization runs a critical e-commerce platform on Google Kubernetes Engine (GKE). The platform uses Cloud Service Mesh (Anthos Service Mesh) for traffic management and Cloud Monitoring for observability. Recently, after a new release, you observe that the p99 latency of the checkout service has increased from 200ms to 2s. The service's CPU and memory metrics appear normal, and there are no error logs. The release included a change to the Istio VirtualService configuration that added a retry policy: 3 retries with a 500ms timeout per retry. You suspect that the retries are contributing to the latency increase. You want to use Cloud Monitoring to confirm this hypothesis. Which approach should you take?

A.Use Cloud Trace to analyze distributed traces for the checkout service and look for retry spans

B.Check the 'Services' dashboard in Cloud Monitoring, which shows a pre-built latency chart for all services

C.Use Metrics Explorer to query the istio.io/service/server/request_count metric, filtered by response_code_class and destination_service, and include the istio.io/service/server/request_retries metric to see retry counts alongside latency

D.Use Logs Explorer to search for logs containing 'retry' in the checkout service namespace

AnswerC

This directly shows the correlation between retries and latency.

Why this answer

Option C is correct because it directly correlates retry attempts with latency by querying the `istio.io/service/server/request_retries` metric alongside the `istio.io/service/server/request_count` metric in Metrics Explorer. This allows you to visualize the retry count per destination service (checkout) and compare it with the p99 latency increase, confirming whether the retry policy is causing the observed latency spike. The retry policy (3 retries with 500ms timeout) can add up to 1.5s of additional latency per request, which aligns with the increase from 200ms to 2s.

Exam trap

Google Cloud often tests the distinction between metrics (which aggregate over time) and traces (which show individual request paths), leading candidates to choose Cloud Trace (Option A) when they should use Metrics Explorer with retry-specific metrics to confirm a latency hypothesis.

How to eliminate wrong answers

Option A is wrong because Cloud Trace shows distributed traces and retry spans, but it does not provide aggregated metrics like p99 latency or retry counts over time; it is more suitable for debugging individual requests rather than confirming a hypothesis about overall latency trends. Option B is wrong because the pre-built 'Services' dashboard in Cloud Monitoring shows latency charts but does not include retry metrics, so you cannot directly correlate retries with latency increases. Option D is wrong because Logs Explorer searching for 'retry' logs is inefficient and unreliable; Istio retries are not always logged by default, and even if they are, logs do not provide the aggregated time-series data needed to confirm a latency hypothesis.

Full explanation →

437

MCQhard

An organization is implementing SLO-based alerting for a critical service. They want to alert when the service has consumed 50% of its error budget over a 30-day window. Considering best practices for alert sensitivity and noise reduction, which alerting approach should they use?

A.Alert on the burn rate over a 1-hour window with a threshold of 10.

B.Alert on the burn rate over a 5-minute window with a threshold of 0.5.

C.Alert on the error budget remaining with a threshold of 50%.

D.Alert on the SLI value directly with a threshold of 99.9%.

AnswerA

A burn rate of 10 over 1 hour means the error budget would be exhausted in 3 hours (30 days / 10 = 3 hours), triggering an alert when 50% is consumed in about 1.5 hours, which is timely.

Why this answer

Option A is correct because alerting on a burn rate of 10 over a 1-hour window directly indicates that the service is consuming error budget at a rate that would exhaust the entire 30-day budget in 3 days (since 30 days / 10 = 3 days). This approach balances sensitivity and noise reduction by using a sufficiently long window (1 hour) to smooth out transient spikes, while the high threshold ensures only significant sustained degradation triggers an alert, aligning with SRE best practices for multi-window, multi-burn-rate alerting.

Exam trap

Google Cloud often tests the misconception that shorter windows (like 5 minutes) are better for fast detection, but the trap here is that overly short windows increase noise and false positives, whereas a 1-hour window with a high burn rate threshold provides the right balance for a 30-day SLO.

How to eliminate wrong answers

Option B is wrong because a 5-minute window with a burn rate threshold of 0.5 is far too sensitive and noisy; it would trigger alerts on minor, transient blips that do not meaningfully consume error budget over the 30-day window, leading to alert fatigue. Option C is wrong because alerting on error budget remaining at 50% is a reactive, threshold-based approach that provides no lead time; by the time 50% is consumed, the service may already be in a critical state, and it does not account for the rate of consumption. Option D is wrong because alerting on the SLI value directly (e.g., 99.9%) is a static threshold that ignores the error budget entirely; it can trigger false positives during normal fluctuations and fails to measure the actual impact on the SLO over the compliance period.

Full explanation →

438

MCQmedium

Refer to the exhibit. You see this log entry from a Cloud Run service. The stack trace shows the error occurs in handler.js at line 50. You want to see the state of variables at that point in the production environment without adding logging or redeploying. What should you do?

A.Use Error Reporting to view similar errors.

B.Use Cloud Profiler to capture a heap snapshot.

C.Use Cloud Debugger to set a snapshot location at line 50 in handler.js.

D.Use Cloud Trace to trace the request.

AnswerC

Cloud Debugger can capture local variables at specific lines in live applications.

Why this answer

Option C is correct because Cloud Debugger allows you to inspect the state of an application, including local variables and call stack, at a specific line of code in a production environment without modifying or redeploying the application. By setting a snapshot at line 50 in handler.js, you can capture the variable values at the exact point where the error occurs, which directly addresses the need to debug without adding logging or redeploying.

Exam trap

Google Cloud often tests the distinction between debugging tools (Cloud Debugger) and monitoring/observability tools (Error Reporting, Cloud Profiler, Cloud Trace), so the trap here is that candidates may confuse Cloud Debugger with Error Reporting or Cloud Trace, thinking any tool that shows errors or traces can also reveal variable state.

How to eliminate wrong answers

Option A is wrong because Error Reporting aggregates and analyzes errors but does not provide the ability to inspect variable state at a specific line of code; it only shows error logs and stack traces. Option B is wrong because Cloud Profiler is used for continuous profiling of CPU and memory usage to identify performance bottlenecks, not for capturing variable state at a specific code location. Option D is wrong because Cloud Trace is a distributed tracing system that tracks request latency and path through services, but it does not capture local variable values or allow inspection of application state at a specific line of code.

Full explanation →

439

MCQeasy

What is the primary benefit of using preemptible VMs?

A.Higher reliability.

B.Faster performance.

C.Better security.

D.Lower cost.

AnswerD

Preemptible VMs are significantly cheaper than regular VMs.

Why this answer

Preemptible VMs are Compute Engine instances that last a maximum of 24 hours and can be terminated at any time by Google. They offer a significantly lower price—up to 60-91% discount compared to standard VMs—making them ideal for batch jobs and fault-tolerant workloads where cost savings are the primary benefit.

Exam trap

Google Cloud often tests the misconception that preemptible VMs offer higher performance or reliability, but the exam trap is that candidates confuse the cost-saving benefit with other attributes like speed or availability, which are not improved.

How to eliminate wrong answers

Option A is wrong because preemptible VMs have no reliability guarantees; they can be terminated at any time, so they are less reliable than standard VMs. Option B is wrong because preemptible VMs use the same machine types and performance as standard VMs; there is no performance boost. Option C is wrong because preemptible VMs do not provide better security; they share the same security model as standard VMs and are not designed for security enhancements.

Full explanation →

440

Matchingmedium

Match each cost optimization practice to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Discount for 1- or 3-year resource commitment

Automatic discounts for running instances most of month

Short-lived, low-cost instances for batch jobs

Adjusting machine type to match workload needs

Notifications when spending exceeds thresholds

Why these pairings

Common cost management techniques on Google Cloud.

Full explanation →

441

MCQhard

A company runs a critical application on Google Kubernetes Engine (GKE) with 3 nodes (e2-standard-4). To reduce costs, the team is considering right-sizing the nodes. The application is latency-sensitive and experiences periodic traffic spikes. What is the most cost-effective approach that maintains performance during spikes?

A.Use a single larger node (n2-standard-8) to reduce node count and network overhead.

B.Switch to 3 n2-standard-2 nodes to reduce vCPU and memory, and rely on horizontal pod autoscaling.

C.Create a node pool with smaller nodes (e2-standard-2) using preemptible VMs and enable cluster autoscaler.

D.Keep current node size but use committed use discounts for 1 year to reduce per-hour cost.

AnswerC

Preemptible VMs are cheaper, and autoscaler adds nodes during spikes, maintaining performance.

Why this answer

Option C is the most cost-effective approach because it uses smaller e2-standard-2 nodes with preemptible VMs, which are significantly cheaper than regular VMs, and combines this with the cluster autoscaler to automatically add nodes during traffic spikes. This ensures that the latency-sensitive application maintains performance by scaling out horizontally when needed, while minimizing baseline costs with smaller, cheaper nodes.

Exam trap

The trap here is that candidates often assume preemptible VMs are unsuitable for production or latency-sensitive workloads, but the cluster autoscaler mitigates the risk of VM termination by quickly replacing nodes, making this a valid cost-saving strategy for spike-tolerant applications.

How to eliminate wrong answers

Option A is wrong because using a single larger node (n2-standard-8) creates a single point of failure and does not leverage horizontal scaling; during spikes, the application may still suffer from resource contention on a single node, and network overhead is negligible in GKE. Option B is wrong because switching to 3 n2-standard-2 nodes reduces total vCPU and memory by 50%, and relying solely on horizontal pod autoscaling without cluster autoscaler means the node pool cannot grow during spikes, leading to pod scheduling failures and performance degradation. Option D is wrong because keeping the current node size and using committed use discounts only reduces per-hour cost by about 20-30% but does not address the goal of right-sizing to reduce costs; it maintains the same over-provisioned resources, which is less cost-effective than using smaller nodes with autoscaling.

Full explanation →

442

MCQhard

A company has on-premises servers running Linux and GKE clusters. They want to monitor all infrastructure using Cloud Monitoring. Which solution is most scalable and aligned with Google's best practices?

A.Use collectd on on-prem servers to send to Cloud Monitoring via the Stackdriver agent configuration.

B.Deploy Prometheus on both environments and use the PromQL adapter for Cloud Monitoring.

C.Use Google's managed service for Prometheus on GKE and a Prometheus federation for on-prem.

D.Install the Ops Agent on all on-prem servers and use Google's default GKE monitoring.

AnswerC

The managed service for Prometheus on GKE is fully integrated, and federation from on-prem Prometheus scales well.

Why this answer

Option C is correct because it leverages Google's managed service for Prometheus on GKE, which is fully integrated with Cloud Monitoring and eliminates the operational overhead of self-managing Prometheus. For on-premises servers, Prometheus federation allows scraping metrics from the on-prem Prometheus instance and forwarding them to the managed service, providing a unified, scalable monitoring solution that aligns with Google's best practices for hybrid environments.

Exam trap

Google Cloud often tests the misconception that self-managed Prometheus with a custom adapter is the most scalable solution, when in fact Google's managed service eliminates operational overhead and provides native integration with Cloud Monitoring, making it the best practice for hybrid environments.

How to eliminate wrong answers

Option A is wrong because collectd is a legacy agent that requires manual configuration and does not natively integrate with Cloud Monitoring's modern metric pipeline; the Stackdriver agent is deprecated in favor of the Ops Agent. Option B is wrong because deploying self-managed Prometheus on both environments and using the PromQL adapter adds unnecessary complexity and does not leverage Google's managed service, which provides automatic scaling, high availability, and native integration with Cloud Monitoring. Option D is wrong because the Ops Agent is designed for on-premises VMs but does not provide the same level of integration for GKE clusters as the managed Prometheus service; using Google's default GKE monitoring lacks the flexibility and advanced querying capabilities of Prometheus for custom metrics.

Full explanation →

443

MCQhard

A large enterprise is designing a centralized DevOps platform across multiple business units. They want to use a shared CI/CD pipeline that deploys to projects in different folders. Which approach ensures secure, auditable deployments while minimizing IAM administration?

A.Use a cross-project service account in the CI/CD project with required roles (e.g., Cloud Run Admin, Compute Admin) on target projects via IAM.

B.Use Cloud Build triggers directly in each target project with separate code repositories.

C.Grant the Cloud Build Editor role to all developers across projects to allow them to create pipelines.

D.Create a separate service account in each target project with the Cloud Build service agent role, and use impersonation from the CI/CD project.

AnswerA

Centralized service account with cross-project IAM is best practice; it simplifies management and audit.

Why this answer

Option A is correct because a cross-project service account in the CI/CD project, granted the necessary roles (e.g., Cloud Run Admin, Compute Admin) on target projects via IAM, allows the shared pipeline to deploy resources across folders without duplicating service accounts. This centralizes IAM administration, ensures auditability through a single identity, and follows the principle of least privilege by granting only required roles on target projects.

Exam trap

The trap here is that candidates often confuse the Cloud Build service agent role (used for internal Cloud Build operations) with the cross-project service account pattern, leading them to choose Option D, which adds unnecessary administrative overhead instead of leveraging IAM's native cross-project delegation.

How to eliminate wrong answers

Option B is wrong because using Cloud Build triggers directly in each target project with separate code repositories defeats the purpose of a centralized DevOps platform, increasing IAM administration overhead and fragmenting audit trails across multiple projects. Option C is wrong because granting the Cloud Build Editor role to all developers across projects violates the principle of least privilege, introduces excessive permissions, and undermines auditable deployments by allowing developers to create arbitrary pipelines. Option D is wrong because creating a separate service account in each target project with the Cloud Build service agent role and using impersonation from the CI/CD project adds unnecessary IAM complexity and administrative burden, as the cross-project service account approach in Option A achieves the same goal more efficiently.

Full explanation →

444

Multi-Selecteasy

Which TWO statements about bootstrapping a Google Cloud organization for DevOps are correct?

Select 2 answers

A.After enabling the cloudresourcemanager.googleapis.com API, organization policies are automatically applied.

B.Cloud Asset Inventory can be used to discover all resources in the organization.

C.All projects in an organization automatically share a default VPC network.

D.Cloud Audit Logs are disabled by default and must be enabled for each service.

E.Organization policies can be applied at the organization, folder, or project level.

AnswersB, E

Correct: Cloud Asset Inventory provides a historical view of all resources.

Why this answer

Cloud Asset Inventory provides a complete view of all resources (e.g., Compute Engine instances, Cloud Storage buckets, IAM policies) across the entire organization, including all folders and projects. This is essential for DevOps bootstrapping to audit, monitor, and manage resources at scale. It uses the Cloud Asset API to export asset metadata and supports real-time feeds for change detection.

Exam trap

Google Cloud often tests the misconception that organization policies are automatically applied after enabling an API, or that Cloud Audit Logs are disabled by default, when in fact Admin Activity logs are always enabled and Data Access logs require explicit activation.

Full explanation →

445

MCQeasy

A Cloud Run service is experiencing increased cold start latency. The service is written in Python and uses several large dependencies. Which action would most effectively reduce cold start latency?

A.Set concurrency to 1 to ensure each request gets a dedicated container.

B.Increase the CPU allocation to 4 vCPUs.

C.Set a minimum number of instances to keep containers warm.

D.Increase memory to 2 GiB.

AnswerC

Min instances eliminate cold start by keeping containers ready.

Why this answer

Option C is correct because setting a minimum number of instances ensures that the Cloud Run service always has a pool of warm containers ready to serve requests, eliminating the cold start penalty. Cold starts in Python are particularly severe due to the time required to import large dependencies (e.g., NumPy, TensorFlow) and initialize the runtime. By keeping containers alive, you bypass the entire initialization phase, directly addressing the root cause of increased latency.

Exam trap

Google Cloud often tests the misconception that increasing CPU or memory directly reduces cold start latency, when in fact cold starts are primarily caused by initialization overhead (dependency loading, runtime startup) that is not mitigated by resource scaling.

How to eliminate wrong answers

Option A is wrong because setting concurrency to 1 does not reduce cold start latency; it forces each request to have a dedicated container, which can actually increase the number of cold starts if the service scales up, and it wastes resources without addressing the initialization delay. Option B is wrong because increasing CPU allocation speeds up request processing after the container is warm, but it does not reduce the time taken to import large Python dependencies or start the application—cold start latency is dominated by I/O and import overhead, not CPU speed. Option D is wrong because increasing memory provides more headroom for the container but does not affect the initialization sequence; cold start latency is caused by loading dependencies and starting the runtime, not by memory pressure.

Full explanation →

446

MCQmedium

A company is setting up a new Google Cloud organization for DevOps. They want to enforce that all projects have a specific set of VPC Service Controls perimeters. Which approach should they use to ensure these perimeters are automatically applied to all new projects?

A.Configure Cloud Shell to run a script that creates a perimeter when a new project is created.

B.Define an organization policy with a constraint that requires all projects to be within a perimeter.

C.Use Deployment Manager to deploy a configuration that creates a perimeter for each new project.

D.Create a VPC Service Controls perimeter and add the organization node as a member.

AnswerB

Organization policies can enforce constraints like 'vpcServiceControls' across projects.

Why this answer

Option B is correct because Google Cloud Organization Policies allow you to define and enforce constraints at the organization, folder, or project level. The `constraints/compute.restrictVpcServiceControls` constraint can be set to require all new projects to be within a specific VPC Service Controls perimeter, ensuring automatic enforcement without manual intervention.

Exam trap

The trap here is that candidates often confuse VPC Service Controls perimeter membership (which is a resource-level attribute) with organization policy enforcement (which is a hierarchical governance mechanism), leading them to choose Option D or A instead of the correct policy-based approach.

How to eliminate wrong answers

Option A is wrong because Cloud Shell scripts are not a scalable or reliable mechanism for enforcing policies on all new projects; they require manual execution or a separate trigger and do not provide automatic, organization-wide enforcement. Option C is wrong because Deployment Manager is an infrastructure-as-code tool for deploying resources, but it does not automatically apply to every new project created outside of its deployment scope; it would require a separate deployment per project. Option D is wrong because adding the organization node as a member to a VPC Service Controls perimeter does not automatically enforce that all projects within the organization are inside the perimeter; it only allows the organization to be a member, but projects must still be explicitly added or constrained via policy.

Full explanation →

447

MCQhard

A large enterprise is migrating to Google Cloud and wants to bootstrap their organization for DevOps. They have multiple business units, each needing their own folder with projects. Security requires that all projects in the 'prod' folder must have a specific set of organization policies enforced, such as restricting service account key creation. They also want to allow individual teams to create project-level policies as long as they don't conflict with the organization policies. Which approach ensures this while minimizing administrative overhead?

A.Set the required organization policies on the 'prod' folder and allow teams to set additional policies at the project level as long as they don't conflict.

B.Set organization policies at the organization level and use IAM conditions to apply them only to the prod folder.

C.Create custom roles containing the required constraints and assign them to the team's IAM members.

D.Place all production workloads in a single project and use VPC Service Controls for security.

AnswerA

Folder-level policies are inherited; project policies can add restrictions but cannot relax them.

Why this answer

Option A is correct because Google Cloud Organization Policies can be set at the folder level, allowing the 'prod' folder to inherit constraints like `iam.disableServiceAccountKeyCreation` across all its projects. Teams can then add additional project-level policies that are more restrictive, as long as they do not conflict with the inherited folder-level policies, which is enforced by the policy hierarchy. This minimizes administrative overhead by centralizing mandatory controls at the folder level while delegating flexibility to teams.

Exam trap

The trap here is confusing IAM roles and conditions with organization policy constraints, leading candidates to incorrectly select Option B or C, when in fact organization policies are a separate, hierarchical mechanism that cannot be bypassed by IAM or custom roles.

How to eliminate wrong answers

Option B is wrong because organization policies cannot be applied selectively using IAM conditions; IAM conditions control access to resources, not the enforcement of organization policy constraints. Option C is wrong because custom roles define IAM permissions, not organization policy constraints; constraints like restricting service account key creation are enforced via organization policies, not IAM roles. Option D is wrong because placing all production workloads in a single project violates the requirement for multiple business units to have their own folders and projects, and VPC Service Controls address data exfiltration, not organization policy enforcement.

Full explanation →

448

MCQeasy

A company uses Error Budgets for their service. The SLO is 99.9% availability over a 30-day window. The service has been down for 30 minutes in the current window. What is the remaining error budget?

A.43.2 minutes

B.60 minutes

C.13.2 minutes

D.30 minutes

AnswerC

Calculation: 0.001 * 43200 minutes = 43.2 minutes budget, minus 30 = 13.2.

Why this answer

The SLO of 99.9% over a 30-day window allows a total error budget of 43.2 minutes (30 days × 24 hours × 60 minutes × 0.001). The service has already consumed 30 minutes of downtime, so the remaining error budget is 43.2 - 30 = 13.2 minutes. Option C is correct because it reflects this precise calculation.

Exam trap

Google Cloud often tests the distinction between total error budget and remaining error budget, trapping candidates who forget to subtract the already consumed downtime from the total allowable downtime.

How to eliminate wrong answers

Option A is wrong because 43.2 minutes is the total error budget for the 30-day window, not the remaining budget after 30 minutes of downtime. Option B is wrong because 60 minutes would correspond to an SLO of approximately 99.86% (43.2 minutes is the correct total for 99.9%), and it does not account for the 30 minutes already consumed. Option D is wrong because 30 minutes is simply the downtime already incurred, not the remaining error budget.

Full explanation →

449

MCQeasy

A team wants to monitor a Google Cloud Run service for application crashes. Which Google Cloud tool automatically captures and notifies on application errors?

A.Cloud Logging

B.Cloud Monitoring

C.Cloud Console

D.Error Reporting

AnswerD

Error Reporting automatically aggregates errors and can send notifications.

Why this answer

Error Reporting (D) is the correct answer because it is a Google Cloud service specifically designed to automatically capture, aggregate, and notify on application errors, including crashes in Cloud Run services. It ingests error events from Cloud Logging and provides real-time alerts and dashboards, making it the dedicated tool for this use case.

Exam trap

Google Cloud often tests the distinction between log storage (Cloud Logging) and error-specific analysis (Error Reporting), leading candidates to mistakenly choose Cloud Logging because they think 'logs contain errors, so that must be the tool.'

How to eliminate wrong answers

Option A is wrong because Cloud Logging is a centralized log storage and querying service; it does not automatically parse or notify on application errors without additional configuration (e.g., log-based metrics or sinks). Option B is wrong because Cloud Monitoring focuses on metrics, uptime checks, and alerting based on performance thresholds, not on automatically capturing and categorizing application crash errors. Option C is wrong because Cloud Console is a web-based UI for managing Google Cloud resources; it provides no automated error capture or notification capabilities.

Full explanation →

450

MCQhard

A company uses Cloud Monitoring to track latency for a multi-region web application. The SLO is 99.9% of requests under 500ms over a 30-day rolling window. The error budget has been rapidly depleting over the last week. The operations team wants to understand the impact of recent deployments. Which approach should they use to correlate deployment changes with latency spikes?

A.Use Cloud Logging to search for deployment logs and manually compare with latency metrics

B.Use Cloud Trace to analyze latency distributions for each deployment version

C.Create a custom dashboard in Cloud Monitoring that includes latency charts and use annotation markers to indicate deployment times

D.Configure Error Reporting to alert on latency threshold breaches

AnswerC

Annotation markers allow you to overlay deployment events on time-series charts, making it easy to correlate changes with latency spikes.

Why this answer

Option C is correct because Cloud Monitoring supports custom dashboards with annotation markers that can be programmatically or manually added to indicate deployment events. By overlaying these markers on latency charts, the operations team can visually correlate deployment times with latency spikes, enabling direct root-cause analysis without manual log searching or separate tools.

Exam trap

Google Cloud often tests the distinction between monitoring tools (Cloud Monitoring for dashboards and annotations) versus debugging tools (Cloud Trace for per-request analysis) or logging tools (Cloud Logging for raw logs), leading candidates to choose a tool that addresses part of the problem but not the correlation requirement.

How to eliminate wrong answers

Option A is wrong because manually searching Cloud Logging for deployment logs and comparing them with latency metrics is inefficient, error-prone, and does not provide a real-time or automated correlation; it relies on manual cross-referencing, which is not scalable for a multi-region application. Option B is wrong because Cloud Trace is designed for distributed tracing of individual requests and analyzing latency distributions per version, but it does not natively support overlaying deployment timelines or providing a high-level dashboard view for correlation with deployment events. Option D is wrong because Error Reporting is focused on aggregating and alerting on application errors (e.g., exceptions, crashes), not on latency threshold breaches; configuring it to alert on latency would misuse its purpose, and it lacks the ability to correlate alerts with deployment timelines.

Full explanation →

Google Professional Cloud DevOps Engineer (PCDOE) — Questions 376–450