How many Troubleshooting Scenario Questions questions are on this page?

This page has 15 Troubleshooting Scenario Questions scenario questions for the PCDOE exam, each with detailed explanations and wrong-answer analysis.

How should I approach PCDOE scenario questions?

Read the full scenario before looking at the answer options. Identify the constraint or requirement in the scenario, then eliminate options that are generally true but wrong for this specific case. Scenario questions reward careful reading over pattern matching.

← Back to Google Professional Cloud DevOps Engineer questions

Scenario-based practice

Troubleshooting Scenario Questions

Practise Google Professional Cloud DevOps Engineer practice questions — original exam-style scenarios covering every exam domain, with detailed explanations, wrong-answer analysis, and common exam traps.

Start full practice test Read exam guide

scenario questions

PCDOE

exam code

Google Cloud

vendor

Scenario guide

How to approach troubleshooting scenario questions

These questions describe a network symptom and ask you to identify the root cause or the correct fix. They appear across all certification exams and reward systematic thinking over memorisation. The best candidates follow a consistent troubleshooting framework even under time pressure.

Quick answer

Troubleshooting Scenario Questions questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Practice scenarios

Question 1mediummultiple choice

Full question →

An organization uses Cloud Build with a private pool to build container images that require access to on-premises Artifactory. After moving to a new VPC, builds fail with 'Connection refused' when fetching dependencies. What is the best step to troubleshoot?

A
Verify that VPC Network Peering is established between the Cloud Build private pool's service producer VPC and the customer VPC, and that routes to on-premises are present.
Private pools require peering; missing peering stops traffic.
B
Verify that the Cloud Build service account has the dns.networks.bindPrivateZone permission.
Why wrong: DNS permissions are needed for private zones, but the error is connection refused, not DNS resolution.
C
Check that the Cloud Build service account has the storage.objectViewer role on the Artifactory bucket.
Why wrong: Artifactory is not a GCS bucket; the error is network, not permissions.
D
Ensure that Cloud NAT is configured in the private pool's VPC.
Why wrong: Cloud NAT is for internet egress; on-premises is accessed via VPN/Interconnect.

Full breakdown with real-world context →

Question 2mediummultiple choice

Full question →

Your company runs a multi-region application on Google Kubernetes Engine. You have implemented Cloud Monitoring dashboards to track cluster resource utilization and application SLIs. After a recent upgrade, you notice that the dashboard shows a sudden drop in CPU utilization for all nodes in one zone, but the application is still serving traffic normally. You suspect a monitoring issue. What should you investigate first?

A
Check if the nodes in that zone have been cordoned.
Why wrong: Cordoning would prevent pod scheduling, but app is still serving, so unlikely.
B
Check if the application's resource requests and limits have changed.
Why wrong: Resource requests affect scheduling, not metric collection.
C
Check if the Kubernetes Metrics Server is running correctly in that zone.
Metrics Server is responsible for collecting resource usage; if it's down, CPU data would drop.
D
Check if the Cloud Monitoring agent has been updated incorrectly.
Why wrong: GKE uses Metrics Server, not an agent, for resource metrics.

Full breakdown with real-world context →

Question 3hardmultiple choice

Full question →

A DevOps team is troubleshooting a Cloud Build pipeline that fails intermittently when building a container image. The build step uses a custom build step that runs a vulnerability scan. The error log shows: 'Step #1: Error: failed to scan image: context deadline exceeded'. The build configuration includes 'timeout: 600s'. Which is the most likely cause and solution?

A
The scan tool requires a specific dependency; add an installation step before scanning.
Why wrong: Missing dependency would cause command not found or similar errors.
B
There is network latency between Cloud Build and the container registry; use VPC Service Controls.
Why wrong: Network latency would cause connection errors, not a deadline exceeded within the build.
C
The build step is running out of memory; increase the machine type to e2-highcpu-8.
Why wrong: Memory issues would produce out-of-memory errors, not timeout.
D
The scan step is taking longer than the build timeout; increase the timeout value in the build configuration.
The error 'context deadline exceeded' indicates the step timed out.

Full breakdown with real-world context →

Question 4easymultiple choice

Full question →

A team uses Cloud SQL for PostgreSQL. They receive an alert that the database's CPU utilization is above 95% for the past 30 minutes. Queries are taking longer than usual. They want to investigate without causing further impact. What should they do first?

A
Increase the number of vCPUs of the Cloud SQL instance
Why wrong: Scaling up is a mitigation but should be done after understanding the cause.
B
Restart the Cloud SQL instance to clear the cache
Why wrong: Restarting causes downtime and does not fix the root cause.
C
Migrate the database to Cloud Spanner
Why wrong: Migration is a long-term project, not an immediate investigation step.
D
Use Cloud SQL Query Insights to find the most time-consuming queries
Query Insights shows top queries by CPU and latency.

Full breakdown with real-world context →

Question 5hardmultiple choice

Full question →

A company uses Spinnaker for continuous delivery across multiple GKE clusters. After a recent infrastructure change, the 'Canary' deployment strategy fails during the 'disable' phase of the old version. The error log shows: 'Unable to disable server group: Not authorized to perform compute.instanceGroups.update.' What is the most likely root cause?

A
The GKE cluster has reached its maximum node quota.
Why wrong: Quota issues would produce resource exhaustion errors, not authorization.
B
The Cloud Deploy pipeline is missing the required IAM role for the Spinnaker service account.
Why wrong: The error is from Spinnaker directly, not Cloud Deploy.
C
The Spinnaker service account lacks the compute.instanceGroups.update permission on the project.
Correct: Spinnaker uses this permission to disable old server groups.
D
The Kayenta canary analysis service is not configured correctly.
Why wrong: Kayenta handles metric analysis, not disabling server groups.

Full breakdown with real-world context →

Question 6hardmulti select

Full question →

An incident is declared for a production service running on GKE. The on-call engineer suspects a recent code change may have introduced a memory leak. Which THREE actions should the engineer take to investigate and mitigate?

A
Increase the memory limit for the container as a temporary mitigation
Temporary increase buys time for a permanent fix.
B
Scale down the number of replicas to reduce memory pressure
Why wrong: Scaling down reduces total memory but each container still leaks, causing crashes.
C
Roll back the deployment immediately without further investigation
Why wrong: Rollback is mitigation, but the question asks for investigation and mitigation steps.
D
Check container logs for Out of Memory (OOM) killed messages
OOM messages confirm memory exhaustion.
E
Compare memory usage metrics before and after the deployment using Cloud Monitoring
Identifies if memory usage increased after the change.

Full breakdown with real-world context →

Question 7mediummulti select

Read the full network assurance explanation →

A team uses Google Kubernetes Engine (GKE) with cluster telemetry enabled. During an incident, they notice that a deployment's pods are repeatedly crashing with Exit Code 137. The team wants to investigate the root cause. Which two Google Cloud services should they use together to correlate resource usage and logs?

A
Cloud Monitoring and Cloud Logging
Monitoring shows resource usage; Logging shows container logs and OOM events.
B
Security Command Center and Cloud Logging
Why wrong: Security Command Center is for vulnerabilities, not incident root cause.
C
Cloud Trace and Cloud Monitoring
Trace is for request latency, not resource usage or crash logs.
D
Cloud Error Reporting and Cloud Logging
Why wrong: Error Reporting does not show resource metrics.

Full breakdown with real-world context →

Question 8easymultiple choice

Full question →

An engineer receives an alert that a service's error rate has exceeded the threshold. To investigate, which log-based metric should the engineer query in Cloud Logging to identify the root cause?

A
Error log count grouped by service name.
Grouping by service reveals which service has the most errors.
B
Request latency histogram.
Why wrong: Latency is not error cause.
C
CPU utilization of the service instances.
Why wrong: CPU may correlate but is not a root cause of errors.
D
Network bytes sent per instance.
Why wrong: Network traffic does not indicate error sources.

Full breakdown with real-world context →

Question 9mediummultiple choice

Full question →

You are a DevOps engineer for a SaaS company that provides a REST API. The API is deployed on Google Cloud Run. You have configured Cloud Monitoring alerts for 5xx errors. Recently, you received an alert that the error rate exceeded 5% for 5 minutes. You investigated and found that the errors were HTTP 503 (Service Unavailable) from a specific endpoint. The endpoint calls an internal Cloud SQL database. The database CPU utilization was at 90% during that period. You suspect the database is the bottleneck. Which action should you take to reduce the error rate without over-provisioning?

A
Implement connection pooling and retry logic with exponential backoff in the API service
This reduces the number of simultaneous connections to the database and handles transient failures gracefully.
B
Increase the max instances per revision in Cloud Run to handle more concurrent requests
Why wrong: Increasing Cloud Run instances could increase load on the already stressed database, worsening the issue.
C
Reduce the min instances of Cloud Run to decrease load on the database
Why wrong: Reducing instances may cause cold starts and does not address the root cause of database overload.
D
Add a Cloud SQL read replica and route read queries to it
Why wrong: The endpoint causing 503 errors likely involves writes; read replicas won't reduce write load.

Full breakdown with real-world context →

Question 10hardmultiple choice

Full question →

You are troubleshooting a performance issue with a Compute Engine instance that is part of a managed instance group serving a web application. Users report intermittent high latency. You run the command shown in the exhibit. Based on the output, what is the most likely cause of the performance issue?

Exhibit

Refer to the exhibit.

```
$ gcloud compute instances describe instance-1 --zone=us-central1-a
...
networkInterfaces:
- accessConfigs:
  - name: external-nat
    natIP: 34.123.45.67
    type: ONE_TO_ONE_NAT
  name: nic0
  network: https://www.googleapis.com/compute/v1/projects/my-project/global/networks/default
  subnetwork: https://www.googleapis.com/compute/v1/projects/my-project/regions/us-central1/subnetworks/default
...
disks:
- autoDelete: true
  boot: true
  deviceName: instance-1
  diskSizeGb: '100'
  interface: SCSI
  source: https://www.googleapis.com/compute/v1/projects/my-project/zones/us-central1-a/disks/instance-1
  type: PERSISTENT
...
serviceAccounts:
- email: 123456789-compute@developer.gserviceaccount.com
  scopes:
  - https://www.googleapis.com/auth/devstorage.read_only
  - https://www.googleapis.com/auth/logging.write
  - https://www.googleapis.com/auth/monitoring.write
  - https://www.googleapis.com/auth/servicecontrol
  - https://www.googleapis.com/auth/service.management.readonly
  - https://www.googleapis.com/auth/trace.append
```

A
The instance is under-provisioned for CPU.
The output does not show the machine type, but the disk size and service account suggest a small instance, likely with 1 vCPU. Insufficient CPU causes high latency under load.
B
The instance is hitting the network egress bandwidth limit.
Why wrong: No evidence of network throttling; the instance has a standard external IP and typical egress limits for its machine type.
C
The service account lacks the necessary scopes for Cloud Monitoring and Cloud Trace.
Why wrong: The service account has monitoring.write and trace.append scopes, which are adequate for sending metrics and traces.
D
The boot disk is too small, causing I/O contention.
Why wrong: A 100GB persistent disk is sufficient for typical web server operations; disk I/O is unlikely the bottleneck unless there is heavy logging.

Full breakdown with real-world context →

Question 11mediummulti select

Full question →

A team is troubleshooting a slow response time on an App Engine standard environment application. The application uses Cloud SQL as its database. Which TWO actions should the team take to identify the bottleneck?

A
Examine App Engine request logs for latency patterns.
Correlates with slow queries.
B
Increase the number of App Engine instances.
Why wrong: This may not address the root cause.
C
Enable Cloud SQL slow query logging and analyze long-running queries.
Identifies database-level latency.
D
Enable Cloud CDN to cache responses.
Why wrong: Not suitable for dynamic content.
E
Disable caching to ensure fresh data.
Why wrong: Disabling caching worsens performance.

Full breakdown with real-world context →

Question 12mediumdrag order

Full question →

Arrange the steps to troubleshoot a high latency issue on a Google Cloud HTTP(S) Load Balancer.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 13easymultiple choice

Full question →

Refer to the exhibit. A budget alert has fired for project dev-123 indicating that the cost has exceeded the budget of $1000. What should the team do next to investigate the cost overrun?

Exhibit

{
  "budgetName": "projects/my-project/budgets/my-budget",
  "costAmount": 1500.00,
  "budgetAmount": 1000.00,
  "alertThresholdExceeded": 1.0,
  "budgetFilter": {
    "projects": ["projects/dev-123"],
    "creditTypesTreatment": "INCLUDE_ALL_CREDITS"
  }
}

A
Disable all APIs in project dev-123 immediately.
Why wrong: Disabling APIs could disrupt services; investigation should come first.
B
Open a BigQuery query on the billing export table, filtering by project 'dev-123' and service.
BigQuery billing exports provide detailed cost data for root cause analysis.
C
Set up a Cloud Function to automatically shut down resources when the budget is exceeded.
Why wrong: Automated shutdown is a remediation step, not an investigation step.
D
View the billing account's cost table in the Cloud Console.
Why wrong: The console shows aggregated costs but lacks granular breakdown to pinpoint the cause.
E
Create a new budget with a lower threshold to get alerted earlier.
Why wrong: This does not help investigate; it's a reactive measure.

Full breakdown with real-world context →

Question 14hardmultiple choice

Full question →

A DevOps engineer is troubleshooting a Cloud Build failure. The build log shows the error: 'Permission denied for resource projects/my-project/locations/us-central1/repositories/my-repo'. The Cloud Build service account (PROJECT_NUMBER@cloudbuild.gserviceaccount.com) is used. What is the most likely missing role?

A
roles/artifactregistry.reader
Why wrong: This only allows reading, not writing.
B
roles/artifactregistry.admin
Why wrong: This is too broad and grants management permissions.
C
roles/cloudbuild.builds.builder
Why wrong: This is for Cloud Build execution, not for accessing Artifact Registry.
D
roles/artifactregistry.writer
This allows pushing artifacts to repositories.

Full breakdown with real-world context →

Question 15mediummultiple choice

Full question →

A DevOps engineer is troubleshooting a production incident where users are getting 502 errors from a Google Cloud HTTP(S) Load Balancer. The backend service is a GKE deployment. Initial checks show the backend pods are healthy and responding. What is the most likely cause?

A
The load balancer's health check is failing on the backend instance group due to mismatch between health check port and backend port.
502 errors indicate the backend is unhealthy to the load balancer.
B
The backend pods are out of memory and crashing.
Why wrong: Pods would be unhealthy, contradicting the scenario.
C
The IAM permissions for the load balancer service account are misconfigured.
Why wrong: Would cause 403 errors, not 502.
D
The backend service has been accidentally deleted by another engineer.
Why wrong: Would cause 'connection refused' errors, not 502.

Full breakdown with real-world context →

These PCDOE practice questions are part of Courseiva's free Google Cloud certification practice question bank. Courseiva provides original exam-style PCDOE questions with detailed explanations, topic-based practice, mock exams, readiness tracking, and study analytics.

Troubleshooting Scenario Questions

How to approach troubleshooting scenario questions

Quick answer

Related PCDOE topic practice pages

Bootstrapping a Google Cloud organization for DevOps practice questions

Managing service incidents practice questions

Managing Google Cloud costs practice questions

Building and implementing CI/CD pipelines practice questions

Implementing service monitoring strategies practice questions

Optimizing service performance practice questions

PCDOE fundamentals practice questions

PCDOE scenario practice questions

PCDOE troubleshooting practice questions

Practice scenarios

An organization uses Cloud Build with a private pool to build container images that require access to on-premises Artifactory. After moving to a new VPC, builds fail with 'Connection refused' when fetching dependencies. What is the best step to troubleshoot?

A team uses Cloud SQL for PostgreSQL. They receive an alert that the database's CPU utilization is above 95% for the past 30 minutes. Queries are taking longer than usual. They want to investigate without causing further impact. What should they do first?

An incident is declared for a production service running on GKE. The on-call engineer suspects a recent code change may have introduced a memory leak. Which THREE actions should the engineer take to investigate and mitigate?

An engineer receives an alert that a service's error rate has exceeded the threshold. To investigate, which log-based metric should the engineer query in Cloud Logging to identify the root cause?

Exhibit

A team is troubleshooting a slow response time on an App Engine standard environment application. The application uses Cloud SQL as its database. Which TWO actions should the team take to identify the bottleneck?

Arrange the steps to troubleshoot a high latency issue on a Google Cloud HTTP(S) Load Balancer.

Refer to the exhibit. A budget alert has fired for project dev-123 indicating that the cost has exceeded the budget of $1000. What should the team do next to investigate the cost overrun?

Exhibit

A DevOps engineer is troubleshooting a production incident where users are getting 502 errors from a Google Cloud HTTP(S) Load Balancer. The backend service is a GKE deployment. Initial checks show the backend pods are healthy and responding. What is the most likely cause?