PCDOE · topic practice

Managing service incidents practice questions

Practise Google Professional Cloud DevOps Engineer Managing service incidents practice questions — original exam-style scenarios with answer choices, explanations, and analysis of common mistakes.

Courseiva uses original exam-style practice questions designed for learning and revision. The goal is to understand the concepts, recognise exam patterns, and improve through explanations — not memorise copied exam dumps.

Reviewed byJohnson Ajibi· MSc IT Security
20 questionsDomain: Managing service incidents

What the exam tests

What to know about Managing service incidents

Managing service incidents questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Watch out for

Common Managing service incidents exam traps

  • Answering from memory before reading the full scenario.
  • Missing a constraint such as cost, availability, security, scope or command context.
  • Choosing a broad answer when the question asks for the most specific fix.
  • Ignoring why the wrong options are tempting.

Practice set

Managing service incidents questions

20 questions · select your answer, then reveal the explanation

A team uses Google Kubernetes Engine (GKE) with cluster telemetry enabled. During an incident, they notice that a deployment's pods are repeatedly crashing with Exit Code 137. The team wants to investigate the root cause. Which two Google Cloud services should they use together to correlate resource usage and logs?

A DevOps engineer receives an alert that the error budget for a critical service has been exhausted. The service runs on Compute Engine behind an HTTP(S) load balancer. The team wants to reduce the impact on users while investigating. What should the engineer do first?

A company uses Cloud Run for a stateless API service with concurrency set to 80. During a traffic spike, some requests return HTTP 500 errors and latency spikes. Cloud Monitoring shows container CPU utilization at 100% and memory usage at 70%. What is the most likely cause and the best first step?

A team uses Cloud SQL for PostgreSQL. They receive an alert that the database's CPU utilization is above 95% for the past 30 minutes. Queries are taking longer than usual. They want to investigate without causing further impact. What should they do first?

A company's SRE team is designing an incident management process. They want to ensure that alerts are actionable and that on-call engineers are not overwhelmed by false positives. Which approach should they take?

An incident is declared for a production service running on GKE. The on-call engineer suspects a recent code change may have introduced a memory leak. Which THREE actions should the engineer take to investigate and mitigate?

A service experiences increased latency and HTTP 503 errors. The engineer finds that the backend managed instance group (MIG) is at max instances and CPU utilization is 90%. Which TWO actions should the engineer take to restore the service quickly?

Refer to the exhibit. A GKE pod is repeatedly crashing with the error shown. The deployment has resource requests of 512 MiB memory and limits of 1 GiB. What is the most likely cause and the best remediation?

Exhibit

Refer to the exhibit.

```
{
  "severity": "ERROR",
  "textPayload": "Exception: java.lang.OutOfMemoryError: Java heap space\n at com.example.service.DataProcessor.process(DataProcessor.java:45)\n at com.example.service.Main.main(Main.java:20)",
  "resource": {
    "type": "k8s_container",
    "labels": {
      "cluster_name": "prod-cluster",
      "namespace_name": "default",
      "pod_name": "data-processor-7d4f8b6c9-abcde",
      "container_name": "data-processor"
    }
  },
  "labels": {
    "k8s-pod/app": "data-processor"
  }
}
```

Refer to the exhibit. A Cloud Function (2nd gen) is timing out. The function's timeout is set to 60 seconds. The function queries a Cloud SQL database. What is the most likely cause and the best action?

Network Topology
limit 5format jsonRefer to the exhibit.```"severity": "WARNING","timestamp": "2025-03-15T14:23:00Z"},"timestamp": "2025-03-15T14:24:00Z"
Question 10hardmultiple choice
Read the full NAT/PAT explanation →

You are a Site Reliability Engineer (SRE) for an e-commerce platform running on Google Kubernetes Engine (GKE) with a microservices architecture. Your team uses Cloud Monitoring for alerting and Cloud Logging for centralized logs. Recently, during a flash sale event, you observed intermittent latency spikes in the checkout service, causing checkout failures and abandoned carts. The latency spikes last 1-2 seconds and occur roughly every 5-10 minutes during peak traffic. The checkout service runs as a Deployment with 10 replicas, each with resource requests of 500m CPU and 512Mi memory. The service has a Service Level Objective (SLO) of 99.9% of requests completing in under 1 second (p99 latency < 1s). Current p99 latency is 2.1s during peak. You reviewed the Cloud Monitoring dashboard and noticed that CPU utilization across pods is around 60%, memory around 50%, and there are no OOM kills. The logs show occasional 'connection reset by peer' errors in the checkout service logs, but no consistent pattern. You suspect the issue might be related to the database (Cloud SQL) or a downstream dependency. After checking the database, you find that query latency is normal. You also notice that the checkout service makes a synchronous HTTP call to a payment validation service that runs as a separate Deployment with 3 replicas. The payment service's p99 latency is 500ms, but its error rate is below 1%. Your task is to identify the most likely cause of the intermittent latency spikes and propose a remediation. Which action should you take first?

Your team is using Cloud Monitoring to track the health of a distributed microservices application. You notice that the error rate for the checkout service has increased significantly, but no alerts are firing. The SLO for checkout is 99.9% availability over a 28-day rolling window. You inspect the alerting policy and find it uses a time series aggregation with a 1-minute alignment period and a condition that triggers when the ratio of errors to total requests exceeds 0.001 for 5 consecutive minutes. What is the most likely reason the alert is not firing?

You are the DevOps engineer for a large e-commerce platform running on Google Kubernetes Engine (GKE). During a flash sale, you observe that the payments service is experiencing high latency and intermittent errors. The service is deployed with HorizontalPodAutoscaler (HPA) based on CPU utilization. You need to quickly diagnose and mitigate the issue. Which TWO actions should you take?

Refer to the exhibit. You are investigating a performance issue where the api-server container is using excessive CPU. You run a Cloud Monitoring API query and receive the JSON configuration shown. However, the query returns no data points. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
{
  "monitoredResource": {
    "type": "k8s_container",
    "labels": {
      "project_id": "my-project",
      "location": "us-central1",
      "cluster_name": "prod-cluster",
      "namespace_name": "default",
      "pod_name": "api-server-7d8f9c",
      "container_name": "api-server"
    }
  },
  "interval": {
    "startTime": "2025-02-10T10:00:00Z",
    "endTime": "2025-02-10T11:00:00Z"
  },
  "aggregation": {
    "alignmentPeriod": "60s",
    "perSeriesAligner": "ALIGN_MEAN",
    "crossSeriesReducer": "REDUCE_SUM"
  },
  "filter": "metric.type="kubernetes.io/container/cpu/core_usage_time" AND resource.labels.container_name="api-server"",
  "metric": {
    "type": "kubernetes.io/container/cpu/core_usage_time"
  }
}

Order the steps to deploy a new version of a microservice to Google Kubernetes Engine using a rolling update.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order
1Step 1
2Step 2
3Step 3
4Step 4
5Step 5

Arrange the steps to migrate a monolithic application to microservices on Google Kubernetes Engine.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order
1Step 1
2Step 2
3Step 3
4Step 4
5Step 5

Match each CI/CD concept to its definition.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Automated build and test on every commit

Automated deployment to staging, manual to production

Fully automated release to production

Short-lived branches, frequent merges to main

Gradual rollout to a subset of users

Match each Google Cloud DevOps capability to its benefit.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Managed continuous delivery to GKE

Centralized container and package storage

Private Git repositories integrated with Cloud Build

IDE plugins for Kubernetes and Cloud Run

CLI for continuous development on Kubernetes

Your team receives an alert that the Error Reporting count for a critical service has increased tenfold in the last 10 minutes. You suspect a recent code deployment is the cause. What is the first action you should take?

You are investigating a slow increase in latency for a service running on Compute Engine. You have Cloud Monitoring and Cloud Logging set up. Which tool would best help you identify the cause of the latency?

Your team uses a canary deployment strategy on Google Kubernetes Engine (GKE). During a rollback, you notice that the rollback caused a brief period of downtime because the previous version's readiness probe was not properly configured. Which of the following best prevents this issue in the future?

Free account

Track your progress over time

Create a free account to save your results and see which topics improve across sessions.

Focused Managing service incidents sessions

Start a Managing service incidents only practice session

Every question in these sessions is drawn from the Managing service incidents domain — nothing else.

Related practice questions

Related PCDOE topic practice pages

Move into related areas when this topic feels solid.

Frequently asked questions

What does the PCDOE exam test about Managing service incidents?
Managing service incidents questions test whether you can apply the concept in context, not just recognise a definition.
How should I use these practice questions?
Select your answer before revealing the explanation. Then read why each option is right or wrong — this active recall approach builds retention far faster than re-reading notes.
Can I practise just Managing service incidents questions in a focused session?
Yes — the session launcher on this page draws every question from the Managing service incidents domain. Use a 10-question session first to gauge your baseline, then move to 20 or 30 once the weak spots are clear.
Where can I practise other PCDOE topics?
Use the topic links above to move to related areas, or go back to the PCDOE question bank to see all topics.
Are these real exam questions or dumps?
These are original practice questions written to test the same concepts the PCDOE exam covers. They are not copied from any real exam or dump site.