How many Managing service incidents questions are on the PCDOE exam?

The Managing service incidents domain is one of the weighted domains on the PCDOE exam. The Courseiva question bank has 63 practice questions for this domain.

Free PCDOE Managing service incidents Practice Questions (2026)

Q: What does the Managing service incidents domain cover on the PCDOE exam?

The Managing service incidents domain covers the key concepts and skills tested in this area of the PCDOE exam blueprint published by Google Cloud.

Q: How can I practice Managing service incidents questions for PCDOE?

Click any of the 63 questions listed on this page to see the full question and explanation, or use the session launcher to start a focused practice session of 10, 20, 30 or 50 questions drawn only from the Managing service incidents domain.

Practice Managing service incidents questions

10Q 20Q 30Q 50Q

All PCDOE Managing service incidents questions (63)

Start session

Click any question to see the full explanation and answer options, or start a focused practice session above.

A team uses Google Kubernetes Engine (GKE) with cluster telemetry enabled. During an incident, they notice that a deployment's pods are repeatedly crashing with Exit Code 137. The team wants to investigate the root cause. Which two Google Cloud services should they use together to correlate resource usage and logs?

A DevOps engineer receives an alert that the error budget for a critical service has been exhausted. The service runs on Compute Engine behind an HTTP(S) load balancer. The team wants to reduce the impact on users while investigating. What should the engineer do first?

A company uses Cloud Run for a stateless API service with concurrency set to 80. During a traffic spike, some requests return HTTP 500 errors and latency spikes. Cloud Monitoring shows container CPU utilization at 100% and memory usage at 70%. What is the most likely cause and the best first step?

A team uses Cloud SQL for PostgreSQL. They receive an alert that the database's CPU utilization is above 95% for the past 30 minutes. Queries are taking longer than usual. They want to investigate without causing further impact. What should they do first?

A company's SRE team is designing an incident management process. They want to ensure that alerts are actionable and that on-call engineers are not overwhelmed by false positives. Which approach should they take?

An incident is declared for a production service running on GKE. The on-call engineer suspects a recent code change may have introduced a memory leak. Which THREE actions should the engineer take to investigate and mitigate?

A service experiences increased latency and HTTP 503 errors. The engineer finds that the backend managed instance group (MIG) is at max instances and CPU utilization is 90%. Which TWO actions should the engineer take to restore the service quickly?

Refer to the exhibit. A GKE pod is repeatedly crashing with the error shown. The deployment has resource requests of 512 MiB memory and limits of 1 GiB. What is the most likely cause and the best remediation?

Refer to the exhibit. A Cloud Function (2nd gen) is timing out. The function's timeout is set to 60 seconds. The function queries a Cloud SQL database. What is the most likely cause and the best action?

You are a Site Reliability Engineer (SRE) for an e-commerce platform running on Google Kubernetes Engine (GKE) with a microservices architecture. Your team uses Cloud Monitoring for alerting and Cloud Logging for centralized logs. Recently, during a flash sale event, you observed intermittent latency spikes in the checkout service, causing checkout failures and abandoned carts. The latency spikes last 1-2 seconds and occur roughly every 5-10 minutes during peak traffic. The checkout service runs as a Deployment with 10 replicas, each with resource requests of 500m CPU and 512Mi memory. The service has a Service Level Objective (SLO) of 99.9% of requests completing in under 1 second (p99 latency < 1s). Current p99 latency is 2.1s during peak. You reviewed the Cloud Monitoring dashboard and noticed that CPU utilization across pods is around 60%, memory around 50%, and there are no OOM kills. The logs show occasional 'connection reset by peer' errors in the checkout service logs, but no consistent pattern. You suspect the issue might be related to the database (Cloud SQL) or a downstream dependency. After checking the database, you find that query latency is normal. You also notice that the checkout service makes a synchronous HTTP call to a payment validation service that runs as a separate Deployment with 3 replicas. The payment service's p99 latency is 500ms, but its error rate is below 1%. Your task is to identify the most likely cause of the intermittent latency spikes and propose a remediation. Which action should you take first?

Your team is using Cloud Monitoring to track the health of a distributed microservices application. You notice that the error rate for the checkout service has increased significantly, but no alerts are firing. The SLO for checkout is 99.9% availability over a 28-day rolling window. You inspect the alerting policy and find it uses a time series aggregation with a 1-minute alignment period and a condition that triggers when the ratio of errors to total requests exceeds 0.001 for 5 consecutive minutes. What is the most likely reason the alert is not firing?

You are the DevOps engineer for a large e-commerce platform running on Google Kubernetes Engine (GKE). During a flash sale, you observe that the payments service is experiencing high latency and intermittent errors. The service is deployed with HorizontalPodAutoscaler (HPA) based on CPU utilization. You need to quickly diagnose and mitigate the issue. Which TWO actions should you take?

Refer to the exhibit. You are investigating a performance issue where the api-server container is using excessive CPU. You run a Cloud Monitoring API query and receive the JSON configuration shown. However, the query returns no data points. What is the most likely cause?

Order the steps to deploy a new version of a microservice to Google Kubernetes Engine using a rolling update.

Arrange the steps to migrate a monolithic application to microservices on Google Kubernetes Engine.

Match each CI/CD concept to its definition.

Match each Google Cloud DevOps capability to its benefit.

Your team receives an alert that the Error Reporting count for a critical service has increased tenfold in the last 10 minutes. You suspect a recent code deployment is the cause. What is the first action you should take?

You are investigating a slow increase in latency for a service running on Compute Engine. You have Cloud Monitoring and Cloud Logging set up. Which tool would best help you identify the cause of the latency?

Your team uses a canary deployment strategy on Google Kubernetes Engine (GKE). During a rollback, you notice that the rollback caused a brief period of downtime because the previous version's readiness probe was not properly configured. Which of the following best prevents this issue in the future?

Your SLO for availability is 99.9% over a 30-day window. You want an alert that fires when the error budget burn rate is high, leaving less than 5% of the error budget remaining in the next 6 hours. What type of alerting policy should you configure?

A GKE cluster node fails, causing pods to be rescheduled. However, some pods remain in 'CrashLoopBackOff' state. After examining logs, you find the application has a dependency on local SSD that was ephemeral. What is the best long-term solution?

During a post-mortem, you identify that an incident was caused by a configuration change that was not reviewed. Which of the following is the most effective preventive action?

Your incident response team uses a follow-the-sun model. An incident occurs during the Asia-Pacific shift, but the escalation path requires sign-off from the US-based team lead. This causes delays. What change should you recommend?

You are debugging a production issue where a Cloud Function occasionally throws a 'memory limit exceeded' error. You want to inspect the memory usage at the time of the error. What should you do?

Your application runs in two GCP regions. A regional outage occurs in the primary region. You have a Cloud Load Balancer with a failover backend. However, the failover did not trigger because the health check passed on a stale connection. What is the best solution?

You are responding to an incident where a new release has caused increased error rates. Which TWO actions should you take immediately?

Which THREE of the following are recommended practices for writing effective post-mortem documents?

You are designing alerting policies for a microservice architecture. Which TWO metrics are most suitable for triggering a page to the on-call engineer?

Refer to the exhibit. You see this log entry from a Cloud Run service. The stack trace shows the error occurs in handler.js at line 50. You want to see the state of variables at that point in the production environment without adding logging or redeploying. What should you do?

Refer to the exhibit. You are reviewing an alert policy for CPU utilization. What is a potential problem with this configuration?

Refer to the exhibit. Your team deployed a new revision to Cloud Run. After deployment, error rates increased. You want to roll back to the previous revision, which is still serving. Which command should you use?

A team is experiencing increased latency in their microservices application after a new deployment. They suspect a specific service is the bottleneck. Which tool should they use to identify the slowest service in the request path?

During an incident, a DevOps engineer needs to temporarily increase the capacity of a Google Kubernetes Engine (GKE) cluster to handle the traffic surge. Which approach minimizes manual intervention and follows Google best practices?

A company uses Error Budgets for their service. The SLO is 99.9% availability over a 30-day window. The service has been down for 30 minutes in the current window. What is the remaining error budget?

A team has configured an uptime check with a 5xx threshold alert. During an incident, the alert fires with severity 'critical'. The team mitigates the issue, but the alert keeps firing for 15 more minutes due to a slow-responding downstream dependency. What should the team do to avoid false alarms in future incidents?

A DevOps engineer is troubleshooting a production incident where users are getting 502 errors from a Google Cloud HTTP(S) Load Balancer. The backend service is a GKE deployment. Initial checks show the backend pods are healthy and responding. What is the most likely cause?

After deploying a new version of a Cloud Run service, the team notices an increase in 5xx errors. They want to quickly revert to the previous version while minimizing user impact. What is the recommended approach?

A multinational company runs an application on Google Cloud with an SLO of 99.99% monthly availability. They use a multi-region deployment with Cloud Load Balancing and Cloud Spanner. During a regional outage in us-central1, traffic fails over to us-east1. However, the incident response team is not alerted because the error budget burn rate remained below the alert threshold. What should the team change to ensure timely alerting for such regional failures?

An organization has a service that must meet a 99.99% SLO. The service runs on GKE and uses Cloud SQL. The team notices that during a major incident, the error budget is consumed rapidly. They want to implement a mechanism to automatically rollback deployments that cause sustained error budget consumption above a threshold. What is the best approach?

During a post-incident review, the team discovers that a misconfiguration in Cloud Armor caused legitimate traffic to be blocked, leading to a outage. The misconfiguration was introduced by a junior engineer who had overly permissive IAM roles. What is the best way to prevent similar incidents in the future?

Which TWO of the following are best practices for managing incident response on Google Cloud?

Which THREE of the following are valid techniques for mitigating a denial-of-service (DoS) attack against a Google Cloud HTTP(S) Load Balancer?

Which TWO of the following are essential elements of a comprehensive incident post-mortem document according to Google's Site Reliability Engineering (SRE) best practices?

A DevOps engineer notices that a critical service is down, but no alert has been received. The engineer checks Cloud Monitoring and sees that the alerting policy appears to be correctly configured. What is the most likely cause?

After a recent deployment, the mean latency of a user-facing service increased from 200ms to 500ms. The engineer uses Cloud Trace to analyze traces. Which trace characteristic should the engineer focus on to identify the bottleneck?

A team defines an SLO of 99.9% availability over a 30-day window. They use a multi-window, multi-burn-rate alerting approach. Which alerting condition should trigger a page based on fast burn rate?

During a canary deployment of a new version of a microservice, the engineer notices increased error rates in the canary instances. What is the best immediate action?

An engineer receives an alert that a service's error rate has exceeded the threshold. To investigate, which log-based metric should the engineer query in Cloud Logging to identify the root cause?

In Google's incident management process, which role is responsible for communication with stakeholders and users during an incident?

An engineer wants to ensure that an alert is escalated if not acknowledged within 5 minutes. Which feature of Cloud Monitoring can achieve this?

Which TWO practices help reduce Mean Time to Resolve (MTTR) for production incidents?

Which THREE steps are typically part of a formal incident postmortem according to Google SRE best practices?

Which TWO tools should be used for real-time incident collaboration and communication?

Refer to the exhibit. If the error rate spikes to 2% for only 2 minutes, why does the alert not fire?

You are the DevOps engineer for a social media platform. After a recent code rollout, you receive multiple user complaints about failed logins. The service logs show a sharp increase in 5xx errors from the authentication service. However, the existing alerting policy for the authentication service did not fire. The policy is configured to trigger if the error rate exceeds 5% for 5 minutes. Upon checking Cloud Monitoring, you see that the error rate spiked to 15% for 3 minutes, then dropped back to normal. What is the most likely reason the alert did not fire?

Your company runs an e-commerce application on Google Kubernetes Engine (GKE) with a microservice architecture. During a Black Friday sale, the orders service experiences a sudden increase in latency and errors. You notice that the database connection pool in the orders service is exhausted, leading to timeouts. The service is written in Java and uses HikariCP connection pool. You need to mitigate the incident quickly. Which action should you take first?

You manage a production environment with a web service deployed on Compute Engine instances behind a HTTP(S) Load Balancer. The service has a health check configured on the load balancer, probing a health endpoint every 10 seconds. After a recent configuration change, you observe that all instances are marked as unhealthy and traffic is failing. The health check response is 200 OK from the instances, but the load balancer still marks them unhealthy. The health check configuration: protocol: HTTP, port: 80, request path: /health, interval: 10s, timeout: 5s, unhealthy threshold: 2. The instances are running a custom web server. What is the most likely cause?

You are the SRE for a financial services application running on Google Cloud. Users report that certain transactions are taking over 10 seconds, while most complete in under 200ms. You use Cloud Profiler and Cloud Trace. Upon reviewing the profiler data, you see a hotspot in a method that calls a Cloud SQL database with a slow query. You identify the query and create an index to speed it up. However, you cannot deploy the index change immediately due to change management processes. The incident response team needs to mitigate the impact now. Which temporary measure should you take?

Your company runs a microservices application on a private GKE cluster with Workload Identity enabled. Services communicate via gRPC and HTTP. After a recent update to the payment service, users report intermittent 503 errors and 2-second latency spikes during peak hours (10 AM - 12 PM). Cloud Monitoring shows the payment service's CPU utilization averages 60%, but memory spikes to 90% during errors. The existing alert on HTTP 503 responses fires only after 5 consecutive errors over 5 minutes, but the errors are sporadic. You need to diagnose and resolve the issue. What should you do?

You are an on-call engineer responding to a critical service incident affecting a production application. According to Google's Incident Management best practices, which TWO actions should you take immediately after declaring the incident?

Based on the log entry, what is the most likely cause of the 404 error?

Your company runs a microservices application on Google Kubernetes Engine (GKE) with shared Istio service mesh across multiple namespaces. You use Cloud Monitoring and Cloud Logging for observability. At 10:30 AM, you receive an alert that the checkout service is returning high 5xx errors (over 20%) and latency is above 5 seconds. The incident response team is assembled, and you are the incident commander. The team suspects a recent deployment (v2.1) to the checkout service at 10:00 AM. The deployment was a minor configuration update. The team is divided: some want to immediately roll back, others want to analyze traces. You have access to the GCP console. What should you do first to ensure a swift and effective incident response?

Practice all 63 Managing service incidents questions

Other PCDOE exam domains

Bootstrapping a Google Cloud organization for DevOps Managing Google Cloud costs Building and implementing CI/CD pipelines Implementing service monitoring strategies Optimizing service performance

Frequently asked questions

What does the Managing service incidents domain cover on the PCDOE exam?

The Managing service incidents domain covers the key concepts tested in this area of the PCDOE exam blueprint published by Google Cloud. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all PCDOE domains — no account required.

How many Managing service incidents questions are in the PCDOE question bank?

The Courseiva PCDOE question bank contains 63 questions in the Managing service incidents domain. Click any question to see the full explanation and answer breakdown.

What is the best way to practice Managing service incidents for PCDOE?

Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.

Can I practice only Managing service incidents questions for PCDOE?

Yes — the session launcher on this page draws questions exclusively from the Managing service incidents domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.

Free forever · No credit card required

Track your PCDOE domain progress

Save your results, see per-domain analytics, and get readiness scores — free, for every certification.

Free forever · Every certification included