How should I use these Managing service incidents practice questions?

Read each scenario carefully and choose your answer before revealing the explanation. Then check why your choice was right or wrong. Repeat until the reasoning feels automatic.

Can I practise just Managing service incidents questions in a focused session?

Yes — use the session launcher on this page to start a 10-, 20-, 30- or 50-question session drawn entirely from the Managing service incidents domain.

PCDOE · topic practice

Managing service incidents practice questions

Practise Google Professional Cloud DevOps Engineer Managing service incidents practice questions — original exam-style scenarios with answer choices, explanations, and analysis of common mistakes.

Courseiva uses original exam-style practice questions designed for learning and revision. The goal is to understand the concepts, recognise exam patterns, and improve through explanations — not memorise copied exam dumps.

Reviewed byJohnson Ajibi· MSc IT Security

20 questionsDomain: Managing service incidents

Practice 10 questions Browse domain →

What the exam tests

What to know about Managing service incidents

Managing service incidents questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Watch out for

Common Managing service incidents exam traps

▸Answering from memory before reading the full scenario.
▸Missing a constraint such as cost, availability, security, scope or command context.
▸Choosing a broad answer when the question asks for the most specific fix.
▸Ignoring why the wrong options are tempting.

Practice set

Managing service incidents questions

20 questions · select your answer, then reveal the explanation

Question 1mediummulti select

Read the full network assurance explanation →

A team uses Google Kubernetes Engine (GKE) with cluster telemetry enabled. During an incident, they notice that a deployment's pods are repeatedly crashing with Exit Code 137. The team wants to investigate the root cause. Which two Google Cloud services should they use together to correlate resource usage and logs?

Trap 1: Security Command Center and Cloud Logging

Security Command Center is for vulnerabilities, not incident root cause.

Trap 2: Cloud Error Reporting and Cloud Logging

Error Reporting does not show resource metrics.

Study all Managing service incidents common traps →

A
Cloud Monitoring and Cloud Logging
Monitoring shows resource usage; Logging shows container logs and OOM events.
B
Security Command Center and Cloud Logging
Why wrong: Security Command Center is for vulnerabilities, not incident root cause.
C
Cloud Trace and Cloud Monitoring
Trace is for request latency, not resource usage or crash logs.
D
Cloud Error Reporting and Cloud Logging
Why wrong: Error Reporting does not show resource metrics.

Full breakdown with real-world context →

Question 2easymultiple choice

Read the full Managing service incidents explanation →

A DevOps engineer receives an alert that the error budget for a critical service has been exhausted. The service runs on Compute Engine behind an HTTP(S) load balancer. The team wants to reduce the impact on users while investigating. What should the engineer do first?

Trap 1: Begin a detailed postmortem analysis

Postmortem should happen after mitigation, not before.

Trap 2: Disable the alerting policy to reduce noise

Ignoring alerts does not resolve the incident.

Trap 3: Increase the number of instances in the managed instance group

Scaling out may not address the root cause and could increase costs.

Study all Managing service incidents common traps →

A
Roll back the most recent deployment
Rolling back quickly restores the previous stable version.
B
Begin a detailed postmortem analysis
Why wrong: Postmortem should happen after mitigation, not before.
C
Disable the alerting policy to reduce noise
Why wrong: Ignoring alerts does not resolve the incident.
D
Increase the number of instances in the managed instance group
Why wrong: Scaling out may not address the root cause and could increase costs.

Full breakdown with real-world context →

Question 3hardmultiple choice

Read the full Managing service incidents explanation →

A company uses Cloud Run for a stateless API service with concurrency set to 80. During a traffic spike, some requests return HTTP 500 errors and latency spikes. Cloud Monitoring shows container CPU utilization at 100% and memory usage at 70%. What is the most likely cause and the best first step?

Trap 1: Maximum instances limit is too low; increase from 10 to 100

Cloud Run scales automatically; max instances limit prevents unbounded scaling but 500s indicate per-instance overload.

Trap 2: Min idle instances is too low; set min idle to 5 to reduce cold…

Cold starts cause latency but not 500s; CPU is already 100%.

Trap 3: Memory limit is too low; increase memory from 256 MiB to 512 MiB

Memory usage is only 70%, so memory is not the bottleneck.

Study all Managing service incidents common traps →

A
Concurrency per container is too high; reduce concurrency to 10
Lowering concurrency reduces CPU contention, preventing timeouts and 500s.
B
Maximum instances limit is too low; increase from 10 to 100
Why wrong: Cloud Run scales automatically; max instances limit prevents unbounded scaling but 500s indicate per-instance overload.
C
Min idle instances is too low; set min idle to 5 to reduce cold starts
Why wrong: Cold starts cause latency but not 500s; CPU is already 100%.
D
Memory limit is too low; increase memory from 256 MiB to 512 MiB
Why wrong: Memory usage is only 70%, so memory is not the bottleneck.

Full breakdown with real-world context →

Question 4easymultiple choice

Read the full Managing service incidents explanation →

A team uses Cloud SQL for PostgreSQL. They receive an alert that the database's CPU utilization is above 95% for the past 30 minutes. Queries are taking longer than usual. They want to investigate without causing further impact. What should they do first?

Trap 1: Increase the number of vCPUs of the Cloud SQL instance

Scaling up is a mitigation but should be done after understanding the cause.

Trap 2: Restart the Cloud SQL instance to clear the cache

Restarting causes downtime and does not fix the root cause.

Trap 3: Migrate the database to Cloud Spanner

Migration is a long-term project, not an immediate investigation step.

Study all Managing service incidents common traps →

A
Increase the number of vCPUs of the Cloud SQL instance
Why wrong: Scaling up is a mitigation but should be done after understanding the cause.
B
Restart the Cloud SQL instance to clear the cache
Why wrong: Restarting causes downtime and does not fix the root cause.
C
Migrate the database to Cloud Spanner
Why wrong: Migration is a long-term project, not an immediate investigation step.
D
Use Cloud SQL Query Insights to find the most time-consuming queries
Query Insights shows top queries by CPU and latency.

Full breakdown with real-world context →

Question 5mediummultiple choice

Read the full Managing service incidents explanation →

A company's SRE team is designing an incident management process. They want to ensure that alerts are actionable and that on-call engineers are not overwhelmed by false positives. Which approach should they take?

Trap 1: Use only critical severity alerts and rely on manual dashboard…

Manual review is inefficient and may miss issues.

Trap 2: Create alerting policies for every available metric to ensure…

This would cause alert fatigue with many false positives.

Trap 3: Set all alert thresholds to 50% above the average value to avoid…

This may miss real issues and does not consider SLOs.

Study all Managing service incidents common traps →

A
Use only critical severity alerts and rely on manual dashboard review for lower severity
Why wrong: Manual review is inefficient and may miss issues.
B
Create alerting policies for every available metric to ensure nothing is missed
Why wrong: This would cause alert fatigue with many false positives.
C
Set all alert thresholds to 50% above the average value to avoid false positives
Why wrong: This may miss real issues and does not consider SLOs.
D
Define SLOs and set alert thresholds based on historical error budget consumption
SLO-based alerting focuses on user-facing impact and reduces noise.

Full breakdown with real-world context →

Question 6hardmulti select

Read the full Managing service incidents explanation →

An incident is declared for a production service running on GKE. The on-call engineer suspects a recent code change may have introduced a memory leak. Which THREE actions should the engineer take to investigate and mitigate?

Trap 1: Scale down the number of replicas to reduce memory pressure

Scaling down reduces total memory but each container still leaks, causing crashes.

Trap 2: Roll back the deployment immediately without further investigation

Rollback is mitigation, but the question asks for investigation and mitigation steps.

Study all Managing service incidents common traps →

A
Increase the memory limit for the container as a temporary mitigation
Temporary increase buys time for a permanent fix.
B
Scale down the number of replicas to reduce memory pressure
Why wrong: Scaling down reduces total memory but each container still leaks, causing crashes.
C
Roll back the deployment immediately without further investigation
Why wrong: Rollback is mitigation, but the question asks for investigation and mitigation steps.
D
Check container logs for Out of Memory (OOM) killed messages
OOM messages confirm memory exhaustion.
E
Compare memory usage metrics before and after the deployment using Cloud Monitoring
Identifies if memory usage increased after the change.

Full breakdown with real-world context →

Question 7mediummulti select

Read the full Managing service incidents explanation →

A service experiences increased latency and HTTP 503 errors. The engineer finds that the backend managed instance group (MIG) is at max instances and CPU utilization is 90%. Which TWO actions should the engineer take to restore the service quickly?

Trap 1: Increase the autoscaling target CPU utilization to 95%

Higher target would delay scaling, worsening the issue.

Trap 2: Reduce the autoscaling target CPU utilization to 50%

Lower target would trigger more instances but the MIG is already at max; need to increase max instances first.

Trap 3: Reduce the number of instances to avoid resource contention

Reducing instances would increase load on remaining instances.

Study all Managing service incidents common traps →

A
Enable autoscaling based on HTTP load balancing utilization
Scales based on request rate, which is more responsive than CPU.
B
Increase the autoscaling target CPU utilization to 95%
Why wrong: Higher target would delay scaling, worsening the issue.
C
Increase the maximum number of instances in the MIG
Allows the MIG to scale out further to handle load.
D
Reduce the autoscaling target CPU utilization to 50%
Why wrong: Lower target would trigger more instances but the MIG is already at max; need to increase max instances first.
E
Reduce the number of instances to avoid resource contention
Why wrong: Reducing instances would increase load on remaining instances.

Full breakdown with real-world context →

Question 8hardmultiple choice

Read the full Managing service incidents explanation →

Refer to the exhibit. A GKE pod is repeatedly crashing with the error shown. The deployment has resource requests of 512 MiB memory and limits of 1 GiB. What is the most likely cause and the best remediation?

Exhibit

Refer to the exhibit.

```
{
  "severity": "ERROR",
  "textPayload": "Exception: java.lang.OutOfMemoryError: Java heap space\n at com.example.service.DataProcessor.process(DataProcessor.java:45)\n at com.example.service.Main.main(Main.java:20)",
  "resource": {
    "type": "k8s_container",
    "labels": {
      "cluster_name": "prod-cluster",
      "namespace_name": "default",
      "pod_name": "data-processor-7d4f8b6c9-abcde",
      "container_name": "data-processor"
    }
  },
  "labels": {
    "k8s-pod/app": "data-processor"
  }
}
```

Trap 1: The node is under memory pressure; add more nodes to the cluster

The error is Java heap OOM, not node-level memory pressure.

Trap 2: The container needs more CPU; increase CPU request and limit

CPU is not related to heap space errors.

Trap 3: The application has a memory leak; refactor the DataProcessor class

No evidence of a leak; stack trace shows a simple allocation failure.

Study all Managing service incidents common traps →

A
The Java heap size exceeds the container memory limit; reduce the JVM heap size or increase the container memory limit
JVM heap must fit within the container limit to avoid OOM.
B
The node is under memory pressure; add more nodes to the cluster
Why wrong: The error is Java heap OOM, not node-level memory pressure.
C
The container needs more CPU; increase CPU request and limit
Why wrong: CPU is not related to heap space errors.
D
The application has a memory leak; refactor the DataProcessor class
Why wrong: No evidence of a leak; stack trace shows a simple allocation failure.

Full breakdown with real-world context →

Question 9mediummultiple choice

Read the full Managing service incidents explanation →

Refer to the exhibit. A Cloud Function (2nd gen) is timing out. The function's timeout is set to 60 seconds. The function queries a Cloud SQL database. What is the most likely cause and the best action?

Network Topology

Trap 1: Reduce the function's allocated memory to decrease cold start time

Cold start is not related to timeout; memory reduction may slow processing.

Trap 2: Increase the function timeout to 120 seconds

The function is not exceeding 60s; increasing timeout won't fix the slow query.

Trap 3: Increase the Cloud SQL max connections setting

No evidence of connection exhaustion; the issue is query latency.

Study all Managing service incidents common traps →

A
Reduce the function's allocated memory to decrease cold start time
Why wrong: Cold start is not related to timeout; memory reduction may slow processing.
B
Add indexes to the database tables queried by the function
Slow queries often indicate missing indexes; adding them reduces query time.
C
Increase the function timeout to 120 seconds
Why wrong: The function is not exceeding 60s; increasing timeout won't fix the slow query.
D
Increase the Cloud SQL max connections setting
Why wrong: No evidence of connection exhaustion; the issue is query latency.

Full breakdown with real-world context →

Question 10hardmultiple choice

Read the full NAT/PAT explanation →

You are a Site Reliability Engineer (SRE) for an e-commerce platform running on Google Kubernetes Engine (GKE) with a microservices architecture. Your team uses Cloud Monitoring for alerting and Cloud Logging for centralized logs. Recently, during a flash sale event, you observed intermittent latency spikes in the checkout service, causing checkout failures and abandoned carts. The latency spikes last 1-2 seconds and occur roughly every 5-10 minutes during peak traffic. The checkout service runs as a Deployment with 10 replicas, each with resource requests of 500m CPU and 512Mi memory. The service has a Service Level Objective (SLO) of 99.9% of requests completing in under 1 second (p99 latency < 1s). Current p99 latency is 2.1s during peak. You reviewed the Cloud Monitoring dashboard and noticed that CPU utilization across pods is around 60%, memory around 50%, and there are no OOM kills. The logs show occasional 'connection reset by peer' errors in the checkout service logs, but no consistent pattern. You suspect the issue might be related to the database (Cloud SQL) or a downstream dependency. After checking the database, you find that query latency is normal. You also notice that the checkout service makes a synchronous HTTP call to a payment validation service that runs as a separate Deployment with 3 replicas. The payment service's p99 latency is 500ms, but its error rate is below 1%. Your task is to identify the most likely cause of the intermittent latency spikes and propose a remediation. Which action should you take first?

Trap 1: Increase the number of replicas of the payment validation service…

The payment service has low error rate and latency within normal range; increasing replicas may not resolve the issue and is not the first step.

Trap 2: Enable connection pooling and retries with exponential backoff in…

This can improve resilience but does not address the root cause; it may mask the problem.

Trap 3: Investigate the checkout service pod restarts due to liveness probe…

While 'connection reset by peer' can indicate pod restarts, the latency pattern is more consistent with GC pauses; pod restarts would likely cause more frequent errors and a different latency pattern.

Study all Managing service incidents common traps →

A
Increase the number of replicas of the payment validation service to 10 to handle peak load.
Why wrong: The payment service has low error rate and latency within normal range; increasing replicas may not resolve the issue and is not the first step.
B
Check the garbage collection logs of the checkout service pods to identify if long GC pauses coincide with the latency spikes.
Periodic latency spikes are a classic symptom of JVM garbage collection. Checking GC logs will help confirm if this is the cause.
C
Enable connection pooling and retries with exponential backoff in the checkout service for the HTTP call to the payment service.
Why wrong: This can improve resilience but does not address the root cause; it may mask the problem.
D
Investigate the checkout service pod restarts due to liveness probe failures, as 'connection reset by peer' indicates pod instability.
Why wrong: While 'connection reset by peer' can indicate pod restarts, the latency pattern is more consistent with GC pauses; pod restarts would likely cause more frequent errors and a different latency pattern.

Full breakdown with real-world context →

Question 11mediummultiple choice

Read the full Managing service incidents explanation →

Your team is using Cloud Monitoring to track the health of a distributed microservices application. You notice that the error rate for the checkout service has increased significantly, but no alerts are firing. The SLO for checkout is 99.9% availability over a 28-day rolling window. You inspect the alerting policy and find it uses a time series aggregation with a 1-minute alignment period and a condition that triggers when the ratio of errors to total requests exceeds 0.001 for 5 consecutive minutes. What is the most likely reason the alert is not firing?

Trap 1: The error budget has been exhausted, so the alert is suppressed.

Error budget exhaustion does not suppress alerts; it might indicate the SLO is being breached, but alerts should still fire.

Trap 2: The SLO window is too long, and the alert condition uses a…

The SLO window (28 days) is separate from the alert condition's alignment period; the alert evaluates the ratio over 1-minute windows, not the SLO window.

Trap 3: The ratio threshold is too high because the total request count is…

The threshold is a ratio, so it is independent of absolute counts; low request counts could cause high ratio but still trigger if sustained.

Study all Managing service incidents common traps →

A
The alert condition requires 5 consecutive minutes of breach, but the error rate spikes are intermittent and not sustained.
The alert requires 5 consecutive minutes of the ratio exceeding 0.001; intermittent spikes may not meet this condition.
B
The error budget has been exhausted, so the alert is suppressed.
Why wrong: Error budget exhaustion does not suppress alerts; it might indicate the SLO is being breached, but alerts should still fire.
C
The SLO window is too long, and the alert condition uses a different measurement period.
Why wrong: The SLO window (28 days) is separate from the alert condition's alignment period; the alert evaluates the ratio over 1-minute windows, not the SLO window.
D
The ratio threshold is too high because the total request count is low.
Why wrong: The threshold is a ratio, so it is independent of absolute counts; low request counts could cause high ratio but still trigger if sustained.

Full breakdown with real-world context →

Question 12hardmulti select

Read the full Managing service incidents explanation →

You are the DevOps engineer for a large e-commerce platform running on Google Kubernetes Engine (GKE). During a flash sale, you observe that the payments service is experiencing high latency and intermittent errors. The service is deployed with HorizontalPodAutoscaler (HPA) based on CPU utilization. You need to quickly diagnose and mitigate the issue. Which TWO actions should you take?

Trap 1: Check the GKE node's network performance using VPC Flow Logs and…

The issue is with the payments service, not the underlying network or node capacity.

Trap 2: Modify the HPA to use memory utilization instead of CPU, as memory…

Changing the metric without understanding the bottleneck may not help; latency issues are often not due to memory.

Trap 3: Configure a custom metric in Cloud Monitoring for the payments…

Custom metrics require the service to expose them; this would take time to implement and won't help immediately.

Study all Managing service incidents common traps →

A
Use Cloud Monitoring to examine the payments service's request latency and error rate metrics, and create a custom dashboard for real-time monitoring.
Cloud Monitoring provides latency and error metrics via Istio or GKE metrics; a custom dashboard helps visualize the issue.
B
Check the GKE node's network performance using VPC Flow Logs and increase the node pool size.
Why wrong: The issue is with the payments service, not the underlying network or node capacity.
C
Modify the HPA to use memory utilization instead of CPU, as memory is more indicative of the service's performance.
Why wrong: Changing the metric without understanding the bottleneck may not help; latency issues are often not due to memory.
D
Configure a custom metric in Cloud Monitoring for the payments service's request queue depth and use it for HPA.
Why wrong: Custom metrics require the service to expose them; this would take time to implement and won't help immediately.
E
Manually scale up the payments service deployment to more replicas to handle the increased load.
Manual scaling provides immediate additional capacity while HPA responds to the load.

Full breakdown with real-world context →

Question 13hardmultiple choice

Read the full Managing service incidents explanation →

Refer to the exhibit. You are investigating a performance issue where the api-server container is using excessive CPU. You run a Cloud Monitoring API query and receive the JSON configuration shown. However, the query returns no data points. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
{
  "monitoredResource": {
    "type": "k8s_container",
    "labels": {
      "project_id": "my-project",
      "location": "us-central1",
      "cluster_name": "prod-cluster",
      "namespace_name": "default",
      "pod_name": "api-server-7d8f9c",
      "container_name": "api-server"
    }
  },
  "interval": {
    "startTime": "2025-02-10T10:00:00Z",
    "endTime": "2025-02-10T11:00:00Z"
  },
  "aggregation": {
    "alignmentPeriod": "60s",
    "perSeriesAligner": "ALIGN_MEAN",
    "crossSeriesReducer": "REDUCE_SUM"
  },
  "filter": "metric.type="kubernetes.io/container/cpu/core_usage_time" AND resource.labels.container_name="api-server"",
  "metric": {
    "type": "kubernetes.io/container/cpu/core_usage_time"
  }
}

Trap 1: The time interval specified is too short and falls outside the data…

The interval is one hour, which is well within the retention period (typically 6 weeks for Cloud Monitoring).

Trap 2: The metric type 'kubernetes.io/container/cpu/core_usage_time' is…

The metric type is valid and commonly used for GKE container CPU usage.

Trap 3: The aggregation perSeriesAligner 'ALIGN_MEAN' is incompatible with…

'ALIGN_MEAN' is compatible with cumulative metrics when used correctly; the issue is not the aligner.

Study all Managing service incidents common traps →

A
The time interval specified is too short and falls outside the data retention period.
Why wrong: The interval is one hour, which is well within the retention period (typically 6 weeks for Cloud Monitoring).
B
The metric type 'kubernetes.io/container/cpu/core_usage_time' is deprecated and no longer available.
Why wrong: The metric type is valid and commonly used for GKE container CPU usage.
C
The filter is missing required resource labels such as project_id, location, cluster_name, and namespace_name, causing no time series to match.
Resource labels must be fully specified in the filter to match the specific container; otherwise the query may not return data.
D
The aggregation perSeriesAligner 'ALIGN_MEAN' is incompatible with the metric type, which requires 'ALIGN_RATE'.
Why wrong: 'ALIGN_MEAN' is compatible with cumulative metrics when used correctly; the issue is not the aligner.

Full breakdown with real-world context →

Question 14mediumdrag order

Read the full Managing service incidents explanation →

Order the steps to deploy a new version of a microservice to Google Kubernetes Engine using a rolling update.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 15mediumdrag order

Read the full Managing service incidents explanation →

Arrange the steps to migrate a monolithic application to microservices on Google Kubernetes Engine.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 16mediummatching

Read the full Managing service incidents explanation →

Match each CI/CD concept to its definition.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Automated build and test on every commit

Automated deployment to staging, manual to production

Fully automated release to production

Short-lived branches, frequent merges to main

Gradual rollout to a subset of users

Question 17mediummatching

Read the full Managing service incidents explanation →

Match each Google Cloud DevOps capability to its benefit.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Managed continuous delivery to GKE

Centralized container and package storage

Private Git repositories integrated with Cloud Build

IDE plugins for Kubernetes and Cloud Run

CLI for continuous development on Kubernetes

Question 18easymultiple choice

Read the full Managing service incidents explanation →

Your team receives an alert that the Error Reporting count for a critical service has increased tenfold in the last 10 minutes. You suspect a recent code deployment is the cause. What is the first action you should take?

Trap 1: Disable the alert to reduce noise.

Disabling alerts ignores the problem and delays response.

Trap 2: Increase the instance count to handle the load.

Scaling up may help but does not address the root cause (buggy code).

Trap 3: Open a post-mortem to document the incident.

Post-mortem is important but not the first action; containment comes first.

Study all Managing service incidents common traps →

A
Disable the alert to reduce noise.
Why wrong: Disabling alerts ignores the problem and delays response.
B
Roll back the deployment to the previous version.
Rolling back quickly mitigates user impact.
C
Increase the instance count to handle the load.
Why wrong: Scaling up may help but does not address the root cause (buggy code).
D
Open a post-mortem to document the incident.
Why wrong: Post-mortem is important but not the first action; containment comes first.

Full breakdown with real-world context →

Question 19easymultiple choice

Read the full Managing service incidents explanation →

You are investigating a slow increase in latency for a service running on Compute Engine. You have Cloud Monitoring and Cloud Logging set up. Which tool would best help you identify the cause of the latency?

Trap 1: Error Reporting

Error Reporting aggregates errors, not latency.

Trap 2: Cloud Profiler

Profiler is for CPU/memory profiling, not latency tracing.

Trap 3: Cloud Debugger

Debugger inspects live code state but does not aggregate latency data.

Study all Managing service incidents common traps →

A
Error Reporting
Why wrong: Error Reporting aggregates errors, not latency.
B
Cloud Profiler
Why wrong: Profiler is for CPU/memory profiling, not latency tracing.
C
Cloud Debugger
Why wrong: Debugger inspects live code state but does not aggregate latency data.
D
Cloud Trace
Cloud Trace traces requests and identifies latency contributors.

Full breakdown with real-world context →

Question 20hardmultiple choice

Read the full Managing service incidents explanation →

Your team uses a canary deployment strategy on Google Kubernetes Engine (GKE). During a rollback, you notice that the rollback caused a brief period of downtime because the previous version's readiness probe was not properly configured. Which of the following best prevents this issue in the future?

Trap 1: Perform a gradual rollback with a managed instance group.

MIGs are not used in GKE; this is not applicable.

Trap 2: Use a blue/green deployment instead.

Blue/green also relies on probes; the same issue could occur.

Trap 3: Use a Kubernetes Job to run a post-deployment validation.

Post-deployment validation is good but does not prevent the initial issue during rollback.

Study all Managing service incidents common traps →

A
Perform a gradual rollback with a managed instance group.
Why wrong: MIGs are not used in GKE; this is not applicable.
B
Use a blue/green deployment instead.
Why wrong: Blue/green also relies on probes; the same issue could occur.
C
Ensure that the readiness probe is tested as part of the pre-deployment validation.
Validating probes before deployment ensures both new and old versions are ready.
D
Use a Kubernetes Job to run a post-deployment validation.
Why wrong: Post-deployment validation is good but does not prevent the initial issue during rollback.

Full breakdown with real-world context →

Continue with 20-question session →

Free account

Track your progress over time

Create a free account to save your results and see which topics improve across sessions.

Focused Managing service incidents sessions

Start a Managing service incidents only practice session

Every question in these sessions is drawn from the Managing service incidents domain — nothing else.

10 questions 20 questions 30 questions 50 questions

Browse all Managing service incidents questions →Mixed PCDOE session

Frequently asked questions

What does the PCDOE exam test about Managing service incidents?: Managing service incidents questions test whether you can apply the concept in context, not just recognise a definition.
How should I use these practice questions?: Select your answer before revealing the explanation. Then read why each option is right or wrong — this active recall approach builds retention far faster than re-reading notes.
Can I practise just Managing service incidents questions in a focused session?: Yes — the session launcher on this page draws every question from the Managing service incidents domain. Use a 10-question session first to gauge your baseline, then move to 20 or 30 once the weak spots are clear.
Where can I practise other PCDOE topics?: Use the topic links above to move to related areas, or go back to the PCDOE question bank to see all topics.
Are these real exam questions or dumps?: These are original practice questions written to test the same concepts the PCDOE exam covers. They are not copied from any real exam or dump site.

Managing service incidents only

10 questions 20 questions 30 questions 50 questions

Mixed PCDOE session

Track your progress

A free account saves results across sessions and highlights which topics need work.

Study resources

All PCDOE questions Managing service incidents domain overview PCDOE exam guide

Exam traps to avoid

▸Answering from memory before reading the full scenario.
▸Missing a constraint such as cost, availability, security, scope or command context.
▸Choosing a broad answer when the question asks for the most specific fix.
▸Ignoring why the wrong options are tempting.

Managing service incidents practice questions

What to know about Managing service incidents

Common Managing service incidents exam traps

Managing service incidents questions

A DevOps engineer receives an alert that the error budget for a critical service has been exhausted. The service runs on Compute Engine behind an HTTP(S) load balancer. The team wants to reduce the impact on users while investigating. What should the engineer do first?

A company uses Cloud Run for a stateless API service with concurrency set to 80. During a traffic spike, some requests return HTTP 500 errors and latency spikes. Cloud Monitoring shows container CPU utilization at 100% and memory usage at 70%. What is the most likely cause and the best first step?

A team uses Cloud SQL for PostgreSQL. They receive an alert that the database's CPU utilization is above 95% for the past 30 minutes. Queries are taking longer than usual. They want to investigate without causing further impact. What should they do first?

A company's SRE team is designing an incident management process. They want to ensure that alerts are actionable and that on-call engineers are not overwhelmed by false positives. Which approach should they take?

An incident is declared for a production service running on GKE. The on-call engineer suspects a recent code change may have introduced a memory leak. Which THREE actions should the engineer take to investigate and mitigate?

A service experiences increased latency and HTTP 503 errors. The engineer finds that the backend managed instance group (MIG) is at max instances and CPU utilization is 90%. Which TWO actions should the engineer take to restore the service quickly?

Refer to the exhibit. A GKE pod is repeatedly crashing with the error shown. The deployment has resource requests of 512 MiB memory and limits of 1 GiB. What is the most likely cause and the best remediation?

Exhibit

Refer to the exhibit. A Cloud Function (2nd gen) is timing out. The function's timeout is set to 60 seconds. The function queries a Cloud SQL database. What is the most likely cause and the best action?

Refer to the exhibit. You are investigating a performance issue where the api-server container is using excessive CPU. You run a Cloud Monitoring API query and receive the JSON configuration shown. However, the query returns no data points. What is the most likely cause?

Exhibit

Order the steps to deploy a new version of a microservice to Google Kubernetes Engine using a rolling update.

Arrange the steps to migrate a monolithic application to microservices on Google Kubernetes Engine.

Match each CI/CD concept to its definition.

Match each Google Cloud DevOps capability to its benefit.

Your team receives an alert that the Error Reporting count for a critical service has increased tenfold in the last 10 minutes. You suspect a recent code deployment is the cause. What is the first action you should take?

You are investigating a slow increase in latency for a service running on Compute Engine. You have Cloud Monitoring and Cloud Logging set up. Which tool would best help you identify the cause of the latency?

Your team uses a canary deployment strategy on Google Kubernetes Engine (GKE). During a rollback, you notice that the rollback caused a brief period of downtime because the previous version's readiness probe was not properly configured. Which of the following best prevents this issue in the future?

Track your progress over time

Start a Managing service incidents only practice session

Related PCDOE topic practice pages

Bootstrapping a Google Cloud organization for DevOps practice questions