20+ practice questions focused on Managing service incidents — one of the most tested topics on the Google Professional Cloud DevOps Engineer exam. Each question includes a detailed explanation so you learn why the right answer is correct.
Start Managing service incidents PracticeA team uses Google Kubernetes Engine (GKE) with cluster telemetry enabled. During an incident, they notice that a deployment's pods are repeatedly crashing with Exit Code 137. The team wants to investigate the root cause. Which two Google Cloud services should they use together to correlate resource usage and logs?
Explanation: Exit Code 137 indicates that a container was killed by SIGKILL (signal 9), typically due to an out-of-memory (OOM) condition. Cloud Monitoring provides metrics such as memory usage and OOM kill counts, while Cloud Logging captures the container's termination logs and system events. By correlating these two services, the team can identify when memory usage spiked and confirm that the pod was OOM-killed, enabling root cause analysis.
A DevOps engineer receives an alert that the error budget for a critical service has been exhausted. The service runs on Compute Engine behind an HTTP(S) load balancer. The team wants to reduce the impact on users while investigating. What should the engineer do first?
Explanation: Rolling back the most recent deployment is the correct first action because it immediately restores the service to a known stable state, stopping further consumption of the error budget. This aligns with the incident management principle of 'mitigate first, investigate later' — reducing user impact takes priority over root cause analysis. The HTTP(S) load balancer will automatically route traffic to the previous healthy version once the rollback is complete.
A company uses Cloud Run for a stateless API service with concurrency set to 80. During a traffic spike, some requests return HTTP 500 errors and latency spikes. Cloud Monitoring shows container CPU utilization at 100% and memory usage at 70%. What is the most likely cause and the best first step?
Explanation: The correct answer is A because with CPU at 100% and memory at only 70%, the bottleneck is CPU, not memory. Cloud Run containers handle requests concurrently; setting concurrency to 80 means each container processes up to 80 requests simultaneously. When CPU is saturated, requests queue up, causing latency spikes and eventual HTTP 500 errors as the container becomes unresponsive. Reducing concurrency to 10 lowers the per-container request load, allowing each request to complete before CPU saturation occurs.
A team uses Cloud SQL for PostgreSQL. They receive an alert that the database's CPU utilization is above 95% for the past 30 minutes. Queries are taking longer than usual. They want to investigate without causing further impact. What should they do first?
Explanation: Cloud SQL Query Insights is a managed monitoring tool that automatically captures and analyzes query performance metrics, including CPU consumption, latency, and execution plans. In this scenario, it allows the team to identify the specific queries causing high CPU utilization without making any changes to the instance, thus avoiding further impact. This is the first and safest diagnostic step before any remediation.
A company's SRE team is designing an incident management process. They want to ensure that alerts are actionable and that on-call engineers are not overwhelmed by false positives. Which approach should they take?
Explanation: Option D is correct because defining SLOs and setting alert thresholds based on historical error budget consumption ensures alerts are directly tied to user-facing reliability. This approach prevents false positives by only triggering when the error budget is being consumed faster than expected, making alerts actionable and reducing noise for on-call engineers.
+15 more Managing service incidents questions available
Practice all Managing service incidents questions1. Baseline your knowledge
Start with 10 questions to gauge your current understanding of Managing service incidents. This tells you whether you need a concept refresher or just practice.
2. Review every explanation
For each question — right or wrong — read the full explanation. Understanding why an answer is correct is more valuable than knowing the answer itself.
3. Focus on exam traps
Managing service incidents questions on the PCDOE frequently use trap wording. Look for subtle differences in answers that test your precision, not just general knowledge.
4. Reach 80% consistently
Do repeated sessions until you score 80%+ three times in a row. Then move to mixed-mode practice to test cross-topic recall under realistic conditions.
The exact number varies per candidate. Managing service incidents is tested as part of the Google Professional Cloud DevOps Engineer blueprint. Practicing with targeted Managing service incidents questions ensures you can handle any format or difficulty that appears.
Yes. Courseiva provides free PCDOE practice questions across all exam topics and domains. The platform includes topic-based practice, mock exams, missed-question review, bookmarked questions, and readiness tracking — no account required.
Difficulty is subjective, but Managing service incidents is a high-priority exam concept tested in multiple ways — direct recall, scenario analysis, and command-output interpretation. Consistent practice is the best way to build confidence.
Launch a full Managing service incidents practice session with instant scoring and detailed explanations.
Start Managing service incidents Practice →