Knowledge + Practice

CCNA Managing service incidents Questions

63 questions · Managing service incidents · All types, answers revealed

Practice these questions Domain overview All questions

1

MCQmedium

Refer to the exhibit. You are reviewing an alert policy for CPU utilization. What is a potential problem with this configuration?

A.The autoClose time is too long.

B.The duration is too short, which may cause noise during spikes.

C.The threshold is set too low.

D.The combiner should be 'AND'.

AnswerB

A 60-second window may not filter out short-lived bursts.

Why this answer

Option B is correct because a short duration in a CPU utilization alert policy means the threshold must be breached for only a brief period before triggering an incident. This can cause noise during transient spikes that are not indicative of a sustained problem, leading to false positives and alert fatigue for operations teams.

Exam trap

Google Cloud often tests the distinction between duration and threshold, tricking candidates into thinking a low threshold is the primary cause of noise, when in fact a short duration is the more direct trigger for false positives from spikes.

How to eliminate wrong answers

Option A is wrong because the autoClose time being too long is not inherently a problem; it simply means incidents remain open longer, which may be acceptable or even desirable for tracking. Option C is wrong because the threshold being set too low is not indicated by the exhibit; the question focuses on duration, not threshold value. Option D is wrong because the combiner should be 'OR' (not 'AND') when you want to trigger on any single condition being met; using 'AND' would require all conditions to be true simultaneously, which is less sensitive and could miss issues.

Practice this question →

2

MCQeasy

You are investigating a slow increase in latency for a service running on Compute Engine. You have Cloud Monitoring and Cloud Logging set up. Which tool would best help you identify the cause of the latency?

A.Error Reporting

B.Cloud Profiler

C.Cloud Debugger

D.Cloud Trace

AnswerD

Cloud Trace traces requests and identifies latency contributors.

Why this answer

Cloud Trace is designed to capture latency data from distributed applications by tracing requests as they propagate through services. It provides detailed latency distributions and per-span breakdowns, allowing you to pinpoint which component or operation is causing the slowdown. For a Compute Engine service with Cloud Monitoring and Logging already in place, Cloud Trace is the most direct tool to analyze request-level latency.

Exam trap

Google Cloud often tests the distinction between tools that measure latency (Cloud Trace) versus tools that measure resource utilization (Cloud Profiler) or error rates (Error Reporting), leading candidates to confuse profiling with tracing.

How to eliminate wrong answers

Option A is wrong because Error Reporting aggregates and analyzes application errors (exceptions, crashes), not latency metrics; it would not help identify the cause of a slow increase in latency. Option B is wrong because Cloud Profiler continuously samples CPU and heap usage to identify performance bottlenecks in code, but it focuses on resource consumption rather than request-level latency tracing. Option C is wrong because Cloud Debugger allows you to inspect application state at a specific point in code without stopping execution, but it is meant for debugging logic issues, not for analyzing latency trends over time.

Practice this question →

3

MCQmedium

A company's SRE team is designing an incident management process. They want to ensure that alerts are actionable and that on-call engineers are not overwhelmed by false positives. Which approach should they take?

A.Use only critical severity alerts and rely on manual dashboard review for lower severity

B.Create alerting policies for every available metric to ensure nothing is missed

C.Set all alert thresholds to 50% above the average value to avoid false positives

D.Define SLOs and set alert thresholds based on historical error budget consumption

AnswerD

SLO-based alerting focuses on user-facing impact and reduces noise.

Why this answer

Option D is correct because defining SLOs and setting alert thresholds based on historical error budget consumption ensures alerts are directly tied to user-facing reliability. This approach prevents false positives by only triggering when the error budget is being consumed faster than expected, making alerts actionable and reducing noise for on-call engineers.

Exam trap

Google Cloud often tests the misconception that more alerts or higher thresholds equal better reliability, when in fact the key is aligning alerts with SLOs and error budgets to ensure they are actionable and reduce noise.

How to eliminate wrong answers

Option A is wrong because relying solely on critical alerts and manual dashboard review for lower severity risks missing early warning signs of degradation, leading to delayed incident response and potential SLO breaches. Option B is wrong because creating alerting policies for every available metric generates excessive noise and alert fatigue, overwhelming on-call engineers with non-actionable alerts. Option C is wrong because setting all alert thresholds to 50% above the average value is arbitrary and does not account for normal variance or seasonal patterns, which can either miss real issues or still produce false positives during low-traffic periods.

Practice this question →

4

MCQeasy

Your SLO for availability is 99.9% over a 30-day window. You want an alert that fires when the error budget burn rate is high, leaving less than 5% of the error budget remaining in the next 6 hours. What type of alerting policy should you configure?

A.A custom log-based metric.

B.A static threshold alert based on the error rate.

C.A burn rate alert based on the forecasted consumption.

D.An exponential decay alert.

AnswerC

Burn rate alerts track how fast the error budget is being used and can forecast depletion.

Why this answer

Option C is correct because a burn rate alert based on forecasted consumption directly monitors the rate at which the error budget is being consumed and triggers when the projected remaining budget falls below 5% within the next 6 hours. This aligns with Google SRE best practices for proactive error budget management, using a multi-window, multi-burn-rate approach to detect rapid consumption before the budget is exhausted.

Exam trap

Google Cloud often tests the distinction between static error rate thresholds and dynamic burn rate alerts, trapping candidates who think a simple error rate threshold can adequately protect an SLO without considering the time-based consumption of the error budget.

How to eliminate wrong answers

Option A is wrong because a custom log-based metric is used to extract and measure specific events from logs (e.g., counting 5xx errors), but it does not inherently provide burn rate forecasting or alerting on remaining budget over a time window. Option B is wrong because a static threshold alert based on error rate would fire when the error rate exceeds a fixed value, but it does not account for the error budget consumption rate or the remaining budget percentage over the next 6 hours, leading to either premature or missed alerts. Option D is wrong because an exponential decay alert is typically used for resource utilization metrics (e.g., CPU or memory) that decay over time, not for error budget burn rate, which requires linear or windowed consumption tracking.

Practice this question →

5

MCQeasy

A team is experiencing increased latency in their microservices application after a new deployment. They suspect a specific service is the bottleneck. Which tool should they use to identify the slowest service in the request path?

A.Cloud Profiler

B.Cloud Logging

C.Cloud Monitoring

D.Cloud Trace

AnswerD

Cloud Trace enables distributed tracing to identify slow services.

Why this answer

Cloud Trace is the correct tool because it provides end-to-end latency analysis by capturing trace spans from each microservice in a request path. It aggregates and visualizes the time spent in each service, allowing you to pinpoint the slowest service causing the bottleneck. This directly addresses the need to identify the specific service responsible for increased latency after a deployment.

Exam trap

Google Cloud often tests the distinction between profiling (Cloud Profiler) and tracing (Cloud Trace), where candidates mistakenly choose Cloud Profiler because they confuse 'profiling' with 'tracing' or think it can analyze request paths across services.

How to eliminate wrong answers

Option A is wrong because Cloud Profiler is designed for continuous profiling of CPU and memory usage to identify performance hotspots within a single service's code, not for tracing request latency across multiple services. Option B is wrong because Cloud Logging collects and stores log entries but does not provide distributed tracing or latency breakdowns across service boundaries. Option C is wrong because Cloud Monitoring focuses on metrics, alerts, and dashboards for overall system health and resource utilization, but it lacks the trace-level detail needed to isolate the slowest service in a request path.

Practice this question →

6

Multi-Selecthard

Which THREE steps are typically part of a formal incident postmortem according to Google SRE best practices?

Select 3 answers

A.Identify the person responsible for the incident.

B.Assign action items with deadlines.

C.Summarize the incident timeline.

D.Implement a solution immediately.

E.Determine contributing factors.

AnswersB, C, E

Action items ensure follow-up on improvements.

Why this answer

Option B is correct because Google SRE postmortems emphasize creating actionable follow-ups to prevent recurrence. Assigning action items with deadlines ensures that identified issues are systematically addressed, which is a core principle of blameless postmortems focused on process improvement rather than punishment.

Exam trap

Google Cloud often tests the misconception that postmortems are about assigning blame or immediate fixes, when in reality the PCDOE exam expects knowledge of blameless, data-driven retrospectives with actionable follow-ups as per Google SRE best practices.

Practice this question →

7

Drag & Dropmedium

Order the steps to deploy a new version of a microservice to Google Kubernetes Engine using a rolling update.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Update the manifest, apply it, monitor, verify, and roll back if needed.

Practice this question →

8

MCQeasy

During an incident, a DevOps engineer needs to temporarily increase the capacity of a Google Kubernetes Engine (GKE) cluster to handle the traffic surge. Which approach minimizes manual intervention and follows Google best practices?

A.Enable cluster autoscaler and update the horizontal pod autoscaler to scale faster.

B.Manually add a node pool with larger machines via the Google Cloud Console.

C.Create a new node pool and migrate pods using kubectl drain.

D.Scale the existing node pool by increasing the maximum node count in the cluster.

AnswerA

Cluster autoscaler adds nodes automatically; HPA scales pods.

Why this answer

Option A is correct because enabling the cluster autoscaler automatically adjusts the number of nodes in the node pool based on resource demands, while updating the horizontal pod autoscaler (HPA) to scale faster (e.g., reducing the stabilization window or increasing the target CPU utilization threshold) allows pods to replicate more quickly. This combination minimizes manual intervention by automating both pod-level and node-level scaling, aligning with Google's best practices for handling traffic surges in GKE.

Exam trap

The trap here is that candidates often confuse simply increasing the maximum node count (Option D) with enabling the cluster autoscaler, assuming that raising the cap alone triggers automatic scaling, when in fact the cluster autoscaler must be explicitly enabled to add nodes based on demand.

How to eliminate wrong answers

Option B is wrong because manually adding a node pool with larger machines via the Google Cloud Console requires human intervention and does not automate scaling, violating the goal of minimizing manual effort. Option C is wrong because creating a new node pool and migrating pods using kubectl drain is a manual, multi-step process that does not leverage GKE's native autoscaling capabilities and introduces operational overhead. Option D is wrong because simply increasing the maximum node count in the cluster does not trigger automatic scaling; it only sets an upper limit, and without the cluster autoscaler enabled, nodes will not be added automatically in response to traffic surges.

Practice this question →

9

Multi-Selecthard

You are the DevOps engineer for a large e-commerce platform running on Google Kubernetes Engine (GKE). During a flash sale, you observe that the payments service is experiencing high latency and intermittent errors. The service is deployed with HorizontalPodAutoscaler (HPA) based on CPU utilization. You need to quickly diagnose and mitigate the issue. Which TWO actions should you take?

Select 2 answers

A.Use Cloud Monitoring to examine the payments service's request latency and error rate metrics, and create a custom dashboard for real-time monitoring.

B.Check the GKE node's network performance using VPC Flow Logs and increase the node pool size.

C.Modify the HPA to use memory utilization instead of CPU, as memory is more indicative of the service's performance.

D.Configure a custom metric in Cloud Monitoring for the payments service's request queue depth and use it for HPA.

E.Manually scale up the payments service deployment to more replicas to handle the increased load.

AnswersA, E

Cloud Monitoring provides latency and error metrics via Istio or GKE metrics; a custom dashboard helps visualize the issue.

Why this answer

Option A is correct because Cloud Monitoring provides the necessary observability to diagnose the root cause of high latency and intermittent errors by examining request latency and error rate metrics. Creating a custom dashboard enables real-time monitoring, allowing you to correlate performance degradation with traffic spikes during the flash sale. This is the first step in incident management: observe before acting.

Exam trap

Google Cloud often tests the misconception that scaling actions (like manual scaling or changing HPA metrics) are the first step in incident response, when in fact observability and diagnosis must precede any mitigation to avoid making the problem worse.

Practice this question →

10

MCQmedium

Refer to the exhibit. If the error rate spikes to 2% for only 2 minutes, why does the alert not fire?

A.The notification rate limit prevents the alert from firing.

B.The duration of 300s requires the condition to be met for 5 minutes.

C.The threshold value of 1% is too high.

D.The alignment period of 60s is too short.

AnswerB

The spike lasted only 2 minutes, insufficient to meet the duration requirement.

Why this answer

Option B is correct because the alert condition requires the error rate to exceed 1% for a duration of 300 seconds (5 minutes). A spike lasting only 2 minutes does not meet the minimum duration requirement, so the alert remains in a 'pending' state and never fires. In Cisco PCDOE, the duration parameter defines how long the condition must be continuously true before the alert transitions from pending to firing.

Exam trap

Google Cloud often tests the distinction between the threshold value and the duration parameter, tricking candidates into thinking a spike above the threshold should always trigger an alert, when in fact the duration requirement must also be satisfied.

How to eliminate wrong answers

Option A is wrong because the notification rate limit controls how often alerts can send notifications after they have fired, not whether the alert fires in the first place. Option C is wrong because the threshold value of 1% is actually appropriate—the error rate spikes to 2%, which exceeds the threshold, so the threshold itself is not the issue. Option D is wrong because the alignment period of 60s defines how often the metric is evaluated (e.g., averaging over 60-second windows), but it does not affect the duration requirement; a shorter alignment period would not cause the alert to fire if the 300s duration is not met.

Practice this question →

11

MCQmedium

Your company runs an e-commerce application on Google Kubernetes Engine (GKE) with a microservice architecture. During a Black Friday sale, the orders service experiences a sudden increase in latency and errors. You notice that the database connection pool in the orders service is exhausted, leading to timeouts. The service is written in Java and uses HikariCP connection pool. You need to mitigate the incident quickly. Which action should you take first?

A.Increase the number of replicas of the orders service.

B.Add more database instances.

C.Enable connection pooling at the database side.

D.Temporarily reduce the maximum connection pool size.

AnswerA

More replicas mean more connection pools, increasing total connections and reducing load per pod.

Why this answer

Increasing the number of replicas of the orders service is the fastest way to mitigate the incident because it horizontally scales the application tier, distributing incoming requests across more pods. This reduces the load per pod, which in turn reduces the number of concurrent database connections each pod attempts to acquire from its HikariCP pool, alleviating pool exhaustion and timeouts without requiring a database or code change.

Exam trap

Google Cloud often tests the misconception that database connection pool exhaustion is always a database-side problem, leading candidates to choose database scaling or connection pooling changes instead of recognizing that the immediate fix is to scale the application tier.

How to eliminate wrong answers

Option B is wrong because adding more database instances addresses database-side capacity but does not directly solve the connection pool exhaustion at the application tier; the existing pods would still attempt to open connections to the new instances, and the pool exhaustion is a client-side issue. Option C is wrong because connection pooling at the database side (e.g., PgBouncer or ProxySQL) is a valid architectural improvement but requires deployment and configuration changes that take time, making it unsuitable for immediate mitigation during an incident. Option D is wrong because temporarily reducing the maximum connection pool size would worsen the bottleneck by allowing even fewer concurrent database operations, increasing queueing and latency rather than resolving the exhaustion.

Practice this question →

12

MCQhard

During a post-incident review, the team discovers that a misconfiguration in Cloud Armor caused legitimate traffic to be blocked, leading to a outage. The misconfiguration was introduced by a junior engineer who had overly permissive IAM roles. What is the best way to prevent similar incidents in the future?

A.Enforce a mandatory peer review for all Cloud Armor configuration changes.

B.Revoke the junior engineer's access to Cloud Armor and grant read-only access.

C.Enable Cloud Armor security policy logs and create alerting for blocked traffic spikes.

D.Use Organization Policy constraints to restrict allowed IP ranges and rules in Cloud Armor security policies.

AnswerD

Prevents creation of overly permissive rules.

Why this answer

Option D is correct because enforcing guardrails with Organization Policies can prevent misconfigurations at scale. Option A is wrong because removing the engineer's access is punitive but doesn't prevent others. Option B is wrong because peer reviews reduce human error but are not automated.

Option C is wrong because Cloud Armor logs help detection but not prevention.

Practice this question →

13

MCQeasy

Your incident response team uses a follow-the-sun model. An incident occurs during the Asia-Pacific shift, but the escalation path requires sign-off from the US-based team lead. This causes delays. What change should you recommend?

A.Use a chatbot for automated responses.

B.Implement a global incident commander role with delegated authority.

C.Increase the number of US team members.

D.Schedule the US team lead to work overnight.

AnswerB

This empowers regional leads to make decisions quickly.

Why this answer

Option B is correct because a global incident commander with delegated authority can make escalation decisions without waiting for a specific time-zone-based team lead. This role operates across shifts, ensuring that critical incident response actions are not delayed by geographic handoffs. In a follow-the-sun model, this role provides continuous, authoritative decision-making, aligning with ITIL incident management best practices for global teams.

Exam trap

Google Cloud often tests the misconception that adding more staff or automating responses can solve process delays, when the real issue is a lack of delegated authority across time zones.

How to eliminate wrong answers

Option A is wrong because a chatbot for automated responses cannot replace human judgment for escalation sign-offs; it handles only predefined, low-complexity tasks and lacks the authority to approve critical incident actions. Option C is wrong because increasing the number of US team members does not address the root cause of delayed sign-offs during the Asia-Pacific shift; the US team lead remains unavailable during that time, and more team members do not grant them escalation authority. Option D is wrong because scheduling the US team lead to work overnight is unsustainable, leads to burnout, and violates the follow-the-sun model's intent of balanced global coverage; it also does not scale for incidents in other time zones.

Practice this question →

14

MCQhard

Your application runs in two GCP regions. A regional outage occurs in the primary region. You have a Cloud Load Balancer with a failover backend. However, the failover did not trigger because the health check passed on a stale connection. What is the best solution?

A.Use a passive health check.

B.Use a global load balancer with HTTP health checks based on application health.

C.Configure a custom health check that checks the database.

D.Use TCP health checks with a shorter interval.

AnswerB

HTTP health checks test actual application response, reducing false positives.

Why this answer

Option B is correct because a global load balancer with HTTP health checks can probe the actual application endpoint (e.g., /healthz) to verify end-to-end functionality, not just TCP connectivity. This prevents the stale connection issue where a TCP health check passes on an existing but broken session, ensuring failover triggers only when the application is genuinely unhealthy.

Exam trap

The trap here is that candidates assume TCP health checks are sufficient for failover, but Cisco tests the nuance that stale connections can mask application failure, requiring application-layer (HTTP) health checks to ensure true failover.

How to eliminate wrong answers

Option A is wrong because passive health checks (e.g., connection draining) rely on observing traffic failures rather than active probing, which would not detect a stale connection that still passes traffic. Option C is wrong because a custom health check that checks the database introduces unnecessary complexity and dependency; the health check should validate the application's own readiness, not an external service that may be unrelated to the outage. Option D is wrong because TCP health checks with a shorter interval still only verify layer-4 connectivity, not application health; a stale TCP connection can persist and pass the check even if the application is unresponsive.

Practice this question →

15

MCQeasy

Your company runs a microservices application on Google Kubernetes Engine (GKE) with shared Istio service mesh across multiple namespaces. You use Cloud Monitoring and Cloud Logging for observability. At 10:30 AM, you receive an alert that the checkout service is returning high 5xx errors (over 20%) and latency is above 5 seconds. The incident response team is assembled, and you are the incident commander. The team suspects a recent deployment (v2.1) to the checkout service at 10:00 AM. The deployment was a minor configuration update. The team is divided: some want to immediately roll back, others want to analyze traces. You have access to the GCP console. What should you do first to ensure a swift and effective incident response?

A.Review the deployment history of the checkout service alongside Cloud Monitoring metrics and logs to identify the exact time and nature of the change.

B.Check the Error Reporting dashboard to view aggregated error logs and stack traces for the checkout service.

C.Immediately roll back the checkout service to the previous version and monitor if errors decrease.

D.Declare the incident, assign roles, and start a postmortem document.

AnswerA

This correlates the deployment with the incident symptoms, providing evidence for the best course of action.

Why this answer

Option D is correct because comparing the deployment changes with monitoring metrics helps correlate the incident with the deployment, providing evidence to guide next steps. Option A is premature without confirming the rollback will fix the issue and acknowledging potential side effects. Option B is useful but might not pinpoint the root cause as quickly as comparing metrics.

Option C is a good practice but not the first action; you need to understand the impact first.

Practice this question →

16

MCQmedium

Your team is using Cloud Monitoring to track the health of a distributed microservices application. You notice that the error rate for the checkout service has increased significantly, but no alerts are firing. The SLO for checkout is 99.9% availability over a 28-day rolling window. You inspect the alerting policy and find it uses a time series aggregation with a 1-minute alignment period and a condition that triggers when the ratio of errors to total requests exceeds 0.001 for 5 consecutive minutes. What is the most likely reason the alert is not firing?

A.The alert condition requires 5 consecutive minutes of breach, but the error rate spikes are intermittent and not sustained.

B.The error budget has been exhausted, so the alert is suppressed.

C.The SLO window is too long, and the alert condition uses a different measurement period.

D.The ratio threshold is too high because the total request count is low.

AnswerA

The alert requires 5 consecutive minutes of the ratio exceeding 0.001; intermittent spikes may not meet this condition.

Why this answer

The alert condition requires the error ratio to exceed 0.001 for 5 consecutive 1-minute alignment periods. If the error rate spikes are intermittent—lasting only a minute or two before returning to normal—the condition of 5 consecutive minutes of breach is never met, so the alert remains silent. This is a classic case where the alerting policy's duration setting is too long relative to the bursty nature of the errors.

Exam trap

Google Cloud often tests the distinction between a threshold being breached and the alert condition's duration requirement being met, leading candidates to overlook that the 'for' parameter (e.g., 5 consecutive minutes) is a separate, critical condition that must be satisfied for the alert to fire.

How to eliminate wrong answers

Option B is wrong because error budget exhaustion does not suppress alerts; it is a separate SLO metric that tracks cumulative availability over the 28-day window, and alerts are independent of budget status. Option C is wrong because the SLO window (28 days) and the alert condition measurement period (1-minute alignment) are intentionally different—the alert uses a short-term metric to detect immediate problems, not the long-term SLO window. Option D is wrong because the ratio threshold of 0.001 (0.1%) is standard for a 99.9% SLO; a low total request count would make the ratio more volatile but does not inherently prevent the alert from firing if the condition is met.

Practice this question →

17

MCQeasy

A DevOps engineer notices that a critical service is down, but no alert has been received. The engineer checks Cloud Monitoring and sees that the alerting policy appears to be correctly configured. What is the most likely cause?

A.The metrics for the service are not being collected.

B.The incident was created but automatically closed.

C.The SLO is defined too loosely, so the error budget is not exhausted.

D.The notification channel is misconfigured.

AnswerD

If the notification channel is misconfigured, the alert condition may be met but the alert is not delivered.

Why this answer

Option D is correct because if the notification channel is misconfigured (e.g., invalid webhook URL, incorrect email address, or missing PagerDuty integration key), Cloud Monitoring will correctly detect the incident but fail to deliver the alert. The engineer sees the alerting policy appears correct because the policy itself is valid, but the channel configuration prevents the notification from reaching the recipient.

Exam trap

Google Cloud often tests the misconception that a correctly configured alerting policy guarantees alert delivery, ignoring that the notification channel is a separate configuration layer that can silently fail.

How to eliminate wrong answers

Option A is wrong because if metrics were not being collected, the alerting policy would show a 'no data' status or the metric would be absent from the dashboard, which the engineer would have noticed when checking Cloud Monitoring. Option B is wrong because an incident that was created and automatically closed would still generate a notification when the incident is created; the engineer would have received an alert for the opening event. Option C is wrong because an SLO and error budget are unrelated to alert delivery; they define service level targets, not whether a notification channel functions correctly.

Practice this question →

18

MCQeasy

Your team receives an alert that the Error Reporting count for a critical service has increased tenfold in the last 10 minutes. You suspect a recent code deployment is the cause. What is the first action you should take?

A.Disable the alert to reduce noise.

B.Roll back the deployment to the previous version.

C.Increase the instance count to handle the load.

D.Open a post-mortem to document the incident.

AnswerB

Rolling back quickly mitigates user impact.

Why this answer

A tenfold increase in error reporting within 10 minutes strongly indicates a recent code deployment introduced a critical bug. The immediate priority is to restore service stability by rolling back the deployment to the previous known-good version, as this directly mitigates the root cause. Delaying the rollback risks further degradation of the service and potential data loss or corruption.

Exam trap

Google Cloud often tests the candidate's ability to prioritize immediate incident response over long-term analysis, trapping those who confuse 'post-mortem' (a retrospective activity) with 'first action' (a corrective action).

How to eliminate wrong answers

Option A is wrong because disabling the alert ignores the underlying problem and violates the principle of observability; alerts are meant to signal real issues, and suppressing them does not reduce errors. Option C is wrong because increasing the instance count does not fix the root cause—if the new code has a bug (e.g., a memory leak or an unhandled exception), adding more instances will only replicate the failure across more nodes, potentially amplifying the impact. Option D is wrong because opening a post-mortem is a reactive step that should occur after the immediate incident is resolved; performing it first wastes critical time and delays the necessary rollback.

Practice this question →

19

MCQhard

Your company runs a microservices application on a private GKE cluster with Workload Identity enabled. Services communicate via gRPC and HTTP. After a recent update to the payment service, users report intermittent 503 errors and 2-second latency spikes during peak hours (10 AM - 12 PM). Cloud Monitoring shows the payment service's CPU utilization averages 60%, but memory spikes to 90% during errors. The existing alert on HTTP 503 responses fires only after 5 consecutive errors over 5 minutes, but the errors are sporadic. You need to diagnose and resolve the issue. What should you do?

A.Increase the memory limits for the payment service containers and set a memory request equal to the limit. Then restart the pods to clear any memory leaks.

B.Switch the payment service's HTTP/2 protocol to HTTP/1.1 to reduce overhead, and increase the memory request limit to avoid out-of-memory errors.

C.Enable detailed logging and metrics for the payment service in Cloud Logging and Cloud Monitoring. Analyze logs around the error timestamps to identify memory consumption patterns, and lower the alert threshold for 503 errors to trigger on 2 errors within 1 minute. Also, set up a custom alert on memory usage exceeding 85%.

D.Scale the payment service horizontally by increasing the minimum number of replicas to handle peak load, and adjust the HPA to scale faster.

AnswerC

This approach provides the necessary observability to correlate memory spikes with errors, and adjusts alerts to catch issues earlier, enabling proactive diagnosis.

Why this answer

Option C is correct because the issue requires a two-pronged approach: first, enable detailed logging and metrics to diagnose the root cause (memory spikes during peak hours), and second, adjust alerting thresholds to catch sporadic 503 errors earlier (2 errors in 1 minute) and add a proactive memory usage alert at 85%. This aligns with the PCDOE incident management process of first gathering data before making changes, and ensures you can correlate memory pressure with gRPC/HTTP errors in a GKE environment with Workload Identity.

Exam trap

Google Cloud often tests the misconception that scaling or resource adjustments should be the first step, when in reality the PCDOE framework emphasizes 'observe before act' — you must enable logging and metrics to diagnose the issue before making any configuration changes.

How to eliminate wrong answers

Option A is wrong because increasing memory limits and restarting pods does not address the root cause of intermittent memory spikes during peak load; it only masks the symptom and may cause unnecessary downtime. Option B is wrong because switching from HTTP/2 to HTTP/1.1 would break gRPC communication (gRPC requires HTTP/2), and increasing memory request limits without understanding the pattern does not resolve the sporadic errors. Option D is wrong because scaling horizontally without first diagnosing why memory spikes to 90% during peak hours may lead to inefficient resource usage and does not fix the underlying memory consumption issue; the HPA may already be configured but the memory pressure is causing OOM kills or latency.

Practice this question →

20

Multi-Selecthard

You are designing alerting policies for a microservice architecture. Which TWO metrics are most suitable for triggering a page to the on-call engineer?

Select 2 answers

A.Latency P99 exceeding the SLO target for 5 minutes.

B.Error budget burn rate exceeding 10x in 1 hour.

C.Number of requests per second.

D.CPU utilization at 50%.

E.Memory usage trend.

AnswersA, B

Breaching latency SLO directly impacts users.

Why this answer

Error budget burn rate and high latency (P99 breaching SLO) directly indicate customer-facing issues and require immediate attention. CPU and request count are less critical.

Practice this question →

21

Multi-Selectmedium

You are an on-call engineer responding to a critical service incident affecting a production application. According to Google's Incident Management best practices, which TWO actions should you take immediately after declaring the incident?

Select 2 answers

A.Communicate the incident status to stakeholders and affected teams.

B.Notify the incident commander to take over coordination.

C.Begin documenting the incident for a postmortem report.

D.Roll back the latest deployment to the previous stable version.

E.Gather evidence and logs to identify the incident's impact and root cause.

AnswersA, E

Communication is a key initial step to keep everyone informed and coordinated.

Why this answer

Option A is correct because Google's Incident Management best practices emphasize that immediately after declaring an incident, the on-call engineer must communicate the incident status to stakeholders and affected teams. This ensures that everyone is aware of the ongoing issue, sets expectations, and prevents redundant troubleshooting. Early communication also helps in coordinating response efforts and reducing confusion during the critical initial phase.

Exam trap

Google Cloud often tests the distinction between immediate triage actions and later-stage mitigation or documentation steps, trapping candidates who confuse 'declaring an incident' with 'starting the fix' rather than recognizing that communication and initial impact assessment are the first priority.

Practice this question →

22

Matchingmedium

Match each CI/CD concept to its definition.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Automated build and test on every commit

Automated deployment to staging, manual to production

Fully automated release to production

Short-lived branches, frequent merges to main

Gradual rollout to a subset of users

Why these pairings

Key DevOps practices for reliable releases.

Practice this question →

23

MCQmedium

You are the SRE for a financial services application running on Google Cloud. Users report that certain transactions are taking over 10 seconds, while most complete in under 200ms. You use Cloud Profiler and Cloud Trace. Upon reviewing the profiler data, you see a hotspot in a method that calls a Cloud SQL database with a slow query. You identify the query and create an index to speed it up. However, you cannot deploy the index change immediately due to change management processes. The incident response team needs to mitigate the impact now. Which temporary measure should you take?

A.Increase the database connection pool size in the application.

B.Add a database read replica to offload read queries.

C.Implement caching of query results using Cloud Memorystore.

D.Scale up the Cloud SQL instance to more vCPUs.

AnswerC

Caching reduces the need to run the slow query, providing immediate latency improvement.

Why this answer

Option C is correct because caching the results of the slow query in Cloud Memorystore (Redis) immediately reduces the load on the Cloud SQL database and eliminates the need to execute the slow query repeatedly. This provides a temporary performance improvement without requiring any database schema changes or deployments, bypassing the change management delay. The hotspot in the profiler indicates the query itself is the bottleneck, and caching avoids that bottleneck entirely for repeated reads.

Exam trap

Google Cloud often tests the distinction between temporary mitigation and permanent resolution, and the trap here is that candidates confuse scaling the database (Option D) or adding replicas (Option B) as quick fixes, when in reality those are infrastructure changes that require change management approval and cannot be deployed instantly.

How to eliminate wrong answers

Option A is wrong because increasing the database connection pool size would add more concurrent connections to the already overloaded database, potentially making the contention worse and increasing latency further. Option B is wrong because adding a read replica requires provisioning a new Cloud SQL instance and modifying application connection strings, which is a deployment change subject to the same change management processes and cannot be done immediately. Option D is wrong because scaling up the Cloud SQL instance to more vCPUs requires a database restart or a failover operation, which is a significant change that also falls under change management and does not address the root cause of the slow query.

Practice this question →

24

MCQhard

You are a Site Reliability Engineer (SRE) for an e-commerce platform running on Google Kubernetes Engine (GKE) with a microservices architecture. Your team uses Cloud Monitoring for alerting and Cloud Logging for centralized logs. Recently, during a flash sale event, you observed intermittent latency spikes in the checkout service, causing checkout failures and abandoned carts. The latency spikes last 1-2 seconds and occur roughly every 5-10 minutes during peak traffic. The checkout service runs as a Deployment with 10 replicas, each with resource requests of 500m CPU and 512Mi memory. The service has a Service Level Objective (SLO) of 99.9% of requests completing in under 1 second (p99 latency < 1s). Current p99 latency is 2.1s during peak. You reviewed the Cloud Monitoring dashboard and noticed that CPU utilization across pods is around 60%, memory around 50%, and there are no OOM kills. The logs show occasional 'connection reset by peer' errors in the checkout service logs, but no consistent pattern. You suspect the issue might be related to the database (Cloud SQL) or a downstream dependency. After checking the database, you find that query latency is normal. You also notice that the checkout service makes a synchronous HTTP call to a payment validation service that runs as a separate Deployment with 3 replicas. The payment service's p99 latency is 500ms, but its error rate is below 1%. Your task is to identify the most likely cause of the intermittent latency spikes and propose a remediation. Which action should you take first?

A.Increase the number of replicas of the payment validation service to 10 to handle peak load.

B.Check the garbage collection logs of the checkout service pods to identify if long GC pauses coincide with the latency spikes.

C.Enable connection pooling and retries with exponential backoff in the checkout service for the HTTP call to the payment service.

D.Investigate the checkout service pod restarts due to liveness probe failures, as 'connection reset by peer' indicates pod instability.

AnswerB

Periodic latency spikes are a classic symptom of JVM garbage collection. Checking GC logs will help confirm if this is the cause.

Why this answer

The intermittent latency spikes every 5-10 minutes with no CPU/memory pressure or database issues strongly suggest a periodic process like garbage collection. Java-based services (common in microservices) can experience stop-the-world GC pauses that cause latency spikes of 1-2 seconds, matching the observed pattern. Checking GC logs is the fastest way to confirm this before making architectural changes.

Exam trap

Google Cloud often tests the ability to distinguish between symptoms (connection resets, latency spikes) and root causes (GC pauses, thread pool exhaustion) by presenting plausible but superficial fixes like scaling or retries, while the correct answer requires analyzing internal service behavior.

How to eliminate wrong answers

Option A is wrong because increasing payment service replicas does not address the root cause of intermittent latency spikes; the payment service has low p99 latency (500ms) and low error rate, so scaling it won't fix a periodic issue like GC pauses. Option C is wrong because connection pooling and retries with exponential backoff would help with transient network failures or overload, but the 'connection reset by peer' errors are likely a symptom of the checkout service itself being unresponsive during GC pauses, not a network issue. Option D is wrong because 'connection reset by peer' does not indicate liveness probe failures or pod restarts; it typically means the remote side closed the connection, which could be due to the checkout service being paused by GC, not because the pod is unstable or restarting.

Practice this question →

25

MCQeasy

An engineer receives an alert that a service's error rate has exceeded the threshold. To investigate, which log-based metric should the engineer query in Cloud Logging to identify the root cause?

A.Error log count grouped by service name.

B.Request latency histogram.

C.CPU utilization of the service instances.

D.Network bytes sent per instance.

AnswerA

Grouping by service reveals which service has the most errors.

Why this answer

Option A is correct because error log count grouped by service name directly surfaces which specific service is generating the elevated error rate. In Cloud Logging, log-based metrics are user-defined counters extracted from log entries; querying the error log count per service isolates the offending component, enabling root cause identification without mixing in unrelated performance data.

Exam trap

Google Cloud often tests the distinction between log-based metrics (derived from log entries) and system metrics (like CPU or network) from Cloud Monitoring, trapping candidates who confuse performance indicators with error source identification.

How to eliminate wrong answers

Option B is wrong because request latency histogram measures response times, not error rates; latency spikes can occur without errors, so it does not pinpoint the source of error threshold breaches. Option C is wrong because CPU utilization is a system-level metric from Cloud Monitoring, not a log-based metric; high CPU may correlate with errors but does not directly reveal which service or log entry caused the error rate alert. Option D is wrong because network bytes sent per instance is a network throughput metric, not derived from logs; it cannot identify error patterns or the specific service responsible for increased errors.

Practice this question →

26

Multi-Selectmedium

Which THREE of the following are recommended practices for writing effective post-mortem documents?

Select 3 answers

A.Focus on the root cause and contributing factors.

B.Assign blame to the responsible engineer.

C.Share the findings with the entire organization.

D.Include an action plan to prevent recurrence.

E.Keep the document brief to save time.

AnswersA, C, D

Understanding the root cause prevents recurrence.

Why this answer

Option A is correct because effective post-mortem documents must focus on root cause analysis and contributing factors to identify systemic issues rather than individual errors. This aligns with the Site Reliability Engineering (SRE) principle of blameless post-mortems, which prioritize learning and process improvement over punishment. By analyzing the technical chain of events—such as a misconfigured load balancer or a cascading failure due to missing circuit breakers—teams can implement lasting fixes.

Exam trap

Google Cloud often tests the misconception that post-mortems should be brief or assign blame, tempting candidates to select options that prioritize speed or accountability over thorough, blameless analysis.

Practice this question →

27

MCQhard

You manage a production environment with a web service deployed on Compute Engine instances behind a HTTP(S) Load Balancer. The service has a health check configured on the load balancer, probing a health endpoint every 10 seconds. After a recent configuration change, you observe that all instances are marked as unhealthy and traffic is failing. The health check response is 200 OK from the instances, but the load balancer still marks them unhealthy. The health check configuration: protocol: HTTP, port: 80, request path: /health, interval: 10s, timeout: 5s, unhealthy threshold: 2. The instances are running a custom web server. What is the most likely cause?

A.The load balancer is misconfigured with an incorrect protocol (HTTPS instead of HTTP).

B.The health check port is incorrectly configured as 80 but the service is listening on 8080.

C.The health check response is coming from a different server (e.g., a reverse proxy) that returns 200 but does not represent the actual service health.

D.The health check timeout is greater than the interval, causing overlapping probes.

AnswerC

If the response is from a different process, the load balancer may mark the instance unhealthy because the health check's interpretation of health is not satisfied.

Why this answer

Option C is correct because the health check is receiving a 200 OK response from a reverse proxy or intermediary that is not the actual web server, so the load balancer's health check does not reflect the true health of the application. Even though the response is successful, the load balancer marks instances unhealthy because the health check is likely failing to reach the intended health endpoint on the custom web server, or the response is not coming from the service itself. This scenario often occurs when a reverse proxy (e.g., nginx) returns a static 200 for /health without forwarding the request to the backend service, causing the load balancer to see a healthy response but the service to be actually unhealthy.

Exam trap

Google Cloud often tests the misconception that a 200 OK response always indicates a healthy service, but the trap here is that the health check may be hitting a different server or proxy that does not reflect the actual application state.

How to eliminate wrong answers

Option A is wrong because if the load balancer were configured with HTTPS but the instances expected HTTP, the health check would receive a connection error or a protocol mismatch, not a 200 OK response. Option B is wrong because if the health check port were 80 but the service listened on 8080, the health check would fail to connect or receive a response, not return a 200 OK. Option D is wrong because the health check timeout (5s) is less than the interval (10s), so overlapping probes are not possible; overlapping would require timeout > interval, which is not the case here.

Practice this question →

28

MCQhard

Refer to the exhibit. You are investigating a performance issue where the api-server container is using excessive CPU. You run a Cloud Monitoring API query and receive the JSON configuration shown. However, the query returns no data points. What is the most likely cause?

A.The time interval specified is too short and falls outside the data retention period.

B.The metric type 'kubernetes.io/container/cpu/core_usage_time' is deprecated and no longer available.

C.The filter is missing required resource labels such as project_id, location, cluster_name, and namespace_name, causing no time series to match.

D.The aggregation perSeriesAligner 'ALIGN_MEAN' is incompatible with the metric type, which requires 'ALIGN_RATE'.

AnswerC

Resource labels must be fully specified in the filter to match the specific container; otherwise the query may not return data.

Why this answer

Option C is correct because the Cloud Monitoring API query for the 'kubernetes.io/container/cpu/core_usage_time' metric type requires mandatory resource labels—specifically 'project_id', 'location', 'cluster_name', and 'namespace_name'—to uniquely identify the time series for a GKE container. Without these labels in the filter, the query cannot match any time series, resulting in no data points returned, even if the metric is actively emitting data.

Exam trap

Google Cloud often tests the misconception that a missing filter causes an error or that the metric is deprecated, when in fact the API silently returns no data points because the required resource labels are absent from the filter.

How to eliminate wrong answers

Option A is wrong because the time interval being too short does not cause zero data points; it would simply return data within that window, and the data retention period for GKE container metrics is typically 6 weeks, so a short interval is valid. Option B is wrong because 'kubernetes.io/container/cpu/core_usage_time' is a valid, non-deprecated metric type in Cloud Monitoring for GKE; deprecation would cause a warning or error, not silent empty results. Option D is wrong because 'ALIGN_MEAN' is compatible with cumulative metrics like 'core_usage_time' (which is a cumulative counter); 'ALIGN_RATE' is also valid but not required, and incompatibility would cause an error, not empty data.

Practice this question →

29

Multi-Selecteasy

Which TWO of the following are best practices for managing incident response on Google Cloud?

Select 2 answers

A.Automatically escalate all incidents to the engineering manager.

B.Establish a clear escalation path and ensure that on-call engineers are aware of their roles.

C.Assign only one engineer to be on call to reduce confusion.

D.Use only one notification channel (e.g., email) to keep the team focused.

E.Create a written incident response plan that defines severities, roles, and communication channels.

AnswersB, E

Clarity reduces response time.

Why this answer

Option B is correct because a clear escalation path with defined roles ensures that on-call engineers know exactly whom to contact for different severity levels, reducing response time and preventing miscommunication. Google Cloud's operations suite (formerly Stackdriver) supports structured escalation policies through alerting channels and notification routing, making this a foundational best practice for incident management.

Exam trap

Google Cloud often tests the misconception that simplicity (e.g., single on-call engineer or single notification channel) is a best practice, when in reality redundancy and multi-channel communication are critical for reliability.

Practice this question →

30

MCQeasy

A team uses Cloud SQL for PostgreSQL. They receive an alert that the database's CPU utilization is above 95% for the past 30 minutes. Queries are taking longer than usual. They want to investigate without causing further impact. What should they do first?

A.Increase the number of vCPUs of the Cloud SQL instance

B.Restart the Cloud SQL instance to clear the cache

C.Migrate the database to Cloud Spanner

D.Use Cloud SQL Query Insights to find the most time-consuming queries

AnswerD

Query Insights shows top queries by CPU and latency.

Why this answer

Cloud SQL Query Insights is a managed monitoring tool that automatically captures and analyzes query performance metrics, including CPU consumption, latency, and execution plans. In this scenario, it allows the team to identify the specific queries causing high CPU utilization without making any changes to the instance, thus avoiding further impact. This is the first and safest diagnostic step before any remediation.

Exam trap

Google Cloud often tests the principle that the first step in incident management is always to gather diagnostic data without making changes, and the trap here is that candidates may jump to scaling (Option A) or restarting (Option B) as quick fixes, ignoring that these actions can cause further disruption and do not provide root-cause analysis.

How to eliminate wrong answers

Option A is wrong because increasing vCPUs is a scaling action that may temporarily reduce performance during the resize operation and does not address the root cause—it only masks symptoms without identifying the problematic queries. Option B is wrong because restarting the instance clears the buffer cache, which can actually worsen performance temporarily as caches rebuild, and it does not provide any diagnostic information about what caused the high CPU. Option C is wrong because migrating to Cloud Spanner is a major architectural change that is not appropriate for initial investigation; it is a costly, complex migration that should only be considered after identifying that the workload fundamentally requires a horizontally scalable database.

Practice this question →

31

MCQhard

A team defines an SLO of 99.9% availability over a 30-day window. They use a multi-window, multi-burn-rate alerting approach. Which alerting condition should trigger a page based on fast burn rate?

A.1% error budget consumed in 1 hour.

B.10% error budget consumed in 1 hour.

C.2% error budget consumed in 10 minutes.

D.5% error budget consumed in 6 hours.

AnswerC

This is a fast burn rate, consuming budget quickly; should trigger a page.

Why this answer

Option C is correct because a multi-window, multi-burn-rate alerting approach uses a short window (e.g., 10 minutes) to detect fast burn rates that could rapidly exhaust the error budget. Consuming 2% of the error budget in 10 minutes corresponds to an annualized burn rate of over 1000x, which is dangerously fast and requires immediate paging to prevent a breach of the 99.9% SLO over the 30-day window.

Exam trap

Google Cloud often tests the distinction between burn rate thresholds and time windows, where candidates mistakenly associate a larger percentage consumed (like 10% or 5%) with fast burn, without considering the window duration and the resulting annualized burn rate.

How to eliminate wrong answers

Option A is wrong because consuming 1% of the error budget in 1 hour represents a burn rate of approximately 0.24x (1% per hour annualized to 24% per day), which is too slow to trigger a fast-burn alert; it would be caught by a slow-burn alert instead. Option B is wrong because 10% error budget consumed in 1 hour corresponds to a burn rate of about 2.4x (10% per hour annualized to 240% per day), which is moderate and would typically trigger a medium-burn alert, not a fast-burn page. Option D is wrong because 5% error budget consumed in 6 hours equates to a burn rate of roughly 0.2x (0.833% per hour annualized to 20% per day), which is a slow burn that should not trigger a page for fast burn rate; it would be handled by a lower-severity alert.

Practice this question →

32

MCQmedium

Refer to the exhibit. A Cloud Function (2nd gen) is timing out. The function's timeout is set to 60 seconds. The function queries a Cloud SQL database. What is the most likely cause and the best action?

A.Reduce the function's allocated memory to decrease cold start time

B.Add indexes to the database tables queried by the function

C.Increase the function timeout to 120 seconds

D.Increase the Cloud SQL max connections setting

AnswerB

Slow queries often indicate missing indexes; adding them reduces query time.

Why this answer

The most likely cause of the timeout is that the database queries are slow due to missing indexes, causing the function to wait longer than its 60-second timeout for results. Adding indexes to the queried columns reduces query execution time, resolving the timeout without changing the function's configuration. This aligns with best practices for optimizing Cloud SQL queries in serverless environments.

Exam trap

Google Cloud often tests the misconception that increasing timeouts or resources is the default fix for timeouts, when the real issue is almost always unoptimized queries or missing indexes in database-backed serverless functions.

How to eliminate wrong answers

Option A is wrong because reducing memory typically increases cold start time (as less CPU is allocated), which would worsen performance, not fix a timeout caused by slow queries. Option C is wrong because increasing the timeout to 120 seconds only masks the symptom; it does not address the root cause of slow database queries, and the function may still fail if queries remain unoptimized. Option D is wrong because increasing Cloud SQL max connections does not speed up individual queries; it only allows more concurrent connections, which could even increase database load and worsen latency.

Practice this question →

33

MCQhard

A multinational company runs an application on Google Cloud with an SLO of 99.99% monthly availability. They use a multi-region deployment with Cloud Load Balancing and Cloud Spanner. During a regional outage in us-central1, traffic fails over to us-east1. However, the incident response team is not alerted because the error budget burn rate remained below the alert threshold. What should the team change to ensure timely alerting for such regional failures?

A.Shorten the SLO compliance window from 30 days to 7 days.

B.Create a custom dashboard and alert for regional unavailability using Cloud Monitoring metrics like load_balancing/backend_request_count and region health checks.

C.Change the SLO to 99.9% to allow more error budget.

D.Reduce the error budget burn rate alert threshold from 10% to 5% per hour.

AnswerB

Direct alerts for regional failures catch issues early.

Why this answer

Option D is correct because implementing a 'signals of possible trouble' dashboard and alerts for regional failures provides early warning even before SLO is breached. Option A is wrong because lowering the burn rate alert threshold may cause noise but could help, but it's not the best practice for regional failures. Option B is wrong because the burn rate is already calculated over a longer window; reducing window might help but could increase noise.

Option C is wrong because changing SLO to 99.9% would increase error budget but not address alerting.

Practice this question →

34

MCQhard

During a post-mortem, you identify that an incident was caused by a configuration change that was not reviewed. Which of the following is the most effective preventive action?

A.Add more monitoring alerts.

B.Implement a change management process with mandatory peer review.

C.Schedule weekly meetings to review changes.

D.Use a configuration management database (CMDB).

AnswerB

Peer review catches misconfigurations before deployment.

Why this answer

Option B is correct because a change management process with mandatory peer review directly addresses the root cause: a configuration change was made without oversight. By requiring at least one additional engineer to review and approve changes before implementation, the process catches misconfigurations, policy violations, or unintended side effects before they reach production. This is a preventive control, not a detective or corrective one, and aligns with ITIL best practices for change management.

Exam trap

Google Cloud often tests the distinction between preventive controls (like peer review) and detective controls (like monitoring), leading candidates to mistakenly choose monitoring alerts because they seem proactive, when in fact they only detect failures after they happen.

How to eliminate wrong answers

Option A is wrong because adding more monitoring alerts is a detective control that only notifies you after the incident has already occurred; it does not prevent the unreviewed configuration change from causing the incident. Option C is wrong because scheduling weekly meetings to review changes is a reactive, after-the-fact review that does not prevent the change from being applied without review; the damage is already done by the time the meeting occurs. Option D is wrong because a configuration management database (CMDB) is a repository for storing configuration item data and relationships; it does not enforce any review or approval workflow, so an unreviewed change can still be made without any preventive barrier.

Practice this question →

35

MCQmedium

During a canary deployment of a new version of a microservice, the engineer notices increased error rates in the canary instances. What is the best immediate action?

A.Continue the rollout to see if errors stabilize.

B.Perform a rollback of the canary to the previous version.

C.Scale up the canary instances to handle load.

D.Pause the rollout and investigate the errors.

AnswerB

Rolling back immediately stops the errors and protects users.

Why this answer

When error rates increase in a canary, the safest immediate action is to roll back the canary to prevent further impact on users. Pausing and investigating is reasonable but allows continued errors. Scaling up the canary would worsen the issue.

Continuing the rollout would be irresponsible.

Practice this question →

36

MCQhard

Refer to the exhibit. Your team deployed a new revision to Cloud Run. After deployment, error rates increased. You want to roll back to the previous revision, which is still serving. Which command should you use?

A.gcloud run services update-traffic my-service --to-revisions=my-service-00001-caz=100

B.gcloud run services rollback my-service

C.gcloud run revisions delete my-service-00002-caw

D.gcloud run deploy my-service --image gcr.io/my-project/my-image:v1

AnswerA

This command sends 100% traffic to the previous revision.

Why this answer

Option A is correct because `gcloud run services update-traffic` allows you to precisely control traffic splitting between revisions. By setting `--to-revisions=my-service-00001-caz=100`, you direct 100% of incoming requests to the previous revision, effectively rolling back without deleting the current revision. This command is the standard method for traffic-based rollbacks in Cloud Run.

Exam trap

Google Cloud often tests the misconception that a 'rollback' command exists for Cloud Run, but the correct approach is to use traffic management commands like `update-traffic` to shift traffic away from the problematic revision.

How to eliminate wrong answers

Option B is wrong because `gcloud run services rollback` is not a valid command in the gcloud CLI; Cloud Run does not have a built-in rollback subcommand, so this would result in an error. Option C is wrong because deleting the current revision (`my-service-00002-caw`) does not automatically route traffic to the previous revision; it would cause a service outage until traffic is explicitly redirected, and Cloud Run requires at least one revision serving traffic. Option D is wrong because `gcloud run deploy` with a previous image creates a new revision (e.g., `my-service-00003`) rather than reverting to the existing previous revision, which may introduce additional changes and does not leverage the already-serving revision.

Practice this question →

37

MCQhard

In Google's incident management process, which role is responsible for communication with stakeholders and users during an incident?

A.Incident Commander.

B.Communications Lead.

C.Technical Lead.

D.Operations Lead.

AnswerB

The Communications Lead manages all communication with stakeholders.

Why this answer

In Google's incident management process, the Communications Lead is explicitly responsible for managing all external and internal communications, including updates to stakeholders and users. This role ensures that accurate, timely information is disseminated while the Incident Commander focuses on coordinating the response. The Communications Lead does not engage in technical troubleshooting or operational tasks, which are handled by other roles.

Exam trap

Google Cloud often tests the misconception that the Incident Commander handles all aspects of an incident, including communication, but in Google's model, the Incident Commander delegates communication to a dedicated Communications Lead to maintain focus on coordination.

How to eliminate wrong answers

Option A is wrong because the Incident Commander is responsible for overall coordination and decision-making during the incident, not for direct stakeholder communication; they delegate that to the Communications Lead. Option C is wrong because the Technical Lead focuses on diagnosing and resolving the technical issue, not on communicating with stakeholders or users. Option D is wrong because the Operations Lead handles operational tasks such as resource allocation and infrastructure management, not stakeholder communication.

Practice this question →

38

Multi-Selectmedium

Which THREE of the following are valid techniques for mitigating a denial-of-service (DoS) attack against a Google Cloud HTTP(S) Load Balancer?

Select 3 answers

A.Increase the number of backend instances to absorb traffic.

B.Enable autoscaling on the backend services to handle increased load.

C.Modify VPC firewall rules to block all traffic from the source IP.

D.Configure rate limiting per client IP using Cloud Armor or the load balancer's settings.

E.Enable Cloud Armor and create a security policy to block suspicious IP addresses.

AnswersB, D, E

Helps absorb legitimate traffic surge.

Why this answer

Option B is correct because enabling autoscaling on backend services allows the load balancer to dynamically add more backend instances in response to increased traffic, helping to absorb a DoS attack by scaling out capacity. This is a valid mitigation technique as it leverages Google Cloud's managed scaling to maintain service availability under load.

Exam trap

Google Cloud often tests the misconception that manually increasing backend instances (Option A) is a valid real-time mitigation technique, but in practice, autoscaling (Option B) is the correct automated approach, and candidates may overlook that firewall rules (Option C) cannot block application-layer attacks on a load balancer.

Practice this question →

39

Multi-Selectmedium

A team uses Google Kubernetes Engine (GKE) with cluster telemetry enabled. During an incident, they notice that a deployment's pods are repeatedly crashing with Exit Code 137. The team wants to investigate the root cause. Which two Google Cloud services should they use together to correlate resource usage and logs?

Select 2 answers

A.Cloud Monitoring and Cloud Logging

B.Security Command Center and Cloud Logging

C.Cloud Trace and Cloud Monitoring

D.Cloud Error Reporting and Cloud Logging

AnswersA, C

Monitoring shows resource usage; Logging shows container logs and OOM events.

Why this answer

Exit Code 137 indicates that a container was killed by SIGKILL (signal 9), typically due to an out-of-memory (OOM) condition. Cloud Monitoring provides metrics such as memory usage and OOM kill counts, while Cloud Logging captures the container's termination logs and system events. By correlating these two services, the team can identify when memory usage spiked and confirm that the pod was OOM-killed, enabling root cause analysis.

Exam trap

Google Cloud often tests the distinction between services that handle metrics (Cloud Monitoring) versus logs (Cloud Logging) versus errors (Cloud Error Reporting), and the trap here is that candidates may confuse Cloud Error Reporting with Cloud Logging, not realizing that Error Reporting only surfaces application-level exceptions, not system-level OOM kills or resource metrics.

Practice this question →

40

MCQmedium

A team has configured an uptime check with a 5xx threshold alert. During an incident, the alert fires with severity 'critical'. The team mitigates the issue, but the alert keeps firing for 15 more minutes due to a slow-responding downstream dependency. What should the team do to avoid false alarms in future incidents?

A.Add a second notification channel to send alerts to a different team.

B.Increase the 'duration' field in the alerting policy to require the condition to be true for a longer time before alerting.

C.Modify the alert condition to check only for 5xx errors and ignore other status codes.

D.Decrease the check frequency to every 30 seconds to get faster feedback.

AnswerB

A longer duration reduces false alerts from transient issues.

Why this answer

Option B is correct because increasing the 'duration' field in the alerting policy ensures that the condition (e.g., 5xx errors) must persist for a longer, defined period before the alert fires. This prevents false alarms from transient issues like a slow-responding downstream dependency that temporarily triggers the threshold but resolves before the alert duration expires. In Google Cloud Monitoring, the duration parameter specifies the minimum time the condition must be true, filtering out short-lived spikes.

Exam trap

Google Cloud often tests the misconception that increasing alert sensitivity (e.g., faster checks) or adding more notification channels improves incident response, when the correct approach is to tune the alert duration to match the expected persistence of the underlying issue.

How to eliminate wrong answers

Option A is wrong because adding a second notification channel does not address the root cause of false alarms; it merely duplicates alerts to another team, increasing noise without reducing false positives. Option C is wrong because the alert already checks for 5xx errors (as stated in the question), and ignoring other status codes would not prevent false alarms caused by a slow downstream dependency that still returns 5xx errors. Option D is wrong because decreasing the check frequency to every 30 seconds would make the alert more sensitive to transient conditions, likely increasing false alarms rather than reducing them.

Practice this question →

41

MCQmedium

A DevOps engineer is troubleshooting a production incident where users are getting 502 errors from a Google Cloud HTTP(S) Load Balancer. The backend service is a GKE deployment. Initial checks show the backend pods are healthy and responding. What is the most likely cause?

A.The load balancer's health check is failing on the backend instance group due to mismatch between health check port and backend port.

B.The backend pods are out of memory and crashing.

C.The IAM permissions for the load balancer service account are misconfigured.

D.The backend service has been accidentally deleted by another engineer.

AnswerA

502 errors indicate the backend is unhealthy to the load balancer.

Why this answer

A 502 error from an HTTP(S) Load Balancer indicates that the load balancer is unable to establish a successful connection to the backend. Even though the backend pods are healthy and responding, the load balancer's health check may be failing because it is configured to check a different port (e.g., the health check port) than the port the backend service is actually serving traffic on (e.g., the backend port). This mismatch causes the load balancer to mark the backend instances as unhealthy, resulting in 502 errors for users.

Exam trap

Google Cloud often tests the distinction between backend health and health check configuration, where candidates assume that if pods are healthy, the load balancer must also see them as healthy, ignoring the port mismatch or firewall rules that block health check probes.

How to eliminate wrong answers

Option B is wrong because if the backend pods were out of memory and crashing, they would not be 'healthy and responding' as stated in the question; the engineer's initial checks would have found them unhealthy. Option C is wrong because IAM permissions for the load balancer service account affect the load balancer's ability to access Google Cloud APIs (e.g., to read instance groups), not the direct HTTP connection between the load balancer and backend pods; a misconfiguration here would typically cause a 500 or 403 error, not a 502. Option D is wrong because if the backend service were deleted, the load balancer would have no target to forward traffic to, resulting in a 503 or 404 error, not a 502; the engineer's checks confirm the backend service exists and pods are responding.

Practice this question →

42

MCQhard

Refer to the exhibit. A GKE pod is repeatedly crashing with the error shown. The deployment has resource requests of 512 MiB memory and limits of 1 GiB. What is the most likely cause and the best remediation?

A.The Java heap size exceeds the container memory limit; reduce the JVM heap size or increase the container memory limit

B.The node is under memory pressure; add more nodes to the cluster

C.The container needs more CPU; increase CPU request and limit

D.The application has a memory leak; refactor the DataProcessor class

AnswerA

JVM heap must fit within the container limit to avoid OOM.

Why this answer

The error indicates an OutOfMemoryError (OOM) in the Java application, which occurs when the JVM heap size exceeds the container's memory limit. Since the deployment has a memory limit of 1 GiB, if the JVM is configured with a heap size larger than this limit (or if the heap plus other memory usage exceeds it), the container will be killed by Kubernetes. Reducing the JVM heap size or increasing the container memory limit directly resolves the mismatch.

Exam trap

Google Cloud often tests the distinction between application-level errors (like JVM OOM) and infrastructure-level issues (like node pressure), tempting candidates to choose a cluster scaling solution when the root cause is a misconfigured application resource limit.

How to eliminate wrong answers

Option B is wrong because node memory pressure would cause pod eviction or scheduling failures, not a Java-specific OOM error within a running container; adding nodes does not fix the application's memory configuration. Option C is wrong because the error is an OutOfMemoryError, not a CPU starvation issue; increasing CPU resources would not prevent the JVM from exceeding the memory limit. Option D is wrong because while a memory leak could cause OOM over time, the immediate error message points to heap size exceeding limits, and refactoring the DataProcessor class is a speculative fix that does not address the explicit memory limit configuration.

Practice this question →

43

MCQeasy

You are the DevOps engineer for a social media platform. After a recent code rollout, you receive multiple user complaints about failed logins. The service logs show a sharp increase in 5xx errors from the authentication service. However, the existing alerting policy for the authentication service did not fire. The policy is configured to trigger if the error rate exceeds 5% for 5 minutes. Upon checking Cloud Monitoring, you see that the error rate spiked to 15% for 3 minutes, then dropped back to normal. What is the most likely reason the alert did not fire?

A.The error rate threshold of 5% was too low, causing the alert to be suppressed.

B.The alignment period for the metric was set to 5 minutes, hiding the spike.

C.The duration condition of 5 minutes was not satisfied.

D.The notification channel was incorrectly configured.

AnswerC

The spike lasted 3 minutes, less than required 5 minutes.

Why this answer

The alert did not fire because the policy requires the error rate to exceed 5% for a continuous duration of 5 minutes. The spike only lasted 3 minutes, which is shorter than the configured duration condition, so the alerting policy's condition was never fully met. In Google Cloud Monitoring, alerting policies evaluate both the threshold and the duration window before transitioning to a firing state.

Exam trap

Google Cloud often tests the distinction between threshold-based alerts and duration-based conditions, tricking candidates into focusing on the threshold value or notification channels when the real issue is the unmet time window requirement.

How to eliminate wrong answers

Option A is wrong because a lower threshold (5%) would make the alert more sensitive, not suppress it; the spike exceeded 5%, so the threshold was not the issue. Option B is wrong because the alignment period (e.g., 1 minute) controls how raw data points are combined into time series, but the alert's duration condition of 5 minutes is a separate parameter that requires the threshold to be breached for that entire window; a 5-minute alignment period would actually smooth out short spikes, but the spike was 3 minutes, which still wouldn't satisfy the 5-minute duration. Option D is wrong because the notification channel configuration only affects delivery of the alert, not whether the alert fires; if the policy's conditions are not met, no alert is generated regardless of channel settings.

Practice this question →

44

MCQmedium

After a recent deployment, the mean latency of a user-facing service increased from 200ms to 500ms. The engineer uses Cloud Trace to analyze traces. Which trace characteristic should the engineer focus on to identify the bottleneck?

A.Timestamps of the trace ID.

B.Distribution of span latencies across services.

C.Error count per span.

D.Total number of spans in the trace.

AnswerB

Span latencies show how long each service took, pinpointing the slowest.

Why this answer

The engineer should focus on the distribution of span latencies across services (Option B) because Cloud Trace captures the latency of each span in a distributed trace. By examining the histogram or distribution of span latencies, the engineer can identify which specific service or component is contributing the most to the overall increase from 200ms to 500ms, pinpointing the bottleneck. This approach aligns with the principle of distributed tracing, where the critical path is determined by the slowest span in the trace.

Exam trap

Google Cloud often tests the misconception that timestamps or error counts are the primary indicators of performance bottlenecks, but in distributed tracing, the distribution of span latencies is the key to identifying which service is the root cause of increased latency.

How to eliminate wrong answers

Option A is wrong because timestamps of the trace ID only indicate when the trace started and ended, not the relative performance of individual services; they cannot reveal which service caused the latency increase. Option C is wrong because error count per span focuses on failures, not performance degradation; a service can have zero errors yet still be the bottleneck due to high latency. Option D is wrong because the total number of spans in the trace reflects the complexity or depth of the request path, not the latency contribution of any single service; a trace with many spans can still have a single slow span causing the bottleneck.

Practice this question →

45

Multi-Selectmedium

You are responding to an incident where a new release has caused increased error rates. Which TWO actions should you take immediately?

Select 2 answers

A.Disable the alert.

B.Notify stakeholders.

C.Push a hotfix without testing.

D.Roll back the release.

E.Create a post-mortem document.

AnswersB, D

Keeping stakeholders informed is critical during an incident.

Why this answer

Option B is correct because immediately notifying stakeholders (such as product owners, support teams, and affected users) is a critical first step in incident management. It ensures transparency, sets expectations, and allows coordinated response efforts. In the PCDOE framework, stakeholder communication is prioritized to maintain trust and align business impact with technical remediation.

Exam trap

Google Cloud often tests the distinction between immediate containment actions (rollback, notification) versus post-incident tasks (post-mortem) or harmful actions (disabling alerts, untested hotfixes) to see if candidates understand the priority of stopping user impact over preserving data or process.

Practice this question →

46

Multi-Selectmedium

Which TWO practices help reduce Mean Time to Resolve (MTTR) for production incidents?

Select 2 answers

A.Conduct postmortems only after major incidents.

B.Implement runbooks for common incident types.

C.Use a shared on-call rotating schedule.

D.Establish a war room procedure for critical incidents.

E.Increase logging verbosity for all services.

AnswersB, D

Runbooks provide step-by-step guidance, speeding up resolution.

Why this answer

B is correct because runbooks provide step-by-step, pre-approved procedures for common incident types, enabling engineers to follow a consistent, repeatable process without needing to diagnose from scratch. This reduces the time spent on investigation and decision-making, directly lowering Mean Time to Resolve (MTTR) by standardizing the response for known issues.

Exam trap

Google Cloud often tests the distinction between practices that directly reduce MTTR (like runbooks and war rooms) versus practices that improve reliability or team health (like postmortems and on-call schedules) but do not directly shorten the resolution time during an incident.

Practice this question →

47

MCQhard

Your team uses a canary deployment strategy on Google Kubernetes Engine (GKE). During a rollback, you notice that the rollback caused a brief period of downtime because the previous version's readiness probe was not properly configured. Which of the following best prevents this issue in the future?

A.Perform a gradual rollback with a managed instance group.

B.Use a blue/green deployment instead.

C.Ensure that the readiness probe is tested as part of the pre-deployment validation.

D.Use a Kubernetes Job to run a post-deployment validation.

AnswerC

Validating probes before deployment ensures both new and old versions are ready.

Why this answer

Option C is correct because the root cause of the downtime was a misconfigured readiness probe on the previous version. By testing the readiness probe as part of pre-deployment validation, you ensure that the probe correctly reflects the application's ability to serve traffic before it is used in a rollback. This prevents the scenario where a rollback deploys a version that fails its readiness check, causing the service to be removed from the load balancer and resulting in downtime.

Exam trap

Google Cloud often tests the misconception that changing deployment strategies (like blue/green or canary) solves all rollback issues, when in fact the real problem is a misconfigured health check that must be validated before the rollback is executed.

How to eliminate wrong answers

Option A is wrong because a managed instance group (MIG) is a Compute Engine concept, not a native GKE resource; gradual rollbacks in GKE are handled by Deployment strategies (e.g., maxSurge/maxUnavailable), and a MIG does not address readiness probe misconfiguration. Option B is wrong because blue/green deployment is a release strategy that reduces risk during rollouts, but it does not inherently validate readiness probes; a rollback in blue/green still relies on the same probe configuration, so a misconfigured probe would cause the same downtime. Option D is wrong because a Kubernetes Job runs a post-deployment validation, which occurs after the deployment is live; it cannot prevent the downtime that happens during the rollback itself, as the probe failure causes immediate traffic disruption before the Job executes.

Practice this question →

48

MCQeasy

A DevOps engineer receives an alert that the error budget for a critical service has been exhausted. The service runs on Compute Engine behind an HTTP(S) load balancer. The team wants to reduce the impact on users while investigating. What should the engineer do first?

A.Roll back the most recent deployment

B.Begin a detailed postmortem analysis

C.Disable the alerting policy to reduce noise

D.Increase the number of instances in the managed instance group

AnswerA

Rolling back quickly restores the previous stable version.

Why this answer

Rolling back the most recent deployment is the correct first action because it immediately restores the service to a known stable state, stopping further consumption of the error budget. This aligns with the incident management principle of 'mitigate first, investigate later' — reducing user impact takes priority over root cause analysis. The HTTP(S) load balancer will automatically route traffic to the previous healthy version once the rollback is complete.

Exam trap

Google Cloud often tests the misconception that scaling out (increasing instances) is the correct response to any degradation, but here the error budget exhaustion indicates a functional defect, not a capacity issue, so scaling would not fix the root cause.

How to eliminate wrong answers

Option B is wrong because beginning a detailed postmortem analysis is a later step; the immediate priority is to restore service, not analyze the incident. Option C is wrong because disabling the alerting policy would hide the problem rather than fix it, violating the principle of observability and potentially allowing further degradation. Option D is wrong because increasing the number of instances in the managed instance group does not address the root cause (likely a code or configuration defect) and may only temporarily mask the issue while continuing to exhaust the error budget.

Practice this question →

49

MCQeasy

You are debugging a production issue where a Cloud Function occasionally throws a 'memory limit exceeded' error. You want to inspect the memory usage at the time of the error. What should you do?

A.Check Cloud Logging for memory metrics.

B.Use Cloud Trace to trace the invocations.

C.Use Cloud Debugger to set a breakpoint.

D.Enable Cloud Profiler and analyze the snapshot.

AnswerD

Profiler provides memory and CPU profiling snapshots.

Why this answer

Option D is correct because Cloud Profiler provides continuous, low-overhead profiling of CPU and memory usage, and its snapshot analysis can pinpoint memory allocation patterns at the time of a memory limit exceeded error. Unlike other tools, Profiler captures the call stack and memory consumption per function invocation, enabling you to identify the specific code path causing the spike.

Exam trap

Google Cloud often tests the distinction between monitoring (logging/tracing) and profiling, leading candidates to choose Cloud Logging or Cloud Trace because they are more familiar, while the correct answer requires a tool specifically designed for memory analysis.

How to eliminate wrong answers

Option A is wrong because Cloud Logging does not natively expose memory metrics for Cloud Functions; it logs textual events and errors, but memory usage is not a standard log entry unless explicitly instrumented. Option B is wrong because Cloud Trace focuses on latency and request tracing, not memory profiling; it can show execution time but not memory allocation details. Option C is wrong because Cloud Debugger is designed for inspecting code state at a breakpoint without stopping execution, but it cannot capture memory usage snapshots or profile memory over time, and it may alter the function's runtime behavior.

Practice this question →

50

Multi-Selecthard

An incident is declared for a production service running on GKE. The on-call engineer suspects a recent code change may have introduced a memory leak. Which THREE actions should the engineer take to investigate and mitigate?

Select 3 answers

A.Increase the memory limit for the container as a temporary mitigation

B.Scale down the number of replicas to reduce memory pressure

C.Roll back the deployment immediately without further investigation

D.Check container logs for Out of Memory (OOM) killed messages

E.Compare memory usage metrics before and after the deployment using Cloud Monitoring

AnswersA, D, E

Temporary increase buys time for a permanent fix.

Why this answer

Option A is correct because increasing the memory limit for the container provides a temporary mitigation to prevent the service from being killed by the Out of Memory (OOM) killer while the root cause is investigated. In GKE, the container's memory limit is defined in the pod spec under `resources.limits.memory`, and raising it gives the application more headroom to continue serving requests without immediate termination. This is a standard incident response practice to buy time for deeper analysis, such as reviewing logs and metrics, before applying a permanent fix.

Exam trap

Google Cloud often tests the misconception that scaling down replicas reduces memory pressure, when in fact it reduces total available memory and can worsen the impact of a memory leak.

Practice this question →

51

MCQhard

Based on the log entry, what is the most likely cause of the 404 error?

A.The user does not have permission to invoke the service.

B.The revision is not configured with the correct container port.

C.The Cloud Run service is not autoscaling properly, causing requests to be dropped.

D.The service has run out of memory.

AnswerB

A 404 often means the container is listening on a different port than what Cloud Run expects.

Why this answer

A 404 error on Cloud Run typically indicates that the request reached the service but no container is listening on the configured port. If the revision's container port does not match the port the application is actually serving on (e.g., the app listens on 8080 but the revision is configured for 3000), Cloud Run's HTTP ingress will fail to route traffic, resulting in a 404. This is the most likely cause because the error is not a permission or resource issue, but a routing mismatch at the container level.

Exam trap

Google Cloud often tests the distinction between HTTP status codes (404 vs 403 vs 503 vs 500) and their root causes in serverless environments, trapping candidates who confuse permission errors with routing misconfigurations.

How to eliminate wrong answers

Option A is wrong because a 403 Forbidden error, not a 404, would occur if the user lacks permission to invoke the service (IAM permissions control invocation, not routing). Option C is wrong because autoscaling issues typically cause 503 Service Unavailable or 429 Too Many Requests errors, not 404s; a 404 indicates the service exists but the endpoint is not reachable. Option D is wrong because running out of memory would cause the container to crash or return a 500 Internal Server Error, not a 404; memory limits affect container health, not HTTP routing.

Practice this question →

52

Drag & Dropmedium

Arrange the steps to migrate a monolithic application to microservices on Google Kubernetes Engine.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Identify contexts, containerize, deploy, set up communication, redirect traffic.

Practice this question →

53

Multi-Selectmedium

A service experiences increased latency and HTTP 503 errors. The engineer finds that the backend managed instance group (MIG) is at max instances and CPU utilization is 90%. Which TWO actions should the engineer take to restore the service quickly?

Select 2 answers

A.Enable autoscaling based on HTTP load balancing utilization

B.Increase the autoscaling target CPU utilization to 95%

C.Increase the maximum number of instances in the MIG

D.Reduce the autoscaling target CPU utilization to 50%

E.Reduce the number of instances to avoid resource contention

AnswersA, C

Scales based on request rate, which is more responsive than CPU.

Why this answer

Option A is correct because enabling autoscaling based on HTTP load balancing utilization allows the MIG to scale out based on the actual request load, which directly addresses the 503 errors caused by the backend being at max capacity. This metric is more responsive to traffic spikes than CPU utilization alone, as it reflects the frontend load balancer's view of backend capacity.

Exam trap

Google Cloud often tests the misconception that adjusting CPU utilization thresholds (either up or down) is a quick fix for capacity issues, when in fact the immediate solution is to increase the maximum instance count or enable a more responsive scaling metric.

Practice this question →

54

MCQmedium

Refer to the exhibit. You see this log entry from a Cloud Run service. The stack trace shows the error occurs in handler.js at line 50. You want to see the state of variables at that point in the production environment without adding logging or redeploying. What should you do?

A.Use Error Reporting to view similar errors.

B.Use Cloud Profiler to capture a heap snapshot.

C.Use Cloud Debugger to set a snapshot location at line 50 in handler.js.

D.Use Cloud Trace to trace the request.

AnswerC

Cloud Debugger can capture local variables at specific lines in live applications.

Why this answer

Option C is correct because Cloud Debugger allows you to inspect the state of an application, including local variables and call stack, at a specific line of code in a production environment without modifying or redeploying the application. By setting a snapshot at line 50 in handler.js, you can capture the variable values at the exact point where the error occurs, which directly addresses the need to debug without adding logging or redeploying.

Exam trap

Google Cloud often tests the distinction between debugging tools (Cloud Debugger) and monitoring/observability tools (Error Reporting, Cloud Profiler, Cloud Trace), so the trap here is that candidates may confuse Cloud Debugger with Error Reporting or Cloud Trace, thinking any tool that shows errors or traces can also reveal variable state.

How to eliminate wrong answers

Option A is wrong because Error Reporting aggregates and analyzes errors but does not provide the ability to inspect variable state at a specific line of code; it only shows error logs and stack traces. Option B is wrong because Cloud Profiler is used for continuous profiling of CPU and memory usage to identify performance bottlenecks, not for capturing variable state at a specific code location. Option D is wrong because Cloud Trace is a distributed tracing system that tracks request latency and path through services, but it does not capture local variable values or allow inspection of application state at a specific line of code.

Practice this question →

55

MCQeasy

A company uses Error Budgets for their service. The SLO is 99.9% availability over a 30-day window. The service has been down for 30 minutes in the current window. What is the remaining error budget?

A.43.2 minutes

B.60 minutes

C.13.2 minutes

D.30 minutes

AnswerC

Calculation: 0.001 * 43200 minutes = 43.2 minutes budget, minus 30 = 13.2.

Why this answer

The SLO of 99.9% over a 30-day window allows a total error budget of 43.2 minutes (30 days × 24 hours × 60 minutes × 0.001). The service has already consumed 30 minutes of downtime, so the remaining error budget is 43.2 - 30 = 13.2 minutes. Option C is correct because it reflects this precise calculation.

Exam trap

Google Cloud often tests the distinction between total error budget and remaining error budget, trapping candidates who forget to subtract the already consumed downtime from the total allowable downtime.

How to eliminate wrong answers

Option A is wrong because 43.2 minutes is the total error budget for the 30-day window, not the remaining budget after 30 minutes of downtime. Option B is wrong because 60 minutes would correspond to an SLO of approximately 99.86% (43.2 minutes is the correct total for 99.9%), and it does not account for the 30 minutes already consumed. Option D is wrong because 30 minutes is simply the downtime already incurred, not the remaining error budget.

Practice this question →

56

MCQhard

A company uses Cloud Run for a stateless API service with concurrency set to 80. During a traffic spike, some requests return HTTP 500 errors and latency spikes. Cloud Monitoring shows container CPU utilization at 100% and memory usage at 70%. What is the most likely cause and the best first step?

A.Concurrency per container is too high; reduce concurrency to 10

B.Maximum instances limit is too low; increase from 10 to 100

C.Min idle instances is too low; set min idle to 5 to reduce cold starts

D.Memory limit is too low; increase memory from 256 MiB to 512 MiB

AnswerA

Lowering concurrency reduces CPU contention, preventing timeouts and 500s.

Why this answer

The correct answer is A because with CPU at 100% and memory at only 70%, the bottleneck is CPU, not memory. Cloud Run containers handle requests concurrently; setting concurrency to 80 means each container processes up to 80 requests simultaneously. When CPU is saturated, requests queue up, causing latency spikes and eventual HTTP 500 errors as the container becomes unresponsive.

Reducing concurrency to 10 lowers the per-container request load, allowing each request to complete before CPU saturation occurs.

Exam trap

Google Cloud often tests the misconception that HTTP 500 errors during spikes are always due to insufficient instances or memory, but the key diagnostic clue here is CPU at 100% with memory well below limit, pointing to concurrency overload as the root cause.

How to eliminate wrong answers

Option B is wrong because increasing the maximum instances limit would add more containers, but each new container would also be configured with concurrency 80 and would immediately hit the same CPU bottleneck, spreading the problem without solving it. Option C is wrong because min idle instances addresses cold start latency for new containers, but the issue here is CPU saturation during a traffic spike, not cold starts; idle instances would still be overwhelmed by the high concurrency setting. Option D is wrong because memory usage is at 70%, not 100%, so memory is not the bottleneck; increasing memory would not resolve CPU saturation and could even increase per-container cost without benefit.

Practice this question →

57

Matchingmedium

Match each Google Cloud DevOps capability to its benefit.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Managed continuous delivery to GKE

Centralized container and package storage

Private Git repositories integrated with Cloud Build

IDE plugins for Kubernetes and Cloud Run

CLI for continuous development on Kubernetes

Why these pairings

Tools that accelerate DevOps workflows.

Practice this question →

58

MCQhard

An organization has a service that must meet a 99.99% SLO. The service runs on GKE and uses Cloud SQL. The team notices that during a major incident, the error budget is consumed rapidly. They want to implement a mechanism to automatically rollback deployments that cause sustained error budget consumption above a threshold. What is the best approach?

A.Use Cloud Scheduler to run a script that checks error budget and rolls back if needed.

B.Set up a deployment pipeline with Cloud Deploy that includes a predeployment validation step that checks the current error budget burn rate and blocks the release if the burn rate exceeds 10% per hour.

C.Implement a canary deployment strategy with manual approval steps.

D.Configure Cloud Build to automatically revert the last commit if error budget is consumed.

AnswerB

Automated policy prevents deployments that would consume error budget quickly.

Why this answer

Option A is correct because using Cloud Deploy or Spinnaker with automated rollback based on error budget burn rate is the recommended pattern. Option B is wrong because Cloud Build is CI, not deployment orchestration. Option C is wrong because canary deployments reduce blast radius but don't auto-rollback.

Option D is wrong because manual rollback via Cloud Console is not automated.

Practice this question →

59

Multi-Selecthard

Which TWO of the following are essential elements of a comprehensive incident post-mortem document according to Google's Site Reliability Engineering (SRE) best practices?

Select 2 answers

A.Exact code line numbers and commits that caused the incident.

B.A timeline of events leading to and during the incident.

C.A detailed analysis of the root cause only.

D.An attribution of blame to the individual or team responsible.

E.A list of action items with owners and deadlines to prevent recurrence.

AnswersB, E

Timeline helps understand sequence.

Why this answer

Option B is correct because a timeline of events is a core element of an incident post-mortem in Google SRE practice, as it provides a chronological reconstruction of the incident's progression, enabling teams to understand the sequence of failures and responses. This timeline is essential for identifying contributing factors and evaluating the effectiveness of mitigation actions, not just the root cause.

Exam trap

Google Cloud often tests the misconception that a post-mortem is solely about finding the root cause or assigning blame, but SRE best practices emphasize a blameless culture and a comprehensive review that includes a timeline and actionable follow-ups.

Practice this question →

60

MCQmedium

A GKE cluster node fails, causing pods to be rescheduled. However, some pods remain in 'CrashLoopBackOff' state. After examining logs, you find the application has a dependency on local SSD that was ephemeral. What is the best long-term solution?

A.Use PersistentVolumes with ReadWriteOnce access mode.

B.Configure pod anti-affinity.

C.Increase the node pool size.

D.Use a DaemonSet to run the application.

AnswerA

PersistentVolumes retain data across pod rescheduling and node failures.

Why this answer

The correct answer is A because the application's dependency on local SSD (ephemeral storage) means that when the node fails, the data is lost, causing the pods to crash. PersistentVolumes (PVs) with ReadWriteOnce (RWO) access mode provide durable, node-independent storage that survives node failures, ensuring pods can be rescheduled on any node and access their data. This is the best long-term solution because it decouples storage from the node lifecycle, preventing CrashLoopBackOff due to missing local data.

Exam trap

Google Cloud often tests the misconception that scaling resources (e.g., node pool size) or controlling pod placement (e.g., anti-affinity) can fix data persistence issues, but the trap here is that ephemeral storage is tied to the node's lifecycle, so only persistent storage solutions like PersistentVolumes address the root cause.

How to eliminate wrong answers

Option B is wrong because pod anti-affinity controls pod placement (e.g., spreading pods across nodes) but does not address the root cause of data loss from ephemeral local SSD; it would not prevent CrashLoopBackOff if the data is missing. Option C is wrong because increasing the node pool size adds more nodes but does not solve the problem of ephemeral storage being tied to a failed node; pods would still fail on new nodes if they rely on local SSD that is not replicated. Option D is wrong because a DaemonSet runs one pod per node, but it does not provide persistent storage; if the node fails, the pod is rescheduled on another node without the local SSD data, leading to the same CrashLoopBackOff issue.

Practice this question →

61

MCQmedium

After deploying a new version of a Cloud Run service, the team notices an increase in 5xx errors. They want to quickly revert to the previous version while minimizing user impact. What is the recommended approach?

A.Set the minimum number of instances of the new revision to 0.

B.Redeploy the previous version from the container registry.

C.Modify the ingress settings to restrict traffic to the new revision.

D.Use Cloud Run's traffic management to set 100% of traffic to the previous revision.

AnswerD

Traffic splitting achieves instant rollback.

Why this answer

Cloud Run supports traffic splitting between revisions, allowing you to instantly route 100% of traffic to the previous revision without redeploying. This minimizes user impact because the rollback is immediate and does not require rebuilding or re-pulling container images. Option D is correct because it leverages Cloud Run's built-in traffic management feature for zero-downtime rollbacks.

Exam trap

Google Cloud often tests the misconception that you must redeploy or delete a revision to roll back, when in fact Cloud Run's traffic management allows instant, traffic-level rollbacks without any deployment action.

How to eliminate wrong answers

Option A is wrong because setting the minimum number of instances of the new revision to 0 does not stop traffic from reaching it; it only affects scaling behavior, and the revision would still serve requests if it receives traffic. Option B is wrong because redeploying the previous version from the container registry is unnecessary and slower; Cloud Run already retains the previous revision, so you can simply shift traffic to it. Option C is wrong because modifying ingress settings to restrict traffic to the new revision would block all incoming traffic to that revision, but it does not automatically route traffic to the previous revision, leaving users with no service until you manually adjust routing.

Practice this question →

62

MCQmedium

An engineer wants to ensure that an alert is escalated if not acknowledged within 5 minutes. Which feature of Cloud Monitoring can achieve this?

A.Incident management tool like PagerDuty.

B.Notification channel with escalation configuration.

C.Using a webhook notification channel.

D.Alerting policy with a condition that checks for acknowledgment.

AnswerB

Escalation channels in notification settings allow sending to different recipients after a delay.

Why this answer

Notification channels in Google Cloud Monitoring can be configured with an escalation policy that defines a sequence of recipients or notification methods to be contacted if an alert is not acknowledged within a specified time. By setting the escalation duration to 5 minutes, the system automatically escalates the alert to the next tier or channel if no acknowledgment is received, directly meeting the engineer's requirement.

Exam trap

Google Cloud often tests the distinction between alerting policy conditions (what triggers an alert) and notification channel escalation (what happens after the alert fires), leading candidates to incorrectly select option D because they confuse the condition's 'duration' with the escalation timeout.

How to eliminate wrong answers

Option A is wrong because PagerDuty is an external incident management tool that integrates with Cloud Monitoring via notification channels, but it is not a built-in feature of Cloud Monitoring itself; the question asks for a feature of Cloud Monitoring, not an external tool. Option C is wrong because a webhook notification channel simply sends HTTP POST requests to a specified URL when an alert fires; it does not inherently support escalation logic or acknowledgment tracking. Option D is wrong because an alerting policy condition defines the metric or log criteria that trigger an alert, not the escalation behavior after the alert fires; acknowledgment and escalation are handled by the notification channel's configuration, not the policy condition.

Practice this question →

63

Multi-Selectmedium

Which TWO tools should be used for real-time incident collaboration and communication?

Select 2 answers

A.Google Meet only.

B.Jira.

C.Cloud Monitoring Incident Response (beta).

D.Google Chat with a dedicated incident room.

E.Cloud Trace.

AnswersC, D

This tool centralizes incident management and collaboration.

Why this answer

Cloud Monitoring Incident Response (beta) provides a dedicated interface for real-time incident management, including automated notifications, escalation policies, and a centralized timeline for collaboration. Google Chat with a dedicated incident room enables real-time communication and coordination among incident responders, allowing them to share updates, logs, and runbooks in a structured channel. Together, these tools fulfill the need for both incident orchestration and synchronous collaboration during active incidents.

Exam trap

Google Cloud often tests the distinction between tools that support real-time collaboration (like Chat rooms) versus tools that are for asynchronous tracking (like Jira) or monitoring-only (like Cloud Trace), leading candidates to select familiar but incorrect options like Jira for incident communication.

Practice this question →

Ready to test yourself?

Try a timed practice session using only Managing service incidents questions.

Start 20-question session