Practice PCDOE Implementing service monitoring strategies questions with full explanations on every answer.
Start practicing
Implementing service monitoring strategies — choose a session length
Free · No account required
Click any question to see the full explanation and answer options, or start a focused practice session above.
A team is monitoring a production service on Google Kubernetes Engine (GKE) and notices that a deployment is occasionally returning HTTP 503 errors. The team has set up a ServiceMonitor in Prometheus to scrape metrics from the pods. What is the most likely cause of the intermittent 503 errors?
2A cloud operations team is implementing monitoring for a microservices application deployed on Compute Engine. They want to create a custom dashboard in Cloud Monitoring that shows the 99th percentile latency of a specific service over the last hour. Which combination of Cloud Monitoring features should they use?
3An e-commerce platform is using Cloud Load Balancing with a backend service that has a custom health check. The health check is failing intermittently, causing traffic to be routed away from healthy instances. The team has enabled Cloud Logging and wants to diagnose the issue. Which log view should they examine to see the health check probe results?
4A DevOps engineer is setting up alerting policies for a critical API service. They want to receive an alert if the error rate exceeds 5% for at least 5 minutes, but only during business hours (9 AM to 5 PM). Which approach should they use?
5A company is running a stateful workload on Compute Engine and has configured a TCP health check on port 8080. The health check is failing, but the application is running and responding on port 8080 when tested manually from within the instance. What is the most likely cause of the health check failure?
6Which TWO of the following are best practices for implementing service monitoring in Google Cloud? (Choose 2)
7Which THREE of the following are valid approaches to monitor a custom application metric in Cloud Monitoring? (Choose 3)
8A DevOps engineer runs the command above and gets the output shown. What does this output indicate?
9A team has deployed a Prometheus server on GKE using the configuration above. They expect Prometheus to scrape metrics from pods with the label 'app: my-app' and the annotation 'prometheus.io/scrape: true' on port 8080. However, no metrics are being collected. What is the most likely cause?
10You are monitoring a microservices application deployed on Google Kubernetes Engine (GKE) that uses Cloud Monitoring for observability. You notice that the error rate for a critical service has increased, but the CPU and memory usage remain normal. The service uses gRPC and logs are structured. Which Cloud Monitoring tool should you use first to diagnose the root cause of the increased error rate?
11A company uses Cloud Monitoring to track latency for a multi-region web application. The SLO is 99.9% of requests under 500ms over a 30-day rolling window. The error budget has been rapidly depleting over the last week. The operations team wants to understand the impact of recent deployments. Which approach should they use to correlate deployment changes with latency spikes?
12You are setting up alerting for a batch processing job that runs daily on Compute Engine. The job must complete within 2 hours. Which metric and alert condition should you use to ensure you are notified if the job is still running after 90 minutes?
13Which TWO metrics should be included in a comprehensive monitoring strategy for a production Kubernetes workload to detect performance degradation and capacity issues?
14Your organization runs a critical e-commerce platform on Google Kubernetes Engine (GKE). The platform uses Cloud Service Mesh (Anthos Service Mesh) for traffic management and Cloud Monitoring for observability. Recently, after a new release, you observe that the p99 latency of the checkout service has increased from 200ms to 2s. The service's CPU and memory metrics appear normal, and there are no error logs. The release included a change to the Istio VirtualService configuration that added a retry policy: 3 retries with a 500ms timeout per retry. You suspect that the retries are contributing to the latency increase. You want to use Cloud Monitoring to confirm this hypothesis. Which approach should you take?
15You are a DevOps engineer for a SaaS company that provides a REST API. The API is deployed on Google Cloud Run. You have configured Cloud Monitoring alerts for 5xx errors. Recently, you received an alert that the error rate exceeded 5% for 5 minutes. You investigated and found that the errors were HTTP 503 (Service Unavailable) from a specific endpoint. The endpoint calls an internal Cloud SQL database. The database CPU utilization was at 90% during that period. You suspect the database is the bottleneck. Which action should you take to reduce the error rate without over-provisioning?
16A company uses Cloud Run for a critical service and needs to set up alerting for 5xx errors. They want to receive a notification within 1 minute of the error rate exceeding 1% for any 1-minute window. Which alerting approach should they use?
17Which TWO are best practices for implementing service monitoring strategies in Google Cloud?
18A team has set up the alerting policies shown in the exhibit. They receive an alert for High Memory but not for High CPU. What is the most likely reason?
19Order the steps to configure a VPC Network Peering between two projects.
20Match each Google Cloud tool to its function in incident management.
21A DevOps team is defining an SLO for a web application that runs on Compute Engine behind an HTTP Load Balancer. They need to measure the proportion of requests that complete within 300ms. Which Cloud Monitoring metric is most appropriate as the SLI?
22An alerting policy triggers frequently for a spike in CPU utilization on a Compute Engine instance, but the spike lasts only a few seconds. The SRE team wants to reduce false positives. Which change should they make?
23You are designing a monitoring strategy for a microservices application running on Google Kubernetes Engine (GKE). You need to create a custom metric that counts the number of failed login attempts from the application logs. The logs are in JSON format and contain a field 'status' with value 'FAILED'. Which approach should you use?
24A team needs to monitor the availability of an HTTPS endpoint that requires a Bearer token in the request header. What is the simplest way to configure this with Cloud Monitoring?
25You are creating a Cloud Monitoring dashboard to display the 99th percentile latency of your HTTP Load Balancer over the last 6 hours. Which MQL query should you use?
26A company has on-premises servers running Linux and GKE clusters. They want to monitor all infrastructure using Cloud Monitoring. Which solution is most scalable and aligned with Google's best practices?
27An application running on App Engine is throwing exceptions. The DevOps team wants to be notified when a new type of exception appears. Which Cloud Monitoring feature should they use?
28An SRE team needs to implement an incident management workflow that automatically creates a ticket in their ITSM tool when a critical alert fires. They use Cloud Monitoring. Which approach should they use?
29A company wants to reduce costs associated with Cloud Monitoring. They have many custom metrics and high ingestion rates. Which cost optimization strategy is most effective?
30A DevOps engineer notices that some Compute Engine instances are not reporting metrics to Cloud Monitoring. Which two potential causes should they investigate? (Choose two.)
31An alerting policy for high CPU utilization on a VM is firing even when CPU is not high. The team suspects a misconfiguration. Which two possible issues should they check? (Choose two.)
32You are designing SLO monitoring for a high-traffic e-commerce platform. Which three best practices should you follow? (Choose three.)
33An SRE team created the above logs-based metric. They expect it to count the number of HTTP 500 errors per instance. However, the metric shows no data. What is the most likely cause?
34A team notices that the 'cpu-high' alert fires frequently even for short bursts. The 'disk-full' alert never sends notifications. Based on the exhibit, what is the issue with each?
35The above MQL query is used in a Cloud Monitoring dashboard. What does it display?
36A team wants to monitor the latency of a microservice deployed on GKE. Which Google Cloud tool should they use to collect custom metrics?
37A company has an application that experiences intermittent errors. They want to be notified immediately when the error rate exceeds 1% of total requests. What should they implement?
38A DevOps team is setting up SLOs for a service with two critical metrics: availability and latency. They want to measure over a 30-day window. Which approach correctly defines an SLO?
39You need to monitor the health of an external HTTP endpoint. Which resource should you create?
40Your team wants to create a dashboard that shows request latency broken down by API version. Which approach is most efficient?
41A service is deployed on Cloud Run. You need to monitor memory usage per revision. How can you create an alert?
42Which service provides built-in dashboards for Google Cloud services?
43You want to send alerts to a Slack channel when a critical error occurs. What should you do?
44A team wants to implement multi-cluster monitoring for GKE using Managed Service for Prometheus. Which configuration is required?
45Your team is implementing SLO monitoring. Which TWO tools should they use to create and monitor SLIs?
46You are designing a monitoring strategy for a cloud-native application. Which THREE components are essential for observability?
47A company needs to monitor custom application metrics from Compute Engine instances. Which TWO methods can be used?
48Based on the exhibit, what does the duration of 300s mean in this alerting policy?
49Based on the exhibit, which Cloud Logging query filter will return all logs of this type?
50Refer to the exhibit. What is the effect of the metricRelabelings section in this ServiceMonitor?
51A team wants to monitor a Google Cloud Run service for application crashes. Which Google Cloud tool automatically captures and notifies on application errors?
52An SRE team needs to define an SLI for a web service's availability SLO of 99.9%. Which metric should they use?
53A company runs a microservices architecture on GKE with Istio. They want to generate custom request-level metrics for SLO tracking without modifying application code. Which approach is most efficient?
54A developer wants to view logs from all pods in a GKE namespace in real time. Which command-line tool should they use?
55A DevOps engineer needs to verify if a load balancer's health check is behaving normally by examining historical trends. Where should they look?
56You need to monitor a multi-step login flow that involves calling an API, validating a token, and redirecting. Which type of uptime check should you use?
57Which Google Cloud tool automatically captures and visualizes traces for applications running on App Engine?
58An organization wants to be alerted when the total size of a Cloud Storage bucket exceeds 1 TB. Which metric should they monitor?
59A team uses Cloud Monitoring alerting policies with multiple conditions. They want an incident to fire only when both conditions are met simultaneously. What should they configure?
60Which TWO options are valid methods to create a custom metric descriptor in Cloud Monitoring?
61Which THREE are recommended practices for setting up alerting on Google Cloud?
62Which TWO metrics should be monitored to detect a potential memory leak in a Compute Engine VM?
63What is the effect of the 'timeshiftDuration' of '3600s' in the dashboard widget?
64Which Cloud Monitoring feature can directly correlate this error with the associated trace and VM instance?
65Your company runs a multi-region application on GKE across us-east1 and europe-west1. The application serves a global user base with a strict SLO of 99.95% availability. Recently, the team noticed that during peak hours, some users in South America experience high latency and intermittent errors. The GKE clusters are monitored via Cloud Monitoring with custom dashboards and alerting policies. The team has set up a single alerting policy that triggers when the global error rate exceeds 0.1%. However, the alert fires only after the issue has persisted for 10 minutes, and by then the customer impact is already significant. You need to improve the detection and response time. Which action should you take first?
66A team is using Cloud Monitoring to track the performance of a microservices application. They set up an uptime check for each service, but they notice that some checks are failing intermittently without actual service degradation. What is the most likely cause?
67A company wants to monitor custom application metrics in real-time and trigger alerts when a metric exceeds a threshold. Which Google Cloud service should they use?
68An organization is implementing SLO-based alerting for a critical service. They want to alert when the service has consumed 50% of its error budget over a 30-day window. Considering best practices for alert sensitivity and noise reduction, which alerting approach should they use?
69A DevSecOps team is configuring Cloud Monitoring alerts for proactive incident response. Which two practices are recommended for effective alerting? (Choose two.)
70A team is designing a dashboard for their production environment using Cloud Monitoring. Which three types of information should be included on the dashboard to support incident response? (Choose three.)
71Your company runs a stateless web application on Google Kubernetes Engine (GKE). You have configured Cloud Monitoring to track request latency and set up an alert when p95 latency exceeds 500ms for 5 minutes. Recently, the alert has been firing frequently during peak hours. You examine the metrics and see that p95 latency spikes to 600ms for short periods. The application's SLO is 99.9% availability with a latency threshold of 1 second. What should you do to reduce alert noise without compromising the SLO?
72Your organization is migrating a monolithic application to microservices on Cloud Run. You need to monitor the health of each microservice and aggregate logs and metrics in a central dashboard. You have set up Cloud Monitoring custom dashboards and logs-based metrics. After the initial deployment, you notice that the dashboards show data only for some services, while others appear to have no metrics. You verify that all services are running and emitting logs. What is the most likely cause?
73You are the DevOps engineer for a large gaming company. Your game backend runs on Compute Engine instances behind a global HTTP(S) Load Balancer. You have set up Cloud Monitoring with an uptime check for the load balancer's IP address, and you are using logging to capture 404 errors. Recently, a new game update caused a surge in traffic, and you started receiving many alerts from your uptime check indicating that the site is down. However, you verify that the backend instances are healthy and the load balancer is responding correctly, though some requests are timing out due to the increased load. Your alerting policy currently triggers when 2 consecutive checks fail. What is the most likely reason for the false positive alerts?
74A small startup uses Cloud Functions for their backend and wants to monitor function execution times and error rates. They have enabled Cloud Monitoring and are viewing metrics in the Cloud Console. They notice that the execution time metric for a particular function shows an average of 200ms, but occasionally there are spikes to 5 seconds, which correspond to user-reported slow responses. They want to be alerted when the function exceeds 1 second for any invocation. What is the simplest way to achieve this?
75Your company runs a multi-region application on Google Kubernetes Engine. You have implemented Cloud Monitoring dashboards to track cluster resource utilization and application SLIs. After a recent upgrade, you notice that the dashboard shows a sudden drop in CPU utilization for all nodes in one zone, but the application is still serving traffic normally. You suspect a monitoring issue. What should you investigate first?
76A site reliability engineer is defining SLOs for a microservice application running on Google Kubernetes Engine. The application serves user-facing API requests. Which TWO approaches should the engineer take to effectively monitor the service's performance?
77Refer to the exhibit. ```yaml name: projects/my-project/alertPolicies/12345 displayName: High Error Rate combiner: OR conditions: - conditionThreshold: filter: metric.type="logging.googleapis.com/user/myapp/error_count" resource.type="k8s_container" aggregations: - alignmentPeriod: 60s perSeriesAligner: ALIGN_RATE duration: 120s comparison: COMPARISON_GT thresholdValue: 5 trigger: count: 1 ``` An engineer notices that this alert fires too frequently during normal operation. Which change would most likely reduce the noise?
78A company runs a multi-region web application on Google Kubernetes Engine (GKE) using Cloud Load Balancing and Cloud Armor. They use Cloud Monitoring to track user-facing latency. Recently, they noticed that the p99 latency has increased from 200ms to 2s during peak hours, but only for users in the US region. The team suspects a specific backend service in us-central1 is causing the spike. They have set up a dashboard showing latency by region, but the latency metric is aggregated globally, not broken down by region. What should they do to pinpoint the issue?
The Implementing service monitoring strategies domain covers the key concepts tested in this area of the PCDOE exam blueprint published by Google Cloud. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all PCDOE domains — no account required.
The Courseiva PCDOE question bank contains 78 questions in the Implementing service monitoring strategies domain. Click any question to see the full explanation and answer breakdown.
Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.
Yes — the session launcher on this page draws questions exclusively from the Implementing service monitoring strategies domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.
Save your results, see per-domain analytics, and get readiness scores — free, for every certification.
Sign Up FreeFree forever · Every certification included