Free PCDOE Implementing service monitoring strategies Practice Questions (2026)

Q: How many Implementing service monitoring strategies questions are on the PCDOE exam?

The Implementing service monitoring strategies domain is one of the weighted domains on the PCDOE exam. The Courseiva question bank has 78 practice questions for this domain.

Q: How can I practice Implementing service monitoring strategies questions for PCDOE?

Click any of the 78 questions listed on this page to see the full question and explanation, or use the session launcher to start a focused practice session of 10, 20, 30 or 50 questions drawn only from the Implementing service monitoring strategies domain.

Practice Implementing service monitoring strategies questions

10Q 20Q 30Q 50Q

All PCDOE Implementing service monitoring strategies questions (78)

Start session

Click any question to see the full explanation and answer options, or start a focused practice session above.

A team is monitoring a production service on Google Kubernetes Engine (GKE) and notices that a deployment is occasionally returning HTTP 503 errors. The team has set up a ServiceMonitor in Prometheus to scrape metrics from the pods. What is the most likely cause of the intermittent 503 errors?

A cloud operations team is implementing monitoring for a microservices application deployed on Compute Engine. They want to create a custom dashboard in Cloud Monitoring that shows the 99th percentile latency of a specific service over the last hour. Which combination of Cloud Monitoring features should they use?

An e-commerce platform is using Cloud Load Balancing with a backend service that has a custom health check. The health check is failing intermittently, causing traffic to be routed away from healthy instances. The team has enabled Cloud Logging and wants to diagnose the issue. Which log view should they examine to see the health check probe results?

A DevOps engineer is setting up alerting policies for a critical API service. They want to receive an alert if the error rate exceeds 5% for at least 5 minutes, but only during business hours (9 AM to 5 PM). Which approach should they use?

A company is running a stateful workload on Compute Engine and has configured a TCP health check on port 8080. The health check is failing, but the application is running and responding on port 8080 when tested manually from within the instance. What is the most likely cause of the health check failure?

Which TWO of the following are best practices for implementing service monitoring in Google Cloud? (Choose 2)

Which THREE of the following are valid approaches to monitor a custom application metric in Cloud Monitoring? (Choose 3)

A DevOps engineer runs the command above and gets the output shown. What does this output indicate?

A team has deployed a Prometheus server on GKE using the configuration above. They expect Prometheus to scrape metrics from pods with the label 'app: my-app' and the annotation 'prometheus.io/scrape: true' on port 8080. However, no metrics are being collected. What is the most likely cause?

You are monitoring a microservices application deployed on Google Kubernetes Engine (GKE) that uses Cloud Monitoring for observability. You notice that the error rate for a critical service has increased, but the CPU and memory usage remain normal. The service uses gRPC and logs are structured. Which Cloud Monitoring tool should you use first to diagnose the root cause of the increased error rate?

A company uses Cloud Monitoring to track latency for a multi-region web application. The SLO is 99.9% of requests under 500ms over a 30-day rolling window. The error budget has been rapidly depleting over the last week. The operations team wants to understand the impact of recent deployments. Which approach should they use to correlate deployment changes with latency spikes?

You are setting up alerting for a batch processing job that runs daily on Compute Engine. The job must complete within 2 hours. Which metric and alert condition should you use to ensure you are notified if the job is still running after 90 minutes?

Which TWO metrics should be included in a comprehensive monitoring strategy for a production Kubernetes workload to detect performance degradation and capacity issues?

Your organization runs a critical e-commerce platform on Google Kubernetes Engine (GKE). The platform uses Cloud Service Mesh (Anthos Service Mesh) for traffic management and Cloud Monitoring for observability. Recently, after a new release, you observe that the p99 latency of the checkout service has increased from 200ms to 2s. The service's CPU and memory metrics appear normal, and there are no error logs. The release included a change to the Istio VirtualService configuration that added a retry policy: 3 retries with a 500ms timeout per retry. You suspect that the retries are contributing to the latency increase. You want to use Cloud Monitoring to confirm this hypothesis. Which approach should you take?

You are a DevOps engineer for a SaaS company that provides a REST API. The API is deployed on Google Cloud Run. You have configured Cloud Monitoring alerts for 5xx errors. Recently, you received an alert that the error rate exceeded 5% for 5 minutes. You investigated and found that the errors were HTTP 503 (Service Unavailable) from a specific endpoint. The endpoint calls an internal Cloud SQL database. The database CPU utilization was at 90% during that period. You suspect the database is the bottleneck. Which action should you take to reduce the error rate without over-provisioning?

A company uses Cloud Run for a critical service and needs to set up alerting for 5xx errors. They want to receive a notification within 1 minute of the error rate exceeding 1% for any 1-minute window. Which alerting approach should they use?

Which TWO are best practices for implementing service monitoring strategies in Google Cloud?

A team has set up the alerting policies shown in the exhibit. They receive an alert for High Memory but not for High CPU. What is the most likely reason?

Order the steps to configure a VPC Network Peering between two projects.

Match each Google Cloud tool to its function in incident management.

A DevOps team is defining an SLO for a web application that runs on Compute Engine behind an HTTP Load Balancer. They need to measure the proportion of requests that complete within 300ms. Which Cloud Monitoring metric is most appropriate as the SLI?

An alerting policy triggers frequently for a spike in CPU utilization on a Compute Engine instance, but the spike lasts only a few seconds. The SRE team wants to reduce false positives. Which change should they make?

You are designing a monitoring strategy for a microservices application running on Google Kubernetes Engine (GKE). You need to create a custom metric that counts the number of failed login attempts from the application logs. The logs are in JSON format and contain a field 'status' with value 'FAILED'. Which approach should you use?

A team needs to monitor the availability of an HTTPS endpoint that requires a Bearer token in the request header. What is the simplest way to configure this with Cloud Monitoring?

You are creating a Cloud Monitoring dashboard to display the 99th percentile latency of your HTTP Load Balancer over the last 6 hours. Which MQL query should you use?

A company has on-premises servers running Linux and GKE clusters. They want to monitor all infrastructure using Cloud Monitoring. Which solution is most scalable and aligned with Google's best practices?

An application running on App Engine is throwing exceptions. The DevOps team wants to be notified when a new type of exception appears. Which Cloud Monitoring feature should they use?

An SRE team needs to implement an incident management workflow that automatically creates a ticket in their ITSM tool when a critical alert fires. They use Cloud Monitoring. Which approach should they use?

A company wants to reduce costs associated with Cloud Monitoring. They have many custom metrics and high ingestion rates. Which cost optimization strategy is most effective?

A DevOps engineer notices that some Compute Engine instances are not reporting metrics to Cloud Monitoring. Which two potential causes should they investigate? (Choose two.)

An alerting policy for high CPU utilization on a VM is firing even when CPU is not high. The team suspects a misconfiguration. Which two possible issues should they check? (Choose two.)

You are designing SLO monitoring for a high-traffic e-commerce platform. Which three best practices should you follow? (Choose three.)

An SRE team created the above logs-based metric. They expect it to count the number of HTTP 500 errors per instance. However, the metric shows no data. What is the most likely cause?

A team notices that the 'cpu-high' alert fires frequently even for short bursts. The 'disk-full' alert never sends notifications. Based on the exhibit, what is the issue with each?

The above MQL query is used in a Cloud Monitoring dashboard. What does it display?

A team wants to monitor the latency of a microservice deployed on GKE. Which Google Cloud tool should they use to collect custom metrics?

A company has an application that experiences intermittent errors. They want to be notified immediately when the error rate exceeds 1% of total requests. What should they implement?

A DevOps team is setting up SLOs for a service with two critical metrics: availability and latency. They want to measure over a 30-day window. Which approach correctly defines an SLO?

You need to monitor the health of an external HTTP endpoint. Which resource should you create?

Your team wants to create a dashboard that shows request latency broken down by API version. Which approach is most efficient?

A service is deployed on Cloud Run. You need to monitor memory usage per revision. How can you create an alert?

Which service provides built-in dashboards for Google Cloud services?

You want to send alerts to a Slack channel when a critical error occurs. What should you do?

A team wants to implement multi-cluster monitoring for GKE using Managed Service for Prometheus. Which configuration is required?

Your team is implementing SLO monitoring. Which TWO tools should they use to create and monitor SLIs?

You are designing a monitoring strategy for a cloud-native application. Which THREE components are essential for observability?

A company needs to monitor custom application metrics from Compute Engine instances. Which TWO methods can be used?

Based on the exhibit, what does the duration of 300s mean in this alerting policy?

Based on the exhibit, which Cloud Logging query filter will return all logs of this type?

Refer to the exhibit. What is the effect of the metricRelabelings section in this ServiceMonitor?

A team wants to monitor a Google Cloud Run service for application crashes. Which Google Cloud tool automatically captures and notifies on application errors?

An SRE team needs to define an SLI for a web service's availability SLO of 99.9%. Which metric should they use?

A company runs a microservices architecture on GKE with Istio. They want to generate custom request-level metrics for SLO tracking without modifying application code. Which approach is most efficient?

A developer wants to view logs from all pods in a GKE namespace in real time. Which command-line tool should they use?

A DevOps engineer needs to verify if a load balancer's health check is behaving normally by examining historical trends. Where should they look?

You need to monitor a multi-step login flow that involves calling an API, validating a token, and redirecting. Which type of uptime check should you use?

Which Google Cloud tool automatically captures and visualizes traces for applications running on App Engine?

An organization wants to be alerted when the total size of a Cloud Storage bucket exceeds 1 TB. Which metric should they monitor?

A team uses Cloud Monitoring alerting policies with multiple conditions. They want an incident to fire only when both conditions are met simultaneously. What should they configure?

Which TWO options are valid methods to create a custom metric descriptor in Cloud Monitoring?

Which THREE are recommended practices for setting up alerting on Google Cloud?

Which TWO metrics should be monitored to detect a potential memory leak in a Compute Engine VM?

What is the effect of the 'timeshiftDuration' of '3600s' in the dashboard widget?

Which Cloud Monitoring feature can directly correlate this error with the associated trace and VM instance?

Your company runs a multi-region application on GKE across us-east1 and europe-west1. The application serves a global user base with a strict SLO of 99.95% availability. Recently, the team noticed that during peak hours, some users in South America experience high latency and intermittent errors. The GKE clusters are monitored via Cloud Monitoring with custom dashboards and alerting policies. The team has set up a single alerting policy that triggers when the global error rate exceeds 0.1%. However, the alert fires only after the issue has persisted for 10 minutes, and by then the customer impact is already significant. You need to improve the detection and response time. Which action should you take first?

A team is using Cloud Monitoring to track the performance of a microservices application. They set up an uptime check for each service, but they notice that some checks are failing intermittently without actual service degradation. What is the most likely cause?

A company wants to monitor custom application metrics in real-time and trigger alerts when a metric exceeds a threshold. Which Google Cloud service should they use?

An organization is implementing SLO-based alerting for a critical service. They want to alert when the service has consumed 50% of its error budget over a 30-day window. Considering best practices for alert sensitivity and noise reduction, which alerting approach should they use?

A DevSecOps team is configuring Cloud Monitoring alerts for proactive incident response. Which two practices are recommended for effective alerting? (Choose two.)

A team is designing a dashboard for their production environment using Cloud Monitoring. Which three types of information should be included on the dashboard to support incident response? (Choose three.)

Your company runs a stateless web application on Google Kubernetes Engine (GKE). You have configured Cloud Monitoring to track request latency and set up an alert when p95 latency exceeds 500ms for 5 minutes. Recently, the alert has been firing frequently during peak hours. You examine the metrics and see that p95 latency spikes to 600ms for short periods. The application's SLO is 99.9% availability with a latency threshold of 1 second. What should you do to reduce alert noise without compromising the SLO?

Your organization is migrating a monolithic application to microservices on Cloud Run. You need to monitor the health of each microservice and aggregate logs and metrics in a central dashboard. You have set up Cloud Monitoring custom dashboards and logs-based metrics. After the initial deployment, you notice that the dashboards show data only for some services, while others appear to have no metrics. You verify that all services are running and emitting logs. What is the most likely cause?

You are the DevOps engineer for a large gaming company. Your game backend runs on Compute Engine instances behind a global HTTP(S) Load Balancer. You have set up Cloud Monitoring with an uptime check for the load balancer's IP address, and you are using logging to capture 404 errors. Recently, a new game update caused a surge in traffic, and you started receiving many alerts from your uptime check indicating that the site is down. However, you verify that the backend instances are healthy and the load balancer is responding correctly, though some requests are timing out due to the increased load. Your alerting policy currently triggers when 2 consecutive checks fail. What is the most likely reason for the false positive alerts?

A small startup uses Cloud Functions for their backend and wants to monitor function execution times and error rates. They have enabled Cloud Monitoring and are viewing metrics in the Cloud Console. They notice that the execution time metric for a particular function shows an average of 200ms, but occasionally there are spikes to 5 seconds, which correspond to user-reported slow responses. They want to be alerted when the function exceeds 1 second for any invocation. What is the simplest way to achieve this?

Your company runs a multi-region application on Google Kubernetes Engine. You have implemented Cloud Monitoring dashboards to track cluster resource utilization and application SLIs. After a recent upgrade, you notice that the dashboard shows a sudden drop in CPU utilization for all nodes in one zone, but the application is still serving traffic normally. You suspect a monitoring issue. What should you investigate first?

A site reliability engineer is defining SLOs for a microservice application running on Google Kubernetes Engine. The application serves user-facing API requests. Which TWO approaches should the engineer take to effectively monitor the service's performance?

Refer to the exhibit. ```yaml name: projects/my-project/alertPolicies/12345 displayName: High Error Rate combiner: OR conditions: - conditionThreshold: filter: metric.type="logging.googleapis.com/user/myapp/error_count" resource.type="k8s_container" aggregations: - alignmentPeriod: 60s perSeriesAligner: ALIGN_RATE duration: 120s comparison: COMPARISON_GT thresholdValue: 5 trigger: count: 1 ``` An engineer notices that this alert fires too frequently during normal operation. Which change would most likely reduce the noise?

A company runs a multi-region web application on Google Kubernetes Engine (GKE) using Cloud Load Balancing and Cloud Armor. They use Cloud Monitoring to track user-facing latency. Recently, they noticed that the p99 latency has increased from 200ms to 2s during peak hours, but only for users in the US region. The team suspects a specific backend service in us-central1 is causing the spike. They have set up a dashboard showing latency by region, but the latency metric is aggregated globally, not broken down by region. What should they do to pinpoint the issue?

Practice all 78 Implementing service monitoring strategies questions

Other PCDOE exam domains

Bootstrapping a Google Cloud organization for DevOps Managing service incidents Managing Google Cloud costs Building and implementing CI/CD pipelines Optimizing service performance

Frequently asked questions

What does the Implementing service monitoring strategies domain cover on the PCDOE exam?

The Implementing service monitoring strategies domain covers the key concepts tested in this area of the PCDOE exam blueprint published by Google Cloud. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all PCDOE domains — no account required.

How many Implementing service monitoring strategies questions are in the PCDOE question bank?

The Courseiva PCDOE question bank contains 78 questions in the Implementing service monitoring strategies domain. Click any question to see the full explanation and answer breakdown.

What is the best way to practice Implementing service monitoring strategies for PCDOE?

Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.

Can I practice only Implementing service monitoring strategies questions for PCDOE?

Yes — the session launcher on this page draws questions exclusively from the Implementing service monitoring strategies domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.

Free forever · No credit card required

Track your PCDOE domain progress

Save your results, see per-domain analytics, and get readiness scores — free, for every certification.

Free forever · Every certification included