PCDOE Implementing service monitoring strategies — All Questions With Answers

Question 1easymultiple choice

Read the full Implementing service monitoring strategies explanation →

A team is monitoring a production service on Google Kubernetes Engine (GKE) and notices that a deployment is occasionally returning HTTP 503 errors. The team has set up a ServiceMonitor in Prometheus to scrape metrics from the pods. What is the most likely cause of the intermittent 503 errors?

Question 2mediummultiple choice

Read the full NAT/PAT explanation →

A cloud operations team is implementing monitoring for a microservices application deployed on Compute Engine. They want to create a custom dashboard in Cloud Monitoring that shows the 99th percentile latency of a specific service over the last hour. Which combination of Cloud Monitoring features should they use?

Question 3hardmultiple choice

Review the full routing breakdown →

An e-commerce platform is using Cloud Load Balancing with a backend service that has a custom health check. The health check is failing intermittently, causing traffic to be routed away from healthy instances. The team has enabled Cloud Logging and wants to diagnose the issue. Which log view should they examine to see the health check probe results?

Question 4mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

A DevOps engineer is setting up alerting policies for a critical API service. They want to receive an alert if the error rate exceeds 5% for at least 5 minutes, but only during business hours (9 AM to 5 PM). Which approach should they use?

Question 5hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

A company is running a stateful workload on Compute Engine and has configured a TCP health check on port 8080. The health check is failing, but the application is running and responding on port 8080 when tested manually from within the instance. What is the most likely cause of the health check failure?

Question 6mediummulti select

Read the full Implementing service monitoring strategies explanation →

Which TWO of the following are best practices for implementing service monitoring in Google Cloud? (Choose 2)

Question 7hardmulti select

Read the full Implementing service monitoring strategies explanation →

Which THREE of the following are valid approaches to monitor a custom application metric in Cloud Monitoring? (Choose 3)

Question 8easymultiple choice

Read the full Implementing service monitoring strategies explanation →

A DevOps engineer runs the command above and gets the output shown. What does this output indicate?

Network Topology

Question 9mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

A team has deployed a Prometheus server on GKE using the configuration above. They expect Prometheus to scrape metrics from pods with the label 'app: my-app' and the annotation 'prometheus.io/scrape: true' on port 8080. However, no metrics are being collected. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
# prometheus.yml
scrape_configs:
  - job_name: 'my-app'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_label_app]
      regex: my-app
      action: keep
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      regex: "true"
      action: keep
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      target_label: __address__
      regex: (.+)
      replacement: $1:8080
```

Question 10mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

You are monitoring a microservices application deployed on Google Kubernetes Engine (GKE) that uses Cloud Monitoring for observability. You notice that the error rate for a critical service has increased, but the CPU and memory usage remain normal. The service uses gRPC and logs are structured. Which Cloud Monitoring tool should you use first to diagnose the root cause of the increased error rate?

Question 11hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

A company uses Cloud Monitoring to track latency for a multi-region web application. The SLO is 99.9% of requests under 500ms over a 30-day rolling window. The error budget has been rapidly depleting over the last week. The operations team wants to understand the impact of recent deployments. Which approach should they use to correlate deployment changes with latency spikes?

Question 12easymultiple choice

Read the full Implementing service monitoring strategies explanation →

You are setting up alerting for a batch processing job that runs daily on Compute Engine. The job must complete within 2 hours. Which metric and alert condition should you use to ensure you are notified if the job is still running after 90 minutes?

Question 13mediummulti select

Read the full Implementing service monitoring strategies explanation →

Which TWO metrics should be included in a comprehensive monitoring strategy for a production Kubernetes workload to detect performance degradation and capacity issues?

Question 14hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

Your organization runs a critical e-commerce platform on Google Kubernetes Engine (GKE). The platform uses Cloud Service Mesh (Anthos Service Mesh) for traffic management and Cloud Monitoring for observability. Recently, after a new release, you observe that the p99 latency of the checkout service has increased from 200ms to 2s. The service's CPU and memory metrics appear normal, and there are no error logs. The release included a change to the Istio VirtualService configuration that added a retry policy: 3 retries with a 500ms timeout per retry. You suspect that the retries are contributing to the latency increase. You want to use Cloud Monitoring to confirm this hypothesis. Which approach should you take?

Question 15mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

You are a DevOps engineer for a SaaS company that provides a REST API. The API is deployed on Google Cloud Run. You have configured Cloud Monitoring alerts for 5xx errors. Recently, you received an alert that the error rate exceeded 5% for 5 minutes. You investigated and found that the errors were HTTP 503 (Service Unavailable) from a specific endpoint. The endpoint calls an internal Cloud SQL database. The database CPU utilization was at 90% during that period. You suspect the database is the bottleneck. Which action should you take to reduce the error rate without over-provisioning?

Question 16mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

A company uses Cloud Run for a critical service and needs to set up alerting for 5xx errors. They want to receive a notification within 1 minute of the error rate exceeding 1% for any 1-minute window. Which alerting approach should they use?

Question 17easymulti select

Read the full Implementing service monitoring strategies explanation →

Which TWO are best practices for implementing service monitoring strategies in Google Cloud?

Question 18hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

A team has set up the alerting policies shown in the exhibit. They receive an alert for High Memory but not for High CPU. What is the most likely reason?

Exhibit

Refer to the exhibit.

```
{
  "alertPolicies": [
    {
      "displayName": "High CPU Alert",
      "combiner": "OR",
      "conditions": [
        {
          "displayName": "CPU usage > 80%",
          "conditionThreshold": {
            "filter": "metric.type=\"compute.googleapis.com/instance/cpu/utilization\" resource.type=\"gce_instance\"",
            "comparison": "COMPARISON_GT",
            "thresholdValue": 0.8,
            "duration": "300s",
            "trigger": {
              "count": 1
            }
          }
        }
      ]
    },
    {
      "displayName": "High Memory Alert",
      "conditions": [
        {
          "displayName": "Memory usage > 90%",
          "conditionThreshold": {
            "filter": "metric.type=\"agent.googleapis.com/memory/percent_used\" resource.type=\"gce_instance\"",
            "comparison": "COMPARISON_GT",
            "thresholdValue": 0.9,
            "duration": "60s",
            "trigger": {
              "count": 1
            }
          }
        }
      ]
    }
  ]
}
```

Question 19mediumdrag order

Read the full Implementing service monitoring strategies explanation →

Order the steps to configure a VPC Network Peering between two projects.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 20mediummatching

Read the full Implementing service monitoring strategies explanation →

Match each Google Cloud tool to its function in incident management.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

End-to-end incident lifecycle tool

Third-party alerting and on-call scheduling

Asynchronous messaging for event-driven alerts

Serverless automation for incident response

Containerized event-driven applications

Question 21easymultiple choice

Read the full Implementing service monitoring strategies explanation →

A DevOps team is defining an SLO for a web application that runs on Compute Engine behind an HTTP Load Balancer. They need to measure the proportion of requests that complete within 300ms. Which Cloud Monitoring metric is most appropriate as the SLI?

Question 22mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

An alerting policy triggers frequently for a spike in CPU utilization on a Compute Engine instance, but the spike lasts only a few seconds. The SRE team wants to reduce false positives. Which change should they make?

Question 23hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

You are designing a monitoring strategy for a microservices application running on Google Kubernetes Engine (GKE). You need to create a custom metric that counts the number of failed login attempts from the application logs. The logs are in JSON format and contain a field 'status' with value 'FAILED'. Which approach should you use?

Question 24easymultiple choice

Read the full Implementing service monitoring strategies explanation →

A team needs to monitor the availability of an HTTPS endpoint that requires a Bearer token in the request header. What is the simplest way to configure this with Cloud Monitoring?

Question 25mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

You are creating a Cloud Monitoring dashboard to display the 99th percentile latency of your HTTP Load Balancer over the last 6 hours. Which MQL query should you use?

Question 26hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

A company has on-premises servers running Linux and GKE clusters. They want to monitor all infrastructure using Cloud Monitoring. Which solution is most scalable and aligned with Google's best practices?

Question 27easymultiple choice

Read the full Implementing service monitoring strategies explanation →

An application running on App Engine is throwing exceptions. The DevOps team wants to be notified when a new type of exception appears. Which Cloud Monitoring feature should they use?

Question 28mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

An SRE team needs to implement an incident management workflow that automatically creates a ticket in their ITSM tool when a critical alert fires. They use Cloud Monitoring. Which approach should they use?

Question 29hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

A company wants to reduce costs associated with Cloud Monitoring. They have many custom metrics and high ingestion rates. Which cost optimization strategy is most effective?

Question 30easymulti select

Read the full Implementing service monitoring strategies explanation →

A DevOps engineer notices that some Compute Engine instances are not reporting metrics to Cloud Monitoring. Which two potential causes should they investigate? (Choose two.)

Question 31mediummulti select

Read the full Implementing service monitoring strategies explanation →

An alerting policy for high CPU utilization on a VM is firing even when CPU is not high. The team suspects a misconfiguration. Which two possible issues should they check? (Choose two.)

Question 32hardmulti select

Read the full Implementing service monitoring strategies explanation →

You are designing SLO monitoring for a high-traffic e-commerce platform. Which three best practices should you follow? (Choose three.)

Question 33mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

An SRE team created the above logs-based metric. They expect it to count the number of HTTP 500 errors per instance. However, the metric shows no data. What is the most likely cause?

Exhibit

```
"logsBasedMetric": {
  "filter": "resource.type=\"gce_instance\" AND jsonPayload.status=\"500\"",
  "metricDescriptor": {
    "metricKind": "DELTA",
    "valueType": "INT64",
    "name": "custom.googleapis.com/errors/5xx"
  },
  "labelExtractors": {
    "instance_id": "EXTRACT(jsonPayload.instance_id)"
  },
  "description": "Count of 500 errors per instance"
}
```

Question 34easymultiple choice

Read the full Implementing service monitoring strategies explanation →

A team notices that the 'cpu-high' alert fires frequently even for short bursts. The 'disk-full' alert never sends notifications. Based on the exhibit, what is the issue with each?

Exhibit

```
NAME: cpu-high
CONDITION: cpu_utilization > 0.8
DURATION: 0s
NOTIFICATION_CHANNELS: email

NAME: disk-full
CONDITION: disk_utilization > 0.95
DURATION: 300s
NOTIFICATION_CHANNELS: none
```

Question 35hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

The above MQL query is used in a Cloud Monitoring dashboard. What does it display?

Exhibit

```
fetch cloud_run_revision::https://googleapis.com/traces/span
| filter spans == "my-service"
| align delta(1m)
| every 1m
| group_by [span_id], [latency: percentile(99)]
```

Question 36easymultiple choice

Read the full Implementing service monitoring strategies explanation →

A team wants to monitor the latency of a microservice deployed on GKE. Which Google Cloud tool should they use to collect custom metrics?

Question 37mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

A company has an application that experiences intermittent errors. They want to be notified immediately when the error rate exceeds 1% of total requests. What should they implement?

Question 38hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

A DevOps team is setting up SLOs for a service with two critical metrics: availability and latency. They want to measure over a 30-day window. Which approach correctly defines an SLO?

Question 39easymultiple choice

Read the full Implementing service monitoring strategies explanation →

You need to monitor the health of an external HTTP endpoint. Which resource should you create?

Question 40mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

Your team wants to create a dashboard that shows request latency broken down by API version. Which approach is most efficient?

Question 41hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

A service is deployed on Cloud Run. You need to monitor memory usage per revision. How can you create an alert?

Question 42easymultiple choice

Read the full Implementing service monitoring strategies explanation →

Which service provides built-in dashboards for Google Cloud services?

Question 43mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

You want to send alerts to a Slack channel when a critical error occurs. What should you do?

Question 44hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

A team wants to implement multi-cluster monitoring for GKE using Managed Service for Prometheus. Which configuration is required?

Question 45mediummulti select

Read the full Implementing service monitoring strategies explanation →

Your team is implementing SLO monitoring. Which TWO tools should they use to create and monitor SLIs?

Question 46easymulti select

Read the full NAT/PAT explanation →

You are designing a monitoring strategy for a cloud-native application. Which THREE components are essential for observability?

Question 47hardmulti select

Read the full Implementing service monitoring strategies explanation →

A company needs to monitor custom application metrics from Compute Engine instances. Which TWO methods can be used?

Question 48mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

Based on the exhibit, what does the duration of 300s mean in this alerting policy?

Exhibit

Refer to the exhibit.
```json
{
  "name": "projects/my-project/alertPolicies/123456789",
  "displayName": "High CPU Alert",
  "conditions": [
    {
      "name": "projects/my-project/alertPolicies/123456789/conditions/987654321",
      "displayName": "CPU Utilization > 80%",
      "conditionThreshold": {
        "filter": "metric.type=\"compute.googleapis.com/instance/cpu/utilization\" resource.type=\"gce_instance\"",
        "aggregations": [
          {
            "alignmentPeriod": "60s",
            "perSeriesAligner": "ALIGN_MEAN"
          }
        ],
        "comparison": "COMPARISON_GT",
        "thresholdValue": 0.8,
        "duration": "300s",
        "trigger": {
          "count": 1
        }
      }
    }
  ]
}
```

Question 49easymultiple choice

Read the full Implementing service monitoring strategies explanation →

Based on the exhibit, which Cloud Logging query filter will return all logs of this type?

Exhibit

Refer to the exhibit.
```json
{
  "insertId": "abc123",
  "severity": "ERROR",
  "jsonPayload": {
    "message": "Connection timeout",
    "service": "auth-service"
  },
  "resource": {
    "type": "cloud_function",
    "labels": {
      "function_name": "process_login"
    }
  }
}
```

Question 50hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

Refer to the exhibit. What is the effect of the metricRelabelings section in this ServiceMonitor?

Exhibit

Refer to the exhibit.
```yaml
apiVersion: monitoring.googleapis.com/v1
kind: ServiceMonitor
metadata:
  name: my-service-monitor
  namespace: default
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: http
    interval: 30s
  namespaceSelector:
    matchNames:
    - production
  sampleLimit: 1000
  targetLabels:
  - instance
  metricRelabelings:
  - sourceLabels: [__name__]
    regex: 'container_.*'
    action: drop
```

Question 51easymultiple choice

Read the full Implementing service monitoring strategies explanation →

A team wants to monitor a Google Cloud Run service for application crashes. Which Google Cloud tool automatically captures and notifies on application errors?

Question 52mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

An SRE team needs to define an SLI for a web service's availability SLO of 99.9%. Which metric should they use?

Question 53hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

A company runs a microservices architecture on GKE with Istio. They want to generate custom request-level metrics for SLO tracking without modifying application code. Which approach is most efficient?

Question 54easymultiple choice

Read the full Implementing service monitoring strategies explanation →

A developer wants to view logs from all pods in a GKE namespace in real time. Which command-line tool should they use?

Question 55easymultiple choice

Read the full Implementing service monitoring strategies explanation →

A DevOps engineer needs to verify if a load balancer's health check is behaving normally by examining historical trends. Where should they look?

Question 56mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

You need to monitor a multi-step login flow that involves calling an API, validating a token, and redirecting. Which type of uptime check should you use?

Question 57easymultiple choice

Read the full Implementing service monitoring strategies explanation →

Which Google Cloud tool automatically captures and visualizes traces for applications running on App Engine?

Question 58mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

An organization wants to be alerted when the total size of a Cloud Storage bucket exceeds 1 TB. Which metric should they monitor?

Question 59hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

A team uses Cloud Monitoring alerting policies with multiple conditions. They want an incident to fire only when both conditions are met simultaneously. What should they configure?

Question 60mediummulti select

Read the full Implementing service monitoring strategies explanation →

Which TWO options are valid methods to create a custom metric descriptor in Cloud Monitoring?

Question 61easymulti select

Read the full Implementing service monitoring strategies explanation →

Which THREE are recommended practices for setting up alerting on Google Cloud?

Question 62hardmulti select

Read the full Implementing service monitoring strategies explanation →

Which TWO metrics should be monitored to detect a potential memory leak in a Compute Engine VM?

Question 63mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

What is the effect of the 'timeshiftDuration' of '3600s' in the dashboard widget?

Network Topology

Question 64hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

Which Cloud Monitoring feature can directly correlate this error with the associated trace and VM instance?

Exhibit

Refer to the exhibit.
```
{
  "insertId": "abc123",
  "jsonPayload": {
    "severity": "ERROR",
    "message": "Database connection timeout",
    "component": "authservice",
    "trace": "projects/my-project/traces/xxx"
  },
  "resource": {
    "type": "gce_instance",
    "labels": {
      "instance_id": "123",
      "zone": "us-central1-a"
    }
  },
  "logName": "projects/my-project/logs/authservice-log",
  "timestamp": "2024-10-01T12:00:00Z"
}
```

Question 65hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

Your company runs a multi-region application on GKE across us-east1 and europe-west1. The application serves a global user base with a strict SLO of 99.95% availability. Recently, the team noticed that during peak hours, some users in South America experience high latency and intermittent errors. The GKE clusters are monitored via Cloud Monitoring with custom dashboards and alerting policies. The team has set up a single alerting policy that triggers when the global error rate exceeds 0.1%. However, the alert fires only after the issue has persisted for 10 minutes, and by then the customer impact is already significant. You need to improve the detection and response time. Which action should you take first?

Question 66mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

A team is using Cloud Monitoring to track the performance of a microservices application. They set up an uptime check for each service, but they notice that some checks are failing intermittently without actual service degradation. What is the most likely cause?

Question 67easymultiple choice

Read the full Implementing service monitoring strategies explanation →

A company wants to monitor custom application metrics in real-time and trigger alerts when a metric exceeds a threshold. Which Google Cloud service should they use?

Question 68hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

An organization is implementing SLO-based alerting for a critical service. They want to alert when the service has consumed 50% of its error budget over a 30-day window. Considering best practices for alert sensitivity and noise reduction, which alerting approach should they use?

Question 69mediummulti select

Read the full Implementing service monitoring strategies explanation →

A DevSecOps team is configuring Cloud Monitoring alerts for proactive incident response. Which two practices are recommended for effective alerting? (Choose two.)

Question 70hardmulti select

Read the full Implementing service monitoring strategies explanation →

A team is designing a dashboard for their production environment using Cloud Monitoring. Which three types of information should be included on the dashboard to support incident response? (Choose three.)

Question 71easymultiple choice

Read the full Implementing service monitoring strategies explanation →

Your company runs a stateless web application on Google Kubernetes Engine (GKE). You have configured Cloud Monitoring to track request latency and set up an alert when p95 latency exceeds 500ms for 5 minutes. Recently, the alert has been firing frequently during peak hours. You examine the metrics and see that p95 latency spikes to 600ms for short periods. The application's SLO is 99.9% availability with a latency threshold of 1 second. What should you do to reduce alert noise without compromising the SLO?

Question 72mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

Your organization is migrating a monolithic application to microservices on Cloud Run. You need to monitor the health of each microservice and aggregate logs and metrics in a central dashboard. You have set up Cloud Monitoring custom dashboards and logs-based metrics. After the initial deployment, you notice that the dashboards show data only for some services, while others appear to have no metrics. You verify that all services are running and emitting logs. What is the most likely cause?

Question 73hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

You are the DevOps engineer for a large gaming company. Your game backend runs on Compute Engine instances behind a global HTTP(S) Load Balancer. You have set up Cloud Monitoring with an uptime check for the load balancer's IP address, and you are using logging to capture 404 errors. Recently, a new game update caused a surge in traffic, and you started receiving many alerts from your uptime check indicating that the site is down. However, you verify that the backend instances are healthy and the load balancer is responding correctly, though some requests are timing out due to the increased load. Your alerting policy currently triggers when 2 consecutive checks fail. What is the most likely reason for the false positive alerts?

Question 74easymultiple choice

Read the full Implementing service monitoring strategies explanation →

A small startup uses Cloud Functions for their backend and wants to monitor function execution times and error rates. They have enabled Cloud Monitoring and are viewing metrics in the Cloud Console. They notice that the execution time metric for a particular function shows an average of 200ms, but occasionally there are spikes to 5 seconds, which correspond to user-reported slow responses. They want to be alerted when the function exceeds 1 second for any invocation. What is the simplest way to achieve this?

Question 75mediummultiple choice

Read the full Implementing service monitoring strategies explanation →

Your company runs a multi-region application on Google Kubernetes Engine. You have implemented Cloud Monitoring dashboards to track cluster resource utilization and application SLIs. After a recent upgrade, you notice that the dashboard shows a sudden drop in CPU utilization for all nodes in one zone, but the application is still serving traffic normally. You suspect a monitoring issue. What should you investigate first?

Question 76mediummulti select

Read the full Implementing service monitoring strategies explanation →

A site reliability engineer is defining SLOs for a microservice application running on Google Kubernetes Engine. The application serves user-facing API requests. Which TWO approaches should the engineer take to effectively monitor the service's performance?

Question 77hardmultiple choice

Read the full Implementing service monitoring strategies explanation →

Refer to the exhibit.

```yaml name: projects/my-project/alertPolicies/12345 displayName: High Error Rate combiner: OR conditions: - conditionThreshold: filter: metric.type="logging.googleapis.com/user/myapp/error_count" resource.type="k8s_container" aggregations: - alignmentPeriod: 60s perSeriesAligner: ALIGN_RATE duration: 120s comparison: COMPARISON_GT thresholdValue: 5 trigger: count: 1 ```

An engineer notices that this alert fires too frequently during normal operation. Which change would most likely reduce the noise?

Exhibit

Refer to the exhibit.

Question 78easymultiple choice

Read the full Implementing service monitoring strategies explanation →

A company runs a multi-region web application on Google Kubernetes Engine (GKE) using Cloud Load Balancing and Cloud Armor. They use Cloud Monitoring to track user-facing latency. Recently, they noticed that the p99 latency has increased from 200ms to 2s during peak hours, but only for users in the US region. The team suspects a specific backend service in us-central1 is causing the spike. They have set up a dashboard showing latency by region, but the latency metric is aggregated globally, not broken down by region. What should they do to pinpoint the issue?

Refer to the exhibit. ``` # prometheus.yml scrape_configs: - job_name: 'my-app' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] regex: my-app action: keep - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] regex: "true" action: keep - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port] action: replace target_label: __address__ regex: (.+) replacement: $1:8080 ```

Refer to the exhibit. ``` { "alertPolicies": [ { "displayName": "High CPU Alert", "combiner": "OR", "conditions": [ { "displayName": "CPU usage > 80%", "conditionThreshold": { "filter": "metric.type=\"compute.googleapis.com/instance/cpu/utilization\" resource.type=\"gce_instance\"", "comparison": "COMPARISON_GT", "thresholdValue": 0.8, "duration": "300s", "trigger": { "count": 1 } } } ] }, { "displayName": "High Memory Alert", "conditions": [ { "displayName": "Memory usage > 90%", "conditionThreshold": { "filter": "metric.type=\"agent.googleapis.com/memory/percent_used\" resource.type=\"gce_instance\"", "comparison": "COMPARISON_GT", "thresholdValue": 0.9, "duration": "60s", "trigger": { "count": 1 } } } ] } ] } ```

``` "logsBasedMetric": { "filter": "resource.type=\"gce_instance\" AND jsonPayload.status=\"500\"", "metricDescriptor": { "metricKind": "DELTA", "valueType": "INT64", "name": "custom.googleapis.com/errors/5xx" }, "labelExtractors": { "instance_id": "EXTRACT(jsonPayload.instance_id)" }, "description": "Count of 500 errors per instance" } ```

Refer to the exhibit. ```json { "name": "projects/my-project/alertPolicies/123456789", "displayName": "High CPU Alert", "conditions": [ { "name": "projects/my-project/alertPolicies/123456789/conditions/987654321", "displayName": "CPU Utilization > 80%", "conditionThreshold": { "filter": "metric.type=\"compute.googleapis.com/instance/cpu/utilization\" resource.type=\"gce_instance\"", "aggregations": [ { "alignmentPeriod": "60s", "perSeriesAligner": "ALIGN_MEAN" } ], "comparison": "COMPARISON_GT", "thresholdValue": 0.8, "duration": "300s", "trigger": { "count": 1 } } } ] } ```

Refer to the exhibit. ```json { "insertId": "abc123", "severity": "ERROR", "jsonPayload": { "message": "Connection timeout", "service": "auth-service" }, "resource": { "type": "cloud_function", "labels": { "function_name": "process_login" } } } ```

Refer to the exhibit. ```yaml apiVersion: monitoring.googleapis.com/v1 kind: ServiceMonitor metadata: name: my-service-monitor namespace: default spec: selector: matchLabels: app: my-app endpoints: - port: http interval: 30s namespaceSelector: matchNames: - production sampleLimit: 1000 targetLabels: - instance metricRelabelings: - sourceLabels: [__name__] regex: 'container_.*' action: drop ```

Refer to the exhibit. ``` { "insertId": "abc123", "jsonPayload": { "severity": "ERROR", "message": "Database connection timeout", "component": "authservice", "trace": "projects/my-project/traces/xxx" }, "resource": { "type": "gce_instance", "labels": { "instance_id": "123", "zone": "us-central1-a" } }, "logName": "projects/my-project/logs/authservice-log", "timestamp": "2024-10-01T12:00:00Z" } ```