Knowledge + Practice

CCNA Service Monitoring Questions

75 of 78 questions · Page 1/2 · Service Monitoring topic · Answers revealed

Practice these questions Exam hub All questions

1

MCQhard

You are the DevOps engineer for a large gaming company. Your game backend runs on Compute Engine instances behind a global HTTP(S) Load Balancer. You have set up Cloud Monitoring with an uptime check for the load balancer's IP address, and you are using logging to capture 404 errors. Recently, a new game update caused a surge in traffic, and you started receiving many alerts from your uptime check indicating that the site is down. However, you verify that the backend instances are healthy and the load balancer is responding correctly, though some requests are timing out due to the increased load. Your alerting policy currently triggers when 2 consecutive checks fail. What is the most likely reason for the false positive alerts?

A.The global load balancer's health check is failing due to the surge.

B.The monitoring project has reached its limit for concurrent uptime checks.

C.The uptime check is configured to check a specific URL that is returning a 503 status code.

D.The uptime check's timeout is too short for the current response times.

AnswerD

During traffic surge, response time increases; if timeout is too short, check fails despite site being up.

Why this answer

Option D is correct because the uptime check's timeout is too short for the current response times. When a surge in traffic causes some requests to time out, the load balancer may still respond correctly to most requests, but the uptime check—which has a fixed timeout (default 10 seconds)—fails if the response does not arrive within that window. Since the alert triggers after 2 consecutive failures, the check falsely reports the site as down even though the backend and load balancer are healthy.

Exam trap

Google Cloud often tests the distinction between health checks (which verify backend instance health) and uptime checks (which verify end-to-end availability from a monitoring perspective), leading candidates to confuse a healthy backend with a successful uptime check response.

How to eliminate wrong answers

Option A is wrong because the global load balancer's health check is separate from the uptime check; the health check monitors backend instance health, and the scenario states the backend instances are healthy, so the health check is not failing. Option B is wrong because Cloud Monitoring does not have a hard limit on concurrent uptime checks; the limit is on the number of uptime checks per project (100), not on concurrency, and a surge in traffic would not cause a limit to be reached. Option C is wrong because the scenario mentions capturing 404 errors, not 503 errors; a 503 status code would indicate the backend is unavailable, but the problem states the backend is healthy and the load balancer is responding correctly, so the uptime check is not receiving a 503.

Practice this question →

2

MCQmedium

A team has deployed a Prometheus server on GKE using the configuration above. They expect Prometheus to scrape metrics from pods with the label 'app: my-app' and the annotation 'prometheus.io/scrape: true' on port 8080. However, no metrics are being collected. What is the most likely cause?

A.The kubernetes_sd_configs role is set to 'pod' but should be 'endpoints'.

B.Prometheus needs to be configured to listen on port 9090 for scraping.

C.The keep action for the label 'my-app' is filtering out all pods.

D.The relabel_config for port incorrectly constructs the target address; it should use the annotation value directly without appending ':8080'.

AnswerD

The port annotation usually includes the port number, so appending a fixed port is wrong.

Why this answer

Option D is correct because the relabel_config is incorrectly constructing the target address. The configuration likely uses `__meta_kubernetes_annotation_prometheus_io_port` to get the port, but then appends ':8080' statically, overriding the annotation value. If the annotation specifies a different port or is missing, this hardcoded port causes Prometheus to scrape the wrong endpoint or fail entirely.

The correct approach is to use the annotation value directly without appending a fixed port, or to use a default port only when the annotation is absent.

Exam trap

The trap here is that candidates often overlook the relabel_config port construction and assume the issue is with service discovery roles or label filtering, when in fact the static port override is the subtle cause of scrape failures.

How to eliminate wrong answers

Option A is wrong because the `kubernetes_sd_configs` role set to 'pod' is correct for scraping pods directly; changing it to 'endpoints' would scrape service endpoints, which is not required here and would not fix the issue. Option B is wrong because Prometheus's own listening port (default 9090) is for its web UI and API, not for scraping targets; scraping uses the target's port, not Prometheus's port. Option C is wrong because the `keep` action for the label 'my-app' is likely intended to filter pods with that label, and if it were filtering out all pods, no targets would be discovered at all; the problem is specifically with port construction, not label matching.

Practice this question →

3

MCQhard

A company is running a stateful workload on Compute Engine and has configured a TCP health check on port 8080. The health check is failing, but the application is running and responding on port 8080 when tested manually from within the instance. What is the most likely cause of the health check failure?

A.The health check is configured to use port 80 instead of port 8080.

B.The firewall rules are not allowing traffic from the health check probe IP ranges.

C.The instance's DNS resolution is failing, causing the health check to use the wrong IP.

D.The health check response timeout is set too low (e.g., 1 second).

AnswerB

Health check probes use specific IP ranges that must be allowed.

Why this answer

The health check probes originate from Google's health check systems, which use specific IP ranges (e.g., 35.191.0.0/16, 130.211.0.0/22). If firewall rules on the instance or VPC do not explicitly allow inbound traffic from these probe IP ranges on port 8080, the health check will fail even though the application is running and responding to manual tests from within the instance. This is the most common cause of health check failures when the application itself is healthy.

Exam trap

Google Cloud often tests the misconception that health check failures are always due to application misconfiguration or port mismatches, but the trap here is that the health check probes come from external Google IP ranges that must be explicitly allowed in firewall rules, not from within the instance's own network.

How to eliminate wrong answers

Option A is wrong because the health check is explicitly configured to use port 8080, and the question states the application responds on that port; a misconfiguration to port 80 would be a different issue, but the scenario describes the health check as failing despite the correct port being configured. Option C is wrong because health checks in Google Cloud use IP addresses, not DNS names, so DNS resolution is irrelevant to the probe reaching the instance. Option D is wrong because a timeout set too low would cause intermittent failures or timeouts, but the question states the health check is consistently failing, and the application responds instantly from within the instance, indicating the probes are not reaching the instance at all.

Practice this question →

4

MCQhard

You are designing a monitoring strategy for a microservices application running on Google Kubernetes Engine (GKE). You need to create a custom metric that counts the number of failed login attempts from the application logs. The logs are in JSON format and contain a field 'status' with value 'FAILED'. Which approach should you use?

A.Use Cloud Monitoring's Metrics Explorer to create a metric from logs using a filter.

B.Install the Ops Agent on GKE nodes to collect application metrics directly.

C.Configure a logs-based metric in Cloud Logging that filters for the condition and counts.

D.Export logs to BigQuery and then create a custom metric from the exported data.

AnswerC

Logs-based metrics are designed for this purpose – they count log entries that match a filter and expose them as custom metrics in Cloud Monitoring.

Why this answer

Option C is correct because a logs-based metric in Cloud Logging directly counts occurrences of a specific log entry pattern (e.g., 'status' = 'FAILED') without requiring additional agents or data exports. This approach is purpose-built for deriving metrics from log data and integrates seamlessly with Cloud Monitoring for alerting and dashboards.

Exam trap

Google Cloud often tests the misconception that Metrics Explorer can create metrics from logs, when in fact it only queries and charts existing metrics, while logs-based metrics are the correct service for deriving custom metrics from log data.

How to eliminate wrong answers

Option A is wrong because Metrics Explorer is a visualization tool for existing metrics, not a mechanism to create new metrics from log content. Option B is wrong because the Ops Agent collects system and application metrics from VM instances, not from GKE pods or container logs, and it cannot parse JSON log fields to count failed logins. Option D is wrong because exporting logs to BigQuery adds latency, cost, and complexity; while you could query the data, it is not the recommended or direct method for creating a custom metric from logs.

Practice this question →

5

MCQhard

A DevOps team is setting up SLOs for a service with two critical metrics: availability and latency. They want to measure over a 30-day window. Which approach correctly defines an SLO?

A.Use Cloud Monitoring to create a custom SLI based on logs and set an SLO with a 7-day rolling window

B.Use Cloud Tasks to schedule a cron job that calculates availability

C.Use Cloud Monitoring to create a custom SLI based on request latency metrics and set an SLO with a 30-day rolling window

D.Use Stackdriver Monitoring (deprecated) to set an SLO with a fixed 30-day window

AnswerC

This correctly uses Cloud Monitoring and a 30-day rolling window.

Why this answer

Option C is correct because it uses Cloud Monitoring to create a custom SLI based on request latency metrics, which is a valid SLI type, and sets an SLO with a 30-day rolling window, matching the requirement. Cloud Monitoring's SLO feature natively supports rolling windows, and latency is a standard metric for defining service-level objectives.

Exam trap

Google Cloud often tests the distinction between deprecated services (Stackdriver) and current ones (Cloud Monitoring), and the requirement for a rolling window versus a fixed window, leading candidates to pick D if they are unaware of deprecation or misunderstand window types.

How to eliminate wrong answers

Option A is wrong because it specifies a 7-day rolling window instead of the required 30-day window, and while logs can be used for SLIs, the question explicitly asks for a 30-day window. Option B is wrong because Cloud Tasks is a task queue service for asynchronous work, not a monitoring or SLO calculation tool; using it to calculate availability is architecturally incorrect and not a supported approach for SLOs. Option D is wrong because Stackdriver Monitoring is deprecated and should not be used; the current service is Cloud Monitoring, and a fixed 30-day window is not the standard rolling window approach for SLOs in Cloud Monitoring.

Practice this question →

6

MCQmedium

Based on the exhibit, what does the duration of 300s mean in this alerting policy?

A.The alert fires if CPU utilization is above 80% for at least 5 consecutive minutes.

B.The alert fires after 300 seconds of sustained CPU utilization above 80% with a count of 1.

C.The alert fires if CPU utilization averaged over 5 minutes exceeds 80%.

D.The alert fires if CPU utilization is above 80% for any 5-minute window in the last hour.

AnswerA

Duration is the minimum continuous time the condition must hold. This is correct.

Why this answer

In the PagerDuty alerting policy shown in the exhibit, the duration of 300s (5 minutes) defines the minimum period during which the CPU utilization must continuously exceed the 80% threshold before the alert fires. This prevents transient spikes from triggering unnecessary alerts. Option A correctly states that the alert fires only if CPU utilization is above 80% for at least 5 consecutive minutes, matching the policy's configuration.

Exam trap

Google Cloud often tests the distinction between 'sustained threshold' (all data points above the threshold for the duration) and 'average-based threshold' (mean over the window), and candidates frequently confuse the duration as a sliding window or a count-based condition instead of a consecutive period.

How to eliminate wrong answers

Option B is wrong because it incorrectly implies that the alert fires after 300 seconds of sustained CPU utilization above 80% with a count of 1, but the 'count' parameter in PagerDuty refers to the number of times the condition must be met within a window, not a separate condition; here, the duration alone defines the sustained period. Option C is wrong because it describes an average-based threshold (CPU utilization averaged over 5 minutes exceeds 80%), but the exhibit shows a 'threshold' condition, not an average; the alert fires only when every data point in the 300s window is above 80%, not when the average is above 80%. Option D is wrong because it introduces a 'last hour' window, which is not part of the policy; the duration of 300s is a fixed consecutive period, not a sliding window within an hour.

Practice this question →

7

Multi-Selectmedium

An alerting policy for high CPU utilization on a VM is firing even when CPU is not high. The team suspects a misconfiguration. Which two possible issues should they check? (Choose two.)

Select 2 answers

A.The alert condition is using the average aggregation with a short alignment period.

B.The threshold is set too low compared to actual CPU usage.

C.The metric is being duplicated because multiple agents are running.

D.The alerting policy was created in a different project and not imported.

E.The VM is reporting metrics from a custom namespace instead of the standard agent.

AnswersA, C

A short alignment period makes the alert sensitive to brief spikes, causing false positives.

Why this answer

Option A is correct because using a short alignment period with average aggregation can cause the alert to fire on brief spikes in CPU utilization that do not represent sustained high usage. If the alignment period is too short (e.g., 1 minute), the alerting policy may trigger on transient bursts, even when the overall CPU load is low. This is a common misconfiguration in Google Cloud Monitoring (formerly Stackdriver) where the alignment period and aggregator settings must match the expected workload pattern.

Exam trap

Google Cloud often tests the misconception that a low threshold (Option B) is the cause of false positives, but the real issue is the alignment period and aggregation settings that amplify transient spikes, not the threshold value itself.

Practice this question →

8

MCQmedium

You are a DevOps engineer for a SaaS company that provides a REST API. The API is deployed on Google Cloud Run. You have configured Cloud Monitoring alerts for 5xx errors. Recently, you received an alert that the error rate exceeded 5% for 5 minutes. You investigated and found that the errors were HTTP 503 (Service Unavailable) from a specific endpoint. The endpoint calls an internal Cloud SQL database. The database CPU utilization was at 90% during that period. You suspect the database is the bottleneck. Which action should you take to reduce the error rate without over-provisioning?

A.Implement connection pooling and retry logic with exponential backoff in the API service

B.Increase the max instances per revision in Cloud Run to handle more concurrent requests

C.Reduce the min instances of Cloud Run to decrease load on the database

D.Add a Cloud SQL read replica and route read queries to it

AnswerA

This reduces the number of simultaneous connections to the database and handles transient failures gracefully.

Why this answer

Option A is correct because implementing connection pooling and retry logic with exponential backoff directly addresses the database bottleneck without over-provisioning. Connection pooling reduces the number of concurrent connections to Cloud SQL, lowering CPU contention, while exponential backoff prevents thundering herd retries that could further overwhelm the database. This approach optimizes existing resources rather than scaling infrastructure.

Exam trap

Google Cloud often tests the misconception that scaling application instances (Cloud Run) is the default fix for backend bottlenecks, but the trap here is that increasing concurrency without addressing database connection limits can exacerbate the problem.

How to eliminate wrong answers

Option B is wrong because increasing Cloud Run max instances would amplify concurrent requests to the database, worsening CPU saturation and potentially increasing 503 errors. Option C is wrong because reducing min instances would decrease baseline capacity, causing cold starts and potentially increasing latency or errors under load, not reducing database CPU. Option D is wrong because adding a read replica only helps with read-heavy workloads, but the endpoint in question likely performs writes or mixed operations; moreover, replicas do not reduce write-related CPU load on the primary instance.

Practice this question →

9

MCQeasy

Which service provides built-in dashboards for Google Cloud services?

A.Cloud Shell

B.Cloud Console

C.Cloud Logging

D.Cloud Monitoring

AnswerD

Cloud Monitoring has built-in dashboards for Google Cloud services.

Why this answer

Cloud Monitoring (option D) provides built-in dashboards for Google Cloud services, offering pre-configured visualizations of metrics, logs, and alerts without requiring manual setup. These dashboards aggregate data from services like Compute Engine, Cloud SQL, and Kubernetes Engine, enabling real-time monitoring of resource utilization and performance. This aligns with the PCDOE domain of implementing service monitoring strategies by providing out-of-the-box observability.

Exam trap

Google Cloud often tests the distinction between Cloud Logging (for logs) and Cloud Monitoring (for metrics and dashboards), so the trap here is confusing the log storage and analysis service with the visualization and alerting service.

How to eliminate wrong answers

Option A is wrong because Cloud Shell is a browser-based command-line interface for managing Google Cloud resources, not a monitoring or dashboard service. Option B is wrong because Cloud Console is the web-based GUI for managing Google Cloud projects, but it does not provide built-in dashboards for monitoring; it relies on Cloud Monitoring for that functionality. Option C is wrong because Cloud Logging is a service for storing, searching, and analyzing log data, not for providing dashboards; it integrates with Cloud Monitoring for visualization.

Practice this question →

10

MCQeasy

A DevOps engineer needs to verify if a load balancer's health check is behaving normally by examining historical trends. Where should they look?

A.Cloud Monitoring Metrics Explorer

B.Cloud Logging

C.Cloud Console health check page

D.Cloud Load Balancing logs

AnswerA

Metrics Explorer stores health check metrics for historical analysis.

Why this answer

Cloud Monitoring Metrics Explorer is the correct place to examine historical trends of a load balancer's health check because it provides time-series metrics such as `https/backend_request_count`, `https/backend_latencies`, and health check probe success/failure counts. These metrics can be queried over custom time ranges and aggregated to detect anomalies or degradation patterns, which is exactly what the question asks for — verifying historical trends of health check behavior.

Exam trap

The trap here is that candidates confuse Cloud Logging (which shows raw events) with Cloud Monitoring Metrics (which shows aggregated trends), leading them to choose Cloud Logging or the health check page when the question explicitly asks for historical trends.

How to eliminate wrong answers

Option B is wrong because Cloud Logging stores discrete log entries (e.g., individual health check probe results or request logs), not aggregated time-series metrics; it is designed for troubleshooting specific events, not for analyzing historical trends over time. Option C is wrong because the Cloud Console health check page shows the current status and recent results of health checks, but it does not provide historical trend data or allow you to examine patterns over days or weeks. Option D is wrong because Cloud Load Balancing logs (e.g., access logs) record individual requests and responses, not health check probe metrics; they are useful for traffic analysis but not for monitoring the health check mechanism itself over time.

Practice this question →

11

Multi-Selecthard

You are designing SLO monitoring for a high-traffic e-commerce platform. Which three best practices should you follow? (Choose three.)

Select 3 answers

A.Use multiple SLOs for different critical user journeys.

B.Monitor error budgets and alert when depletion is imminent.

C.Use SLI metrics that align with user experience, like request latency and errors.

D.Use a single global SLO for all customer segments.

E.Set the SLO to 100% to ensure maximum reliability.

AnswersA, B, C

Separate SLOs allow targeted monitoring and alerting for each journey's requirements.

Why this answer

Option A is correct because using multiple SLOs for different critical user journeys (e.g., checkout, product search, login) allows you to tailor reliability targets to the specific performance and availability needs of each workflow. This granularity prevents a single, coarse SLO from masking issues that affect only a subset of users, enabling more precise monitoring and faster incident response.

Exam trap

Google Cloud often tests the misconception that a single, high-level SLO is sufficient for monitoring, when in reality multiple SLOs aligned to user journeys are required to detect partial outages that affect specific critical paths.

Practice this question →

12

MCQmedium

A company has an application that experiences intermittent errors. They want to be notified immediately when the error rate exceeds 1% of total requests. What should they implement?

A.Create an uptime check pointing to the application endpoint

B.Create a log-based metric counting error logs and set an alerting policy in Cloud Monitoring

C.Use Cloud Trace to analyze latency and set an alert on trace spans

D.Create a dashboard showing error count over time

AnswerB

This directly monitors error rate from logs and alerts on threshold.

Why this answer

Option B is correct because the requirement is to be notified when the error rate exceeds 1% of total requests, which requires a metric that counts error logs relative to total requests. A log-based metric in Cloud Logging can filter for error log entries (e.g., status codes 5xx), and an alerting policy in Cloud Monitoring can trigger when the ratio of error logs to total requests surpasses 1%. This directly addresses the intermittent error rate condition with precise threshold-based alerting.

Exam trap

Google Cloud often tests the distinction between monitoring availability (uptime checks) and monitoring error rates (log-based metrics), leading candidates to mistakenly choose uptime checks for error rate detection.

How to eliminate wrong answers

Option A is wrong because an uptime check only verifies that the application endpoint is reachable and responds (e.g., HTTP 200), but it cannot detect intermittent errors within successful responses or calculate an error rate percentage. Option C is wrong because Cloud Trace analyzes latency and trace spans, not error counts or error rates; it is designed for performance troubleshooting, not real-time error rate alerting. Option D is wrong because a dashboard showing error count over time provides visualization but does not include alerting or notification capabilities; it requires manual monitoring and cannot trigger immediate notifications when the error rate exceeds 1%.

Practice this question →

13

Drag & Dropmedium

Order the steps to configure a VPC Network Peering between two projects.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Create networks, set firewalls, create peering in both directions, verify.

Practice this question →

14

MCQmedium

A cloud operations team is implementing monitoring for a microservices application deployed on Compute Engine. They want to create a custom dashboard in Cloud Monitoring that shows the 99th percentile latency of a specific service over the last hour. Which combination of Cloud Monitoring features should they use?

A.Use a gauge metric with the max alignment function in a Metrics Explorer chart.

B.Use a distribution metric with the 99th percentile alignment function in a Metrics Explorer chart.

C.Use an uptime check metric and configure the latency percentile in the chart.

D.Create a logs-based metric from application logs and use the count alignment.

AnswerB

Distribution metrics support percentile alignments like 99th percentile.

Why this answer

Option B is correct because Cloud Monitoring's distribution metrics inherently store a histogram of values, allowing percentile calculations like the 99th percentile. By selecting the 99th percentile alignment function in a Metrics Explorer chart, the dashboard directly computes and displays the desired latency threshold from the distribution data over the specified time window.

Exam trap

Google Cloud often tests the distinction between metric types (gauge vs. distribution) and the specific alignment functions available in Cloud Monitoring, trapping candidates who confuse max alignment with percentile calculation or assume uptime checks can measure internal service latency.

How to eliminate wrong answers

Option A is wrong because gauge metrics represent a single instantaneous value and cannot compute percentiles; the max alignment function only returns the maximum value, not a percentile. Option C is wrong because uptime check metrics measure availability and response time from external probes, not the internal 99th percentile latency of a specific microservice, and they lack a built-in percentile configuration. Option D is wrong because a logs-based metric with count alignment counts log entries, not latency values, and cannot derive percentile latency from application logs without additional distribution metric configuration.

Practice this question →

15

MCQmedium

Your organization is migrating a monolithic application to microservices on Cloud Run. You need to monitor the health of each microservice and aggregate logs and metrics in a central dashboard. You have set up Cloud Monitoring custom dashboards and logs-based metrics. After the initial deployment, you notice that the dashboards show data only for some services, while others appear to have no metrics. You verify that all services are running and emitting logs. What is the most likely cause?

A.The services are not exporting metrics to Cloud Monitoring via the Monitoring API.

B.The logs-based metrics are not configured to parse logs from all services.

C.The services are running in different GCP projects and you are viewing only one project.

D.The Cloud Monitoring agent is not installed on the Cloud Run instances.

AnswerC

Dashboards in Cloud Monitoring are project-scoped unless configured with a metrics scope.

Why this answer

Option C is correct because Cloud Monitoring dashboards and logs-based metrics are scoped to a single GCP project. If microservices are deployed across multiple projects, metrics from services in other projects will not appear unless the dashboard is configured to aggregate data from all relevant projects. Since the question states that some services have no metrics while all are running and emitting logs, the most likely cause is that those services reside in different projects.

Exam trap

Google Cloud often tests the misconception that all GCP services automatically share monitoring data across projects, when in fact Cloud Monitoring is inherently project-scoped and requires explicit cross-project configuration via metrics scopes.

How to eliminate wrong answers

Option A is wrong because Cloud Run services automatically export built-in metrics (e.g., request count, latency) to Cloud Monitoring without requiring explicit use of the Monitoring API; the issue is not about manual metric export. Option B is wrong because logs-based metrics are defined using filters that apply to all logs ingested into Cloud Logging within the project; if logs are being emitted, the metrics would appear unless the filter explicitly excludes those services, but the question states logs are emitted and the dashboard shows no metrics at all for some services, which points to a project-scope issue. Option D is wrong because Cloud Run is a serverless platform; there is no Cloud Monitoring agent to install on instances—the agent is used for Compute Engine VMs, not Cloud Run.

Practice this question →

16

Multi-Selectmedium

Which TWO of the following are best practices for implementing service monitoring in Google Cloud? (Choose 2)

Select 2 answers

A.Set static alert thresholds without considering historical baselines.

B.Use Cloud Monitoring uptime checks to verify that services are reachable from external locations.

C.Use the USE method (Utilization, Saturation, Errors) for service-level monitoring.

D.Define service level indicators (SLIs) using the RED method (Rate, Errors, Duration).

E.Alert on cause-based metrics (e.g., CPU utilization) rather than symptom-based metrics (e.g., latency).

AnswersB, D

Uptime checks verify external accessibility.

Why this answer

Option B is correct because Cloud Monitoring uptime checks verify that a service is reachable from external locations by sending HTTP, HTTPS, or TCP requests from Google Cloud's global vantage points. This validates external availability and helps detect regional outages or DNS resolution issues, which is a core best practice for service monitoring.

Exam trap

Google Cloud often tests the distinction between resource-level monitoring (USE method) and service-level monitoring (RED method), trapping candidates who apply the USE method to services or confuse cause-based alerts with symptom-based alerts.

Practice this question →

17

MCQmedium

You need to monitor a multi-step login flow that involves calling an API, validating a token, and redirecting. Which type of uptime check should you use?

A.Cloud Endpoint check

B.TCP check

C.HTTP GET check

D.Synthetic Monitor (Cloud Functions)

AnswerD

Synthetic monitors can script multi-step flows.

Why this answer

A synthetic monitor using Cloud Functions is the correct choice because it can simulate a multi-step login flow by executing custom code that calls an API, validates a token, and performs a redirect. Unlike simple HTTP GET or TCP checks, synthetic monitors can handle stateful interactions and conditional logic, making them ideal for complex transaction monitoring.

Exam trap

Google Cloud often tests the misconception that an HTTP GET check can handle multi-step flows because it can follow redirects, but in reality, HTTP GET checks in uptime monitoring tools (like Google Cloud Monitoring) do not execute JavaScript or manage session state, making them unsuitable for token validation and conditional redirects.

How to eliminate wrong answers

Option A is wrong because Cloud Endpoint checks are designed for monitoring specific endpoints (e.g., IP addresses or URLs) with basic health probes, not for executing multi-step workflows with token validation and redirects. Option B is wrong because a TCP check only verifies that a port is open and accepting connections; it cannot validate application-layer logic like token authentication or redirect handling. Option C is wrong because an HTTP GET check only performs a single request and checks for a response code (e.g., 200 OK); it cannot follow redirects, maintain session state, or validate tokens across multiple steps.

Practice this question →

18

MCQhard

A team wants to implement multi-cluster monitoring for GKE using Managed Service for Prometheus. Which configuration is required?

A.Enable Managed Service for Prometheus in one cluster and have other clusters forward metrics to it

B.Enable Managed Service for Prometheus in each cluster and configure a single Cloud Monitoring workspace to collect metrics from all clusters

C.Use Cloud Monitoring agent on nodes in each cluster

D.Set up a separate workspace per cluster

AnswerB

This aggregates metrics from multiple clusters in one workspace.

Why this answer

Managed Service for Prometheus is a Google Cloud-managed, multi-cluster monitoring solution. To collect metrics from multiple GKE clusters, you must enable the service in each cluster individually and then configure a single Cloud Monitoring workspace to aggregate the data. This ensures each cluster runs its own collection pipeline, while the workspace provides a unified view across all clusters.

Exam trap

The trap here is that candidates assume a single 'central' cluster can aggregate metrics from others (like a traditional Prometheus federation), but Managed Service for Prometheus requires each cluster to independently send metrics to a shared Cloud Monitoring workspace.

How to eliminate wrong answers

Option A is wrong because Managed Service for Prometheus does not support a hub-and-spoke forwarding model; each cluster must run its own collection pipeline and cannot simply forward metrics to another cluster's service. Option C is wrong because the Cloud Monitoring agent (formerly Stackdriver agent) is a legacy solution for collecting system metrics and logs, not Prometheus metrics; Managed Service for Prometheus uses its own managed collector based on the Prometheus server, not the Cloud Monitoring agent. Option D is wrong because using separate workspaces per cluster defeats the purpose of multi-cluster monitoring, which requires a single workspace to aggregate and visualize metrics from all clusters in one place.

Practice this question →

19

Multi-Selectmedium

Which TWO options are valid methods to create a custom metric descriptor in Cloud Monitoring?

Select 2 answers

A.Creating a log-based metric

B.Using the gcloud commands: gcloud monitoring metrics create

C.Using Terraform resource google_monitoring_metric_descriptor

D.Deploying a Prometheus exporter with the Ops Agent

E.Using the Cloud Monitoring API CreateMetricDescriptor

AnswersB, E

gcloud CLI can create custom metric descriptors.

Why this answer

Option B is correct because the `gcloud monitoring metrics create` command directly invokes the Cloud Monitoring API to create a custom metric descriptor. Option E is correct because the Cloud Monitoring API's `CreateMetricDescriptor` method is the programmatic way to define a custom metric, specifying its type, unit, and labels. Both methods result in a new metric descriptor being registered in the project's metric store.

Exam trap

Google Cloud often tests the distinction between creating a metric descriptor (the schema) and writing metric data (time series); candidates confuse log-based metrics (which auto-create a descriptor) with the explicit creation of a custom descriptor, or assume that any tool that produces metrics (like Prometheus exporters) directly creates a descriptor, when in fact the Ops Agent handles descriptor creation automatically for Prometheus endpoints.

Practice this question →

20

MCQmedium

A DevOps engineer is setting up alerting policies for a critical API service. They want to receive an alert if the error rate exceeds 5% for at least 5 minutes, but only during business hours (9 AM to 5 PM). Which approach should they use?

A.Create a log-based metric for errors and use a condition with a threshold, then set the alert policy to only run during business hours using the 'condition' schedule.

B.Create an alerting policy with a condition that triggers when the error rate is above 5% for 5 minutes, and configure the notification channel to only send notifications during business hours using a webhook receiver that checks time.

C.Create two separate alert policies, one for business hours and one for off-hours, each with different thresholds.

D.Use Cloud Scheduler to enable and disable the alerting policy at the start and end of business hours.

AnswerB

This approach uses a custom notification channel to filter by time.

Why this answer

Option B is correct because it uses a single alerting policy with a condition that triggers when the error rate exceeds 5% for 5 minutes, and then controls notification delivery via a webhook receiver that checks the current time. This approach ensures the alert is evaluated continuously (so the 5-minute window is respected) but only notifications are suppressed outside business hours, which is the most reliable way to meet the requirement without missing alert evaluations or relying on external scheduling.

Exam trap

Google Cloud often tests the distinction between 'when the alert is evaluated' versus 'when notifications are sent' — candidates mistakenly think scheduling the condition itself (Option A) or toggling the entire policy (Option D) is valid, but the correct approach is to keep evaluation always on and only filter notifications.

How to eliminate wrong answers

Option A is wrong because log-based metrics and conditions do not have a 'condition schedule' to restrict evaluation to business hours; alert conditions evaluate continuously, and scheduling is applied at the policy level, not the condition level. Option C is wrong because creating two separate policies with different thresholds would either trigger alerts off-hours (if thresholds are the same) or miss the 5% threshold requirement (if thresholds differ), and it unnecessarily duplicates management overhead. Option D is wrong because using Cloud Scheduler to enable/disable the alerting policy would stop all evaluation during off-hours, meaning the 5-minute window would not be maintained across the boundary (e.g., an error spike starting at 4:58 PM would not be detected until 9 AM the next day, breaking the 'at least 5 minutes' requirement).

Practice this question →

21

MCQeasy

A small startup uses Cloud Functions for their backend and wants to monitor function execution times and error rates. They have enabled Cloud Monitoring and are viewing metrics in the Cloud Console. They notice that the execution time metric for a particular function shows an average of 200ms, but occasionally there are spikes to 5 seconds, which correspond to user-reported slow responses. They want to be alerted when the function exceeds 1 second for any invocation. What is the simplest way to achieve this?

A.Create a log-based metric for function duration and set a threshold alert.

B.Configure a Cloud Monitoring uptime check for the function URL.

C.Use the built-in Cloud Functions latency metric and create a metric threshold alert for the max value over 1 minute.

D.Use Cloud Error Reporting to capture slow responses.

AnswerC

This directly uses the existing metric and alerts on the maximum value, catching spikes.

Why this answer

Option C is correct because Cloud Functions automatically emits a built-in `execution_time` metric (measured in milliseconds) to Cloud Monitoring. By creating a metric threshold alert on the `max` value of this metric over a 1-minute window, you can trigger an alert whenever any single invocation exceeds 1 second, directly matching the requirement to be alerted per invocation spike.

Exam trap

Google Cloud often tests the distinction between built-in metrics and log-based metrics, and the trap here is that candidates overcomplicate by choosing log-based metrics (Option A) when a simpler built-in metric already satisfies the requirement.

How to eliminate wrong answers

Option A is wrong because log-based metrics require you to parse structured logs and define custom metrics, which adds unnecessary complexity when a built-in metric already exists for execution time. Option B is wrong because an uptime check only verifies that the function URL is reachable and returns a response; it does not measure execution duration or error rates for individual invocations. Option D is wrong because Cloud Error Reporting captures only errors (exceptions, crashes), not slow responses that complete successfully but exceed a latency threshold.

Practice this question →

22

MCQmedium

What is the effect of the 'timeshiftDuration' of '3600s' in the dashboard widget?

A.The chart compares current data to data from 1 hour ago

B.The chart shows data from the last 1 hour

C.The metric is aggregated over 1 hour

D.The dashboard updates every hour

AnswerA

Timeshift adds a series shifted by the specified duration.

Why this answer

The 'timeshiftDuration' parameter in a dashboard widget shifts the time range of the comparison data relative to the primary time range. Setting 'timeshiftDuration' to '3600s' means the chart compares the current data (within the selected time range) against data from exactly 1 hour (3600 seconds) earlier. This is commonly used for offset comparisons, such as week-over-week or hour-over-hour analysis, and does not alter the primary data range or aggregation.

Exam trap

Google Cloud often tests the distinction between 'timeshiftDuration' (comparison offset) and 'timeRange' (primary data window), leading candidates to confuse shifting the comparison data with changing the chart's visible time range.

How to eliminate wrong answers

Option B is wrong because 'timeshiftDuration' does not set the time range of the chart; it only shifts the comparison data. The chart's primary time range is defined separately (e.g., by 'timeRange' or the dashboard's global time picker). Option C is wrong because 'timeshiftDuration' has no effect on metric aggregation; aggregation is controlled by the 'aggregation' or 'rollup' parameter (e.g., 'avg', 'sum', 'count').

Option D is wrong because 'timeshiftDuration' does not control dashboard refresh intervals; refresh behavior is set by the 'refreshInterval' parameter or the dashboard's auto-refresh setting.

Practice this question →

23

Matchingmedium

Match each Google Cloud tool to its function in incident management.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

End-to-end incident lifecycle tool

Third-party alerting and on-call scheduling

Asynchronous messaging for event-driven alerts

Serverless automation for incident response

Containerized event-driven applications

Why these pairings

Tools used in incident response workflows.

Practice this question →

24

MCQeasy

A DevOps engineer runs the command above and gets the output shown. What does this output indicate?

A.The instance's disk is full, causing write errors.

B.An application running on the instance encountered a connection timeout to a backend service.

C.The instance failed to authenticate with the metadata server.

D.A health check probe failed to reach the instance.

AnswerB

The log message explicitly states 'Connection timeout to backend service'.

Why this answer

The output shows a 'Connection timed out' error when attempting to reach a backend service. This indicates that the application on the instance is unable to establish a TCP connection to the specified IP and port, typically due to network issues, firewall rules, or the backend service being down. The error is specific to application-level connectivity, not disk space or authentication.

Exam trap

The trap here is that candidates confuse application-level connection timeouts with infrastructure-level issues like disk full or health check failures, but the specific 'Connection timed out' message directly points to a network connectivity problem to a backend service.

How to eliminate wrong answers

Option A is wrong because a full disk would produce 'No space left on device' or write-related errors, not a connection timeout. Option C is wrong because metadata server authentication failures result in 401 Unauthorized or 'Failed to retrieve metadata' errors, not a TCP connection timeout. Option D is wrong because a health check probe failure would be reported by the load balancer or monitoring system (e.g., 'Unhealthy' status), not as a connection timeout from within the instance's application logs.

Practice this question →

25

Multi-Selectmedium

Your team is implementing SLO monitoring. Which TWO tools should they use to create and monitor SLIs?

Select 2 answers

A.Error Reporting

B.Cloud Monitoring

C.Cloud Logging

D.Cloud Trace

E.Cloud Profiler

AnswersB, C

Cloud Monitoring is the primary tool for creating metric-based SLIs and SLOs.

Why this answer

Cloud Monitoring (option B) is the correct tool for creating and monitoring Service Level Indicators (SLIs) because it allows you to define custom metrics, set up alerting policies, and create dashboards to track the performance of your services. Cloud Logging (option C) is also correct because SLIs often rely on log-based metrics, such as request latency or error counts, which are extracted from log entries using log-based metrics in Cloud Logging. Together, these two tools provide the data ingestion and monitoring capabilities needed to define and observe SLIs.

Exam trap

Google Cloud often tests the distinction between tools that collect raw data (like Cloud Logging) and tools that aggregate and alert on that data (like Cloud Monitoring), and the trap here is that candidates might think Error Reporting or Cloud Trace are sufficient for SLI monitoring, when they lack the metric aggregation and alerting capabilities required for SLO monitoring.

Practice this question →

26

MCQeasy

You need to monitor the health of an external HTTP endpoint. Which resource should you create?

A.Load balancer with a health probe

B.Internal health check in Compute Engine

C.Uptime check in Cloud Monitoring

D.Cloud Logging log-based metric

AnswerC

Uptime checks monitor external endpoints for availability.

Why this answer

To monitor the health of an external HTTP endpoint, you need a service that can reach the endpoint from outside your VPC and provide alerting and visibility. Cloud Monitoring's Uptime Check is specifically designed for this purpose: it verifies that an external HTTP(S) endpoint is reachable and responsive, and can trigger alerting policies based on the results. This is the correct resource because it operates from Google's infrastructure, not from within your project, and directly supports external endpoint monitoring.

Exam trap

Google Cloud often tests the distinction between internal health checks (for VMs/backends) and external uptime checks (for public endpoints), so the trap here is that candidates confuse a load balancer health probe (which checks backends) with a tool for monitoring an external HTTP endpoint.

How to eliminate wrong answers

Option A is wrong because a Load Balancer with a health probe is used to distribute traffic and check the health of backend instances within your VPC, not to monitor an external HTTP endpoint from outside your network. Option B is wrong because an Internal health check in Compute Engine is designed to verify the health of VM instances within the same VPC using internal IPs, and cannot reach external endpoints. Option D is wrong because Cloud Logging log-based metrics are used to count or measure log entries based on filters, not to actively probe or monitor the availability of an external HTTP endpoint.

Practice this question →

27

MCQmedium

A company uses Cloud Run for a critical service and needs to set up alerting for 5xx errors. They want to receive a notification within 1 minute of the error rate exceeding 1% for any 1-minute window. Which alerting approach should they use?

A.Set up a log-based metric for 5xx responses and create an alert on the metric.

B.Create a Cloud Logging sink to a Pub/Sub topic and trigger a Cloud Function that sends notifications.

C.Use Cloud Monitoring's log-based alerting to trigger on every 5xx log entry.

D.Create a Cloud Monitoring alerting policy using the 'Request count' metric with a condition that compares the ratio of 5xx responses to total requests over a 1-minute window.

AnswerD

Cloud Monitoring supports metric evaluation every few seconds, and the ratio condition meets the requirement of alerting within 1 minute.

Why this answer

Option D is correct because Cloud Monitoring's alerting policies can directly compute the ratio of 5xx responses to total requests using the 'Request count' metric with a custom ratio condition over a 1-minute window. This approach meets the 1-minute notification latency requirement without additional infrastructure, as Cloud Monitoring evaluates the metric every 60 seconds and triggers alerts based on the sliding window.

Exam trap

Google Cloud often tests the distinction between log-based metrics (which have inherent sampling and aggregation delays) and built-in request metrics (which are real-time and window-aware), leading candidates to mistakenly choose log-based alerting for rate-based conditions.

How to eliminate wrong answers

Option A is wrong because log-based metrics are sampled and aggregated with a typical latency of 2–5 minutes, which cannot guarantee notification within 1 minute of the error rate exceeding the threshold. Option B is wrong because creating a Cloud Logging sink to Pub/Sub and triggering a Cloud Function introduces additional latency from log ingestion, Pub/Sub delivery, and function execution, making it impossible to meet the 1-minute SLA; it also lacks native ratio computation. Option C is wrong because Cloud Monitoring's log-based alerting triggers on every individual log entry, not on a rate or ratio over a time window, so it cannot detect when the error rate exceeds 1% in a 1-minute window.

Practice this question →

28

MCQhard

The above MQL query is used in a Cloud Monitoring dashboard. What does it display?

A.The total number of spans in 'my-service' each minute.

B.The maximum latency of spans grouped by span_id.

C.The 99th percentile latency of all spans in the 'my-service' service, every minute.

D.The 99th percentile latency of each span_id, every minute.

AnswerD

The group_by [span_id] combined with percentile(99) on latency gives per-span_id 99th percentile values.

Why this answer

The MQL query uses the `fetch` command to retrieve spans from the `my-service` service, then applies `percentile(99)` to the latency metric, and groups the result by `span_id` using the `into` clause. The `every 1m` parameter sets the alignment window to one minute. This produces a time series showing the 99th percentile latency for each distinct span_id, updated every minute.

Exam trap

Google Cloud often tests the distinction between aggregating across all spans (e.g., service-level percentile) versus grouping by a dimension like `span_id`, leading candidates to mistakenly choose a service-wide aggregation when the query explicitly groups by a finer granularity.

How to eliminate wrong answers

Option A is wrong because the query does not count spans; it calculates a percentile of latency, not a count. Option B is wrong because the query uses `percentile(99)`, not `max`, so it shows the 99th percentile, not the maximum latency. Option C is wrong because the query groups by `span_id` (via `into`), not by service, so it displays per-span_id percentiles, not a single aggregated value for the entire service.

Practice this question →

29

MCQhard

Which Cloud Monitoring feature can directly correlate this error with the associated trace and VM instance?

A.Log-based metrics

B.Metrics Explorer

C.Error Reporting

D.Alerting policies

AnswerC

Error Reporting aggregates errors and links to traces and resources.

Why this answer

Error Reporting in Google Cloud automatically groups errors from Cloud Logging and can directly correlate an error with the associated trace ID and VM instance metadata. This is because Error Reporting ingests structured log entries that contain the `@type` field set to `type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent` and parses the `serviceContext` and `context` fields to link the error to the specific trace and resource (e.g., VM instance).

Exam trap

Google Cloud often tests the misconception that Metrics Explorer or Log-based metrics can correlate errors to traces, but only Error Reporting is designed to parse and group error events with their associated trace and resource metadata.

How to eliminate wrong answers

Option A is wrong because Log-based metrics extract numerical data from log entries (e.g., count of errors) but do not provide direct correlation to a specific trace or VM instance; they aggregate metrics over time. Option B is wrong because Metrics Explorer is a tool for visualizing and querying metric time series data from Cloud Monitoring, not for linking individual error events to traces or instances. Option D is wrong because Alerting policies define conditions and notifications based on metric thresholds or log-based alerts, but they do not perform the correlation of an error with its trace and VM instance; they only trigger alerts when conditions are met.

Practice this question →

30

MCQhard

A company runs a microservices architecture on GKE with Istio. They want to generate custom request-level metrics for SLO tracking without modifying application code. Which approach is most efficient?

A.Set up Istio telemetry with the Cloud Monitoring adapter

B.Write a custom Prometheus exporter deploying in each pod

C.Use the Cloud Monitoring agent to scrape metrics from pods

D.Enable Stackdriver for GKE (deprecated)

AnswerA

Istio's adapter exports request metrics directly to Cloud Monitoring.

Why this answer

Option A is correct because Istio's telemetry v2 can be configured to export custom request-level metrics (e.g., latency, error rate) to Cloud Monitoring via the Cloud Monitoring adapter without modifying application code. This approach leverages Istio's sidecar proxy to capture metrics at the service mesh layer, making it the most efficient and non-invasive method for SLO tracking.

Exam trap

Google Cloud often tests the misconception that the Cloud Monitoring agent (or legacy Stackdriver) can scrape Istio metrics from pods, but in reality, the agent operates at the VM or node level and cannot access the Envoy proxy's in-memory metrics without explicit Prometheus scraping configuration.

How to eliminate wrong answers

Option B is wrong because writing a custom Prometheus exporter and deploying it in each pod requires modifying application code or adding sidecars, which contradicts the requirement of not modifying application code and adds operational overhead. Option C is wrong because the Cloud Monitoring agent (formerly Stackdriver agent) scrapes metrics from the host OS or container runtime, not from Istio's sidecar proxies, and it cannot capture request-level metrics at the mesh layer without application instrumentation. Option D is wrong because 'Stackdriver for GKE' is deprecated and replaced by Google Cloud Managed Service for Prometheus and Cloud Monitoring; relying on a deprecated approach is not a valid or efficient solution.

Practice this question →

31

MCQmedium

You want to send alerts to a Slack channel when a critical error occurs. What should you do?

A.Set up a webhook in Cloud Logging and point it to Slack

B.Use Cloud Tasks to schedule a job that queries logs and sends Slack messages

C.Create a Cloud Logging export to Pub/Sub and subscribe using a Cloud Function that sends to Slack

D.Configure a Slack webhook notification channel in Cloud Monitoring and associate it with an alerting policy

AnswerD

Cloud Monitoring natively integrates with Slack via webhooks.

Why this answer

Option D is correct because Cloud Monitoring alerting policies can directly send notifications to Slack via a configured webhook notification channel. When a critical error triggers the alerting policy, Cloud Monitoring sends an HTTP POST request to the Slack webhook URL, delivering the alert message to the specified Slack channel. This is the native, event-driven approach without requiring custom code or intermediate services.

Exam trap

Google Cloud often tests the distinction between Cloud Logging exports (for log storage/analysis) and Cloud Monitoring alerting (for notification delivery), leading candidates to over-engineer solutions with Pub/Sub and Cloud Functions when a direct notification channel is available.

How to eliminate wrong answers

Option A is wrong because Cloud Logging webhooks are not a feature; Cloud Logging does not support direct webhook integrations to Slack. Option B is wrong because Cloud Tasks is a distributed task queue for asynchronous execution, not designed for real-time log monitoring or Slack notifications; polling logs with scheduled jobs introduces latency and complexity. Option C is wrong because while a Cloud Logging export to Pub/Sub with a Cloud Function can send messages to Slack, this approach is unnecessarily complex and indirect compared to the native Cloud Monitoring alerting policy with a Slack notification channel, which is the recommended and simpler solution.

Practice this question →

32

MCQhard

Your company runs a multi-region application on GKE across us-east1 and europe-west1. The application serves a global user base with a strict SLO of 99.95% availability. Recently, the team noticed that during peak hours, some users in South America experience high latency and intermittent errors. The GKE clusters are monitored via Cloud Monitoring with custom dashboards and alerting policies. The team has set up a single alerting policy that triggers when the global error rate exceeds 0.1%. However, the alert fires only after the issue has persisted for 10 minutes, and by then the customer impact is already significant. You need to improve the detection and response time. Which action should you take first?

A.Create separate alerting policies per region with shorter evaluation periods, and set up a notification channel for the on-call team.

B.Add a dashboard that shows latency by region and set up a log-based metric for error counting.

C.Reduce the alerting policy's duration to 1 minute and increase the threshold to 0.5% to reduce noise.

D.Implement a canary deployment strategy to roll back changes quickly.

AnswerA

Regional alerts detect issues faster and target the affected region.

Why this answer

Option A is correct because the current single global alerting policy with a 10-minute evaluation period introduces a significant delay in detecting regional issues. By creating separate alerting policies per region with shorter evaluation periods, you can detect and respond to regional anomalies (like high latency in South America) much faster, directly improving the detection time for the multi-region application and reducing customer impact.

Exam trap

Google Cloud often tests the misconception that reducing the evaluation period and raising the threshold (Option C) is a quick fix, but this ignores the need for regional granularity and can lead to missed detections or increased noise, while the correct approach is to isolate alerts per region.

How to eliminate wrong answers

Option B is wrong because adding a dashboard and log-based metric for error counting only improves visibility and monitoring, but does not directly reduce the detection or response time for the alert; it provides data but no automated alerting improvement. Option C is wrong because reducing the duration to 1 minute and increasing the threshold to 0.5% would likely increase noise (false positives) and may still miss regional issues if the global error rate remains below 0.5% despite regional problems, thus not addressing the core issue of regional detection. Option D is wrong because implementing a canary deployment strategy is a deployment and rollback technique that helps mitigate impact after a bad release, but does not improve the detection and response time for the existing latency and error issue; it is a reactive measure, not a proactive monitoring fix.

Practice this question →

33

Multi-Selecteasy

Which THREE are recommended practices for setting up alerting on Google Cloud?

Select 3 answers

A.Use multiple conditions with OR combiner to cover all scenarios

B.Set alert thresholds with a buffer (e.g., below SLO)

C.Test alerts in a non-production environment

D.Use a single alerting policy for all services

E.Configure multiple notification channels for different roles

AnswersB, C, E

Buffers reduce noise and provide time to act.

Why this answer

Option B is correct because setting alert thresholds with a buffer below the SLO (e.g., alert at 99.9% when SLO is 99.99%) provides early warning before the SLO is breached. This proactive approach allows time for investigation and remediation, preventing actual SLO violations and ensuring service reliability targets are met.

Exam trap

Google Cloud often tests the misconception that combining multiple conditions with OR is always beneficial for coverage, but in reality, it increases noise and violates the principle of alerting on symptoms rather than causes.

Practice this question →

34

MCQeasy

A team wants to monitor the latency of a microservice deployed on GKE. Which Google Cloud tool should they use to collect custom metrics?

A.Cloud Logging

B.Cloud Trace

C.Error Reporting

D.Cloud Monitoring

AnswerD

Cloud Monitoring collects metrics including custom metrics from GKE.

Why this answer

Cloud Monitoring (formerly Stackdriver Monitoring) is the correct tool because it allows you to collect, visualize, and alert on custom metrics via the Monitoring API or the `custom.googleapis.com` metric namespace. For a microservice on GKE, you can instrument your application with the OpenTelemetry SDK or the Cloud Monitoring client library to push custom latency metrics directly into Cloud Monitoring, where they can be queried and used in dashboards or alerting policies.

Exam trap

Google Cloud often tests the distinction between logging, tracing, and monitoring, and the trap here is that candidates confuse Cloud Trace (which deals with latency at the trace level) with Cloud Monitoring (which collects and stores custom numeric metrics for alerting and dashboards).

How to eliminate wrong answers

Option A is wrong because Cloud Logging is designed for storing, searching, and analyzing log data, not for collecting numeric time-series metrics; while logs can be converted to metrics via log-based metrics, that is an indirect and less efficient approach for custom latency monitoring. Option B is wrong because Cloud Trace is a distributed tracing tool that captures end-to-end request latency as trace spans, but it does not support custom metric collection or expose a metrics API for arbitrary numeric values. Option C is wrong because Error Reporting aggregates and analyzes application errors (exceptions, crashes), not latency data; it focuses on error events, not performance metrics.

Practice this question →

35

MCQhard

A company wants to reduce costs associated with Cloud Monitoring. They have many custom metrics and high ingestion rates. Which cost optimization strategy is most effective?

A.Use log-based metrics instead of custom metrics.

B.Aggregate metrics into buckets and export to BigQuery for analysis.

C.Reduce the sampling rate of all custom metrics to 1 minute.

D.Delete all unused custom metrics and reduce labels.

AnswerD

Unused metrics and high label cardinality contribute to costs; cleaning them up is the most direct approach.

Why this answer

Option D is correct because deleting unused custom metrics and reducing labels directly reduces the volume of data ingested and stored, which is the primary cost driver in Cloud Monitoring (formerly Stackdriver). Custom metrics incur charges per data point ingested, and each unique label combination creates additional time series, multiplying costs. This strategy eliminates waste without sacrificing necessary monitoring fidelity.

Exam trap

Google Cloud often tests the misconception that reducing sampling frequency or moving metrics to logs will always lower costs, when in fact the most effective first step is eliminating unused metrics and reducing label cardinality to minimize the number of time series ingested.

How to eliminate wrong answers

Option A is wrong because log-based metrics still incur ingestion costs for the logs themselves, and converting custom metrics to log-based metrics does not inherently reduce costs—it may shift costs to log ingestion and storage. Option B is wrong because aggregating metrics into buckets and exporting to BigQuery adds additional costs for BigQuery storage and querying, and does not reduce the ingestion volume into Cloud Monitoring. Option C is wrong because reducing the sampling rate to 1 minute may not be appropriate for all metrics (e.g., high-frequency metrics need finer granularity), and it does not address the root cause of high ingestion volume from unused metrics or excessive labels; also, Cloud Monitoring charges per data point, so simply lowering frequency may not yield proportional savings if the number of time series remains high.

Practice this question →

36

MCQhard

Refer to the exhibit. ```yaml name: projects/my-project/alertPolicies/12345 displayName: High Error Rate combiner: OR conditions: - conditionThreshold: filter: metric.type="logging.googleapis.com/user/myapp/error_count" resource.type="k8s_container" aggregations: - alignmentPeriod: 60s perSeriesAligner: ALIGN_RATE duration: 120s comparison: COMPARISON_GT thresholdValue: 5 trigger: count: 1 ``` An engineer notices that this alert fires too frequently during normal operation. Which change would most likely reduce the noise?

A.Increase duration to 300s.

B.Change combiner to AND.

C.Change perSeriesAligner to ALIGN_MEAN.

D.Decrease thresholdValue to 1.

AnswerA

A longer duration means the threshold must be exceeded for 300 seconds continuously, filtering out short-lived spikes that cause false alerts.

Why this answer

Increasing the duration to 300s means the condition (error rate > 5 per second) must be sustained for a full 5 minutes before the alert fires. This reduces noise by filtering out transient spikes that occur during normal operation, which are shorter than the new duration window. The original 120s duration allowed short-lived bursts to trigger the alert too frequently.

Exam trap

Google Cloud often tests the misconception that changing the threshold value or aligner is the primary way to reduce alert noise, when in fact adjusting the duration (or evaluation window) is the correct method to filter out transient spikes.

How to eliminate wrong answers

Option B is wrong because changing the combiner to AND would require all conditions to be true simultaneously, but there is only one condition in this policy, so AND has no effect and does not reduce noise. Option C is wrong because changing perSeriesAligner to ALIGN_MEAN would average the error count over the alignment period, which could smooth out spikes but does not address the core issue of transient bursts triggering the alert; it might even increase sensitivity if the mean is still above threshold. Option D is wrong because decreasing thresholdValue to 1 would make the alert fire even more frequently, as it would trigger on any error rate above 1, increasing noise rather than reducing it.

Practice this question →

37

MCQeasy

You are setting up alerting for a batch processing job that runs daily on Compute Engine. The job must complete within 2 hours. Which metric and alert condition should you use to ensure you are notified if the job is still running after 90 minutes?

A.Alert on CPU utilization greater than 80% for the instance running the job

B.Create a custom metric that emits 1 when the job starts and 0 when it finishes; alert if the metric is 1 for more than 90 minutes

C.Use a heartbeat metric that reports every 5 minutes; alert if no heartbeat for 90 minutes

D.Set up a log-based metric that counts job completion log entries; alert if the count is zero after 90 minutes

AnswerB

This directly measures job duration and triggers an alert if it exceeds 90 minutes.

Why this answer

Option B is correct because it directly monitors the job's running state using a custom metric that emits 1 at job start and 0 at completion. By alerting when the metric remains at 1 for more than 90 minutes, you are notified if the job exceeds the 90-minute threshold, ensuring you catch failures before the 2-hour deadline. This approach is precise and avoids false positives from indirect signals like CPU or logs.

Exam trap

Google Cloud often tests the distinction between direct state monitoring (custom metric with start/end signals) and indirect signals (CPU, heartbeats, log counts), where candidates mistakenly choose an indirect metric that seems plausible but fails to accurately capture the specific condition of 'still running after 90 minutes'.

How to eliminate wrong answers

Option A is wrong because CPU utilization greater than 80% does not reliably indicate that the job is still running; the job could be idle or waiting on I/O while CPU is low, or CPU could be high due to unrelated processes, leading to false alarms or missed alerts. Option C is wrong because a heartbeat metric that reports every 5 minutes would alert if no heartbeat for 90 minutes, but the job could still be running and sending heartbeats even after 90 minutes, or it could fail early and stop heartbeats, missing the specific condition of 'still running after 90 minutes'. Option D is wrong because a log-based metric counting job completion log entries would alert if the count is zero after 90 minutes, but this only detects that the job has not completed; it does not distinguish between a job that is still running and one that never started or failed silently, and it may also be delayed by log ingestion latency.

Practice this question →

38

Multi-Selectmedium

A DevSecOps team is configuring Cloud Monitoring alerts for proactive incident response. Which two practices are recommended for effective alerting? (Choose two.)

Select 2 answers

A.Define clear escalation paths for different alert severities.

B.Alert on every microsecond of latency increase.

C.Use a single high-level alert that covers all symptoms.

D.Set alert thresholds based on arbitrary guesses.

E.Create separate alerts for different symptom classes.

AnswersA, E

Clear escalation ensures the right team is notified based on severity.

Why this answer

Recommended practices include defining clear escalation paths for different severities and creating separate alerts for different symptom classes to reduce noise and ensure proper routing.

Practice this question →

39

MCQeasy

Your company runs a stateless web application on Google Kubernetes Engine (GKE). You have configured Cloud Monitoring to track request latency and set up an alert when p95 latency exceeds 500ms for 5 minutes. Recently, the alert has been firing frequently during peak hours. You examine the metrics and see that p95 latency spikes to 600ms for short periods. The application's SLO is 99.9% availability with a latency threshold of 1 second. What should you do to reduce alert noise without compromising the SLO?

A.Disable the alert during peak hours.

B.Implement a multi-window burn-rate alerting approach.

C.Increase the alert threshold to 1 second to match the SLO.

D.Change the alert to use a longer evaluation window, e.g., 30 minutes.

AnswerB

Burn-rate alerts use multiple windows to detect fast consumption while filtering out brief spikes.

Why this answer

The best approach is to implement a multi-window burn-rate alerting strategy, which is less sensitive to short spikes and directly tracks error budget consumption, aligning with the SLO.

Practice this question →

40

MCQeasy

Which Google Cloud tool automatically captures and visualizes traces for applications running on App Engine?

A.Cloud Debugger

B.Cloud Monitoring

C.Cloud Logging

D.Cloud Trace

AnswerD

Cloud Trace captures and displays latency data.

Why this answer

Cloud Trace is the correct Google Cloud tool for automatically capturing and visualizing latency traces from applications running on App Engine. It provides end-to-end latency insights by collecting trace data from distributed systems, enabling you to analyze request performance and identify bottlenecks without manual instrumentation.

Exam trap

The trap here is that candidates confuse Cloud Trace with Cloud Monitoring or Cloud Logging because all three are part of Google Cloud's operations suite, but only Cloud Trace is designed for distributed tracing and latency visualization.

How to eliminate wrong answers

Option A is wrong because Cloud Debugger is used for inspecting application state and code execution in real time without stopping the application, not for capturing or visualizing traces. Option B is wrong because Cloud Monitoring focuses on collecting metrics, uptime checks, and alerting policies, not trace data or distributed request visualization. Option C is wrong because Cloud Logging handles log data storage, search, and analysis, but does not capture or visualize trace spans or latency distributions.

Practice this question →

41

MCQmedium

Your team wants to create a dashboard that shows request latency broken down by API version. Which approach is most efficient?

A.Use Cloud Monitoring Metrics Explorer to query the latency metric, group by API version, and save the chart as a dashboard

B.Write a custom application to output metrics to a file and send to Cloud Monitoring

C.Export Cloud Logging logs to BigQuery and create a dashboard in Data Studio

D.Enable Cloud Trace and create a dashboard based on trace data

AnswerA

Metrics Explorer directly supports grouping and creating dashboards.

Why this answer

Cloud Monitoring Metrics Explorer allows you to directly query the latency metric (e.g., `request_latencies`) and group by the `version` label, then save the resulting chart as a dashboard widget. This is the most efficient approach because it requires no data export, no custom code, and no additional services — the metric is already available in Cloud Monitoring if your API is instrumented with the appropriate label.

Exam trap

Google Cloud often tests the misconception that exporting logs to BigQuery or using Cloud Trace is always better for analysis, but here the question specifically asks for the most efficient approach to display a pre-existing metric broken down by a label, which Metrics Explorer does directly without extra steps or cost.

How to eliminate wrong answers

Option B is wrong because writing a custom application to output metrics to a file and send to Cloud Monitoring is unnecessarily complex and inefficient; Cloud Monitoring already ingests metrics natively via the Monitoring API or agent, and a file-based pipeline adds latency and operational overhead. Option C is wrong because exporting Cloud Logging logs to BigQuery and creating a dashboard in Data Studio is a roundabout, slower approach — logs are not metrics, and you would need to parse and aggregate log entries, incurring BigQuery costs and delays, whereas the metric is already pre-aggregated in Cloud Monitoring. Option D is wrong because Cloud Trace is designed for distributed tracing (individual request spans), not for aggregating latency metrics by API version; creating a dashboard from trace data would require custom aggregation and is not the most efficient way to show a pre-aggregated metric broken down by a label.

Practice this question →

42

MCQeasy

An application running on App Engine is throwing exceptions. The DevOps team wants to be notified when a new type of exception appears. Which Cloud Monitoring feature should they use?

A.Error Reporting

B.Uptime Checks

C.Custom alerts

D.Logs-based metrics

AnswerA

Error Reporting analyzes error logs, groups similar exceptions, and can send notifications for new error groups.

Why this answer

Error Reporting is the correct choice because it is a Cloud Monitoring feature specifically designed to count, analyze, and increase the severity of exceptions in application logs. It automatically groups exceptions by type and can trigger notifications when a new type of exception (one not seen before) appears, which directly meets the DevOps team's requirement.

Exam trap

Google Cloud often tests the distinction between features that passively log data (Logs-based metrics) versus features that actively analyze and classify errors (Error Reporting), leading candidates to choose Logs-based metrics because they think 'any exception can be captured in logs.'

How to eliminate wrong answers

Option B is wrong because Uptime Checks monitor the availability and response latency of a service by sending synthetic requests (e.g., HTTP GET) from Google Cloud locations, not application-level exceptions. Option C is wrong because Custom alerts are generic alerting policies that can be based on any metric, but they do not inherently detect or classify new exception types; they require a pre-defined metric or condition. Option D is wrong because Logs-based metrics extract numerical data (e.g., count of log entries matching a filter) from logs, but they do not automatically identify or notify on new exception types without manual configuration of a metric and alert.

Practice this question →

43

Multi-Selectmedium

Which TWO metrics should be included in a comprehensive monitoring strategy for a production Kubernetes workload to detect performance degradation and capacity issues?

Select 2 answers

A.Disk read IOPS per pod

B.Container CPU utilization

C.Number of nodes in the cluster

D.Network bytes received per second

E.Request latency percentiles (e.g., p99)

AnswersB, E

High CPU utilization can indicate capacity pressure and performance issues.

Why this answer

Container CPU utilization (Option B) is a direct indicator of resource pressure and potential performance degradation in a Kubernetes workload. High CPU utilization can lead to throttling, increased request latency, and pod evictions, making it essential for detecting capacity issues. Request latency percentiles (Option E) are the gold standard for measuring user-facing performance degradation, as they reflect the actual experience of end users and can reveal subtle slowdowns before resource metrics show saturation.

Exam trap

Google Cloud often tests the distinction between infrastructure-level metrics (like node count or network bytes) and application-level metrics (like latency percentiles) that directly measure user experience and workload health.

Practice this question →

44

MCQhard

An e-commerce platform is using Cloud Load Balancing with a backend service that has a custom health check. The health check is failing intermittently, causing traffic to be routed away from healthy instances. The team has enabled Cloud Logging and wants to diagnose the issue. Which log view should they examine to see the health check probe results?

A.VPC flow logs

B.Cloud Audit Logs (Admin Activity)

C.Instance serial port output logs

D.Load balancer logs (type: 'loadbalancing.googleapis.com')

AnswerD

Load balancer logs contain health check probe results.

Why this answer

Load balancer logs (type: 'loadbalancing.googleapis.com') contain detailed records of health check probes, including the probe source IP, target instance, response code, and latency. This is the exact log view that captures health check probe results, enabling the team to identify intermittent failures by correlating probe timestamps with instance health status changes.

Exam trap

The trap here is that candidates confuse VPC flow logs (which show network-level traffic) with load balancer logs (which show application-level health check results), or assume that audit logs or serial console logs would contain runtime health check data.

How to eliminate wrong answers

Option A is wrong because VPC flow logs capture network traffic metadata (source/destination IP, ports, protocols) but do not include health check probe results or application-layer health check responses. Option B is wrong because Cloud Audit Logs (Admin Activity) record administrative actions like creating or modifying load balancer configurations, not the runtime results of health check probes. Option C is wrong because instance serial port output logs contain OS-level boot and kernel messages, not health check probe data from the load balancer.

Practice this question →

45

MCQhard

A service is deployed on Cloud Run. You need to monitor memory usage per revision. How can you create an alert?

A.Deploy a sidecar container that collects memory metrics and pushes to Cloud Monitoring

B.Configure Cloud Run to send metrics to Cloud Monitoring and create an alert in Cloud Monitoring

C.Use Cloud Console and navigate to Cloud Run services to set an alert directly

D.Use Cloud Logging to parse container logs for memory usage

AnswerB

Cloud Run metrics are automatically available in Cloud Monitoring.

Why this answer

Cloud Run automatically exports built-in metrics, including memory usage per revision, to Cloud Monitoring without any additional configuration. By creating an alerting policy in Cloud Monitoring based on the `run.googleapis.com/container/memory/utilizations` metric, you can monitor memory usage per revision directly. Option B correctly identifies this native integration, making it the most efficient and reliable approach.

Exam trap

Google Cloud often tests the misconception that you must manually configure metric export or use sidecars for Cloud Run monitoring, when in fact Cloud Run's native integration with Cloud Monitoring handles this automatically.

How to eliminate wrong answers

Option A is wrong because Cloud Run already sends memory metrics to Cloud Monitoring natively; deploying a sidecar container to collect and push metrics is unnecessary, adds complexity, and violates the principle of using built-in observability. Option C is wrong because the Cloud Run console does not provide a direct interface to create alerting policies; alerts must be configured in Cloud Monitoring, not within the Cloud Run service page. Option D is wrong because Cloud Logging is designed for log analysis, not metric-based alerting; parsing container logs for memory usage is inefficient, unreliable, and not the intended method for monitoring resource utilization metrics.

Practice this question →

46

MCQeasy

A team needs to monitor the availability of an HTTPS endpoint that requires a Bearer token in the request header. What is the simplest way to configure this with Cloud Monitoring?

A.Deploy a sidecar container that handles the authentication and exposes a plain endpoint.

B.Use a synthetic monitor from Cloud Monitoring that handles authentication via a script.

C.Export the endpoint logs to Cloud Logging and set up a log-based metric for availability.

D.Configure the Uptime Check to include a custom header with the Bearer token.

AnswerD

Uptime Checks allow custom headers, so you can directly set the Authorization header with the token.

Why this answer

Option D is correct because Cloud Monitoring's Uptime Checks natively support custom HTTP headers, including Authorization headers with Bearer tokens. This allows you to directly monitor an authenticated HTTPS endpoint without any additional infrastructure, scripting, or log-based workarounds. It is the simplest and most straightforward configuration for this requirement.

Exam trap

Google Cloud often tests the misconception that Uptime Checks cannot handle authentication, leading candidates to overcomplicate the solution with sidecars or scripts, when in fact custom headers are a built-in feature.

How to eliminate wrong answers

Option A is wrong because deploying a sidecar container adds unnecessary complexity and defeats the purpose of a simple configuration; it also introduces an extra point of failure and maintenance overhead. Option B is wrong because synthetic monitors are designed for multi-step or browser-based scenarios, not for simple single-request availability checks; using a script for a basic header-based authentication is overkill and not the simplest approach. Option C is wrong because exporting logs and setting up a log-based metric for availability is indirect, introduces latency, and does not provide proactive monitoring; it relies on the endpoint already generating logs, which is not guaranteed for simple availability checks.

Practice this question →

47

Multi-Selecthard

A company needs to monitor custom application metrics from Compute Engine instances. Which TWO methods can be used?

Select 2 answers

A.Use the deprecated Stackdriver Agent

B.Install Cloud Monitoring agent on instances

C.Use Cloud Trace

D.Use OpenTelemetry to send metrics to Cloud Monitoring

E.Install Cloud Logging agent

AnswersB, D

The agent collects custom metrics and sends to Cloud Monitoring.

Why this answer

Option B is correct because the Cloud Monitoring agent is specifically designed to collect custom application metrics from Compute Engine instances and send them to Cloud Monitoring. It supports both third-party applications and custom metrics via its built-in integration with collectd and a configuration interface for defining custom metrics. This is the standard, supported method for monitoring custom metrics from VMs.

Exam trap

Google Cloud often tests the distinction between agents for logging versus monitoring, and candidates mistakenly think the Cloud Logging agent can also handle metrics, or they confuse Cloud Trace (tracing) with metric collection.

Practice this question →

48

MCQeasy

A team is monitoring a production service on Google Kubernetes Engine (GKE) and notices that a deployment is occasionally returning HTTP 503 errors. The team has set up a ServiceMonitor in Prometheus to scrape metrics from the pods. What is the most likely cause of the intermittent 503 errors?

A.The pods are crashing and restarting frequently.

B.The Prometheus scrape interval is too long, causing missed metrics.

C.The readiness probes are failing, causing the pods to be removed from the service endpoints.

D.The container resource limits are set too low, causing out-of-memory errors.

AnswerC

Readiness probe failures remove pods from service endpoints, causing 503s if all replicas fail.

Why this answer

Intermittent HTTP 503 errors in a GKE deployment typically indicate that the service's endpoints are temporarily unavailable. When a readiness probe fails, Kubernetes removes the pod from the Service's endpoints, causing traffic to be routed to remaining healthy pods. If multiple pods fail their readiness probes simultaneously or in quick succession, the Service may have no available endpoints, resulting in 503 errors for incoming requests.

Exam trap

Google Cloud often tests the distinction between liveness probes (which restart pods) and readiness probes (which control traffic routing), and candidates mistakenly attribute 503 errors to pod crashes or resource limits rather than the readiness probe's role in endpoint management.

How to eliminate wrong answers

Option A is wrong because pods crashing and restarting frequently would cause more persistent errors or connection resets, not intermittent 503 errors, and the ServiceMonitor would still scrape metrics from the restarted pods. Option B is wrong because the Prometheus scrape interval affects metric collection, not the availability of the service endpoints; a long scrape interval may cause gaps in monitoring data but does not directly cause HTTP 503 errors. Option D is wrong because out-of-memory errors typically cause pod crashes (OOMKilled) and restarts, which would manifest as connection timeouts or 502 errors rather than intermittent 503 errors from the service endpoint perspective.

Practice this question →

49

MCQmedium

You are monitoring a microservices application deployed on Google Kubernetes Engine (GKE) that uses Cloud Monitoring for observability. You notice that the error rate for a critical service has increased, but the CPU and memory usage remain normal. The service uses gRPC and logs are structured. Which Cloud Monitoring tool should you use first to diagnose the root cause of the increased error rate?

A.Logs Explorer to filter logs by error status codes

B.Service Monitoring to create a custom dashboard

C.Error Reporting to automatically group error occurrences

D.Metrics Explorer to view error rate and latency charts

AnswerA

Logs Explorer allows you to examine structured logs, including gRPC status codes, to find error patterns.

Why this answer

Option A is correct because Logs Explorer allows you to directly query structured gRPC logs by filtering on error status codes (e.g., gRPC status codes like `UNAVAILABLE`, `INTERNAL`, or `DEADLINE_EXCEEDED`). Since the service uses structured logging, you can quickly isolate the exact error messages and stack traces without needing to pre-configure dashboards or wait for automated grouping. This is the fastest first step to identify the root cause of an increased error rate when CPU and memory are normal, as it points to application-level or dependency issues.

Exam trap

Google Cloud often tests the distinction between monitoring (Metrics Explorer, dashboards) and logging (Logs Explorer) — the trap here is assuming that aggregated metrics or automated error grouping are the fastest path to root cause, when in fact direct log inspection is required to see the specific error details and status codes.

How to eliminate wrong answers

Option B is wrong because creating a custom dashboard with Service Monitoring is a longer-term visualization setup, not a diagnostic tool for immediate root cause analysis; it does not provide the granular log-level filtering needed to inspect individual error occurrences. Option C is wrong because Error Reporting automatically groups error occurrences based on stack traces, but it requires the errors to be sent to Cloud Logging and may take time to aggregate; it is better for ongoing monitoring after the initial diagnosis, not the first tool to use. Option D is wrong because Metrics Explorer shows aggregated error rate and latency charts, which can confirm the problem but cannot drill into individual log entries or specific gRPC status codes to identify the root cause.

Practice this question →

50

MCQeasy

A developer wants to view logs from all pods in a GKE namespace in real time. Which command-line tool should they use?

A.gcloud logging read

B.Cloud Console Logs Viewer

C.kubectl logs --tail=100

D.gcloud logging tail

AnswerD

This streams logs in real time across resources.

Why this answer

The `gcloud logging tail` command streams logs in real time from all pods in a GKE namespace, as it directly queries the Cloud Logging API for live log entries. This is the correct tool for real-time log streaming across multiple pods, unlike `kubectl logs` which only shows logs from a single pod or a limited set. The command supports filtering by resource labels, such as `--filter="resource.labels.namespace_name=NAMESPACE"`, to scope the output to a specific namespace.

Exam trap

Google Cloud often tests the distinction between historical log retrieval (`gcloud logging read`) and real-time streaming (`gcloud logging tail`), and the trap here is that candidates mistakenly choose `kubectl logs` because they are familiar with its `-f` flag, but they overlook that it cannot aggregate logs from all pods in a namespace without complex scripting.

How to eliminate wrong answers

Option A is wrong because `gcloud logging read` retrieves historical logs from Cloud Logging, not real-time streaming; it requires a time range and returns a snapshot. Option B is wrong because Cloud Console Logs Viewer is a web-based UI for querying historical logs, not a command-line tool, and it does not provide real-time streaming natively. Option C is wrong because `kubectl logs --tail=100` shows the last 100 lines from a single pod's logs, not all pods in a namespace, and it does not stream in real time unless combined with `-f` (follow), but even then it only follows one pod at a time.

Practice this question →

51

Multi-Selecthard

Which THREE of the following are valid approaches to monitor a custom application metric in Cloud Monitoring? (Choose 3)

Select 3 answers

A.Install the Stackdriver Monitoring agent on a Windows VM and configure custom metric collection in the agent configuration file.

B.Use the Cloud Monitoring API to write time series data directly.

C.Create a logs-based metric from application logs that contain the metric value.

D.Use the built-in JMX plugin in the Cloud Monitoring agent to collect Java application metrics.

E.Use the OpenTelemetry Collector with the Google Cloud Monitoring exporter.

AnswersB, C, E

The API allows writing custom metrics.

Why this answer

Option B is correct because the Cloud Monitoring API allows you to write custom metric data directly via the `projects.timeSeries.create` endpoint. This is the most direct programmatic approach, supporting arbitrary metric descriptors and time series data without requiring any agent or intermediary.

Exam trap

Google Cloud often tests the distinction between predefined agent-collected metrics (like JMX plugin metrics) and custom metrics that require explicit API or logs-based creation, leading candidates to incorrectly select agent-based options for custom metric monitoring.

Practice this question →

52

MCQeasy

A team notices that the 'cpu-high' alert fires frequently even for short bursts. The 'disk-full' alert never sends notifications. Based on the exhibit, what is the issue with each?

A.The cpu-high uses email which is unreliable; the disk-full condition is too low.

B.Both alerts have misconfigured durations.

C.The cpu-high alert duration is too short; the disk-full alert has no notification channel.

D.The cpu-high threshold is too high; the disk-full duration is too long.

AnswerC

Duration of 0s causes firing on any transient spike; missing notification channel means no alerts are delivered.

Why this answer

Option C is correct because the 'cpu-high' alert fires frequently for short bursts due to its duration being set too short, causing it to trigger on transient spikes. The 'disk-full' alert never sends notifications because it lacks a configured notification channel, so even when the condition is met, no alert is dispatched.

Exam trap

Google Cloud often tests the distinction between alert condition configuration (threshold/duration) and notification delivery, leading candidates to confuse a missing notification channel with a threshold or duration misconfiguration.

How to eliminate wrong answers

Option A is wrong because email is not inherently unreliable in this context; the issue is the duration setting, not the channel. Additionally, the 'disk-full' condition being 'too low' would cause false positives, not silence. Option B is wrong because both alerts do not have misconfigured durations; only the 'cpu-high' alert has a duration issue, while the 'disk-full' alert has a missing notification channel.

Option D is wrong because a threshold that is too high would reduce false alarms, not increase them; the 'disk-full' duration being too long would delay alerts, not prevent them entirely.

Practice this question →

53

MCQmedium

An SRE team needs to define an SLI for a web service's availability SLO of 99.9%. Which metric should they use?

A.Error budget

B.CPU utilization

C.Request latency (p99)

D.Uptime check success rate

AnswerD

Uptime checks measure the fraction of successful probes, directly reflecting availability.

Why this answer

Option D is correct because an uptime check success rate directly measures the proportion of time the service is reachable and responding, which aligns with the definition of availability for a 99.9% SLO. This metric is typically derived from synthetic probes or health check endpoints (e.g., HTTP 200 responses) and reflects the binary state of the service being up or down, making it the appropriate SLI for availability.

Exam trap

Google Cloud often tests the distinction between availability (binary up/down) and performance (latency/error rate), so candidates mistakenly choose latency metrics like p99 for availability SLOs, conflating responsiveness with uptime.

How to eliminate wrong answers

Option A is wrong because error budget is a derived concept (the allowed amount of downtime or errors before violating the SLO), not a raw metric used as an SLI; it is calculated from the SLI and SLO, not measured directly. Option B is wrong because CPU utilization is a resource-level metric that does not directly measure service availability; a service can have high CPU usage but still be available, or low CPU usage but be unresponsive due to other failures. Option C is wrong because request latency (p99) measures performance (e.g., the 99th percentile of response times), not availability; a service could be available but slow, or unavailable but not captured by latency metrics if requests fail entirely.

Practice this question →

54

MCQmedium

An alerting policy triggers frequently for a spike in CPU utilization on a Compute Engine instance, but the spike lasts only a few seconds. The SRE team wants to reduce false positives. Which change should they make?

A.Increase the notification channel threshold.

B.Decrease the alerting duration to 0s.

C.Increase the evaluation period and duration.

D.Change the aggregation to mean instead of max.

AnswerC

Longer evaluation period and duration require the condition to persist, reducing false positives from short-lived spikes.

Why this answer

Option C is correct because increasing the evaluation period and duration ensures that the alerting policy only fires when the CPU utilization spike persists over a longer window, filtering out transient spikes that last only a few seconds. This directly reduces false positives by requiring sustained high utilization before triggering an alert, aligning with Google Cloud Monitoring's sliding window evaluation logic.

Exam trap

Google Cloud often tests the misconception that reducing the duration or changing aggregation will solve false positives, but the correct approach is to increase the evaluation window to require sustained anomalous behavior, not to react to every momentary deviation.

How to eliminate wrong answers

Option A is wrong because increasing the notification channel threshold does not affect the alerting condition; it only controls how many notifications are sent, not the sensitivity of the metric evaluation. Option B is wrong because decreasing the alerting duration to 0s would cause the policy to trigger on any single data point, making false positives worse by reacting to every momentary spike. Option D is wrong because changing the aggregation to mean instead of max would smooth out spikes, potentially masking real issues and still not addressing the transient nature of the spike; the mean could still be elevated if the spike is high enough, but the core problem is the short duration, not the aggregation method.

Practice this question →

55

MCQhard

A team uses Cloud Monitoring alerting policies with multiple conditions. They want an incident to fire only when both conditions are met simultaneously. What should they configure?

A.Set the alerting policy combiner to 'AND'

B.Create two separate alerting policies

C.Create a single condition with a ratio metric

D.Use a log-based metric condition

AnswerA

The combiner 'AND' ensures all conditions must be met.

Why this answer

Option A is correct because Cloud Monitoring alerting policies support a 'combiner' field that can be set to 'AND' to require that all conditions are met simultaneously before the incident fires. This ensures the alert triggers only when both conditions are true at the same evaluation window, rather than when either condition is met.

Exam trap

Google Cloud often tests the misconception that creating multiple alerting policies or using a ratio metric can achieve multi-condition AND logic, but the correct approach is to use the combiner field within a single alerting policy.

How to eliminate wrong answers

Option B is wrong because creating two separate alerting policies would result in each condition firing its own independent incident, not a single incident that requires both conditions to be met simultaneously. Option C is wrong because a ratio metric condition calculates a ratio of two metrics but does not enforce that two separate conditions must both be true at the same time; it is a single condition, not a multi-condition AND logic. Option D is wrong because a log-based metric condition is a type of condition that uses logs to create a metric, but it does not provide a mechanism to combine multiple conditions with an AND operator.

Practice this question →

56

MCQmedium

An organization wants to be alerted when the total size of a Cloud Storage bucket exceeds 1 TB. Which metric should they monitor?

A.storage.googleapis.com/storage/total_bytes

B.storage.googleapis.com/storage/object_count

C.storage.googleapis.com/storage/network_sent_bytes

D.storage.googleapis.com/storage/request_count

AnswerA

This metric measures total bucket size.

Why this answer

The metric `storage.googleapis.com/storage/total_bytes` directly measures the total amount of data stored in a Cloud Storage bucket, including all object data and metadata. Monitoring this metric allows the organization to set an alert threshold at 1 TB (1,099,511,627,776 bytes) to trigger when the bucket exceeds that size. This is the correct metric for tracking storage capacity usage.

Exam trap

Google Cloud often tests the distinction between metrics that measure capacity (total_bytes) versus metrics that measure activity (object_count, request_count) or throughput (network_sent_bytes), leading candidates to confuse object count with total size.

How to eliminate wrong answers

Option B is wrong because `storage.googleapis.com/storage/object_count` tracks the number of objects in the bucket, not their total size; a bucket could have millions of small objects that total far less than 1 TB. Option C is wrong because `storage.googleapis.com/storage/network_sent_bytes` measures outbound network traffic from the bucket, which is unrelated to the stored data size. Option D is wrong because `storage.googleapis.com/storage/request_count` counts API requests made to the bucket, which does not reflect the total storage consumed.

Practice this question →

57

MCQhard

A team has set up the alerting policies shown in the exhibit. They receive an alert for High Memory but not for High CPU. What is the most likely reason?

A.The Cloud Monitoring agent is not installed or not reporting on the instance, so the memory metric is missing.

B.The CPU alert's duration of 300 seconds prevents it from firing before the memory alert.

C.The memory alert has a higher threshold value, making it easier to trigger.

D.The CPU metric is not available because the instance does not have the Cloud Monitoring agent installed.

AnswerA

The agent is required for agent.googleapis.com metrics.

Why this answer

Option A is correct because the High Memory alert fires while the High CPU alert does not, indicating that the memory metric is available but the CPU metric is missing. This typically happens when the Cloud Monitoring agent is installed but not properly reporting CPU metrics, or when the agent is missing entirely and only the memory metric is being collected via a different mechanism (e.g., guest-attributes). Without the agent, standard CPU utilization metrics are not exposed to Cloud Monitoring, while memory metrics may still be available through other means, causing the memory alert to trigger but not the CPU alert.

Exam trap

Google Cloud often tests the misconception that CPU metrics are always available from the hypervisor, but in reality, detailed CPU metrics (like per-process or utilization with specific labels) may require the Cloud Monitoring agent, and the absence of the agent can cause CPU alerts to fail while memory alerts (which also require the agent) may still fire if memory data is collected via a different path.

How to eliminate wrong answers

Option B is wrong because the duration of 300 seconds (5 minutes) for the CPU alert does not prevent it from firing before the memory alert; it simply means the CPU condition must persist for 5 minutes before the alert fires, but if the CPU metric is missing entirely, no alert will ever fire regardless of duration. Option C is wrong because a higher threshold value makes an alert harder to trigger, not easier; the memory alert having a higher threshold would require a more extreme condition to fire, contradicting the scenario where it fires while the CPU alert does not. Option D is wrong because if the instance did not have the Cloud Monitoring agent installed, both CPU and memory metrics would be unavailable, not just the CPU metric; the fact that the memory alert fires indicates that at least some metrics are being reported, so the agent must be present and functional for memory.

Practice this question →

58

MCQmedium

An SRE team needs to implement an incident management workflow that automatically creates a ticket in their ITSM tool when a critical alert fires. They use Cloud Monitoring. Which approach should they use?

A.Configure the alerting policy to send notifications via email to the ITSM system's email-to-ticket feature.

B.Create a webhook notification channel directly to the ITSM tool.

C.Use a Cloud Pub/Sub notification channel and a Cloud Function that receives the alert and calls the ITSM API.

D.Use the Cloud Monitoring API to periodically pull alerts and create tickets.

AnswerC

Pub/Sub ensures reliable delivery, and Cloud Function can transform and forward alerts to the ITSM tool.

Why this answer

Option C is correct because Cloud Monitoring can send alert notifications to a Cloud Pub/Sub topic, which then triggers a Cloud Function. The Cloud Function can parse the alert payload and call the ITSM tool's API to create a ticket, providing a reliable, scalable, and decoupled integration that supports custom logic and error handling.

Exam trap

Google Cloud often tests the misconception that direct webhooks (Option B) are sufficient for ITSM integration, but they ignore that Cloud Monitoring webhooks lack support for custom headers, authentication, and reliable retry mechanisms required by enterprise ITSM tools.

How to eliminate wrong answers

Option A is wrong because email-to-ticket features are unreliable for critical alerts due to potential delays, spam filtering, and lack of guaranteed delivery; they also do not support structured data or automated acknowledgment. Option B is wrong because a direct webhook notification channel in Cloud Monitoring sends HTTP POST requests but does not support authentication headers, retry logic, or payload transformation required by most ITSM APIs, leading to frequent failures. Option D is wrong because periodically pulling alerts via the Cloud Monitoring API introduces latency, misses real-time alerting requirements, and adds unnecessary complexity compared to event-driven push notifications.

Practice this question →

59

MCQmedium

A team is using Cloud Monitoring to track the performance of a microservices application. They set up an uptime check for each service, but they notice that some checks are failing intermittently without actual service degradation. What is the most likely cause?

A.The services are behind a load balancer that occasionally returns 503 during scaling.

B.The timeout setting is too short for the service's typical latency.

C.Uptime checks are deployed in a single region, causing false positives.

D.The project's quota for uptime checks has been exceeded.

AnswerB

A short timeout can cause the check to fail even when the service is healthy, especially during transient latency spikes.

Why this answer

The most likely cause is that the timeout setting is too short, causing false positives when the service response time temporarily exceeds the timeout. Other options are less plausible: uptime checks typically run from multiple regions; load balancer 503 errors would indicate a real issue; quota exceed would prevent checks from running.

Practice this question →

60

MCQeasy

A DevOps team is defining an SLO for a web application that runs on Compute Engine behind an HTTP Load Balancer. They need to measure the proportion of requests that complete within 300ms. Which Cloud Monitoring metric is most appropriate as the SLI?

A.loadbalancing.googleapis.com/https/backend_request_bytes

B.loadbalancing.googleapis.com/https/frontend_tcp_rtt

C.loadbalancing.googleapis.com/https/request_count

D.loadbalancing.googleapis.com/https/total_latencies

AnswerD

This metric gives latency distribution, including percentiles, making it ideal for a latency SLI.

Why this answer

The SLI must measure the proportion of requests completing within 300ms, which is a latency distribution metric. The `total_latencies` metric from the HTTP Load Balancer provides a histogram of request latencies, allowing you to compute the percentage of requests below a threshold (e.g., 300ms). This directly supports the SLO definition.

Exam trap

Google Cloud often tests the distinction between latency metrics (histogram-based) and simple counters or byte metrics, expecting candidates to recognize that only a distribution metric like `total_latencies` can compute percentile-based SLIs.

How to eliminate wrong answers

Option A is wrong because `backend_request_bytes` measures the size of request data sent to backends, not latency. Option B is wrong because `frontend_tcp_rtt` measures TCP round-trip time between client and load balancer, not application-layer request latency. Option C is wrong because `request_count` only counts total requests without any latency information, so it cannot be used to measure the proportion of fast requests.

Practice this question →

61

Multi-Selecteasy

A DevOps engineer notices that some Compute Engine instances are not reporting metrics to Cloud Monitoring. Which two potential causes should they investigate? (Choose two.)

Select 2 answers

A.The instances are in a different region and Cloud Monitoring doesn't support cross-region.

B.The instances are preemptible and automatically stop reporting after 24 hours.

C.The instances have insufficient IAM permissions to write metrics.

D.The instances are in a different project and not peered.

E.The Ops Agent is not installed on the instances.

AnswersC, E

Instances need the roles/monitoring.metricWriter role to send metrics.

Why this answer

Option C is correct because Compute Engine instances require the appropriate IAM permissions (e.g., roles/monitoring.metricWriter) to write metrics to Cloud Monitoring. Without these permissions, the API calls to ingest metric data are denied, even if the Ops Agent is installed and running.

Exam trap

Google Cloud often tests the misconception that preemptible instances have a built-in metric reporting cutoff, when in fact they can report metrics normally until they are preempted, and the real issue is often IAM permissions or missing agent installation.

Practice this question →

62

Multi-Selecthard

A team is designing a dashboard for their production environment using Cloud Monitoring. Which three types of information should be included on the dashboard to support incident response? (Choose three.)

Select 3 answers

A.Resource utilization trends

B.Recent alerting history

C.Real-time user feedback

D.Security audit logs

E.Service Level Indicators (SLIs)

AnswersA, B, E

Trends help identify capacity-related issues during incidents.

Why this answer

Resource utilization trends (A) are essential for incident response because they provide historical context, enabling responders to identify anomalies, correlate changes with incidents, and predict capacity issues. Cloud Monitoring's Metrics Explorer and dashboards allow you to plot trends over time, which is critical for root cause analysis during an active incident.

Exam trap

Google Cloud often tests the distinction between operational monitoring data (metrics, alerts, SLIs) and non-operational data (user feedback, audit logs), expecting candidates to recognize that dashboards for incident response must contain only real-time, actionable, and metric-based information.

Practice this question →

63

MCQmedium

An SRE team created the above logs-based metric. They expect it to count the number of HTTP 500 errors per instance. However, the metric shows no data. What is the most likely cause?

A.The metric kind is DELTA but should be CUMULATIVE.

B.The log entries might not have the 'status' field in jsonPayload; it could be in a different location or format.

C.The metric name does not follow the required naming convention.

D.The labelExtractors must use regex instead of JSON path.

AnswerB

If the logs are structured differently, the filter will not match, resulting in no data.

Why this answer

Option B is correct because the most likely reason for a logs-based metric showing no data is that the log entries do not contain the expected 'status' field in jsonPayload, or it is located in a different field (e.g., httpRequest.status) or formatted as a string instead of an integer. Cloud Logging metrics rely on exact field paths defined in the metric descriptor; if the field is missing or misnamed, no data points are generated.

Exam trap

Google Cloud often tests the misconception that metric kind or naming conventions cause missing data, but the real issue is almost always a mismatch between the log entry's actual field structure and the metric's extraction configuration.

How to eliminate wrong answers

Option A is wrong because the metric kind (DELTA vs. CUMULATIVE) affects how values are aggregated over time, not whether data appears; a DELTA metric will still show data if log entries match the filter and field extraction succeeds. Option C is wrong because Cloud Monitoring metric names do not have strict naming conventions that would cause zero data; they follow a simple resource type and metric type pattern, and invalid names would cause a creation error, not silent data absence.

Option D is wrong because labelExtractors can use either JSON path or regex; JSON path is the standard and recommended approach for structured logs, and using regex is not required to make the metric work.

Practice this question →

64

MCQeasy

A company wants to monitor custom application metrics in real-time and trigger alerts when a metric exceeds a threshold. Which Google Cloud service should they use?

A.Cloud Monitoring

B.Cloud Audit Logs

C.Cloud Logging

D.Cloud Error Reporting

AnswerA

Cloud Monitoring ingests custom metrics and provides alerting capabilities.

Why this answer

Cloud Monitoring (formerly Stackdriver Monitoring) is the correct service because it is designed to ingest custom application metrics via the Monitoring API or OpenTelemetry, create dashboards for real-time visualization, and configure alerting policies that trigger notifications when a metric exceeds a defined threshold. This directly meets the requirement for real-time monitoring and threshold-based alerts.

Exam trap

Google Cloud often tests the distinction between logging (text-based events) and monitoring (numeric time-series metrics), so the trap here is that candidates confuse Cloud Logging's log-based metrics or alerting on log entries with the dedicated metric monitoring and alerting capabilities of Cloud Monitoring.

How to eliminate wrong answers

Option B (Cloud Audit Logs) is wrong because it records administrative actions and access events for compliance and security auditing, not real-time custom application metrics or threshold-based alerting. Option C (Cloud Logging) is wrong because it ingests and stores log data (text-based events) and can trigger alerts on log content, but it is not designed for numeric metric time-series ingestion or threshold evaluation. Option D (Cloud Error Reporting) is wrong because it aggregates and analyzes application errors (e.g., exceptions, stack traces) from logs, not custom numeric metrics, and does not support threshold-based alerting on metric values.

Practice this question →

65

MCQmedium

You are creating a Cloud Monitoring dashboard to display the 99th percentile latency of your HTTP Load Balancer over the last 6 hours. Which MQL query should you use?

A.fetch https_lb_rule :: latency | align 99p

B.fetch loadbalancing.googleapis.com/https/total_latencies | align percentile(99)

C.fetch loadbalancing.googleapis.com/https/request_count | align | ratio

D.fetch loadbalancing.googleapis.com/https/total_latencies | align 99 | with latency

AnswerB

This query fetches the latency distribution and aligns to the 99th percentile, exactly as needed.

Why this answer

Option B is correct because it uses the correct metric type (`total_latencies`) and the proper MQL function `percentile(99)` to compute the 99th percentile latency. The `fetch` statement targets the exact Cloud Monitoring metric for HTTPS load balancer latencies, and `align percentile(99)` aggregates the raw latency distribution data over the specified time window (last 6 hours) to produce the desired percentile value.

Exam trap

Google Cloud often tests the distinction between valid metric names (e.g., `total_latencies` vs. `latency`) and correct MQL syntax (e.g., `percentile(99)` vs. `99p` or `align 99`), leading candidates to choose syntactically close but incorrect options like A or D.

How to eliminate wrong answers

Option A is wrong because `https_lb_rule` is not a valid metric type in Cloud Monitoring; the correct metric is `loadbalancing.googleapis.com/https/total_latencies`. Additionally, `align 99p` is not valid MQL syntax — the correct function is `percentile(99)`. Option C is wrong because `request_count` is a count metric, not a latency metric, and using `ratio` would compute a ratio of request counts, not a latency percentile.

Option D is wrong because `align 99` is invalid MQL syntax (percentile requires the `percentile()` function), and `with latency` is not a recognized MQL clause for extracting or labeling the result.

Practice this question →

66

MCQeasy

A company runs a multi-region web application on Google Kubernetes Engine (GKE) using Cloud Load Balancing and Cloud Armor. They use Cloud Monitoring to track user-facing latency. Recently, they noticed that the p99 latency has increased from 200ms to 2s during peak hours, but only for users in the US region. The team suspects a specific backend service in us-central1 is causing the spike. They have set up a dashboard showing latency by region, but the latency metric is aggregated globally, not broken down by region. What should they do to pinpoint the issue?

A.Deploy a sidecar proxy in each pod to collect detailed latency data and export it to a third-party tool.

B.Use Cloud Monitoring's 'Service Monitoring' to set up a service SLO and create a burn-rate alert.

C.Use the GKE Dashboard to view per-pod latency metrics.

D.Create a custom log-based metric that extracts latency per region from application logs.

AnswerD

Log-based metrics allow you to parse latency values and labels (e.g., region) from structured logs, providing per-region latency data to pinpoint the issue.

Why this answer

Option D is correct because creating a custom log-based metric that extracts latency per region from application logs allows you to break down the globally aggregated latency metric into per-region slices. This directly addresses the need to isolate the us-central1 backend service's impact on p99 latency during US peak hours, without requiring additional infrastructure or third-party tools.

Exam trap

The trap here is that candidates may assume per-pod metrics (Option C) are sufficient for user-facing latency analysis, but GKE Dashboard metrics are infrastructure-focused and lack the regional breakdown needed to isolate a specific backend service's impact on global p99 latency.

How to eliminate wrong answers

Option A is wrong because deploying a sidecar proxy adds unnecessary complexity and cost, and exporting data to a third-party tool is not required when Cloud Monitoring's log-based metrics can already extract and filter latency by region from existing application logs. Option B is wrong because Service Monitoring and SLO burn-rate alerts are designed to detect when a service level objective is being violated, not to diagnose the root cause of a latency spike by region; they would only confirm the problem exists, not pinpoint the specific backend. Option C is wrong because the GKE Dashboard provides per-pod metrics like CPU and memory, but it does not expose user-facing latency broken down by region; latency metrics are typically collected at the load balancer or application layer, not at the pod level.

Practice this question →

67

MCQhard

Your organization runs a critical e-commerce platform on Google Kubernetes Engine (GKE). The platform uses Cloud Service Mesh (Anthos Service Mesh) for traffic management and Cloud Monitoring for observability. Recently, after a new release, you observe that the p99 latency of the checkout service has increased from 200ms to 2s. The service's CPU and memory metrics appear normal, and there are no error logs. The release included a change to the Istio VirtualService configuration that added a retry policy: 3 retries with a 500ms timeout per retry. You suspect that the retries are contributing to the latency increase. You want to use Cloud Monitoring to confirm this hypothesis. Which approach should you take?

A.Use Cloud Trace to analyze distributed traces for the checkout service and look for retry spans

B.Check the 'Services' dashboard in Cloud Monitoring, which shows a pre-built latency chart for all services

C.Use Metrics Explorer to query the istio.io/service/server/request_count metric, filtered by response_code_class and destination_service, and include the istio.io/service/server/request_retries metric to see retry counts alongside latency

D.Use Logs Explorer to search for logs containing 'retry' in the checkout service namespace

AnswerC

This directly shows the correlation between retries and latency.

Why this answer

Option C is correct because it directly correlates retry attempts with latency by querying the `istio.io/service/server/request_retries` metric alongside the `istio.io/service/server/request_count` metric in Metrics Explorer. This allows you to visualize the retry count per destination service (checkout) and compare it with the p99 latency increase, confirming whether the retry policy is causing the observed latency spike. The retry policy (3 retries with 500ms timeout) can add up to 1.5s of additional latency per request, which aligns with the increase from 200ms to 2s.

Exam trap

Google Cloud often tests the distinction between metrics (which aggregate over time) and traces (which show individual request paths), leading candidates to choose Cloud Trace (Option A) when they should use Metrics Explorer with retry-specific metrics to confirm a latency hypothesis.

How to eliminate wrong answers

Option A is wrong because Cloud Trace shows distributed traces and retry spans, but it does not provide aggregated metrics like p99 latency or retry counts over time; it is more suitable for debugging individual requests rather than confirming a hypothesis about overall latency trends. Option B is wrong because the pre-built 'Services' dashboard in Cloud Monitoring shows latency charts but does not include retry metrics, so you cannot directly correlate retries with latency increases. Option D is wrong because Logs Explorer searching for 'retry' logs is inefficient and unreliable; Istio retries are not always logged by default, and even if they are, logs do not provide the aggregated time-series data needed to confirm a latency hypothesis.

Practice this question →

68

MCQhard

An organization is implementing SLO-based alerting for a critical service. They want to alert when the service has consumed 50% of its error budget over a 30-day window. Considering best practices for alert sensitivity and noise reduction, which alerting approach should they use?

A.Alert on the burn rate over a 1-hour window with a threshold of 10.

B.Alert on the burn rate over a 5-minute window with a threshold of 0.5.

C.Alert on the error budget remaining with a threshold of 50%.

D.Alert on the SLI value directly with a threshold of 99.9%.

AnswerA

A burn rate of 10 over 1 hour means the error budget would be exhausted in 3 hours (30 days / 10 = 3 hours), triggering an alert when 50% is consumed in about 1.5 hours, which is timely.

Why this answer

Option A is correct because alerting on a burn rate of 10 over a 1-hour window directly indicates that the service is consuming error budget at a rate that would exhaust the entire 30-day budget in 3 days (since 30 days / 10 = 3 days). This approach balances sensitivity and noise reduction by using a sufficiently long window (1 hour) to smooth out transient spikes, while the high threshold ensures only significant sustained degradation triggers an alert, aligning with SRE best practices for multi-window, multi-burn-rate alerting.

Exam trap

Google Cloud often tests the misconception that shorter windows (like 5 minutes) are better for fast detection, but the trap here is that overly short windows increase noise and false positives, whereas a 1-hour window with a high burn rate threshold provides the right balance for a 30-day SLO.

How to eliminate wrong answers

Option B is wrong because a 5-minute window with a burn rate threshold of 0.5 is far too sensitive and noisy; it would trigger alerts on minor, transient blips that do not meaningfully consume error budget over the 30-day window, leading to alert fatigue. Option C is wrong because alerting on error budget remaining at 50% is a reactive, threshold-based approach that provides no lead time; by the time 50% is consumed, the service may already be in a critical state, and it does not account for the rate of consumption. Option D is wrong because alerting on the SLI value directly (e.g., 99.9%) is a static threshold that ignores the error budget entirely; it can trigger false positives during normal fluctuations and fails to measure the actual impact on the SLO over the compliance period.

Practice this question →

69

MCQhard

A company has on-premises servers running Linux and GKE clusters. They want to monitor all infrastructure using Cloud Monitoring. Which solution is most scalable and aligned with Google's best practices?

A.Use collectd on on-prem servers to send to Cloud Monitoring via the Stackdriver agent configuration.

B.Deploy Prometheus on both environments and use the PromQL adapter for Cloud Monitoring.

C.Use Google's managed service for Prometheus on GKE and a Prometheus federation for on-prem.

D.Install the Ops Agent on all on-prem servers and use Google's default GKE monitoring.

AnswerC

The managed service for Prometheus on GKE is fully integrated, and federation from on-prem Prometheus scales well.

Why this answer

Option C is correct because it leverages Google's managed service for Prometheus on GKE, which is fully integrated with Cloud Monitoring and eliminates the operational overhead of self-managing Prometheus. For on-premises servers, Prometheus federation allows scraping metrics from the on-prem Prometheus instance and forwarding them to the managed service, providing a unified, scalable monitoring solution that aligns with Google's best practices for hybrid environments.

Exam trap

Google Cloud often tests the misconception that self-managed Prometheus with a custom adapter is the most scalable solution, when in fact Google's managed service eliminates operational overhead and provides native integration with Cloud Monitoring, making it the best practice for hybrid environments.

How to eliminate wrong answers

Option A is wrong because collectd is a legacy agent that requires manual configuration and does not natively integrate with Cloud Monitoring's modern metric pipeline; the Stackdriver agent is deprecated in favor of the Ops Agent. Option B is wrong because deploying self-managed Prometheus on both environments and using the PromQL adapter adds unnecessary complexity and does not leverage Google's managed service, which provides automatic scaling, high availability, and native integration with Cloud Monitoring. Option D is wrong because the Ops Agent is designed for on-premises VMs but does not provide the same level of integration for GKE clusters as the managed Prometheus service; using Google's default GKE monitoring lacks the flexibility and advanced querying capabilities of Prometheus for custom metrics.

Practice this question →

70

MCQeasy

A team wants to monitor a Google Cloud Run service for application crashes. Which Google Cloud tool automatically captures and notifies on application errors?

A.Cloud Logging

B.Cloud Monitoring

C.Cloud Console

D.Error Reporting

AnswerD

Error Reporting automatically aggregates errors and can send notifications.

Why this answer

Error Reporting (D) is the correct answer because it is a Google Cloud service specifically designed to automatically capture, aggregate, and notify on application errors, including crashes in Cloud Run services. It ingests error events from Cloud Logging and provides real-time alerts and dashboards, making it the dedicated tool for this use case.

Exam trap

Google Cloud often tests the distinction between log storage (Cloud Logging) and error-specific analysis (Error Reporting), leading candidates to mistakenly choose Cloud Logging because they think 'logs contain errors, so that must be the tool.'

How to eliminate wrong answers

Option A is wrong because Cloud Logging is a centralized log storage and querying service; it does not automatically parse or notify on application errors without additional configuration (e.g., log-based metrics or sinks). Option B is wrong because Cloud Monitoring focuses on metrics, uptime checks, and alerting based on performance thresholds, not on automatically capturing and categorizing application crash errors. Option C is wrong because Cloud Console is a web-based UI for managing Google Cloud resources; it provides no automated error capture or notification capabilities.

Practice this question →

71

MCQhard

A company uses Cloud Monitoring to track latency for a multi-region web application. The SLO is 99.9% of requests under 500ms over a 30-day rolling window. The error budget has been rapidly depleting over the last week. The operations team wants to understand the impact of recent deployments. Which approach should they use to correlate deployment changes with latency spikes?

A.Use Cloud Logging to search for deployment logs and manually compare with latency metrics

B.Use Cloud Trace to analyze latency distributions for each deployment version

C.Create a custom dashboard in Cloud Monitoring that includes latency charts and use annotation markers to indicate deployment times

D.Configure Error Reporting to alert on latency threshold breaches

AnswerC

Annotation markers allow you to overlay deployment events on time-series charts, making it easy to correlate changes with latency spikes.

Why this answer

Option C is correct because Cloud Monitoring supports custom dashboards with annotation markers that can be programmatically or manually added to indicate deployment events. By overlaying these markers on latency charts, the operations team can visually correlate deployment times with latency spikes, enabling direct root-cause analysis without manual log searching or separate tools.

Exam trap

Google Cloud often tests the distinction between monitoring tools (Cloud Monitoring for dashboards and annotations) versus debugging tools (Cloud Trace for per-request analysis) or logging tools (Cloud Logging for raw logs), leading candidates to choose a tool that addresses part of the problem but not the correlation requirement.

How to eliminate wrong answers

Option A is wrong because manually searching Cloud Logging for deployment logs and comparing them with latency metrics is inefficient, error-prone, and does not provide a real-time or automated correlation; it relies on manual cross-referencing, which is not scalable for a multi-region application. Option B is wrong because Cloud Trace is designed for distributed tracing of individual requests and analyzing latency distributions per version, but it does not natively support overlaying deployment timelines or providing a high-level dashboard view for correlation with deployment events. Option D is wrong because Error Reporting is focused on aggregating and alerting on application errors (e.g., exceptions, crashes), not on latency threshold breaches; configuring it to alert on latency would misuse its purpose, and it lacks the ability to correlate alerts with deployment timelines.

Practice this question →

72

MCQmedium

Your company runs a multi-region application on Google Kubernetes Engine. You have implemented Cloud Monitoring dashboards to track cluster resource utilization and application SLIs. After a recent upgrade, you notice that the dashboard shows a sudden drop in CPU utilization for all nodes in one zone, but the application is still serving traffic normally. You suspect a monitoring issue. What should you investigate first?

A.Check if the nodes in that zone have been cordoned.

B.Check if the application's resource requests and limits have changed.

C.Check if the Kubernetes Metrics Server is running correctly in that zone.

D.Check if the Cloud Monitoring agent has been updated incorrectly.

AnswerC

Metrics Server is responsible for collecting resource usage; if it's down, CPU data would drop.

Why this answer

The Kubernetes Metrics Server is responsible for collecting resource metrics from Kubelets and exposing them via the Metrics API, which Cloud Monitoring uses to display CPU utilization. A sudden drop in CPU utilization across all nodes in a single zone, while the application continues to serve traffic normally, strongly indicates that the Metrics Server in that zone has failed or is not reporting metrics, rather than an actual change in workload. Investigating the Metrics Server's health and logs is the correct first step to confirm whether the monitoring pipeline is broken.

Exam trap

Google Cloud often tests the misconception that Cloud Monitoring relies on an external agent for all metrics, when in fact GKE integrates natively with the Metrics Server for node and pod resource utilization, making agent-related options a red herring.

How to eliminate wrong answers

Option A is wrong because cordoning a node prevents new pods from being scheduled but does not affect the reporting of CPU utilization for existing pods; the Metrics Server would still report metrics from running pods on cordoned nodes. Option B is wrong because changes to resource requests and limits affect pod scheduling and resource guarantees, not the actual CPU utilization reported by the Metrics Server; a sudden drop in reported utilization across all nodes in a zone is not caused by request/limit changes. Option D is wrong because Cloud Monitoring agents are not required for GKE node metrics; the Metrics Server collects and exposes node and pod metrics natively via the Kubernetes API, and Cloud Monitoring integrates directly with the Metrics API, not through a separate agent.

Practice this question →

73

Multi-Selecthard

Which TWO metrics should be monitored to detect a potential memory leak in a Compute Engine VM?

Select 2 answers

A.CPU utilization

B.Process count

C.Memory usage (percentage)

D.Disk read IOPS

E.Network bytes sent

AnswersB, C

A memory leak may cause the application to spawn more processes.

Why this answer

Option B is correct because a memory leak causes processes to consume increasing amounts of memory without releasing it, leading to a growing process count as new instances of the leaking process are spawned or existing processes remain active. Monitoring the process count helps detect abnormal growth that correlates with memory exhaustion. Option C is correct because memory usage percentage directly reflects how much of the VM's available RAM is consumed; a steady upward trend without a corresponding increase in workload indicates a leak.

Exam trap

Google Cloud often tests the misconception that CPU utilization is a primary indicator of memory leaks, but in reality, a leak can silently consume memory without spiking CPU until the system is critically low on memory.

Practice this question →

74

Multi-Selecteasy

You are designing a monitoring strategy for a cloud-native application. Which THREE components are essential for observability?

Select 3 answers

A.Metrics

B.Alerts

C.Dashboards

D.Traces

E.Logs

AnswersA, D, E

Metrics provide quantitative data about system performance.

Why this answer

Metrics (A) are essential because they provide quantitative, time-series data about system health and performance, such as CPU utilization, request latency, and error rates. In cloud-native observability, metrics are typically collected via Prometheus or similar systems, enabling trend analysis and threshold-based alerting. Without metrics, you cannot measure the overall state of your application or infrastructure over time.

Exam trap

Google Cloud often tests the distinction between the core observability data types (metrics, logs, traces) and the operational tools built on them (alerts, dashboards), trapping candidates who confuse 'observability components' with 'monitoring tools'.

Practice this question →

75

MCQeasy

Based on the exhibit, which Cloud Logging query filter will return all logs of this type?

A.severity>=ERROR

B.severity:"ERROR"

C.severity=ERROR

D.jsonPayload.severity:ERROR

AnswerC

This is the correct syntax to match logs with severity exactly 'ERROR'.

Why this answer

Option C is correct because Cloud Logging uses the `severity=ERROR` syntax to filter logs by exact severity level. The `=` operator performs an exact match on the severity field, which is a standard LogEntry field with predefined values (DEFAULT, DEBUG, INFO, NOTICE, WARNING, ERROR, CRITICAL, ALERT, EMERGENCY). This filter returns all logs where the severity is exactly ERROR, matching the requirement.

Exam trap

Google Cloud often tests the distinction between exact match (`=`) and text search (`:`) operators in Cloud Logging, and candidates mistakenly apply SQL-like range operators (`>=`) or confuse the severity field with a JSON payload field.

How to eliminate wrong answers

Option A is wrong because `severity>=ERROR` uses a comparison operator that is not supported in Cloud Logging filtering; severity filtering requires exact match operators (`=` or `!=`), not range comparisons. Option B is wrong because `severity:"ERROR"` uses a colon operator which is for text search or has field matching in some logging systems, but Cloud Logging requires the `=` operator for exact severity field matching. Option D is wrong because `jsonPayload.severity:ERROR` references a nested field under `jsonPayload`, but the severity field is a top-level LogEntry field, not part of the JSON payload; this filter would look for a custom field and miss the standard severity logs.

Practice this question →

Page 1 of 2 · 78 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Service Monitoring questions.

Start 20-question session