ACEChapter 38 of 101Objective 4.1

Cloud Monitoring and Uptime Checks

This chapter covers Cloud Monitoring and Uptime Checks, a core component of Google Cloud's observability suite. On the ACE exam, approximately 5-10% of questions touch on monitoring concepts, with a focus on configuring uptime checks, alerting policies, and interpreting monitoring dashboards. Understanding how to verify resource health and set up proactive notifications is essential for ensuring high availability and meeting SLAs. This chapter provides a deep dive into the mechanisms, configuration, and exam-specific nuances of uptime checks within Cloud Monitoring.

25 min read
Intermediate
Updated May 31, 2026

The Building Security Guard Analogy

Imagine a large office building with a security guard at the front desk. The guard's job is to ensure that specific areas of the building are accessible and responsive. The guard periodically walks to each department (like a server) and knocks on the door. If someone answers within a reasonable time, the guard notes that the department is 'up.' If no one answers after three knocks spaced 10 seconds apart, the guard marks that department as 'down' and alerts the building manager. The guard can also check if the department's phone is working (like an HTTP check) or just if the door opens (like a TCP check). The building manager can configure the guard to check every 5 minutes for critical departments (e.g., IT) and every 10 minutes for less critical ones. If a department fails three consecutive checks, the manager is notified via a pager (alerting policy). The guard also maintains a log of all checks and response times, which the manager reviews monthly. In this analogy, the security guard is the uptime check system, the departments are your resources, the knocks are probes, and the manager is the Cloud Monitoring alerting system.

How It Actually Works

What Are Uptime Checks and Why Do They Exist?

Uptime checks are a feature of Google Cloud's Operations Suite (formerly Stackdriver) that monitor the availability and responsiveness of your applications and resources. They allow you to verify that your services are reachable from outside the Google Cloud network, simulating user traffic. The primary purpose is to detect outages or performance degradation before they impact end users, enabling proactive remediation.

On the ACE exam, you must know how to create and configure uptime checks, understand the different check types (HTTP, HTTPS, TCP), interpret the results, and integrate them with alerting policies. Uptime checks are part of the broader Cloud Monitoring service, which also collects metrics, logs, and traces.

How Uptime Checks Work Internally

An uptime check is a periodic probe sent from Google Cloud's monitoring infrastructure to your resource. The probe originates from multiple geographic locations (called 'locations' in the console) to verify global reachability. The check can target:

A public IP address or URL

A load balancer frontend

A VM instance (via its external IP or a public endpoint)

For HTTP/HTTPS checks, the probe sends an HTTP GET request to the specified URL. It expects a response within a configurable timeout (default 10 seconds, max 30 seconds). A successful check requires a response status code in the 200-399 range (by default; you can customize accepted codes). For TCP checks, the probe attempts a TCP handshake on the specified port. If the handshake completes within the timeout, the check is successful.

Each check runs on a configurable interval (default 5 minutes, minimum 1 minute). The probe results are aggregated into a metric called uptime_check/check_passed (value 1 for pass, 0 for fail). Additionally, the latency of the check is recorded as uptime_check/check_latency.

Key Components, Values, Defaults, and Timers

Check Type: HTTP, HTTPS, or TCP. HTTPS checks validate SSL certificates by default, but you can disable validation.

Check Interval: Default 5 minutes. Minimum 1 minute. The interval applies per location.

Timeout: Default 10 seconds. Maximum 30 seconds. If no response within timeout, the check fails.

Locations: You can select from a list of global locations (e.g., us-west1, europe-west1, asia-east1). At least one location is required.

Failure Threshold: The number of consecutive failures before the resource is marked as 'down'. Default is 3. Range is 1-10.

Success Threshold: The number of consecutive successes before the resource is marked as 'up' after being down. Default is 3. Range is 1-10.

Response Status Codes: For HTTP/HTTPS, default accepted codes are 200-399. You can customize a list of accepted codes.

Content Matching: Optionally, you can check that the response body contains a specific string (e.g., 'OK'). This is useful for verifying application-level health.

Alerting: Uptime checks can trigger alerting policies when the resource is 'down' (i.e., failure threshold exceeded). Alerts can be sent via email, SMS, PagerDuty, Slack, etc.

Uptime Check Groups: You can group multiple resources into a single uptime check. The group is considered 'up' if all resources in the group pass.

Configuration and Verification Commands

Uptime checks can be configured via the Cloud Console, gcloud CLI, or API. Here are key gcloud commands:

# Create an HTTP uptime check
gcloud monitoring uptime create my-http-check \
    --resource-type=url \
    --url=https://example.com/health \
    --check-interval=5m \
    --timeout=10s \
    --locations=us-west1,europe-west1 \
    --failure-threshold=3 \
    --success-threshold=3 \
    --accepted-response-status-codes=200-399

# Create a TCP uptime check
gcloud monitoring uptime create my-tcp-check \
    --resource-type=instance \
    --instance=my-vm-instance \
    --port=80 \
    --check-interval=1m \
    --timeout=5s \
    --locations=us-central1

# List uptime checks
gcloud monitoring uptime list

# Describe an uptime check
gcloud monitoring uptime describe my-http-check

# Delete an uptime check
gcloud monitoring uptime delete my-http-check

To view uptime check metrics in Cloud Monitoring:

gcloud monitoring metrics list --filter='metric.type = "uptime_check/check_passed"'

Interaction with Related Technologies

Uptime checks integrate with: - Alerting Policies: You can create a policy that triggers when an uptime check fails. For example, if the metric uptime_check/check_passed is 0 for 5 minutes, send a notification. - Dashboards: Uptime check results can be displayed on custom dashboards using the Metrics Explorer. - Health Checks: Uptime checks are different from load balancer health checks. Load balancer health checks are internal and used for traffic routing, while uptime checks are external monitoring probes. - SLA Monitoring: Uptime checks are used to measure actual uptime against SLAs. You can set up alerts when uptime drops below a certain threshold.

Exam Tips

Remember the default values: 5-minute interval, 10-second timeout, 3 failure/success threshold.

Uptime checks can only monitor resources that are publicly accessible (or have a public IP/URL). They cannot monitor internal-only resources without a proxy or load balancer.

For HTTPS checks, SSL certificate validation is enabled by default. If your certificate is self-signed or invalid, the check will fail unless you disable validation.

Content matching is optional but powerful for verifying application health beyond just a TCP connection.

Uptime check locations are global; you must select at least one. Using multiple locations provides a more comprehensive view of availability from different geographic regions.

The failure threshold is the number of consecutive failures from a single location. If you have multiple locations, each location independently reports status. The overall status is aggregated: if any location reports failure, the resource is considered 'down' for alerting purposes.

Common Pitfalls

Misconfiguring the response status codes: If your health endpoint returns a 401 (unauthorized), the check will fail if 401 is not in the accepted range.

Timeout too low: If your application takes longer than the timeout to respond, the check will fail even if the application is healthy.

Not using content matching: A TCP check or a simple HTTP status check may not catch application-level errors (e.g., a web server returning 200 but with an error page).

Overlooking location selection: If you only select one location, you may miss regional outages that affect other areas.

Walk-Through

1

Define the resource to monitor

Identify the target resource. This can be a URL (e.g., https://example.com/health), a VM instance with a public IP, or a load balancer frontend. For HTTP/HTTPS checks, you need a valid URL. For TCP checks, you need the IP address or hostname and port. The resource must be publicly accessible from the internet because uptime check probes originate from Google Cloud's monitoring infrastructure outside your VPC network.

2

Choose check type and configure parameters

Select HTTP, HTTPS, or TCP. Set the check interval (default 5 min, min 1 min), timeout (default 10 sec, max 30 sec), and failure/success thresholds (default 3). For HTTP/HTTPS, specify accepted response status codes (default 200-399) and optionally a content match string. For HTTPS, decide whether to validate the SSL certificate (default enabled). For TCP, specify the port.

3

Select probe locations

Choose one or more geographic locations from which probes will be sent. Available locations include regions like us-west1, europe-west1, asia-east1, etc. Using multiple locations provides redundancy and a global perspective. Each location independently sends probes at the configured interval. The resource is considered 'down' if any location reports a failure (after the failure threshold).

4

Create an alerting policy

Navigate to Cloud Monitoring > Alerting > Create Policy. Select the metric 'uptime_check/check_passed' with a filter for your uptime check name. Set a condition, e.g., 'metric is absent for 5 minutes' or 'value is 0 for 3 minutes'. Configure notification channels (email, SMS, PagerDuty, etc.) and set a message. Alerts fire when the condition is met, typically when the failure threshold is exceeded.

5

Monitor and interpret results

View uptime check results in the Cloud Monitoring console under 'Uptime checks'. The dashboard shows a timeline of pass/fail status per location, latency metrics, and overall availability percentage. Use the Metrics Explorer to create custom charts. Regularly review trends to identify intermittent issues. If a check consistently fails, investigate the resource's health, network connectivity, or configuration.

What This Looks Like on the Job

Enterprise Scenario 1: E-commerce Application Health Monitoring

A large e-commerce company runs a multi-region application on Google Kubernetes Engine (GKE) behind a global HTTP(S) load balancer. They need to ensure that the checkout endpoint is always responsive from different parts of the world. They create an HTTPS uptime check targeting https://shop.example.com/checkout/health with a 10-second timeout and a content match for the string 'healthy'. They select five locations: us-west1, us-east1, europe-west1, asia-east1, and australia-southeast1. The check interval is set to 1 minute for rapid detection. An alerting policy is configured to notify the on-call engineer via PagerDuty if the check fails for 3 consecutive intervals from any location. This setup catches regional DNS issues or backend failures quickly. In production, they observed that during a traffic spike, the checkout endpoint responded in 12 seconds, causing timeouts. They adjusted the timeout to 15 seconds and added a separate latency alert when response time exceeds 10 seconds.

Enterprise Scenario 2: Internal Application Monitoring via Proxy

A financial services firm has a critical internal application that is not publicly accessible. They cannot use standard uptime checks directly. Instead, they deploy a reverse proxy (e.g., NGINX) in a public subnet that forwards health check requests to the internal application. The proxy exposes a /health endpoint that returns 200 only if the internal app is healthy. They create an HTTPS uptime check targeting the proxy's public URL. The proxy also adds latency and error rate metrics. This approach allows them to monitor internal services without exposing them directly. A common misconfiguration is forgetting to update the proxy's health endpoint when the internal app's health check path changes, causing false positives.

Scenario 3: Multi-Cloud SLA Monitoring

A company uses both Google Cloud and AWS. They want to monitor the uptime of their AWS-hosted API from Google Cloud. They create an HTTP uptime check targeting the AWS API's public endpoint. This provides an independent verification of availability from a different cloud provider's network. They use this data to compare with AWS's reported uptime and to trigger failover to GCP if the API is down. A challenge is that the AWS API may block repeated probes from GCP IP ranges, so they whitelist the GCP monitoring IP ranges. They also set a higher failure threshold (5) to avoid false alarms due to transient network issues between clouds.

How ACE Actually Tests This

What the ACE Exam Tests

Objective 4.1: 'Monitor Compute resources using uptime checks and alerting policies.' The exam expects you to:

Create and configure uptime checks (HTTP, HTTPS, TCP)

Understand default values (interval, timeout, thresholds)

Interpret uptime check metrics and logs

Set up alerting policies based on uptime check failures

Differentiate between uptime checks and load balancer health checks

Common Wrong Answers and Why Candidates Choose Them

1.

'Uptime checks can monitor internal IP addresses.' – This is false. Uptime checks require a publicly accessible endpoint. Candidates confuse uptime checks with load balancer health checks, which can monitor internal instances.

2.

'The default timeout is 30 seconds.' – The default is 10 seconds, max 30. Many candidates remember the max but not the default.

3.

'Uptime checks automatically create alerting policies.' – They do not. You must explicitly create an alerting policy. Candidates assume that because the check is created, alerts are automatic.

4.

'You can select only one location.' – While you can select one, the exam may ask about best practices. Using multiple locations is recommended for global monitoring.

Specific Numbers and Terms That Appear on the Exam

Default check interval: 5 minutes

Default timeout: 10 seconds

Default failure threshold: 3

Default success threshold: 3

Accepted response status codes: 200-399 (default)

Uptime check metric: uptime_check/check_passed

Check types: HTTP, HTTPS, TCP

Locations: us-west1, europe-west1, asia-east1, etc.

Edge Cases and Exceptions

Self-signed SSL certificates: If SSL validation is enabled, the check fails. Disable validation for self-signed certs.

Content matching case sensitivity: Content matching is case-sensitive by default.

Check groups: If you create a group, all resources must pass for the group to be 'up'. A single failure brings the group down.

Private resources: To monitor internal-only resources, you must expose them via a load balancer or proxy.

How to Eliminate Wrong Answers

If an answer says 'uptime checks can monitor any resource in your VPC,' eliminate it because they need public access.

If an answer mentions 'automatic alerting,' eliminate it because alerting policies are separate.

If an answer suggests a timeout of 30 seconds as default, eliminate it; default is 10.

If an answer says 'TCP checks can validate response content,' eliminate it; only HTTP/HTTPS supports content matching.

Key Takeaways

Default uptime check interval is 5 minutes; minimum is 1 minute.

Default timeout is 10 seconds; maximum is 30 seconds.

Default failure and success thresholds are 3 consecutive checks.

Uptime checks require a publicly accessible endpoint (URL, public IP, or load balancer).

HTTP/HTTPS checks can validate response status codes (default 200-399) and content.

TCP checks only verify that a TCP handshake completes within the timeout.

Uptime checks do not automatically create alerting policies; you must create them separately.

Select multiple geographic locations for comprehensive global monitoring.

The metric `uptime_check/check_passed` is 1 for pass, 0 for fail.

Uptime checks are different from load balancer health checks; do not confuse them.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Uptime Checks

External probes from Google Cloud monitoring infrastructure

Used for availability monitoring and SLA tracking

Supports HTTP, HTTPS, and TCP

Configurable interval, timeout, thresholds

Can trigger alerting policies

Load Balancer Health Checks

Internal probes from the load balancer to backend instances

Used for traffic routing (determines which backends receive traffic)

Supports HTTP, HTTPS, TCP, SSL, and gRPC

Configurable interval, timeout, unhealthy threshold

Automatically marks backends as unhealthy; no direct alerting

Watch Out for These

Mistake

Uptime checks are the same as load balancer health checks.

Correct

They are different. Load balancer health checks are internal and used for traffic routing; uptime checks are external monitoring probes from Google's infrastructure.

Mistake

Uptime checks can monitor resources with internal IPs only.

Correct

Uptime checks require a publicly accessible endpoint (public IP, URL, or load balancer frontend). They cannot reach internal-only IPs.

Mistake

The default timeout for uptime checks is 30 seconds.

Correct

The default timeout is 10 seconds. The maximum configurable timeout is 30 seconds.

Mistake

Creating an uptime check automatically sets up alerting.

Correct

Uptime checks and alerting policies are separate. You must manually create an alerting policy to receive notifications on failures.

Mistake

Uptime checks can only be configured via the Cloud Console.

Correct

They can also be configured via gcloud CLI, API, and Terraform. The exam may test knowledge of gcloud commands.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

Can uptime checks monitor resources in a VPC that have no public IP?

No, uptime checks require a publicly accessible endpoint. To monitor internal resources, you can expose them via a load balancer with a public frontend or a reverse proxy. The exam may test this limitation.

What is the difference between an uptime check and a load balancer health check?

Uptime checks are external monitoring probes from Google Cloud's infrastructure used for availability monitoring and alerting. Load balancer health checks are internal probes from the load balancer to its backends, used solely for traffic routing. They have different configurations and purposes.

How do I set up an alert when an uptime check fails?

Navigate to Cloud Monitoring > Alerting > Create Policy. Select the metric `uptime_check/check_passed` with a filter for your check name. Set a condition (e.g., 'metric is absent for 5 minutes'). Configure notification channels and save. The alert fires when the condition is met, typically after the failure threshold is exceeded.

What are the default values for uptime check interval and timeout?

Default check interval is 5 minutes. Default timeout is 10 seconds. These are common exam questions. You can change the interval to as low as 1 minute and timeout up to 30 seconds.

Can I use uptime checks to monitor an HTTPS endpoint with a self-signed certificate?

Yes, but you must disable SSL certificate validation in the uptime check configuration. By default, validation is enabled, and a self-signed certificate will cause the check to fail. Disable validation to ignore certificate errors.

What is the purpose of content matching in an uptime check?

Content matching allows you to verify that the response body contains a specific string. This is useful for application-level health checks, e.g., ensuring the response includes 'healthy' rather than just a 200 status code. It adds an extra layer of validation.

How many locations should I select for an uptime check?

At least one location is required. For global coverage and redundancy, select multiple locations (e.g., one per continent). The exam may recommend using multiple locations to get a comprehensive view of availability from different regions.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Cloud Monitoring and Uptime Checks — now see how well it sticks with free ACE practice questions. Full explanations included, no account needed.

Done with this chapter?