This chapter covers observability on Google Cloud, focusing on the three pillars: logging, monitoring, and tracing. For the GCDL exam, this topic appears in roughly 10-15% of questions, typically scenario-based where you must choose the correct tool (Cloud Logging, Cloud Monitoring, Cloud Trace) for a given need. You will also need to understand how these services integrate and how to export logs to BigQuery for analysis. Mastery of this chapter ensures you can answer questions about incident response, performance optimization, and compliance auditing.
Jump to a section
Observability in Google Cloud is like the flight data recorder (black box) and cockpit instruments on a modern aircraft. The black box continuously logs every parameter—altitude, speed, engine temperature, control surface positions—at high frequency. If something goes wrong, investigators replay the exact sequence of events to determine the root cause. Meanwhile, the cockpit instruments provide real-time monitoring: the pilot sees altitude, airspeed, engine RPM, and fuel levels on gauges, with alarms for critical thresholds (e.g., low fuel, engine fire). Tracing is like the flight path recorder that shows the entire journey from takeoff to landing, including each waypoint and the time between them. If a flight is delayed, the airline can trace the route to see where the delay occurred—e.g., holding pattern at JFK. In Google Cloud, logs are the black box data (Cloud Logging), metrics are the cockpit gauges (Cloud Monitoring), and traces are the flight path (Cloud Trace). Together, they give you full observability: you can detect anomalies in real time, drill into logs to debug, and trace requests across microservices to find performance bottlenecks. Without observability, you're flying blind—you know something is wrong only when customers complain, and you have no way to replay the incident.
What is Observability?
Observability is the ability to measure the internal state of a system by examining its outputs. In Google Cloud, observability is implemented through three integrated services: Cloud Logging for logs, Cloud Monitoring for metrics, and Cloud Trace for distributed tracing. Together, they form the Google Cloud Observability suite. The GCDL exam expects you to know the purpose of each service and when to use it.
Cloud Logging: The Central Log Repository
Cloud Logging is a fully managed service that collects, stores, and analyzes log data from Google Cloud services and custom applications. It is the central repository for all logs. Key concepts:
- Log entries: Each log entry has a timestamp, severity (DEFAULT, DEBUG, INFO, NOTICE, WARNING, ERROR, CRITICAL, ALERT, EMERGENCY), and payload (JSON, text, or structured).
- Log buckets: Logs are stored in log buckets. By default, each Google Cloud project has a _Default log bucket that retains logs for 30 days. You can create custom buckets with retention periods from 1 day to 3650 days (10 years) for compliance.
- Log views: Access control via Log Views—IAM roles like roles/logging.viewer allow read access to specific logs.
- Log sinks: You can export logs to destinations like Cloud Storage (for archival), BigQuery (for analysis), Pub/Sub (for streaming to third-party tools), or another project. Sinks are configured with a filter (e.g., severity>=ERROR).
- Log-based metrics: You can create metrics from log content (e.g., count of ERROR logs) and use them in Cloud Monitoring alerts.
- Exclusions: You can exclude certain logs from ingestion to reduce costs, using exclusion filters.
Cloud Monitoring: Metrics and Alerting
Cloud Monitoring collects metrics (time-series data) from Google Cloud services, AWS, and custom applications via the Monitoring API. It provides dashboards, alerting, and uptime checks. Key components:
- Metrics: Both Google Cloud metrics (e.g., CPU utilization, network bytes) and custom metrics (via OpenCensus or the Monitoring API). Metrics have metric types (gauge, delta, cumulative) and resource types (e.g., gce_instance, k8s_container).
- Dashboards: Customizable visualizations of metrics. You can create charts (line, bar, heatmap) and group them into dashboards.
- Alerting policies: Conditions (e.g., metric > threshold for 5 minutes), notification channels (email, SMS, PagerDuty, Slack, webhook), and duration (how long condition must hold before firing). Default evaluation period is 60 seconds, but you can set from 30 seconds to 23 hours.
- Uptime checks: HTTP/HTTPS/TCP checks from multiple locations worldwide (e.g., 6 regions by default). You can configure check frequency (1, 5, 10, 15 minutes) and response time thresholds.
- Service Level Objectives (SLOs): Define SLOs for your services (e.g., 99.9% uptime over 30 days) and track burn rate. Cloud Monitoring provides SLO monitoring and alerting on error budget consumption.
Cloud Trace: Distributed Tracing
Cloud Trace is a distributed tracing system that captures latency data from applications. It helps you identify performance bottlenecks across microservices. Key concepts:
- Spans: A trace is composed of spans, each representing a unit of work (e.g., an RPC call, a database query). Spans have start time, end time, and parent span ID.
- Trace context: Propagated via HTTP headers (X-Cloud-Trace-Context). Google Cloud services like App Engine and Cloud Run automatically instrument traces.
- Trace sampling: To reduce overhead, traces are sampled. Default sampling rate is 1 trace per 10 seconds per instance, but you can configure custom rates.
- Trace analysis: The Trace UI shows waterfall charts of spans, latency distributions, and performance dashboards. You can filter by service, operation, or latency.
- Integration: Cloud Trace integrates with Cloud Logging—you can view logs for a specific trace by clicking the trace ID in the Logs Viewer.
How They Work Together
Observability is not just about three separate tools; they are designed to interoperate. For example:
When an alert fires in Cloud Monitoring, you can click a link to view related logs in Cloud Logging.
In Cloud Logging, you can see trace IDs in log entries and jump to Cloud Trace to see the full trace.
Cloud Monitoring can ingest custom metrics derived from logs (log-based metrics) or from traces (e.g., trace latency metrics).
Exporting and Analyzing Logs
For compliance or advanced analytics, you export logs to BigQuery. Steps:
1. Create a log sink in Cloud Logging with BigQuery as the destination.
2. Logs are written to a BigQuery dataset (one table per log type, e.g., cloudaudit.googleapis.com%2Factivity).
3. Use SQL to query logs, e.g., SELECT * FROM logs WHERE severity='ERROR' AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR).
Cost Considerations
Cloud Logging: Ingestion costs $0.50 per GiB per month (first 50 GiB free for each billing account). Storage costs $0.01 per GiB per month for the default bucket, but longer retention in custom buckets costs more.
Cloud Monitoring: Free for Google Cloud metrics (up to 100 metrics per project). Custom metrics cost $0.30 per metric per month. Alerting is free for basic channels; notification channels have nominal costs.
Cloud Trace: Free for 2,500 traces ingested per month per project. Additional traces cost $0.20 per 100,000 spans.
Exam Relevance
The GCDL exam tests your ability to choose the right observability tool for a given scenario. Common questions: - 'Which service should you use to view application logs?' -> Cloud Logging - 'Which service should you use to set up an alert when CPU exceeds 80%?' -> Cloud Monitoring - 'Which service helps you identify which microservice is causing a slow response?' -> Cloud Trace - 'How do you export logs to BigQuery?' -> Create a log sink with BigQuery destination.
Step-by-Step: Incident Response with Observability
Alert triggers in Cloud Monitoring when CPU exceeds 90% for 5 minutes.
Investigate metrics: Open Cloud Monitoring dashboard to see CPU spike correlated with increased request latency.
View logs: Click 'View logs' on the alert to see error logs from the application server.
Trace request: Identify a trace ID in the logs, open Cloud Trace to see the waterfall chart.
Find bottleneck: The trace shows a database query taking 8 seconds. The query is missing an index.
Fix and verify: Add index, redeploy. The alert clears. Use logs to confirm no further errors.
Best Practices
Use structured logging (JSON) to make logs searchable.
Set up log-based metrics for critical error patterns.
Configure alerting with appropriate duration to avoid flapping.
Use trace sampling judiciously—too high sampling increases cost and overhead.
Export audit logs to BigQuery for compliance and security analysis.
Common Pitfalls
Forgetting to set retention policies; default 30 days may not meet compliance.
Not using exclusion filters for noisy logs, leading to high costs.
Over-alerting: setting thresholds too low causes alert fatigue.
Not instrumenting custom applications for tracing—you miss distributed tracing.
Integration with Other Services
Cloud Audit Logs: Admin Activity, Data Access, and System Event logs are automatically sent to Cloud Logging. They are essential for security and compliance.
Cloud Error Reporting: Built on Cloud Logging, it aggregates and displays application errors.
Cloud Profiler: Continuous profiling of CPU and memory usage, complementary to tracing.
Cloud Debugger: Allows you to inspect application state without stopping it (deprecated, but still tested).
Summary
Observability in Google Cloud is a unified approach using logs, metrics, and traces. Cloud Logging stores and analyzes logs, Cloud Monitoring provides metrics and alerting, and Cloud Trace offers distributed tracing. They integrate seamlessly to enable rapid incident response and performance optimization. For the GCDL exam, focus on the use cases and how to export logs to BigQuery.
Create a Log Sink to BigQuery
In the Cloud Console, navigate to Logging > Log Router. Click 'Create Sink'. Provide a name (e.g., 'audit-logs-to-bq'). Choose 'BigQuery dataset' as destination. Select the dataset (create one if needed). In the 'Build inclusion filter' section, specify a filter like `severity>=WARNING` to export only warnings and above. Click 'Create Sink'. The sink will start exporting matching log entries to BigQuery in near real-time. Logs appear in a table named after the log type (e.g., `cloudaudit.googleapis.com%2Factivity`). You can then query using SQL.
Create a Log-Based Metric
Go to Cloud Logging > Log-based Metrics. Click 'Create Metric'. Provide a name like 'error_count'. For the filter, use `severity=ERROR`. Choose 'Counter' as the metric type. Click 'Create'. This creates a new metric in Cloud Monitoring named `logging.googleapis.com/user/error_count`. You can now use this metric in alerting policies or dashboards. For example, alert when the count of errors exceeds 100 in 5 minutes.
Set Up an Alerting Policy
In Cloud Monitoring, go to Alerting > Create Policy. Select the metric (e.g., CPU utilization for a VM). Set condition to 'threshold' > 80% for 1 minute. Configure notification channels (email, Slack, etc.). Set duration to 1 minute to avoid flapping. Name the policy 'High CPU Alert'. Click 'Save'. The alert fires when the condition is met and sends notifications. You can also configure auto-resolve after a period.
View Traces in Cloud Trace
In Cloud Console, go to Trace > Trace List. You see a list of recent traces with latency. Click on a trace to see the waterfall view. Each span shows the service name, operation, start time, and duration. You can expand spans to see metadata like HTTP status code or custom tags. Use filters to find slow traces (e.g., latency > 2s). Click on a span to see logs associated with that span if integrated.
Correlate Logs and Traces
In Cloud Logging, enable 'Show traces' toggle. Log entries with trace IDs show a 'Trace' button. Click it to open Cloud Trace with the specific trace. Conversely, in Cloud Trace, you can click 'View logs' on a span to see logs for that span. This cross-linking enables efficient debugging: from an error log, you can trace the request path, or from a slow trace, you can examine logs for errors.
Scenario 1: E-commerce Platform Outage
A large e-commerce platform runs on Google Kubernetes Engine (GKE) with microservices. Customers report slow checkout. The operations team uses Cloud Monitoring dashboards to see a spike in latency for the payment service. They drill into Cloud Trace and see that the payment service is calling a legacy database with a query taking 12 seconds. The trace shows the database span. They check Cloud Logging for that trace ID and see database connection timeout errors. The root cause is a missing index. They add the index, and latency drops to 200ms. Without observability, they would have guessed and possibly restarted services, causing more downtime.
Scenario 2: Compliance Auditing for Healthcare
A healthcare company must retain logs for 7 years per HIPAA. They create a log sink in Cloud Logging with a filter for all audit logs (Admin Activity and Data Access) and export to a Cloud Storage bucket with retention policy set to 7 years. They also export to BigQuery for quarterly compliance reports. They use Cloud Monitoring to alert on any Data Access log that shows a user accessing protected health information outside of business hours. This setup satisfies auditors and provides security monitoring.
Scenario 3: Multi-cloud Observability
A company uses both Google Cloud and AWS. They deploy the Google Cloud Operations Agent on their AWS EC2 instances to collect metrics and logs. The agent sends data to Cloud Monitoring and Cloud Logging. They create a single dashboard that shows metrics from both clouds. They set up alerts for CPU usage across all instances. They also use Cloud Trace for their application that runs on both GKE and EKS. This unified observability reduces tool sprawl and provides a single pane of glass. Misconfiguration: If the agent is not properly configured, logs may be lost. They use exclusion filters to drop verbose debug logs from development instances to save costs.
What GCDL Tests
Domain: Apps | Objective 4.3: 'Describe the observability tools available in Google Cloud.' The exam tests your ability to identify which tool to use for a given requirement. Specific codes: 4.3.1 (Cloud Logging), 4.3.2 (Cloud Monitoring), 4.3.3 (Cloud Trace). You will NOT be asked to configure them in detail, but you must know their primary functions and key features.
Common Wrong Answers
Choosing Cloud Monitoring for logs: Candidates confuse 'monitoring' with 'logging'. Remember: Monitoring = metrics and alerts; Logging = logs.
Choosing Cloud Trace for real-time alerting: Trace is for latency analysis, not real-time alerts. Alerts use Cloud Monitoring.
Thinking logs are automatically stored forever: Default retention is 30 days. You must configure custom buckets or exports for longer retention.
Assuming Cloud Trace is always on: It requires instrumentation. Google-managed services like App Engine are auto-instrumented, but custom apps need the Trace SDK.
Confusing log sinks and log buckets: Sinks export logs, buckets store them. You can have multiple sinks per bucket.
Specific Numbers and Terms
Default log retention: 30 days
Maximum custom retention: 3650 days (10 years)
Cloud Monitoring default evaluation period: 60 seconds
Cloud Trace default sampling: 1 trace per 10 seconds per instance
Free tier: 50 GiB log ingestion per month (per billing account), 100 metrics, 2500 traces per month
Uptime check locations: 6 default (can add more)
Log severity levels: DEFAULT, DEBUG, INFO, NOTICE, WARNING, ERROR, CRITICAL, ALERT, EMERGENCY
Edge Cases
What if you need to retain logs for more than 30 days but don't want to export? Use a custom log bucket with longer retention.
What if you need to send logs to a third-party SIEM? Use a log sink to Pub/Sub, then subscribe with your SIEM.
What if you have multiple projects? You can aggregate logs into a single project using sinks with a destination in a central project.
How to Eliminate Wrong Answers
If the scenario mentions 'real-time alert' or 'threshold', the answer is Cloud Monitoring.
If the scenario mentions 'debugging errors' or 'viewing event records', the answer is Cloud Logging.
If the scenario mentions 'performance bottleneck' or 'request latency across services', the answer is Cloud Trace.
If the scenario mentions 'compliance' or 'audit', think Cloud Logging with export to BigQuery or Cloud Storage.
If the scenario mentions 'custom metrics from application code', think Cloud Monitoring with custom metrics.
Cloud Logging is for logs; Cloud Monitoring is for metrics and alerts; Cloud Trace is for distributed tracing.
Default log retention is 30 days; use custom buckets for longer retention up to 10 years.
Log sinks export logs to BigQuery, Cloud Storage, or Pub/Sub.
Cloud Monitoring alerting policies require a condition (metric + threshold) and a duration (default 60 seconds).
Cloud Trace uses sampling (default 1 trace per 10 seconds per instance) to reduce overhead.
Google-managed services auto-instrument traces; custom apps need the Trace SDK.
Log-based metrics bridge logs and monitoring: create a metric from log content and use it in alerts.
Audit logs (Admin Activity, Data Access) are automatically sent to Cloud Logging and are critical for compliance.
Uptime checks verify external accessibility from multiple locations; frequency can be 1, 5, 10, or 15 minutes.
Observability tools are integrated: from an alert, you can view logs; from logs, you can view traces.
These come up on the exam all the time. Here's how to tell them apart.
Cloud Logging
Stores and analyzes log data (events, errors, audit trails).
Logs have severity levels and timestamps.
Default retention 30 days, custom up to 10 years.
Export to BigQuery, Cloud Storage, Pub/Sub.
Use for debugging, compliance, and security analysis.
Cloud Monitoring
Collects and visualizes metrics (time-series data).
Provides alerting based on thresholds or anomaly detection.
Metrics retained for 6 weeks by default (custom up to 24 months).
Supports custom metrics from applications.
Use for real-time monitoring, dashboards, and alerting.
Mistake
Cloud Logging and Cloud Monitoring are the same service.
Correct
They are separate services. Cloud Logging handles log data (text/JSON events), while Cloud Monitoring handles time-series metrics and alerting. They integrate but are distinct.
Mistake
Logs are automatically retained indefinitely.
Correct
Default retention is 30 days. You must configure custom log buckets with longer retention (up to 3650 days) or export logs to Cloud Storage or BigQuery.
Mistake
Cloud Trace automatically traces all requests without any setup.
Correct
Google-managed services (App Engine, Cloud Run) auto-instrument, but for custom applications you must add the Trace SDK and propagate trace context via headers.
Mistake
Cloud Monitoring can only monitor Google Cloud resources.
Correct
Cloud Monitoring can monitor AWS resources (via the AWS connector) and on-premises resources (via the Operations Agent).
Mistake
Log-based metrics are stored in Cloud Logging, not Cloud Monitoring.
Correct
Log-based metrics are created in Cloud Logging but become custom metrics in Cloud Monitoring, where you can use them in dashboards and alerts.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
The default retention period is 30 days for the `_Default` log bucket. You can create custom buckets with retention from 1 day to 3650 days (10 years). For compliance, you may need to export logs to Cloud Storage or BigQuery for longer retention.
Create a log sink in Cloud Logging with BigQuery as the destination. Specify the BigQuery dataset and an optional filter to export only certain logs. Logs are streamed to BigQuery tables automatically. You can then query the data using standard SQL.
Yes. Cloud Monitoring can monitor AWS resources via the AWS connector (which collects CloudWatch metrics) and on-premises resources using the Operations Agent or the Monitoring API. You can also send custom metrics from any application.
A log bucket is a storage container for logs within Cloud Logging. A log sink routes logs from Cloud Logging to a destination (e.g., BigQuery, Cloud Storage, Pub/Sub). You can have multiple sinks for a single bucket, and you can exclude logs from ingestion using exclusion filters.
Log entries can include a trace ID (e.g., from the `X-Cloud-Trace-Context` header). In Cloud Logging, you can click the trace ID to open the corresponding trace in Cloud Trace. Conversely, in Cloud Trace, you can click 'View logs' on a span to see logs associated with that span. This integration enables end-to-end debugging.
Log-based metrics are custom metrics created from the content of log entries. For example, you can create a counter metric that increments each time a log with severity ERROR appears. These metrics become available in Cloud Monitoring for dashboards and alerting.
Use exclusion filters to drop verbose logs (e.g., DEBUG logs from development). Set appropriate retention periods (shorter for non-critical logs). Use log sinks to archive older logs to cheaper storage like Cloud Storage (Nearline or Coldline). Monitor your log usage in the Logging dashboard.
You've just covered Observability: Logging, Monitoring, and Tracing — now see how well it sticks with free GCDL practice questions. Full explanations included, no account needed.
Done with this chapter?