This chapter covers Amazon CloudWatch from the perspective of a developer preparing for the DVA-C02 exam. CloudWatch is a core monitoring service for AWS resources and applications, and it appears in roughly 10-15% of exam questions, often integrated with other services like Lambda, EC2, and Auto Scaling. You will learn about metrics, alarms, logs, dashboards, and how CloudWatch integrates with AWS services. The focus is on what developers need to know to monitor and troubleshoot applications effectively, including key defaults, retention periods, and common pitfalls.
Jump to a section
Imagine a large office building with thousands of sensors: temperature, humidity, motion, power usage, and door access. Each sensor sends data to a central monitoring station. The station records every reading, but it only stores the raw data for a limited time (e.g., 15 days for high-resolution data). For older data, it summarizes readings into 1-hour or 1-day averages to save space. The building manager can set alarms: if a room's temperature exceeds 80°F for 5 minutes, send an email to the HVAC team. The manager can also query the station to see trends, like average power usage over the last month. Additionally, the manager can log custom events, like 'scheduled maintenance' or 'fire drill', to correlate with sensor data. The key is that the sensor system is passive—it does not control anything directly; it only observes and alerts. If the manager wants to automatically turn on cooling when temperature is high, they need a separate automation system (like a thermostat) that reads the sensor data and acts. In AWS, CloudWatch is the sensor system: it collects metrics, logs, and events from AWS resources and applications, stores them with different retention policies, and triggers actions via alarms and events.
What is Amazon CloudWatch?
Amazon CloudWatch is a monitoring and observability service that collects data from AWS resources and applications. It provides metrics (time-series data), logs (textual event records), and events (state changes) in a centralized platform. Developers use CloudWatch to gain operational visibility, set alarms, create dashboards, and trigger automated responses.
Core Components
CloudWatch consists of several key components:
Metrics: A metric is a time-ordered set of data points representing a variable (e.g., CPU utilization, request count). Each metric belongs to a namespace (e.g., AWS/EC2, AWS/Lambda, Custom). Metrics have dimensions (key-value pairs) that define the source (e.g., InstanceId, FunctionName). Metrics are stored for a default retention period:
- Data points with a period of less than 60 seconds: 3 hours - Data points with a period of 60 seconds (1 minute): 15 days - Data points with a period of 300 seconds (5 minutes): 63 days - Data points with a period of 3600 seconds (1 hour): 455 days (15 months) - Alarms: Alarms watch a single metric (or math expression) and perform an action when a threshold is breached. Alarms have three states: OK, ALARM, and INSUFFICIENT_DATA. You can set actions like sending an SNS notification or triggering Auto Scaling. - Logs: CloudWatch Logs centralizes log files from EC2 instances, Lambda functions, and other sources. Logs are organized into log groups, each containing log streams (sequence of log events). You can filter logs, create metric filters to generate metrics from log patterns, and export logs to S3. - Events (CloudWatch Events/EventBridge): CloudWatch Events (now largely replaced by Amazon EventBridge) delivers a stream of system events describing changes in AWS resources. You can create rules to match events and route them to targets like Lambda, SNS, or SQS. - Dashboards: Customizable pages that display metrics and alarms in widgets. You can share dashboards across accounts.
How CloudWatch Works Internally
When an AWS service emits a metric, the data is sent to CloudWatch via the CloudWatch API or the PutMetricData API. Each data point includes a timestamp, value, unit, and dimensions. CloudWatch stores the data in a distributed time-series database optimized for aggregation and retrieval. When you query metrics (e.g., via GetMetricStatistics), CloudWatch aggregates data based on the specified period and statistics (Average, Sum, Min, Max, SampleCount). The aggregation is done by computing the specified statistic over the data points within each period.
For logs, agents (like the CloudWatch Logs agent or unified CloudWatch agent) or AWS services send log events to the CloudWatch Logs API (PutLogEvents). The service buffers logs and writes them to durable storage. You can then search logs using filter patterns (e.g., "ERROR") or create metric filters that count occurrences.
CloudWatch Alarms
An alarm evaluates a metric over a specified number of evaluation periods. For example, an alarm on CPU utilization with period=300 seconds (5 minutes) and evaluation periods=2 means the alarm checks the metric every 5 minutes and if the threshold is breached for two consecutive periods, it transitions to ALARM. The alarm has a TreatMissingData setting to handle missing data points. Common values: notBreaching, breaching, ignore, missing.
Alarm actions can be:
SNS topic (send notification)
Auto Scaling policy (scale out/in)
EC2 action (stop, terminate, reboot, recover)
CloudWatch Logs Features
Log Retention: By default, log groups never expire. You can set retention policy in days: 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, 3653. You can also set NEVER_EXPIRE.
Metric Filters: You can create a metric filter that extracts patterns from log events and publishes a metric. For example, count occurrences of "ERROR" and create a metric ErrorCount.
Export to S3: You can export log data to S3 for archival or analysis. Export can take up to 12 hours to complete.
CloudWatch Logs Insights: A query engine for logs. You can run SQL-like queries to filter, aggregate, and analyze logs. Queries have a 15-minute timeout and can scan up to 1 TB per query.
CloudWatch Agent vs. Unified CloudWatch Agent
The legacy CloudWatch Logs agent only sends logs. The unified CloudWatch agent can collect both metrics and logs. It supports collecting system metrics (CPU, memory, disk, network) and custom metrics from applications. The unified agent uses a configuration file in JSON or TOML format.
CloudWatch vs. AWS X-Ray
CloudWatch provides aggregated metrics and logs, while AWS X-Ray provides distributed tracing for requests as they travel through an application. X-Ray traces are sampled and show the path of a single request. CloudWatch and X-Ray are complementary: CloudWatch for overall health, X-Ray for debugging specific issues.
Integration with Lambda
Lambda automatically sends metrics to CloudWatch: Invocations, Errors, Duration, Throttles, ConcurrentExecutions, etc. Lambda also sends logs to CloudWatch Logs (if you use print statements or logging libraries). By default, Lambda creates a log group named /aws/lambda/<function-name> with log streams per container.
Integration with EC2
EC2 sends basic metrics (CPU, disk, network) at 5-minute intervals. For detailed monitoring (1-minute intervals), you must enable detailed monitoring (costs extra). You can also install the CloudWatch agent to send memory and disk metrics.
Custom Metrics
You can publish custom metrics using the AWS CLI, SDK, or API. Custom metrics can be stored at high resolution (1-second granularity) or standard resolution (1-minute granularity). High-resolution metrics cost more. You can specify storage resolution in PutMetricData: StorageResolution = 1 (high) or 60 (standard).
CloudWatch Dashboard
Dashboards are created in the CloudWatch console. You can add metric widgets, alarm widgets, and text widgets. Dashboards can be shared with other accounts via a shareable link. You can also create dashboards programmatically using the PutDashboard API.
CloudWatch Events / EventBridge
CloudWatch Events (now part of EventBridge) delivers a stream of real-time events from AWS services. You can create rules that match event patterns and route to targets. For example, an EC2 instance state change to 'stopped' can trigger a Lambda function. EventBridge is the newer version with additional features like schema registry and event buses.
Pricing
CloudWatch pricing is based on:
Metrics: number of custom metrics, number of API requests
Logs: data ingested, data stored, data archived (S3 export)
Dashboards: per dashboard per month (first 3 free)
Alarms: per alarm per month
Exam-Relevant Defaults and Limits
Default metric retention: as above (3 hours for <60s, 15 days for 1 min, 63 days for 5 min, 455 days for 1 hour)
Log retention: never expire by default
Alarm evaluation periods: default is 1, can be up to 1000
Maximum number of dimensions per metric: 30
Maximum number of metrics per alarm: 1 (unless using math expressions)
CloudWatch Logs Insights: query timeout 15 minutes, max scan 1 TB
Log group name: can contain alphanumeric, hyphen, underscore, forward slash, period
Log stream name: same as log group
Log event size: maximum 256 KB (uncompressed)
Log batch size: maximum 1 MB (total), 10,000 events
Metric filter: maximum 100 filters per log group
CloudWatch dashboard: maximum 100 widgets per dashboard
Common Exam Scenarios
Setting an alarm on a custom metric published by an application
Creating a metric filter to count errors in logs and triggering an alarm
Using CloudWatch Logs Insights to troubleshoot a slow Lambda function
Understanding that CloudWatch does not automatically aggregate across dimensions; you need to use math expressions or separate metrics
Knowing that CloudWatch alarms can trigger Auto Scaling policies
Recognizing that CloudWatch Events can trigger Lambda functions for automated remediation
Publish a Custom Metric
First, you publish a custom metric using the AWS CLI or SDK. For example, using the AWS CLI: `aws cloudwatch put-metric-data --namespace "MyApp" --metric-name "PageViewCount" --value 1 --timestamp 2025-03-15T12:00:00Z`. This sends a data point to CloudWatch with a namespace, metric name, value, and optional dimensions. The data is stored in the CloudWatch time-series database. By default, the storage resolution is 60 seconds (standard). To publish at high resolution (1 second), add `--storage-resolution 1`. The data point is immediately available for querying and alarm evaluation.
Create a Metric Filter in Logs
To extract a metric from logs, navigate to CloudWatch Logs, select a log group, and create a metric filter. Specify a filter pattern, e.g., `"ERROR"` to match any log event containing 'ERROR'. Then define the metric name and namespace. CloudWatch will scan incoming log events and increment a count metric each time the pattern matches. The metric is published with a timestamp equal to the log event timestamp. The filter can also extract numeric values from logs using pattern syntax like `[..., severity, ...]` where severity is a number. This allows you to create metrics like `ErrorCount` or `Latency` from log data.
Set Up a CloudWatch Alarm
In the CloudWatch console, create an alarm. Choose a metric (e.g., `ErrorCount` from the metric filter). Define the condition: e.g., `ErrorCount > 5` for 2 consecutive periods of 5 minutes. Set the evaluation period to 2 (meaning the threshold must be breached for 2 data points). Configure actions: e.g., send notification to an SNS topic. The alarm will evaluate every period (5 minutes). When the metric exceeds 5 for two consecutive readings, the alarm state changes to ALARM and triggers the action. If the metric drops below the threshold, it returns to OK. If no data is received, it goes to INSUFFICIENT_DATA (depending on TreatMissingData setting).
Configure CloudWatch Agent for EC2
To collect memory and disk metrics from EC2, install the unified CloudWatch agent. Download the agent package from Amazon S3 and run the installer. Create a configuration file (e.g., `/opt/aws/amazon-cloudwatch-agent/bin/config.json`) specifying metrics to collect: `{"metrics":{"metrics_collected":{"mem":{"measurement":["mem_used_percent"]},"disk":{"measurement":["disk_used_percent"],"resources":["*"]}}}}`. Start the agent. The agent sends metrics to CloudWatch every 60 seconds. You can then create alarms on memory usage. The agent also collects logs if configured.
Query Logs with CloudWatch Logs Insights
To analyze logs, open CloudWatch Logs Insights in the console. Select a log group and write a query, e.g., `fields @timestamp, @message | filter @message like /ERROR/ | stats count() by bin(5m)`. This retrieves log events, filters for ERROR, and counts per 5-minute interval. Run the query. The results are displayed as a table or time series. You can export results to CSV. Insights queries have a 15-minute timeout and can scan up to 1 TB of data. This is useful for troubleshooting specific errors or performance issues.
In a production e-commerce platform, CloudWatch is used to monitor hundreds of EC2 instances running a microservices architecture. The operations team sets alarms on CPU utilization (threshold > 80% for 5 minutes) to trigger Auto Scaling policies that add instances during traffic spikes. Custom metrics are published from the application to track business metrics like 'CheckoutSuccessRate' and 'PaymentErrorCount'. These metrics are visualized on a CloudWatch dashboard shared with the business team. Logs from all services are sent to CloudWatch Logs with a retention of 30 days. Metric filters count the number of HTTP 500 errors and trigger an alarm that sends an SNS notification to the on-call engineer. The team uses CloudWatch Logs Insights to investigate specific incidents by querying logs for correlation IDs.
Another scenario: a serverless application using Lambda and API Gateway. CloudWatch collects Lambda metrics (Duration, Errors, Throttles) and API Gateway metrics (4XX, 5XX, Latency). The developer sets a CloudWatch alarm on Lambda Errors > 0 for 1 minute to trigger a Lambda function that sends a Slack message. They also create a dashboard showing the error rate and average duration. Logs from Lambda are automatically sent to CloudWatch Logs; the developer sets a metric filter for 'Exception' to count exceptions. When the exception count spikes, an alarm triggers a Lambda function that runs a diagnostic script.
Common misconfigurations: forgetting to set log retention, causing logs to accumulate and incur high storage costs. Another is setting alarm evaluation periods too low (e.g., 1 period) causing flapping. Also, not enabling detailed monitoring on EC2 instances leads to 5-minute granularity, which may miss short spikes. On the exam, you might see a scenario where a developer publishes a custom metric but forgets to include a timestamp, so CloudWatch uses the current time. Or they use the wrong namespace, making metrics invisible to other services.
The DVA-C02 exam tests CloudWatch under Domain 4: Troubleshooting and Optimization (Objective 4.2 - Implement monitoring and observability). Common exam topics include:
- Metric retention periods: Know the exact values: 3 hours for <60s granularity, 15 days for 1 min, 63 days for 5 min, 455 days for 1 hour. The exam often asks what happens to older data (it is aggregated).
- Log retention: Default is never expire. You must explicitly set retention. Exam questions may present a scenario where logs are accumulating costs and ask how to reduce storage.
- Alarm evaluation: Understand that an alarm evaluates over evaluation periods and period. For example, period=300 seconds, evaluation periods=2 means the alarm checks every 5 minutes and needs two consecutive breaches to alarm.
- CloudWatch Events vs. EventBridge: Know that EventBridge is the newer service with additional features like schema registry and event buses. CloudWatch Events is legacy but still appears on the exam.
- Custom metrics: You can publish custom metrics using PutMetricData. Know the difference between standard (60s) and high resolution (1s). High resolution costs more.
- Metric filters: Used to extract metrics from logs. The exam may ask how to count error occurrences in logs.
- CloudWatch Logs Insights: Used for ad-hoc log analysis. Query timeout is 15 minutes, max scan 1 TB.
- Integration with Auto Scaling: Alarms can trigger Auto Scaling policies.
- Common wrong answers: Candidates often confuse CloudWatch with CloudTrail (audit logs) or X-Ray (tracing). Another trap is thinking CloudWatch can directly stop an EC2 instance (it can via EC2 action, but only if the alarm is configured correctly). Also, some think CloudWatch Logs can be used for real-time streaming to Kinesis (actually, you can use subscription filters to stream to Lambda or Kinesis).
- Exam edge cases: What happens when an alarm has insufficient data? The state becomes INSUFFICIENT_DATA. What if you delete a metric? Alarms still exist but show insufficient data. How to send logs to multiple destinations? Use subscription filters.
- Eliminating wrong answers: If a question asks about monitoring application performance, and one option mentions CloudTrail, eliminate it (CloudTrail is for API activity). If the question asks about tracing a single request across services, the answer is X-Ray, not CloudWatch.
CloudWatch Metrics retention: <60s period = 3 hours, 1 min = 15 days, 5 min = 63 days, 1 hour = 455 days.
CloudWatch Logs default retention is never expire; always set a retention policy to control costs.
Alarms evaluate over `evaluation periods` (number of consecutive periods) and `period` (duration of each data point).
Custom metrics can be published with `PutMetricData`; use `StorageResolution=1` for high resolution (1 second).
Metric filters in CloudWatch Logs allow you to create metrics from log patterns (e.g., count errors).
CloudWatch Logs Insights has a 15-minute query timeout and can scan up to 1 TB.
CloudWatch does not automatically aggregate metrics across dimensions; use math expressions or separate metrics.
CloudWatch Events (EventBridge) can trigger Lambda, SNS, SQS, and other targets based on AWS resource state changes.
To collect memory metrics from EC2, install the unified CloudWatch agent.
Alarms can trigger Auto Scaling policies, SNS notifications, or EC2 actions (stop, terminate, reboot, recover).
These come up on the exam all the time. Here's how to tell them apart.
CloudWatch Logs
Stores text log events with timestamps
Retention can be set from 1 day to 10 years or never expire
Supports metric filters to extract metrics from logs
Can be exported to S3 for archival
Supports Insights queries for analysis
CloudWatch Metrics
Stores numeric time-series data points
Retention depends on data point period (3h to 15 months)
Cannot be directly queried; use GetMetricStatistics or Math Expressions
Can be aggregated using statistics (Average, Sum, etc.)
Used for alarms and dashboards
Mistake
CloudWatch automatically collects memory metrics from EC2 instances.
Correct
CloudWatch only collects basic metrics (CPU, disk, network) by default. Memory metrics require installing the CloudWatch agent or publishing custom metrics.
Mistake
CloudWatch Logs stores logs indefinitely by default.
Correct
The default retention is never expire, but you must explicitly set a retention policy. If you do not set one, logs are stored forever, incurring costs.
Mistake
CloudWatch alarms can trigger Lambda functions directly.
Correct
Alarms can trigger SNS topics, Auto Scaling policies, or EC2 actions. To trigger a Lambda function, you must configure an SNS topic as the alarm action and subscribe the Lambda function to that SNS topic.
Mistake
CloudWatch Metrics are aggregated across all dimensions automatically.
Correct
Metrics are stored per unique set of dimensions. To get an aggregate across dimensions (e.g., all EC2 instances), you must use math expressions or publish a separate metric without dimensions.
Mistake
CloudWatch Logs Insights can query live streaming logs.
Correct
CloudWatch Logs Insights queries only the data that has been ingested into CloudWatch Logs. It does not query real-time streaming data; you need to use subscription filters for streaming.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
CloudWatch retains metric data based on the period of the data points. Data points with a period less than 60 seconds are retained for 3 hours. Data points with a period of 60 seconds (1 minute) are retained for 15 days. Data points with a period of 300 seconds (5 minutes) are retained for 63 days. Data points with a period of 3600 seconds (1 hour) are retained for 455 days (15 months). This is a common exam question.
You can set a retention policy on a log group via the AWS CLI, SDK, or console. Use the `put-retention-policy` CLI command: `aws logs put-retention-policy --log-group-name /aws/lambda/my-function --retention-in-days 30`. This sets the log group to expire log events older than 30 days. You can also set retention during log group creation. The default is never expire, which can lead to high costs.
No, CloudWatch alarms cannot directly trigger a Lambda function. Alarms can send notifications to an SNS topic, and you can subscribe a Lambda function to that SNS topic. Alternatively, you can use CloudWatch Events (EventBridge) to match alarm state changes and route to Lambda. The alarm state change is an event that can be captured by EventBridge.
CloudWatch is for monitoring performance and health of AWS resources and applications (metrics, logs, alarms). CloudTrail is for auditing API activity (who did what, when, from where). They serve different purposes. The exam often tests this distinction: CloudTrail logs API calls, CloudWatch logs application logs and metrics.
In the CloudWatch console, select a log group, choose 'Metric filters' tab, and click 'Create metric filter'. Specify a filter pattern (e.g., `"ERROR"` for exact match, or `[..., severity, ...]` for numeric extraction). Then define the metric namespace and metric name. The filter will start publishing a metric each time a log event matches the pattern. You can also use the CLI: `aws logs put-metric-filter --log-group-name my-log-group --filter-name ErrorFilter --filter-pattern "ERROR" --metric-transformations metricName=ErrorCount,metricNamespace=MyApp,metricValue=1`.
The maximum size of a log event (the message plus metadata) is 256 KB, uncompressed. If a single log event exceeds this, CloudWatch Logs will reject it. This is important for applications that log large payloads. You should truncate or split large log messages.
You can use a subscription filter to stream log events to a Kinesis stream, Lambda function, or Amazon Elasticsearch Service. Create a subscription filter using the CLI: `aws logs put-subscription-filter --log-group-name my-log-group --filter-name myFilter --filter-pattern "" --destination-arn arn:aws:kinesis:region:account-id:stream/my-stream --role-arn arn:aws:iam::account-id:role/my-role`. The filter pattern can be empty to send all logs.
You've just covered CloudWatch for Developers — now see how well it sticks with free DVA-C02 practice questions. Full explanations included, no account needed.
Done with this chapter?