SOA-C02Chapter 1 of 104Objective 1.1

CloudWatch Metrics and Alarms

This chapter covers Amazon CloudWatch Metrics and Alarms, a core monitoring service for AWS. For the SOA-C02 exam, approximately 15-20% of questions touch on monitoring, with a significant portion focusing on CloudWatch metrics, alarms, and their integration with other services. Understanding how to collect, visualize, and act on metrics is essential for a SysOps Administrator. This chapter will equip you with the precise knowledge needed to configure, troubleshoot, and optimize CloudWatch Metrics and Alarms, directly addressing exam objectives 1.1 (Monitor and report on performance and availability) and 1.2 (Monitor and report on cost and usage).

25 min read
Intermediate
Updated May 31, 2026

CloudWatch Metrics: A Bank's Surveillance System

Imagine a bank with thousands of customers, ATMs, and online transactions. The bank installs a surveillance system (CloudWatch) that monitors every activity: ATM cash levels, transaction counts, server CPU loads, and even the number of people in line. Each sensor (metric) continuously reports data points (e.g., 'ATM 5 has $10,000 cash at 10:00 AM'). The surveillance system stores these data points in a ledger (metrics repository) with timestamps, allowing the bank to view trends (graphs) and set rules (alarms) such as 'If ATM 5 cash drops below $5,000 for 2 consecutive minutes, send a text to the manager.' The manager can also set up a dashboard (CloudWatch Dashboard) to see the most critical metrics at a glance. If the alarm triggers, the bank can automatically dispatch a cash refill (Auto Scaling or Lambda) or simply notify the team (SNS). Just as the surveillance system can monitor different floors (namespaces) and different types of sensors (dimensions), CloudWatch organizes metrics by namespace (e.g., AWS/EC2) and dimensions (e.g., InstanceId). The system also allows custom sensors (custom metrics) for things the bank cares about, like 'number of loan applications per hour.' Under the hood, each data point is a metric with a timestamp, value, unit, and optional dimensions. CloudWatch aggregates these into statistics (e.g., average, sum) over specified periods (e.g., 5 minutes). Alarms evaluate these statistics against thresholds and transition between states: OK, ALARM, or INSUFFICIENT_DATA. The bank can set multiple thresholds (e.g., warning at $10,000, critical at $5,000) and actions for each state. This mechanistic analogy mirrors CloudWatch's structure: metrics are time-ordered data points, alarms are state machines that evaluate expressions, and actions are triggered by state changes.

How It Actually Works

What Are CloudWatch Metrics and Alarms?

Amazon CloudWatch is a monitoring service for AWS resources and applications. It collects raw data in the form of metrics — time-ordered sets of data points. Each metric represents a variable to monitor, such as CPU utilization, network throughput, or disk I/O. Metrics are stored for up to 15 months, allowing historical analysis. An alarm watches a single metric or an expression based on multiple metrics over a specified time period, and performs one or more actions when the metric crosses a threshold. Alarms can trigger actions such as sending a notification via Amazon SNS, performing an EC2 Auto Scaling action, or invoking a Lambda function.

How CloudWatch Metrics Work

Metrics are organized into namespaces — containers for metrics from different sources. AWS services publish metrics to predefined namespaces like AWS/EC2, AWS/Lambda, AWS/RDS. You can also publish custom metrics to namespaces like Custom/MyApp. Each metric is uniquely identified by its namespace, metric name, and zero or more dimensions. Dimensions are key-value pairs that further refine the metric. For example, the CPUUtilization metric in the AWS/EC2 namespace can have a dimension InstanceId to isolate data for a specific EC2 instance.

When a metric is emitted, it includes: - Timestamp: The time when the data point was collected (default: current time). - Value: The numeric value of the metric. - Unit: The unit of measurement (e.g., Percent, Count, Bytes). - Dimensions: Optional key-value pairs.

CloudWatch stores metrics as a series of data points. You can retrieve statistics such as Average, Sum, Minimum, Maximum, Sample Count, and percentile values (e.g., p50, p90, p99). Statistics are computed over a specified period — the length of time to aggregate the data. The period can range from 1 second to 1 day (86400 seconds). For example, if you set a period of 300 seconds (5 minutes), CloudWatch aggregates all data points within each 5-minute interval into one data point representing the average, sum, etc.

Metric Resolution and Storage

Metrics have a resolution that determines the granularity at which data is stored: - Standard resolution: 1-minute granularity (data points aggregated every minute). - High resolution: 1-second granularity (data points stored at 1-second intervals). Custom metrics can be published with high resolution by setting the StorageResolution parameter to 1. High-resolution metrics are stored for 3 hours at 1-second resolution, then aggregated to 1-minute for 15 days, then to 5-minute for 63 days, and finally to 1-hour for 15 months.

CloudWatch Alarms

An alarm watches a single metric or a metric math expression. The alarm evaluates the metric against a threshold over a specified number of evaluation periods. The alarm can be in one of three states: - OK: The metric is within the defined threshold. - ALARM: The metric has crossed and stayed beyond the threshold for the specified number of consecutive periods. - INSUFFICIENT_DATA: Not enough data is available to evaluate the alarm (e.g., newly created metric, missing data points).

#### Alarm Components - Metric: The metric to monitor (e.g., AWS/EC2 CPUUtilization). - Statistic: The statistic to evaluate (e.g., Average, Sum, Maximum). - Period: The length of time in seconds to evaluate the metric. - Evaluation Periods: The number of most recent periods to consider. - Datapoints to Alarm: The number of data points within the evaluation periods that must be breaching to trigger ALARM state. - Threshold: The value that triggers the alarm. - Comparison Operator: e.g., GreaterThanOrEqualToThreshold, LessThanThreshold. - Treat Missing Data: How to handle missing data points (e.g., breaching, notBreaching, ignore, missing). Default is missing which treats missing as not breaching. - Actions: Actions to take on state transitions (OK, ALARM, INSUFFICIENT_DATA). Common actions include SNS notifications, Auto Scaling policies, and EC2 actions (stop, terminate, reboot).

#### Alarm Evaluation Logic

The alarm evaluates the metric over the last N evaluation periods. The alarm transitions to ALARM if the metric breaches the threshold for at least DatapointsToAlarm out of the last EvaluationPeriods periods. For example, with EvaluationPeriods=3 and DatapointsToAlarm=2, the alarm goes to ALARM if 2 out of the last 3 periods are breaching. This provides tolerance for transient spikes.

Metric Math

You can create alarms based on expressions combining multiple metrics. Metric math supports arithmetic operations (+, -, *, /), statistical functions (AVG, SUM, MIN, MAX, COUNT), and time series functions (METRICS, FILL). For example, you could create an expression (m1 + m2) / 2 to average two metrics. The expression result is a time series that the alarm evaluates.

CloudWatch Dashboards

Dashboards allow you to create customizable views of metrics and alarms. You can add widgets (graphs, text, alarms) and arrange them. Dashboards can be shared across accounts or made public. They support automatic refresh and can display metrics from multiple regions.

Integration with Other Services

Auto Scaling: Use CloudWatch alarms to trigger Auto Scaling policies (scale out/in).

AWS Lambda: Invoke Lambda functions in response to alarm state changes.

Amazon SNS: Send notifications to email, SMS, or other endpoints.

AWS Systems Manager: Automate remediation actions using Automation documents.

CloudWatch Events/EventBridge: Capture alarm state changes and route to targets.

CloudWatch Logs and Metrics

CloudWatch Logs can be used to extract metrics from log data using metric filters. A metric filter defines a pattern to look for in log events and creates a metric when a match is found. You can then create alarms on those custom metrics. For example, you can count the number of ERROR log entries and alarm if errors exceed a threshold.

CloudWatch Agent

The CloudWatch agent collects metrics and logs from EC2 instances and on-premises servers. It can collect system metrics (CPU, memory, disk, network) and custom metrics (e.g., application performance). The agent can also collect logs and send them to CloudWatch Logs. It supports both Linux and Windows.

Pricing

CloudWatch has a free tier: 10 custom metrics, 10 alarms, 1 million API requests, and 5 GB of log ingestion per month. Beyond that, you pay per metric, per alarm, per API request, and per GB of logs ingested and stored. High-resolution metrics cost more than standard resolution.

Exam-Relevant Details

Default metric resolution for AWS services: 1 minute (standard).

Custom metrics: Can be high-resolution (1 second) or standard (1 minute).

Alarm evaluation: EvaluationPeriods and DatapointsToAlarm are key to understanding alarm behavior.

TreatMissingData: Default is missing (not breaching). Change to breaching if you want missing data to trigger ALARM.

Metric retention: Data points with period less than 60 seconds are retained for 3 hours; 60-second data for 15 days; 300-second data for 63 days; 3600-second data for 455 days (15 months).

Anomaly Detection: CloudWatch can create a model of expected metric behavior and alarm on anomalies.

Composite Alarms: Combine multiple alarms into a single alarm using AND/OR logic.

Alarm actions: Can be added to existing alarms via the AWS CLI or console.

CLI Examples

To put custom metric data:

aws cloudwatch put-metric-data --namespace "Custom/MyApp" --metric-data "[{\"MetricName\":\"Requests\",\"Value\":10,\"Unit\":\"Count\",\"Timestamp\":\"2025-04-01T12:00:00Z\"}]"

To create an alarm:

aws cloudwatch put-metric-alarm --alarm-name "HighCPU" --alarm-description "Alarm when CPU exceeds 80%" --metric-name CPUUtilization --namespace AWS/EC2 --statistic Average --period 300 --evaluation-periods 2 --threshold 80 --comparison-operator GreaterThanThreshold --dimensions Name=InstanceId,Value=i-1234567890abcdef0 --alarm-actions arn:aws:sns:us-east-1:123456789012:my-topic

To describe alarms:

aws cloudwatch describe-alarms --alarm-name-prefix "HighCPU"

Walk-Through

1

Select Metric and Namespace

First, identify the metric you want to monitor. AWS services automatically publish metrics to predefined namespaces (e.g., AWS/EC2 for EC2 metrics). For custom metrics, you define a namespace (e.g., Custom/MyApp). Each metric has a name (e.g., CPUUtilization) and optional dimensions (e.g., InstanceId=i-123). You must know the exact metric name and dimensions to create an alarm. Use the CloudWatch console, CLI, or API to list available metrics. For exam purposes, remember that dimensions uniquely identify a metric stream.

2

Define Alarm Parameters

Specify the statistic (e.g., Average, Sum), period (e.g., 300 seconds), evaluation periods (e.g., 2), datapoints to alarm (e.g., 2), threshold (e.g., 80), and comparison operator (e.g., GreaterThanThreshold). Also set how to treat missing data (default: missing/notBreaching). These parameters determine when the alarm triggers. For example, with EvaluationPeriods=3 and DatapointsToAlarm=2, the alarm requires 2 out of 3 consecutive periods to be breaching.

3

Configure Alarm Actions

Choose actions for each state (OK, ALARM, INSUFFICIENT_DATA). Common actions: send notification via SNS topic, trigger Auto Scaling policy, perform EC2 action (stop, terminate, reboot), or invoke Lambda. You can specify multiple actions per state. Actions are ARNs (Amazon Resource Names). For exam, know that SNS topics must be created beforehand and that you can add actions after alarm creation.

4

Create Alarm via Console or CLI

Use the CloudWatch console to create an alarm by navigating to Alarms > Create alarm. Select the metric, define conditions, and configure actions. Alternatively, use the AWS CLI `put-metric-alarm` command. The alarm is created in the same region as the metric. After creation, the alarm begins evaluating data. It may initially show INSUFFICIENT_DATA until enough data points are collected.

5

Monitor and Maintain Alarm

After creation, monitor the alarm state via the console, CLI (`describe-alarms`), or dashboards. You can edit alarm parameters (e.g., threshold, actions) without deleting it. If the alarm triggers, verify the action occurred (e.g., SNS message received). For exam, know that you can disable/enable alarms, and that alarms can be deleted. Also, note that alarm history is stored for 14 days.

What This Looks Like on the Job

Enterprise Scenario 1: Auto Scaling Based on CPU Utilization

A large e-commerce company runs a fleet of EC2 instances behind an Auto Scaling group. They need to automatically scale out when CPU utilization exceeds 70% for 5 minutes and scale in when it drops below 30% for 10 minutes. They create two CloudWatch alarms: HighCPU (Average CPUUtilization > 70 for 2 consecutive 5-minute periods) and LowCPU (Average CPUUtilization < 30 for 2 consecutive 5-minute periods). The HighCPU alarm triggers a scale-out policy that adds 2 instances; LowCPU triggers a scale-in policy that removes 1 instance. They also set up an SNS notification to the operations team for both alarms. In production, they notice that during flash sales, CPU spikes briefly above 70% but not for 10 minutes, so the alarm doesn't trigger unnecessarily. However, they also experience a slow ramp-up where CPU averages 68% for 10 minutes but never hits 70% — the alarm doesn't trigger, causing performance degradation. They adjust the threshold to 65% to catch this. Misconfiguration: setting EvaluationPeriods too low (e.g., 1) causes flapping (alarm toggles rapidly), which can trigger unnecessary scaling actions. Best practice: use at least 2 evaluation periods to dampen noise.

Enterprise Scenario 2: Custom Application Monitoring

A financial services company runs a microservices application on ECS Fargate. They publish custom metrics for request latency, error count, and order volume. They create a composite alarm that triggers if ErrorCount > 10 AND Latency > 500ms for 5 minutes. This alarm invokes a Lambda function that runs a diagnostic script and posts a message to a Slack channel via SNS. They also create a dashboard showing real-time metrics for each microservice. In production, they encounter a problem where missing data points (e.g., due to a service restart) cause the alarm to go into INSUFFICIENT_DATA state, which they had set to ignore. They change TreatMissingData to notBreaching to avoid false alarms. They also learn that high-resolution metrics (1-second) are useful for debugging but cost more, so they only use them for critical services.

Enterprise Scenario 3: Cost Monitoring with Budget Alarms

A startup uses AWS Budgets to set cost thresholds, but they also create CloudWatch alarms on the Billing metric (namespace: AWS/Billing, metric: EstimatedCharges). They set an alarm to notify when estimated charges exceed $1,000 for the month. This requires enabling billing alerts in the account settings. They also create a custom metric that tracks cost per service by using cost allocation tags and publishing metrics via a Lambda function. Misconfiguration: forgetting to enable billing alerts results in no data for the billing metric, causing the alarm to remain in INSUFFICIENT_DATA. They also set the period to 6 hours (21600 seconds) to avoid daily fluctuations. Common mistake: using Sum instead of MaximumSum accumulates charges over the period, which can exceed the threshold early in the month; Maximum gives the current estimated total.

How SOA-C02 Actually Tests This

What SOA-C02 Tests on CloudWatch Metrics and Alarms

The exam focuses on practical knowledge of creating, configuring, and troubleshooting CloudWatch metrics and alarms. Key objectives: - Objective 1.1: Monitor and report on performance and availability (e.g., set up alarms for EC2, RDS, ELB). - Objective 1.2: Monitor and report on cost and usage (e.g., billing alarms). - Objective 2.1: Implement and manage monitoring and reporting solutions (e.g., custom metrics, metric filters).

Common Wrong Answers and Why Candidates Choose Them

1.

Setting `EvaluationPeriods` equal to `DatapointsToAlarm`: Many candidates think this means the alarm triggers immediately after the first breach. Actually, it means the alarm requires that many consecutive breaching periods. For example, if both are 2, the alarm triggers only after 2 consecutive breaching periods. The wrong answer often says 'triggers after 1 period' — that would require DatapointsToAlarm=1.

2.

Confusing `TreatMissingData` default: Candidates often think missing data causes the alarm to go to ALARM. The default is missing which is treated as notBreaching. To make missing data trigger ALARM, you must set TreatMissingData to breaching.

3.

Thinking custom metrics must be published with high resolution: Custom metrics can be standard (1-minute) or high-resolution (1-second). High resolution is optional and incurs higher costs.

4.

Believing CloudWatch metrics are stored indefinitely: Metrics are retained for up to 15 months, but the granularity decreases over time (see retention details above).

Specific Numbers and Terms to Memorize

Default metric resolution: 1 minute (standard).

High resolution: 1 second (StorageResolution=1).

Retention: 3 hours for <60s data, 15 days for 60s, 63 days for 300s, 455 days for 3600s.

Alarm evaluation periods: 1 to 10 (maximum).

Period: 1 to 86400 seconds.

TreatMissingData options: breaching, notBreaching, ignore, missing.

Composite alarms: combine up to 10 alarms with AND/OR.

Anomaly detection: uses machine learning to model expected bounds.

Edge Cases and Exceptions

Billing metrics: Must enable billing alerts in account settings; only available in us-east-1.

Cross-account dashboards: Can share dashboards across accounts using resource-based policies.

Alarm history: Stored for 14 days.

Metric math expressions: Can include up to 10 metrics.

Alarm actions: Can be added after creation via CLI put-metric-alarm with updated actions list.

How to Eliminate Wrong Answers

If an answer says 'alarm triggers after 1 breach' and EvaluationPeriods is 3, it's wrong because alarm needs multiple breaches.

If an answer suggests using CloudWatch Logs for real-time metrics without mentioning metric filters, it's incomplete.

If an answer says 'custom metrics can only be published via CloudWatch agent', it's wrong; you can use API/CLI.

If an answer says 'alarms can only have one action per state', it's wrong; multiple actions are allowed.

Key Takeaways

CloudWatch metrics are time-ordered data points identified by namespace, metric name, and dimensions.

Default metric resolution is 1 minute (standard); high-resolution (1 second) is optional for custom metrics.

Alarms evaluate metrics over a specified number of periods (EvaluationPeriods) and require a certain number of breaching periods (DatapointsToAlarm).

TreatMissingData defaults to 'missing' (notBreaching); change to 'breaching' if missing data should trigger alarm.

Metric retention: <60s for 3 hours, 60s for 15 days, 300s for 63 days, 3600s for 455 days.

Alarm actions can include SNS, Auto Scaling, EC2 actions, Lambda, and Systems Manager.

Composite alarms combine multiple alarms with AND/OR logic.

Billing metrics require enabling billing alerts and are only available in us-east-1.

CloudWatch agent collects memory and disk metrics from EC2 and on-premises servers.

Metric filters in CloudWatch Logs extract metrics from log data.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Standard Resolution Metrics

1-minute granularity (aggregated every 60 seconds).

Lower cost than high resolution.

Retained for 15 days at 1-minute resolution.

Default for all AWS service metrics.

Suitable for most monitoring scenarios.

High Resolution Metrics

1-second granularity (data points at 1-second intervals).

Higher cost per metric.

Retained for 3 hours at 1-second resolution.

Must be explicitly set via StorageResolution=1 for custom metrics.

Used for real-time anomaly detection or high-frequency monitoring.

Watch Out for These

Mistake

CloudWatch automatically monitors memory and disk metrics for EC2 instances.

Correct

CloudWatch only monitors basic system metrics (CPU, network, disk I/O) by default. Memory and disk utilization require the CloudWatch agent or custom scripts to publish custom metrics.

Mistake

An alarm triggers immediately when a metric crosses the threshold.

Correct

The alarm triggers only after the metric breaches the threshold for the specified number of evaluation periods (DatapointsToAlarm). For example, with EvaluationPeriods=3 and DatapointsToAlarm=2, the alarm requires 2 out of 3 consecutive periods to be breaching.

Mistake

CloudWatch metrics are stored indefinitely at full resolution.

Correct

Metrics are retained for up to 15 months, but resolution degrades over time: high-resolution data (sub-minute) is kept for 3 hours, then aggregated to 1-minute for 15 days, 5-minute for 63 days, and 1-hour for 15 months.

Mistake

You can create custom metrics without specifying a namespace.

Correct

Every custom metric must have a namespace. If you omit it, the API call will fail. The namespace helps organize metrics and avoid collisions.

Mistake

Alarms can only send notifications via SNS.

Correct

Alarms can trigger multiple actions, including Auto Scaling policies, EC2 actions (stop, terminate, reboot), Lambda functions, and Systems Manager automations, in addition to SNS notifications.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

How do I create a CloudWatch alarm that triggers when CPU > 80% for 5 minutes?

Use the AWS CLI: `aws cloudwatch put-metric-alarm --alarm-name HighCPU --metric-name CPUUtilization --namespace AWS/EC2 --statistic Average --period 300 --evaluation-periods 1 --threshold 80 --comparison-operator GreaterThanThreshold --dimensions Name=InstanceId,Value=i-123 --alarm-actions arn:aws:sns:us-east-1:123456789012:my-topic`. The period of 300 seconds (5 minutes) and evaluation-periods of 1 means the alarm triggers if the average CPU exceeds 80% in a single 5-minute window. For exam, remember that period is in seconds.

What is the difference between EvaluationPeriods and DatapointsToAlarm?

EvaluationPeriods is the number of most recent periods to consider. DatapointsToAlarm is the number of those periods that must be breaching to trigger the alarm. For example, EvaluationPeriods=3, DatapointsToAlarm=2 means the alarm triggers if 2 out of the last 3 periods are breaching. This allows tolerance for transient spikes. If both are equal (e.g., 2), the alarm requires that many consecutive breaching periods.

Can I create an alarm on a custom metric that hasn't been published yet?

Yes, you can create an alarm on a metric that doesn't exist yet. The alarm will start in INSUFFICIENT_DATA state until data points are published. Once data arrives, the alarm evaluates normally. This is useful for setting up alarms before deploying an application.

How do I get memory and disk metrics for EC2 instances?

You must install the CloudWatch agent on the instance. The agent collects memory, disk, and other system metrics and publishes them as custom metrics to CloudWatch. You can then create alarms on these metrics. Without the agent, only basic metrics (CPU, network, disk I/O) are available.

What happens if I delete a CloudWatch alarm?

Deleting an alarm removes it permanently. It stops evaluating the metric and no longer triggers actions. The metric data is unaffected. You can create a new alarm with the same name later, but the alarm history is lost. For exam, know that alarms can be disabled temporarily without deleting.

Can I share a CloudWatch dashboard with another AWS account?

Yes, you can share a dashboard with other accounts by using a dashboard sharing feature. You can make the dashboard public (accessible by anyone with the link) or share it with specific accounts using resource-based policies. The dashboard displays metrics from the source account even when viewed from another account.

How do I create an alarm that triggers on a metric math expression?

Use the `put-metric-alarm` CLI with `--metrics` parameter instead of `--metric-name`. Define an expression using metric IDs. Example: `--metrics "[{\"Id\":\"m1\",\"MetricStat\":{\"Metric\":{\"Namespace\":\"AWS/EC2\",\"MetricName\":\"CPUUtilization\"},\"Period\":300,\"Stat\":\"Average\"}},{\"Id\":\"expr1\",\"Expression\":\"m1 > 80\",\"ReturnData\":true}]"`. The alarm evaluates the expression result.

Terms Worth Knowing

Ready to put this to the test?

You've just covered CloudWatch Metrics and Alarms — now see how well it sticks with free SOA-C02 practice questions. Full explanations included, no account needed.

Done with this chapter?