This chapter covers CloudWatch Log Metric Alarms, a powerful feature that allows you to monitor log data in real-time and trigger automated actions based on patterns in your logs. For the SOA-C02 exam, this is a key topic in Domain 1: Monitoring and Reporting, and you can expect 2-3 questions that test your understanding of metric filters, alarm states, and evaluation periods. Mastering this concept is essential for designing proactive monitoring solutions that can automatically scale, notify, or remediate issues without manual intervention.
Jump to a section
Imagine a factory floor with hundreds of machines (log streams) producing noise (log events). The factory manager wants to know if any machine overheats (error count) or if the entire floor gets too loud (sum of logs). Instead of listening to every machine, he installs a monitor (metric filter) that counts specific sounds (e.g., beeps = errors) and reports the count every minute. This count is like a CloudWatch metric. Now, he sets a threshold: if the count exceeds 5 per minute, the monitor triggers a red light (alarm). The alarm can also call the maintenance team (SNS) or shut down a machine (Auto Scaling). But the monitor only checks every 60 seconds (period), so if a machine overheats and cools down within 30 seconds, the alarm may never fire. The factory manager must choose the right evaluation periods (e.g., 2 consecutive minutes over threshold) to avoid false alarms. He also needs to decide whether the monitor should use the average, sum, or max of counts over the period. This is exactly how CloudWatch Log Metric Alarms work: you define a filter on log data, extract a metric, set an alarm on that metric with a statistic and evaluation periods, and trigger actions when the threshold is breached.
What are CloudWatch Log Metric Alarms?
CloudWatch Log Metric Alarms are a combination of two AWS services: CloudWatch Logs and CloudWatch Alarms. They allow you to define a metric filter on a log group, which extracts numeric data from log events and publishes it as a custom CloudWatch metric. You can then create an alarm on that metric that triggers when it crosses a threshold for a specified number of evaluation periods.
Why Use Log Metric Alarms?
Logs are a rich source of operational data. By converting log data into metrics, you can:
Monitor error rates, request counts, or any numeric pattern in your logs.
Set alarms that trigger actions like sending an SNS notification or invoking a Lambda function.
Auto-scale your infrastructure based on log-derived metrics.
Visualize trends in CloudWatch dashboards.
How It Works Internally
The process involves three main steps:
1. Define a Metric Filter: You create a filter pattern that matches specific log events. The filter can count occurrences or extract a numeric value from the log event. For example, you can count all log lines containing "ERROR" or extract the value of a JSON field like latency.
2. Metric is Published: Each time a log event matches the filter, CloudWatch Logs publishes a data point to a custom CloudWatch metric. The metric is created automatically in the AWS/Logs namespace or a custom namespace you specify.
3. Alarm Evaluates: You create a CloudWatch alarm on that metric. The alarm periodically evaluates the metric against a threshold. If the metric breaches the threshold for the specified number of evaluation periods, the alarm changes state (e.g., from OK to ALARM).
Key Components, Values, Defaults, and Timers
Metric Filter: - Filter Pattern: A string that defines which log events to match. Patterns can be simple (e.g., "ERROR") or complex using JSON or space-delimited parsing. - Metric Transformations: You can specify a metric namespace, metric name, metric value, and default value. The metric value is typically 1 for count filters, or you can extract a numeric field from the log event. - Default Value: If no log events match during a period, CloudWatch can publish a default value (e.g., 0) to avoid missing data points.
Alarm:
- Period: The length of time (in seconds) to evaluate the metric. Valid values: 10, 30, 60, 120, 180, 240, 300, 360, 420, 480, 540, 600, 900, 1200, 1800, 21600, 43200, 86400. Default is 300 (5 minutes).
- Evaluation Periods: The number of consecutive periods the metric must breach the threshold to trigger the alarm. Default is 1.
- Datapoints to Alarm: The number of data points within the evaluation periods that must be breaching. Default is the same as Evaluation Periods.
- Statistic: The statistic to apply to the metric (e.g., Sum, Average, SampleCount, Minimum, Maximum). For log-derived metrics, Sum is common for count filters.
- Threshold: The value that triggers the alarm. Can be static or anomaly detection.
- Treat Missing Data: How to treat missing data points: notBreaching, breaching, ignore, missing. Default is missing which treats missing as missing.
Alarm States: - OK: Metric is within the threshold. - ALARM: Metric is outside the threshold for the specified evaluation periods. - INSUFFICIENT_DATA: Not enough data points to evaluate (e.g., no log events in the period).
Configuration and Verification
Using AWS CLI:
Create a metric filter:
aws logs put-metric-filter \
--log-group-name "/aws/lambda/my-function" \
--filter-name "ErrorCount" \
--filter-pattern "ERROR" \
--metric-transformations \
metricName=ErrorCount,metricNamespace=MyApp,metricValue=1,defaultValue=0Create an alarm:
aws cloudwatch put-metric-alarm \
--alarm-name "HighErrorRate" \
--alarm-description "Alarm when error count > 10 per minute" \
--metric-name ErrorCount \
--namespace MyApp \
--statistic Sum \
--period 60 \
--evaluation-periods 2 \
--threshold 10 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789012:MyTopicVerification:
Use aws logs describe-metric-filters --log-group-name <name> to list filters.
Use aws cloudwatch describe-alarms --alarm-names <name> to check alarm state.
Use aws cloudwatch get-metric-statistics --namespace MyApp --metric-name ErrorCount --start-time ... --end-time ... --period 60 --statistics Sum to see metric data.
How It Interacts with Related Technologies
CloudWatch Logs Insights: Use for ad-hoc querying; metric filters are for real-time monitoring.
CloudWatch Dashboards: Display log-derived metrics alongside infrastructure metrics.
Lambda: Trigger a Lambda function from an alarm to perform custom remediation.
Auto Scaling: Use log-derived metrics in target tracking scaling policies (e.g., scale on error count).
SNS: Send notifications to email, SMS, or other endpoints.
Important Considerations
Filter Pattern Syntax: Supports ? and * wildcards, && and || logical operators, and JSON extraction with $.field.
Metric Publishing Frequency: Metrics are published each time a log event matches the filter. For high-volume logs, this can generate many data points, increasing costs.
Alarm Evaluation: Alarms evaluate based on the metric's period. If your filter publishes a metric every time a log event arrives (which can be sub-second), the alarm's period defines the aggregation window.
Missing Data: If no log events match in a period, no metric data point is published by default. The defaultValue in the metric transformation can publish a synthetic data point (e.g., 0).
Costs: Metric filters incur charges for each metric data point published. Alarms incur charges per alarm per month.
Common Configuration Patterns
Error Rate Monitoring: Filter for "ERROR" or "Exception", count per minute, alarm if sum > threshold for 2 consecutive periods.
Latency Monitoring: Extract latency from JSON logs, use Average statistic, alarm if > 500ms for 3 periods.
403 Forbidden Count: Filter for "403", count per 5 minutes, alarm if sum > 100.
Exam Tips
Remember that metric filters are defined at the log group level, not per log stream.
The metric is automatically created in CloudWatch; you do not need to create it manually.
Alarms can only be created on existing metrics, so the filter must exist and have published data.
The defaultValue is crucial for ensuring that an alarm doesn't get stuck in INSUFFICIENT_DATA when no logs are produced.
Evaluation periods and datapoints to alarm allow you to reduce false positives by requiring multiple breaches.
Troubleshooting
Alarm stuck in INSUFFICIENT_DATA: Check if the metric filter has matched any log events. Use aws logs describe-metric-filters to confirm the filter exists, and check CloudWatch metrics for the custom namespace.
Alarm not triggering: Verify the threshold, statistic, and evaluation periods. Ensure the alarm actions (SNS, Lambda) have correct permissions.
Metric not appearing: It can take a few minutes for the first data point to appear. Also, check if the filter pattern is correct.
Identify Log Group and Pattern
First, identify the CloudWatch Log Group that contains the logs you want to monitor. This could be from an EC2 instance, Lambda function, or any AWS service that publishes logs. Then, define the filter pattern that matches the log events of interest. For example, a pattern like "ERROR" will match any log line containing the word ERROR. The pattern can be a simple string or a complex JSON expression. Ensure the pattern is precise to avoid matching irrelevant events, which would increase costs and noise.
Create Metric Filter with Transform
Using the AWS Management Console, CLI, or SDK, create a metric filter on the log group. Specify the filter pattern and the metric transformation: a metric namespace (e.g., MyApp), metric name (e.g., ErrorCount), metric value (usually 1 for counting, or a numeric field from the log), and a default value (e.g., 0). The default value is published when no matching log events occur in a period, preventing missing data points. The metric is automatically created in CloudWatch under the specified namespace.
Verify Metric Data Points
After the metric filter is created, it will start publishing data points to CloudWatch. Use the CloudWatch console or CLI to verify that the metric appears and data points are being recorded. You can query the metric using `get-metric-statistics` to see the values over time. This step is crucial to confirm that the filter pattern is working correctly. If no data points appear, check that logs are being generated and the pattern matches.
Create Alarm with Threshold and Period
Create a CloudWatch alarm on the custom metric. Choose the statistic (e.g., Sum for count metrics), period (e.g., 60 seconds), evaluation periods (e.g., 2), and threshold (e.g., 10). The alarm will evaluate the metric every period and check if the threshold is breached for the specified number of consecutive periods. Also, configure how to treat missing data (e.g., treat as not breaching). Attach actions such as SNS topic ARN or Lambda function ARN to trigger when the alarm state changes.
Test Alarm and Monitor State
Generate log events that match the filter pattern to trigger the alarm. For example, if monitoring errors, cause an error in your application. Observe the alarm state change from OK to ALARM after the evaluation periods are met. Verify that the configured actions (e.g., email notification) are triggered. Use CloudWatch dashboards to monitor the metric and alarm history. Adjust threshold or evaluation periods if needed to reduce false positives or ensure timely alerts.
Scenario 1: E-commerce Application Error Monitoring
A large e-commerce platform uses AWS Lambda for order processing. They want to be alerted if the error rate exceeds 5% of total invocations. They configure a metric filter on the Lambda log group to count "ERROR" log events and another filter to count all invocations (by matching every log line). They create a CloudWatch metric for error count and total count. Then, they use a CloudWatch math expression to compute error rate = (errorCount / totalCount) * 100, and create an alarm on that expression with a threshold of 5. The alarm triggers an SNS notification to the operations team. In production, they set evaluation periods to 3 (5-minute periods) to avoid flapping. They also set a default value of 0 for both metrics to handle periods with no logs. The system processes millions of requests daily, generating thousands of log lines per second. They ensure the metric filter is efficient to avoid high costs. Misconfiguration: Initially, they used a filter pattern that was too broad, matching many irrelevant lines, leading to high metric data point costs and false alarms. They refined the pattern to match only lines with specific error codes.
Scenario 2: Real-time Security Alerting
A financial services company monitors for unauthorized access attempts. They have a CloudTrail log group capturing API calls. They create a metric filter to count events where eventName is "ConsoleLogin" and errorMessage contains "Failed authentication". They set an alarm with a threshold of 10 failed logins in 5 minutes. The alarm triggers a Lambda function that revokes the user's credentials temporarily and sends an alert to the security team. They use a period of 300 seconds and evaluation periods of 1 for immediate response. Performance: The CloudTrail logs can be high volume, but the filter pattern is specific, so only relevant events are counted. They also use a default value of 0 to ensure the alarm goes to OK when no failed logins occur. Misconfiguration: Initially, they forgot to set a default value, so during quiet periods the alarm went to INSUFFICIENT_DATA, which they had to handle. They also learned that the alarm actions must have proper IAM permissions for Lambda invocation.
Scenario 3: Auto Scaling based on Logged Latency
A video streaming service uses EC2 instances behind an ALB. They want to scale based on 99th percentile latency measured from application logs. They configure a metric filter to extract the latency field from JSON logs and publish it as a metric with statistic p99. They create a target tracking scaling policy using this custom metric. The auto scaling group adds instances when latency exceeds 200ms. They set the metric's period to 60 seconds and use a warm-up time for new instances. Common pitfall: The custom metric must be published continuously; if the application stops logging, the metric stops, and the scaling policy may not work correctly. They use a default value to publish a synthetic data point when no logs exist. Also, they ensure the metric filter is defined on the correct log group (e.g., /aws/ec2/instance-id).
What SOA-C02 Tests on This Topic
Domain 1: Monitoring and Reporting, Objective 1.1: Implement logging and monitoring. Specifically, you need to know how to create metric filters, alarms, and understand the states and evaluation logic. Expect questions about:
The difference between metric filter and CloudWatch Logs Insights.
How to configure an alarm to trigger on a custom log metric.
The meaning of alarm states: OK, ALARM, INSUFFICIENT_DATA.
The effect of defaultValue and treatMissingData.
How evaluation periods and datapoints to alarm work.
Common Wrong Answers and Why
"Metric filters are created at the log stream level." – Wrong. They are created at the log group level and apply to all streams in that group. Candidates confuse log groups and streams.
"You must create the custom metric in CloudWatch before creating the filter." – Wrong. The metric is automatically created the first time a data point is published. No manual creation needed.
"Setting `treatMissingData` to `breaching` will cause the alarm to go to ALARM when no data is present." – This is true, but candidates often think missing is the same as notBreaching. The exam tests the exact behavior.
"An alarm with 2 evaluation periods and 2 datapoints to alarm will trigger after 2 consecutive breaches." – Correct. But some think it triggers after 1 breach if the second period also breaches, but it must be consecutive. The exam may ask about non-consecutive breaches.
Specific Numbers and Terms
Default period: 300 seconds (5 minutes).
Default evaluation periods: 1.
Default datapoints to alarm: same as evaluation periods.
Valid periods: 10, 30, 60, 120, 180, 240, 300, 360, 420, 480, 540, 600, 900, 1200, 1800, 21600, 43200, 86400.
Metric filter pattern syntax: "ERROR", { $.status = 404 }.
Metric transformation: metricValue can be 1 or a number extracted from log.
Edge Cases and Exceptions
No matching logs: If no logs match, no data point is published unless defaultValue is set. The alarm may go to INSUFFICIENT_DATA.
High-resolution metrics: Custom metrics from log filters are standard resolution (1-minute granularity). You cannot get high-resolution (sub-minute) from log filters.
Multiple filters on same log group: Allowed, but each generates separate metrics.
Alarm on math expression: You can create an alarm on a CloudWatch math expression that combines multiple log-derived metrics.
How to Eliminate Wrong Answers
Focus on the mechanism: metric filters extract data from logs into custom metrics. Alarms evaluate those metrics.
If an answer says "Create a custom metric first", it's likely wrong.
If an answer says "Metric filter applies to a single log stream", it's wrong.
Understand the difference between treatMissingData options: missing (default) treats missing as missing (INSUFFICIENT_DATA), notBreaching treats as OK, breaching treats as ALARM, ignore continues current state.
For alarm evaluation, remember that the alarm state changes only after the evaluation periods are met. One breach is not enough if evaluation periods > 1.
Metric filters are defined at the log group level and apply to all log streams in that group.
The metric is automatically created in CloudWatch when the first data point is published.
Default period for alarms is 300 seconds (5 minutes); default evaluation periods is 1.
Use `defaultValue` in metric transformation to avoid INSUFFICIENT_DATA when no logs match.
Alarm states: OK, ALARM, INSUFFICIENT_DATA.
`treatMissingData` options: missing (default), notBreaching, breaching, ignore.
You can create alarms on CloudWatch math expressions that combine log-derived metrics.
Metric filters support JSON extraction with `$.field` and space-delimited parsing.
High-resolution metrics (sub-minute) are not supported from log filters; only standard resolution (1-minute).
Alarm actions can include SNS, Lambda, Auto Scaling, and EC2 actions.
These come up on the exam all the time. Here's how to tell them apart.
Metric Filter on Log Group
Real-time: publishes metric data points continuously as logs arrive.
Used for automated alarms and dashboards.
Costs per metric data point published.
Limited to simple counting or field extraction.
Metrics are retained according to CloudWatch metrics retention (15 months).
CloudWatch Logs Insights Query
On-demand: run queries on historical log data.
Used for ad-hoc analysis and troubleshooting.
Costs per amount of data scanned.
Supports complex SQL-like queries, aggregations, and visualizations.
Results are not persisted as metrics unless exported.
Mistake
A metric filter must be created before the log group exists.
Correct
The log group must exist first. Metric filters are attached to existing log groups.
Mistake
The custom metric must be manually created in CloudWatch before creating the filter.
Correct
The metric is automatically created when the first data point is published by the filter. No manual creation is needed.
Mistake
Metric filters can only count occurrences; they cannot extract numeric values.
Correct
Metric filters can extract numeric values from log events using JSON or space-delimited parsing, allowing you to monitor things like latency or request size.
Mistake
An alarm will trigger immediately when a log event matches the filter.
Correct
The alarm evaluates based on the period and evaluation periods. It only triggers after the metric breaches the threshold for the specified number of consecutive periods.
Mistake
If no log events are generated, the alarm will stay in OK state.
Correct
By default, if no data points are published, the alarm enters INSUFFICIENT_DATA state. To keep it in OK, set `treatMissingData` to `notBreaching` or use a `defaultValue` of 0.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Use the AWS CLI command `aws logs put-metric-filter` with a filter pattern like "ERROR" and metric transformation with metricValue=1. For example: `aws logs put-metric-filter --log-group-name /aws/lambda/myFunc --filter-name ErrorCount --filter-pattern "ERROR" --metric-transformations metricName=ErrorCount,metricNamespace=MyApp,metricValue=1,defaultValue=0`. This will publish a metric data point with value 1 for each log line containing "ERROR". Then create an alarm on that metric.
By default, no metric data point is published. The alarm will have missing data for that period, which by default is treated as 'missing' and the alarm state may become INSUFFICIENT_DATA if there are too many missing periods. To avoid this, set a `defaultValue` in the metric transformation (e.g., 0) so that a data point with that value is published when no matches occur. Alternatively, set `treatMissingData` on the alarm to `notBreaching` to treat missing as OK.
No. The log group must exist before you can attach a metric filter. Also, the metric must have at least one data point before you can create an alarm. You can create the filter and then wait for data to appear before creating the alarm.
A metric filter extracts data from log events and publishes it as a CloudWatch metric. A subscription filter sends log events to another destination like Lambda, Kinesis, or Amazon ES for real-time processing. They are both filters but serve different purposes.
Create a metric filter to count errors. Then create an alarm with period=300 (5 minutes), evaluation periods=3, datapoints to alarm=3, statistic=Sum, comparison operator=GreaterThanThreshold, and threshold=your desired count. This ensures the alarm only triggers after three consecutive periods where the sum of errors exceeds the threshold.
Yes, if you use the CloudWatch agent to send logs from on-premises to CloudWatch Logs. Once the logs are in a log group, you can create metric filters on that log group just like for AWS services.
You need permissions for `logs:PutMetricFilter` and `logs:DescribeLogGroups`. To create an alarm, you need `cloudwatch:PutMetricAlarm`. Also, if the alarm triggers a Lambda function, the alarm role must have `lambda:InvokeFunction` permission.
You've just covered CloudWatch Log Metric Alarms — now see how well it sticks with free SOA-C02 practice questions. Full explanations included, no account needed.
Done with this chapter?