SOA-C02Chapter 83 of 104Objective 1.2

Lambda for Automation and Remediation

This chapter covers AWS Lambda as a core automation and remediation engine for AWS environments, a key topic for the SOA-C02 exam under Domain 1 (Monitoring) Objective 1.2. You will learn how to design event-driven automation using Lambda, CloudWatch, EventBridge, and Systems Manager to detect and automatically remediate common operational issues. Approximately 10-15% of exam questions touch on Lambda-based automation, often in the context of responding to CloudWatch alarms or AWS Config rules. Mastery of this topic is essential for passing the SysOps Administrator exam.

25 min read
Intermediate
Updated May 31, 2026

Lambda as a Programmable Fire Alarm

Imagine a large building with a fire alarm system that does more than just make noise. Instead of a simple bell, the building has a central panel that monitors hundreds of sensors: smoke detectors, heat sensors, manual pull stations, and even water flow sensors on sprinklers. Each sensor sends a signal to the panel when triggered. The panel has a set of rules: if a smoke detector in the kitchen goes off, it first checks if the heat sensor also triggered; if yes, it calls the fire department, unlocks the kitchen doors, and turns on exhaust fans. If only a manual pull station is pulled, the panel announces a fire drill and locks all doors except exits. The panel can be reprogrammed at any time without rewiring the sensors. In this analogy, the sensors are AWS CloudWatch alarms or EventBridge events, the central panel is the EventBridge rule, and the rules are Lambda functions. The panel doesn't care what kind of sensor triggers it—it just executes the appropriate response. If a smoke detector triggers but the heat sensor is normal, the panel might decide to just log it and send an alert to maintenance. This is exactly how Lambda-based automation works: events trigger functions that execute custom logic, and you can change the logic without touching the event sources. The building's fire code is your IAM policy; the panel's programming is your Lambda code. Misconfigure the panel, and you might flood the building with water when someone burns toast.

How It Actually Works

What is Lambda and Why Use It for Automation?

AWS Lambda is a serverless compute service that executes code in response to events. For automation and remediation, Lambda functions are the workhorses that run custom logic when triggered by services like CloudWatch Alarms, EventBridge events, AWS Config rules, or even S3 events. The primary advantage is that you don't provision or manage servers; you only pay for execution time. This makes Lambda ideal for automating responses to operational events—for example, stopping an EC2 instance that is running an unauthorized instance type, or restarting a failed service on an EC2 instance via Systems Manager.

How Lambda Automation Works Internally

When an event source (like a CloudWatch alarm) triggers a Lambda function, the following sequence occurs: 1. The event source publishes an event to EventBridge (or directly invokes Lambda via service integration). 2. EventBridge evaluates rules to determine which Lambda function to invoke. 3. Lambda receives the event payload (a JSON object containing details about the event). 4. Lambda runs your code (Node.js, Python, Java, etc.) in a secure, isolated execution environment. 5. The function interacts with other AWS services using the AWS SDK (e.g., calling ec2.stop_instances). 6. Lambda logs the execution to CloudWatch Logs. 7. The function returns a response (optional).

Execution environment details: Each Lambda function runs in a sandbox with 512 MB of ephemeral storage by default (configurable up to 10 GB). The maximum execution timeout is 15 minutes. For automation, keep functions short—typically under 5 minutes. The function's IAM role must grant permissions to the resources it needs (e.g., ec2:StopInstances).

Key Components, Values, Defaults, and Timers

Memory: 128 MB to 10,240 MB (in 1 MB increments). CPU is proportional to memory.

Timeout: 1 second to 900 seconds (15 minutes). Default is 3 seconds.

Ephemeral storage: 512 MB to 10,240 MB.

Concurrency: Soft limit of 1,000 concurrent executions per region (can be increased).

Reserved concurrency: Prevents a function from using all available concurrency.

Event payload size: Up to 256 KB (synchronous) or 128 KB (asynchronous).

Deployment package: Up to 50 MB (zipped) for direct upload, 250 MB (unzipped) including layers.

Cold start: First invocation after idle period incurs additional latency (typically 100-500 ms).

Destinations: For async invocations, you can send success/failure records to SQS, SNS, etc.

Configuration and Verification Commands

Using AWS CLI:

# Create a Lambda function
aws lambda create-function --function-name my-remediation --runtime python3.9 --role arn:aws:iam::123456789012:role/lambda-remediation-role --handler lambda_function.lambda_handler --zip-file fileb://function.zip

# Update function code
aws lambda update-function-code --function-name my-remediation --zip-file fileb://function.zip

# Invoke function manually (test)
aws lambda invoke --function-name my-remediation --payload '{"key":"value"}' output.txt

# View logs
aws logs describe-log-groups --log-group-name-prefix /aws/lambda/my-remediation

How Lambda Interacts with Related Technologies

CloudWatch Alarms: An alarm in ALARM state can trigger a Lambda function directly (via SNS or EC2 actions). The alarm's JSON includes metric data and timestamp.

EventBridge: Rules match events from various sources (e.g., EC2 state changes, API calls via CloudTrail). The rule target can be a Lambda function.

AWS Config: Config rules can invoke a Lambda function to evaluate resource compliance (custom rule). The function receives a configuration item and returns COMPLIANT or NON_COMPLIANT.

Systems Manager Automation: Lambda can be called from an SSM automation document (aws:invokeLambdaFunction step). Also, Lambda can call SSM Run Command to execute scripts on EC2 instances.

AWS Health: Personal Health Dashboard events can be sent to EventBridge to trigger remediation.

S3 Events: Object creation/deletion can invoke Lambda for processing.

Best Practices for Automation with Lambda

Idempotency: Ensure your function can be safely retried. Use idempotency tokens if needed.

Error handling: Use DLQ (Dead Letter Queue) for async invocations to capture failures.

Logging: Always log the event payload for debugging. Use structured logging.

Permissions: Follow least privilege. Use separate IAM roles per function.

Testing: Test with sample events from the source service.

Monitoring: Set up CloudWatch alarms on function errors, throttles, and duration.

Common Automation Patterns

1.

Auto-remediate non-compliant resources: An AWS Config rule triggers a Lambda that fixes the resource (e.g., enable S3 bucket encryption).

2.

Stop idle resources: A CloudWatch alarm on low CPU triggers a Lambda that stops an EC2 instance.

3.

Rotate credentials: A scheduled EventBridge rule (cron) triggers a Lambda that rotates RDS passwords.

4.

Respond to health events: A Personal Health Dashboard event triggers a Lambda that replaces an impaired EC2 instance.

Limits and Constraints

Lambda function code cannot assume it runs on a specific machine. Ephemeral storage is not persistent.

Functions cannot have a static IP (unless using a VPC with a NAT Gateway).

Maximum execution duration is 15 minutes. For longer tasks, use Step Functions or ECS.

If a function is invoked too quickly, it may hit concurrency limits. Use reserved concurrency to guarantee capacity.

Security Considerations

Never hardcode credentials; use IAM roles and environment variables for configuration.

If the function needs to access resources in a VPC, configure it to run in the VPC (which adds ENIs and cold start latency).

Use KMS to encrypt environment variables.

Validate event sources using resource-based policies or IAM conditions.

Step-by-Step: Setting Up a Remediation Function

1.

Create the Lambda function with appropriate IAM role.

2.

Write the code to parse the event and perform remediation (e.g., stop EC2 instance).

3.

Create a CloudWatch alarm that triggers an SNS topic.

4.

Subscribe the Lambda function to the SNS topic.

5.

Test by putting the alarm into ALARM state.

Troubleshooting

If the function doesn't trigger, check the SNS subscription confirmation.

If the function fails, check CloudWatch Logs for error messages.

If the function times out, increase the timeout or optimize code.

If permissions fail, check the IAM role for the function.

Walk-Through

1

Define the Automation Goal

Clearly identify the operational problem you want to solve. For example: 'Automatically stop EC2 instances that have been running for more than 24 hours with CPU utilization below 1%.' This step determines the event source (CloudWatch alarm), the target resource (EC2 instances), and the remediation action (stop). On the exam, you must choose the correct combination of services to achieve the goal.

2

Create the Lambda Function

Write the remediation code in a supported runtime (Python 3.x, Node.js 14.x, etc.). The function must parse the incoming event to extract the resource ID (e.g., instance ID from the alarm message). Use the AWS SDK to perform the action. Ensure the function's IAM role grants only the necessary permissions (e.g., ec2:StopInstances). Set the timeout appropriately (e.g., 30 seconds). Test the function manually with a sample event.

3

Configure the Event Source

Choose the event source that will trigger the function. For CloudWatch alarms, create an SNS topic and subscribe the Lambda function to it. For EventBridge, create a rule that matches specific events (e.g., EC2 instance state change) and set the target to the Lambda function. For AWS Config, create a custom rule that invokes the Lambda function. Verify that the function is invoked when the event occurs.

4

Set Up Monitoring and Logging

Enable CloudWatch Logs for the Lambda function to capture execution logs. Create CloudWatch alarms on the function's metrics: Errors, Throttles, Duration, and Invocations. Set up a dashboard to monitor remediation activity. Also configure a Dead Letter Queue (DLQ) for async invocations to capture failed events for retry analysis.

5

Test and Validate the Automation

Simulate the event that triggers the automation. For instance, manually stop an instance or create a non-compliant resource. Verify that the Lambda function executes and the remediation action is taken. Check CloudWatch Logs for any errors. Ensure idempotency: if the function runs multiple times, it should not cause unintended side effects. Finally, document the automation for operational handover.

What This Looks Like on the Job

Enterprise Scenario 1: Auto-Stop Idle EC2 Instances

A large e-commerce company runs hundreds of EC2 instances for development and testing. Developers often forget to stop instances at the end of the day, wasting thousands of dollars monthly. The solution: a CloudWatch alarm that monitors CPU utilization (average <1% for 2 hours) triggers an SNS topic. A Lambda function subscribed to the topic receives the alarm message, extracts the instance ID from the alarm dimensions, and calls ec2:stop_instances. The function also tags the instance with 'AutoStopped: true' and sends a notification to the developer's Slack channel via webhook. In production, the function must handle instances that are already stopped (idempotency) and avoid stopping instances with specific tags (e.g., 'AutoStop: false'). The team also implemented a scheduled EventBridge rule that starts instances at 9 AM Monday-Friday. Performance consideration: Lambda concurrency must be reserved to handle a burst of alarms if many instances become idle simultaneously. Misconfiguration: If the IAM role lacks ec2:DescribeInstances, the function cannot verify the instance state before stopping, leading to errors.

Enterprise Scenario 2: Remediate Unencrypted S3 Buckets

A financial services company must ensure all S3 buckets have default encryption enabled. They use AWS Config with a custom rule that invokes a Lambda function every time a bucket is created or changed. The function checks if the bucket has AES256 encryption; if not, it applies the encryption policy using put_bucket_encryption. The function also logs the remediation to CloudWatch Logs. In production, the function must handle buckets in other accounts (via cross-account roles) and avoid remediating buckets that are intentionally unencrypted (e.g., static website buckets). The company set up a CloudWatch alarm on Lambda errors to detect when the function fails due to insufficient permissions. Common pitfall: The function may exceed the 15-minute timeout if processing many buckets; they use batch processing with pagination.

Enterprise Scenario 3: Auto-Rotate RDS Database Passwords

A healthcare provider needs to rotate RDS master passwords every 90 days for compliance. They use a scheduled EventBridge rule (cron expression '0 0 1 */3 ? *') that triggers a Lambda function. The function generates a new password using Secrets Manager, updates the RDS instance, and stores the new password in Secrets Manager. The function also updates the parameter in the application's configuration (e.g., via SSM Parameter Store). To avoid downtime, the function performs the rotation during a maintenance window. The IAM role must allow secretsmanager:GetRandomPassword, rds:ModifyDBInstance, and secretsmanager:PutSecretValue. Performance: The function must handle up to 50 RDS instances; it uses parallel processing with asyncio. Misconfiguration: If the function fails mid-rotation, the password may be inconsistent; they implement a rollback step using Step Functions.

How SOA-C02 Actually Tests This

What SOA-C02 Tests on This Topic (Objective 1.2)

The exam focuses on your ability to design and implement automated remediation using Lambda, CloudWatch, EventBridge, and Systems Manager. Specific objectives include: - 1.2: Automate remediation of operational issues using AWS services. - You must know how to trigger a Lambda function from CloudWatch alarms, EventBridge rules, and AWS Config rules. - You must understand how to grant permissions to Lambda using IAM roles. - You must be able to troubleshoot why a remediation did not occur.

Common Wrong Answers and Why Candidates Choose Them

1. Wrong answer: 'Use EC2 Auto Scaling to stop idle instances.' - Why chosen: Candidates confuse Auto Scaling (which launches/terminates based on load) with simple stop/start automation. - Correct: Use CloudWatch alarm + Lambda to stop instances. 2. Wrong answer: 'Create a Lambda function that directly reads CloudWatch metrics.' - Why chosen: Candidates think Lambda can poll metrics; but Lambda is event-driven, not a poller. - Correct: Use CloudWatch alarm to send event to Lambda. 3. Wrong answer: 'Use AWS Config to automatically remediate non-compliant resources without Lambda.' - Why chosen: AWS Config has managed remediation actions (e.g., enable encryption), but for custom logic, Lambda is required. - Correct: AWS Config custom rules require Lambda. 4. Wrong answer: 'Set the Lambda timeout to 15 minutes for all automation.' - Why chosen: Candidates remember max is 15 min, but for simple API calls (e.g., stop instance), 30 seconds is enough. - Correct: Set timeout based on the actual task; 15 min is for long-running tasks.

Specific Numbers, Values, and Terms That Appear on the Exam

Lambda timeout: default 3 seconds, max 15 minutes (900 seconds).

Memory: 128 MB to 10,240 MB.

Concurrency: default 1,000 per region.

Event payload: 256 KB (sync), 128 KB (async).

DLQ: for async invocations only.

Reserved concurrency: guarantees capacity.

Provisioned concurrency: pre-warms instances to avoid cold starts.

Cold start: adds latency (typically <1 second).

IAM role: must have permissions for the actions the function performs.

Edge Cases and Exceptions the Exam Loves to Test

Idempotency: The exam asks what happens if the function is invoked twice. Answer: The second invocation should succeed without side effects (e.g., stopping an already stopped instance).

Concurrency limits: If many alarms trigger simultaneously, some invocations may be throttled. Solution: reserved concurrency.

VPC access: If the function needs to access an RDS database in a VPC, it must be configured to run in the VPC, which adds ENI costs and cold start delay.

Cross-account remediation: Lambda can assume a role in another account to remediate resources.

Error handling: If the function fails, the alarm may re-trigger. Use a DLQ to capture failures.

How to Eliminate Wrong Answers Using the Underlying Mechanism

If a question asks 'Which service should you use to automatically stop an EC2 instance when CPU is low?', remember that Lambda is event-driven. The flow is: CloudWatch alarm -> SNS -> Lambda -> stop instance. Any answer that suggests a different flow (e.g., 'Lambda polls CloudWatch metrics') is wrong because Lambda does not poll. Also, if an answer uses 'AWS Config' for a metric-based alarm, it's wrong because Config is for configuration compliance, not metrics. Use the mechanism: identify the event source, the action, and the permissions.

Key Takeaways

Lambda is the primary compute service for custom automation and remediation in AWS.

CloudWatch alarms trigger Lambda via SNS; EventBridge rules trigger Lambda directly.

Lambda execution timeout max is 15 minutes; set it based on the task duration.

Always assign an IAM execution role with least privilege to the Lambda function.

Use reserved concurrency to guarantee capacity for critical remediation functions.

For async invocations, configure a DLQ to capture failed events for analysis.

Idempotency is crucial: ensure multiple invocations do not cause unintended side effects.

Cold starts add latency; use provisioned concurrency for latency-sensitive functions.

Lambda can be invoked by AWS Config custom rules to evaluate and remediate compliance.

Test Lambda functions with sample events from the source service before production deployment.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Lambda with CloudWatch Alarms

Triggered only when alarm state changes to ALARM, OK, or INSUFFICIENT_DATA

Payload contains metric data, alarm name, and state

Requires SNS topic as intermediary (alarm -> SNS -> Lambda)

Best for metric-based thresholds (e.g., CPU > 90%)

Limited to CloudWatch metric alarms; cannot respond to non-metric events

Lambda with EventBridge Rules

Triggered by any event from AWS services, CloudTrail, or custom sources

Payload is the original event (e.g., EC2 state change, API call)

Direct integration: EventBridge rule -> Lambda (no SNS needed)

Best for API call events (e.g., RunInstances) or schedule-based triggers

More flexible; can match events based on JSON pattern matching

Watch Out for These

Mistake

Lambda can be triggered directly by a CloudWatch alarm without any intermediate service.

Correct

CloudWatch alarms cannot directly invoke Lambda. They must send events to an SNS topic, and the Lambda function subscribes to that SNS topic. Alternatively, use EventBridge to route alarm state changes to Lambda.

Mistake

Lambda functions automatically retry on failure.

Correct

For synchronous invocations, there is no automatic retry. For asynchronous invocations, Lambda retries twice (total 3 attempts) with delays between 1 second and 5 minutes. You must configure a DLQ to capture failed events.

Mistake

You can use Lambda to stop an EC2 instance without any IAM role.

Correct

Lambda always needs an IAM execution role. To stop an EC2 instance, the role must include the ec2:StopInstances permission. Without it, the function fails with an access denied error.

Mistake

Lambda can run indefinitely as long as it is actively processing.

Correct

Lambda has a maximum execution timeout of 15 minutes (900 seconds). If the function runs longer, it is forcibly terminated. For long-running tasks, use AWS Step Functions or ECS.

Mistake

EventBridge can trigger Lambda only for AWS service events, not custom events.

Correct

EventBridge can process custom events published by your own applications via the PutEvents API. You can create rules that match custom event patterns and invoke Lambda.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

How do I trigger a Lambda function from a CloudWatch alarm?

Create an SNS topic, subscribe your Lambda function to it, then configure the CloudWatch alarm to send notifications to that SNS topic. When the alarm changes state, SNS invokes the Lambda function with the alarm payload. Alternatively, use EventBridge: create a rule that matches alarm state change events (via CloudWatch Alarm state change events) and set the target to your Lambda function.

What permissions does a Lambda function need to stop an EC2 instance?

The Lambda execution role must include an IAM policy that allows ec2:StopInstances. Optionally, include ec2:DescribeInstances to verify the instance state before stopping. The policy should specify the resource ARN of the instances or use a condition to restrict to specific tags. Example: { 'Effect': 'Allow', 'Action': 'ec2:StopInstances', 'Resource': 'arn:aws:ec2:us-east-1:123456789012:instance/*' }.

Can Lambda be triggered by a scheduled event?

Yes, you can use EventBridge to create a scheduled rule (cron or rate expression) that invokes a Lambda function at regular intervals. For example, a rate of 5 minutes or a cron expression for daily at midnight. This is useful for periodic tasks like cleaning up old snapshots or rotating credentials.

What happens if a Lambda function times out?

Lambda terminates the execution and logs an error to CloudWatch Logs. For synchronous invocations, the caller receives a timeout error (HTTP 503). For asynchronous invocations, Lambda retries twice. To avoid timeouts, increase the function timeout or optimize the code. For long-running tasks, use Step Functions.

How do I debug a Lambda function that is not triggering?

First, check the event source configuration: verify that the SNS subscription is confirmed (if using SNS) or that the EventBridge rule is enabled and has the correct target. Check CloudWatch Logs for any invocation logs. If no logs appear, the function may not be invoked due to permission issues (e.g., SNS not allowed to invoke Lambda). Also check the Lambda function's resource-based policy (for SNS) or IAM role (for EventBridge).

What is the difference between reserved concurrency and provisioned concurrency?

Reserved concurrency guarantees that a function has a specific number of concurrent executions available, preventing it from being throttled by other functions. Provisioned concurrency pre-warms a specified number of execution environments to eliminate cold starts. Reserved concurrency is for controlling capacity; provisioned concurrency is for reducing latency.

Can a Lambda function be invoked by multiple event sources simultaneously?

Yes, a single Lambda function can be the target of multiple EventBridge rules, SNS topics, S3 events, etc. Each invocation is independent. However, if the function is not idempotent, simultaneous invocations may cause race conditions. Use idempotency logic or a mutex (e.g., DynamoDB lock) to handle this.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Lambda for Automation and Remediation — now see how well it sticks with free SOA-C02 practice questions. Full explanations included, no account needed.

Done with this chapter?