SOA-C02Chapter 38 of 104Objective 3.2

SSM Automation Runbooks

This chapter covers AWS Systems Manager Automation Runbooks, a powerful feature for automating operational tasks on AWS resources. You will learn how to create, execute, and manage runbooks to perform common SysOps tasks such as patching, AMI creation, and incident remediation. This topic appears in approximately 5-8% of SOA-C02 exam questions, primarily in the Deployment, Provisioning, and Automation domain (Objective 3.2). Mastering Automation Runbooks is essential for demonstrating your ability to implement repeatable, auditable, and safe automation at scale.

25 min read
Intermediate
Updated May 31, 2026

SSM Automation as a Hotel Maintenance Robot

Imagine a hotel with 1,000 rooms, each containing servers that need periodic maintenance. The hotel has a maintenance robot that can be programmed with runbooks — step-by-step instructions like 'Check server temperature, if > 80°C, throttle CPU; then update firmware; then reboot.' The robot can execute these runbooks on any room's server, but it needs a small agent installed in each room to receive commands and report back. The hotel manager (SysOps) can run a runbook on a single room, a floor (using tags), or the entire building. The robot also has safety features: it can pause (via an approval step) before a risky action like a reboot, and it can roll back changes if something goes wrong. The robot's actions are logged in a central journal (CloudTrail). If the robot encounters an error, it can stop the entire runbook or skip that step and continue. This is exactly how AWS Systems Manager Automation works: a service that runs predefined or custom runbooks (Python or YAML documents) on EC2 instances or other AWS resources, using the SSM Agent to execute steps, with support for approval, error handling, and integration with other AWS services.

How It Actually Works

What is SSM Automation?

AWS Systems Manager Automation is a service that enables you to safely automate common and repetitive IT operations and management tasks. An Automation Runbook is a document that defines the actions that Systems Manager performs on your AWS resources. Runbooks can be either predefined by AWS (owned by Amazon) or custom-written by you. They are written in YAML or JSON and consist of a series of steps, each of which can invoke an AWS API action (like aws:executeAwsApi), run a script on an instance (aws:runCommand), or perform conditional logic (aws:branch).

Why Automation Runbooks Exist

Before Automation, SysOps engineers had to manually perform repetitive tasks like patching instances, creating AMIs, or responding to CloudWatch alarms. This was error-prone, time-consuming, and difficult to audit. Automation Runbooks solve this by providing a declarative, version-controlled, and auditable way to define and execute operational procedures. They integrate with IAM for fine-grained permissions, CloudTrail for logging, and CloudWatch Events (now Amazon EventBridge) for triggering based on events.

How Automation Works Internally

When you start an Automation execution, the Systems Manager service reads the runbook document and begins executing steps sequentially. Each step has an action that maps to an AWS API call or a Systems Manager command. The service uses the IAM role associated with the execution (the Automation Assume Role) to perform actions on your behalf. For steps that target EC2 instances, the SSM Agent on the instance executes the command and returns the output. The Automation service tracks the status of each step (Success, Failed, TimedOut, Cancelled) and can branch based on conditions. If a step fails, the runbook can be configured to stop, continue, or roll back to a previous step via the onFailure property.

Key Components and Defaults

Runbook Document: A YAML or JSON file with a schemaVersion (currently 0.3), description, mainSteps array, and optional parameters. Each step has: name, action, inputs, maxAttempts (default 1), timeoutSeconds (default varies by action, e.g., 3600 for aws:runCommand), and onFailure (default Abort).

Automation Execution: An instance of a runbook being run. It has an execution ID and a status (Pending, InProgress, Success, Failed, TimedOut, Cancelled).

Automation Assume Role: An IAM role that Automation assumes to perform actions. This role must have permissions for all actions in the runbook.

Rate Control: For multi-instance targets, you can set concurrency (max number of instances processed simultaneously, default 1) and errorThreshold (max number of errors before stopping, default 1).

Approval Steps: Use aws:approve to pause execution until a user approves or denies via the console or CLI.

Configuration and Verification Commands

To create a custom runbook, you can write a document and store it in Systems Manager Documents (under Automation). Example CLI command to create a document:

aws ssm create-document \
  --content file://myRunbook.yaml \
  --name "MyCustomRunbook" \
  --document-type "Automation"

To start an execution:

aws ssm start-automation-execution \
  --document-name "MyCustomRunbook" \
  --parameters "key=value" \
  --target-parameter-name "InstanceId" \
  --targets "Key=tag:Environment,Values=Production"

To check execution status:

aws ssm describe-automation-executions --filter "Key=DocumentName,Values=MyCustomRunbook"
aws ssm get-automation-execution --automation-execution-id "execution-id"

Interaction with Related Technologies

AWS Config: Runbooks can be triggered by Config rules to remediate non-compliant resources (e.g., open security groups).

Amazon EventBridge: You can create rules that trigger Automation in response to events (e.g., EC2 instance state change).

AWS CloudTrail: All Automation API calls are logged, providing an audit trail.

AWS Lambda: Runbooks can invoke Lambda functions via aws:invokeLambdaFunction for custom logic.

AWS Step Functions: For complex workflows, you might use Step Functions instead of Automation, but Automation is simpler for straightforward operational tasks.

Step Details and Execution Flow

Each step in a runbook can be one of many action types. Common ones include:

`aws:executeAwsApi`: Calls any AWS API (e.g., ec2:CreateImage). Inputs include Service, Api, and InputPayload.

`aws:runCommand`: Sends a command to one or more instances via SSM Run Command. Inputs include DocumentName, Parameters, and Targets.

`aws:invokeLambdaFunction`: Invokes a Lambda function.

`aws:branch`: Chooses a branch based on a condition (e.g., StringEquals).

`aws:sleep`: Pauses execution for a specified duration (in milliseconds).

`aws:approve`: Waits for manual approval.

Steps can also use outputs to capture data from one step and pass it to subsequent steps via {{ stepName.outputName }}.

Error Handling and Rollback

`onFailure`: Can be Abort (stop entire execution, default), Continue (skip failed step and continue), or step:stepName (roll back to a specific step).

`maxAttempts`: Number of times to retry a step on failure (default 1).

`timeoutSeconds`: Maximum time a step can run before being considered failed.

Security and Compliance

Automation executions are logged in CloudTrail. You can use IAM policies to restrict who can start executions and which documents they can use. The Automation Assume Role must have permissions for all actions in the runbook. Use aws:SourceAccount and aws:SourceArn in the role trust policy to prevent confused deputy problems.

Common Use Cases

AMI Creation: Automate creating AMIs from running instances, including pre- and post-steps like de-registering old AMIs.

Patch Management: Use AWS-RunPatchBaseline to scan and install patches.

Incident Remediation: Trigger a runbook when a CloudWatch alarm fires (e.g., high CPU, then restart instance).

Account Cleanup: Automate deletion of stale snapshots or unused security groups.

Performance and Limits

Maximum runbook document size: 256 KB.

Maximum number of steps per runbook: 100.

Maximum execution history retention: 30 days (can be extended by logging to S3 or CloudWatch Logs).

Rate of execution starts: 10 per second per account per region (soft limit).

Troubleshooting

Permission Errors: Check the Automation Assume Role and ensure it has the necessary actions. Also verify the instance profile for SSM Agent commands.

Timeouts: Increase timeoutSeconds or check network connectivity.

Step Failures: Use aws:executeAwsApi with IsOutput to capture error details. Check CloudTrail for API calls.

Best Practices

Use version control for runbook documents (e.g., store in Git and sync to SSM).

Test runbooks in a non-production environment first.

Use approval steps for destructive actions.

Leverage aws:executeAwsApi for custom API calls instead of Lambda for simplicity.

Tag runbook executions for cost tracking and filtering.

Walk-Through

1

Create or Select Runbook

The first step is to either use an AWS-provided runbook (e.g., AWS-RunPatchBaseline) or create a custom one. Custom runbooks are written in YAML or JSON with schemaVersion 0.3. You define parameters that users can input when starting execution, and mainSteps that list the actions to perform. The runbook is stored as a Systems Manager Document. When creating a custom runbook, you must specify the document type as 'Automation'. The document content includes metadata like description and assumeRole (optional). You can also use the AWS Management Console to create or import runbooks.

2

Configure Execution Details

When starting an Automation execution, you must specify the document name, parameters, and targets (if applicable). You can target instances by tags, resource groups, or individual instance IDs. You also define the Automation Assume Role — an IAM role that Automation will assume to perform actions. Optionally, you can set rate control parameters: maxConcurrency (how many instances to process simultaneously) and maxErrors (how many errors to tolerate before stopping). For approval steps, you can set the number of approvers needed and the timeout for approval.

3

Execution Begins and Steps Run

Once started, the Automation service reads the runbook and begins executing steps sequentially. For each step, the service uses the Assume Role to call the specified AWS API or send a command via SSM Agent. If the step targets EC2 instances, the SSM Agent on each instance executes the command and returns output. Steps can include conditional branching (aws:branch) to choose different paths based on previous step outputs. The service tracks each step's status and logs all actions to CloudTrail. If a step fails, the onFailure setting determines behavior: Abort stops execution, Continue skips the step, or roll back to a previous step.

4

Approval Step (if present)

If the runbook includes an aws:approve step, execution pauses and waits for a user to approve or deny via the AWS Management Console, CLI, or SDK. The approver must have the appropriate IAM permissions (ssm:UpdateAutomationExecution). The step can require a single approver or multiple approvers. A timeout can be set; if no decision is made within the timeout, the step can be set to fail or continue automatically. Once approved, execution resumes with the next step. This is commonly used before destructive actions like instance termination or AMI deletion.

5

Execution Completes and Logs

After all steps have executed (or failed/stopped), the execution enters a final state: Success, Failed, TimedOut, or Cancelled. The execution history, including inputs, outputs, and step statuses, is retained for 30 days by default. You can view details in the Systems Manager console or via CLI (get-automation-execution). All API calls made by Automation are logged in CloudTrail, providing an audit trail. You can also configure Automation to send execution logs to CloudWatch Logs or S3 for longer retention. Failed executions can be analyzed to identify the failing step and error message.

What This Looks Like on the Job

Enterprise Scenario 1: Automated AMI Creation and Lifecycle Management

A large enterprise runs hundreds of EC2 instances across multiple accounts. They need to create weekly AMIs of all instances for disaster recovery, while also deregistering AMIs older than 90 days. Using a custom Automation Runbook, they automate the entire process. The runbook first stops the instance (if required), creates an AMI with tags (e.g., CreationDate, Owner), then restarts the instance. It also deletes old AMIs and associated snapshots. The runbook is triggered weekly by an EventBridge schedule. Rate control is set to maxConcurrency=10 and maxErrors=5 to avoid overwhelming the account. The runbook uses aws:executeAwsApi to call ec2:CreateImage and ec2:DeregisterImage. IAM policies are scoped to specific resources. Common misconfiguration: forgetting to exclude instances with a 'NoBackup' tag, causing unnecessary downtime. The solution uses a branching step to check tags before proceeding.

Enterprise Scenario 2: Automated Patch Remediation

A financial services company must patch all Windows and Linux instances monthly. They use the AWS-provided runbook AWS-RunPatchBaseline with custom parameters. The runbook scans for missing patches, installs them, and reboots if required. To avoid business impact, they use an approval step before the reboot. The runbook targets instances by the tag 'PatchGroup: Monthly'. The Automation Assume Role is scoped to allow ec2:RebootInstances only on tagged instances. They also integrate with AWS Config: if an instance falls out of compliance (e.g., missing critical patches), a Config rule triggers a remediation runbook that patches the instance immediately. A common pitfall is not setting the reboot parameter correctly, causing instances to reboot unexpectedly or not reboot when needed. The team uses maxAttempts: 3 on the patch step to handle transient errors.

Scenario 3: Incident Response Automation

A SaaS company uses CloudWatch alarms for high CPU utilization. When an alarm triggers, an EventBridge rule starts an Automation Runbook that captures a memory dump, stops the instance, attaches a diagnostic volume, runs analysis scripts, and then restarts the instance. The runbook includes an approval step before stopping the instance to allow a human to intervene if it's a critical production server. The runbook outputs the analysis results to an S3 bucket. The team learned the hard way that the Automation Assume Role must include permissions for ec2:StopInstances and ec2:CreateVolume, and that the SSM Agent must be installed on the instance. They also set timeoutSeconds on the analysis step to 600 to avoid hanging.

How SOA-C02 Actually Tests This

Exam Focus for SOA-C02 Objective 3.2

The SOA-C02 exam tests your ability to implement automation using Systems Manager Automation Runbooks. Specifically, you need to know:

1.

How to create and execute runbooks (both predefined and custom).

2.

How to target resources using tags, resource groups, or instance IDs.

3.

How to use parameters and rate control (maxConcurrency, maxErrors).

4.

How approval steps work and how to configure them.

5.

How to handle errors with onFailure (Abort, Continue, rollback).

6.

How Automation integrates with CloudTrail, EventBridge, and AWS Config.

7.

IAM permissions required for Automation (Assume Role and instance profile).

Common Wrong Answers and Why

Wrong: 'Automation Runbooks can only be executed on EC2 instances.' Reality: Runbooks can perform actions on any AWS resource (e.g., create an RDS snapshot, modify security groups) using aws:executeAwsApi, not just on EC2.

Wrong: 'The Automation Assume Role must be the same as the instance profile role.' Reality: These are separate. The Assume Role is for Automation to call AWS APIs; the instance profile role is for the SSM Agent to receive commands. They often need different permissions.

Wrong: 'Rate control parameters (maxConcurrency and maxErrors) are set in the runbook document.' Reality: They are set at execution time, not in the document. The document defines steps; the execution defines how many instances run in parallel.

Wrong: 'If a step fails, the entire execution stops immediately.' Reality: The default onFailure is Abort, but you can set it to Continue or rollback. The exam may test the default.

Specific Numbers and Terms

Default onFailure: Abort

Default maxAttempts: 1

Default timeoutSeconds: varies, but for aws:runCommand it's 3600 seconds (1 hour).

Schema version: 0.3

Maximum steps: 100

Maximum document size: 256 KB

Execution history retention: 30 days

Approval step timeout: can be set in minutes (default 0, meaning no timeout).

Edge Cases

Cross-account Automation: Automation can assume a role in another account to perform actions there, but the runbook must be in the same account as the execution.

Using aws:branch: The exam may test that branching is based on StringEquals or StringNotEquals comparing step outputs.

Output variables: Syntax {{ stepName.outputName }} to pass data between steps.

Chaining runbooks: You can start another Automation execution from within a runbook using aws:executeAutomation.

How to Eliminate Wrong Answers

If the question involves patching, think of AWS-RunPatchBaseline.

If the question involves approval, look for aws:approve.

If the question involves error handling, check the onFailure setting.

If the question involves targeting, look for tags or --targets.

Remember that Automation is not just for EC2; it can call any AWS API.

Key Takeaways

Automation Runbooks are defined in YAML/JSON with schemaVersion 0.3.

Steps can call any AWS API via aws:executeAwsApi.

Default onFailure is Abort; can be changed to Continue or rollback.

Rate control (maxConcurrency, maxErrors) is set at execution time, not in document.

Approval steps pause execution until a user approves via console or CLI.

Runbooks can be triggered by EventBridge, AWS Config, or other runbooks.

Execution history is retained for 30 days by default.

The Automation Assume Role must have permissions for all actions in the runbook.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

SSM Automation Runbooks

Simpler syntax (YAML/JSON) with predefined actions like aws:runCommand.

Limited to 100 steps per runbook.

Native integration with SSM Agent and AWS Config remediation.

Best for straightforward operational tasks like patching and AMI creation.

Approval steps built-in via aws:approve.

AWS Step Functions

More complex but flexible state machine with support for parallel branches.

Can handle very long workflows with up to 25,000 state transitions.

Integrates with over 200 AWS services via SDK integrations.

Best for complex business processes and microservice orchestration.

Approval requires a separate task token pattern or Lambda.

Watch Out for These

Mistake

Automation can only run commands on EC2 instances.

Correct

Automation can perform any AWS API action via aws:executeAwsApi, such as creating RDS snapshots, modifying security groups, or stopping instances. It is not limited to EC2.

Mistake

The Automation Assume Role must have the same permissions as the instance profile.

Correct

The Assume Role is used by the Automation service to call AWS APIs. The instance profile role is used by the SSM Agent to execute commands. They are separate and often require different permissions.

Mistake

Rate control parameters (maxConcurrency, maxErrors) are defined in the runbook document.

Correct

These parameters are specified when starting the execution, not in the document. The document defines the steps; the execution defines how many instances run in parallel and error tolerance.

Mistake

If a step fails, the entire execution always stops.

Correct

The default onFailure is Abort, but you can set it to Continue or rollback to a previous step. The exam tests the default and the ability to change it.

Mistake

Automation runbooks can only be executed manually.

Correct

Runbooks can be triggered automatically by EventBridge rules, AWS Config rules, or even by other runbooks using aws:executeAutomation.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

Can I use Automation Runbooks to manage resources in multiple AWS accounts?

Yes, but the runbook must be in the account where the execution is started. To perform actions in another account, the runbook can use aws:executeAwsApi with a role ARN from the target account. The Automation Assume Role must have sts:AssumeRole permission for that role. Alternatively, you can use AWS Organizations and StackSets for multi-account management.

What is the difference between SSM Automation and SSM Run Command?

SSM Run Command is used to run commands on one or more EC2 instances or on-premises servers. It is a single action. Automation is a more powerful service that can execute multi-step workflows involving multiple AWS resources, not just instances. Automation can include Run Command as one of its steps (aws:runCommand).

How do I pass outputs from one step to another in an Automation runbook?

You can capture outputs from a step by defining them in the step's outputs section. Then reference them in subsequent steps using the syntax {{ stepName.outputName }}. For example, if a step named 'CreateImage' has an output 'ImageId', you can use {{ CreateImage.ImageId }} in the next step.

What happens if the Automation Assume Role does not have sufficient permissions?

The execution will fail with an access denied error. The step that attempted the action will have a status of 'Failed' and the error message will indicate the missing permission. You should check the IAM role and add the necessary actions. Also ensure the trust policy allows the Systems Manager service to assume the role.

Can I run an Automation runbook on an instance that does not have the SSM Agent installed?

No, if the runbook includes steps that target instances (like aws:runCommand), the SSM Agent must be installed and configured. However, if the runbook only calls AWS APIs (like creating an AMI from an instance ID), the SSM Agent is not required because the action is performed by the Automation service itself.

How do I set up an approval step in an Automation runbook?

Add a step with action 'aws:approve' in your runbook. You can specify parameters like 'Approvers' (list of IAM users/groups), 'MinRequiredApprovals' (default 1), and 'ApprovalTimeout' (in minutes, default 0 for no timeout). When execution reaches this step, it pauses. Approvers can approve or deny via the Systems Manager console or using the CLI command 'ssm:update-automation-execution'.

What is the maximum number of steps in an Automation runbook?

The maximum number of steps is 100. If you need more steps, consider using nested runbooks (aws:executeAutomation) or switch to AWS Step Functions for more complex workflows.

Terms Worth Knowing

Ready to put this to the test?

You've just covered SSM Automation Runbooks — now see how well it sticks with free SOA-C02 practice questions. Full explanations included, no account needed.

Done with this chapter?