This chapter covers AWS Systems Manager Incident Manager, a service for managing and responding to incidents on AWS. It is part of the Monitoring domain (Objective 1.2) on the SOA-C02 exam. Incident Manager typically appears in 2-3 exam questions, focusing on its integration with CloudWatch alarms, the incident lifecycle, and response plans. You will learn the core components, step-by-step incident flow, and common pitfalls to avoid.
Jump to a section
Imagine a city fire station with a dispatch center. The dispatch center monitors alarms from various buildings (your AWS resources). When a fire alarm (incident) is triggered, the dispatcher (Incident Manager) immediately creates an incident record, assigns a severity (e.g., 1 for a full blaze, 2 for a small fire), and notifies the appropriate fire crew (on-call responders) via pagers (SNS, SMS, or chat). The dispatch center uses a pre-defined runbook (a checklist of actions: which hoses to use, which exits to block) to guide the crew. As the crew fights the fire, they update the dispatch board (incident timeline) with progress. If the fire escalates, the dispatcher can automatically involve more crews (escalation). After the fire is out, the dispatch center archives the incident for post-mortem analysis. Without this dispatch system, each building would have to manually call fire stations, causing delays and confusion. Incident Manager automates the entire lifecycle: detection, notification, response, and resolution, ensuring the right people are alerted with the right context at the right time.
What is AWS Systems Manager Incident Manager?
AWS Systems Manager Incident Manager is a fully managed service that helps you prepare for, detect, respond to, and resolve incidents. It centralizes incident management by automating response actions, notifying the right people, and providing a structured process to minimize downtime. Incident Manager is part of AWS Systems Manager, a suite of tools for operational management.
Why Incident Manager Exists
Before Incident Manager, teams had to manually handle incidents using custom scripts, third-party tools, or spreadsheets. This led to inconsistent responses, delayed notifications, and difficulty in tracking resolution progress. Incident Manager provides a standardized, automated approach that integrates with AWS services like CloudWatch, EventBridge, and Systems Manager Automation runbooks. It ensures that incidents are handled consistently, with a clear audit trail.
How Incident Manager Works Internally
Incident Manager operates through three main concepts: response plans, incidents, and engagement plans.
Response Plan: Defines the actions to take when an incident is created. It specifies the contacts, escalation channels, and automation runbooks to run. Each response plan is associated with a specific CloudWatch alarm or EventBridge rule.
Incident: A record of an event that requires attention. It has a severity level (1-5), a status (Open, In Progress, Resolved), and a timeline of events. Incidents are created automatically by CloudWatch alarms or manually via the console/API.
Engagement Plan: Determines how and when to notify responders. It includes contact channels (SMS, email, voice, chat) and escalation rules. Engagement plans can be re-used across multiple response plans.
When a CloudWatch alarm triggers, it can invoke a response plan via an EventBridge rule. The response plan then: 1. Creates an incident with a specified severity. 2. Starts an automation runbook (e.g., to take a snapshot, restart an instance). 3. Engages the on-call team using the engagement plan. 4. Updates the incident timeline as responders provide updates.
Key Components, Values, Defaults, and Timers
Severity Levels: 1 (critical) to 5 (informational). Default is 3.
Incident Status: Open, In Progress, Resolved. You can also set a resolution plan.
Contact Channels: SMS (10-160 characters), email, voice (up to 5 minutes), chat (via AWS Chatbot).
Engagement Plan: Supports multiple contacts and escalation rules. Escalation can be time-based (e.g., if not acknowledged in 5 minutes, escalate to next tier).
Automation Runbooks: Pre-defined Systems Manager Automation documents (e.g., AWSIncidentManager-ResolveIncident) or custom runbooks.
Timeline: Automatically records events like incident creation, responder acknowledgments, runbook execution, and resolution. You can add manual entries.
Integration with CloudWatch: Incidents can be created from any CloudWatch alarm state change (OK, ALARM, INSUFFICIENT_DATA). The alarm must be associated with a response plan via EventBridge.
Integration with EventBridge: EventBridge rules match alarm state changes and trigger the response plan.
Configuration and Verification Commands
To create a response plan using the AWS CLI:
aws ssm-incidents create-response-plan \
--name "MyResponsePlan" \
--incident-template "{\"title\": \"Example incident\", \"severity\": \"3\"}" \
--integrations "[{\"pagerDutyConfiguration\": {\"name\": \"PagerDuty\", \"pagerDutyIncidentConfiguration\": {\"serviceId\": \"P12345\"}, \"secretId\": \"arn:aws:secretsmanager:us-east-1:123456789012:secret:MyPagerDutyKey\"}}]"To list incidents:
aws ssm-incidents list-incidentsTo update an incident status:
aws ssm-incidents update-incident-record \
--arn "arn:aws:ssm-incidents::123456789012:incident/MyIncident" \
--status "Resolved"How It Interacts with Related Technologies
AWS Systems Manager Automation: Runbooks can execute automated actions like restarting EC2 instances, taking EBS snapshots, or patching.
AWS Chatbot: Sends notifications to Slack or Amazon Chime channels.
Amazon CloudWatch: Alarms trigger incidents via EventBridge.
AWS EventBridge: Routes alarm state changes to response plans.
AWS Lambda: Custom actions can be triggered via runbooks or EventBridge targets.
AWS Secrets Manager: Stores credentials for third-party integrations like PagerDuty.
AWS CloudTrail: Logs all Incident Manager API calls for auditing.
Exam Tips
Remember that Incident Manager is part of AWS Systems Manager, not a standalone service.
Know the difference between a response plan (defines actions) and an engagement plan (defines notification rules).
Understand that incidents can be created manually or automatically from CloudWatch alarms.
Be aware that you can integrate with third-party tools like PagerDuty and Slack.
The default severity is 3. Severity 1 is highest.
Escalation rules are time-based; if a responder does not acknowledge within a set time, the incident escalates.
CloudWatch Alarm Triggers
A CloudWatch alarm enters the ALARM state based on a metric threshold (e.g., CPU > 80% for 5 minutes). The alarm is configured to send to an SNS topic or directly to EventBridge. EventBridge has a rule that matches the alarm state change and invokes a specific Incident Manager response plan. The alarm ARN is passed as context to the response plan.
Response Plan Execution
The response plan receives the alarm context. It creates an incident record with a predefined severity (e.g., 2) and title. It also triggers any associated Systems Manager Automation runbook. The runbook can perform automated actions like stopping an EC2 instance or taking a snapshot. The incident is created in 'Open' status.
Engage On-Call Team
The engagement plan associated with the response plan sends notifications to the on-call contacts via configured channels (SMS, email, voice, chat). The engagement plan may have escalation rules: if the primary responder does not acknowledge within 5 minutes, the incident escalates to the secondary responder. Acknowledgment is done via the Incident Manager console or by replying to a notification.
Responder Investigates and Updates
The responder reviews the incident details, including the alarm context and runbook output. They update the incident status to 'In Progress' and add timeline entries (e.g., 'Investigating root cause'). They can also run additional runbooks from the incident console. The timeline provides a chronological log of all actions.
Incident Resolution and Closure
Once the issue is resolved, the responder sets the incident status to 'Resolved'. They may add a resolution note. Incident Manager archives the incident for later analysis. The timeline is preserved for post-incident reviews. CloudWatch alarms that triggered the incident may return to OK state, but the incident remains resolved.
Enterprise Scenario 1: E-Commerce Platform with Critical Latency Spikes
A large e-commerce company uses Incident Manager to respond to latency spikes in their payment processing service. They have a CloudWatch alarm on the API Gateway latency metric. When latency exceeds 2 seconds for 3 consecutive periods, the alarm triggers Incident Manager. The response plan runs a Systems Manager Automation runbook that captures a thread dump from the affected EC2 instances and stores it in S3. It also creates an incident with severity 1. The engagement plan sends SMS and Slack messages to the on-call team. The team uses the thread dump to identify a database bottleneck. They update the incident with findings and resolve it after scaling the database. The incident timeline provides a full audit trail for compliance.
Enterprise Scenario 2: Financial Services with Compliance Requirements
A bank uses Incident Manager to handle security incidents. They have a CloudTrail alarm for unauthorized API calls. When triggered, Incident Manager creates a severity 2 incident and runs a runbook that isolates the compromised IAM user by attaching a DenyAll policy. The engagement plan calls the security team via voice and sends an email to the manager. The team investigates and updates the incident. After resolution, they export the incident timeline for compliance reporting. They have configured multiple engagement plans for different severity levels: severity 1 escalates to the CISO after 2 minutes, severity 2 escalates to the team lead after 5 minutes.
Common Misconfiguration and Issues
Missing IAM Permissions: The Incident Manager service role must have permissions to run runbooks and access resources. If the role is missing, the runbook fails silently.
Incorrect EventBridge Rule: The rule must match the exact alarm state change. A common mistake is using the wrong event pattern.
Engagement Plan Timeouts: If responders do not acknowledge, the escalation may be too slow. Set realistic acknowledgment windows.
Third-Party Integration Failures: If PagerDuty or Slack tokens expire, notifications fail. Use Secrets Manager to rotate secrets.
SOA-C02 Objective Coverage
Incident Manager falls under Domain 1: Monitoring and Reporting, Objective 1.2: Manage incidents using AWS Systems Manager Incident Manager. The exam tests your ability to:
Configure response plans and engagement plans.
Integrate with CloudWatch alarms and EventBridge.
Understand the incident lifecycle (Open, In Progress, Resolved).
Identify the correct order of steps when an alarm triggers.
Differentiate between Incident Manager and other Systems Manager capabilities.
Common Wrong Answers and Why Candidates Choose Them
1. 'Incident Manager can directly restart EC2 instances without a runbook.' - Wrong because Incident Manager itself does not execute actions; it uses Systems Manager Automation runbooks to perform actions. Candidates often think Incident Manager has built-in remediation.
2. 'Incident Manager requires a third-party tool like PagerDuty to send notifications.' - Wrong because Incident Manager has built-in notification channels (SMS, email, voice, chat). PagerDuty is an optional integration. Candidates assume third-party is mandatory.
3. 'Incident Manager incidents are automatically resolved when the CloudWatch alarm returns to OK.' - Wrong because incidents must be manually resolved or via a runbook. The alarm state change does not automatically resolve the incident. Candidates confuse alarm lifecycle with incident lifecycle.
4. 'You can create an incident only from a CloudWatch alarm.' - Wrong because you can manually create incidents from the console or API. Candidates overlook the manual creation option.
Specific Numbers and Values to Memorize
Severity levels: 1 (critical) to 5 (informational). Default: 3.
Engagement plan escalation timers: can be set in minutes (e.g., 5 minutes).
Supported contact channels: SMS, email, voice, chat (via AWS Chatbot).
Incident statuses: Open, In Progress, Resolved.
Integration with Systems Manager Automation runbooks.
Edge Cases and Exceptions
If a response plan is deleted, existing incidents are not affected.
Engagement plans can be shared across multiple response plans.
Incident Manager supports cross-region replication of incidents? No, incidents are regional.
You can assign a custom incident template with placeholders for alarm details.
How to Eliminate Wrong Answers
If an answer mentions direct remediation without runbooks, it is wrong.
If an answer says incidents auto-resolve when alarm OKs, it is wrong.
If an answer says you must use a third-party for notifications, it is wrong.
Look for keywords: 'response plan', 'engagement plan', 'runbook', 'severity'.
Incident Manager is part of AWS Systems Manager, used for automated incident response.
Response plans define actions; engagement plans define notifications and escalations.
Incidents can be created automatically from CloudWatch alarms or manually.
Severity levels range from 1 (critical) to 5 (informational); default is 3.
Incident statuses: Open, In Progress, Resolved.
Automation runbooks execute remediation steps (e.g., restart instances).
Notifications can be sent via SMS, email, voice, or chat (Slack/Chime).
Escalation rules are time-based; if not acknowledged, incident escalates to next responder.
Incidents must be manually resolved; they do not auto-resolve when alarm clears.
Incident Manager integrates with CloudWatch, EventBridge, Systems Manager Automation, and third-party tools like PagerDuty.
These come up on the exam all the time. Here's how to tell them apart.
Incident Manager
Automated incident creation from CloudWatch alarms
Built-in notification channels (SMS, email, voice, chat)
Automated runbook execution for remediation
Centralized incident timeline and audit trail
Escalation rules with time-based triggers
Manual Incident Handling
Requires manual creation of incident tickets
Notifications sent via separate tools or scripts
Remediation done manually by responders
No standardized timeline; relies on emails or notes
Escalation depends on human intervention
Mistake
Incident Manager is a standalone service separate from Systems Manager.
Correct
Incident Manager is a capability of AWS Systems Manager, not a standalone service. It is accessed via the Systems Manager console.
Mistake
Incidents can only be created automatically from CloudWatch alarms.
Correct
Incidents can be created manually via the console or API, not just from alarms. Manual creation is useful for testing or external events.
Mistake
Incident Manager can automatically resolve incidents when the underlying alarm returns to OK.
Correct
Incidents must be resolved manually or via a runbook. There is no automatic resolution based on alarm state.
Mistake
You must use PagerDuty or another third-party tool to send notifications.
Correct
Incident Manager has built-in notification channels: SMS, email, voice, and chat (via AWS Chatbot). Third-party tools are optional.
Mistake
Engagement plans are the same as response plans.
Correct
Response plans define the actions (runbooks, incident template) and reference an engagement plan. Engagement plans define notification and escalation rules.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
First, create a response plan in Incident Manager that specifies the severity and any automation runbooks. Then, create an EventBridge rule that matches the CloudWatch alarm state change (e.g., ALARM) and targets the response plan. When the alarm triggers, EventBridge invokes the response plan, which creates an incident and sends notifications.
No, incidents must be resolved manually by a responder or via a Systems Manager Automation runbook. There is no automatic resolution when the triggering alarm returns to OK. You can create a runbook that sets the incident status to Resolved, but it must be triggered manually or by another event.
Incident Manager supports SMS (text messages), email, voice calls, and chat notifications via AWS Chatbot (which integrates with Slack and Amazon Chime). You can also integrate with third-party tools like PagerDuty and Atlassian Opsgenie.
A response plan defines the incident template (title, severity) and the automation runbooks to run when an incident is created. It also references an engagement plan. An engagement plan defines the contacts to notify, their notification channels, and escalation rules (e.g., escalate after 5 minutes if not acknowledged).
In the engagement plan, you can define multiple contacts in a hierarchy. For each contact, you can set a duration (in minutes) after which the incident escalates to the next contact if the current one does not acknowledge. Acknowledgment can be done via the Incident Manager console or by responding to a notification.
Yes, you can manually create incidents from the Incident Manager console or API. This is useful for incidents reported by users or from external monitoring tools. Manual incidents follow the same lifecycle as automatically created ones.
Incident Manager is a regional service. Incidents are created and managed within a single AWS region. If you need cross-region incident management, you must set up separate response plans in each region or use a global dashboard via third-party tools.
You've just covered AWS Systems Manager Incident Manager — now see how well it sticks with free SOA-C02 practice questions. Full explanations included, no account needed.
Done with this chapter?