SOA-C02Chapter 54 of 104Objective 2.1

EC2 Status Checks and Auto Recovery

This chapter covers EC2 status checks and the Auto Recovery feature, which are critical for maintaining high availability of EC2 instances. For the SOA-C02 exam, understanding how status checks work, how to configure CloudWatch alarms to trigger recovery, and the limitations of Auto Recovery is essential. Approximately 5-10% of exam questions touch on EC2 monitoring, recovery, or related CloudWatch concepts. This chapter will equip you with the deep technical knowledge needed to answer scenario-based questions about instance health monitoring and automated recovery actions.

25 min read
Intermediate
Updated May 31, 2026

EC2 Auto Recovery: Like a Hospital ICU Monitor

Imagine a hospital ICU where each patient (EC2 instance) is connected to a vital signs monitor (status check). The monitor continuously checks heart rate, blood pressure, and oxygen levels (instance status checks) and also ensures the room's power and network connections work (system status checks). If the monitor detects a critical issue, like a heart attack (impaired instance), it doesn't immediately call a code blue. Instead, it first waits for a short period (status check failure threshold) to confirm the problem is persistent, not a transient glitch. Once confirmed, the system automatically alerts the on-call doctor (CloudWatch alarm) who can trigger a predefined action (Auto Recovery). The recovery action is like moving the patient to a different ICU bed on the same floor (same Availability Zone) with a fresh set of equipment (new host hardware) while preserving the patient's IV lines and monitors (same instance ID, private IP, Elastic IP, and EBS attachments). The patient never loses their identity, and the recovery is seamless to the outside world. If the patient's condition is too severe (e.g., hardware failure), the system may decide to terminate and restart with a clean slate (terminate and replace), but that's a different procedure. The key is that the monitor and the automated response work together to minimize downtime without human intervention.

How It Actually Works

What Are EC2 Status Checks and Why Do They Exist?

EC2 status checks are automated checks that run every minute to detect hardware and software issues that may impair an instance. They are the foundation of instance health monitoring in AWS. There are two types:

System Status Checks: These monitor the health of the underlying physical host, including network connectivity, power supply, and the host's software stack. A failure here indicates a problem that AWS can often fix by moving the instance to a new host.

Instance Status Checks: These monitor the software and network configuration of the instance itself, such as operating system responsiveness, file system integrity, and proper networking stack. A failure here indicates an issue inside the instance that typically requires the customer to take action (e.g., reboot, repair OS).

Status checks are not configurable — they run automatically every minute and report either "OK" or "Impaired". The results are visible in the EC2 console, CloudWatch, and can be used to trigger CloudWatch alarms and automated actions.

How It Works Internally — The Mechanism

Each EC2 instance has an associated hypervisor agent that runs on the physical host. This agent performs the following checks:

1.

System Status Check Components:

Network reachability: Verifies that the instance can communicate with the AWS internal network.

Power status: Ensures the host has stable power.

Physical host health: Checks for hardware degradation like memory errors or disk failures.

AWS software health: Checks the hypervisor and other host-level software.

2.

Instance Status Check Components:

Operating system responsiveness: The agent attempts to reach the OS via the virtual console.

File system integrity: Checks if the root volume is accessible and not corrupted.

Networking configuration: Verifies that the instance's network interface is properly configured and responding.

If any of these checks fail for two consecutive minutes (the default), the status check result is set to "Impaired". The system does not immediately declare failure; it waits for two checks to avoid transient issues.

Key Components, Values, Defaults, and Timers

- Status check interval: Every 1 minute. - Failure threshold: 2 consecutive failures (2 minutes) before status is marked "Impaired". - Status check types: - system-status-reachability: System status check. - instance-status-reachability: Instance status check.

- CloudWatch metrics: - StatusCheckFailed: Metric that counts the number of failed status checks in a 1-minute period. Values can be 0 or 1. - StatusCheckFailed_System: Count of failed system status checks. - StatusCheckFailed_Instance: Count of failed instance status checks.

Auto Recovery prerequisites:

The instance must be in a VPC (not EC2-Classic).

The instance must use an EBS root volume (not instance store).

The instance must be running (not stopped or terminated).

The instance must have a public or Elastic IP (optional, but recovery preserves it).

- Recovery actions: - Auto Recovery: The instance is stopped and started on a new physical host within the same Availability Zone. The instance ID, private IP, Elastic IP, and EBS volumes are preserved. This typically takes 5-10 minutes. - Terminate and Replace: Not automatic; must be configured via CloudWatch alarm. The instance is terminated and a new one is launched (e.g., via Auto Scaling).

Configuration and Verification Commands

To configure Auto Recovery, you create a CloudWatch alarm that triggers on the StatusCheckFailed_System metric. The alarm action must be recover.

AWS CLI Example:

aws cloudwatch put-metric-alarm \
    --alarm-name "EC2-AutoRecover-i-1234567890abcdef0" \
    --alarm-description "Auto-recover instance i-1234567890abcdef0 on system status check failure" \
    --metric-name StatusCheckFailed_System \
    --namespace AWS/EC2 \
    --statistic Maximum \
    --period 60 \
    --evaluation-periods 2 \
    --threshold 1 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
    --alarm-actions arn:aws:automate:us-east-1:ec2:recover

Note: The recover action is an ARN in the format arn:aws:automate:REGION:ec2:recover. The region must match the instance's region.

Verification:

Check CloudWatch alarm state: aws cloudwatch describe-alarms --alarm-names "EC2-AutoRecover-i-1234567890abcdef0"

Check instance status: aws ec2 describe-instance-status --instance-ids i-1234567890abcdef0

View recovery history in CloudTrail: Look for RecoverInstance events.

Interaction with Related Technologies

CloudWatch Alarms: The primary trigger for Auto Recovery. Alarms can also trigger SNS notifications or Lambda functions for custom actions.

Auto Scaling: If an instance fails a system status check, Auto Scaling can terminate it and launch a new one if part of an Auto Scaling group. However, Auto Recovery is often preferred for stateful instances.

AWS Systems Manager Automation: Can be used to create custom recovery workflows that go beyond simple recovery, such as running scripts before recovery.

Elastic Load Balancing: If an instance fails an instance status check, the load balancer will stop routing traffic to it. Auto Recovery does not fix instance-level issues; you need to reboot the instance or fix the OS.

Important Limitations

Auto Recovery only works for system status check failures, not instance status check failures.

The instance must be in a VPC.

The instance must have an EBS root volume.

If the instance is part of an Auto Scaling group, Auto Recovery may conflict with Auto Scaling health checks. AWS recommends disabling Auto Recovery for instances in an Auto Scaling group and letting Auto Scaling handle recovery.

After recovery, the instance is rebooted, and any memory state is lost. Data on EBS volumes persists.

The instance's public IP (not Elastic IP) will change after recovery. If you need a fixed public IP, use an Elastic IP.

Recovery may fail if there is insufficient capacity in the Availability Zone. In that case, the instance remains in a stopped state.

Exam Tips

The exam loves to test the difference between system and instance status check failures. Remember: system = host problem (AWS responsible), instance = guest problem (customer responsible).

Auto Recovery is only for system status check failures.

The default period for status checks is 60 seconds, and the default evaluation period for the alarm is 2 (2 minutes).

The recover action is not available for all instance types. Check the documentation for supported types.

When an instance is recovered, it retains its instance ID, private IP, Elastic IP, and all EBS volumes. However, any data stored in instance store volumes is lost (but instance store-backed instances cannot use Auto Recovery anyway).

Walk-Through

1

Status Check Execution

Every 60 seconds, the EC2 hypervisor agent performs system and instance status checks. The system check verifies the physical host's network, power, and hardware health. The instance check tests OS responsiveness, file system integrity, and network configuration. The agent uses internal mechanisms like ICMP echo requests to the instance's loopback address and checks the virtual console. If all checks pass, the status is 'OK'. If any check fails, it is recorded as a failure for that minute.

2

Failure Accumulation and Threshold

A single failed status check does not immediately impair the instance. The system uses a threshold of two consecutive failures (2 minutes) before marking the status as 'Impaired'. This prevents transient issues from triggering unnecessary actions. The CloudWatch metric `StatusCheckFailed_System` reports a 1 for each minute a system check fails. The alarm evaluation period is typically set to 2 periods (2 minutes) with a threshold of 1, meaning if the metric sum over 2 minutes is >=1, the alarm triggers.

3

CloudWatch Alarm Triggers

When the CloudWatch alarm enters the ALARM state, it executes the configured actions. For Auto Recovery, the action is the `recover` ARN. The alarm can also send notifications via SNS or invoke a Lambda function. The alarm's state change is logged in CloudTrail. The alarm evaluates the metric every period (default 60 seconds) and transitions to ALARM after the specified number of evaluation periods with a breach.

4

Recovery Initiation

AWS EC2 receives the recovery action and begins the process. First, it stops the instance (similar to a stop-start operation). The instance goes through a 'stopping' then 'stopped' state. The EBS root volume is detached from the current physical host. AWS then selects a new healthy physical host in the same Availability Zone. The EBS volumes are reattached to the new host. The instance is then started. During this time, the instance is unavailable, typically for 5-10 minutes.

5

Post-Recovery Verification

After the instance starts, the hypervisor agent resumes status checks. The instance should now pass system status checks. The instance ID, private IP, Elastic IP, and all EBS volumes (including data) are preserved. However, the public IP (if not Elastic) changes. The recovery event is recorded in CloudTrail as a `RecoverInstance` API call. You can monitor the recovery by checking the instance's status checks in the console or via CLI.

What This Looks Like on the Job

In a production environment, EC2 Auto Recovery is commonly used for stateful applications that cannot easily be replaced by a new instance. For example, consider a company running a legacy database on an EC2 instance with a large EBS volume. The database cannot be easily replicated or replaced without downtime. By configuring a CloudWatch alarm on StatusCheckFailed_System with the recover action, the database instance can automatically recover from underlying hardware failures without manual intervention. The instance retains its private IP, so applications connecting to it via that IP do not need to reconnect. However, the team must ensure that the database can handle a crash recovery (e.g., by using a transactional database that replays logs on restart). Another scenario is a bastion host or NAT instance that serves as a gateway. These instances are often stateless, but losing their IP would break connectivity. Auto Recovery preserves the Elastic IP, so the gateway remains reachable. In large-scale deployments, such as a fleet of web servers behind a load balancer, Auto Recovery is less critical because Auto Scaling can replace unhealthy instances. In fact, combining Auto Recovery with Auto Scaling can cause conflicts: if an instance fails a system check, Auto Recovery might start the instance again, while Auto Scaling could terminate it simultaneously. Best practice is to disable Auto Recovery for instances in an Auto Scaling group and rely on Auto Scaling health checks instead. Performance considerations: Recovery takes 5-10 minutes, during which the instance is unavailable. For critical workloads, you should deploy multiple instances in different Availability Zones and use a load balancer. Misconfiguration often occurs when the alarm threshold is set too low (e.g., evaluation period of 1), causing false positives from transient network glitches. Also, forgetting to attach an Elastic IP can result in a changed public IP after recovery, breaking client connections. Finally, some instance types (e.g., T2/T3 unlimited) may not support recovery; always check the documentation.

How SOA-C02 Actually Tests This

The SOA-C02 exam tests EC2 Status Checks and Auto Recovery under Domain 2.1 (Reliability). The objective is to "Implement and manage high availability and disaster recovery strategies." Specific exam topics include:

- Distinguishing between system and instance status checks. - Configuring CloudWatch alarms for Auto Recovery. - Understanding the limitations and prerequisites of Auto Recovery. - Knowing that Auto Recovery only works for system status check failures. - Recognizing that instance status check failures require customer action (e.g., reboot, OS repair). - The default status check interval is 1 minute, and the alarm evaluation period is typically 2 minutes. - The recover action ARN format: arn:aws:automate:REGION:ec2:recover. - Auto Recovery preserves instance ID, private IP, Elastic IP, and EBS volumes. - Common wrong answers: 1. Choosing "Auto Recovery for instance status check failures" — this is incorrect; Auto Recovery only applies to system status checks. 2. Thinking that Auto Recovery works for instance store-backed instances — it does not; the root volume must be EBS. 3. Believing that Auto Recovery preserves the public IP (non-Elastic) — it does not; public IP changes after recovery. 4. Assuming that Auto Recovery works in EC2-Classic — it does not; the instance must be in a VPC. - Edge cases:

If the instance is in an Auto Scaling group, Auto Recovery may conflict with Auto Scaling health checks. The exam expects you to know that Auto Recovery should be disabled in that case.

If the recovery fails due to insufficient capacity, the instance remains stopped. You need to manually start it.

The exam may ask about the minimum number of evaluation periods needed to avoid false positives: the default is 2.

How to eliminate wrong answers: Focus on the type of failure. If the scenario describes a problem with the underlying hardware (e.g., "physical host failure"), the answer involves system status check and Auto Recovery. If the problem is inside the instance (e.g., "OS crash"), the answer involves instance status check and manual intervention (reboot or terminate).

Key Takeaways

EC2 status checks run every 60 seconds and have two types: system (host health) and instance (guest health).

Auto Recovery only works for system status check failures and requires the instance to be in a VPC with an EBS root volume.

The default CloudWatch alarm evaluation period for Auto Recovery is 2 minutes (2 periods of 60 seconds).

After Auto Recovery, the instance retains its instance ID, private IP, Elastic IP, and all EBS volumes, but public IP (non-Elastic) changes.

Auto Recovery is not supported for instance store-backed instances or instances in EC2-Classic.

Do not use Auto Recovery for instances in an Auto Scaling group; use Auto Scaling health checks instead.

The recover action ARN format is: arn:aws:automate:REGION:ec2:recover.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Auto Recovery

Triggers on system status check failures only.

Preserves instance ID, private IP, Elastic IP, and EBS volumes.

Instance is stopped and started on a new host (takes 5-10 minutes).

Best for stateful instances that cannot be replaced.

Does not automatically terminate the instance; recovers it.

Auto Scaling Health Check Replacement

Triggers on EC2 status checks (both system and instance) or ELB health checks.

Terminates the unhealthy instance and launches a new one (new instance ID, new IP).

New instance is launched from a launch template or configuration (fast if using AMI).

Best for stateless instances that can be replaced without data loss.

Automatically terminates the instance and launches a replacement.

Watch Out for These

Mistake

Auto Recovery works for both system and instance status check failures.

Correct

Auto Recovery only triggers on system status check failures. Instance status check failures require customer action such as rebooting the instance, fixing the OS, or replacing the instance.

Mistake

After Auto Recovery, the instance gets a new private IP address.

Correct

The private IP address is preserved. Only the public IP (if not Elastic) changes. Elastic IPs are also preserved.

Mistake

Auto Recovery works for instance store-backed instances.

Correct

Auto Recovery requires an EBS root volume. Instance store-backed instances cannot be recovered because the root volume is ephemeral and data is lost on stop/termination.

Mistake

You can configure Auto Recovery directly in the EC2 console without CloudWatch.

Correct

Auto Recovery is configured via CloudWatch alarms. In the EC2 console, you can create the alarm from the instance's monitoring tab, but it still creates a CloudWatch alarm underneath.

Mistake

Auto Recovery works in EC2-Classic.

Correct

Auto Recovery is only supported for instances in a VPC. EC2-Classic instances cannot use Auto Recovery.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between system status check and instance status check on EC2?

System status checks monitor the health of the underlying physical host, including network, power, and hardware. Instance status checks monitor the software and OS of the instance itself, such as file system integrity and network configuration. AWS is responsible for system failures; customers are responsible for instance failures.

Can Auto Recovery fix instance status check failures?

No. Auto Recovery only triggers on system status check failures. For instance status check failures, you need to manually reboot the instance, fix the OS, or replace the instance. Instance status checks indicate issues inside the instance that AWS cannot resolve by moving the instance to a new host.

Does Auto Recovery preserve the public IP address of my instance?

If your instance has an Elastic IP, it is preserved. If it uses a public IP assigned by AWS (non-Elastic), the public IP will change after recovery because the instance is moved to a new host. To maintain a fixed public IP, assign an Elastic IP to the instance.

What happens if Auto Recovery fails due to insufficient capacity?

If there is insufficient capacity in the Availability Zone to launch the instance on a new host, the recovery fails and the instance remains in a stopped state. You must manually start the instance or modify the instance type/availability zone. CloudWatch will not retry the recovery automatically.

Can I use Auto Recovery for instances in an Auto Scaling group?

It is not recommended. If an instance in an Auto Scaling group fails a system status check, both Auto Recovery and Auto Scaling health checks may attempt to recover/replace the instance, causing conflicts. AWS recommends disabling Auto Recovery for instances in an Auto Scaling group and using Auto Scaling health checks instead.

How do I configure Auto Recovery for an EC2 instance?

You create a CloudWatch alarm on the StatusCheckFailed_System metric with a threshold of 1, evaluation period of 2, and action set to the recover ARN. You can do this via the AWS Management Console, AWS CLI, or CloudFormation. The action ARN is arn:aws:automate:REGION:ec2:recover.

Does Auto Recovery work for all EC2 instance types?

No. Auto Recovery is supported for most current generation instance types, but not all. Check the AWS documentation for the specific list. For example, T1 micro and some older types are not supported. Always verify if your instance type is supported.

Terms Worth Knowing

Ready to put this to the test?

You've just covered EC2 Status Checks and Auto Recovery — now see how well it sticks with free SOA-C02 practice questions. Full explanations included, no account needed.

Done with this chapter?