SAA-C03Chapter 98 of 189Objective 2.5

Route 53 Health Checks and DNS Failover

This chapter covers Route 53 health checks and DNS failover, a core mechanism for building resilient architectures on AWS. For the SAA-C03 exam, this topic appears in roughly 10-15% of questions, often combined with Auto Scaling, ELB, and multi-region architectures. You will learn how health checks monitor endpoint availability, how Route 53 uses them to direct traffic, and how to configure failover routing policies for high availability. Mastering these concepts is essential for achieving the 2.5 objective under Resilient Architectures.

25 min read
Intermediate
Updated May 31, 2026

Route 53 Health Checks as a Building Security System

Imagine a large office building with two entrances: a main door and a side door. A security guard monitors both doors. Each door has a sensor that reports its status (open/closed, locked/unlocked) every 30 seconds. If the main door sensor fails to report for three consecutive intervals (90 seconds), the guard marks it as unhealthy. The guard then updates a sign outside: 'Use side entrance.' Visitors arriving at the main door see the sign and walk to the side door. The guard also checks the side door; if both doors fail, he posts 'Building closed.' This mirrors Route 53 health checks: each endpoint (door) is probed at a configurable interval (default 30 seconds). After a configurable number of consecutive failures (default 3), the endpoint is marked unhealthy. Route 53 then updates DNS responses, directing traffic to healthy endpoints. The analogy extends to health check types: TCP checks are like the guard listening for the door click; HTTP checks are like the guard asking for a password (expecting a 200 OK). The guard's log is analogous to CloudWatch metrics.

How It Actually Works

What Are Route 53 Health Checks and Why Do They Exist?

Route 53 health checks are automated probes that monitor the availability and health of your endpoints (e.g., web servers, load balancers, or other AWS resources). They exist to enable DNS-level failover: if an endpoint becomes unhealthy, Route 53 can automatically route traffic away from it to a healthy endpoint. This is critical for building fault-tolerant applications that can survive individual component failures without manual intervention.

How Health Checks Work Internally

Route 53 health checks operate from a global network of health checkers distributed across AWS regions. When you create a health check, Route 53 assigns a set of these checkers to probe your endpoint from multiple locations. The health checkers send requests based on the type of check: - TCP: Opens a TCP connection to the specified IP address and port. If the connection is successful within the timeout, it's healthy. - HTTP/HTTPS: Opens a TCP connection, sends an HTTP request, and expects a 2xx or 3xx status code (configurable). Additional options include string matching in the response body. - Calculated checks: Combine multiple child health checks using logical AND/OR/NOT. - CloudWatch alarm checks: Use the state of a CloudWatch alarm (OK, ALARM, INSUFFICIENT_DATA) as the health status.

Each health checker independently reports the endpoint status. Route 53 aggregates these reports: if a majority (more than half) of the health checkers report the endpoint as healthy, the overall health check is considered healthy. This distributed approach prevents false positives from isolated network issues.

Key Parameters, Defaults, and Timers

Request interval: Default 30 seconds. You can set it to 10 seconds (faster detection, higher cost) or 30 seconds.

Failure threshold: Default 3 consecutive failures. Range 1-18. The endpoint is marked unhealthy after this many consecutive failures.

Request timeout: Default 5 seconds for TCP, HTTP, HTTPS. You can set it between 1 and 60 seconds.

Health checker regions: AWS automatically selects up to 14 health checkers from its global network. You can optionally specify which regions to use.

DNS failover TTL: The TTL of the DNS record that uses the health check. When a health check flips, Route 53 updates DNS responses immediately, but clients may cache the old records until TTL expires. Recommended TTL: 60 seconds or less for fast failover.

Configuration and Verification

You create a health check via AWS Console, CLI, or SDK. Example using AWS CLI:

aws route53 create-health-check --caller-reference unique-string --health-check-config '{
    "IPAddress": "203.0.113.10",
    "Port": 443,
    "Type": "HTTPS",
    "ResourcePath": "/health",
    "FullyQualifiedDomainName": "example.com",
    "RequestInterval": 30,
    "FailureThreshold": 3
}'

To associate a health check with a DNS record, you use a routing policy that supports health checks: failover, weighted, latency, geolocation, geoproximity, or multivalue answer. For example, for failover routing:

aws route53 change-resource-record-sets --hosted-zone-id ZONEID --change-batch '{
    "Changes": [{
        "Action": "CREATE",
        "ResourceRecordSet": {
            "Name": "www.example.com.",
            "Type": "A",
            "SetIdentifier": "primary",
            "Failover": "PRIMARY",
            "HealthCheckId": "health-check-id",
            "TTL": 60,
            "ResourceRecords": [{"Value": "203.0.113.10"}]
        }
    }]
}'

To monitor health check status, use:

aws route53 get-health-check-status --health-check-id id

Interaction with Related Technologies

Auto Scaling: Health checks can be used with Auto Scaling lifecycle hooks to drain traffic before instance termination.

Elastic Load Balancer (ELB): Route 53 health checks can point to an ELB DNS name. The ELB has its own health checks for backend instances; Route 53 checks the ELB endpoint itself.

CloudWatch: Route 53 health checks can trigger CloudWatch alarms for notifications or automated actions.

AWS Global Accelerator: Uses health checks to route traffic to healthy endpoints in multiple regions, similar to Route 53 but at the network layer.

DNS Failover Mechanism

When a health check fails, Route 53 automatically updates the DNS response for any record that references that health check. For failover routing, if the primary record's health check is unhealthy, Route 53 returns the secondary record's IP. If both are unhealthy, Route 53 returns the primary record (by default). This is because Route 53 still returns the primary if no healthy record exists; you can configure a 'last resort' behavior.

Failover can be active-passive (primary/standby) or active-active (using weighted routing with health checks). For active-passive, you create two records with the same name, one PRIMARY and one SECONDARY. The primary is served until its health check fails, then the secondary is served. When the primary recovers, traffic shifts back (unless you set Invert Health Check).

Calculated Health Checks

Calculated health checks allow you to combine multiple child health checks using a logical operator. For example, you can create a health check that is healthy only if at least 2 out of 3 regional endpoints are healthy. This is useful for multi-region deployments where you want failover only if multiple regions fail.

Private Hosted Zones and Health Checks

By default, Route 53 health checkers run from public IP addresses and cannot reach endpoints in private subnets (VPC). To health check private resources, you must use a CloudWatch alarm that monitors a custom metric (e.g., from a Lambda function) and then associate that alarm with the health check (CloudWatch alarm health check type). Alternatively, you can use a proxy endpoint in a public subnet that forwards to the private resource.

Cost Considerations

Each health check costs $0.50 per month (30-second interval) or $1.25 per month (10-second interval).

Health checkers send requests from multiple regions; you are not charged for data transfer for health check requests.

Calculated health checks incur additional costs based on the number of child checks.

Limits

Maximum 50 health checks per AWS account (soft limit, can be increased).

Maximum 100 health checks per hosted zone.

Health checks can only be associated with records in public hosted zones (not private hosted zones).

Step-by-Step Configuration

1.

Create a health check for your primary endpoint.

2.

Create a health check for your secondary endpoint.

3.

Create a failover routing policy record set for primary, referencing the primary health check.

4.

Create a failover routing policy record set for secondary, referencing the secondary health check.

5.

Test by disabling the primary endpoint and verifying DNS resolution returns the secondary IP.

Monitoring and Troubleshooting

Use CloudWatch metrics: HealthCheckStatus (1=healthy, 0=unhealthy), HealthCheckPercentageHealthy (percentage of health checkers reporting healthy). Common failure reasons: endpoint not responding, incorrect port, timeout, string mismatch. Check health check status via CLI: aws route53 get-health-check-status --health-check-id id returns last status and failure reason.

Walk-Through

1

Create Health Check for Primary

In Route 53 console, choose Health Checks > Create Health Check. Specify a name, protocol (TCP/HTTP/HTTPS), endpoint (IP or domain), port, path (for HTTP/HTTPS), request interval (10 or 30 seconds), and failure threshold (1-18). Optionally enable string matching. The health checkers will immediately begin probing the endpoint every 30 seconds. Route 53 selects up to 14 health checker regions automatically. The health check status initially shows 'Initializing' for a few minutes until enough data is collected.

2

Create Health Check for Secondary

Repeat the process for the secondary endpoint. Use a different caller reference. For active-passive failover, the secondary endpoint may be in a different region or AZ. Ensure the health check is configured with the same interval and threshold for consistency. If using a CloudWatch alarm health check, create the alarm first, then reference its ARN.

3

Create Failover Record for Primary

In the hosted zone, create a record set with the same name as your domain (e.g., www.example.com). Choose routing policy 'Failover'. Set the failover record type to 'Primary'. Enter the primary endpoint's IP or alias target. Under 'Health Check ID', select the health check you created for the primary. Set TTL to 60 seconds or less for fast failover. This record will only be returned when its associated health check is healthy.

4

Create Failover Record for Secondary

Create another record set with the same name and type. Choose routing policy 'Failover' and set failover record type to 'Secondary'. Enter the secondary endpoint's IP or alias target. Select the secondary health check. Set TTL to the same value as primary. The secondary record is returned only when the primary health check is unhealthy (and secondary is healthy). If both are unhealthy, Route 53 returns the primary record by default.

5

Test Failover Behavior

Simulate a failure by stopping the primary endpoint (e.g., stop the web server or block the port). Wait for the failure threshold (default 3 x 30s = 90 seconds). Use `dig` or `nslookup` to query the DNS record. Initially, you get the primary IP. After 90 seconds, the query returns the secondary IP. Restart the primary; after the health check recovers (3 consecutive successes), DNS returns the primary IP again. Check CloudWatch metrics to see the health status change.

What This Looks Like on the Job

Enterprise Scenario 1: Multi-Region Active-Passive Web Application

A global e-commerce company runs a web application in us-east-1 (primary) and eu-west-1 (secondary). They use Route 53 failover routing with health checks pointing to an Application Load Balancer (ALB) in each region. The health checks are HTTP checks on the ALB's /health endpoint. The primary ALB is configured with a 30-second interval and 3 failure threshold. When us-east-1 experiences an outage, Route 53 detects three consecutive failures after 90 seconds and automatically returns the eu-west-1 ALB IP. The TTL is set to 60 seconds, so most clients fail over within 2-3 minutes. A common misconfiguration is forgetting to set the health check path to a lightweight endpoint that reflects application health, not just the load balancer's health. In production, they also use a CloudWatch alarm health check to monitor a custom metric that checks database connectivity, ensuring the health check is application-aware.

Enterprise Scenario 2: Weighted Routing with Health Checks for Blue/Green Deployment

A SaaS provider uses weighted routing with health checks to gradually shift traffic from blue to green deployments. They create two A records with weights 100 and 0, each associated with a health check. The green deployment's health check is configured to fail until the deployment is verified. Once verified, they update the green health check to be healthy, and adjust weights to 50/50. If the green deployment shows errors, the health check fails, and Route 53 automatically stops sending traffic to it. This prevents manual rollback. A common pitfall is not setting the health check interval low enough (10 seconds) for rapid detection; with 30-second interval and 3 failures, it takes 90 seconds to detect a problem, which may be too slow for critical deployments.

Enterprise Scenario 3: Private Endpoint Health Check via CloudWatch Alarm

A financial services company hosts a critical API behind an internal Network Load Balancer (NLB) in a private subnet. Since Route 53 health checkers are public, they cannot directly probe the NLB. They deploy a Lambda function inside the VPC that periodically calls the API and publishes a custom metric to CloudWatch (e.g., api_health with value 1 for healthy, 0 for unhealthy). They create a CloudWatch alarm that triggers when the metric is below 1 for 1 minute. Then they create a Route 53 health check of type 'CloudWatch Alarm' and associate it with the alarm. This health check is used in a failover record. A common issue is the Lambda function's execution role not having permissions to publish metrics, causing the alarm to never trigger. They also set the alarm's evaluation period to match the health check's failure threshold to avoid flapping.

How SAA-C03 Actually Tests This

What SAA-C03 Tests

This topic falls under Objective 2.5: Design resilient architectures. The exam expects you to:

Understand how Route 53 health checks work with different routing policies (failover, weighted, latency, geolocation, multivalue answer).

Know the default values: request interval (30s), failure threshold (3), TCP/HTTP timeout (5s).

Recognize when to use a health check vs. other monitoring (e.g., ELB health checks, Auto Scaling health checks).

Know how to health check private resources (use CloudWatch alarm health check).

Understand the difference between active-passive and active-active failover.

Common Wrong Answers

1.

'Use an ELB health check instead': Candidates often confuse ELB health checks (which monitor backend instances behind a load balancer) with Route 53 health checks (which monitor the endpoint itself). The exam will ask: 'How to route traffic away from a failed endpoint at the DNS level?' The answer is Route 53 health check, not ELB health check.

2.

'Set TTL to 0 for instant failover': TTL=0 is not allowed for DNS records (minimum is 0 but most DNS servers ignore 0 and use a minimum of 30-60 seconds). Even if allowed, clients may still cache. The correct approach is to use a low TTL (e.g., 60s) and rely on health check failover.

3.

'Health checks work in private hosted zones': This is false. Health checks can only be associated with records in public hosted zones. For private zones, you must use CloudWatch alarm health checks.

4.

'Failover routing automatically switches back when primary recovers': This is true, but candidates sometimes think you need to manually change records. Route 53 automatically returns the primary record when its health check becomes healthy again.

Specific Numbers and Terms

Default request interval: 30 seconds (10 seconds available but costs more).

Default failure threshold: 3 consecutive failures.

Default timeout: 5 seconds.

Health check types: TCP, HTTP, HTTPS, Calculated, CloudWatch Alarm.

Routing policies that support health checks: Failover, Weighted, Latency, Geolocation, Geoproximity, Multivalue Answer.

Failover record types: PRIMARY and SECONDARY.

Maximum health checkers: Up to 14.

Edge Cases

Invert health check: You can invert the health check status, so a healthy endpoint is considered unhealthy. This is useful for scenarios like maintenance mode.

Calculated health checks: The exam may ask about combining health checks with AND/OR/NOT. For example, a calculated health check that is healthy only if at least 2 of 3 child checks are healthy.

Health check latency: The exam might test that health checks are measured from multiple regions, and the response time is not part of the health status (only availability).

How to Eliminate Wrong Answers

If the question involves DNS-level traffic routing away from a failed resource, eliminate answers that mention ELB or Auto Scaling health checks.

If the endpoint is in a private subnet, eliminate answers that use standard health checks; the correct answer must involve CloudWatch alarms.

If the question asks for 'fastest failover', look for options with 10-second interval and low TTL (but not 0).

If the question mentions 'active-active' failover, look for weighted or latency routing with health checks, not failover routing (which is active-passive).

Key Takeaways

Route 53 health checks monitor endpoint availability using distributed health checkers from up to 14 regions.

Default request interval is 30 seconds; can be set to 10 seconds for faster detection (higher cost).

Default failure threshold is 3 consecutive failures; range 1-18.

Health checks support TCP, HTTP, HTTPS, Calculated, and CloudWatch Alarm types.

Health checks can be associated with failover, weighted, latency, geolocation, geoproximity, and multivalue answer routing policies.

To health check private endpoints, use CloudWatch alarm health checks with a custom metric.

Failover routing is active-passive: primary is served until unhealthy, then secondary is served.

When all records are unhealthy, Route 53 returns the primary record by default.

TTL should be set low (e.g., 60 seconds) for faster failover; TTL=0 is not recommended.

Health checks are only for public hosted zones; private hosted zones do not support health check associations.

Calculated health checks combine multiple child health checks using AND/OR/NOT logic.

Health check status can be monitored via CloudWatch metrics (HealthCheckStatus, HealthCheckPercentageHealthy).

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Route 53 Health Check (Standard)

Directly probes endpoint via TCP/HTTP/HTTPS from public checkers.

Cannot reach private IPs in VPC.

Default interval 30s, failure threshold 3.

Supports string matching and invert health check.

Cost: $0.50/month per health check.

CloudWatch Alarm Health Check

Monitors the state of a CloudWatch alarm (OK/ALARM/INSUFFICIENT_DATA).

Can monitor private endpoints via custom metrics published by a probe in VPC.

Alarm evaluation period and datapoints determine failure detection time.

No direct probing; relies on metric data.

Cost: CloudWatch alarm costs + health check cost.

Watch Out for These

Mistake

Route 53 health checks can monitor any endpoint, including private IPs in VPC, without additional setup.

Correct

Route 53 health checkers are deployed on the public internet and cannot reach private IP addresses in VPCs. To health check private endpoints, you must use CloudWatch alarm health checks that monitor custom metrics published by a probe inside the VPC.

Mistake

Setting TTL to 0 ensures instant DNS failover.

Correct

TTL=0 is not allowed in Route 53; the minimum is 0 but many DNS resolvers ignore it and use a default minimum (e.g., 30 seconds). Even with TTL=0, client-side caching and DNS resolver caching can delay failover. The best practice is to use a low TTL (e.g., 60 seconds) combined with health check failover.

Mistake

Health checks are only used with failover routing policy.

Correct

Health checks can be associated with multiple routing policies: failover, weighted, latency, geolocation, geoproximity, and multivalue answer. For example, weighted routing with health checks ensures that unhealthy endpoints receive zero traffic regardless of weight.

Mistake

A health check failure immediately changes DNS responses.

Correct

The health check must fail a configurable number of consecutive times (default 3) before the endpoint is marked unhealthy. Only then does Route 53 alter DNS responses. This prevents flapping due to transient failures.

Mistake

You can associate a health check with a record in a private hosted zone.

Correct

Health checks can only be associated with records in public hosted zones. Private hosted zones do not support health check associations. To achieve health-check-based routing in private zones, you must use CloudWatch alarm health checks or custom DNS solutions.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

How does Route 53 health check work?

Route 53 health checks work by having a global network of health checkers probe your endpoint at a configurable interval (default 30 seconds). Each health checker independently reports whether the endpoint responded successfully. Route 53 aggregates these reports: if more than half of the checkers report healthy, the overall health check is considered healthy. After a configurable number of consecutive failures (default 3), the health check status becomes unhealthy, and Route 53 updates DNS responses accordingly.

What is the difference between Route 53 health check and ELB health check?

Route 53 health checks monitor the endpoint itself (e.g., an ALB, EC2 instance) from multiple global locations. ELB health checks monitor the health of backend instances behind the load balancer. Route 53 health checks are used for DNS-level failover, while ELB health checks are used to manage traffic distribution within a load balancer. For the SAA-C03 exam, understand that Route 53 health checks are for routing traffic away from a failed endpoint at the DNS level, whereas ELB health checks are for instance-level health within a load balancer.

Can Route 53 health check private IPs?

No, standard Route 53 health checks cannot reach private IPs in VPCs because health checkers run on the public internet. To health check private endpoints, you must use a CloudWatch alarm health check. First, deploy a probe (e.g., Lambda function) inside the VPC that publishes a custom metric to CloudWatch. Then create a CloudWatch alarm on that metric. Finally, create a Route 53 health check of type 'CloudWatch Alarm' and associate it with the alarm.

What is the default failure threshold for Route 53 health checks?

The default failure threshold is 3 consecutive failures. This means the endpoint must fail to respond three times in a row before it is marked unhealthy. You can configure this value between 1 and 18. A lower threshold (e.g., 2) provides faster failover but increases the risk of false positives from transient issues.

How do I set up DNS failover with Route 53?

To set up DNS failover, first create health checks for your primary and secondary endpoints. Then create two record sets with the same DNS name in your hosted zone: one with failover routing policy set to PRIMARY (associated with the primary health check) and one with failover routing policy set to SECONDARY (associated with the secondary health check). Set a low TTL (e.g., 60 seconds). Route 53 will automatically return the primary record when its health check is healthy, and the secondary record when the primary is unhealthy.

What routing policies support health checks?

The following routing policies support health checks: Failover (active-passive), Weighted, Latency, Geolocation, Geoproximity (traffic flow only), and Multivalue Answer. Simple routing does not support health checks. For the exam, know that health checks can be used with any policy that allows multiple records, except Simple.

What is a calculated health check?

A calculated health check is a health check that combines the status of multiple child health checks using a logical operator: AND, OR, or NOT. For example, you can create a calculated health check that is healthy only if at least 2 out of 3 child health checks are healthy. This is useful for multi-region deployments where you want failover only when multiple regions fail.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Route 53 Health Checks and DNS Failover — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.

Done with this chapter?