This chapter covers Route 53 health checks and DNS failover, a critical topic for the SOA-C02 exam's Networking domain (Objective 5.2). Health checks enable Route 53 to monitor the health of your resources (e.g., web servers, load balancers) and automatically respond to failures by rerouting DNS traffic. Expect roughly 5-10% of exam questions to touch on health checks, failover routing, and related configurations. Mastering this topic requires understanding health check types, monitoring intervals, failure thresholds, and how DNS failover works at the protocol level.
Jump to a section
Imagine a hospital emergency room with a single main entrance. Patients arrive and are assessed by a triage nurse. For non-critical cases, the nurse directs them to a waiting room and checks periodically. For critical cases, the nurse immediately alerts the trauma team. The hospital also has a secondary entrance that is used only if the main entrance is blocked (e.g., by a fire). In that case, a sign outside redirects ambulances to the backup entrance. The triage nurse is like Route 53 health checks: she constantly monitors each patient's condition (health check endpoints) and decides whether the patient (resource) is healthy enough to handle new arrivals (traffic). If a patient in the waiting room deteriorates, the nurse escalates (health check fails) and the main entrance may be closed, triggering a failover to the backup entrance (DNS failover). The periodic checks (health check interval) ensure the nurse's assessment is current. The backup entrance is only used when the primary is unhealthy, just like Route 53 failover routing. The nurse's decision is based on specific criteria (status codes, response time) and she can even check with a specialist (health check calculator) to determine the right threshold.
What Are Route 53 Health Checks and DNS Failover?
Route 53 health checks are automated requests that evaluate the availability and responsiveness of an endpoint (e.g., an IP address or domain name). DNS failover uses these health checks to automatically update DNS records when a resource becomes unhealthy, directing traffic to healthy resources instead. This is essential for building highly available applications.
How Health Checks Work Internally
Route 53 health checkers are distributed across multiple AWS regions (at least 8 regions globally). Each health checker sends a request to the endpoint's IP address or domain name at a configurable interval (default: 30 seconds, minimum: 10 seconds with Route 53 health checkers). The request is an HTTP/HTTPS GET or TCP SYN. For HTTP/HTTPS, the health checker expects a 2xx or 3xx status code (configurable). For TCP, it expects a successful three-way handshake within a timeout (default: 5 seconds, configurable from 1-60 seconds).
If the endpoint does not respond within the timeout, the health checker marks that request as failed. Route 53 aggregates results from all health checkers over a configurable failure threshold (default: 3 consecutive failures) before declaring the endpoint unhealthy. The health check interval and failure threshold together determine the total time to detect a failure: e.g., with 30s interval and 3 failures, it takes up to 90 seconds (3 cycles) to mark unhealthy.
Key Components and Defaults
- Health Check Types: - Endpoint monitoring: Checks an IP address or domain name. - Calculated health checks: Combines multiple health checks via OR, AND, or at-least-N logic. - CloudWatch alarm health checks: Monitors a CloudWatch alarm (e.g., for a specific metric).
- Health Check Configuration Values: - Request interval: 10 or 30 seconds (default 30). - Failure threshold: 1-18, default 3. - Timeout: 1-60 seconds, default 5. - String matching: Optionally check for a specific string in the response body (up to 5,120 bytes). - Latency graphs: Enabled by default, shows response times. - Invert health check status: Reverses healthy/unhealthy (useful for CloudWatch alarms).
- DNS Failover Routing Policy: - Failover: You designate one record as primary and one as secondary (or more). Route 53 returns the primary record if its health check is healthy; otherwise, it returns the secondary record. You can associate health checks with any routing policy, but failover routing specifically uses primary/secondary logic. - Active-active vs. Active-passive: Failover routing is active-passive (primary-secondary). For active-active, use weighted or latency routing with health checks.
Configuration and Verification
To create a health check via AWS CLI:
aws route53 create-health-check --caller-reference $(uuidgen) --health-check-config '{
"IPAddress": "203.0.113.1",
"Port": 80,
"Type": "HTTP",
"ResourcePath": "/health",
"RequestInterval": 30,
"FailureThreshold": 3
}'To associate a health check with a DNS record:
aws route53 change-resource-record-sets --hosted-zone-id ZONEID --change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "example.com",
"Type": "A",
"SetIdentifier": "primary",
"Failover": "PRIMARY",
"HealthCheckId": "HEALTHCHECKID",
"TTL": 60,
"ResourceRecords": [{"Value": "203.0.113.1"}]
}
}]
}'To verify health check status:
aws route53 get-health-check-status --health-check-id HEALTHCHECKIDInteraction with Related Technologies
Elastic Load Balancing (ELB): Route 53 health checks can monitor ELB endpoints. ELB itself has health checks for its targets, but Route 53 health checks are independent and can be used to failover between load balancers across regions.
Auto Scaling: Combined with Auto Scaling, Route 53 health checks can trigger failover before instances are terminated, reducing downtime.
CloudWatch: Route 53 can monitor CloudWatch alarms (e.g., for EC2 status checks) and use them as health checks.
AWS Global Accelerator: Provides anycast IPs and health checks at the edge, but Route 53 is still needed for DNS-level failover.
DNS Failover Mechanism
When a health check fails, Route 53 propagates the status change. For failover routing, Route 53 will immediately return the secondary record in response to DNS queries. However, DNS caching by clients and resolvers (TTL) can delay the effect. The TTL on the record should be low (e.g., 60 seconds) to minimize failover time. Route 53 supports health checks for all routing policies except simple routing (which cannot have health checks).
Calculated Health Checks
Calculated health checks allow you to create a parent health check that monitors the status of up to 256 child health checks. You can specify a threshold (e.g., at least 2 of 3 child checks must be healthy) and how to treat missing data (last status, healthy, unhealthy, or ignore). This is useful for complex dependencies.
Private Hosted Zones and Health Checks
Health checks cannot directly monitor resources in a VPC because they run from public IPs. To monitor private resources, you must either:
Use a CloudWatch alarm health check that monitors a metric from a private resource (e.g., an EC2 instance status check).
Use a TCP health check to a public-facing endpoint that proxies to the private resource.
Exam-Relevant Limits
Maximum 256 health checks per AWS account (soft limit, can be increased).
Maximum 50 health checks per DNS record (for calculated health checks).
Health checkers are in at least 8 regions; if all 8 fail, the health check is considered unhealthy.
Route 53 health checks can be configured to check from specific regions (by default, all regions).
Create a Health Check
Using the Route 53 console, CLI, or SDK, define a health check specifying the endpoint (IP or domain), protocol (HTTP, HTTPS, TCP), port, resource path (for HTTP/HTTPS), request interval (10 or 30 seconds), failure threshold (1-18), and optional string matching. The health check is assigned a unique ID. Route 53 immediately begins sending probes from its distributed health checkers.
Health Checkers Probe Endpoint
Route 53 health checkers located in at least 8 AWS regions send requests to the endpoint at the configured interval. For HTTP/HTTPS, they perform GET requests and expect a 2xx or 3xx status code (or a custom string match). For TCP, they attempt a three-way handshake. Each health checker records success or failure based on response within the timeout (default 5 seconds). Failure is recorded if no response or wrong status code.
Aggregate Results and Determine Status
Route 53 aggregates the results from all health checkers. If the number of consecutive failed probes across all checkers meets or exceeds the failure threshold, the health check status changes to unhealthy. For example, with a 30-second interval and threshold of 3, after 90 seconds of continuous failures, the health check becomes unhealthy. The status is then propagated to Route 53 DNS servers.
DNS Failover Routing Responds
When a DNS query arrives for a record with failover routing and an associated health check, Route 53 checks the health check status. If the primary record's health check is healthy, Route 53 returns the primary record. If unhealthy, it returns the secondary record (if healthy). If the secondary is also unhealthy, Route 53 may return the primary (as a last resort) or no answer, depending on configuration (evaluate target health).
Client Cache and TTL Considerations
After Route 53 returns a DNS response, the client and any intermediate resolvers may cache the result for the TTL duration (e.g., 60 seconds). This means failover is not instantaneous; the effective failover time is the health check detection time plus the remaining TTL. To minimize failover time, use a low TTL (e.g., 60 seconds) and a low failure threshold (e.g., 2). Route 53 also supports 'evaluate target health' which can cause queries to fail if no healthy targets exist.
In a typical enterprise deployment, Route 53 health checks and DNS failover are used to provide multi-region high availability. For example, a global e-commerce website runs active-active in us-east-1 and eu-west-1 with an Application Load Balancer (ALB) in each region. Route 53 latency-based routing is used with health checks on each ALB. If the us-east-1 ALB becomes unhealthy, Route 53 automatically stops sending traffic to that region, directing all requests to eu-west-1. This setup requires careful TTL tuning (e.g., 60 seconds) and health check intervals (10 seconds with low threshold) to achieve sub-minute failover. Another common scenario is active-passive failover for a legacy application that runs on EC2 instances in a single region, with a standby environment in another region. Route 53 failover routing with a primary record pointing to the active environment and a secondary to the standby. Health checks monitor the primary's ELB endpoint. If the primary fails, DNS automatically resolves to the standby's IP. Misconfiguration often occurs when health checks are not associated with the correct record, or when the health check endpoint is not reachable from Route 53 (e.g., security groups blocking probes). Also, forgetting to set 'Evaluate Target Health' on alias records to ELBs can cause Route 53 to return unhealthy ELB IPs. In production, teams often use calculated health checks to combine multiple endpoints (e.g., web server and database) into a single health check for a service. Scaling considerations include the default limit of 256 health checks per account, which may be insufficient for large environments. Route 53 health checks also incur charges per check per month, so cost optimization may involve reducing check frequency or using CloudWatch alarm health checks instead of endpoint checks for some scenarios.
The SOA-C02 exam tests Route 53 health checks and DNS failover under Objective 5.2 (Implement Route 53 routing policies). Key areas: (1) Health check types – endpoint, calculated, CloudWatch alarm. (2) Failover routing policy – primary/secondary logic, association with health checks. (3) Health check configuration – request interval (10 or 30 sec), failure threshold (1-18, default 3), timeout (1-60 sec, default 5). (4) How health checks interact with other routing policies (weighted, latency, geolocation). Common wrong answers: (A) Thinking health checks can be assigned to simple routing records – they cannot. (B) Assuming health checks work for private IPs without a proxy – they need public endpoints or CloudWatch alarms. (C) Believing failover is instant – TTL and health check intervals cause delay. (D) Confusing calculated health checks with routing policies – calculated health checks aggregate child health checks, not DNS records. Exam loves edge cases: (1) What happens when both primary and secondary are unhealthy? Route 53 returns the primary record (if 'Evaluate Target Health' is false) or returns no answer (if true). (2) How to monitor a private instance? Use a CloudWatch alarm health check on the instance's status check metric. (3) What is the minimum failover time? With 10-second interval and failure threshold of 2, detection takes 20 seconds, plus TTL (e.g., 60 seconds) = 80 seconds. (4) String matching – only checks first 5,120 bytes of response. To eliminate wrong answers, focus on the mechanism: health checkers are distributed, they use TCP/HTTP, failure is based on consecutive failures, and DNS failover is passive (client cache matters). Remember that Route 53 health checks are independent of ELB health checks – they can be used together or separately.
Health check types: endpoint, calculated, CloudWatch alarm.
Default health check interval: 30 seconds (can be 10 seconds).
Default failure threshold: 3 consecutive failures.
Default timeout: 5 seconds (configurable 1-60).
Health checks cannot be associated with simple routing records.
Failover routing policy requires exactly one primary and one secondary record per name/type.
Calculated health checks can combine up to 256 child health checks.
Health checkers are in at least 8 AWS regions.
DNS failover is not instant due to TTL caching.
To monitor private resources, use CloudWatch alarm health checks.
These come up on the exam all the time. Here's how to tell them apart.
Endpoint Health Check
Monitors a specific IP or domain directly.
Can check HTTP/TCP response and string matching.
Requires the endpoint to be publicly accessible.
Probes from multiple global locations.
Ideal for checking web servers and load balancers.
CloudWatch Alarm Health Check
Monitors a CloudWatch alarm (e.g., EC2 status check, custom metric).
Does not require public endpoint; can monitor private resources.
No string matching; alarm state is either OK, ALARM, or INSUFFICIENT_DATA.
Uses CloudWatch metric evaluation; no direct probing.
Useful for monitoring instance health or custom metrics.
Mistake
Health checks can monitor any resource in a VPC.
Correct
Health checkers run from public IPs and cannot reach private IPs in a VPC. To monitor private resources, you must use a public-facing proxy, a CloudWatch alarm health check, or a TCP health check to a public endpoint that forwards to the private resource.
Mistake
DNS failover is immediate after a health check fails.
Correct
After a health check fails, Route 53 updates its DNS records, but client DNS resolvers cache the old record for the TTL duration. The effective failover time is health check detection time plus remaining TTL. Low TTL (e.g., 60 seconds) is recommended.
Mistake
You can associate a health check with a simple routing record.
Correct
Simple routing records cannot have health checks. Health checks are only supported with alias records, weighted, latency, geolocation, geoproximity, multivalue answer, and failover routing policies.
Mistake
A health check with a 30-second interval checks every 30 seconds from a single location.
Correct
Route 53 health checkers are distributed across at least 8 regions. Each checker sends a request every 30 seconds, so the endpoint may receive probes more frequently (multiple per second from different regions) but the health status is evaluated per checker.
Mistake
If a health check is unhealthy, Route 53 stops sending queries to that record entirely.
Correct
For failover routing, Route 53 returns the secondary record if the primary is unhealthy. For other routing policies, if 'Evaluate Target Health' is enabled, Route 53 excludes unhealthy targets from responses. If all targets are unhealthy, Route 53 may return no answer or all targets depending on configuration.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
You cannot directly use an endpoint health check because the instance is not publicly accessible. Instead, create a CloudWatch alarm on the EC2 instance's status check metric (e.g., StatusCheckFailed). Then create a Route 53 health check of type 'CloudWatch alarm' and associate it with that alarm. The health check will reflect the alarm state: OK = healthy, ALARM = unhealthy. This allows you to monitor private instances without exposing them.
'Evaluate Target Health' is a setting on alias records (e.g., for ELB) that tells Route 53 to consider the health of the target resource. For example, if an alias points to an ELB, enabling 'Evaluate Target Health' will check the ELB's own health check status. A health check is a separate Route 53 resource that directly probes an endpoint. Both can be used together: you can associate a health check with a record and also enable 'Evaluate Target Health' on an alias record.
Yes, as long as the servers are publicly accessible. You create endpoint health checks pointing to the public IPs or domain names of your on-premises servers. Then configure failover routing records: primary points to the main server, secondary to the backup. Route 53 will automatically failover when the primary health check fails. Ensure your firewall allows inbound TCP/HTTP from Route 53 health checker IP ranges (published in the AWS documentation).
A calculated health check has a parent health check that monitors the status of up to 256 child health checks. You configure a rule: OR (any child healthy = parent healthy), AND (all children healthy = parent healthy), or at least N (e.g., at least 3 of 5). You can also specify how to treat missing data (last status, healthy, unhealthy, or ignore). This is useful for complex dependencies, e.g., a service that requires both a web server and a database server to be healthy.
If 'Evaluate Target Health' is enabled on the records, Route 53 will return no answer for DNS queries, causing a failure. If 'Evaluate Target Health' is disabled (default), Route 53 will return the primary record even if it is unhealthy. To avoid complete outage, you can configure a third record (e.g., a static placeholder) or use a CloudWatch alarm to trigger an external notification.
No, the request interval is set at creation and cannot be modified. You must delete and recreate the health check to change the interval. The failure threshold can be updated via the API or console.
AWS publishes the IP ranges for Route 53 health checkers in the AWS IP Address Ranges JSON file (ip-ranges.json). They are listed with the service 'ROUTE53_HEALTHCHECKS'. You should allow these IPs in your security groups or firewalls if you use endpoint health checks.
You've just covered Route 53 Health Checks and DNS Failover — now see how well it sticks with free SOA-C02 practice questions. Full explanations included, no account needed.
Done with this chapter?