N10-009Chapter 105 of 163Objective 3.4

Failover and Recovery Testing

This chapter covers failover and recovery testing, a critical skill for ensuring network resilience under Objective 3.4 of the CompTIA Network+ N10-009 exam. You will learn the purpose and methods of testing failover mechanisms, including active-passive and active-active configurations, as well as recovery testing procedures like backup validation and disaster recovery drills. Approximately 10–15% of exam questions touch on network operations and resilience, with failover testing being a key subtopic. Mastering this material will help you design, verify, and troubleshoot high-availability networks.

25 min read
Intermediate
Updated May 31, 2026

Fire Drill for Your Network

Think of failover and recovery testing like a fire drill for a large office building. The building (your network) has two stairwells (primary and backup data paths). Every month, the fire marshal (network engineer) triggers a drill by pulling the alarm on the primary stairwell. The drill tests: (1) Do the doors automatically unlock on the backup stairwell? (2) Do the emergency lights (routing protocols) illuminate and guide people (traffic) to the backup path? (3) Does the PA system (monitoring) correctly announce the switch? (4) How long does it take for all employees (packets) to evacuate and re-enter through the backup? If the backup stairwell is blocked by storage boxes (misconfiguration), the drill fails. The fire marshal logs the time, identifies bottlenecks, and drills again after clearing the obstruction. This is exactly what failover testing does: it simulates a failure, measures convergence time, verifies that backup paths work, and documents results to improve recovery procedures. Without drills, you only discover problems during a real fire — and that’s too late.

How It Actually Works

What is Failover and Recovery Testing?

Failover testing is the process of intentionally causing a failure in a network component (link, device, or service) to verify that the backup component takes over seamlessly. Recovery testing validates that the system returns to normal operation after the failure is resolved. These tests are essential for meeting Service Level Agreements (SLAs) that specify uptime (e.g., 99.999% “five nines” equals ~5.26 minutes of downtime per year). Without testing, failover mechanisms may silently fail due to configuration drift, hardware aging, or software bugs.

How Failover Works Internally

Failover relies on a combination of protocols and hardware. At Layer 2, protocols like Spanning Tree Protocol (STP) or Rapid Spanning Tree Protocol (RSTP) provide link redundancy. At Layer 3, First Hop Redundancy Protocols (FHRPs) such as Hot Standby Router Protocol (HSRP), Virtual Router Redundancy Protocol (VRRP), or Gateway Load Balancing Protocol (GLBP) allow multiple routers to share a virtual IP address. Stateful failover (e.g., in firewalls) synchronizes session tables between active and standby units so that existing connections survive a switchover.

When a failure occurs, the following sequence happens: - Detection: The backup device detects the failure via missing hello messages (e.g., HSRP hello timer defaults to 3 seconds, hold timer 10 seconds). - Election: In a multi-device group, the backup with the highest priority (or IP) becomes active. - Convergence: Routing tables update; for FHRPs, the virtual MAC address moves to the new active device. - Traffic Redirection: Switches update their MAC address tables, and clients resume sending traffic to the virtual IP.

Key Components, Values, Defaults, and Timers

HSRP (Cisco proprietary): - Hello timer: 3 seconds - Hold timer: 10 seconds - Priority range: 0–255 (default 100) - Preemption: disabled by default; must be enabled to allow a higher-priority router to take over after recovery. - Virtual MAC: 0000.0c07.acXX (XX = group number)

VRRP (IEEE 802.1ak, RFC 5798): - Hello timer (advertisement interval): 1 second (default) - Master down interval: 3 × advertisement interval + skew time (skew = (256 – priority) / 256) - Priority range: 1–254 (default 100) - Virtual MAC: 0000.5e00.01XX (XX = VRID)

GLBP (Cisco proprietary): - Hello timer: 3 seconds - Hold timer: 10 seconds - Load balancing methods: round-robin, weighted, host-dependent - Virtual MAC: 0007.b400.XXYY (XX = group, YY = AVF)

Active/Active vs Active/Passive: - Active/Passive: One device handles traffic; the standby remains idle until failure. Example: HSRP with preemption disabled. - Active/Active: Both devices handle traffic simultaneously; failure causes load redistribution. Example: GLBP or equal-cost multipath (ECMP) routing.

Configuration and Verification Commands

HSRP Configuration (Cisco IOS):

interface GigabitEthernet0/0
 ip address 192.168.1.1 255.255.255.0
 standby 1 ip 192.168.1.254
 standby 1 priority 110
 standby 1 preempt
 standby 1 timers 1 3

Verification:

show standby
show standby brief
show ip interface brief

VRRP Configuration (Juniper):

set interfaces ge-0/0/0 unit 0 family inet address 192.168.1.1/24 vrrp-group 1 virtual-address 192.168.1.254
set interfaces ge-0/0/0 unit 0 family inet address 192.168.1.1/24 vrrp-group 1 priority 110

Verification:

show vrrp
show vrrp detail

How Failover Testing Interacts with Related Technologies

Failover testing must consider dependencies: - DNS: If failover changes IP addresses, DNS TTLs may cause clients to cache old records. Use low TTLs (e.g., 60 seconds) for critical services. - Load Balancers: Application load balancers (e.g., F5, HAProxy) have health checks that mark servers as down. Testing must ensure health check intervals (e.g., 5 seconds) are shorter than failover timers. - SDN/Network Virtualization: In software-defined networks, the controller orchestrates failover. Testing should verify that the controller detects failures and reprograms flows. - Cloud Environments: AWS Route 53 health checks and failover routing policies mimic FHRPs. Testing involves disabling an endpoint and verifying DNS resolution changes.

Step-by-Step Failover Testing Procedure

1.

Document current state: Record device roles, MAC tables, routing tables, and active sessions.

2.

Simulate failure: Disconnect a primary link or power off the active device.

3.

Measure convergence time: Use continuous ping or traffic generators to detect the gap.

4.

Verify backup operation: Confirm that the standby takes over the virtual IP and forwards traffic.

5.

Restore primary: Reconnect or power on the primary device.

6.

Test recovery: Verify that the primary resumes its role (if preemption is enabled) and that traffic returns without interruption.

Common Failure Scenarios Tested

Link failure: Fiber cut, cable disconnect

Device failure: Power supply failure, OS crash

Interface failure: Port error-disable, flapping

Routing protocol failure: BGP session drop, OSPF neighbor loss

Power failure: UPS failure, PDU trip

Metrics to Collect

Convergence time: Time from failure to traffic restoration. Target: < 1 second for FHRPs, < 10 seconds for routing protocols.

Packet loss: Percentage of dropped packets during failover. Should be zero for stateful failover.

Recovery time: Time to restore primary and return to original state.

False positives: Number of tests where failover triggered incorrectly (e.g., due to flapping).

Pitfalls and Common Mistakes

Preemption disabled: After recovery, the primary does not reclaim its role, leading to suboptimal routing.

Timer mismatch: If hello and hold timers are not consistent across devices, flapping may occur.

Forgotten VLANs: FHRP configuration must be applied to each VLAN interface; missing one causes partial failover.

Stateful failover not tested: Session tables may not synchronize correctly, dropping active connections.

Monitoring gaps: Without SNMP traps or syslog, failures may go unnoticed until manual inspection.

Walk-Through

1

Document Baseline State

Before testing, record the current network state. Use commands like 'show standby', 'show ip route', and 'show mac address-table' to capture active and standby roles, routing tables, and MAC tables. Note the virtual IP and MAC addresses. Also document the number of active sessions (e.g., via firewall session tables). This baseline allows you to compare post-failover behavior. If you don't know the baseline, you cannot measure convergence time or detect anomalies.

2

Simulate a Failure

Introduce a controlled failure. Common methods: shut down the active device's interface (e.g., 'shutdown' on the primary router's LAN interface), disconnect the fiber patch cable, or power off the device. For testing at scale, use traffic generators that can inject BGP withdrawals or OSPF LSA flushes. Ensure you have monitoring running (e.g., continuous ping from a test host to the virtual IP). The failure should be isolated to avoid impacting production traffic beyond the test scope.

3

Measure Convergence Time

Use a continuous ping from a client to the virtual IP. The ping will show a gap (timeouts) during failover. Count the number of lost pings and multiply by the ping interval (e.g., 100 ms) to estimate convergence time. For precise measurement, use tools like Wireshark to capture traffic or dedicated network test equipment. The goal is to verify that convergence time is within SLA (e.g., < 1 second for FHRPs). Also monitor syslog messages for HSRP/VRRP state changes.

4

Verify Backup Operation

After failover, confirm that the standby device has taken over. Check its status (e.g., 'show standby' shows Active state). Verify that the virtual MAC address now appears on the standby's interface. Test connectivity by pinging the virtual IP from multiple subnets. Ensure that traffic is being forwarded correctly by checking routing tables and NAT translations if applicable. Also verify that any stateful sessions (e.g., VoIP calls) are maintained if stateful failover is configured.

5

Restore Primary and Test Recovery

Reconnect the primary device or bring its interface back up. If preemption is enabled, the primary should automatically reclaim the active role after a brief delay (configurable, e.g., 'standby preempt delay 60'). Monitor the transition using 'show standby' and ping tests. Verify that the virtual IP and MAC return to the primary. Check for any packet loss during the recovery process. If preemption is disabled, you may need to manually force a switchover by lowering the standby's priority.

What This Looks Like on the Job

Enterprise Scenario 1: Dual-Homed Data Center with HSRP

A financial services company runs a data center with two Cisco 4500-X switches acting as default gateways for 500 servers. They use HSRP with preemption enabled and a priority of 150 on the primary. During a quarterly failover test, the network engineer shuts the primary's uplink to the core. The failover completes in under 2 seconds, but the test reveals that the standby switch's CPU spikes to 90% due to processing ARP requests from all 500 servers. The team mitigates this by enabling 'standby use-bia' (use burned-in address) to avoid MAC address flapping. They also discover that a misconfigured VLAN interface (missing 'standby 1 ip' command) caused that VLAN to lose connectivity entirely. The fix: a script audits all VLAN interfaces for HSRP configuration. Common scale consideration: with thousands of VLANs, HSRP group numbers must be unique; using VLAN ID as group number is a best practice.

Enterprise Scenario 2: Active-Active Load Balancing with GLBP

A large e-commerce platform uses GLBP on two routers to load balance outbound traffic across 20 Gbps links. During a recovery test, they simulate a router failure by killing the GLBP process. The failover is seamless (sub-second) because GLBP uses round-robin assignment of virtual forwarders. However, after restoring the failed router, they notice that the recovered router does not immediately resume forwarding because GLBP preemption is disabled by default. The engineer must manually re-enable preemption or adjust weighting. Performance consideration: GLBP can handle up to 1024 virtual forwarders per group, but each router can only be a member of 16 groups; thus, for large networks, multiple GLBP groups are needed.

Scenario 3: Cloud Failover with AWS Route 53

A SaaS provider uses AWS Route 53 with failover routing policy across two Regions. They perform a recovery test by marking the primary health check as unhealthy. Route 53's DNS failover takes effect within the TTL (set to 60 seconds). The test reveals that clients with DNS resolvers that ignore TTL (some ISPs) continue to resolve to the failed endpoint for hours. The solution: use Route 53's latency-based routing with health checks, and set TTL to 10 seconds for critical records. They also learned that Route 53 does not support weighted failover for active-passive; instead, they use primary and secondary records with health checks. Common misconfiguration: not associating the health check with the failover record, causing Route 53 to never fail over.

How N10-009 Actually Tests This

N10-009 Exam Focus: Failover and Recovery Testing

This topic falls under Objective 3.4: “Given a scenario, implement network resiliency and high availability.” The exam expects you to understand the purpose of failover testing, the differences between active/active and active/passive, and how to interpret test results. Specific exam points include: - Identify the correct failover mechanism for a given scenario (e.g., active/passive for critical services, active/active for load sharing). - Know default timers for HSRP (hello 3s, hold 10s), VRRP (advertisement 1s), and GLBP (hello 3s). - Understand preemption and when to enable it. - Recognize common testing mistakes, such as not testing stateful failover or ignoring recovery time.

Common Wrong Answers and Why Candidates Choose Them

1.

“Failover testing should be performed during peak hours to simulate real conditions.” This is wrong because testing during peak hours risks impacting production traffic. The exam wants you to know that testing should be done during maintenance windows.

2.

“HSRP and VRRP are interchangeable and have identical timers.” Wrong. VRRP’s default advertisement interval is 1 second, while HSRP’s hello is 3 seconds. Candidates confuse the two because both provide virtual IP redundancy.

3.

“In active/active failover, one device is always idle.” This is the definition of active/passive. Active/active means both devices handle traffic.

4.

“After a failover, the primary device automatically resumes its role when it comes back online.” Only if preemption is enabled. Many candidates assume preemption is default; it is not in HSRP or GLBP.

Specific Numbers and Terms That Appear on the Exam

HSRP group number: 0–255 (Cisco) or 0–4095 (Cisco IOS 15+)

VRRP VRID: 1–255

GLBP group number: 0–1023

FHRP virtual MAC patterns: HSRP 0000.0c07.acXX, VRRP 0000.5e00.01XX, GLBP 0007.b400.XXYY

Convergence time targets: < 1 second for FHRPs, < 10 seconds for routing protocols

Edge Cases the Exam Loves to Test

Multiple VLANs: A single HSRP group cannot span VLANs; each VLAN needs its own group.

Preemption delay: Used to prevent flapping when a recovering device is unstable.

Object tracking: HSRP can track an uplink interface; if the uplink goes down, the router lowers its priority, forcing failover even if the LAN interface is up.

Stateful failover: Requires synchronization of session tables; not all devices support it.

How to Eliminate Wrong Answers

If the question mentions “both devices forward traffic simultaneously,” eliminate any answer describing “standby” or “passive.”

If the question asks about “automatic recovery,” check for “preemption” in the answer. If preemption is not mentioned, the answer is likely wrong.

For timer questions, remember: VRRP is faster (1s advertisement) than HSRP (3s hello).

Key Takeaways

Failover testing validates that backup components activate within acceptable convergence time (e.g., < 1 second for FHRPs).

HSRP default timers: hello 3 seconds, hold 10 seconds. VRRP: advertisement 1 second, master down interval ~3 seconds.

Preemption is disabled by default in HSRP and GLBP; enable it to allow automatic role recovery.

Active/passive uses one device; active/active uses both devices for load sharing.

Stateful failover requires session table synchronization; test it explicitly.

Document baseline state before testing and measure packet loss during failover.

Common exam traps: confusing HSRP/VRRP timers, assuming preemption is default, and misidentifying active/active vs active/passive.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Active/Passive Failover

One device handles all traffic; standby is idle.

Simpler configuration; no load balancing logic.

Convergence time typically sub-second with FHRPs.

Wastes capacity (standby resources unused).

Common with HSRP/VRRP in small to medium networks.

Active/Active Failover

Both devices handle traffic simultaneously.

Requires load balancing algorithm (e.g., round-robin).

Failover causes load redistribution, not full switchover.

Maximizes resource utilization.

Common with GLBP or ECMP routing.

Watch Out for These

Mistake

Failover testing is only necessary once during initial deployment.

Correct

Failover testing must be performed regularly (e.g., quarterly) because configuration changes, hardware aging, and software updates can break failover mechanisms. Silent failures are common.

Mistake

Active/passive failover provides load balancing.

Correct

Active/passive means one device handles all traffic; the standby is idle. Load balancing requires active/active or a load balancer.

Mistake

HSRP and VRRP use the same hello timer default.

Correct

HSRP hello defaults to 3 seconds; VRRP advertisement interval defaults to 1 second. They are not the same.

Mistake

Preemption is enabled by default in HSRP.

Correct

Preemption is disabled by default in HSRP and GLBP. It must be explicitly configured for the primary to reclaim its role after recovery.

Mistake

Failover testing always requires taking the network down.

Correct

Failover testing can be done without impacting production by using isolated test environments, redundant paths, or scheduled maintenance windows. Many tests are non-intrusive (e.g., shutting down a single interface).

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between HSRP and VRRP?

HSRP is Cisco proprietary, while VRRP is an IEEE standard (RFC 5798). Both provide virtual IP redundancy, but HSRP uses an active/standby model with a virtual MAC (0000.0c07.acXX), while VRRP uses a master/backup model with a different virtual MAC (0000.5e00.01XX). Default timers differ: HSRP hello is 3 seconds, VRRP advertisement is 1 second. The exam expects you to know these differences.

Should I enable preemption on HSRP?

Enable preemption if you want the primary router to automatically reclaim its role after recovery. Without preemption, the standby remains active even after the primary comes back online. However, be cautious: preemption can cause flapping if the primary is unstable. Use preemption delay (e.g., 'standby preempt delay 60') to allow the primary to stabilize before taking over.

How do I test failover without causing downtime?

Use a maintenance window or test on a redundant path. For example, if you have dual uplinks, shut down one interface and verify that traffic fails over to the other. Use continuous ping to measure convergence time. For non-disruptive testing, use simulation tools like GNS3 or EVE-NG to model the network.

What is object tracking in HSRP?

Object tracking allows HSRP to monitor an interface (e.g., uplink) and adjust priority if that interface goes down. For example, if the primary's WAN link fails, object tracking lowers its priority, forcing the standby to become active. This ensures failover occurs even if the LAN interface is still up.

How does GLBP differ from HSRP?

GLBP (Gateway Load Balancing Protocol) provides active/active load balancing, where multiple routers share the virtual IP and forward traffic simultaneously. HSRP is active/passive. GLBP uses a single virtual IP but multiple virtual MACs. It supports up to 1024 virtual forwarders per group. The exam may ask you to choose GLBP when load balancing is required.

What is the purpose of a failover test?

The purpose is to verify that the backup component takes over seamlessly within the allowed convergence time, that no traffic is lost beyond acceptable thresholds, and that the system returns to normal after recovery. It also uncovers configuration errors, hardware issues, or software bugs that could cause silent failures.

What metrics should I collect during a failover test?

Key metrics include: convergence time (time from failure to traffic restoration), packet loss percentage, recovery time (time to restore primary), and any false positives (unexpected failovers). Also monitor CPU and memory usage on the standby device during failover.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Failover and Recovery Testing — now see how well it sticks with free N10-009 practice questions. Full explanations included, no account needed.

Done with this chapter?