This chapter covers failover and recovery testing, a critical skill for ensuring network resilience under Objective 3.4 of the CompTIA Network+ N10-009 exam. You will learn the purpose and methods of testing failover mechanisms, including active-passive and active-active configurations, as well as recovery testing procedures like backup validation and disaster recovery drills. Approximately 10–15% of exam questions touch on network operations and resilience, with failover testing being a key subtopic. Mastering this material will help you design, verify, and troubleshoot high-availability networks.
Jump to a section
Think of failover and recovery testing like a fire drill for a large office building. The building (your network) has two stairwells (primary and backup data paths). Every month, the fire marshal (network engineer) triggers a drill by pulling the alarm on the primary stairwell. The drill tests: (1) Do the doors automatically unlock on the backup stairwell? (2) Do the emergency lights (routing protocols) illuminate and guide people (traffic) to the backup path? (3) Does the PA system (monitoring) correctly announce the switch? (4) How long does it take for all employees (packets) to evacuate and re-enter through the backup? If the backup stairwell is blocked by storage boxes (misconfiguration), the drill fails. The fire marshal logs the time, identifies bottlenecks, and drills again after clearing the obstruction. This is exactly what failover testing does: it simulates a failure, measures convergence time, verifies that backup paths work, and documents results to improve recovery procedures. Without drills, you only discover problems during a real fire — and that’s too late.
What is Failover and Recovery Testing?
Failover testing is the process of intentionally causing a failure in a network component (link, device, or service) to verify that the backup component takes over seamlessly. Recovery testing validates that the system returns to normal operation after the failure is resolved. These tests are essential for meeting Service Level Agreements (SLAs) that specify uptime (e.g., 99.999% “five nines” equals ~5.26 minutes of downtime per year). Without testing, failover mechanisms may silently fail due to configuration drift, hardware aging, or software bugs.
How Failover Works Internally
Failover relies on a combination of protocols and hardware. At Layer 2, protocols like Spanning Tree Protocol (STP) or Rapid Spanning Tree Protocol (RSTP) provide link redundancy. At Layer 3, First Hop Redundancy Protocols (FHRPs) such as Hot Standby Router Protocol (HSRP), Virtual Router Redundancy Protocol (VRRP), or Gateway Load Balancing Protocol (GLBP) allow multiple routers to share a virtual IP address. Stateful failover (e.g., in firewalls) synchronizes session tables between active and standby units so that existing connections survive a switchover.
When a failure occurs, the following sequence happens: - Detection: The backup device detects the failure via missing hello messages (e.g., HSRP hello timer defaults to 3 seconds, hold timer 10 seconds). - Election: In a multi-device group, the backup with the highest priority (or IP) becomes active. - Convergence: Routing tables update; for FHRPs, the virtual MAC address moves to the new active device. - Traffic Redirection: Switches update their MAC address tables, and clients resume sending traffic to the virtual IP.
Key Components, Values, Defaults, and Timers
HSRP (Cisco proprietary): - Hello timer: 3 seconds - Hold timer: 10 seconds - Priority range: 0–255 (default 100) - Preemption: disabled by default; must be enabled to allow a higher-priority router to take over after recovery. - Virtual MAC: 0000.0c07.acXX (XX = group number)
VRRP (IEEE 802.1ak, RFC 5798): - Hello timer (advertisement interval): 1 second (default) - Master down interval: 3 × advertisement interval + skew time (skew = (256 – priority) / 256) - Priority range: 1–254 (default 100) - Virtual MAC: 0000.5e00.01XX (XX = VRID)
GLBP (Cisco proprietary): - Hello timer: 3 seconds - Hold timer: 10 seconds - Load balancing methods: round-robin, weighted, host-dependent - Virtual MAC: 0007.b400.XXYY (XX = group, YY = AVF)
Active/Active vs Active/Passive: - Active/Passive: One device handles traffic; the standby remains idle until failure. Example: HSRP with preemption disabled. - Active/Active: Both devices handle traffic simultaneously; failure causes load redistribution. Example: GLBP or equal-cost multipath (ECMP) routing.
Configuration and Verification Commands
HSRP Configuration (Cisco IOS):
interface GigabitEthernet0/0
ip address 192.168.1.1 255.255.255.0
standby 1 ip 192.168.1.254
standby 1 priority 110
standby 1 preempt
standby 1 timers 1 3Verification:
show standby
show standby brief
show ip interface briefVRRP Configuration (Juniper):
set interfaces ge-0/0/0 unit 0 family inet address 192.168.1.1/24 vrrp-group 1 virtual-address 192.168.1.254
set interfaces ge-0/0/0 unit 0 family inet address 192.168.1.1/24 vrrp-group 1 priority 110Verification:
show vrrp
show vrrp detailHow Failover Testing Interacts with Related Technologies
Failover testing must consider dependencies: - DNS: If failover changes IP addresses, DNS TTLs may cause clients to cache old records. Use low TTLs (e.g., 60 seconds) for critical services. - Load Balancers: Application load balancers (e.g., F5, HAProxy) have health checks that mark servers as down. Testing must ensure health check intervals (e.g., 5 seconds) are shorter than failover timers. - SDN/Network Virtualization: In software-defined networks, the controller orchestrates failover. Testing should verify that the controller detects failures and reprograms flows. - Cloud Environments: AWS Route 53 health checks and failover routing policies mimic FHRPs. Testing involves disabling an endpoint and verifying DNS resolution changes.
Step-by-Step Failover Testing Procedure
Document current state: Record device roles, MAC tables, routing tables, and active sessions.
Simulate failure: Disconnect a primary link or power off the active device.
Measure convergence time: Use continuous ping or traffic generators to detect the gap.
Verify backup operation: Confirm that the standby takes over the virtual IP and forwards traffic.
Restore primary: Reconnect or power on the primary device.
Test recovery: Verify that the primary resumes its role (if preemption is enabled) and that traffic returns without interruption.
Common Failure Scenarios Tested
Link failure: Fiber cut, cable disconnect
Device failure: Power supply failure, OS crash
Interface failure: Port error-disable, flapping
Routing protocol failure: BGP session drop, OSPF neighbor loss
Power failure: UPS failure, PDU trip
Metrics to Collect
Convergence time: Time from failure to traffic restoration. Target: < 1 second for FHRPs, < 10 seconds for routing protocols.
Packet loss: Percentage of dropped packets during failover. Should be zero for stateful failover.
Recovery time: Time to restore primary and return to original state.
False positives: Number of tests where failover triggered incorrectly (e.g., due to flapping).
Pitfalls and Common Mistakes
Preemption disabled: After recovery, the primary does not reclaim its role, leading to suboptimal routing.
Timer mismatch: If hello and hold timers are not consistent across devices, flapping may occur.
Forgotten VLANs: FHRP configuration must be applied to each VLAN interface; missing one causes partial failover.
Stateful failover not tested: Session tables may not synchronize correctly, dropping active connections.
Monitoring gaps: Without SNMP traps or syslog, failures may go unnoticed until manual inspection.
Document Baseline State
Before testing, record the current network state. Use commands like 'show standby', 'show ip route', and 'show mac address-table' to capture active and standby roles, routing tables, and MAC tables. Note the virtual IP and MAC addresses. Also document the number of active sessions (e.g., via firewall session tables). This baseline allows you to compare post-failover behavior. If you don't know the baseline, you cannot measure convergence time or detect anomalies.
Simulate a Failure
Introduce a controlled failure. Common methods: shut down the active device's interface (e.g., 'shutdown' on the primary router's LAN interface), disconnect the fiber patch cable, or power off the device. For testing at scale, use traffic generators that can inject BGP withdrawals or OSPF LSA flushes. Ensure you have monitoring running (e.g., continuous ping from a test host to the virtual IP). The failure should be isolated to avoid impacting production traffic beyond the test scope.
Measure Convergence Time
Use a continuous ping from a client to the virtual IP. The ping will show a gap (timeouts) during failover. Count the number of lost pings and multiply by the ping interval (e.g., 100 ms) to estimate convergence time. For precise measurement, use tools like Wireshark to capture traffic or dedicated network test equipment. The goal is to verify that convergence time is within SLA (e.g., < 1 second for FHRPs). Also monitor syslog messages for HSRP/VRRP state changes.
Verify Backup Operation
After failover, confirm that the standby device has taken over. Check its status (e.g., 'show standby' shows Active state). Verify that the virtual MAC address now appears on the standby's interface. Test connectivity by pinging the virtual IP from multiple subnets. Ensure that traffic is being forwarded correctly by checking routing tables and NAT translations if applicable. Also verify that any stateful sessions (e.g., VoIP calls) are maintained if stateful failover is configured.
Restore Primary and Test Recovery
Reconnect the primary device or bring its interface back up. If preemption is enabled, the primary should automatically reclaim the active role after a brief delay (configurable, e.g., 'standby preempt delay 60'). Monitor the transition using 'show standby' and ping tests. Verify that the virtual IP and MAC return to the primary. Check for any packet loss during the recovery process. If preemption is disabled, you may need to manually force a switchover by lowering the standby's priority.
Enterprise Scenario 1: Dual-Homed Data Center with HSRP
A financial services company runs a data center with two Cisco 4500-X switches acting as default gateways for 500 servers. They use HSRP with preemption enabled and a priority of 150 on the primary. During a quarterly failover test, the network engineer shuts the primary's uplink to the core. The failover completes in under 2 seconds, but the test reveals that the standby switch's CPU spikes to 90% due to processing ARP requests from all 500 servers. The team mitigates this by enabling 'standby use-bia' (use burned-in address) to avoid MAC address flapping. They also discover that a misconfigured VLAN interface (missing 'standby 1 ip' command) caused that VLAN to lose connectivity entirely. The fix: a script audits all VLAN interfaces for HSRP configuration. Common scale consideration: with thousands of VLANs, HSRP group numbers must be unique; using VLAN ID as group number is a best practice.
Enterprise Scenario 2: Active-Active Load Balancing with GLBP
A large e-commerce platform uses GLBP on two routers to load balance outbound traffic across 20 Gbps links. During a recovery test, they simulate a router failure by killing the GLBP process. The failover is seamless (sub-second) because GLBP uses round-robin assignment of virtual forwarders. However, after restoring the failed router, they notice that the recovered router does not immediately resume forwarding because GLBP preemption is disabled by default. The engineer must manually re-enable preemption or adjust weighting. Performance consideration: GLBP can handle up to 1024 virtual forwarders per group, but each router can only be a member of 16 groups; thus, for large networks, multiple GLBP groups are needed.
Scenario 3: Cloud Failover with AWS Route 53
A SaaS provider uses AWS Route 53 with failover routing policy across two Regions. They perform a recovery test by marking the primary health check as unhealthy. Route 53's DNS failover takes effect within the TTL (set to 60 seconds). The test reveals that clients with DNS resolvers that ignore TTL (some ISPs) continue to resolve to the failed endpoint for hours. The solution: use Route 53's latency-based routing with health checks, and set TTL to 10 seconds for critical records. They also learned that Route 53 does not support weighted failover for active-passive; instead, they use primary and secondary records with health checks. Common misconfiguration: not associating the health check with the failover record, causing Route 53 to never fail over.
N10-009 Exam Focus: Failover and Recovery Testing
This topic falls under Objective 3.4: “Given a scenario, implement network resiliency and high availability.” The exam expects you to understand the purpose of failover testing, the differences between active/active and active/passive, and how to interpret test results. Specific exam points include: - Identify the correct failover mechanism for a given scenario (e.g., active/passive for critical services, active/active for load sharing). - Know default timers for HSRP (hello 3s, hold 10s), VRRP (advertisement 1s), and GLBP (hello 3s). - Understand preemption and when to enable it. - Recognize common testing mistakes, such as not testing stateful failover or ignoring recovery time.
Common Wrong Answers and Why Candidates Choose Them
“Failover testing should be performed during peak hours to simulate real conditions.” This is wrong because testing during peak hours risks impacting production traffic. The exam wants you to know that testing should be done during maintenance windows.
“HSRP and VRRP are interchangeable and have identical timers.” Wrong. VRRP’s default advertisement interval is 1 second, while HSRP’s hello is 3 seconds. Candidates confuse the two because both provide virtual IP redundancy.
“In active/active failover, one device is always idle.” This is the definition of active/passive. Active/active means both devices handle traffic.
“After a failover, the primary device automatically resumes its role when it comes back online.” Only if preemption is enabled. Many candidates assume preemption is default; it is not in HSRP or GLBP.
Specific Numbers and Terms That Appear on the Exam
HSRP group number: 0–255 (Cisco) or 0–4095 (Cisco IOS 15+)
VRRP VRID: 1–255
GLBP group number: 0–1023
FHRP virtual MAC patterns: HSRP 0000.0c07.acXX, VRRP 0000.5e00.01XX, GLBP 0007.b400.XXYY
Convergence time targets: < 1 second for FHRPs, < 10 seconds for routing protocols
Edge Cases the Exam Loves to Test
Multiple VLANs: A single HSRP group cannot span VLANs; each VLAN needs its own group.
Preemption delay: Used to prevent flapping when a recovering device is unstable.
Object tracking: HSRP can track an uplink interface; if the uplink goes down, the router lowers its priority, forcing failover even if the LAN interface is up.
Stateful failover: Requires synchronization of session tables; not all devices support it.
How to Eliminate Wrong Answers
If the question mentions “both devices forward traffic simultaneously,” eliminate any answer describing “standby” or “passive.”
If the question asks about “automatic recovery,” check for “preemption” in the answer. If preemption is not mentioned, the answer is likely wrong.
For timer questions, remember: VRRP is faster (1s advertisement) than HSRP (3s hello).
Failover testing validates that backup components activate within acceptable convergence time (e.g., < 1 second for FHRPs).
HSRP default timers: hello 3 seconds, hold 10 seconds. VRRP: advertisement 1 second, master down interval ~3 seconds.
Preemption is disabled by default in HSRP and GLBP; enable it to allow automatic role recovery.
Active/passive uses one device; active/active uses both devices for load sharing.
Stateful failover requires session table synchronization; test it explicitly.
Document baseline state before testing and measure packet loss during failover.
Common exam traps: confusing HSRP/VRRP timers, assuming preemption is default, and misidentifying active/active vs active/passive.
These come up on the exam all the time. Here's how to tell them apart.
Active/Passive Failover
One device handles all traffic; standby is idle.
Simpler configuration; no load balancing logic.
Convergence time typically sub-second with FHRPs.
Wastes capacity (standby resources unused).
Common with HSRP/VRRP in small to medium networks.
Active/Active Failover
Both devices handle traffic simultaneously.
Requires load balancing algorithm (e.g., round-robin).
Failover causes load redistribution, not full switchover.
Maximizes resource utilization.
Common with GLBP or ECMP routing.
Mistake
Failover testing is only necessary once during initial deployment.
Correct
Failover testing must be performed regularly (e.g., quarterly) because configuration changes, hardware aging, and software updates can break failover mechanisms. Silent failures are common.
Mistake
Active/passive failover provides load balancing.
Correct
Active/passive means one device handles all traffic; the standby is idle. Load balancing requires active/active or a load balancer.
Mistake
HSRP and VRRP use the same hello timer default.
Correct
HSRP hello defaults to 3 seconds; VRRP advertisement interval defaults to 1 second. They are not the same.
Mistake
Preemption is enabled by default in HSRP.
Correct
Preemption is disabled by default in HSRP and GLBP. It must be explicitly configured for the primary to reclaim its role after recovery.
Mistake
Failover testing always requires taking the network down.
Correct
Failover testing can be done without impacting production by using isolated test environments, redundant paths, or scheduled maintenance windows. Many tests are non-intrusive (e.g., shutting down a single interface).
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
HSRP is Cisco proprietary, while VRRP is an IEEE standard (RFC 5798). Both provide virtual IP redundancy, but HSRP uses an active/standby model with a virtual MAC (0000.0c07.acXX), while VRRP uses a master/backup model with a different virtual MAC (0000.5e00.01XX). Default timers differ: HSRP hello is 3 seconds, VRRP advertisement is 1 second. The exam expects you to know these differences.
Enable preemption if you want the primary router to automatically reclaim its role after recovery. Without preemption, the standby remains active even after the primary comes back online. However, be cautious: preemption can cause flapping if the primary is unstable. Use preemption delay (e.g., 'standby preempt delay 60') to allow the primary to stabilize before taking over.
Use a maintenance window or test on a redundant path. For example, if you have dual uplinks, shut down one interface and verify that traffic fails over to the other. Use continuous ping to measure convergence time. For non-disruptive testing, use simulation tools like GNS3 or EVE-NG to model the network.
Object tracking allows HSRP to monitor an interface (e.g., uplink) and adjust priority if that interface goes down. For example, if the primary's WAN link fails, object tracking lowers its priority, forcing the standby to become active. This ensures failover occurs even if the LAN interface is still up.
GLBP (Gateway Load Balancing Protocol) provides active/active load balancing, where multiple routers share the virtual IP and forward traffic simultaneously. HSRP is active/passive. GLBP uses a single virtual IP but multiple virtual MACs. It supports up to 1024 virtual forwarders per group. The exam may ask you to choose GLBP when load balancing is required.
The purpose is to verify that the backup component takes over seamlessly within the allowed convergence time, that no traffic is lost beyond acceptable thresholds, and that the system returns to normal after recovery. It also uncovers configuration errors, hardware issues, or software bugs that could cause silent failures.
Key metrics include: convergence time (time from failure to traffic restoration), packet loss percentage, recovery time (time to restore primary), and any false positives (unexpected failovers). Also monitor CPU and memory usage on the standby device during failover.
You've just covered Failover and Recovery Testing — now see how well it sticks with free N10-009 practice questions. Full explanations included, no account needed.
Done with this chapter?