This chapter covers troubleshooting RAID array failures, a critical topic for the CompTIA A+ Core 1 (220-1101) exam under Domain 5.0 (Hardware and Network Troubleshooting), Objective 5.2. RAID (Redundant Array of Independent Disks) is a fundamental storage technology that combines multiple drives into a single logical unit for performance and/or redundancy. Exam questions often test your ability to identify failure symptoms, select the correct recovery procedure, and understand which RAID levels provide redundancy and which do not. Expect 5-7% of the exam to touch on RAID troubleshooting scenarios.
Jump to a section
Imagine a multi-story parking garage where each floor is a hard drive. In RAID 0 (striping), cars (data) are split across floors: the front half of every car on floor 1, the back half on floor 2. If floor 2 collapses, every car is incomplete—you lose all data. In RAID 1 (mirroring), you have two identical garages. If one collapses, you still have a complete copy in the other. In RAID 5, you have three floors plus a special 'parity' floor that stores a mathematical summary of the data on the other floors. If any single floor collapses, you can use the parity floor and the remaining floors to reconstruct the missing data. But if a second floor collapses before you rebuild the first, the parity math fails—you lose everything. The garage manager (RAID controller) monitors for structural damage (disk failures) and triggers an alarm (alert). The repair crew must replace the damaged floor (hot-swap a new disk) and then the manager recalculates the parity (rebuild), during which the garage is under stress (degraded mode) and another failure would be catastrophic. This mirrors exactly how RAID arrays handle disk failures: redundancy protects against single failures, but rebuild times expose windows of vulnerability.
What is RAID and Why Does it Exist?
RAID (Redundant Array of Independent Disks) is a storage virtualization technology that combines multiple physical disk drives into one or more logical units. Its primary goals are: - Performance improvement (via striping) - Redundancy (via mirroring or parity) - Increased capacity (combining smaller drives)
The CompTIA A+ 220-1101 exam focuses on software RAID (implemented by the operating system) and hardware RAID (using a dedicated RAID controller). You must know the characteristics of RAID levels 0, 1, 5, 6, and 10 (1+0), as well as how to troubleshoot common failure scenarios.
How RAID Works Internally
RAID distributes data across multiple drives using three fundamental techniques: - Striping: Data is split into blocks (e.g., 64 KB) and written across multiple drives. This improves read/write performance because multiple drives work in parallel. However, striping alone provides no redundancy—if one drive fails, all data is lost. - Mirroring: Data is written identically to two or more drives. This provides redundancy: if one drive fails, the other(s) have a complete copy. Read performance can improve because reads can be serviced from any mirror, but write performance is slightly degraded because each write must be committed to all mirrors. - Parity: A mathematical calculation (XOR) is performed on data blocks, and the result is stored on a dedicated parity drive (RAID 4) or distributed across all drives (RAID 5, RAID 6). Parity allows reconstruction of data from a failed drive using the remaining drives and the parity information. RAID 5 can survive a single drive failure; RAID 6 can survive two simultaneous failures.
Key RAID Levels for A+ 220-1101
#### RAID 0 (Striping) - Minimum drives: 2 - Redundancy: None - Capacity: Sum of all drives (e.g., 2 x 500 GB = 1 TB usable) - Performance: Excellent read/write, but failure of any drive destroys the entire array. - Exam tip: RAID 0 is often chosen for temporary or cache data where speed is critical and data loss is acceptable.
#### RAID 1 (Mirroring) - Minimum drives: 2 - Redundancy: Yes (single drive failure tolerable) - Capacity: Smallest drive (e.g., 2 x 500 GB = 500 GB usable) - Performance: Good reads (can read from either drive), slightly slower writes (must write to both). - Exam tip: RAID 1 is common for boot drives and critical system volumes.
#### RAID 5 (Striping with Parity) - Minimum drives: 3 - Redundancy: Yes (single drive failure) - Capacity: (N-1) * smallest drive (e.g., 3 x 500 GB = 1 TB usable) - Performance: Good reads, slower writes due to parity calculation. - Exam tip: RAID 5 is a cost-effective balance of capacity and redundancy, but rebuild times are long on large drives, increasing risk during rebuild.
#### RAID 6 (Striping with Double Parity) - Minimum drives: 4 - Redundancy: Yes (two drive failures) - Capacity: (N-2) * smallest drive - Performance: Slower writes than RAID 5 due to double parity. - Exam tip: RAID 6 is used when high fault tolerance is needed, e.g., in large enterprise arrays.
#### RAID 10 (1+0) — Striping of Mirrors - Minimum drives: 4 (in pairs) - Redundancy: Yes (can survive up to one drive per mirror pair) - Capacity: (N/2) * smallest drive (e.g., 4 x 500 GB = 1 TB usable) - Performance: Excellent reads and writes (combines mirroring and striping). - Exam tip: RAID 10 is preferred for high-performance databases and virtual machine stores.
RAID Failure Symptoms and Troubleshooting Steps
When a RAID array fails, the symptoms vary by RAID level and the type of failure. Common indicators include: - Degraded array: RAID controller reports one or more drives as failed or missing. The array continues to operate but with reduced performance and no redundancy. - Offline array: The array becomes unavailable, often due to multiple drive failures or controller issues. - Slow performance: During a rebuild, the array may be slow because the controller is busy recalculating parity. - I/O errors: Applications report read/write errors, and the operating system may show disk errors in Event Viewer or system logs.
#### Troubleshooting Steps (General)
Identify the failure: Check the RAID controller's management interface (e.g., HP Smart Storage Administrator, Dell OpenManage, or Intel RST). Look for status indicators: "Failed", "Missing", "Predictive Failure", or "Degraded".
Determine the RAID level: Knowing whether the array has redundancy (RAID 1,5,6,10) vs. no redundancy (RAID 0) dictates recovery options.
Check physical connections: Reseat SATA/SAS cables and power connectors. Ensure the drive is spinning (listen for clicking or buzzing—clicking often indicates mechanical failure).
Replace the failed drive: If the drive is confirmed bad, replace it with a compatible drive of equal or larger capacity. For hot-swappable drives, you can replace without powering down.
Initiate rebuild: After replacement, the controller automatically begins rebuilding the array. Monitor progress; rebuild times depend on drive size and controller speed.
Verify data integrity: After rebuild, run a file system check (e.g., chkdsk on Windows) or verify data with checksums.
Backup: Always maintain current backups. RAID is not a backup—it protects against drive failure, not accidental deletion, malware, or disasters.
RAID Controller Configuration and Verification Commands
For hardware RAID, the controller's BIOS or management utility is used. Common commands (for reference, not exhaustive):
- Windows: Use diskpart to view disk status, but RAID management is typically via vendor tools.
- Linux: Use mdadm for software RAID. Examples:
- mdadm --detail /dev/md0 — shows RAID status
- cat /proc/mdstat — shows current RAID status and rebuild progress
- macOS: Disk Utility can show RAID status.
How RAID Interacts with Other Technologies
Hot Spare: A dedicated standby drive that automatically replaces a failed drive in a RAID array. Reduces downtime.
SSD vs. HDD: SSDs have no moving parts, so they are less prone to mechanical failure. However, they have limited write endurance. RAID with SSDs may use TRIM support and over-provisioning.
Backup: RAID does not replace backup. A RAID array can be lost due to controller failure, multiple simultaneous drive failures, or firmware bugs.
Snapshot: Some RAID controllers support snapshots for point-in-time recovery.
Common RAID Failure Scenarios on the Exam
Scenario 1: A user reports that a RAID 0 array is no longer accessible. The technician finds one drive has failed. Answer: The entire array is lost. Recovery requires restoring from backup.
Scenario 2: A RAID 5 array shows "Degraded" status. One drive has failed. Answer: Replace the failed drive; the array will rebuild automatically. Data is still accessible.
Scenario 3: A RAID 10 array has two failed drives, both in the same mirror pair. Answer: The array is lost. If the two failed drives are in different mirror pairs, the array remains functional.
Scenario 4: A RAID 5 array fails during rebuild because a second drive fails. Answer: Data is lost. This highlights the importance of hot spares and quick replacement.
Default Values and Timers
Rebuild priority: Configurable from low to high. High priority speeds rebuild but degrades performance.
Rebuild rate: Typically 10-50% of I/O bandwidth. Some controllers allow setting a rebuild rate.
Hot spare activation: Immediate upon failure detection.
Predictive failure analysis: Some controllers (e.g., HP SmartArray) support predictive failure alerts based on S.M.A.R.T. data.
Exam-First Summary
For 220-1101, know:
RAID 0 = striping, no redundancy, 2+ drives, capacity = sum
RAID 1 = mirroring, redundancy, 2 drives, capacity = smallest
RAID 5 = striping with parity, 3+ drives, capacity = (N-1)*smallest
RAID 6 = double parity, 4+ drives, capacity = (N-2)*smallest
RAID 10 = mirroring + striping, 4+ drives, capacity = (N/2)*smallest
Symptoms of failure: degraded, offline, slow performance, I/O errors.
Troubleshooting steps: identify failure, check connections, replace drive, rebuild, verify.
Common wrong answers: thinking RAID 0 provides redundancy, or that RAID 5 can survive two failures.
Identify Failure Symptoms
Begin by gathering information from the user and checking system logs. Common symptoms include: the array is not visible in the OS, applications report read/write errors, or the system is unusually slow. On Windows, check Event Viewer under System logs for disk errors. On Linux, examine `/var/log/messages` or use `dmesg`. The RAID controller's management interface (e.g., HP Smart Storage Administrator, Dell OpenManage) will show the array status as 'Degraded', 'Failed', or 'Offline'. Note the RAID level and which drive(s) are indicated as failed. If the array is still functional but degraded, proceed carefully—another failure could cause total data loss.
Determine RAID Level and Redundancy
Knowing the RAID level is critical because it determines recovery options. If the array is RAID 0, any single drive failure destroys all data—no recovery is possible without a backup. If it's RAID 1, 5, 6, or 10, the array can survive one or more failures (depending on level). Check the RAID controller configuration or the OS disk management utility. For software RAID on Linux, `mdadm --detail /dev/md0` shows the RAID level and status. On Windows, use `diskpart` with `list volume` and `detail volume`. Document the exact model and capacity of the failed drive(s) to ensure replacement compatibility.
Check Physical Connections and Drive Status
Before replacing a drive, verify physical connections. Reseat SATA/SAS cables and power connectors. Ensure the drive is receiving power (listen for spin-up sound). If the drive is clicking or making unusual noises, it is likely mechanically failed. For SSDs, there are no moving parts, but they can fail due to NAND wear or controller issues. Check S.M.A.R.T. data if available (e.g., using `smartctl` on Linux). If the drive appears healthy but the array is degraded, the issue might be a loose cable or a faulty backplane. Swap cables or ports to rule out a bad connection.
Replace Failed Drive and Initiate Rebuild
If the drive is confirmed failed, replace it with a compatible drive. For hot-swappable drives, you can replace without powering down—the controller detects the new drive automatically. For non-hot-swap, power down the system, replace the drive, and power on. Once the new drive is installed, the controller should automatically start rebuilding the array. If not, initiate the rebuild manually via the controller management interface. Monitor rebuild progress; on Linux, `cat /proc/mdstat` shows percentage complete. Rebuild time depends on drive size and controller speed—a 1 TB drive can take several hours. During rebuild, the array is vulnerable to another failure.
Verify Data Integrity and Restore if Needed
After the rebuild completes, verify that the array is healthy and data is accessible. Check the array status in the controller interface—it should show 'Optimal' or 'Normal'. Run a file system integrity check: on Windows, use `chkdsk /f` on the volume; on Linux, use `fsck`. If the array was RAID 0 or suffered multiple failures, data recovery may be impossible without a backup. In that case, restore from the most recent backup. Always document the failure and steps taken. Consider implementing a hot spare to reduce future downtime. Finally, remind the user that RAID is not a backup—they should maintain regular backups.
Enterprise Scenario 1: Database Server with RAID 10
A large e-commerce company runs its transactional database on a server with a hardware RAID 10 array of 8 x 1 TB SSDs. The array provides both high performance (striping across four mirror pairs) and redundancy (each mirror pair can survive one failure). One day, the monitoring system alerts that the array is degraded—a drive in one mirror pair has failed. The technician logs into the Dell PERC controller interface, identifies the failed slot, and hot-swaps the drive with a new SSD of the same model and firmware. The controller automatically starts rebuilding the mirror. During the rebuild, the array is still fully accessible but with reduced redundancy for that mirror pair. The rebuild takes about 2 hours for a 1 TB drive. The technician verifies the array status and schedules a maintenance window to check the remaining drives. The key lesson: RAID 10 is resilient but requires quick replacement to avoid a second failure that could take down the array.
Enterprise Scenario 2: File Server with RAID 5
A mid-size company uses a RAID 5 array of 4 x 4 TB HDDs for file storage. The array offers a good balance of capacity (12 TB usable) and single-disk fault tolerance. A drive fails, and the array enters degraded mode. The IT staff orders a replacement, but due to a shipping delay, the replacement arrives 3 days later. Meanwhile, a second drive fails during the rebuild. The array becomes offline, and all data is lost. The company had no recent backup—disaster. This scenario illustrates the danger of RAID 5 with large drives: rebuild times can be 10+ hours, during which the array is vulnerable. A hot spare would have mitigated this. The lesson: for critical data, consider RAID 6 or RAID 10, and always have a backup.
Enterprise Scenario 3: Hypervisor Host with Software RAID 1
A small business runs a hypervisor (VMware ESXi) on a server with a software RAID 1 (mirror) of two 500 GB SSDs for the boot volume. The RAID is managed by the motherboard's Intel RST (Rapid Storage Technology). One SSD fails, and the server continues to run on the remaining drive. The administrator notices the system is slower (reads are now from a single drive) and sees a warning in Intel RST. They power down, replace the failed SSD, and boot up. The RAID controller automatically mirrors the data to the new drive. This is a straightforward recovery. The lesson: RAID 1 is simple and reliable for boot drives, but software RAID may have limitations (e.g., cannot hot-swap if the motherboard doesn't support it).
The 220-1101 exam tests RAID troubleshooting under Objective 5.2: 'Given a scenario, troubleshoot common hardware problems.' You must be able to identify the correct RAID level based on description, interpret failure symptoms, and choose the appropriate recovery step. Expect scenario-based questions where you are told the RAID level and the failure, and you must predict the outcome or next step.
Common Wrong Answers and Why Candidates Choose Them
'RAID 0 provides redundancy' — Candidates confuse 'striping' with 'mirroring'. RAID 0 has no redundancy; it is purely for performance. The exam will describe a RAID 0 failure and ask what to do. The correct answer is 'restore from backup', but many choose 'replace the failed drive' because they think RAID always protects data.
'RAID 5 can survive two drive failures' — RAID 5 can only survive one failure. RAID 6 can survive two. Candidates may misremember the parity overhead.
'A RAID 10 array with two failed drives is always lost' — RAID 10 can survive multiple failures as long as no mirror pair loses both drives. The exam may describe a scenario where two drives fail in different pairs; the array remains functional.
'During a rebuild, the array is offline' — In most RAID configurations, the array remains online and accessible during rebuild, though performance is degraded. The exam may ask about availability during rebuild.
Specific Numbers and Terms That Appear on the Exam
Minimum drives: RAID 0 (2), RAID 1 (2), RAID 5 (3), RAID 6 (4), RAID 10 (4)
Usable capacity formulas: RAID 0 = sum, RAID 1 = smallest, RAID 5 = (N-1)*smallest, RAID 6 = (N-2)*smallest, RAID 10 = (N/2)*smallest
Terms: 'striping', 'mirroring', 'parity', 'degraded', 'rebuild', 'hot spare', 'hot swappable'
Failure types: 'failed drive', 'missing drive', 'predictive failure'
Edge Cases the Exam Loves
A RAID 5 array with a failed drive and a second drive showing 'predictive failure' — the correct action is to replace the failed drive first and then the predictive failure drive as soon as possible.
A RAID 0 array where one drive fails and the other is still good — data is lost; do not attempt to rebuild.
A RAID 1 array where both drives fail simultaneously — data is lost unless there is a backup.
Hot spare activation: if a hot spare is present, it automatically replaces the failed drive without user intervention.
How to Eliminate Wrong Answers Using the Underlying Mechanism
When you see a RAID question, first identify the RAID level and its redundancy characteristics. Then ask: does this level have parity or mirroring? How many failures can it tolerate? If the question involves a rebuild, remember that rebuilds happen online and consume I/O resources. If the answer choice says 'the array is offline during rebuild', it's likely wrong (unless it's a non-redundant array). Use the capacity formulas to check if the described usable capacity matches the RAID level. For example, if the question says '4 drives, 2 TB each, usable capacity 6 TB', that's RAID 5 (4-1=3, 3*2=6 TB). If they say '4 drives, 2 TB each, usable capacity 4 TB', that's RAID 10 (4/2=2, 2*2=4 TB).
RAID 0: striping, no redundancy, minimum 2 drives, capacity = sum of all drives.
RAID 1: mirroring, redundancy, minimum 2 drives, capacity = smallest drive.
RAID 5: striping with parity, single-drive fault tolerance, minimum 3 drives, capacity = (N-1)*smallest drive.
RAID 6: striping with double parity, two-drive fault tolerance, minimum 4 drives, capacity = (N-2)*smallest drive.
RAID 10: stripe of mirrors, can survive one failure per mirror pair, minimum 4 drives, capacity = (N/2)*smallest drive.
When a drive fails in a redundant array, replace it promptly and monitor the rebuild; the array is vulnerable during rebuild.
RAID is not a backup; always maintain separate backups of critical data.
Common troubleshooting steps: identify failure, check connections, replace drive, rebuild, verify integrity.
These come up on the exam all the time. Here's how to tell them apart.
RAID 5
Minimum 3 drives; capacity = (N-1)*smallest drive
Single parity; can survive one drive failure
Slower write performance due to parity calculation
Rebuild times are long, especially with large drives
Cost-effective: provides redundancy with less drive overhead (e.g., 4 drives yield 75% usable capacity)
RAID 10
Minimum 4 drives; capacity = (N/2)*smallest drive
Mirroring; can survive multiple failures as long as no mirror pair loses both drives
Excellent write performance (no parity overhead)
Rebuild times are faster (only mirrors need to be rebuilt)
Higher cost: provides redundancy with 50% usable capacity (e.g., 4 drives yield 50% usable)
Mistake
RAID is a backup solution.
Correct
RAID protects against drive failure, not data corruption, accidental deletion, malware, or disasters. It is not a substitute for regular backups. A RAID array can be lost due to controller failure, multiple drive failures, or firmware bugs.
Mistake
RAID 0 provides data redundancy.
Correct
RAID 0 (striping) has no redundancy. Data is split across drives; if any drive fails, all data is lost. It is used for performance, not protection.
Mistake
RAID 5 can survive two simultaneous drive failures.
Correct
RAID 5 uses single parity and can only survive one drive failure. RAID 6 uses double parity and can survive two failures.
Mistake
During a RAID rebuild, the array is offline and inaccessible.
Correct
In most RAID implementations, the array remains online and accessible during rebuild, though performance may be degraded. The array is vulnerable to another failure during this time.
Mistake
All RAID levels require the same number of drives.
Correct
Different RAID levels have different minimum drive requirements: RAID 0 (2), RAID 1 (2), RAID 5 (3), RAID 6 (4), RAID 10 (4). Using fewer drives than required is not possible.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
If a RAID 0 array fails due to a single drive failure, all data is lost because RAID 0 has no redundancy. The only recovery option is to restore from a backup. Replace the failed drive(s) and recreate the array, then restore data. There is no way to rebuild a RAID 0 array after a failure.
Yes, if the system supports hot-swappable drives. Most enterprise RAID controllers and many desktop motherboards with RAID support allow you to remove and replace a failed drive while the system is running. The controller will detect the new drive and automatically begin rebuilding the array. Always check the manufacturer's documentation to confirm hot-swap capability.
Rebuild time depends on drive size, drive speed, RAID level, controller speed, and system load. For example, rebuilding a 1 TB HDD in a RAID 5 array can take 2-6 hours. SSDs rebuild faster, typically 30 minutes to 2 hours for a 1 TB drive. During rebuild, the array is vulnerable to another failure, so minimize stress on the array.
A hot spare is a drive that is installed in the system and powered on but not actively used until a drive fails. When a failure occurs, the RAID controller automatically uses the hot spare to rebuild the array without manual intervention. A cold spare is an unused drive stored offline; it must be manually installed after a failure. Hot spares reduce downtime.
Yes, but the usable capacity is limited to the smallest drive. For example, in a RAID 5 array with three drives of 1 TB, 2 TB, and 2 TB, the usable capacity is (3-1)*1 TB = 2 TB. The extra space on the larger drives is wasted. For best performance and capacity, use identical drives.
A degraded array is one where a drive has failed or is missing, but the array is still operational because of redundancy (mirroring or parity). Performance may be reduced because the controller must reconstruct data on-the-fly from parity or from the remaining mirror. The array should be repaired as soon as possible to restore full redundancy.
Hardware RAID uses a dedicated controller with its own processor and cache, offloading RAID calculations from the CPU. It generally offers better performance, more features (e.g., battery-backed cache), and OS independence. Software RAID uses the system's CPU and is less expensive but can impact performance under heavy load. For the A+ exam, know that both exist and that hardware RAID is more common in servers.
You've just covered Troubleshoot: RAID Array Failure — now see how well it sticks with free 220-1101 practice questions. Full explanations included, no account needed.
Done with this chapter?