[RAID] Troubleshooting Linux Software Raid
Symptoms
dmesg shows drive errors
dmesg
smart monitor shows errors
smartctl -a DEV
Troubleshooting Commands
Check raid status
cat /proc/mdstat
shows information about the raid.
md device line shows the raid device, state, and component devices. This should reflect the expected number of devices in the array. Otherwise, there is a fault.
i.e. md0 : active raid5 sde1[0] sdf1[4] sdb1[5] sdd1[2] sdc1[1]
md config/status line
i.e. 1250241792 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5] [UUUUU] The [UUUUU] represents the status of each device, either U for up or _ for down.
Identify Block Devices
lsblk
Add device to array
mdadm --manage MD_DEV --add DEV
Remove device from array
mdadm --manage MD_DEV --remove DEV
Stop array
mdadm --stop MD_DEV
Rebuild array
mdadm --assemble --force DEV (..DEV)
mdadm --assemble --force --scan
Show array
mdadm --detail MD_DEV
Show individual drive information
mdadm --examine DEV
Use Cases
Raid inactive and drives are functioning correctly
Something happened to cause an inconsistency in the RAID array and has been inactivated for maintenance out of caution. The array should be fine and can be restarted.
Stopping the array
mdadm --stop MD_DEV
Rebuild the array
mdadm --assemble --force --scan
Raid Inactive and a drive has failed
One of the drives has failed and triggered a maintenance request. Remove the drive from the array and restart.
Determine which drive has failed.
cat /proc/mdstat
Remove the faulty drive(s) from the array.
mdadm --manage MD_DEV --remove DEV
Rebuild the array
mdadm --assemble --force --scan
Add device to array
Use to expand devices in an array.
Add device to array
mdadm --manage MD_DEV --add DEV