[RAID] Troubleshooting Linux Software Raid

Symptoms

  • dmesg shows drive errors
    dmesg

  • smart monitor shows errors
    smartctl -a DEV

Troubleshooting Commands

  • Check raid status

    • cat /proc/mdstat

      • shows information about the raid.

        • md device line shows the raid device, state, and component devices. This should reflect the expected number of devices in the array. Otherwise, there is a fault.

          i.e. md0 : active raid5 sde1[0] sdf1[4] sdb1[5] sdd1[2] sdc1[1]

           

        • md config/status line

          i.e. 1250241792 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5] [UUUUU] The [UUUUU] represents the status of each device, either U for up or _ for down.
  • Identify Block Devices

    • lsblk

  • Add device to array

    • mdadm --manage MD_DEV --add DEV

  • Remove device from array

    • mdadm --manage MD_DEV --remove DEV

  • Stop array

    • mdadm --stop MD_DEV

  • Rebuild array

    • mdadm --assemble --force DEV (..DEV)

    • mdadm --assemble --force --scan

  • Show array

    • mdadm --detail MD_DEV

  • Show individual drive information

    • mdadm --examine DEV

Use Cases

Raid inactive and drives are functioning correctly

Something happened to cause an inconsistency in the RAID array and has been inactivated for maintenance out of caution. The array should be fine and can be restarted.

  1. Stopping the array

    1. mdadm --stop MD_DEV

  2. Rebuild the array

    1. mdadm --assemble --force --scan

Raid Inactive and a drive has failed

One of the drives has failed and triggered a maintenance request. Remove the drive from the array and restart.

  1. Determine which drive has failed.

    1. cat /proc/mdstat

  2. Remove the faulty drive(s) from the array.

    • mdadm --manage MD_DEV --remove DEV

  3. Rebuild the array

    1. mdadm --assemble --force --scan

Add device to array

Use to expand devices in an array.

  1. Add device to array

    1. mdadm --manage MD_DEV --add DEV

References