Replace failed disk in MegaRAID array

This tutorial explains how to replace a failed disk in a RAID array.

Important Notice

For systems that are equipped with a RAID array you should consider certain things before there is a problem:

1. Never operate the RAID array near maximum performance

Reason: When a disk fails, the performance of the RAID array is obviously much less than optimal. The contents of the missing disk has to be reconstructed from the checksum (RAID-5), reading and writing can't be distributed to different physical disks anymore. If the performance of the storage is not sufficient this may lead to catastrophic problems, for example if several virtual machines are served from this RAID array.

In addition, when the broken disk has been replaced by a new one, the RAID controller rebuilds the array by reading / writing the whole volume. This will put even more load on the degraded system.

2. Avoid using several identical disks from the same production batch

Reason: it happens sometimes that a particular production batch has quality problems so if one disk fails there is a good chance that another disk from the same batch will fail at roughly the same time. Even more so if the RAID array is under heavy load (see #1).

3. Document physical disk locations

Reason: if the RAID array fails you want to identify and replace the broken disk as fast as possible. But you absolutely do not want to damage the array even further by removing the wrong disk (the one thas is still working). So learn the commands to locate a disk and understand the status messages, and label each disk bay properly.

4. Have spare disks ready

Reason: You do not want to run a degraded RAID array any longer than necessary. Ideally you should have several spare disks available. They are not that expensive.

5. Practice the RAID array recovery

Reason: It is much safer to spend a couple of hours with the RAID array before it is deployed than to start your training on an already degraded RAID array having terabytes of valuable customer or company data.

Also, use this opportunity to train a colleague so there will be a backup in case you are unavailable.

Install MegaCli software

You can download the latest release from LSI here: http://www.lsi.com/support/Pages/Download-Search.aspx . Search for "megacli".

There is also an excellent driver repository here: http://www.thomas-krenn.com/de/download.html (german language).

To install the software, unpack the zip file and run the install command according to your OS. For a Linux system, the command is usually

# rpm -Uhv ./MegaCLI/MegaCli_Linux/MegaCli-8.05.71-1.noarch.rpm
Preparing...                ########################################### [100%]
   1:MegaCli                ########################################### [100%]

The software is installed at /opt/MegaRAID/MegaCli/MegaCli, so you either need to include this in your $PATH or use the absolute path name on every invocation.

Check status of MegaRAID controller and disks

To find out whether there are any failed disks, run the command below. This will return a detailed status of all LSI MegaRAID controllers on this system, including the status of their disks. Look for the "Device Present" section, which looks like this for a RAID array without any problems:

# /opt/MegaRAID/MegaCli/MegaCli -AdpAllInfo  -aAll
...
 Device Present
                ================
Virtual Drives    : 2
  Degraded        : 0
  Offline         : 0
Physical Devices  : 5
  Disks           : 4
  Critical Disks  : 0
  Failed Disks    : 0

If there is one or more "Degraded" virtual drives, this means that a physical disks has problems and should be replaced. Most RAID configurations other than RAID-0 (striping) provide error recovery so that the virtual drive is currently still accessible and no data is lost at this time. However the failed disks shoud be replaced as soon as possible.

Disable RAID controller sound alarm

The LSI MegaRAID SAS 9260-4i controller will emit a periodic and very annoying beep when there is a problem. It is so loud that you might get a phone call from the data center staff demanding that you turn it off. Here's the command to do that:

# /opt/MegaRAID/MegaCli/MegaCli -AdpSetProp -AlarmSilence -aALL
Adapter 0: Set alarm to Silenced success.

The alarm is disabled for the current problem only. If there is a status change it will beep again, which happens also when you add a new disk and bring it online. So you may have to execute this command again after running other commands. It is also possible to completely disable the alarm.

Identify failed disk drive

The next step is to figure out which disk has to be replaced. This is critical - you don't want to accidentally pull a good disk drive from an already degraded RAID array. Most RAID configurations can handle one missing disk, but not two.

The MegaRAID controllers provide an event log which can be saved to disk. Here is what has happened to the RAID array (older messages at the bottom):

the disk returned a sense status of b/00/00 for a command (several older events not shown here). The event contains the Enclosure Index (252) and the Slot Number (0) which we need in the next steps
as a consequence the physical disk #05 was marked as "failed"
which in turn causes the virtual drive this disk is a member of to be marked as "degraded"

# /opt/MegaRAID/MegaCli/MegaCli -AdpEventLog -GetLatest 100 -f events.log -aALL
# more events.log

seqNum: 0x0000af33
Time: Mon Aug 26 08:44:03 2013
Code: 0x000000fb
Class: 2
Locale: 0x01
Event Description: VD 00/1 is now DEGRADED
Event Data:
===========
Target Id: 0

seqNum: 0x0000af32
Time: Mon Aug 26 08:44:03 2013
Code: 0x00000051
Class: 0
Locale: 0x01
Event Description: State change on VD 00/1 from OPTIMAL(3) to DEGRADED(2)
Event Data:
===========
Target Id: 0
Previous state: 3
New state: 2

seqNum: 0x0000af31
Time: Mon Aug 26 08:44:03 2013
Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 05(e0xfc/s0) from ONLINE(18) to FAILED(11)
Event Data:
===========
Device ID: 5
Enclosure Index: 252
Slot Number: 0
Previous state: 24
New state: 17

seqNum: 0x0000af30
Time: Mon Aug 26 08:44:03 2013
Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 05(e0xfc/s0) Path 4433221103000000, CDB: 2e 00 3a 38 1b c7 00 00 01 00, Sense: b/00/00
Event Data:
===========
Device ID: 5
Enclosure Index: 252
Slot Number: 0
CDB Length: 10
CDB Data:
002e 0000 003a 0038 001b 00c7 0000 0000 0001 0000 0000 0000 0000 0000 0000 0000 Sense Length: 18
Sense Data:
0070 0000 000b 0000 0000 0000 0000 000a 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

seqNum: 0x0000af2f
Time: Mon Aug 26 08:44:02 2013
Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 05(e0xfc/s0) Path 4433221103000000, CDB: 2e 00 3a 38 1b c7 00 00 01 00, Sense: b/00/00
Event Data:
===========
Device ID: 5
Enclosure Index: 252
Slot Number: 0
CDB Length: 10
CDB Data:
002e 0000 003a 0038 001b 00c7 0000 0000 0001 0000 0000 0000 0000 0000 0000 0000 Sense Length: 18
Sense Data:
0070 0000 000b 0000 0000 0000 0000 000a 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
...

In this case, the disk in Enclosure 252 slot number 0 is the leftmost disk bay in this particular server but of course it might be different for other servers.

If you did not document your map of enclosures, slot numbers and physical disk locations, it is sometimes possible to find the broken disk with this command:

# /opt/MegaRAID/MegaCli/MegaCli -pdLocate -start -PhysDrv \[252:0\] -aALL
# /opt/MegaRAID/MegaCli/MegaCli -pdLocate -stop -PhysDrv \[252:0\] -aALL

This will attempt to flash the LED of the disk bay. The parameters are the enclosure id and the slot number. The square brackets must be escaped with a backslash otherwise the shell will interpret this instead of passing it to the MegaCli command.

The problem with this command is that the LED of a broken disk might not flash at all, or that it might flash very similar to the LEDs of the other bays. So you need to be very cautious here.

The best method is to identify each disk bay before there is a problem, and to write it down.

Replace disk

Now that you have found the broken disk, replace it with a new one. If you do not have the same type of disk, it is also possible to use any other disk that has at least the same size. Depending on the RAID controller configuration, the controller will activate the new disk automatically and start the rebuild process.

You can monitor the progress of the rebuild using the event log:

seqNum: 0x0000afa6
Time: Mon Aug 26 23:13:48 2013
Code: 0x000000f9
Class: 0
Locale: 0x01
Event Description: VD 00/1 is now OPTIMAL
Event Data:
===========
Target Id: 0

seqNum: 0x0000afa5
Time: Mon Aug 26 23:13:48 2013
Code: 0x00000051
Class: 0
Locale: 0x01
Event Description: State change on VD 00/1 from DEGRADED(2) to OPTIMAL(3)
Event Data:
===========
Target Id: 0
Previous state: 2
New state: 3

seqNum: 0x0000afa4
Time: Mon Aug 26 23:13:48 2013
Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 05(e0xfc/s0) from REBUILD(14) to ONLINE(18)
Event Data:
===========
Device ID: 5
Enclosure Index: 252
Slot Number: 0
Previous state: 20
New state: 24

seqNum: 0x0000afa3
Time: Mon Aug 26 23:13:48 2013
Code: 0x00000064
Class: 0
Locale: 0x02
Event Description: Rebuild complete on PD 05(e0xfc/s0)
Event Data:
===========
Device ID: 5
Enclosure Index: 252
Slot Number: 0

seqNum: 0x0000afa2
Time: Mon Aug 26 23:13:48 2013
Code: 0x00000063
Class: 0
Locale: 0x02
Event Description: Rebuild complete on VD 00/1
Event Data:
===========
Target Id: 0

seqNum: 0x0000afa1
Time: Mon Aug 26 23:13:21 2013
Code: 0x00000067
Class: -1
Locale: 0x02
Event Description: Rebuild progress on PD 05(e0xfc/s0) is 99.94%(45514s)
Event Data:
===========

...

seqNum: 0x0000af3e
Time: Mon Aug 26 10:42:40 2013
Code: 0x00000067
Class: -1
Locale: 0x02
Event Description: Rebuild progress on PD 05(e0xfc/s0) is 0.99%(473s)
Event Data:
===========

seqNum: 0x0000af3d
Time: Mon Aug 26 10:34:47 2013
Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 05(e0xfc/s0) from OFFLINE(10) to REBUILD(14)
Event Data:
===========
Device ID: 5
Enclosure Index: 252
Slot Number: 0
Previous state: 16
New state: 20

seqNum: 0x0000af3c
Time: Mon Aug 26 10:34:47 2013
Code: 0x00000067
Class: -1
Locale: 0x02
Event Description: Rebuild progress on PD 05(e0xfc/s0) is 0.00%(0s)
Event Data:
===========

seqNum: 0x0000af3b
Time: Mon Aug 26 10:34:47 2013
Code: 0x0000006a
Class: 0
Locale: 0x02
Event Description: Rebuild automatically started on PD 05(e0xfc/s0)
Event Data:
===========
Device ID: 5
Enclosure Index: 252
Slot Number: 0

seqNum: 0x0000af3a
Time: Mon Aug 26 10:34:47 2013
Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 05(e0xfc/s0) from UNCONFIGURED_GOOD(0) to OFFLINE(10)
Event Data:
===========
Device ID: 5
Enclosure Index: 252
Slot Number: 0
Previous state: 0
New state: 16

seqNum: 0x0000af39
Time: Mon Aug 26 10:34:47 2013
Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 05(e0xfc/s0) from UNCONFIGURED_BAD(1) to UNCONFIGURED_GOOD(0)
Event Data:
===========
Device ID: 5
Enclosure Index: 252
Slot Number: 0
Previous state: 1
New state: 0

seqNum: 0x0000af38
Time: Mon Aug 26 10:34:47 2013
Code: 0x000000f7
Class: 0
Locale: 0x02
Event Description: Inserted: PD 05(e0xfc/s0) Info: enclPd=fc, scsiType=0, portMap=01, sasAddr=4433221103000000,0000000000000000
Event Data:
===========
Device ID: 5
Enclosure Device ID: 252
Enclosure Index: 1
Slot Number: 0
SAS Address 1: 4433221103000000
SAS Address 2: 0

seqNum: 0x0000af37
Time: Mon Aug 26 10:34:47 2013
Code: 0x0000005b
Class: 0
Locale: 0x02
Event Description: Inserted: PD 05(e0xfc/s0)
Event Data:
===========
Device ID: 5
Enclosure Index: 252
Slot Number: 0

seqNum: 0x0000af36
Time: Mon Aug 26 10:26:10 2013
Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 05(e0xfc/s0) from FAILED(11) to UNCONFIGURED_BAD(1)
Event Data:
===========
Device ID: 5
Enclosure Index: 252
Slot Number: 0
Previous state: 17
New state: 1

seqNum: 0x0000af35
Time: Mon Aug 26 10:26:10 2013
Code: 0x000000f8
Class: 0
Locale: 0x02
Event Description: Removed: PD 05(e0xfc/s0) Info: enclPd=fc, scsiType=0, portMap=01, sasAddr=4433221103000000,0000000000000000
Event Data:
===========
Device ID: 5
Enclosure Device ID: 252
Enclosure Index: 1
Slot Number: 0
SAS Address 1: 4433221103000000
SAS Address 2: 0

seqNum: 0x0000af34
Time: Mon Aug 26 10:26:10 2013
Code: 0x00000070
Class: 1
Locale: 0x02
Event Description: Removed: PD 05(e0xfc/s0)
Event Data:
===========
Device ID: 5
Enclosure Index: 252
Slot Number: 0

The rebuild process can take many hours or even days (in this case, 10 hours for a 250 GByte volume).

The new disk should be a brand new one, or at least zeroed.

Handling Foreign disks

If you insert a disk that was already used, it is not automatically accepted. This happens often if you remove a disk and put it right back in, instead it is marked as "foreign". Since the disk was not updated by the RAID controller (even if it was removed for just a few seconds), a full rebuild is required.

Use the commands below:

/opt/MegaRAID/MegaCli/MegaCli -PDMakeGood -PhysDrv \[252:0\] -aALL
/opt/MegaRAID/MegaCli/MegaCli -CfgForeign -Clear -aALL
/opt/MegaRAID/MegaCli/MegaCli -PDHSP -Set -PhysDrv \[252:0\] -aALL

This will mark the re-inserted disk drive as "good" and useable, the second statement clears all RAID information on that disk, and the third statement will designate the new disk as a "Hot Spare". Since a disk is missing from the RAID, the new disk is automatically inserted in the RAID array and the rebuild process is started.

Child pages