Linux

I've been using Linux/Unix for many years. I've always had a strong interest in technology in general and computing specifically.

These are my opinions. Opinions are like noses, everyone has one, and they all smell.

Enjoy your visit.
December 2011
M T W T F S S
« Apr    
 1234
567891011
12131415161718
19202122232425
262728293031  

Replacing failed software raided drive

I referenced these instructions to remind me how to replace a drive.

In my case the output of mdstat looks like this:

# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda2[0] sdb2[1]
33615936 blocks [2/2] [UU]

md1 : active raid1 sda3[2](F) sdb3[1]
2096384 blocks [2/1] [_U]

md0 : active raid1 sda1[0] sdb1[1]
128384 blocks [2/2] [UU]

unused devices:

So I have three partitions on two drives raided together. And sda3 is failing. This is the message I received in email.

This is an automatically generated mail message from mdadm
running on host.domain.com

A Fail event had been detected on md device /dev/md1.

Faithfully yours, etc.

Device Boot Start End Blocks Id System
/dev/sda1 * 1 16 128488+ fd Linux raid autodetect
/dev/sda2 17 4201 33616012+ fd Linux raid autodetect
/dev/sda3 4202 4462 2096482+ fd Linux raid autodetect

Device Boot Start End Blocks Id System
/dev/sdb1 * 1 16 128488+ fd Linux raid autodetect
/dev/sdb2 17 4201 33616012+ fd Linux raid autodetect
/dev/sdb3 4202 4462 2096482+ fd Linux raid autodetect

Disk /dev/md0: 131 MB, 131465216 bytes
Disk /dev/md1: 2146 MB, 2146697216 bytes
Disk /dev/md2: 34.4 GB, 34422718464 bytes

Removing the failed partition(s) and disk:

I used the mdadm command to first fail

mdadm –manage -dev/md0 –fail /dev/sda2
mdadm –manage -dev/md1 –fail /dev/sda3
mdadm –manage -dev/md2 –fail /dev/sda1

then remove the raid devices on the failing drive.

madam –manage /dev/md0 –remove /dev/sda2
madam –manage /dev/md1 –remove /dev/sda3
madam –manage /dev/md2 –remove /dev/sda1

Then I shut down the system

shutdown -h now

and replaced the drive with a new one. Then I tried to reboot. But because the failed drive was the first drive in the scsi chain, it failed to boot with the message.

No Operating System Present

Adding the new disk to the RAID Array:

So I ended up having to switch the drives, putting sdb in as sda and then proceeding. I used sfdisk to mirror the partitioning between the two drives.

sfdisk -d /dev/sd1 | sfdisk /dev/sdb

Add the partitions back into the RAID Arrays:

mdadm –manage /dev/md0 –add /dev/sdb2
mdadm –manage /dev/md1 –add /dev/sdb3
mdadm –manage /dev/md2 –add /dev/sdb1

cat /proc/mdstat

I could see the drive rebuilding. When it finished I hot swapped out sda and did the whole process over again, this time without rebooting the system, since the system uses hot swap drives. It worked fine and I had both drives up and running. I could have done the whole process without rebooting the machine.

Install Grub on new hard drive MBR:

# grub
grub> find /grub/stage1
(hd0,0)
grub> device (hd0) /dev/sdb
grub> root (hd0,0)
grub> setup (hd0)
grub> find /grub/stage1
(hd0,0)
(hd1,0)

grub> quit

So now I have the boot manager mirrored on both drives. I can reboot with either single drive and it will work fine.

Share

Comments are closed.