Degraded HD in a Raid 5 Array

Messages
36
Reaction score
0
Location
US
I have 3 WD 150GB 10,000rpm Raptors in a Raid 5 configuation. Everything seems to be working fine. On restart, the screen that lists the drives, says that the center HD of the array is "degraded" but Bootable. If I go to "Manage Discs" in MS Vista, it states that the drive (raid 5 array) is "healthy".

Can I fix this without reformatting, replacing the drive, etc. Any help will be appreciated.
Thanks.
 
It appears that there was an error with one of the drives, and now the array is in degraded status. A RAID 5 array will still function with one failed drive, which is why it still boots.

First, you should be glad that you weren't running a RAID 0 array. :) Second, I suggest backing up the data immediately. Since this array apparently has your OS, you probably should make an image using a utility such as Ghost or Acronis. Next, I would download and run the WD drive utility, and have it scan the potentially bad drive for errors. If the WD utility says the drive is bad, replace it.

Finally, after replacing the drive or verifying that it is not bad, try to repair the array. You will do this with the RAID controller utility (some controllers have a Windows utility, others only have a utility accessible by pressing a key during boot time).

Good luck.
 
Thanks for your responce to my question. Your answer is clear and I now know what to do. Thanks again.
Allen Penrod
 
The above advice is excellent, especially the part about immediatley making a back-up image of your current logical drive(s).

RAID 5 needs to be outlawed for anybody other than data centers engineers where they can be monitor them 24/7. The problem is that RAID 5 is not* fault tolerant because if the controller card goes bad you frequentlylose data on all 3 drives.

As per the scenario here, you have no way of really knowing if it's the drive going bad, or the card going bad. If the drive is physically fine, and the card is fine, then why do we have a degraded drive? --??

Since you have to fix the array anyways, it would be worth the time to migrate back to a RAID 1. This will cost you another HD, but you won't be playing russian roulette like you are now.
 
Hmmm. if you're recommending buying another raptor to go RAID 1, considering the use of Raptors indicates a desire for speed, might want to consider RAID 1,0 at the cost of 150GB of storage.

I wouldn't recommend having data stored on the same drive/array as the operating system.

If this array isn't the OS array and is just storage, then RAID 5 is fine (and more efficient and a balance of speed and redundancy) since you should already have regular backups (RAID IS NOT BACKUP!). As evidenced by this case, the RAID 5 has done exactly what it is meant to do - provide continued access despite a failure of one of the array elements. This is the point at which you secure your data (ie make sure your backups are up to date and valid) and then replace the failed unit and rebuild the array.

If you're worried about controller failure then you don't have backups, and it's the lack of backups, not possible controller failure, that is the real problem.

FWIW, I have a 1.5TB RAID 5 (4 disk array) for data access in my workstation and it's replicated on another 1.5TB RAID 5 offsite.

--
Farmer
http://www.the-farm.net
 
if you're recommending buying another raptor to go RAID 1, considering the use > of Raptors indicates a desire for speed, might want to consider RAID 1,0 at the > cost of 150GB of storage.
If there is a desire for "speed", then get real and move to 15k SCSI because Raptors have only neglibile performance increases over vanilla 7200rpm SATA/IDE drives - at a tremendous cost increase.
As evidenced by this case, the RAID 5 has done exactly what it is meant to do - > provide continued access despite a failure of one of the array elements.
One network engineer who's a good friend who made the same arguement, and lost his job last year because his SANs got too warm and cost almost a terrabyte of lost data on a weekend in between back-ups. Then there was the AS400 I saw go down about 18months ago because the RAID 5 controller went south and the tape back-ups were not readable, the half dozen Compaq's, the SANs over at Photonet, the 30+ RAID 5 cards I lost when I worked at a regional bank, on and on and on. RAID 5 faults have contributed the biggest amount of lost data I've seen in my career, and yet somebody always insists how well it's working because they haven't been hit yet. On the other hand, I've never seen RAID 1 lose a bit of data.

Given the greatly added cost of onboard processing over-heard, the price of a RAID 5 card that is actually worth it, and the negligible expense of 4 drives -vs- 3, RAID 5 makes no sense for a desktop user. It's essentially a bomb waiting to go off just like all these consumer NAS devices also using RAID 5.
 
If there is a desire for "speed", then get real and move to 15k SCSI
because Raptors have only neglibile performance increases over
vanilla 7200rpm SATA/IDE drives - at a tremendous cost increase.
Sure, but the user already has raptors. The cost increase of going to a 15k SCSI setup compared to 1 more raptor is obviously considerable. So let's work within the framework of the user in question.
As evidenced by this case, the RAID 5 has done exactly what it is meant to do - > provide continued access despite a failure of one of the array elements.
One network engineer who's a good friend who made the same arguement,
and lost his job last year because his SANs got too warm and cost
almost a terrabyte of lost data on a weekend in between back-ups.
And how likely do you think it is, for a home user, to generate that much data in between backups?

You come home from a shoot and download your images and then back them up. Or, if you've been working on processing them once you finish, you back them up.

Totally different environments. If you're generating 1TB of data over a weekend you should at the very least be looking at something like 5,0 (although I'd move completely away from traditional RAID solutions there anyway). Again, not a home user.
Then there was the AS400 I saw go down about 18months ago because the
RAID 5 controller went south and the tape back-ups were not readable,
Failure of process to verify the tapes. We've all heard horror stories of backups that weren't useful. That's not a RAID failure issue, that's a backup process issue.
the half dozen Compaq's, the SANs over at Photonet, the 30+ RAID 5
cards I lost when I worked at a regional bank, on and on and on. RAID
5 faults have contributed the biggest amount of lost data I've seen
in my career, and yet somebody always insists how well it's working
because they haven't been hit yet. On the other hand, I've never seen
RAID 1 lose a bit of data.
Short of catastrophic destruction of the devices there's no doubt RAID 1 is more tolerant, but from an economic point of view (bear in mind that these days RAID controllers are often on the main board for negligible cost) RAID 5 is perfectly viable for a photographer so long as they understand that ANY RAID is NOT backup. You may have never lost a bit of data from RAID 1, but if you then decide that it means you don't need real backups you're just begging to get slammed with a catastrophic failure.
Given the greatly added cost of onboard processing over-heard, the
Negligible on any modern system that someone here is likely to use for Photoshop processing.
price of a RAID 5 card that is actually worth it, and the negligible
Again, good quality controllers are on the top end motherboards. If you're making a real box for decent processing you should spend the extra $100-$200 on such a motherboard anyway.
expense of 4 drives -vs- 3, RAID 5 makes no sense for a desktop user.
Certainly the cost of drives is minimal these days, but since you can put 4 drives into a RAID 5 array, you can put that towards your storage solution.
It's essentially a bomb waiting to go off just like all these
consumer NAS devices also using RAID 5.
I think we need to agree to disagree. Your points are all valid, but I think you're missing the market in which we're talking here. Backups are far and away more important than RAID. RAID in this market is for pure convenience of being able to avoid having to retrieve from backups if there is a failure (you let the RAID rebuild for you). Nothing wrong with RAID 1 at all and I highly recommend it, but RAID 5 is perfectly legitimate option. BOTH require proper backups (offsite) if you really have any desire to protect your data. I would recommend spending money on offsite storage ahead of RAID any day of the week, and implimenting a proper strategy for creating and maintaining and verify those backups.

--
Farmer
http://www.the-farm.net
 
RAID 5 needs to be outlawed for anybody other than data centers
engineers where they can be monitor them 24/7. The problem is that
RAID 5 is not* fault tolerant because if the controller card
goes bad you frequentlylose data on all 3 drives.
there goes scott spewing nonsense to us, again.

there is no correlation between the raid level and 'controller' or controllers. example: I run raid5 but my 'controllers' are discrete sata ports on my motherboard. I can mix and match all I want. if a controller goes down I simply choose some other pci or pci-e card and edit a simple config and I'm back up and running again. you are assuming hardware raid and that its a single card. that is just not always the case, especially with home raid users that use software raid.

what IS true is that you temporarily lose the cluster if your hardware controller card goes down and you have no shelf spares. but I doubt a bad controller will go trashing all the drives in your cluster. very unlikely for that to happen.
As per the scenario here, you have no way of really knowing if it's
the drive going bad, or the card going bad. If the drive is
physically fine, and the card is fine, then why do we have a degraded
drive? --??
nonsense. why is it hard to know the diff? if the card can be accessed (if you can send messages to it for its counters or status) then you CAN know the state of the card. and with SMART, you can know if you got SMART data at all and if you did, its trustable so you'll know if its a bad drive or a bad card.

--
Bryan (pics only: http://www.flickr.com/photos/linux-works )
(pics and more: http://www.netstuff.org ) ~
 
Update: Yesterday, I backed up complete PC to an external HD. Took many hours, was still running when I went to bed. During the night, we lost power. On attempting bootup, the HD screen said: disk failure, bootable "NO". I restated again and was able to get to the RAID screen. I choose "reset disks" and then it said "rebootable" and I booted up. The middle disk in my RAID 5 array has failed.

I guess I need to replace the drive, recreate the RAID 5 array (losing all data), reinstall VISTA and then restore complete PC from the external drive. Is this the procedure I should use? I don't think I can get to the Complete PC backup on the external drive without using Vista to get there. Thanks.
 
raid and vista?

sorry - I can't help you at all with that. other than to say, don't do that.

I run an 8 drive software raid system:





I've posted about it here a few times.

the neat thing is that I can start with just 3 drives, create a raid pack then add in more drives, LIVE (!), while the system is up and running. in fact, I don't even have to dismount the raid pack at all or even mark it read-only. users have full write privs to a pack that is building. it works!

to do this hot OCE (online capacity expansion) thing you physically add the drive (in my case, to any spare ide, sata, scsi or even firewire, usb connector) then tell the software about it and start the rebuild. it rebuilds in place. about 8 hours later (for my 3TB system) that new drive is part of the cluster and the rescan is complete.

I can then yank a drive out, causing a fault and things still work. insert that drive back again and it rebuilds or rechecks. all live.

once I screwed up the order of the drives when I mixed up some cables, adding new ones to the pack. it freaked out - then I freaked out - I read up a bit on it and found that I just had to do a non-destructive scan and it 'found' all the right drives in the right order and started the rebuild.

I've done some bad things to my system and its still up, kicking and serving 3TB of files.

gentoo ~ # df -h
Filesystem Size Used Avail Use% Mounted on
dev/hda1 244M 105M 127M 46% /
udev 10M 204K 9.9M 2%
dev
dev/hda3 19G 1.4G 17G 8% var
dev/hda4 86G 5.8G 76G 8% usr
shm 1.5G 0 1.5G 0% dev/shm dev/md/0 3.2T 3.2T 60G 99% mnt/raid

only 60G left - guess I'll have to add yet ANOTHER drive, soon. ;)

anyway, dump vista and their unreliable raid nonsense. get linux (free), use software raid5 ('md device'), don't bother with LVM (its not needed for this), and get yourself a motherboard or set of sata controller ports, enough to support the drive count you want. I prefer the intel badaxe2 since it has 8 (!!) sata2/300 ports onboard. that's why I have 8 drives in my array ;)

but if I needed more, I'd pop in a $20 sata pci controller and there, I have room for 4 more drives.

raid5 is incredibly reliable IF you run the right software.

I had serious doubts about software raid and I've been using hardware controllers for over a decade now (3ware, adaptec, buslogic/mylex, DPT, IBM) and only now is software raid really EXCEEDING the performance of the hardware controllers when on a dedicated pc with decent sata2 controller ports.

oh, and since I run software raid, I can directly get access, easily, to my SMART data on each of them. if I want to do a quick check of the drive temps:

% hdd_temp dev/sda: SAMSUNG HD501LJ: 30 C or F
dev/sdb: SAMSUNG HD501LJ: 30 C or F dev/sdc: SAMSUNG HD501LJ: 30 C or F
dev/sdd: SAMSUNG HD501LJ: 29 C or F dev/sde: SAMSUNG HD501LJ: 31 C or F
dev/sdf: SAMSUNG HD501LJ: 32 C or F dev/sdg: SAMSUNG HD501LJ: 30 C or F
dev/sdh: SAMSUNG HD501LJ: 31 C or F

nice. all seems to be running cool.

check out linux md-raid (software raid). email me if you need pointers or help.

I work at a large computer company who has some very high end enterprise storage solutions - and so I'm not easily impressed by consumer computer systems. but this software raid stuff is FINALLY enterprise class. it really is.

--
Bryan (pics only: http://www.flickr.com/photos/linux-works )
(pics and more: http://www.netstuff.org ) ~
 
Replacing the failed drive should not destroy the RAID. The whole point of RAID 5 is that if only 1 element has failed, once you replace it, it will rebuild itself.

So replace the failed drive and the array should rebuild. However, unitl it has been replaced and rebuilt, you have NO fault tolerance and a further failure will mean data loss.

However, you've done exactly the right thing and backed up your system. That way if there are any problems, you can restore it.

There's been some...interesting...advice given in this thread. None of it's wrong, but I think it misses the mark as to your level of use and technical expertise. I'd like to make some suggestions that I think target you as the end user (rather than various power users, professional IT people, geeks (meant in a nice way) and so forth).

From what I can gather, you're using 3 x raptors in RAID 5 for your operating system, giving you 300GB in volume size. This seems excessive and to no real advantage.

Being able to load in those few system elements or initial program loads a tiny bit faster due to teh inherent striping of RAID 5 isn't going to be noticable over the performance of the raptors compared to regular drives anyway. The fault tolerance of the RAID 5 can be handled with RAID 1 (and using just 2 drives). Yes, it means only 150GB on your system partition, but that should be more than enough. You want to keep data on a different drive and scratch disk/swap disk on another again for real performance.

I'd be very tempted, since you have good backups, to look at using 2 x raptors in RAID 1 for your OS, then use the 3rd raptor as a speedy scratch/swap disk (nothing else on it) and then place your data on another drive or another RAID array.

I might be misreading your system setup, in which case please feel free to correct me.

As I've said before, RAID is not backup. Not even the fancy Linux software raid that is "enterprise" level. It's JUST redundancy which is CONVENIENCE. You must have real backups and preferrably offsite if you really want to protect your data.

--
Farmer
http://www.the-farm.net
 
As I've said before, RAID is not backup. Not even the fancy Linux
software raid that is "enterprise" level. It's JUST redundancy which
is CONVENIENCE. You must have real backups and preferrably offsite
if you really want to protect your data.
not sure if you're taking a stab at my comment or not, but ...

the linux md raid IS enterprise quality. I work in that field - I kind of know it pretty well. I deal with high end solaris boxes all day and what I'm seeing from my personal home raid setup is comparible in terms of raw manageability and reliability. even features, it competes well (again, the OCE is pretty high end stuff).

it may be overkill for many home situations, but please don't poke fun at it for serious stuff. it competes well and costs 1/10 if even that.

--
Bryan (pics only: http://www.flickr.com/photos/linux-works )
(pics and more: http://www.netstuff.org ) ~
 
I'm only making the distinction that:

1. The user clearly is a user and not a technician. He's can't be faulted for that. I give him a lot of credit for understanding the advantages of redundancy and to have already worked out the need for backup. In my books, he definitely an above average user. Way above! However, setting up and managing a Linux based software RAID is adding a several levels of technical competency requirements and complexity. Yes, it's a very good solution but, no, it's not for this user, imho.

2. Most importantly, that NO level of redundancy is backup. No matter it being enterprise level or not (and my use of "enterprise" was generically taking a poke at the 93844099830482 supposed enterprise solutions that are on the market, most of which are rubbish :-) Not specifcally at your solution, which from the sound of it is probably in that general class (which, again, would suggest the need for appropriate technical capacity in support of it, just as with any real enterprise solution).

3. The user needs specific support for his current problem. The answer is, a) make a back up (which he's done) and then b) install a new drive to replace the failed element (this he is yet to do and that's where his main query was - I wanted to make sure he realised that he shouldn't need to lose any data by simply replacing a single failed element - the RAID should rebuild itself but if it doesn't then he has his backup). c) since he has backups, he has the option of changing his configuration to something that would offer greater performance and security with very little overhead and effort or increased technical capacity.

4. If he's still stuck, or if what I or anyone else is saying isn't making sense, say so and there are people here who will continue to help :-)

--
Farmer
http://www.the-farm.net
 
Farmer, Thanks for your suggestions. You are absolutely right about my level of expertise. I read about "software raid" and didn't understand exactly what it is or how to do it...Idon't want to anyway. I built this computer (first one) and have been very pleased with it-until the disk failure. The disk in still in warranty and I will send it back for a new one. From what you have suggested, I should be able to replace the failed disk and the raid will rebuild itself...and if it doesn't, I have my backup. I assume that I would have to go into the bios and choose to "rebuild" the raid array. If for some reason I lost the OS in this process (like if the Raptors were formatted), I assume I would have to reinstall Vista in order to get to the backup on the external drive. Any specific suggestions would be appreciated.

If I wanted to do as you suggest, use 1 rapter for the Vista OS and the other 2 in a raid "0" or "1"...how would I do that (step by step) from where I am now.?
Thank you, Allen Penrod
 
Hi Alan,

Firstly, since I don't know your exact configuration etc, I can only offer a guide. I can't give you precise step by step. However, you seem to have a general grap of what's going on so I suspect you'll know if you're in over your head and seek help from a local tech if need be.

Depending on your RAID, you shouldn't need to do anything more than swap out the bad drive. Now, if you're waiting for a warranty replacement, that might take a while, so it depends what you want to do in the meantime (but a new one and then have a 4th when you get your replacement, or wait, etc).

You may need to instruct your RAID controller to rebuild but you will have to refer to documentation or simply put in the new drive and see what you're prompted with. Again, becaues you have a backup you are relatively safe (treat your backup like gold, though, as it's all you have).

Yes, if you wanted to restore from backup you'd first have to reinstall your operating system so you could run the backup software to do the restore.

If you wanted to change your configuration to 1 raptor for the OS and 2 in RAID for data, then you'd be 100% sure your backup was secure and then you'd change your RAID settings so you had 1 drive not in RAID as your boot drive and then the other 2 in RAID 1 or RAID 0 (RAID 1 is more useful, really, the speed benefit for most people not moving large and constant amounts of data is not really worth the reduced convenience of redundancy except for bragging rights on geek - used as a positive term - boards). You would then install your OS to the boot drive and then restore from your backup set, placing data where you wanted.

Of course, the other option is to NOT restore from your backup set or to do only a partial restore. You install Vista, reinstall all the software you want from scratch, and then just restore actual data such as images, documetns and such. That gives you the advantages of a clean install but takes time to redo all your applications and settings and such.

So it all boils down to how much time you have and what you want your final configuration to be. There's nothing particuarly wrong with your current setup, but you would be better off with data seperate from the OS and you'd see some performance benefits. Maintaining that backup set is the most critical thing and before you format any drives, you need to be sure you can access that backup and that the information is valid (you can do that from the backup program).

Easiest step is to just replace the drive and the system should sort you out and you're back where you were. Most complex is to reconfigure. Regardless, treat the backup like gold and make sure it's valid and if you get too much in over your head, find a very competent, plain speaking tech in your area and seek assistance.

--
Farmer
http://www.the-farm.net
 
I think I'll just swap out the drive and see what happens. WD is sending a new drive and I can return the defective one later-good customer service!
I'll let you know how this comes out.
Thanks again, Allen Penrod
 
For RAID-5, a major job of a "controller" is to perform XOR calculations. Without a dedicated Controller, the XOR calculations are performed by the system resulting in less performance...RAID-5 is normally good at Read performance but write performance normally suffers. RAID-1+0 gives much better performance if performance is the goal.

GaryM
 
Farmer, I swaped the bad drive in the Raid 5 array with a good drive. Started up, went to the Raid screen in the Bios where it said "the Raid array would be rebuilt in the OS. Booted up normally and Vista rebuilt the drive automatically. Everything is fine now. The drives are warrantied for five years, so when the next one goes, I'll know what to do. Thanks for your help. Allen Penrod
 

Keyboard shortcuts

Back
Top