Problems with SATA RAID5

That’s an interesting problem I have with my shiny new hard drives.

Goal: Create a Linux Software RAID5 with four (4) drives, each 2 TB in size.

Used Hardware:
ASUS A8N-SLI
2 GB DRAM 333
AMD64 3500+ (2200 MHz)
System HDD: 400GB Hitachi Deskstar 7K400 series Device Model: HDS724040KLAT80
Four SATA HDD for the data: Seagate Barracuda LP Device Model: ST32000542AS

Used software:
Ubuntu Linux 10.10
smartmontools
mdadm

This error occurs:
After multiple access to the SATA-drives, especially during high transfer load within RAID5, or simply by accessing the drives via smartctl or mdadm –detail, (at least) one drive becomes unreachable. /var/log/messages looks like this:

Feb 4 19:46:18 ron kernel: [ 232.080061] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Feb 4 19:46:18 ron kernel: [ 232.080092] ata1.00: failed command: WRITE DMA
Feb 4 19:46:18 ron kernel: [ 232.080119] ata1.00: cmd ca/00:00:00:10:00/00:00:00:00:00/e0 tag 0 dma 131072 out
Feb 4 19:46:18 ron kernel: [ 232.080122] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 4 19:46:18 ron kernel: [ 232.080160] ata1.00: status: { DRDY }
Feb 4 19:46:18 ron kernel: [ 232.080179] ata1: hard resetting link
Feb 4 19:46:28 ron kernel: [ 242.090048] ata1: softreset failed (1st FIS failed)
Feb 4 19:46:28 ron kernel: [ 242.090075] ata1: hard resetting link
Feb 4 19:46:38 ron kernel: [ 252.100047] ata1: softreset failed (1st FIS failed)
Feb 4 19:46:38 ron kernel: [ 252.100075] ata1: hard resetting link
Feb 4 19:47:13 ron kernel: [ 287.110035] ata1: softreset failed (1st FIS failed)
Feb 4 19:47:13 ron kernel: [ 287.110063] ata1: limiting SATA link speed to 1.5 Gbps
Feb 4 19:47:13 ron kernel: [ 287.110070] ata1: hard resetting link
Feb 4 19:47:18 ron kernel: [ 292.320077] ata1: softreset failed (device not ready)
Feb 4 19:47:18 ron kernel: [ 292.320104] ata1: reset failed, giving up
Feb 4 19:47:18 ron kernel: [ 292.320119] ata1.00: disabled
Feb 4 19:47:18 ron kernel: [ 292.320128] ata1.00: device reported invalid CHS sector 0
Feb 4 19:47:18 ron kernel: [ 292.320148] ata1: EH complete
Feb 4 19:47:18 ron kernel: [ 292.320185] sd 7:0:0:0: [sde] Unhandled error code
Feb 4 19:47:18 ron kernel: [ 292.320190] sd 7:0:0:0: [sde] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb 4 19:47:18 ron kernel: [ 292.320197] sd 7:0:0:0: [sde] CDB: Write(10): 2a 00 00 00 10 00 00 01 00 00
Feb 4 19:47:18 ron kernel: [ 292.320214] end_request: I/O error, dev sde, sector 4096
Feb 4 19:47:18 ron kernel: [ 292.320236] md/raid:md0: Disk failure on sde, disabling device.
Feb 4 19:47:18 ron kernel: [ 292.320239] md/raid:md0: Operation continuing on 3 devices.
Feb 4 19:47:18 ron kernel: [ 292.320341] sd 7:0:0:0: [sde] Unhandled error code
Feb 4 19:47:18 ron kernel: [ 292.320346] sd 7:0:0:0: [sde] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb 4 19:47:18 ron kernel: [ 292.320352] sd 7:0:0:0: [sde] CDB: Write(10): 2a 00 00 00 11 00 00 01 00 00
Feb 4 19:47:18 ron kernel: [ 292.320367] end_request: I/O error, dev sde, sector 4352
Feb 4 19:47:18 ron kernel: [ 292.320432] sd 7:0:0:0: [sde] Unhandled error code
Feb 4 19:47:18 ron kernel: [ 292.320436] sd 7:0:0:0: [sde] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb 4 19:47:18 ron kernel: [ 292.320443] sd 7:0:0:0: [sde] CDB: Write(10): 2a 00 00 00 12 00 00 04 00 00
Feb 4 19:47:18 ron kernel: [ 292.320458] end_request: I/O error, dev sde, sector 4608
Feb 4 19:47:18 ron kernel: [ 292.320571] sd 7:0:0:0: [sde] Unhandled error code
Feb 4 19:47:18 ron kernel: [ 292.320576] sd 7:0:0:0: [sde] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb 4 19:47:18 ron kernel: [ 292.320582] sd 7:0:0:0: [sde] CDB: Write(10): 2a 00 00 00 16 00 00 02 00 00
Feb 4 19:47:18 ron kernel: [ 292.320596] end_request: I/O error, dev sde, sector 5632
Feb 4 19:47:18 ron kernel: [ 292.320652] sd 7:0:0:0: [sde] Unhandled error code
Feb 4 19:47:18 ron kernel: [ 292.320656] sd 7:0:0:0: [sde] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb 4 19:47:18 ron kernel: [ 292.320663] sd 7:0:0:0: [sde] CDB: Write(10): 2a 00 e8 e0 88 00 00 00 08 00
Feb 4 19:47:18 ron kernel: [ 292.320676] end_request: I/O error, dev sde, sector 3907028992
Feb 4 19:47:18 ron kernel: [ 292.320696] end_request: I/O error, dev sde, sector 3907028992

Similar problems are described on the net, but none of them changed anything. Here’s what I can exclude definitly:

Upgraded Firmware for all drives from CC34 to CC35, had to „force“ this, though. Still error.
PSU (Power supply): Used a different 700W power supply only to power the drives. The test failed.
SATA-Cables: Changing the cables from the not-failing devices to the failing devices didn’t change anything. The SAME device failed, although it had another cable.
SATA-Controller: Tried three different controllers, the problem persists:
started with NVIDIA nForce 4 (On-Board the A8N-SLI)
Added DAWICONTROL DC-300e RAID (just using the two SATA-Ports here, not the RAID functionality)
Added ASRock 2-Port SATA3 Controller (PCIx)
Turned off NCQ with

echo 1 > /sys/block/sdc/device/queue_depth

– no effect
Turned off disc cache with

hdparm -W0 /dev/sdc/

– no effect
Using the one disc, which always fails, alone with a ext4 file system, all other drives unplugged: working fine???
Using the one disc, which always fails, alone with a ext4 file system, all other drives plugged in: error –> this means: It is NOT a problem concerning the mdadm (RAID) Software! (Proof: /proc/mdstat sees no raid at all.) BUT: Sometimes this does NOT trigger an error on any attached drive. It can’t even be provoked by copying 1.7 Gig-files on the drive and copying them there around.
(Not yet tested): Using all disks in a completly different system, using a Live CD.
Using the same system, but with a Live CD (Ubuntu 10.10). The disc ata6 fails during use of parted. (But md0 was „magically“ already reactivated?) (IDENTITFY DEVICE) error
Using the same system, but with a Live CD (Ubuntu 10.10). This time, the „normal“ system disc (which is connected via ATA: HDS724040KLAT80) was removed. error –> So it’s not an issue with the IDE device in the system.
Hammering the one disc, which always fails, with a LOT of smartctl –all commands under Live CD, without applying a file system or RAID. working fine

Conclusion so far
The „one disc“ is broken, or at least not useable under linux. Replaced it with a different ST32000542AS and rebuilt the RAID5. The error concerning the multiple use of „mdadm –detail“ is gone.

As for the „one disc“, I’ve put it into a „IcyBox“-HDD-eSATA/USB3-Container and attached it via eSATA to another machine, which runs windows. Downloaded the Seatools from Seagate and started the tests. The short test was successful, the long test was successful as well. However, the drive disconnected violently from Windows on its own, so this disc is not reliable anymore. I’ll HAVE to change this one.

Another drive in my RAID configuration showed the same problems, so I replaced it by a Western Digital 2 TB. As the time of writing, the RAID is stable up to now with the new disc.

There’s only one little thing… The drives always only showed the error when waking up or going to standby. When I forced the drives to keep going, they were all working great. (Except the initially described drive above, as that one really looks broken…)
One of the drives is still messing things up, as it is loosing the connection to the controller, but it can be recovered by a „soft reset“. That way, I’m not loosing the disc for the RAID – no resyncing necessary. But still, it is annoying in the log files and leaving an uneasy feeling.

Even high quality SATA-2-cables weren’t making any difference.

For the web search engines, it is much easier for others users to find this specific entry if the have other choices, so please ignore these following lines:

kernel: [] ata1: hard resetting link
kernel: [] ata2: hard resetting link
kernel: [] ata3: hard resetting link
kernel: [] ata4: hard resetting link
kernel: [] ata5: hard resetting link
kernel: [] ata6: hard resetting link
kernel: [] ata7: hard resetting link
kernel: [] ata8: hard resetting link
kernel: [] ata9: hard resetting link
kernel: [] ata10: hard resetting link
kernel: [] ata11: hard resetting link

AngInf

Angewandte Informatik in der Praxis

Problems with SATA RAID5