Announcement

Collapse
No announcement yet.

Weird RAID controller problem - advice appreciated

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Weird RAID controller problem - advice appreciated

    In my main home PC, I have a 3Ware 9650SE-8LPML, to which are attached four 6TB WD Red drives.

    About two weeks ago, it decided that drive 3 had a problem, and that it had to rebuild the RAID. I've left the computer running ever since, and it has never gotten beyond around an 8% or 9% rebuild, before going back to 0% and starting over.

    RAID1.PNG
    According to the SMART data reported by the controller, there is nothing wrong with the drive itself:

    RAID2.PNG
    I wondered if it was something that Windows was doing that was preventing it from completing the rebuild, so I left the PC on the GRUB bootloader selection screen, that appears after the RAID card has booted (this PC dual boots into W10 and Ubuntu, using partitions on an SSD that is totally separate from the RAID controller, and hooked directly to the computer's motherboard), for a day. But when I came back, it was the same story - rebuilding at 2%, and then back to zero a few hours later. A 6TB drive should take about 1.5 to 2 days to rebuild completely, so something is wrong.

    There has to be either a problem with the RAID controller, the mini-SAS to SATA breakout cable, or the drive itself: logically, I can't think of any other possible cause. The card's diagnostics are claiming that the drive is OK; and given that it's two years and 11,000 hours old, and has been properly ventilated throughout those hours, it certainly should be. Just wondered if anyone else has come across this, and what the cause is.

  • #2
    My inclination would be to replace that drive. SMART isn't necessarily the final word as to the actual condition of a drive and since it did report a problem earlier, that should be telling you something.

    Comment


    • #3
      If I were in the same position I would replace the degraded drive ASAP. If another drive goes bad all data will be lost. You can get a 6 TB 5400 RPM drive for under $200.

      Comment


      • #4
        I'm inclined to agree. However, everything on that array that I can't afford to lose is backed up elsewhere, and as mini-SAS to SATA breakout cables only cost around $15 (and it would be useful to have a spare in stock anyways), I think I'm going to try replacing that, and reseating the card, before admitting defeat. But if I get the same behavior after doing that then agreed completely in that it's time to replace the drive.

        Comment


        • #5
          You could try swapping the drives on the end of the breakout cable. Most controllers won't care which drive is connected to which connector. That would at least narrow it down as to whether or not you have a bad breakout cable if a different drive connected to the same connector also registers as "failed."

          There is an off chance that you have a bad power supply or power connector, too.

          In any case, something about the rebuild is failing and causing it to reset from <some percentage completed> to 0%. I'm generally with the others, though, that you have a bad drive.

          Comment


          • #6
            I would never use drives that big... ever! With RAID you can use multiple smaller drives in a different configuration. That's just too much data to lose. I would also use IT grade SATA drives. Among the 50 or so TMS's that I built I have only lost but a half dozen drives. OS is running on Raid 1 SAS pair, and content is 4- 2 tb 7200 RPM IT grade drives. Some of those servers (DELL 2950's) are now 8 years old. I have one site running an external RAID array in Raid 10.
            Last edited by Mark Gulbrandsen; 01-27-2020, 11:15 AM.

            Comment


            • #7
              This is a personal computer, not a server or TMS, and I'm limited in the number of 5.25" drive slots in the case I have to play with, hence using larger drives. As it is not powered up 24/7 (probably around 20-25 hours a week), I didn't think that going to the expense of top-of-the-line enterprise grade drives, which cost roughly double consumer NAS-branded ones, was necessary..

              I'd be surprised if it's a bad power supply - the PSU in question is 1kW, less than two years old, and none of the other symptoms you'd expect to accompany a PSU going out (random BSODs, powered USB devices refusing to light up, etc.) are happening.

              Will try swapping ends of breakout cables - thanks.

              Comment


              • #8
                Hey, It's your data! I run all IT grade drives at home in the two HP Workstations I use here. The OS and Data drives are separate in both computers. You don't need gigantic drives for the OS!!

                Mark

                Comment


                • #9
                  I'm kind of with Mark on this. My bigger concern would be that a 4-drive RAID 5 made up of 6TB disks is probably not a great idea. The issue is that the amount of data required to rebuild the array will tend to statistically exceed the unrecoverable bit error rate of the disks. Enterprise-grade disks tend to have lower error rates, which reduces the problem. They are often not much more expensive, either. Hitachi Ultrastars, Seagate Exos or Cheetahs, and whatever the Western Digital equivalents are can often be found for similar prices.

                  But not everyone is concerned with reliability or availability. Some just want the storage capacity or read performance. And that is fine.

                  Comment


                  • #10
                    I hear you both, and in principle, I agree. However, in this case, the array is not for long-term storage. It is mainly to hold large MXF framesets for DCP rendering - for a 4K feature, this can easily run to 10TB per movie. The rendering output goes onto another array. using the other four slots on that card (4 x 2TB drives). The bottom line is that I need that much storage capacity (16TB), and I only have four, physical 3.5" bays available for it in the computer. Other than buy a bigger PC case (an option) and then spend 10-20 hours pulling all the components out of the old case, reinstalling them in the new one, reconfiguring the arrays, etc. (currently not an option - I simply don't have the time, and likely won't until the summer), using very large drives is the only practical option available to me.

                    Haven't gotten around to playing with the cables yet, but will post an update after I have. Thanks everyone, for suggestions.

                    Comment


                    • #11
                      Like others pointed out: RAID5 with 6TB drives is considered bad business.
                      Realistically, you should consider a RAID6 or at least a RAID10 with such big drives. Although I personally would always prefer the RAID6. If performance is important, then use SSDs. Now, you indicated that your configuration is a well-educated trade-off. So, this rant is just to discourage others to build RAID5 or similar low-tolerance RAID arrays with such big drives for production systems. Setting themselves up for an almost-sure-to-happen disaster somewhere in the (near?) future.
                      Now, if the data on the array isn't worth anything, have you tried to delete the array and recreate it? :P

                      Realistically, from my experience with hundreds of RAID systems, what you're seeing is most likely a problem with the drive. While the problem could be in the controller (still somewhat likely) or in the cabling (I very much doubt so), I've seen this exact problem before. Your drive probably has a failure that's not indicated by SMART, most likely a bad controller, which isn't something SMART is very useful for. Chances are, your drive ejects itself from the RAiD from time to time. Since here wasn't an explicit failure condition the controller detects as a bad drive, the controller assumes the drive has just been disconnected for a while and thus re-inserts the drive into the RAID, triggering a rebuild. This process may go on forever.

                      Letting this run in and endless loop is pretty bad for the remaining drives, by the way, as the extra load on them will only serve to impact their lifetime and bring a possible failure of one of them closer.
                      Last edited by Marcel Birgelen; 02-01-2020, 12:43 AM. Reason: The Truth is Out There(tm)

                      Comment


                      • #12
                        It turned out to be a bad SATA power connector. Not much clearance between the drive and the side of the case, so it makes a tight turn. Noticed that the 12v - wire had what looked like a slight kink in it. After that was fixed (by adding a right-angle adapter), the array rebuilt in about 15 hours, and has been showing as OK ever since. That was about two weeks ago.

                        Comment


                        • #13
                          Yay! Glad to hear that it was an easy fix.

                          Comment

                          Working...
                          X