Announcement

Collapse
No announcement yet.

Reinstall RAID to degraded

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reinstall RAID to degraded

    Hi all,


    I try to reinstall 4HDD RAID, but something goes wrong. After rebuilding the /dev/md0 / data (Healthy states) the process stops and the: dev/md1 / opt is degraded.

    I tried several times to reinstall again, and every time I get this situation again.


    When I use “Analyze RAID performance” is it written: RAID performance is good but status is degraded.


    It looks like I can ingest and plays films without any issues.


    How can I solve this problem?

    Attached screenshots.


    Thank you
    Attached Files

  • #2
    Can you go to the Command Line Interface (CLI) and type the following?

    cat /proc/mdstat

    Make a photo or if you're connected via SSH, copy/paste it here. This should indicate which harddrive is missing from your md1 RAID. It's that disk you should replace.

    Comment


    • #3
      Here it is.
      Attached Files

      Comment


      • #4
        Looks like disk "sda", which should be the first disk, has failed and should be replaced by a new one supported by Doremi/Dolby and of at least equal size. Once you've done that, you should be able to rebuild your RAID without degradation.

        Depending on your usage pattern, you best replace all your disk every 2 to 5 years.

        Comment


        • #5
          How old are these drives? Are they all the same? If they are older, I would go with Marcels suggestion and replace the drive, or all. As both md0 and md1 are on the same drive set, it is strange that only md0 can not be repaired. Personally I would be curious enough to swap the drives around, reinit, and then see what happens.

          Comment


          • #6
            Those are clearly the original drives from Doremi and likely to be about 10 years old. It's amazing they ran this long. You need to replace all 4 drives at this point. You will lose your content on the server, but you will be starting clean.

            This sequence will permit you to change between a 3 and 4 drive raid array and will reinitialize the drives into a new raid.

            MENU / SYSTEM / TERMINAL
            Type: su <ENTER>
            Type: <ENTER> (get this password from Doremi)
            Type: sh /doremi/sbin/reinit_raid.sh <ENTER>
            After a 10 second countdown, enter raid size “4” (assuming a ShowVault or DCP2K4 w/4 drives).
            The script will check disk status, size coherency, shut down services, etc. Let it do its thing until you reach “raid re-initialized”.
            Type: more /proc/mdstat <ENTER>
            Verify md1 = active raid 5
            Verify md0 = active raid 5

            MENU / DOREMI APPS / DIAGNOSTIC TOOL / Storage folder
            Verify all 4 hard drives are green.
            At the top of the screen “/data” and “/opt” will typically be orange. After several minutes a popup will appear “the RAID partition is now healthy”.
            You may now re-ingest content, but BEWARE THAT ENCRYPTED CONTENT WILL NOT PLAY UNTIL YOU HAVE REBOOTED THE SERVER ONCE!



            Comment


            • #7
              Originally posted by Brad Miller
              Those are clearly the original drives from Doremi and likely to be about 10 years old. It's amazing they ran this long.
              In March 2020 I was called to a breakdown at a VIP's home theater: a DSS100, bought and installed in 2007, had started beeping and flashing error messages, because one of the exhaust fans on the back of the chassis had become so gunked up with dust and crud that it had actually stopped rotating. After blasting it with a Datavac, it started to spin again, and the server reported that it was perfectly OK.

              All four of the original, 400GB, manufactured in 2006, RAID drives were good. I took a log file and put it through the Dolby analyzer website, which said that the SMART data on all the drives indicated that they were perfectly OK. Not even any reallocated sectors. They had spun for 104,000 hours!

              I took the server out of the rack, gave it a thorough internal clean, a new CMOS battery, updated it and the DSP100's software and firmware to current (after having checked that the NEC iS8-2K projector's TLS certificate was OK), and advised the owner's house manager that she might like to think about a wholesale upgrade, as all of the cinema equipment was 13 years old and no longer supported by the manufacturer. I also strongly advised new drives as a stopgap measure if she planned to keep that server in service. The upgrade (to an NC1201L with an IMS3000 in it) eventually happened in January this year, but the DSS100 continued in service happily for the intervening period, without new drives.

              To say that she got her money's worth out of those drives would be an understatement.

              Comment


              • #8
                Hard drives can get remarkably old depending on usage pattern, since they're hermetically sealed, they don't collect any dust or gunk on their vital moving parts like fans do. That being said, a drive that has been in constant use for 10 years is a pretty big liability...

                Hazard-rate-pattern-for-hard-disk-drives-as-a-function-of-operation-time.png

                I think this graph captures the failure over time of your average hard-drive pretty well, although the Infant mortality in this graph is somewhat excessive. Obviously, you need to compress the graph a bit for drives that are under heavy stress.

                Comment


                • #9
                  I don’t know exactly, but They are older than 6 years.
                  I will take your advice, and I will keep them as backup storage.

                  I changed the order of the HDD but got the same result.

                  Because the difference between the statues of md1 and md0, I to decided formate the hard driver on the PC and to try again to reinstall the RAID.

                  Now is looks good because md1 is healthy and md0 is rebuilding.

                  I will update after the rebuilding. Hope to come with good news!

                  ** I formate all the HDD to NTFS Master boot.

                  Thank you all
                  Attached Files

                  Comment


                  • #10
                    Not much point in doing an NTSF format but at least it's pretty quick.
                    I wipe the MBR either with zap.exe or using dd when trying to reuse a drive that doesn't want to cooperate.
                    Your failing drive should be replaced, not fiddled with and reinitialized. All four replaced... not necessarily. Yes failures get more likely with advanced age, however there's no definite drop-dead time.
                    With the bad sda your RAID is in a vulnerable state, after "revival" it will probably die again soon and then another drive failure will kill the array. Replace sda now and it will be safer - when the next one dies the show can still go on. Not likely that two will go at the same time.
                    The advantage of replacing all drives now: 2TB is the smallest enterprise drive easily found. Server storage capacity will increase dramatically if they are replaced and the array reinitialized. If you originally got the 4 drives for added storage, just get 3 new 2TB drives and have more storage than with four 1TB ones. Maybe you even have four 400GB or 500GB. RAID 6 is equally reliable in 3 or 4 drive arrays.
                    Replacing one drive will not change the RAID capacity: putting a new 2TB drive into an array of 500GB drives, for example, can only use 500GB of the new drive - the rest is wasted and never used. One can upgrade a RAID by replacing all the drives one at a time using repair but that will not change capacity, the RAID must be complerely reinitialized. You can do that through diagnostic tool unless you changed drive count, that needs the terminal commands.

                    Comment


                    • #11
                      I don't think these drives are that old. Simply because they are 2TB drives, WD RED (edit: mark BS), and few Doremis came with a set of four 2TB drives. Also, this type was never listed on Doremis or Dolbys approved drive list (edit: mark BS). So, they most certainly have been upgraded later.

                      When the rebuilding is done, Shai could just shut down the server, take out each drive, check it's type and make a note about it's manfacturing date, which usually can be read even with the drive in it's sled. What we haven't seen for sure so far is wether these four drives are all the same. I have never heard of a Doremi that was shipped with a 4-drive RAID, so, again, very likely that at least one drive has been retrofitted later, and it may not necessarily be the same as the other three. And it may be THAT drive that caused the issue.
                      I know one site operating a Doremi with 3 original RAID spec'd drives + one consumer grade drive for many years now without a single hickup.


                      If someone is on site with a knowledge about these things, and Shai demonstrated he does have some skills, I guess there is no need to replace all drives preemptively just because there was a hickup in a special situation. However, it would be wise to buy and store at least a single replacement drive on site, and keep a close look on that server.


                      From what I see on that terminal window screenshot above, it is more likely that there was a hickup during setting up the RAID properties by the init script, as md1 has marked one drive as bad, one as a spare, and is using a different RAID scheme than md0, which uses the same drive set. That's just not normal behaviour on a Doremi.
                      Last edited by Carsten Kurz; 05-29-2021, 08:24 AM.

                      Comment


                      • #12
                        Originally posted by Carsten Kurz View Post
                        I don't think these drives are that old. Simply because they are 2TB drives, WD RED, and few Doremis came with a set of four 2TB drives. Also, this type was never listed on Doremis or Dolbys approved drive list. So, they most certainly have been upgraded later.

                        When the rebuilding is done, Shai could just shut down the server, take out each drive and make a note about it's manfacturing date, which usually can be read even with the drive in it's sled. What we haven't seen for sure so far is wether these four drives are all the same. I have never heard of a Doremi that was shipped with a 4-drive RAID, so, again, very likely that at least one drive has been retrofitted later, and it may not necessarily be the same as the other three. And it may be THAT drive that caused the issue.
                        I know one site operating a Doremi with 3 original RAID spec'd drives + one consumer grade drive for many years now without a single hickup.


                        If someone is on site with a knowledge about these things, and Shai demonstrated he does have some skills, I guess there is no need to replace all drives preemptively just because there was a hickup. However, it would be wise to buy and store at least a single replacement drive on site, and keep a close look on that server.
                        Carsten, they are actually Enterprise grade Black Drives and they have the MFG month and year stamped right on them. I have four of those in my NAS that are from Aug 2017... I can't remember the MTBF on these but it is much higher than on standard drives.

                        Comment


                        • #13
                          Ooops, sorry, you are absolutely right, it is a WD Black, and it WAS indeed on Doremis approved drive list, but is obsolete now and no longer listed as an active replacement drive.
                          Last edited by Carsten Kurz; 05-29-2021, 08:20 AM.

                          Comment


                          • #14
                            A good way to find out exactly how many hours a drive has done, and how far worn out it is, is to look at it using CrystalDiskInfo. To make the procedure quick and easy, you'll need a SATA to USB device of some description; or, what I do, which is to have an internal CRU reader in my PC with hot swap support on the motherboard, and an open cartridge that I can slot a drive into quickly and easy. This program will give you the spin-up and hour count of a drive (and for a SSD, the total amount of data written, which is a more reliable indication of useful life remaining), together with the major SMART parameters, and a summary green/amber/red indication of the drive's overall condition.

                            Even if the drive is partitioned and formatted in a way that Windows can't make head or tail of, CrystalDiskInfo will still work on it. Example of output:

                            CrystalDiskInfo_SampleOutput.PNG
                            Last edited by Leo Enticknap; 05-29-2021, 11:59 AM.

                            Comment


                            • #15
                              Originally posted by Shai Skiff View Post
                              I don’t know exactly, but They are older than 6 years.
                              I will take your advice, and I will keep them as backup storage.

                              I changed the order of the HDD but got the same result.

                              Because the difference between the statues of md1 and md0, I to decided formate the hard driver on the PC and to try again to reinstall the RAID.

                              Now is looks good because md1 is healthy and md0 is rebuilding.

                              I will update after the rebuilding. Hope to come with good news!

                              ** I formate all the HDD to NTFS Master boot.

                              Thank you all
                              Formatting the hard-drives on a PC makes no sense at all, unless you zero the entire drive, because then at least, the entire drive has been checked for problems. But formatting a drive usually only initializes the start and the end of the drive and does nothing to check if the drive is working correctly.

                              Instead of fiddling around with those drives, do yourself a favor and simply replace the drive that was indicated as failed and consider swapping all four of them if they're indeed 6 years old. Swapping them around will usually also achieve nothing, especially since we're not looking at a defective backplane port here.

                              The fact that just one of the RAID arrays fails is because: it happens. Disks often fail progressively. Problems start small, but usually turn ugly pretty fast. If the disk was not answering a read or write request for md0, then it's md0 that gets the disk ejected.

                              Even if the RAID array comes back on-line for now, expect it to fail any time soon.

                              Also, don't use previously failed drives as backup, unless you wanted to throw away the data you're storing on them anyway. Because using broken and worn drives as backup is just another way to say *hahaha, fuck you* to your data.

                              By the way, in your screenshot, the sdd drive seems to have a yellow alert.

                              Comment

                              Working...
                              X