Who's Online 

1 anonymous user and 0 registered users online.

You are an unregistered user or have not logged in.


 

Welcome to GotWake.Com

Whether you're a wakeboarder, a geocacher, a snowboarder, my mom, or all or none of the above - hopefully you find something of interest while you're here.

To log a Gotwake signature item, please visit the Tracking Application.

News

Front Page | Archive ]
Geekin' Out
Posted By: Billy on Wednesday, January 28, 2009 - 01:20 PM
Linux
So - as if I wasn't having enough fun already, with the events and experiences of the past week... last night, my NAS (Network Attached Storage) server decided to fail. Basically just a bunch of large disks in a box with some software, this device holds (I thought, QUITE reliably) every single digital picture, MP3, video editing project, and full backup data for all my other systems. It's aRaid5 array, meaning - I should be able to lose an entire disk without any data loss.

I detected the problem as I was working heavily with the box: Managing digital pictures, searching for images for both a wedding slide show and a Panza gallery, and re-indexing our extensive digital music collection, when suddenly... it was just gone. I then noticed a 'beeping' sound I'm unfamiliar with (rare, indeed ;-) coming from the office. Yes, it was my Buffalo Systems 2TB [TeraByte] (that's two thousand Gigs!) Terastation Live - and it was unhappy.

The good news is - countless hours later, there was finally a happy ending. It was certainly an arduous journey, not for the faint of heart. Read below the cut for all the gory details:
It started out innocuous enough: This wasn't actually the first time I'd seen the "System Error 04: Can't Load Kernl" error displayed. It seemed that a reboot had resolved it once - and one other time, I think I had to re-flash the firmware. So, I dusted off those links, refreshed my memory and got up to speed on the latest chat in the forums. Sheesh - seems like there are some significant issues with these products - but then, I guess you see that in every support forum. So, I'd already tried rebooting - no dice. Three times in a row, I kept getting the same error.

The first step seemed simple enough: Re-flash the firmware. After making sure I had the latest version (no updates in over a year - what gives?) - I ran the update. All from my laptop, on the wireless network -- didn't even have to leave the couch and go into the office (yet). The updater found the NAS box, and proceeded to upload the firmware. Next, the system rebooted, and loaded the 'new' firmware. So far, so good. One more reboot , and...

Things went from bad to worse. Now, rather than System Error 04 - I was getting something much worse. The device was booting in "Emergency Mode" - and, upon completion of startup, reported:

No Array Info

Gulp.

Deep breath. I tried flashing the firmware a couple more times - what harm is there in that? No change in results. If only there was some way to get more insight into what the heck is happening in that black box. C'mon, it's running Linux, right?? A few google searches later - and I found acp_commander - my new best friend. A simple java commandline later:

java -jar acp_commander.jar -t GWNAS -o

... and I was in business. Now, I could telnet into the box (I'll worry about migrating that to SSH after I get things running!), and poke around and see what was really happening - or so I thought. I did gain some insight - but realized that there's a whole layer of complexity, when you get right down to the physical devices in a RAID array. So, I decided to follow some of the other possible solutions described to have been successful in the forums: Removing each drive, one at a time, to see if the Array would "come back" - in which case, it might be a simple physical drive failure - even though the system isn't reporting one. Four iterations later - rule out that solution.

I really wasn't looking forward to it - but all threads were starting to point towards the need of actually manipulating each drive, while attached to another system - which should then 'force' a rebuild. Sounded a bit sketchy - and they strongly recommended "backing up before you begin". How do you backup your "backup" - over 500GB of data? That's a LOT of DVDs... Of course, in an operational world, I could have just hooked up a fat external USB drive, and pulled off as much of the data as I had room for or interest in - but, that's only with a working array.

So that's what I decided to do. One by one, I had to pull the individual (500GB) drives, and plug them into the only other PC I have in the house which has a SATA (Serial ATA) controller: the server hosting this site, my email, etc. For any of you that noticed we were off the air last night (HA! As if... ;-) - now you know why. Alas - I only had a single SATA controller/cable in the server, so - I had to disconnect the internal disk, connect the RAID disk - and then boot from a http://www.knoppix.net/>Knoppix CD. I had also attached a large external USB drive - hoping to leverage that to take any 'backup' images before proceeding. I knew the step I was attempting to take wasn't supposed to impact the actual user data on the device - but rather, the " system="system" some="some" basic="basic" config="config" the="the" server="server" i="i" can="can" see="see" the="the" nas="nas" drive="drive" but="but" not="not" usb="usb" what="what" eventually="eventually" was="was" able="able" to="to">sort of - though the contents were jumbled - as if it didn't know how to interpret the file system. WEIRD. So, after much ado... I switched gears, and tried using a small (5GB) inline-powered USB drive. This actually worked - and I started my first backup command... only to realize I was out of space.

Back to the laptop for a moment, to clean a bunch of garbage off the drive. There - 3GB, that should be plenty - since I'm only expecting to have to backup ~400MB from each of four disks. So, back to the server, with both SATA and USB drives attached - I issued the following command. I always like "dd" - it seems so... 'raw' :)

# dd if=/dev/hda1 of=/media/sda1/image/hd1

A couple minutes later - it was finished. Now all I had to do was:

# rm -rf *

Of course - that's a pretty scare one. Gotta be pretty confident about 'where you are' when you go running that. And I was. Some time later - I'd completed the process on two of the four disks, and was disconnecting the NAS drive, when... SNAP! A few too many foot/pounds per square inch, and I'm broken the SATA connector on the controller cable. DOH!! So, yet another twist and turn down the rabbit hole: Now I was mixing up some epoxy, and conducting 2M (Navy jargon for "Micro/Miniature" - special electronics training for working on 'small stuff') repairs to this stupid piece of plastic. Now then: Do I have the patience to wait "2-24 hours for curing", while I'm mid-recovery, and wondering if all is lost?

sort of.

I took a break, and Daphne and I enjoyed a nice dinner, and then watched a little TV (ironically - the Lost season premiere). That was about how my day, and even week, has felt at times. So, a couple of hours later, Daphne is asleep, and I can't wait any longer. I come back in the office, grab the next drive I need to connect, and... SNAP! The connector breaks again, immediately. Bummer. I try a quick test with the broken connector - no dice, the drive doesn't even spin up. So - I try wedging the broken piece in where it belongs, and positioning the drive and cable just so, and... the thing whirs to life. Okay, just two more to go, repeating the process: Backing up the "system" partition, and then... wiping it clean. Rinse, Repeat.

Now I've completed the process with all four drives - and what I was expecting was this: Power up the device, expect it to come up in "Emergency Mode" (but at least configured and on the network) - and then have to re-flash the firmware, in debug mode forcing and update to the hard disks. Instead, what I got was: Immediate error tones, red FAIL LEDs for all four drives, and:

TFTP Boot Mode: IP address is 192.168.100.150 Server IP address 192.168.100.1

Oy! So, all my configuration appeared to be lost (network, etc) - and it just showed hard failures on all four drives. Nice. This is supposed to be getting better, not worse... but I press on. Surprised as I am, I understand what's needed, and how best to accomplish it. After downloading the needed software, I grabbed my laptop - and fortunately, was able to dig up a CROSSOVER ethernet cable pretty quickly - and connected directly between my laptop and the NAS box ethernet port. Then, a quick IP config, and I was ready to run the program which provided both the TFTP server, as well as the initial version of the firmware. I plugged in, started it up - and almost immediately, the NAS box connected, and started pulling down the firmware. A few minutes later... it rebooted, and came up with some default network configuration. I used the NAS Navigator to change the IP address, reconnected the NAS box to the switch, and... voila. It was, at least, back on the network.

At this point, I repeated the process to enable telnet, and gain root access to the box. I connected to the web admin interface - and re-configured all the relevant settings (hostname, users, etc). But... the system already knew about the array (!!), accurately reporting the total utilization (~400GB out of ~ 1.5TB), as well as all the individual shares. WooHOO!!

Exhausted, but oh-so-satisfied... I went and crawled into bed. It had been nearly a 12 hour ordeal - but in the end, it all panned out, and data integrity was maintained. This is a very good thing, given the strategic nature of this device and data. But - while I've got the capacity to do so, I may copy everything off to this fat 750GB external USB hard disk I have sitting around gathering dust. That means I've got RAID5 nested with an offline RAID1 mirror - so, aside from the challenge of keeping them updated - at least I've got a pretty good confidence that I shouldn't lose any of the data. Knock on wood, since I don't actually have any offsite recovery - it's all local.

Whew! What a long, strange trip it's been. I've since made some more 'upgrades' to the NAS box: Enabled SSH and configured it to start at boot time, and be monitored/restarted automatically. A few other tweaks to resolve time discrepancies, add additional tools, install an full 'ps' tool, etc. All in all, as 'deep' as the process got - it all worked out, and I learned a fair amount in the process. I'm chalking that up as a win-win: score one for the team!

 

Theme by XanthiaThemes.com