I woke up this morning and saw some unusual events on my inbox. Sampa servers have lots of self-monitoring and self-healing, but when things are not well I get notified. This morning, our server friendly named SADCWS01 was not responding.
This was our first server. The very first server I bought for Sampa almost 4 years ago. Every once in a while, I have a server that reboots itself to try to address some memory or connectivity issue and the server doesn't come back by itself. It actually happens once every 3 months or so.
At 8:00 AM this morning is when I saw the issue. The first order was to try to reboot it remotely, but that didn't work. I could have asked the datacenter to reboot the server but I had a feeling things were not right, so I drove to our datacenter, attached the monitor to the server and rebooted it manually.
Oh-oh! The boot screen says something ugly, like "your hard disk is not operating under normal parameters, we recommend you backup it...". Gosh, it happened.
The day every entrepreneur, or every ops personnel, or every IT person fears. We have a loss of a hard-disk.
I absolutely knew this day would come, and I design the Sampa software and architecture to survive this, but I knew it would take time to bring everything back online.
This was about 9:45 AM and I was sitting in the floor on the datacenter, very cold and very noise, pondering my options. Do I take the server to the office, buy a new HD, replace it, restore the data. Do I take the good HD out (each server has 2 HDs that backup to each other) and restore on another server? Do I try to kick the server a bit (maybe it would magically come back online)?
Well, I decided I didn't want to "save" that server. Once a server presents a problem like this, you better just get rid of it and move on. You never know if the HD issue was because of the HD, or a bad controller, bad cable, power supply surge, etc.
So, I went to Hard-Drives Northwest, just a few blocks from the datacenter, bought an SATA Enclosure w/ a USB connector and went back to the datacenter.
The first issue was this server was at the bottom of a stack of servers (5). I couldn't remove this server without re-arranging all other servers, so I shutdown all the other servers as to not cause another HD issue on another server by some clumsy act. Then, I carefully started to remove this server from the stack, while pushing the other servers on top of that one back. It probably took me more than a minute to do that very slowly, always checking no cables were getting squished. At the last inch I pushed a bit too much and all the servers fell hard on the shelf. Crap! It was just an 1in drop, but still.
I open the "dead server", remove the hard disk, put it on the enclosure so it becomes an external HD and connect to another server. Go check the screen and YES! I have the backup data and it seems good. Let's start the restore.
By now this is 11:00 AM. I was being very meticously about it as to not screw up anything. We are talking about 10,000 customers here. Their blog posts, their pictures, the baby milestones. My own blog (the one you are reading right now) was on that server, but worse of all, my wife's blog that she writes about our kids are on that server.
After 15 minutes of restoring the data I'm able to calculate how long it's going to take. Nothing short of 6 hours to copy everything to a new server (about 200GB of data) which means copying will be done by 5 PM today.
As of this writing, 65% is completed. Once I restore the data into the new server I still need to make the DNS changes to make sure the sites point to the new server. Good thing our system expires DNS entries in 1h, and not the default 72h of most sites.
I can't say that I'm happy with the way things turned out today, but I feel relieved I was able to restore the data, but about 10% of our customers are not able to access their website today and they don't even know why (our blog is also on that server).