One of the things that I learned on the last 7 years of working on Web Services is that you either have an awesome strategy to avoid downtime on your service or you have a poor service (lots of downtime), there is not much in between.
There is scheduled downtime, like when you are upgrading the service, or the network router is being replaced, and there is un-scheduled downtime, when the disk is full, or the network router failed, or the software has a memory leak and it reached the breaking point.
Un-scheduled downtime you address with redundancy on your servers and network gear, good testing and good quality server software overall.
Scheduled downtime is a matter of good planning and great execution.
In the last twelve months, we have some un-scheduled downtimes, mostly related to a blackout in Seattle in December and other network related issues. We had just one partial downtime (only some sites were unavailable) when we rolled out a new DNS server.
Scheduled downtimes we have at least 3 times a week. That is because we upgrade our software furiously. Yesterday, we deployed new features for Facebook. Today we deployed even more features for Facebook, and so on.
I've been trying to improve the scheduled downtime period to a minimum. It used to take 3-5 minutes of downtime, than I did a fix to make it just under 2 minutes, but the number of sites started to grow so much that even with the fix it took about 3-4 minutes.
Today I did a new fix to how we deploy things, and the results were spectacular: Under 25 seconds!!!
This is fast enough that a visitor browsing the site will never see an error page, and, using a uniform distribution of requests, would average to a wait of just 12 seconds to display a page. Not bad. Heck, some services take 12 seconds to display a page even without downtime!
I'm the Co-founder & CTO of