Marcelo Calbucci

Startup Score:

Successes: 0.1+0.5
Failures: 1
In progress: 1

Monday, July 6, 2009

A Concrete Tip to Deal with Datacenter Failures

In case you missed it over the weekend, there was a fire at the Fisher Plaza building in Seattle between Thursday and Friday. That building is he house of a few Datacenters, like Internap and AdHost.

 

I can't say I'm an expert on Datacenters, redudancy, systems operations, etc., but I can say that I've been managing services and servers for about 8 years, at Microsoft, at Sampa and my own personal servers (yes, I actually have 3 personal servers).

 

I was reading Jeremy Irish post on the effect on Groundspeak and remembered one of the painful lessons I learned a while ago.

 

ALWAYS SET YOUR DNS TTL TO 15 MINUTES OR LESS.

 

If you don't know what it means, you can stop reading now.

 

The problem with DNS is that most DNS Servers come with the default TTL of 24 hours, which means that if you move your server to a different location or you move your site to a different server, some of your users won't be able to access your site for 24 hours. Now, this default of 24h has been around for about 20 years or so and it's just outdated.

 

All my domains have a TTL of 60 minutes or 15 minutes. I actually seen some hosting companies (like eNom) using a TTL of 3 minutes, which I think it's great.

 

The biggest justifications to have a longer TTL are usually two-fold: to reduce your DNS server load and to improve user experience by reducing latency and resilience in case of DNS failure.

 

DNS Servers can have millions of domains and server billions of requests per day, so the perf improvement on the server simply doesn't hold up. DNS is also the fastest protocol on the web (because it uses UDP) and there is almost no latency any user could ever notice. Not even if you are solving a dozen domains for the same page and finally, if your DNS is down *and* your backup DNS is down, it's most likely that your site is down as well (if both run on the same server).

 

I find it that smaller TTLs also help with all kinds of issues, like moving servers, adding servers, removing servers, etc. So, stop procrastinating and go check the TTL of the domains you own and change it to 15 minutes!

 

I also recommend an expiration of 24h and a retry of 1h.

 

 

 

 

blog comments powered by Disqus