Sep 19, 2012

Database Failure Is Not the Biggest Availability Problem

There have been a number of excellent articles about the pros and cons of automatic database failover triggered by Baron's post on the GitHub database outage.  In the spirit of Peter Zaitsev's article "The Math of Automated Failover," it seems like a good time to point out that database failure is usually not the biggest source of downtime for websites or indeed applications in general.  The real culprit is maintenance.

Here is a simple table showing availability numbers out to 5 nines and what they mean in terms of monthly down-time.

Uptime
Downtime per 30-Day Month
0.9
3 days
0.99
07:12:00
0.999
00:43:12
0.9999
00:04:20
0.99999
00:00:26


Now let's do some math.  We start with Peter's suggested number that the DBMS fails once a year.   Let's also say you take a while to wake up (because it's the middle of the night and you don't like automatic failover), figure out what happened, and run your failover procedure.  You are back online in an hour. Amortized over the year an hour of downtime is 5 minutes per month.  Overall availability is close to 99.99% or four nines.  

Five minutes per month is small potatoes compared to the time for planned maintenance.  Let's say you allow yourself a one-hour maintenance window each month for DBMS schema changes, database version upgrades, and other work that takes the DBMS fully offline from applications.  Real availability in this simple (and conservative) example is well below 99.9% or less than three nines. Maintenance accounts for over 90% of the downtime.   The real key to improved availability is to be able to maintain the DBMS without taking applications offline.  

We have been very focused on the maintenance problem in Tungsten.  Database replication is a good start for enabling rolling maintenance where you work on one replica at a time.  In Tungsten the magic sauce is an intervening connectivity layer that can transparently switch connections between DBMS servers while applications are running.  You can take DBMS servers offline and upgrade safely without bothering users.  Managing planned failover in this way is easier to solve than providing bombproof automatic failover, I am happy to say.  It is also considerably more valuable for many users. 

4 comments:

Andy said...

Not all downtime is equal. With maintenance you can schedule it at 3am when there are hardly any users. Failure on the other hand tend to happen during peak hours when the load is heaviest.

So even though maintenance accounts for most of the downtime, failure still causes the biggest business impact.

Robert Hodges said...

@Andy, that's been the traditional view. It is still true for many applications, especially those that serve businesses. However, it is much less true for consumer apps as well as business that serve users spread across many timezones. 3am Pacific is 12 noon in central Europe. Finally, some maintenance operations on DBMS can result in prolonged downtime, not just an hour or two. That's no longer feasible for many applications.

Anonymous said...

Robert, I agree. In my experience a lot of people are incredibly careless and irresponsible when they ask for high availability solutions without first doing some hard thinking about why. (Not pointing at Github here, just the general population.)

- We need an HA solution that will automatically keep services up when they fail.
- OK, what kinds of failures are you looking to fix?
- Uh.... I don't know -- I don't even know what kinds of failures are possible! I never thought about that.

I've had variations of this conversation with a lot of people. If I were an unscrupulous vendor, I'd just sell them something and THEY WOULD BUY IT AND RUN IT WITHOUT A SECOND THOUGHT (I've seen it happen). This boggles my mind.

For something so important, why do people neglect basic due diligence?

The most memorable conversation I ever had about HA was like this.

- We need as much uptime as we can get.
- OK, what does that mean -- five nines?
- No, we need at least six nines.
- OK, got it [me: giving up, attacking a different way]. How much downtime is acceptable every month?
- An hour or two.

Incredible.

Robert Hodges said...

@Anonymous, thanks for the comment. I have been there for similar conversations. Hours per month does seem the easiest metric for people to grasp. That's why I chose it for the table in the article.