Friday, September 21, 2012

Data Fabrics and Other Tales: Percona Live and MySQL Connect

The fall conference season is starting.  I will be doing a number of talks including a keynote on "future proofing" MySQL through the use of data fabrics.  Data fabrics allow you to build durable, long-lasting systems that take advantage of MySQL's strengths today but also evolve to solve future problems using fast-changing cloud and big data technologies.  The talk brings together ideas that Ed Archibald (our CTO) and I have been working on for over two decades.  I'm looking forward to rolling them out to a larger crowd.

Here are the talks in calendar order.  The first two are at MySQL Connect 2012 in San Francisco on September 30th:
MySQL Connect is an add-on to Oracle Open World.  You know the conference is big if they have to use 5-digit codes to keep track of talk titles.  It's almost worth the price of admission to look at Larry Ellison's boat.  Well maybe not quite, but you get the idea.

Next up is the Percona Live MySQL Conference in New York on October 2nd:
Percona has been doing an amazing job of organizing conferences.  This will be my fifth one.  The previous four were great.  If you are in the New York area and like MySQL, sign up now.  This conference is the single best way to get up to speed on state-of-the-art MySQL usage.

Wednesday, September 19, 2012

Database Failure Is Not the Biggest Availability Problem

There have been a number of excellent articles about the pros and cons of automatic database failover triggered by Baron's post on the GitHub database outage.  In the spirit of Peter Zaitsev's article "The Math of Automated Failover," it seems like a good time to point out that database failure is usually not the biggest source of downtime for websites or indeed applications in general.  The real culprit is maintenance.

Here is a simple table showing availability numbers out to 5 nines and what they mean in terms of monthly down-time.

Uptime
Downtime per 30-Day Month
0.9
3 days
0.99
07:12:00
0.999
00:43:12
0.9999
00:04:20
0.99999
00:00:26


Now let's do some math.  We start with Peter's suggested number that the DBMS fails once a year.   Let's also say you take a while to wake up (because it's the middle of the night and you don't like automatic failover), figure out what happened, and run your failover procedure.  You are back online in an hour. Amortized over the year an hour of downtime is 5 minutes per month.  Overall availability is close to 99.99% or four nines.  

Five minutes per month is small potatoes compared to the time for planned maintenance.  Let's say you allow yourself a one-hour maintenance window each month for DBMS schema changes, database version upgrades, and other work that takes the DBMS fully offline from applications.  Real availability in this simple (and conservative) example is well below 99.9% or less than three nines. Maintenance accounts for over 90% of the downtime.   The real key to improved availability is to be able to maintain the DBMS without taking applications offline.  

We have been very focused on the maintenance problem in Tungsten.  Database replication is a good start for enabling rolling maintenance where you work on one replica at a time.  In Tungsten the magic sauce is an intervening connectivity layer that can transparently switch connections between DBMS servers while applications are running.  You can take DBMS servers offline and upgrade safely without bothering users.  Managing planned failover in this way is easier to solve than providing bombproof automatic failover, I am happy to say.  It is also considerably more valuable for many users. 

Monday, September 17, 2012

Automated Database Failover Is Weird but not Evil

Github had a recent outage due to malfunctioning automatic MySQL failover.  Having worked on this problem for several years I felt sympathy but not much need to comment.  Then Baron Schwartz wrote a short post entitled "Is automated failover the root of all evil?"  OK, that seems worth a comment:  it's not.  Great title, though.

Selecting automated database failover involves a trade-off between keeping your site up 24x7 and making things worse by having software do the thinking when humans are not around.  When comparing outcomes of wetware vs. software it is worth remembering that humans are not at their best when woken up at 3:30am.  Humans go on vacations, or their cell phones run out of power.  Humans can commit devastating unforced errors due to inexperience.  For these and other reasons, automated failover is the right choice for many sites even if it is not perfect. 

Speaking of perfection, it is common to hear claims that automated database failover can never be trusted (such as this example).  For the most part such claims apply to particular implementations, not database failover in general.  Even so, it is undoubtedly true that failover is complex and hard to get right.  Here is a short list of things I have learned about failover from working on Tungsten and how I learned them.  Tungsten clusters are master/slave, but you would probably derive similar lessons from most other types of clusters. 

1. Fail over once and only once.  Tungsten does so by electing a coordinator that makes decisions for the whole cluster.  There are other approaches, but you need an algorithm that is provably sound.  Good clusters stop when they cannot maintain the pre-conditions required for soundness, to which Baron's article alludes.  (We got this right more or less from the start through work on other systems and reading lots of books about distributed systems.) 

2. Do not fail over unless the DBMS is truly dead.  The single best criterion for failure seems to be whether the DBMS server will accept a TCP/IP connection.  Tests that look for higher brain function, such as running a SELECT, tend to generate false positives due to transient load problems like running out of connections or slow server responses.  Failing over due to load is very bad as it can take down the entire cluster in sequence as load shifts to the remaining hosts.  (We learned this through trial and error.) 

3. Stop if failover will not work or better yet don't even start.  For example, Tungsten will not fail over if it does not have up-to-date slaves available.  Tungsten will also try to get back to the original pre-failover state when failover fails, though that does not always work.  We get credit for trying, I suppose.  (We also learned this through trial and error.) 

4. Keep it simple.  People often ask why Tungsten does not resubmit transactions that are in-flight when a master failover occurs.  The reason is that there are many reasons why resubmission does not work on a new master and it is difficult to predict when such failures will occur.  Imagine you were dependent on a temp table, for example.  Resubmitting just creates more ways for failover to fail.  Tungsten therefore lets connections break and puts the responsibility on apps to retry failed transactions.  (We learned this from previous products that did not work.) 

Even if you start out with such principles firmly in mind, new failover mechanisms tend to encounter a lot of bugs.  They are hard to find and fix because failover is not easy to test.  Yet the real obstacle to getting automated failover right is not so much bugs but the unexpected nature of the problems clusters encounter.  There is a great quote from J.B.S. Haldane about the nature of the universe that also gives a flavor of the mind-bending nature of distributed programming: 
My own suspicion is that the universe is not only queerer than we suppose, but queerer than we can suppose. 
I can't count the number of times where something misbehaved in a way that would never have occurred to me without seeing it happen in a live system.  That is why mature clustering products can be pretty good while young ones, however well-designed, are not.  The problem space is just strange. 

My sympathy for the Github failures and everyone involved is therefore heartfelt.  Anyone who has worked on failover knows the guilt of failing to anticipate problems as well as as the sense of enlightenment that comes from understanding why they occur.  Automated failover is not evil.  But it is definitely weird. 

Monday, September 3, 2012

Life in the Amazon Jungle

In late 2011 I attended a lecture by John Wilkes on Google compute clusters, which link thousands of commodity computers into huge task processing systems.  At this scale hardware faults are common.  Google puts a lot of effort into making failures harmless by managing hardware efficiently and using fault-tolerant application programming models.  This is not just good for application up-time.  It also allows Google to operate on cheaper hardware with higher failure rates, hence offers a competitive advantage in data center operation.

It's becoming apparent we all have to think like Google to run applications successfully in the cloud.  At Continuent we run our IT and an increasing amount of QA and development on Amazon Web Services (AWS).   During the months of July and August 2012 at least 3 of our EC2 instances were decommissioned or otherwise degraded due to hardware problems.  One of the instances hosted our main website www.continuent.com.

In Amazon failures are common and may occur with no warning.  You have minimal ability to avoid them or some cases even understand the root causes.  To survive in this environment, applications need to obey a new law of the jungle.  Here are the rules as I understand them. 

First, build clusters of redundant services.  The www.continuent.com failure brought our site down for a couple of hours until we could switch to a backup instance.  Redundant means up and ready to handle traffic now, not after a bridge call to decide what to do.  We protect our MySQL servers by replicating data cross-region using Tungsten, but the website is an Apache server that runs on a separate EC2 instance.  Lesson learned.  Make everything a cluster and load balance traffic onto individual services so applications do not have to do anything special to connect.  

Second, make applications fault-tolerant.  Remote services can fail outright, respond very slowly, or hang.  To live through these problems apply time-honored methods to create loosely coupled systems that degrade gracefully during service outages and repair themselves automatically when service returns.  Here are some of my favorites.  
  1. If your application has a feature that depends on a remote service and that service fails, turn off the feature but don't stop the whole application.  
  2. Partition features so that your applications operate where possible on data copies.  Learn how to build caches that do not have distributed consistency problems.  
  3. Substitute message queues for synchronous network calls.  
  4. Set timeouts on network calls to prevent applications from hanging indefinitely.  In Java you usually do this by putting the calls in a separate thread.  
  5. Use thread pools to limit calls to remote services so that your application does not explode when those services are unavailable or fail to respond quickly. 
  6. Add auto-retry so that applications reconnect to services when they are available again. 
  7. Add catch-all exception handling to deal with unexpected errors from failed services.  In Java this means catching RuntimeException or even Throwable to ensure it gets properly handled. 
  8. Build in monitoring to report problems quickly and help you understand failures you have not previously seen.  
Third, revel in failure.  Netflix takes this to an extreme with Chaos Monkey, which introduces failures in running systems.  Another approach is to build scaffolding into applications so operations fail randomly.  We use that technique (among others) to test clusters.  In deployed database clusters I like to check regularly that any node can become the master and that you can recover any failed node.  However, this is just the beginning.  There are many, many ways that things can fail.  It is better to provoke the problems yourself than have them occur for the first time when something bad happens in Amazon.  

There is nothing new about these suggestions.  That said, the Netflix approach exposes the difference between cloud operation and traditional enterprise computing.  If you play the game applications will stay up 24x7 in this rougher landscape and you can tap into the flexible cost structure and rapid scaling of Amazon.  The shift feels similar to using database transactions or eliminating GOTO statements--just something we all need to do in order to build better systems.  There are big benefits to running in the cloud but you really need to step up your game to get them.  

Scaling Databases Using Commodity Hardware and Shared-Nothing Design