Sep 3, 2012

Life in the Amazon Jungle

In late 2011 I attended a lecture by John Wilkes on Google compute clusters, which link thousands of commodity computers into huge task processing systems.  At this scale hardware faults are common.  Google puts a lot of effort into making failures harmless by managing hardware efficiently and using fault-tolerant application programming models.  This is not just good for application up-time.  It also allows Google to operate on cheaper hardware with higher failure rates, hence offers a competitive advantage in data center operation.

It's becoming apparent we all have to think like Google to run applications successfully in the cloud.  At Continuent we run our IT and an increasing amount of QA and development on Amazon Web Services (AWS).   During the months of July and August 2012 at least 3 of our EC2 instances were decommissioned or otherwise degraded due to hardware problems.  One of the instances hosted our main website

In Amazon failures are common and may occur with no warning.  You have minimal ability to avoid them or some cases even understand the root causes.  To survive in this environment, applications need to obey a new law of the jungle.  Here are the rules as I understand them. 

First, build clusters of redundant services.  The failure brought our site down for a couple of hours until we could switch to a backup instance.  Redundant means up and ready to handle traffic now, not after a bridge call to decide what to do.  We protect our MySQL servers by replicating data cross-region using Tungsten, but the website is an Apache server that runs on a separate EC2 instance.  Lesson learned.  Make everything a cluster and load balance traffic onto individual services so applications do not have to do anything special to connect.  

Second, make applications fault-tolerant.  Remote services can fail outright, respond very slowly, or hang.  To live through these problems apply time-honored methods to create loosely coupled systems that degrade gracefully during service outages and repair themselves automatically when service returns.  Here are some of my favorites.  
  1. If your application has a feature that depends on a remote service and that service fails, turn off the feature but don't stop the whole application.  
  2. Partition features so that your applications operate where possible on data copies.  Learn how to build caches that do not have distributed consistency problems.  
  3. Substitute message queues for synchronous network calls.  
  4. Set timeouts on network calls to prevent applications from hanging indefinitely.  In Java you usually do this by putting the calls in a separate thread.  
  5. Use thread pools to limit calls to remote services so that your application does not explode when those services are unavailable or fail to respond quickly. 
  6. Add auto-retry so that applications reconnect to services when they are available again. 
  7. Add catch-all exception handling to deal with unexpected errors from failed services.  In Java this means catching RuntimeException or even Throwable to ensure it gets properly handled. 
  8. Build in monitoring to report problems quickly and help you understand failures you have not previously seen.  
Third, revel in failure.  Netflix takes this to an extreme with Chaos Monkey, which introduces failures in running systems.  Another approach is to build scaffolding into applications so operations fail randomly.  We use that technique (among others) to test clusters.  In deployed database clusters I like to check regularly that any node can become the master and that you can recover any failed node.  However, this is just the beginning.  There are many, many ways that things can fail.  It is better to provoke the problems yourself than have them occur for the first time when something bad happens in Amazon.  

There is nothing new about these suggestions.  That said, the Netflix approach exposes the difference between cloud operation and traditional enterprise computing.  If you play the game applications will stay up 24x7 in this rougher landscape and you can tap into the flexible cost structure and rapid scaling of Amazon.  The shift feels similar to using database transactions or eliminating GOTO statements--just something we all need to do in order to build better systems.  There are big benefits to running in the cloud but you really need to step up your game to get them.  


Brian Moon said...

Actually, this is not an Amazon thing. If you went inside any reliable server infrastructure you would find the same thing. We don't use AWS, but we replicate data two different geographic locations. Everything is a cluster with load balancing. We queue and cache 3rd party services to ensure up time. This is not new. AWS did not cause this to be needed. This has always been needed if you want to stay up. We learned this in 2000. I fear we are failing the new generation since so many people are still learning this.

Robert Hodges said...

@Brian, that's my point. Anyone operating at scale has seen these issues already and dealt with them or suffered for it. However, Amazon is flakey by design so even small businesses cannot evade building in availability. You can't push it off onto the ops team any more.