The Scale-Out Blog: 2012

Dec 27, 2012

The MySQL Community: Beleaguered or Better than Ever?

The MariaDB Foundation announcement spawned some interesting commentary about the state of open source databases. One recent headline cited the "beleaguered MySQL community." Beleaguered is a delightful adjective. The OED tells us that it means beset, invested, or besieged. Much as I like the word, I do not think it is an accurate or useful description of the MySQL community. This article and others like it miss the point of what is happening to MySQL and its users.

Let's start by disproving that the notion that the MySQL community is beleaguered. I don't know everyone who uses MySQL, but in my job I talk to numerous companies that have made sizable investments in MySQL and stand to lose big if they are wrong. They do not seem especially nervous.

1. Nobody seriously questions MySQL viability. I have yet to meet a manager with a substantial business on MySQL who is deeply worried about it disappearing or being ruined by Oracle. They are too busy working on software upgrades or keeping their sites running. The future of MySQL is well down the list of problems keeping them awake at night.

2. MySQL meets or beats the immediate alternatives. There is of course discussion about dropping MySQL for PostgreSQL but it is mostly idle talk. I'm sure some companies have switched (actually in both directions), but I not seen a single customer migrate a working business app from MySQL to PostgreSQL. Once you get past the religion, it's clear MySQL and PostgreSQL are just too similar to supplant each other easily: reliable, row-based stores with single threaded SQL query engines that handle a few terabytes of data at most. Companies need far stronger reasons to switch to something new, especially given the large ecosystem and deep pool of MySQL expertise.

3. MySQL is not the only game in town. Virtually every large web site I know uses at least one NoSQL store alongside MySQL. Column stores are increasingly common for data warehouses. Production Hadoop clusters are no longer a novelty. On the surface this might look like a failure of MySQL. What's really happening in many cases is that small businesses that started on MySQL are now large, profitable enterprises that require more than just economical OLTP. This is a mark of success, not a deficiency.

If this is what beleaguered looks like I can't wait to see something that's actually successful.

Turning the argument around, can we say that the MySQL community is better than it was? In at least one important way, yes. The community is now multi-polar. MySQL long benefitted from having a large community of open source users to find bugs, help focus development direction, and construct a wide range of robust tools like language bindings. However, innovation on MySQL itself was largely gated by a single company: MySQL AB. Multiple groups are now competing to improve MySQL, and it's a very good thing for users. Let me count the ways.

There are three major versions of MySQL: Oracle, Percona, and MariaDB, not to mention cloud-only versions like Amazon RDS. There are at least four companies working directly on major upgrades to replication: Continuent, Oracle, Codership, and Monty Program. Oracle is continuing to make improvements in InnoDB like online schema change and multi-core scaling, efforts that are complemented by Percona's persistent focus on all aspect of performance. Aside from Amazon RDS, all of this work is available in open source, and there is an unusual degree of sharing across otherwise competitive groups. I could keep going for a while but to be frank there's so much it's hard to track all the improvements or give them their proper due.

The MySQL community is therefore competitive in a way that did not exist a few years ago. That's good, because innovation in data management is no longer centered around the web-facing applications that MySQL helped enable. Businesses are grappling with massive data volumes that far exceed the capacity of single DBMS servers while simultaneously moving to Amazon or VMWare. There is a whole new set of problems such as deploying in unstable cloud environments, adjusting to polyglot persistence, managing sharded data effectively, distributing data across multiple regions, and enabling real-time analytics on MySQL transactions. As a group, the MySQL community is well-positioned to address them.

If there is a problem, it is how to keep a strong multi-polar community going for as long as possible. Competition creates uncertainty for users, because change is a given. Pointy-haired bosses have to make decisions with incomplete information or even reverse them later. Competition is hard for vendors, because it is more difficult to make money in efficient markets. Competition even strikes against the vanity of community contributors, who have to try harder to get recognition. It is clear there will be pressures to make the community less competitive. They won't necessarily be from Oracle, which thrives on competition.

This gets back to the MariaDB Foundation reference that started this article. Anything that ensures long-term competitiveness and vitality of MySQL is good. Foundations in general seem well suited to this task. At Continuent we have already had some discussions about joining. So far we are undecided, for reasons that are somewhat similar to Peter Zaitsev's comments on this subject. If the MariaDB Foundation helps maintain a stable multi-polar community, we're in.

Dec 10, 2012

Slides from Percona Live London and a Request

Percona hosted another excellent Percona Live conference this past December 3-4 in London. It was my pleasure to deliver 3 talks including the first keynote following Peter Zaitsev. Percona does a great job of organizing these conferences--this year's London conference was well attended and in an excellent location in Kensington. My thanks to the entire Percona team for putting this together.

Here are the slides for my talks in case you would like to see them.

Keynote: Future-Proofing MySQL for the World-Wide Data Revolution -- Covering the greatly exaggerated death of MySQL and design patterns for robust MySQL systems that can last for decades

Talk: Why, What, and How of Data Warehouses for MySQL -- Why you need a data warehouse for MySQL, some standard choices, and how to move data in real time from MySQL to Vertica. There was even a demo of sysbench data replicating into Vertica automatically.

Talk: Multi-Master, Multi-Site MySQL Databases Made Easy with Continuent Tungsten -- How to build clusters that span multiple sites using multi-master and primary/DR techniques. I did this with Giuseppe Maxia, who did a couple of great demos along the way, including one that I found kind of terrifying to do in front of an audience. (It involved killing a lot of database servers, something Giuseppe does for a living.)

Speaking of talks, there are many conferences with database tracks coming up in the next few months. If you have not done a talk on MySQL before I would encourage you to think about submitting for upcoming conferences like future Percona Live events, OSCON, O'Reilly Strata, or one of the many local meet-ups. There are many people doing interesting things with open source databases. It's great to hear your stories, so speak up!

Sep 21, 2012

Data Fabrics and Other Tales: Percona Live and MySQL Connect

The fall conference season is starting. I will be doing a number of talks including a keynote on "future proofing" MySQL through the use of data fabrics. Data fabrics allow you to build durable, long-lasting systems that take advantage of MySQL's strengths today but also evolve to solve future problems using fast-changing cloud and big data technologies. The talk brings together ideas that Ed Archibald (our CTO) and I have been working on for over two decades. I'm looking forward to rolling them out to a larger crowd.

Here are the talks in calendar order. The first two are at MySQL Connect 2012 in San Francisco on September 30th:

CON11591 - Managing Worldwide Data with MySQL and Continuent Tungsten (Tech talk)
CON9319 - Replicating from MySQL to Oracle Database and Back Again (Tech talk)

MySQL Connect is an add-on to Oracle Open World. You know the conference is big if they have to use 5-digit codes to keep track of talk titles. It's almost worth the price of admission to look at Larry Ellison's boat. Well maybe not quite, but you get the idea.

Next up is the Percona Live MySQL Conference in New York on October 2nd:

Future-Proofing MySQL for the Worldwide Data Revolution (Keynote)
Solving Large-Scale Database Administration with Tungsten (Tech talk with Neil Armitage)

Percona has been doing an amazing job of organizing conferences. This will be my fifth one. The previous four were great. If you are in the New York area and like MySQL, sign up now. This conference is the single best way to get up to speed on state-of-the-art MySQL usage.

Sep 19, 2012

Database Failure Is Not the Biggest Availability Problem

There have been a number of excellent articles about the pros and cons of automatic database failover triggered by Baron's post on the GitHub database outage. In the spirit of Peter Zaitsev's article "The Math of Automated Failover," it seems like a good time to point out that database failure is usually not the biggest source of downtime for websites or indeed applications in general. The real culprit is maintenance.

Here is a simple table showing availability numbers out to 5 nines and what they mean in terms of monthly down-time.

Uptime	Downtime per 30-Day Month
0.9	3 days
0.99	07:12:00
0.999	00:43:12
0.9999	00:04:20
0.99999	00:00:26

Now let's do some math. We start with Peter's suggested number that the DBMS fails once a year. Let's also say you take a while to wake up (because it's the middle of the night and you don't like automatic failover), figure out what happened, and run your failover procedure. You are back online in an hour. Amortized over the year an hour of downtime is 5 minutes per month. Overall availability is close to 99.99% or four nines.

Five minutes per month is small potatoes compared to the time for planned maintenance. Let's say you allow yourself a one-hour maintenance window each month for DBMS schema changes, database version upgrades, and other work that takes the DBMS fully offline from applications. Real availability in this simple (and conservative) example is well below 99.9% or less than three nines. Maintenance accounts for over 90% of the downtime. The real key to improved availability is to be able to maintain the DBMS without taking applications offline.

We have been very focused on the maintenance problem in Tungsten. Database replication is a good start for enabling rolling maintenance where you work on one replica at a time. In Tungsten the magic sauce is an intervening connectivity layer that can transparently switch connections between DBMS servers while applications are running. You can take DBMS servers offline and upgrade safely without bothering users. Managing planned failover in this way is easier to solve than providing bombproof automatic failover, I am happy to say. It is also considerably more valuable for many users.

Sep 17, 2012

Automated Database Failover Is Weird but not Evil

Github had a recent outage due to malfunctioning automatic MySQL failover. Having worked on this problem for several years I felt sympathy but not much need to comment. Then Baron Schwartz wrote a short post entitled "Is automated failover the root of all evil?" OK, that seems worth a comment: it's not. Great title, though.

Selecting automated database failover involves a trade-off between keeping your site up 24x7 and making things worse by having software do the thinking when humans are not around. When comparing outcomes of wetware vs. software it is worth remembering that humans are not at their best when woken up at 3:30am. Humans go on vacations, or their cell phones run out of power. Humans can commit devastating unforced errors due to inexperience. For these and other reasons, automated failover is the right choice for many sites even if it is not perfect.

Speaking of perfection, it is common to hear claims that automated database failover can never be trusted (such as this example). For the most part such claims apply to particular implementations, not database failover in general. Even so, it is undoubtedly true that failover is complex and hard to get right. Here is a short list of things I have learned about failover from working on Tungsten and how I learned them. Tungsten clusters are master/slave, but you would probably derive similar lessons from most other types of clusters.

1. Fail over once and only once. Tungsten does so by electing a coordinator that makes decisions for the whole cluster. There are other approaches, but you need an algorithm that is provably sound. Good clusters stop when they cannot maintain the pre-conditions required for soundness, to which Baron's article alludes. (We got this right more or less from the start through work on other systems and reading lots of books about distributed systems.)

2. Do not fail over unless the DBMS is truly dead. The single best criterion for failure seems to be whether the DBMS server will accept a TCP/IP connection. Tests that look for higher brain function, such as running a SELECT, tend to generate false positives due to transient load problems like running out of connections or slow server responses. Failing over due to load is very bad as it can take down the entire cluster in sequence as load shifts to the remaining hosts. (We learned this through trial and error.)

3. Stop if failover will not work or better yet don't even start. For example, Tungsten will not fail over if it does not have up-to-date slaves available. Tungsten will also try to get back to the original pre-failover state when failover fails, though that does not always work. We get credit for trying, I suppose. (We also learned this through trial and error.)

4. Keep it simple. People often ask why Tungsten does not resubmit transactions that are in-flight when a master failover occurs. The reason is that there are many reasons why resubmission does not work on a new master and it is difficult to predict when such failures will occur. Imagine you were dependent on a temp table, for example. Resubmitting just creates more ways for failover to fail. Tungsten therefore lets connections break and puts the responsibility on apps to retry failed transactions. (We learned this from previous products that did not work.)

Even if you start out with such principles firmly in mind, new failover mechanisms tend to encounter a lot of bugs. They are hard to find and fix because failover is not easy to test. Yet the real obstacle to getting automated failover right is not so much bugs but the unexpected nature of the problems clusters encounter. There is a great quote from J.B.S. Haldane about the nature of the universe that also gives a flavor of the mind-bending nature of distributed programming:

My own suspicion is that the universe is not only queerer than we suppose, but queerer than we can suppose.

I can't count the number of times where something misbehaved in a way that would never have occurred to me without seeing it happen in a live system. That is why mature clustering products can be pretty good while young ones, however well-designed, are not. The problem space is just strange.

My sympathy for the Github failures and everyone involved is therefore heartfelt. Anyone who has worked on failover knows the guilt of failing to anticipate problems as well as as the sense of enlightenment that comes from understanding why they occur. Automated failover is not evil. But it is definitely weird.

Sep 3, 2012

Life in the Amazon Jungle

In late 2011 I attended a lecture by John Wilkes on Google compute clusters, which link thousands of commodity computers into huge task processing systems. At this scale hardware faults are common. Google puts a lot of effort into making failures harmless by managing hardware efficiently and using fault-tolerant application programming models. This is not just good for application up-time. It also allows Google to operate on cheaper hardware with higher failure rates, hence offers a competitive advantage in data center operation.

It's becoming apparent we all have to think like Google to run applications successfully in the cloud. At Continuent we run our IT and an increasing amount of QA and development on Amazon Web Services (AWS). During the months of July and August 2012 at least 3 of our EC2 instances were decommissioned or otherwise degraded due to hardware problems. One of the instances hosted our main website www.continuent.com.

In Amazon failures are common and may occur with no warning. You have minimal ability to avoid them or some cases even understand the root causes. To survive in this environment, applications need to obey a new law of the jungle. Here are the rules as I understand them.

First, build clusters of redundant services. The www.continuent.com failure brought our site down for a couple of hours until we could switch to a backup instance. Redundant means up and ready to handle traffic now, not after a bridge call to decide what to do. We protect our MySQL servers by replicating data cross-region using Tungsten, but the website is an Apache server that runs on a separate EC2 instance. Lesson learned. Make everything a cluster and load balance traffic onto individual services so applications do not have to do anything special to connect.

Second, make applications fault-tolerant. Remote services can fail outright, respond very slowly, or hang. To live through these problems apply time-honored methods to create loosely coupled systems that degrade gracefully during service outages and repair themselves automatically when service returns. Here are some of my favorites.

If your application has a feature that depends on a remote service and that service fails, turn off the feature but don't stop the whole application.
Partition features so that your applications operate where possible on data copies. Learn how to build caches that do not have distributed consistency problems.
Substitute message queues for synchronous network calls.
Set timeouts on network calls to prevent applications from hanging indefinitely. In Java you usually do this by putting the calls in a separate thread.
Use thread pools to limit calls to remote services so that your application does not explode when those services are unavailable or fail to respond quickly.
Add auto-retry so that applications reconnect to services when they are available again.
Add catch-all exception handling to deal with unexpected errors from failed services. In Java this means catching RuntimeException or even Throwable to ensure it gets properly handled.
Build in monitoring to report problems quickly and help you understand failures you have not previously seen.

Third, revel in failure. Netflix takes this to an extreme with Chaos Monkey, which introduces failures in running systems. Another approach is to build scaffolding into applications so operations fail randomly. We use that technique (among others) to test clusters. In deployed database clusters I like to check regularly that any node can become the master and that you can recover any failed node. However, this is just the beginning. There are many, many ways that things can fail. It is better to provoke the problems yourself than have them occur for the first time when something bad happens in Amazon.

There is nothing new about these suggestions. That said, the Netflix approach exposes the difference between cloud operation and traditional enterprise computing. If you play the game applications will stay up 24x7 in this rougher landscape and you can tap into the flexible cost structure and rapid scaling of Amazon. The shift feels similar to using database transactions or eliminating GOTO statements--just something we all need to do in order to build better systems. There are big benefits to running in the cloud but you really need to step up your game to get them.

Aug 5, 2012

Is Synchronous Data Replication over WAN Really a Viable Strategy?

Synchronous data replication over long distances has the sort of seductive appeal that often characterizes bad ideas. Why wouldn't you want every local credit card transaction simultaneously stored on the other side of the planet far away from earthquake, storms and human foolishness? The answer is simple: conventional SQL applications interact poorly with synchronous replication over wide area networks (WANs).

I spent a couple of years down the synchronous replication rabbit hole in an earlier Continuent product. It was one of those experiences that make you a sadder but wiser person. This article digs into some of the problems with synchronous replication and shows why another approach, asynchronous multi-master replication, is currently a better way to manage databases connected by long-haul networks.

Synchronous Replication between Sites

The most obvious problem with any form of synchronous replication is the hit on application performance. Every commit requires a round-trip to transfer and acknowledge receipt at the remote site, which in turn reduces single-thread transaction rates to the number of pings per second between sites. As a nice article by Aaron Brown demonstrates, you can show the effect easily using MySQL semi-synchronous replication between hosts in Amazon regions. Aaron's experiment measured 11.5 transactions per second, or about 100 times less than single-thread performance on a local master between hosts with 85 millisecond latency. At that rate you would theoretically expect transaction throughput of ~11.7 transactions per second (1000 / 85 = 11.7), so the agreement between practice and theory is very close. It's great when science works out like this.

You might argue that applications could tolerate the slow rate assuming it were at least constant. Sadly that's not the case for real systems. Network response varies enormously between sites in ways that are quite easy to demonstrate.

To illustrate variability I set up an Amazon m1.small instance in the us-east-1 region (Virginia) and ran 24 hours of ping tests to instances in us-west-2 (Oregon) and ap-southeast-1 (Singapore). As the following graph shows, during a 4 hour period ping times to Singapore remain within a band but vary up to 10%. Ping times to Oregon on the other hand hover around 100ms but spike up randomly to almost 200ms. During these times, synchronous replication throughput would be cut by 50% to approximately 5 transactions per second.

Amazon ping times from us-east-1 to us-west-2 and ap-southeast-1 (240 minute interval)

Moreover, it's not just a question of network traffic. Remote VMs also become busy, which slows their response. To demonstrate, I ran two-minute sysbench CPU tests to saturate processing on the us-west-2 instance while observing the ping rate. Here is the command:

sysbench --test=cpu --num-threads=10 --max-time=120 --cpu-max-prime=1000000 run

As the next graph illustrates, CPU load has a unpredictable but substantial effect on ping times. As it happens, the ping variation in the previous graph may be due to resource contention on the underlying physical host. (Or it might really be network traffic--you never really know with Amazon.)

Effect of sysbench runs on ping times to US-West (20 minute interval)

Slow response on resource-bound systems is a problem that is familiar to anyone with experience with distributed systems, including systems where everything is on a single LAN. You cannot even count on clock ticks being delivered accurately within various types of virtual machines. The timing delays are magnified in WANs, as they already have high latency to begin with. Between busy hosts and network latency, it's reasonable to expect that at some point most systems would at least briefly experience single session transaction rates of 1 transaction per second or less. Even with parallelized replication you would see substantial backups on the originating DBMS servers as commits begin to freeze.

To complete the tale of woe, failures of various kinds can cause remote hosts to stop responding at all for periods of time that vary from seconds to days. Amazon is generally quite reliable but had two outages in the Virginia data center in June 2012 alone that brought down applications from hours to days. If you replicate synchronously to a host affected by such an outage, your application just stops and you no longer store transactions at all, let alone securely. You need to turn off synchronous replication completely to stay available.

So is synchronous replication really impossible between sites? Given the problems I just described it would be silly to set up MySQL semi-synchronous replication between over WAN for a real application. However, there are other ways to implement synchronous replication. Let's look at two of them.

First, there is Galera, which uses a distributed protocol called certification-based replication to agree on commit order between all cluster nodes combined with execution of non-conflicting transactions in parallel. Certification-based replication is a great algorithm in many ways, but Galera comes with some important practical limitations. First it replicates rows rather than statements. The row approach handles large transactions poorly, especially over distances, due to the large size of change sets. Also, not all workloads parallelize well, since transactions that conflict in any way must be fully serialized. Overall DBMS throughput may therefore reduce to the single-session throughput discussed above at unexpected times due to variations in workload. Finally, full multi-master mode between sites (as opposed to master/slave) is likely to be very problematic as nodes drop out of the cluster due to transient communication failures and require expensive reprovisioning. This is a general problem with group communications, which Galera depends on to order transactions.

Second, there are theoretical approaches that claim many of the benefits of synchronous replication without killing throughput or availability. One example is the Calvin system developed by Daniel Abadi and others, which seeks to achieve both strong transaction consistency and high throughput when operating across sites. The secret sauce in Calvin is that it radically changes the programming model to replicate what amount to transaction requests while forcing actual transaction processing to be under control of the Calvin transaction manager, which orders transaction order in advance across nodes. That should at least in principle reduce some of the unpredictability you may see in systems like Galera that do not constrain transaction logic. Unfortunately it also means a major rewrite for most existing applications. Calvin is also quite speculative. It will be some time before this approach is available for production systems and we can see whether it is widely applicable.

There's absolutely a place for synchronous replication in LANs, but given the current state of the art it's hard to see how most applications can use effectively it to link DBMS servers over WAN links. In fact, the main issue with synchronous replication is the unpredictability it introduces into applications that must work with slow and unreliable networks. This is one of the biggest lessons I have learned at Continuent.

The Alternative: Asynchronous Multi-Master Replication

So what are the alternatives? If you need to build applications that are available 24x7 with high throughput and rarely, if ever, lose data, you should consider high-speed local clusters linked by asynchronous multi-master replication between sites. Here is a typical architecture, which is incidentally a standard design pattern for Tungsten.

Local clusters linked by asynchronous, multi-master replication

The big contrast between synchronous and and asynchronous replication between sites is that while both have downsides, you can minimize asynchronous multi-master problems using techniques that work now. Let's look at how async multi-master meets requirements and the possible optimizations.

Performance. Asynchronous replication solves WAN performance problem as completely as possible. To the extent that you use synchronous or near-synchronous replication technologies it is on local area networks, which are extremely fast and reliable, so application blocking is minimal. Meanwhile, long-haul replication can be improved by compression as well as parallelization, because WANs offer good bandwidth even if there is high end-to-end latency.
Data loss. Speedy local replication, including synchronous and "near synchronous" methods, minimizes of data loss due to storage failures and configuration errors. Somewhat surprisingly you do not need fully synchronous replication for most systems even at the local level--that's a topic for a future blog article--but replication does need to be quite fast to ensure local replicas are up-to-date. Actually, one of the big issues for avoiding local data loss is to configure systems carefully (think sync_binlog=1 for MySQL, for example).
Availability. Async multi-master systems have the delightful property that anything interrupts transaction flow between sites, replication just stops and then resumes when the problem is corrected. There's no failover and no moving parts. This is a major strength of the multi-master model.

So what are the downsides? Nothing comes for free, and there are at least two obvious issues.

Applicability. Not every application is compatible with asychronous multi-master. You will need to do work on most existing applications to implement multi-master and ensure you got it right. I touched on some of the MySQL issues in an earlier article. If multi-master is not a possibility, you may need the other main approach to cross-site replication: a system-of-record design where applications update data on a single active site at any given time with other sites present for disaster recovery. (Tungsten also does this quite well, I should add.)
Data access. While you might not lose data it's also quite likely you might not be able to access it for a while either. It's rare to lose a site completely but not uncommon for sites to be inaccessible for hours to days. The nice thing is that with a properly constructed multi-master application you will at least know that the data will materialize on all sites once the problem is solved. Meanwhile, relax and work on something else until the unavailable site comes back.

In the MySQL community local clusters linked by asynchronous multi-master are an increasingly common architecture for credit card payment gateways, which I mentioned at the beginning of this article. This is a telling point in favor of asynchronous cross-site replication, as credit card processors have a low tolerance for lost data.

Also, a great deal of current innovation in distributed data management is headed in the direction of asynchronous mechanisms. NoSQL systems (such as Cassandra) tend to use asynchronous replication between sites. There is interesting research afoot, for example in Joe Hellerstein's group at UC Berkeley, to make asynchronous replication more efficient by accurately inferring cases where no synchronization is necessary. Like other research, this work is quite speculative, but the foundations are in use in operational systems today.

For now the challenge is to make the same mechanisms that NoSQL systems have jumped on work equally well for relational databases like MySQL. We have been working on this problem for the last couple of years at Continuent. I am confident we are well on the way to solutions that are as good as the best NoSQL offerings for distributed data management.

Jun 3, 2012

MySQL to Vertica Replication, Part 2: Setup and Operation

As described in the first article of this series, Tungsten Replicator can replicate data from MySQL to Vertica in real-time. We use a new batch loading feature that applies transactions to data warehouses in very large blocks using COPY or LOAD DATA INFILE commands. This second and concluding article walks through the details of setting up and testing MySQL to Vertica replication.

To keep the article reasonably short, I assume that readers are conversant with MySQL, Tungsten, and Vertica. Basic replication setup is not hard if you follow all the steps described here, but of course there are variations in every setup. For more information on Tungsten check out the Tungsten Replicator project at code.google.com site well as current Tungsten commercial documentation at Continuent.

Now let's get replication up and running!

What Is the Topology?

In this exercise we will set up Tungsten master/slave replication to move data from MySQL 5.1 to Vertical 5.1. Master/slave is the simplest topology to set up because you don't have to mix settings for different DBMS types in each service. To keep keep things simple, we will install Tungsten directly on the MySQL and Vertica hosts, which are named db1 and db2 respectively. Here is a diagram:

There are of course many other possible configurations. You can run replicators on separate hosts to reduce load on the DBMS servers. You can with a little patience set up direct replication using a Tungsten single replication service, which results in fewer processes to manage. Finally, you can use both direct as well as master/slave topologies to publish data from Tungsten 1.5 clusters. Tungsten clusters provide availability and scaling on the MySQL side.

MySQL to Vertica Replication, Part 1: Enabling Real-Time Analytics with Tungsten

Real-time analytics allow companies to react rapidly to changing business conditions. Online ad services process click-through data to maximize ad impressions. Retailers analyze sales patterns to identify micro-trends and move inventory to meet them. The common theme is speed: moving lots of information without delay from operational systems to fast data warehouses that can feed reports back to users as quickly as possible.

Real-time data publishing is a classic example of a big data replication problem. In this two-part article I will describe recent work on Tungsten Replicator to move data out of MySQL into Vertica at high speed with minimal load on DBMS servers. This feature is known as batch loading. Batch loading enables not only real-time analytics but also any other application that depends on moving data efficiently from MySQL into a data warehouse.

The first article works through the overall solution starting with replication problems for real-time analytics through a description of how Tungsten adapts real-time replication to data warehouses. If you are in a hurry to set up, just skim this article and jump straight to the implementation details in the follow-on article.

Replication Challenges for Real-Time Analytics

To understand some of the difficulties of replicating to a data warehouse, imagine a hosted intrusion detection service that collects access log data from across the web and generates security alerts as well as threat assessments for users. The architecture for this application follows a pattern that is increasingly common in businesses that have to analyze large quantities of incoming data.

Access log entries arrive through data feeds, whereupon an application server checks them to look for suspicious activity and commits results into a front-end DBMS tier of sharded MySQL servers. The front-end tier optimizes for a MySQL sweet spot, namely fast processing of a lot of small transactions.

If You Must Deploy Multi-Master Replication, Read This First

An increasing number of organizations run applications that depend on MySQL multi-master replication between remote sites. I have worked on several such implementations recently. This article summarizes the lessons from those experiences that seem most useful when deploying multi-master on existing as well as new applications.

Let's start by defining terms. Multi-master replication means that applications update the same tables on different masters, and the changes replicate automatically between those masters. Remote sites mean that the masters are separated by a wide area network (WAN), which implies high average network latency of 100ms or more. WAN network latency is also characterized by a long tail, ranging from seconds due to congestion to hours or even days if a ship runs over the wrong undersea cable.

With the definitions in mind we can proceed to the lessons. The list is not exhaustive but includes a few insights that may not be obvious if you are new to multi-master topologies. Also, I have omitted issues like monitoring replication, using InnoDB to make slaves crash-safe, or provisioning new nodes. If you use master/slave replication, you are likely familiar with these topics already.

1. Use the Right Replication Technology and Configure It Properly

The best overall tool for MySQL multi-master replication between sites is Tungsten. The main reason for this assertion is that Tungsten uses a flexible, asynchronous, point-to-point, master/slave replication model that handles a wide variety of topologies such as star replication or all-to-all. Even so, you have to configure Tungsten properly. The following topology is currently my favorite:

All-to-all topology. Each master replicates directly to every other master. This handles prolonged network outages or replication failures well, because one or more masters can drop out without breaking replication between the remaining masters or requiring reconfiguration. When the broken master(s) return, replication just resumes on all sides. All-to-all does not work well if you have a large number of masters.
Updates are not logged on slaves. This keeps master binlogs simple, which is helpful for debugging, and eliminates the possibility of loops. It also requires some extra configuration if the masters have their own slaves, as would be the case in a Tungsten Enterprise cluster.

There are many ways to set up multi-master replication replication, and the right choice varies according to the number of masters, whether you have local clustering, or other considerations. Giuseppe Maxia has described many topologies, for example here, and the Tungsten Cookbook has even more details.

One approach you should approach with special caution is MySQL circular replication. In topologies of three or more nodes, circular replication results in broken systems if one of the masters fails. Also, you should be wary of any kind of synchronous multi-master replication across sites that are separated by more than 50 kilometers (i.e. 1-2ms latency). Synchronous replication makes a siren-like promise of consistency but the price you pay is slow performance under normal conditions and broken replication when WAN links go down.

2. Use Row-Based Replication to Avoid Data Drift

Replication depends on deterministic updates--a transaction that changes 10 rows on the original master should change exactly the same rows when it executes against a replica. Unfortunately many SQL statements that are deterministic in master/slave replication are non-deterministic in multi-master topologies. Consider the following example, which gives a 10% raise to employees in department #35.

UPDATE emp SET salary = salary * 1.1 WHERE dep_id = 35;

If all masters add employees, then the number of employees who actually get the raise will vary depending on whether such additions have replicated to all masters. Your servers will very likely become inconsistent with statement replication. The fix is to enable row-based replication using binlog-format=row in my.cnf. Row replication transfers the exact row updates from each master to the others and eliminates ambiguity.

3. Prevent Key Collisions on INSERTs

For applications that use auto-increment keys, MySQL offers a useful trick to ensure that such keys do not collide between masters using the auto-increment-increment and auto-increment-offset parameters in my.cnf. The following example ensures that auto-increment keys start at 1 and increment by 4 to give values like 1, 5, 9, etc. on this server.

server-id=1
auto-increment-offset = 1

auto-increment-increment = 4

This works so long as your applications use auto-increment keys faithfully. However, any table that either does not have a primary key or where the key is not an auto-increment field is suspect. You need to hunt them down and ensure the application generates a proper key that does not collide across masters, for example using UUIDs or by putting the server ID into the key. Here is a query on the MySQL information schema to help locate tables that do not have an auto-increment primary key.

SELECT t.table_schema, t.table_name

FROM information_schema.tables t

WHERE NOT EXISTS

(SELECT * FROM information_schema.columns c

WHERE t.table_schema = c.table_schema

AND t.table_name = c.table_name

AND c.column_key = 'PRI'

AND c.extra = 'auto_increment')

4. Beware of Semantic Conflicts in Applications

Neither Tungsten nor MySQL native replication can resolve conflicts, though we are starting to design this capability for Tungsten. You need to avoid them in your applications. Here are a few tips as you go about this.

First, avoid obvious conflicts. These include inserting data with the same keys on different masters (described above), updating rows in two places at once, or deleting rows that are updated elsewhere. Any of these can cause errors that will break replication or cause your masters to become out of sync. The good news is that many of these problems are not hard to detect and eliminate using properly formatted transactions. The bad news is that these are the easy conflicts. There are others that are much harder to address.

For example, accounting systems need to generate unbroken sequences of numbers for invoices. A common approach is to use a table that holds the next invoice number and increment it in the same transaction that creates a new invoice. Another accounting example is reports that need to read the value of accounts consistently, for example at monthly close. Neither example works off-the-shelf in a multi-master system with asynchronous replication, as they both require some form of synchronization to ensure global consistency across masters. These and other such cases may force substantial application changes. Some applications simply do not work with multi-master topologies for this reason.

5. Remove Triggers or Make Them Harmless

Triggers are a bane of replication. They conflict with row replication if they run by accident on the slave. They can also create strange conflicts due to weird behavior/bugs (like this) or other problems like needing definer accounts present. MySQL native replication turns triggers off on slaves when using row replication, which is a very nice feature that prevents a lot of problems.

Tungsten on the other hand cannot suppress slave-side triggers. You must instead alter each trigger to add an IF statement that prevents the trigger from running on the slave. The technique is described in the Tungsten Cookbook. It is actually quite flexible and has some advantages for cleaning up data because you can also suppress trigger execution on the master.

You should regard all triggers with suspicion when moving to multi-master. If you cannot eliminate triggers, at least find them, look at them carefully to ensure they do not generate conflicts, and test them very thoroughly before deployment. Here's a query to help you hunt them down:

SELECT trigger_schema, trigger_name

FROM information_schema.triggers;

6. Have a Plan for Sorting Out Mixed Up Data

Master/slave replication has its discontents, but at least sorting out messed up replicas is simple: re-provision from another slave or the master. No so with multi-master topologies--you can easily get into a situation where all masters have transactions you need to preserve and the only way to sort things out is to track down differences and update masters directly. Here are some thoughts on how to do this.

Ensure you have tools to detect inconsistencies. Tungsten has built-in consistency checking with the 'trepctl check' command. You can also use the Percona Toolkit pt-table-checksum to find differences. Be forewarned that neither of these works especially well on large tables and may give false results if more than one master is active when you run them.
Consider relaxing foreign key constraints. I love foreign keys because they keep data in sync. However, they can also create problems for fixing messed up data, because the constraints may break replication or make it difficult to go table-by-table when synchronizing across masters. There is an argument for being a little more relaxed in multi-master settings.
Switch masters off if possible. Fixing problems is a lot easier if you can quiesce applications on all but one master.
Know how to fix data. Being handy with SQL is very helpful for fixing up problems. I find SELECT INTO OUTFILE and LOAD DATA INFILE quite handy for moving changes between masters. Don't forget SET SESSION LOG_FILE_BIN=0 to keep changes from being logged and breaking replication elsewhere. There are also various synchronization tools like pt-table-sync, but I do not know enough about them to make recommendations.

At this point it's probably worth mentioning commercial support. Unless you are a replication guru, it is very comforting to have somebody to call when you are dealing with messed up masters. Even better, expert advice early on can help you avoid problems in the first place.

(Disclaimer: My company sells support for Tungsten so I'm not unbiased. That said, commercial outfits really earn their keep on problems like this.)

7. Test Everything

Cutting corners on testing for multi-master can really hurt. This article has described a lot of things to look for, so put together a test plan and check for them. Here are a few tips on procedure:

Set up a realistic pre-prod test with production data snapshots.
Have a way to reset your test environment quickly from a single master, so you can get back to a consistent state to restart testing.
Run tests on all masters, not just one. You never know if things are properly configured everywhere until you try.
Check data consistency after tests. Quiesce your applications and run a consistency check to compare tables across masters.

It is tempting to take shortcuts or slack off, so you'll need to find ways to improve your motivation. If it helps, picture yourself explaining to the people you work for why your DBMS servers have conflicting data with broken replication, and the problem is getting worse because you cannot take applications offline to fix things. It is a lot easier to ask for more time to test. An even better approach is to hire great QA people and give them time to do the job right.

Summary

Before moving to a multi-master replication topology you should ask yourself whether the trouble is justified. You can get many of the benefits of multi-master with system-of-record architectures with a lot less heartburn. That said, an increasing number of applications do require full multi-master across multiple sites. If you operate one of them, I hope this article is helpful in getting you deployed or improving what you already have.

Tungsten does a pretty good job of multi-master replication already, but I am optimistic we can make it much better. There is a wealth of obvious features around conflict resolution, data repair, and up-front detection of problems that will make life better for Tungsten users and reduce our support load. Plus I believe we can make it easier for developers to write applications that run on multi-master DBMS topologies. You will see more about how we do this in future articles on this blog.

Apr 22, 2012

Replication Is Bad for MySQL Temp Tables

Experienced MySQL DBAs know that temp tables cause major problems for MySQL replication. It turns out the converse is also true: replication can cause major problems for temporary tables.

In a recent customer engagement we enabled Tungsten Replicator with a MySQL application that originally ran on a server that did not use replication. QA promptly discovered reports that previously ran in 10 seconds were now running in as many minutes. It turned out that the reports used temp tables to assemble data, and these were being written into the master binlog. This created bloated binlogs and extremely slow reports. We fixed the problem by enabling row replication (i.e., binlog-format=row in my.cnf).

A common DBA response to temp table problems is to try to eliminate them completely, as suggested in the excellent High Performance MySQL, 3rd Edition. (See p. 502.) Elimination is a good philosophy when applications use temp tables to generate updates. However, it does not work for reporting. Temp tables allow you to stage data for complex reports across a series of transactions, then pull the final results into a report writer like JasperReports. This modular approach is easy to implement and maintain afterwards. Eliminating temp tables in such cases can create an unmaintainable mess.

The real solution with report temp tables is to keep them out of the master binlog. Here is a list of common ways to do so. Let me know if you know others.

* Turn off binlog updates. Issue 'SET SESSION SQL_LOG_BIN=0' when generating reports. The downside is that it requires SUPER privilege to set. Also, if you make a code mistake and update normal tables with this setting, your changes will not be replicated.

* Use a non-replicated database. Configure the master my.cnf with binlog-ignore-db as follows to ignore any update (including on temp tables) that is issued when database 'scratch' is the default database:

binlog_ignore_db = scratch

This approach does not require special privileges. However coding errors or connection pool misconfigurations are obvious liabilities. Your application must either connect to the scratch database or issue an explicit use command. Otherwise, temp table operations will be logged, as in the following example:

use not_scratch;

create temporary table scratch.report1_temp(name varchar(256), entry_time date, exit_time date);

* Use a slave with the binlog disabled. Remove the log-bin option from my.cnf. This works well if you have extra reporting slaves that are caught up. However, it may not work if the reports must be fully up-to-date or you need the ability to promote the slave quickly to a master, in which case the binlog must be enabled.

* Use row replication. You can set row replication at the session level using 'SET SESSION binlog_format=row', which requires SUPER privilege, or overall by setting binlog-format in my.cnf. In this case CREATE TEMPORARY TABLE and updates on temp tables do not appear in the binlog at all. The downside of enabling row replication fully is that it can lead to bloated logs and blocked servers if you have very large transactions. SQL operations like DELETE that affect multiple rows are stored far more compactly in statement replication. Also, reloading mysqldump files can be very slow in row replication compared to statement replication, which can handle block inserts generated by the --extended-insert option.

The proper solution to keep replication from hurting your use of temp tables will vary depending on your application as well as the way you run your site. For my money, though, this is a good example of where row replication really helps and deserves a closer look.

MySQL could use some feature improvements in the area of temp tables and replication. I find it surprising that mixed mode replication does not fully suppress temp table binlog updates. Only row replication does so. Second, it would be great to have a CREATE TABLE option to suppress logging particular tables to the binlog. This would allow applications to make the logging decision at schema design time. Finally, global options to suppress binlogging of specific table types, such as temp tables and MEMORY tables would be useful. Perhaps we will see some of these in future MySQL releases.

Apr 14, 2012

Oracle Missed at MySQL User Conference...Not!

The MySQL UC this past week was the best in years. Percona did an outstanding job of organizing the main Percona Live event that ran Tuesday through Thursday. About 1000 people attended, which is up from the 800 or so at the O'Reilly-run conference in 2011. There were also excellent follow-on events on Friday for MariaDB/SkySQL, Drizzle, and Sphinx.

What made this conference different was the renewed energy around MySQL and the number of companies using it.

Big web properties like Facebook, Twitter, Google, and Craigslist continue to anchor the MySQL community and drive innovation from others through a combination of funding, encouragement, and patches.
Many new companies we have not heard from before like Pinterest, BigDoor, Box.net, and Constant Contact talked about their experience building major new applications on MySQL.
The vendor exhibition hall at Percona Live was hopping. Every vendor I spoke to had a great show and plans to return next year. There is great innovation around MySQL from many creative companies. I'm very proud my company, Continuent, is a part of this.
The demand for MySQL expertise was completely out of hand. So many talks ended with "...and we are hiring" that it became something of a joke. The message board was likewise packed with help wanted ads.

When Oracle acquired Sun Microsystems a couple of years ago, it triggered a lot of uncertainty about the future of MySQL. This concern turns out to be unfounded. Oracle does excellent engineering work, especially on InnoDB, but had no involvement either official or unofficial at the conference. This was actually a good thing.

By not participating, Oracle helped demonstrate that MySQL is no longer dependent on any single vendor and has taken on a real life of its own driven by the people who use it. MySQL fans owe Oracle a vote of thanks for not attending this year. Next year I hope they will be back to join the fun.

p.s., It has come to my attention since writing this article that 800 may not be correct attendance for the O'Reilly 2011 conference. The 1000 figure is from Percona. Speaking as an attendee they seemed about the same size. Please feel free to comment if you have accurate numbers.

Apr 3, 2012

Solving the Cloud Database Memory Conundrum

Cloud databases have a memory problem. Continuent has been doing a lot of Amazon deployments lately, and it is becoming apparent that memory utilization in those environments is more than just an inconvenience. In this article I would like to discuss the memory problem that we see in customer implementations and some new features of Tungsten Enterprise that help alleviate it for MySQL.

The Cloud Memory Problem and Database Arrays

As I discussed in a recent article about prefetch, the amount of RAM allocated to the InnoDB buffer pool is one of the principle determinants of MySQL performance. The speed difference between using a page in the buffer pool vs. getting it from storage is commonly about 100:1 on RAIDed disk. The other determinant is working set size, i.e., the percentage of pages that need to be memory-resident for adequate performance. Working set size is variable and depends on your query patterns as well as the level of indexing on tables. These two variables set a limit to the amount of data you can manage and still access information quickly. Here is a table that shows the relationship.

	Max GB of manageable storage
Buffer Pool Size (GB)	5% resident	10% resident	25% resident	50% resident
15	300	150	60	30
30	600	300	120	60
60	1200	600	240	120
120	2400	1200	480	240

The largest instance currently available from Amazon EC2 is a Quadruple Extra Large, which offers 68GB of RAM. Let's assume we allocate 85% of that to the buffer pool, which is about 58GB. We'll be generous and say 60GB so it matches my table. Assuming our application performs reasonably with 10% of pages resident we can then manage a maximum of 600GB of stored data per server.

The fact that EC2 instances are rather small understates the difficulty with RAM in Amazon. You also need more of it. Amazon EBS is slower than typical on-premise storage, such as your friendly direct-attached RAID, and has highly variable performance as Baron Schwartz pointed out. You might therefore be inclined to double the working set ratio to 20%, since I/O is so costly you need a bigger working set to minimize reads from storage. Quadruple Extra Large instances then max out at 300Gb of managed data. I'm not forgetting SSDs, which are slowly appearing in cloud environments. They alter the working set ratios but introduce other issues, for example related to cost and capacity. It's also not clear how much faster they really are in shared environments like Amazon.

This simple storage math makes it obvious that managing data in the cloud requires a mental shift from fewer but larger DBMS servers to groups of much smaller servers, which we can describe as database arrays. Many small SaaS businesses generate a Terabyte of customer data in a year, hence would provision 4 new 300GB servers annually. Mobile and market automation apps generate many Terabytes per year. I recently did a capacity calculation for a customer where we hit 50 servers without even considering replicas to ensure availability.

Database arrays therefore address memory by partitioning storage. Many applications already partition into shards for other reasons. In this case, the main question is how the applications find data easily. Locating "virtualized databases" is a non-trivial problem, especially when you consider that servers may fail, move, or go offline for maintenance. The more servers you have, the more likely the special cases become. This is where Tungsten Enterprise comes in.

Tungsten Enterprise 1.5 Features for Cloud Database Arrays

Tungsten Enterprise builds clusters by taking a set of off-the-shelf DBMS servers linked by replication and making them look like a single virtualized DBMS server to applications. We call this server a data service. The key to the single-DBMS illusion is the Tungsten Connector, a fast proxy that sits between applications and database hosts. It routes connections into the right type of server, for example a master for writes or a slave for reads. It also allows you to switch masters without stopping applications or losing data.

Tungsten Enterprise already offers several nice properties for cloud operation. The Connector is software-only and does not use VIPs, which are not supported in Amazon anyway. Also, the fact that you can switch master and slaves without stopping applications makes it possible to "escape" slow EBS volumes. That said, previous Tungsten versions limited connections to a single data service per connector, which made database arrays hard to support properly.

Tungsten Enterprise 1.5 adds two features to help with constructing and managing arrays. First, our colleague Jeff Mace created a great Ruby-based install program called tpm that can install complex cluster topologies from the command line. If you have used the tungsten-installer program on Tungsten Replicator, you have an idea what is on the way. (If not Jeff will be talking about installations at the Percona Live MySQL Conference on April 11.) Efficient command-line installations make it easy to set up and manage a lot of clusters in parallel.

Second, thanks to work by Ed Archibald and Gilles Rayrat, the Connector now supports multi-service connectivity. Connectors previously kept track of cluster state by connecting to one of a list of managers for the local data service. We extended this to allow Connectors to track state in multiple data services. We developed the feature to support disaster recovery sites, which are another Tungsten 1.5 feature. As it turns out, though, multi-service connectivity is extremely useful for database arrays in the cloud. The following diagram compares single and multi-data service connectivity.

As with a single data service the Connector receives state changes providing location of the data service master and slaves as well as whether each is online or offline. The Connector uses a simple routing scheme based on logins to route SQL connections to the correct server for whatever it is doing. Beyond that, Tungsten Enterprise takes care of switching masters and load balancing reads that are necessary to use an array efficiently.

Setting up Multi-Service Connectivity

Multi-service access is trivial to set up. This is documented more fully in the forthcoming Tungsten Enterprise 1.5 documentation, but here is the basic idea.

First, each Tungsten Enterprise installation contains a file called dataservices.properties in the cluster-home/conf directory. The Connector uses this file to locate data service managers in order to get cluster state changes. To enable multi-service connectivity, just ensure there are entries for the managers of each data service in your array before starting the local Connector service.

cluster1=pr1c1.bigdata.com,pr2c1.bigdata.com,pr3c1.bigdata.com

cluster2=pr1c2.bigdata.com,pr2c2.bigdata.com,pr3c2.bigdata.com

cluster3=pr1c3.bigdata.com,pr2c3.bigdata.com,pr3c3.bigdata.com

...

Second, the Connector uses a file called user.map locate in tungsten-connect/conf to define logins. Edit this file and add a different login for each cluster to user.map. Save the file and the connector will pick it up automatically.

# user password service

cluster1 secret cluster1

cluster2 secret cluster2

cluster3 secret cluster3

...

That's about it. Applications need to make one change, namely to use a different login for each data service. For some applications, that is quite simple and can be done in a few minutes. I found to my surprise that some customer applications even do it already, which is very convenient. For others, it may require more extensive changes. The good news is that improvements are on the way.

What's Next

Tungsten Enterprise 1.5 is a solid step toward enabling large-scale data management using MySQL database arrays. However, we can do much more. Obviously, it would be easier if the Connector routed to the right DBMS server using the database name on the connect string or by detecting use commands in SQL. That's on the way already.

More generally, the whole point of Tungsten is to make off-the-shelf DBMS work better by adding redundancy and optimizing use of resources without forcing users to migrate or rearrange data. Memory is a key resource in cloud environments and deserves special attention. Tungsten connectivity looks like a powerful complement to the caching that already occurs within databases. It's also a natural extension of the work we have been doing to speed up replication.

Looking to the future, we are investigating more intelligent load balancing that sends queries to servers where the pages they need are most likely to be already in the buffer pool. This type of optimization can double or triple effective buffer pool size in the best cases. Query caching also offers potential low-hanging fruit. Finally, management parallelization is critical for array operation. I foresee a long list of Tungsten improvements that are going to be fun to work on and that go a long way to solving the cloud database memory conundrum.

p.s., Continuent has a number of talks scheduled on Tungsten-related topics at the Percona Live MySQL Conference and the SkySQL MariaDB Solutions Day April 10-13 in Santa Clara. For more on Tungsten Enterprise and EC2 in particular, check out Ed Archibald's talk on April 11. Giuseppe Maxia and I will also touch on these issues in our talk at at the SkySQL/MariaDB Solutions Day.