Sunday, October 24, 2010

Parallel Replication on MySQL: Report from the Trenches

Single-threaded apply is one of the big downsides of MySQL's built-in replication, as Baron Schwartz pointed out a couple of days ago.  While a master can process dozens of updates at once, slaves must apply them one after the other on a single thread.  Add in disk I/O, and the result is very slow performance indeed.  The obvious answer is parallel apply, namely writing multiple non-conflicting updates to the slave at once.

I have spent the last few months implementing parallel apply for Tungsten 2.0, which we are now testing at customer sites.  In this article I would like to describe how Tungsten's parallel apply works as well as some of the lessons that have become apparent through the implementation.

There are a couple of big challenges in parallel apply.  There is of course the practical problem of separating transactions into parallel streams, for example splitting them by database.  This is known as sharding.   Row updates are easy enough but MySQL also has statement replication.  Transactions with statements require parsing, and there are ambiguous cases.  If that's not enough, features like LOAD DATA INFILE have a complex implementation in the binlog and require specialized logic to shard correctly.  In addition, parallel apply of any kind has a lot of corner cases that you have to solve completely or risk unpredictable failures.  Here's an example:  skipping transactions on the slave.  You have to wait for the event, but what if some of the threads are already past it when you ask to skip?  How do you synchronize access to the list of transactions to skip without creating a choke point for threads?  

The next challenge is performance.  Parallel replication offers a rich choice of ways to lower throughput, not raise it.  Multiple disk logs are the best I have found so far, as they can convert sequential reads and writes on the disk log to random I/O when more replication threads contend for different parts of the disk.  Implementing multiple queues in memory is far faster and simpler but limits the queue sizes.  Another excellent way to slow things down is to try to parallelize SQL transactions with a lot of dependencies, which means you end up effectively serialized *and* paying the extra cost of parsing transactions and synchronizing threads.  In this case it can be better to keep everything sequential but use block commit to apply 50 or 100 transactions simultaneously on the slave.

With all that said, the parallel apply problem is still quite tractable, but you need to pick your battles carefully.  Tungsten's parallel apply implementation has a very clear problem focus:  speeding up slave updates for multi-tenant applications that have a high degree of natural partitioning and concurrent updates across customers.  This is not as limiting as it might sound to readers unfamiliar with MySQL.  SaaS applications for the most part follow the multi-tenant model on MySQL, with each customer assigned to a particular database.  So do large ISPs or cloud providers that host customers on shared servers using separate databases.

Tungsten parallel apply is based on automatic sharding of transactions.   The following diagram shows the parallel apply algorithm conceptually. 
Tungsten Parallel Apply
Tungsten has a flexible architecture based on replication pipelines, described in a previous article on this blog.  To recap the model, pipelines are divided into stages, which represent processing steps.  Each stage consists of an extract-filter-apply loop with symmetric interfaces and identical processing logic for each stage.  The parallel apply implementation builds on replication pipelines as follows:
  1. A new filter called EventMetadataFilter automatically parses incoming transactions to figure out which database(s) they affect.  This is simple for row updates but involves parsing for statements and specialized extract handling for odd-ball operations like LOAD DATA INFILE. 
  2. The shard ID is assigned from the database name. This is glommed into the EventMetadataFilter but will shortly be broken out into a separate filter so that it is possible to support alternate shard assignment algorithms. 
  3. There is a new kind of in-memory buffer between stages called a ParallelQueue that supports multiple queues that feed the final apply stage.   Stages have a corresponding extension to allow them to have multiple threads, which must match the number of parallel queues or you get an error. 
  4. The ParallelQueue implementation calls a new component called a Partitioner to assign transactions a partition number (i.e., a parallel queue).  You can substitute different algorithms by providing different partitioner implementations.  The default implementation uses a configuration file called shard.list to map shards to queues.  Unless you say otherwise it hashes on the shard ID to make this assignment.
Extensions #1 and #2 run on the master, while #3 and #4 run on the slave.  I really like diagrams, so here is a picture of the fully implemented parallel apply architecture.  The master replicator extracts, assigns the shard, and logs each transaction.  The slave replicator fetches transactions, logs them locally, then applies in parallel.
Full Master/Slave Architecture for Parallel Apply
So how does this work?  Pretty well actually.  Local lab tests indicate that parellel apply roughly doubles throughput on a multi-database TPC-B benchmark we use for testing.   We should be able to publish some real-world performance numbers in the near future, but so far things look quite promising.  During the implementation a number of interesting issues have arisen, which I would like to discuss now.

The first issue is the ratio between parallel apply threads and shards.  While it might seem obvious to have a thread per shard, in real deployments the situation is not so clear.  For one thing actual deployments in SaaS and ISP situations often have hundreds or even thousands of databases, which has a number of practical consequences for implementation.  Less obviously, spreading transactions thinly across a lot of queues means fewer opportunities to use block commit, hence more work for slave servers and less overall throughput.  Performance optimization is a very uncertain matter, so Tungsten lets users configure the ratio.

Dependencies between shards are yet another issue.  While I mentioned that Tungsten is designed for applications with "a high degree of natural partitioning," dependencies between databases as well as individual transactions do occur and cannot be ignored.  For example, many SaaS applications have reference data that are used by all customer databases.  Even if parallel SQL works here, applications may get sick from seeing updates appear in the wrong order.  Or you could have global operations like CREATE USER that affect all databases.  Or you might not be able to tell which shard a piece of SQL belongs to.  Tungsten allows users to declare reference databases and automatically serializes these databases as well as global or "don't know" cases. 

There are also numerous issues around startup and shutdown.  Remember how MySQL replication slaves will not restart after unclean shutdown with open temp tables?  (If not, take a quick break and read this now.  You'll thank me later.)  Parallel apply introduces similar issues, because you have multiple threads all updating different positions in the database.  Tungsten handles crash recovery by tracking the apply position of each queue in InnoDB and then recommencing from that point on restart in each queue.  I am putting finishing touches on clean shutdown, which ensures that all queues are empty, much like automatically checking that temp tables are closed on MySQL.  

In short, over the last few months Tungsten has climbed a fair distance up a pretty big hill to get parallel apply to work.  The flexibility of the replicator architecture, particularly pipelines, has been very helpful as it is quite easy to extend.  The parallelization algorithm builds on terrific work by other colleagues at Continuent, especially Stephane Giron and Linas Virbalas.  They have both put enormous effort into building up MySQL and PostgreSQL replication capabilities.

Here are a couple of parting thoughts about parallelization based on the experience so far.

Thought number one:  parallel replication is not magic.  To use parallel apply effectively, applications need to play nice:  mostly short transactions and not too many dependencies between shards are the biggest requirements to see a substantial boost in throughput.  For example, if you let one user write 50M statements to the binlog in a single transaction, things are going to get kind of quiet on the slave no matter what you do.  Also, you can forget about MyISAM or other non-transactional engines.  As I have written before, these engines offer a number of opportunities for databases to get messed up or out-of-sync even using conventional MySQL replication.  Tungsten's block commit and parallel apply increase the window for problems significantly.  If you are still using MyISAM for replicated data, it's time to man up and convert to InnoDB. 

Thought number two: The long-term answer to effective parallel replication is to change how MySQL works by interleaving transactions within the binlog along the lines suggested by Kristian Nielsen and others.  MySQL currently completely serializes transactions to the binlog, an accomplishment that makes slave apply logic a lot simpler.   Tungsten parallel apply then has to undo this good work and recreate streams of non-conflicting updates, which is complex and does not help all workloads.

It is doubtful that replicating interleaved transactions will be less complex than handling a serial binlog as it stands today.  There is also some heavy lifting inside MySQL to get to an interleaved binlog.  However, interleaved transactions would have the advantage that transactions for any workload would be parallelized, which would widen the scope of benefits to users.  I'm happy to see that Kristian and other people are now working this feature for future releases of MySQL.

Meanwhile, we have a workable solution for Tungsten and are pushing it forward as quickly as we can.  Contact Continuent if you would like to test it out.

Saturday, October 16, 2010

MySQL Disaster Recovery With Tungsten

Disaster recovery (DR) is not the first thing most DBAs think of when putting up a new database application.   However, it's one of the top issues for people using the data--what happens if the site goes down and everything disappears?   So even if DR is not the first issue in every deployment, it is a very high priority as soon as your application is the least bit successful.

At the database level DR has a fairly simple solution:  keep copies of data on a backup site that is up-to-date at all times.  This article explains the architecture for MySQL DR with Tungsten and a couple of key features that make it work, namely floating IP addresses and global transation IDs.  We will dig into those at the end.

First a bit of introduction.  Tungsten manages clusters of off-the-shelf database connected by master/slave replication.  There are replication and management services on each host with automated policies to handle failover as well as low-level tasks like recognizing new cluster members.  There is a simple management client that lets you log into any host and manage all nodes in the cluster.  Tungsten also has connectivity options to let applications find databases easily.  However, for this article we are going to focus on the database only and how you solve the problem of ensuring your data are protected. 

DR Setup

To implement disaster recovery, you actually create two clusters--one on your main site and one on a backup site which we will henceforth call the DR site.  It looks like the following picture.

Standard Main/DR Architecture with Backups
Here is an outline of the setup.  There are additional details of course but those are covered in Tungsten documentation and support procedures.  The goal here is to give you a sense of how things work at the top level. 
  1. Main site.  Set up the main site cluster as a master/slave pair with a floating IP address on the master.  Enable automatic policy mode so that in the event of a master failure the local slave will immediately take over.  Set up backups and run them on the slave on a regular basis. 
  2. DR site.  Next, set up the DR cluster by provisioning both databases with a recent backup from the main cluster.  Configure it identically to the main site with a master IP address and with backups but with two exceptions.  First, use manual policy mode so that the cluster does not try to fail over. Second, do not start replication automatically.  Instead, manually configure the DR master to be a slave of the main site master using the master floating IP address and start services.  Set up backups on this site as well. 
Handling Failures

At the end of setup you have a main site with a cluster and a DR site with a cluster that slaves efficiently off the main site master.  Both sites have regular backups.  As long as there are no failures, you operate both sites and everything is fine.  Let us now consider a couple of different types of failures and how to handle them.

Let's suppose the main site master fails.  Tungsten will automatically fail over to the main site slave and move the master floating IP address.  The DR site relay slave TCP/IP connection to the master will then break, or more accurately time out.  When the relay slave reconnects to the floating IP,  it will have shifted to the new master and replication to the DR site will continue without any human intervention.
Failed Master on Main Site
This protocol is handy because failures are not the only reason that the main site master may move.  You can also move masters for maintenance or upgrades.  Tungsten has a switch command that makes this very easy to do.  The floating IP moves as before and the DR site continues to receive updates properly after it reconnects. 

If you lose the main site, you initiate a site switch procedure.  At the database level this consists of running a script to "unconfigure" your DR relay slave node so that it becomes a master again and then reload the configuration.  When the node comes up as a master it will then automatically install its own master floating IP address.  The commands are simple and run in a few seconds.  In most cases it will take a lot longer to switch applications properly than switch databases, because you have to change DNS entries, start and/or reconfigure applications, and potentially activate other resources to have a functioning system. 

In fact, the real problem with site failover at the database level is not so much failing over but getting the main site back in operation without losing too much data and with as little interruption to users as possible.  You first need to check for any transactions that did not make it off the main site and apply them to the DR site master.  In MySQL you can do this by carefully applying transactions from the main site binlog.  You can help yourself considerably by including a step in the site failover process where you fence (i.e., turn off) the old site as quickly as possible by shutting down applications and taking applications offline.   The fewer extra transactions on the main site, the simpler it is to clean up.

Next, you need to get the master site resynchronized with the slave.  If there are more than a few differences, you will probably just restore the main site master and slave from local backups, then manually configure them to make the main site master a relay slave of the DR site.    If you have large databases, you may want to look at SAN or NAS products like NetAPP that offer snapshot capabilities.  I have been working lately with NetApp; the snap restore command is really impressive for rolling back file system state quickly. 

DR Site Operation and Main Site Recovery
Once the main site is caught up, you can switch applications back the main site by taking a short outage to move applications.  This step is not fully transparent, but unlike the original DR failover, you get to pick the time that is least inconvenient for your users.  Also, you can use Tungsten features like consistency checks to verify that data are consistent across sites. 

Underlying Tungsten Features to Enable DR

As promised at the beginning, here is a look at the Tungsten features that make DR work.  First, there is automated failover with floating IP address management.  Tungsten uses a rules engine combined with group communications to manage failover and master floating IPs efficiently.  The rules take care of many of the weird failure cases as well as handling tasks like automatically making slave servers readonly, etc.  Setting up DR without floating IP addresses is more complex because it means your relay slave needs to know when the main site master moves for any reason.

As useful as floating IP addresses are, Tungsten has a much more important feature that underlies the entire DR site architecture:  global transaction IDs.   Unlike native MySQL replication, Tungsten assigns a global ID or seqno to each transaction as it is read from the binlog.  Tungsten replicator processes track position using the seqno values rather than the file name and offset used by MySQL slaves.  Here is a picture that illustrates how the replicator log works.
Global IDs, Epoch Numbers, and Backups
As already mentioned, the Tungsten master replicator assigns the seqno to each transaction as it is extracted.  Tungsten slave replicators always use the seqno to request the next event from the master.  This means that you can switch the master without worrying whether slaves will lose track of their positions, because they will just ask for the seqno from the new master.

The other important feature of global IDs is that they make backups fungible across different databases and even sites.  Tungsten marks the database with the current seqno and epoch number.  As long as your backup (or file system snapshot) is transactionally consistent, you can load it on any server and bring it back online as a slave.  The new slave will connect to and catch up with the master, wherever it happens to be.  This makes database recovery both simple and very flexible.

The phrase "transactional consistency" brings up another issue.  To make the disaster recovery architecture work reliably I strongly recommend you switch to InnoDB or another fully transactional engine.  MyISAM does not have a place in this architecture--there are just too many ways to end up with corrupt data and a massive outage.

There is one final aspect of Global IDs in Tungsten that is worth mentioning.  What if the master log is corrupted or a slave from a different cluster accidentally logs into the wrong master?  In both cases the slave could get bad data if it just asked for the next seqno without some how checking that the master and slave logs are somehow consistent.  This would at best lead to errors and in the worst case to badly messed up data.

Tungsten deals with log consistency problems using epoch numbers. Whenever the master goes online it sets a new epoch number, which works like a parity check on the sequence number.   Each time a slave connects to the master, it offers the last seqno it received along with the epoch number.  If the values match the same seqno/epoch number in the master log, we assume the logs have the same master and proceed.  Otherwise, we assume somebody is confused and do not allow the slave to fetch transactions. 


DR site setup is complex and this article obviously glosses over a lot of details even for databases.  One final bit of advice is that whatever you do, test the daylights out of it before deploying.  Site failures may be karmic but dealing with them is certainly not.  Site failover is a really bad time to find out you don't have the password to your DNS provider handy or that you have a network configuration problem on the DR site.  One customer I know put all the computers from his main site and DR site in a pile on his conference room table and tested (and retested and retested and retested) until he was completely satisfied with the results.  That is the soul of true disaster recovery.

Scaling Databases Using Commodity Hardware and Shared-Nothing Design