Oct 24, 2010

Parallel Replication on MySQL: Report from the Trenches

Single-threaded apply is one of the big downsides of MySQL's built-in replication, as Baron Schwartz pointed out a couple of days ago.  While a master can process dozens of updates at once, slaves must apply them one after the other on a single thread.  Add in disk I/O, and the result is very slow performance indeed.  The obvious answer is parallel apply, namely writing multiple non-conflicting updates to the slave at once.

I have spent the last few months implementing parallel apply for Tungsten 2.0, which we are now testing at customer sites.  In this article I would like to describe how Tungsten's parallel apply works as well as some of the lessons that have become apparent through the implementation.

There are a couple of big challenges in parallel apply.  There is of course the practical problem of separating transactions into parallel streams, for example splitting them by database.  This is known as sharding.   Row updates are easy enough but MySQL also has statement replication.  Transactions with statements require parsing, and there are ambiguous cases.  If that's not enough, features like LOAD DATA INFILE have a complex implementation in the binlog and require specialized logic to shard correctly.  In addition, parallel apply of any kind has a lot of corner cases that you have to solve completely or risk unpredictable failures.  Here's an example:  skipping transactions on the slave.  You have to wait for the event, but what if some of the threads are already past it when you ask to skip?  How do you synchronize access to the list of transactions to skip without creating a choke point for threads?  

The next challenge is performance.  Parallel replication offers a rich choice of ways to lower throughput, not raise it.  Multiple disk logs are the best I have found so far, as they can convert sequential reads and writes on the disk log to random I/O when more replication threads contend for different parts of the disk.  Implementing multiple queues in memory is far faster and simpler but limits the queue sizes.  Another excellent way to slow things down is to try to parallelize SQL transactions with a lot of dependencies, which means you end up effectively serialized *and* paying the extra cost of parsing transactions and synchronizing threads.  In this case it can be better to keep everything sequential but use block commit to apply 50 or 100 transactions simultaneously on the slave.

With all that said, the parallel apply problem is still quite tractable, but you need to pick your battles carefully.  Tungsten's parallel apply implementation has a very clear problem focus:  speeding up slave updates for multi-tenant applications that have a high degree of natural partitioning and concurrent updates across customers.  This is not as limiting as it might sound to readers unfamiliar with MySQL.  SaaS applications for the most part follow the multi-tenant model on MySQL, with each customer assigned to a particular database.  So do large ISPs or cloud providers that host customers on shared servers using separate databases.

Tungsten parallel apply is based on automatic sharding of transactions.   The following diagram shows the parallel apply algorithm conceptually. 
Tungsten Parallel Apply
Tungsten has a flexible architecture based on replication pipelines, described in a previous article on this blog.  To recap the model, pipelines are divided into stages, which represent processing steps.  Each stage consists of an extract-filter-apply loop with symmetric interfaces and identical processing logic for each stage.  The parallel apply implementation builds on replication pipelines as follows:
  1. A new filter called EventMetadataFilter automatically parses incoming transactions to figure out which database(s) they affect.  This is simple for row updates but involves parsing for statements and specialized extract handling for odd-ball operations like LOAD DATA INFILE. 
  2. The shard ID is assigned from the database name. This is glommed into the EventMetadataFilter but will shortly be broken out into a separate filter so that it is possible to support alternate shard assignment algorithms. 
  3. There is a new kind of in-memory buffer between stages called a ParallelQueue that supports multiple queues that feed the final apply stage.   Stages have a corresponding extension to allow them to have multiple threads, which must match the number of parallel queues or you get an error. 
  4. The ParallelQueue implementation calls a new component called a Partitioner to assign transactions a partition number (i.e., a parallel queue).  You can substitute different algorithms by providing different partitioner implementations.  The default implementation uses a configuration file called shard.list to map shards to queues.  Unless you say otherwise it hashes on the shard ID to make this assignment.
Extensions #1 and #2 run on the master, while #3 and #4 run on the slave.  I really like diagrams, so here is a picture of the fully implemented parallel apply architecture.  The master replicator extracts, assigns the shard, and logs each transaction.  The slave replicator fetches transactions, logs them locally, then applies in parallel.
Full Master/Slave Architecture for Parallel Apply
So how does this work?  Pretty well actually.  Local lab tests indicate that parellel apply roughly doubles throughput on a multi-database TPC-B benchmark we use for testing.   We should be able to publish some real-world performance numbers in the near future, but so far things look quite promising.  During the implementation a number of interesting issues have arisen, which I would like to discuss now.

The first issue is the ratio between parallel apply threads and shards.  While it might seem obvious to have a thread per shard, in real deployments the situation is not so clear.  For one thing actual deployments in SaaS and ISP situations often have hundreds or even thousands of databases, which has a number of practical consequences for implementation.  Less obviously, spreading transactions thinly across a lot of queues means fewer opportunities to use block commit, hence more work for slave servers and less overall throughput.  Performance optimization is a very uncertain matter, so Tungsten lets users configure the ratio.

Dependencies between shards are yet another issue.  While I mentioned that Tungsten is designed for applications with "a high degree of natural partitioning," dependencies between databases as well as individual transactions do occur and cannot be ignored.  For example, many SaaS applications have reference data that are used by all customer databases.  Even if parallel SQL works here, applications may get sick from seeing updates appear in the wrong order.  Or you could have global operations like CREATE USER that affect all databases.  Or you might not be able to tell which shard a piece of SQL belongs to.  Tungsten allows users to declare reference databases and automatically serializes these databases as well as global or "don't know" cases. 

There are also numerous issues around startup and shutdown.  Remember how MySQL replication slaves will not restart after unclean shutdown with open temp tables?  (If not, take a quick break and read this now.  You'll thank me later.)  Parallel apply introduces similar issues, because you have multiple threads all updating different positions in the database.  Tungsten handles crash recovery by tracking the apply position of each queue in InnoDB and then recommencing from that point on restart in each queue.  I am putting finishing touches on clean shutdown, which ensures that all queues are empty, much like automatically checking that temp tables are closed on MySQL.  


In short, over the last few months Tungsten has climbed a fair distance up a pretty big hill to get parallel apply to work.  The flexibility of the replicator architecture, particularly pipelines, has been very helpful as it is quite easy to extend.  The parallelization algorithm builds on terrific work by other colleagues at Continuent, especially Stephane Giron and Linas Virbalas.  They have both put enormous effort into building up MySQL and PostgreSQL replication capabilities.

Here are a couple of parting thoughts about parallelization based on the experience so far.

Thought number one:  parallel replication is not magic.  To use parallel apply effectively, applications need to play nice:  mostly short transactions and not too many dependencies between shards are the biggest requirements to see a substantial boost in throughput.  For example, if you let one user write 50M statements to the binlog in a single transaction, things are going to get kind of quiet on the slave no matter what you do.  Also, you can forget about MyISAM or other non-transactional engines.  As I have written before, these engines offer a number of opportunities for databases to get messed up or out-of-sync even using conventional MySQL replication.  Tungsten's block commit and parallel apply increase the window for problems significantly.  If you are still using MyISAM for replicated data, it's time to man up and convert to InnoDB. 

Thought number two: The long-term answer to effective parallel replication is to change how MySQL works by interleaving transactions within the binlog along the lines suggested by Kristian Nielsen and others.  MySQL currently completely serializes transactions to the binlog, an accomplishment that makes slave apply logic a lot simpler.   Tungsten parallel apply then has to undo this good work and recreate streams of non-conflicting updates, which is complex and does not help all workloads.

It is doubtful that replicating interleaved transactions will be less complex than handling a serial binlog as it stands today.  There is also some heavy lifting inside MySQL to get to an interleaved binlog.  However, interleaved transactions would have the advantage that transactions for any workload would be parallelized, which would widen the scope of benefits to users.  I'm happy to see that Kristian and other people are now working this feature for future releases of MySQL.

Meanwhile, we have a workable solution for Tungsten and are pushing it forward as quickly as we can.  Contact Continuent if you would like to test it out.

1 comment:

Giorgio said...

Thanks for the article.
So this means that all servers must be sized to handle the entire write load and not just a subset?