Mar 22, 2011

Parallel Replication Using Shards Is the Only Workable Approach for SQL

There have been a couple of recent blog articles (here and here) asking for parallel replication based on something other than schemas.  These articles both focus on the problem of parallelizing updates within a single MySQL schema.  I read these with great interest, not least because they both mentioned Tungsten (thanks!) and also found that our schema-based parallelization approach is too limited.  It is therefore worth a short article explaining exactly what the Tungsten approach is and why we chose it.

First of all, Tungsten does not exactly use schema-based parallel replication.  Tungsten is actually based on what I call the serialized shard model of replication.  We assign global transaction IDs to all transactions, which means that for any particular set of transactions we can always figure out the correct serialization and apply in the right order.  This is true even if the transactions travel across independent replication paths or if we have master failover.

Second, we assign a shard ID to all transactions.  Shards are independent streams of transactions that execute correctly when applied by themselves in serial order.  Shards are typically independent, which means transactions in different shards can execute in parallel without deadlocking or corrupting data.  This is the case when each shard contains data for a single customer in a multi-tenant application.  We also have a notion of "critical shards," which are shards that contain global data, such as shared currency rate tables.  Updates in critical shards cause full serialization across all shards.  

You can define shards in a variety of ways, but as a practical matter identifying durable shards inside individual MySQL schemas is hard for most applications, especially if there are constraints between tables or you have large transactions.   Many SQL applications tend to make most of their updates to a small number of very large tables, which makes finding stable dividing lines even more difficult.  Schemas are therefore a natural unit of sharding, and Tungsten uses these by default.

Schema-based sharding seems pretty limiting, but for current SQL databases it is really the only approach that works.  Here are some important reasons that give you a flavor of the issues.

* Restart.  To handle failures you need to mark the exact restart point on each replication apply thread or you will either repeat or miss transactions.  This requires precise and repeatable serialization on each thread, which you get with the serialized shard model.

* Deadlocks.  If there are conflicts between updates you will quickly hit deadlocks.  This is especially true because one of the biggest single thread replication optimizations is block commit, where you commit dozens of success transactions at once--it can raise throughput by 100% in some cases.  Deadlocks on the other hand can reduce effective throughput to zero in pathological cases.   Shard-based execution avoids deadlocks.

* Ordering.  SQL gives you a lot of ways to shoot yourself in the foot through bad transaction ordering.  You can't write to a table before creating it.  You can't delete a row before it is inserted.  Violating these rules does not just lead to invalid data but also causes errors that stop replication.  The workarounds are either unreliable and slow (conflict resolution) or impractical for most applications (make everything an insert).  To avoid this you need to observe serialization very carefully.

* Throughput.  SQL transactions in real systems vary tremendously in duration, which tends to result in individual long transactions blocking simpler parallelization schemes that use in-memory distribution of updates.  In the Tungsten model we can solve this by letting shard progress vary (by hours potentially), something that is only possible with a well-defined serialization model that deals with dependencies between parallel update streams.  I don't know of another approach that deals with this problem.

If you mess up the solution to any of the foregoing problems, chances are good you will irreparably corrupt data, which leads to replication going completely off the rails.  Then you reprovision your slave(s).  The databases that most need parallel replication are very large, so this is a multi-hour or even multi-day process.  It makes for unpleasant calls with customers when you tell them they need to do this.

I don't spend a lot of time worrying that Tungsten parallel replication is not well suited to the single big schema problem.  So far, the only ways I can think of making it work scalably require major changes to the DBMS or the applications that use it.  In many cases your least costly alternative may be to use SSDs to boost slave I/O performance.

My concerns about Tungsten's model lie in a different area.  The serialized shard model is theoretically sound--it has essentially the same semantics as causally dependent messaging in distributed systems.  However, if we fail to identify shards correctly (and don't know we failed) we will have crashes and corrupt data.  I want Tungsten either to work properly or tell users it won't work and degrade gracefully to full serialization.  If we can't do one of these two for every conceivable sequence of transactions that's a serious problem.

So, to get back to my original point, serialized shards are the best model for parallel replication in SQL databases as we find them today.  I suspect if you look at some of the other incipient designs for parallel replication on MySQL you will find that they follow this model in the end if not at first.  I would think in fact that the next step is to add MySQL features that make sharded replication more effective.  The drizzle team seems to be thinking along these lines already.

1 comment:

Linas Virbalas said...

RH> Updates in critical shards cause full serialization across all shards.

I'd like to emphasize this sentence and give it some weight, as it really means you need to wait for all the parallel threads to reach the seqno in the critical shard, block, commit in the critical shard and then resume. This flattens out the parallelization, thus needs to be avoided as much as architecture allows.