Feb 26, 2012

Solving Replication Lag with Tungsten Slave Prefetch

Slave prefetch is an increasingly popular technique for speeding up native MySQL replication, with several tools already published to enable it, such as mk-slave-prefetch and Replication Booster.  Tungsten Replicator is now joining the fray.   This article explains how our implementation works, how to install and tune it, and how well it performs compared to unaided MySQL native replication as well as Tungsten parallel replication.

Understanding Slave Prefetch

Slow reads from storage are the principle reason for lagging MySQL replication.   This seems paradoxical since at first glance the lag is caused by delayed updates.  The explanation is due to the way DBMS engines like MySQL handle transactions.

As experienced DBMS users know, database servers do not operate directly on storage.  Instead, they  load pages from storage into the buffer cache, which is to say into memory.   For example, an UPDATE statement on an InnoDB table may read a secondary index to find rows to update, after which it reads the rows themselves using the primary key index.  Since updates can affect index entries, the update may also need to traverse additional indexes to find entries that need to be changed.   All of this happens before the DBMS engine can perform the write operation, which for a variety of reasons is very fast.  The difference in update speed can be on  the order of 100:1 when required pages are already in cache vs. fetched from storage.   In the latter case, single-threaded slaves as we find in MySQL replication as well as non-parallel Tungsten replication spend most of their time waiting for reads to complete.  Replication becomes grindingly slow.

Slave prefetch solves replication lag by preloading pages into the buffer cache before they are needed for updates.  Our implementation is specifically designed to speed up replication for Tungsten single-threaded slaves.  It takes advantage of specialized features of Tungsten, including replication pipelines as well as parallel apply and SQL parsing capabilities that are already built into the replicator.  

The following diagram illustrates how Tungsten prefetch works using the example of an UPDATE statement on the standard sysbench table sbtest, which is like the Drosophila Fly of MySQL performance testing.   The idea is to pull the query from the slave THL before the slave can execute it, fetch any affected rows, and then use the fetched data to construct further queries to load secondary index pages.   There is a primary key on column id and a secondary index on column k. 

Secondary index loading is the trickiest part of prefetching.  Select too few index pages and prefetch becomes ineffective, because it misses too many pages needed for the UPDATE.  Select too many or the wrong indexes, and you can easily drag down the MySQL server.   We will return to this issue in a later section as we discuss performance as well as tuning.  

Tungsten can prefetch pages for any of the following types of statements: 
  1. INSERT [INTO] table SELECT ...
  2. UPDATE table SET ... WHERE ...
  3. DELETE FROM table WHERE ...
  4. Any row-based table operation, i.e., inserts, updates, and deletes. 
Tungsten prefetch is implemented by a specialized pipeline that downloads transactions from a Tungsten slave, then coordinates their execution so that the prefetch queries run anywhere from 5 seconds to a few minutes ahead of the corresponding slave updates.  The prefetch applier uses parallel tasks (typically 15 or more) to run many prefetch queries in parallel.   The coordination and parallel apply help ensure that prefetch queries run far enough ahead of time to help the slave pipeline but not so quickly that the fetched pages drop out of the buffer pool before they can be used.

Finally, the Tungsten pre-fetch implementation supports single-threaded Tungsten slaves running against MySQL.  We may extend it to parallel Tungsten slaves at a later time.  Support for vanilla MySQL replication is further out.   

Installing Tungsten Prefetch
Tungsten slave prefetch generally runs in a separate replicator process that connects to the Tungsten slave and downloads transactions from its THL.   The first step is to set up normal master/slave replication using procedures from the Tungsten Replicator install docs or from the cookbook.

Next, you install prefetch in a separate release directory.   Here is an example of a standard one-line install command.   Note:  Prefetch support is very new.  You must use at least Tungsten 2.0.5 build 548, available here.

You can install a replicator with prefetch enabled using a one-line command like the following example:   

$ tools/tungsten-installer --master-slave -a \
  --home-directory=$TUNGSTEN_HOME \
  --prefetch-enabled=true \
  --prefetch-schema=tungsten_mysvc \
  --prefetch-max-time-ahead=180 \
  --prefetch-min-time-ahead=5 \
  --prefetch-sleep-time=500 \
  --cluster-hosts=logos2 \
  --datasource-port=3306 \
  --datasource-user=tungsten \
  --datasource-password=secret \
  --master-host=logos2 \
  --master-thl-port=2112 \
  --thl-port=2114 \
  --rmi-port=10002 \
  --service-name=prefetch \
  --channels=20 \

Notice the following specialized settings.
  • --prefetch-schema has the name of the schema containing the slave's trep_commit_seqno table (i.e., its current position).  Tungsten prefetch selects from this table in order to track the current apply position of the slave pipeline.   
  • --prefetch-max-time-ahead and --prefetch-min-time-ahead define the window for executing prefetch queries.  In this case it's at least 5 seconds ahead of the current slave position but no more than 180 seconds.  
  • --prefetch-sleep-time option sets the number of milliseconds to sleep when prefetching gets too far ahead.  
  • --master-host and --master-thl-port locate the slave for which we are prefetching.  The use of "master"in this context is confusing but does work. 
Your new prefetch replicator should pop up and start working right away.  Use 'trepctl -port 10002 status' to see progress and as well as its overall position in the log.  We will come back to monitoring shortly.

Tungsten Prefetch Performance

So how does Tungsten prefetch actually perform?  We have run a number of benchmarks to compare Tungsten prefetch with unaided native MySQL replication.  The benchmark is exactly the same one we used for parallel replication, which was described in an earlier article on this blog.  The benchmark measures the amount of time for a slave to apply an hour's worth of transactions spread evenly across 30 schemas.  Since the setup is described in detail elsewhere, I will skip further details here. 

For prefetch, we considered only the I/O bound case with 68Gb dataset size and 10Gb InnoDB buffer pool, as this is the only case where prefetching offers a benefit.  Here is a graph showing the result.  Both DBMS servers are starting with empty buffer pools.

As this graph shows, prefetch considerably improves slave performance.   For instance, prefetch truncates buffer pool loading effects to a few minutes compared the hour or so we see for unaided MySQL replication.  Throughput at the steady state is also quite good, as shown by the flat spots on the graph.   Overall Tungsten prefetch exceeded MySQL throughput by slightly more than three times.  This result is consistent over a number of tests including tests on a single schema run by my colleague Giuseppe Maxia.

For comparison, Tungsten parallel replication handles the same workload in about 51 minutes, or 4.4 times faster than MySQL and about 50% faster than prefetch.

As we see, however,  prefetch performance can be quite variable, especially on real workloads.  There are two reasons for this.
  1. Prefetching itself puts load in the system.  For example, secondary index queries can be very expensive, especially for low cardinality indexes (i.e., with few distinct values).  There is a sizable hit on start-up as the prefetch fetches pages for these indexes. Tungsten caches slow prefetch queries so that they do not repeat too often, as there is a new hit each time the prefetch operation repeats. 
  2. Workloads are themselves variable, which means that the pre-fetcher's "guess" of required pages may be more or less good depending on the circumstances. 
There are two interesting anomalies on the prefetch performance. The first is the large dip 10 or so minutes into the run.  The second is the jump at the tail.

I do not know the reason for the first dip in the graph.  However, the jump at the end is likely the effect of prefetch reaching the end of the transactions.  For the final few minutes the slave is running with most indexes fully loaded in cache but without the additional performance impact of prefetching itself.  From this and other tests it appears the Tungsten prefetcher is still somewhat too aggressive and may be pre-fetching more often than required.  This may also account for some of the dip as well, but more investigation is required.

Tuning Prefetch

The performance results illustrate the important of monitoring and if necessary tuning prefetch for better results.  To do this effectively it helps to understand how prefetch works internally.  Here is an illustration of the prefetch pipeline.

Prefetch is largely controlled by the PrefetchStore, which coordinates when to release transactions to the parallel apply queue, and the PrefetchApplier, which converts transactions in to prefetch queries and executes them.   To get a general sense of prefetch status, you can run the following command to see the state of the PrefetchStore.  As shown by the highlighted variables the prefetcher is currently about 7 seconds ahead of the slave position, which is at the low end of a desired range of 3 to 180 seconds ahead.   So far 195,841 transactions have been processed.  The prefetcher skipped 231 to keep ahead of the slave.   These indicate that the prefetcher is working correctly but could perhaps use a little tuning of query caching or number of parallel channels to help keep up.

$ trepctl status -name stores
Processing status command (stores)...
NAME                VALUE
----                -----
allowAll          : false
eventCount        : 195841
interval          : 1000
maxSize           : 100
maxTimeAhead      : 180
minTimeAhead      : 3
name              : prefetch-queue
prefetchEvents    : 195842
prefetchLatency   : 168160.291     
prefetchRatio     : 0.999          
prefetchState     : active
prefetchTimeAhead : 6.709          
skippedEvents     : 231
slaveCatalogSchema: tungsten_test
slaveLatency      : 168167.000     
sleepTime         : 500
storeClass        : com.continuent.tungsten.replicator.prefetch.PrefetchStore
storeSize         : 100
timeActive        : 766.037        
timeSleeping      : 0.000          
url               : jdbc:mysql:thin://
NAME                VALUE
----                -----
criticalPartition : -1
discardCount      : 0
eventCount        : 195740
maxSize           : 100
name              : parallel-queue
partitionerClass  : com.continuent.tungsten.replicator.storage.parallel.RoundRobinPartitioner
queues            : 20
serializationCount: 0
serialized        : false
stopRequested     : false
store.queueSize.0 : 80
store.queueSize.1 : 86
store.queueSize.10: 78
store.queueSize.11: 0
store.queueSize.12: 100
store.queueSize.13: 31
store.queueSize.14: 60
store.queueSize.15: 1
store.queueSize.16: 33
store.queueSize.17: 42
store.queueSize.18: 72
store.queueSize.19: 91
store.queueSize.2 : 1
store.queueSize.3 : 0
store.queueSize.4 : 45
store.queueSize.5 : 18
store.queueSize.6 : 7
store.queueSize.7 : 0
store.queueSize.8 : 77
store.queueSize.9 : 46
storeClass        : com.continuent.tungsten.replicator.storage.parallel.ParallelQueueStore
storeSize         : 868
syncEnabled       : true
syncInterval      : 10000
Finished status command (stores)...

The second store listing shows the parallel store.   You can see that prefetch queries are distributed across 20 channels.  Prefetch uses in-memory parallelization, which limits the number transactions that can be processed but avoids extra writes to disk.

There is currently no simple way to view statistics on PrefetchAppliers during execution.  The prefetchers dump statistics to the replicator log, trepsvc.log, whenever they go offline.   You can use these numbers to adjust query cache parameters, which are defined in the static properties file for the prefetch replication service as shown below.  

# Slow query cache parameters.  Slow queries include those with large
# numbers of rows or poor selectivity, where selectivity is the
# fraction of the rows selected.  Any query that exceeds these is
# not repeated for the number of seconds in the cache duration property.  

One of the most useful parameters is the retention period slowQueryCacheDuration.  It prevents large queries from being repeated, in this case more than once every 600 seconds.  For now there is no magic wand to set proper values.  You have to experiment if the default values do not suit you.

Prefetch versus Parallel Replication versus Yet Other Performance Improvements

Of the techniques available in Tungsten parallel replication offers provably the best overall performance.  This is because it optimizes all I/O exactly (not just buffer cache loading) without the guesswork that is inevitable in prefetching.   If your application is suitable for parallelization, it is normally the way to go if you are concerned about speed.

That said, not all applications can use parallel replication.  Many applications consist of a single database schema, which is difficult to parallelize.  Also, older Tungsten replicators are still single-threaded.   Some of the motivation for prefetch work at Continuent is to speed up commercial Tungsten replication while we integrate parallel replication.  Prefetching is a very powerful tool in these circumstances and is not just limited to disk storage, which we use for most of our benchmarks.  Yoshinori Matsunobu has shown the prefetching also helps when using SSDs.  

In my previous post on parallel replication benchmarking I indicated in the conclusion that some combination of prefetch and parallel replication might be good for certain workloads.   Domas disagreed with that opinion.  At this point I am more inclined to agree with Domas' position for the reasons cited a couple of paragraphs above.   We may still test it but at this point I would rather spend the effort making parallel replication itself faster.

Speaking of replication performance, the next major improvement we have in store is batch loading.   This loads transactions using LOAD DATA INFILE as well as COPY (for non-MySQL servers).  It is especially effective for data warehouse loading, where it can boost load speeds enormously.  I will post more on this with numbers in the near future. 

No comments: