Oct 31, 2011

Benchmarking Tungsten Parallel Replication

Tungsten parallel apply on slaves, or parallel replication for short, has been available for about a year.   Until recently we did not have many formal benchmarks of its performance.  Fortunately the excellent Percona Live Conference in London accepted my talk on Tungsten parallel replication (slides available here), so Giuseppe Maxia and I finally allocated a block of time for systematic performance testing.

In a nutshell, the results were quite good. In the best cases Tungsten parallel apply out-performs single-threaded native replication by about 4.5 to 1.  Both Giuseppe and I have verified this using slightly different test methodologies, which helps avoid dumb counting mistakes.  Our results also match field tests at a customer site over the previous summer, so we regard them as fairly robust.  In the remainder of this article I would like to expand a bit on the details of the benchmarks as well as the results.  The results shown here are from my tests.

Benchmark Test Design

Both Giuseppe and I used a similar testbed for replication testing:    
  • HP Proliant server, dual Xeon L5520 CPUs with hyper-threading enabled, 72Gb of RAM
  • 1TB HP Smart Array RAID 1+0 
  • Centos 5.6
  • XFS file system
  • MySQL 5.1.57 with InnoDB buffer pool set to 10Gb and using O_DIRECT purge method 
  • Tungsten Replicator 2.0.5 build 347  
For convenience we use MySQL sandbox to set up a master with two slaves, as shown in the following diagram.  It turns out that for measuring replication throughput there is no reason to set up on separate hosts, as the master does little or nothing during the test and we only operate one slave at a time.  

The Tungsten Slave is configured as described in a previous article in this blog, except that there are 30 channels instead of 10.   The exact installation command is given at the end of this article.

The test run uses sysbench to spread transactions evenly across 30 databases of identical size, then measure time to process them.  This is also known as a replication catch-up test. 
  1. Load all MySQL servers with an identical dataset consisting of 30 databases pre-populated with data from sysbench.  Giuseppe wrote a new tool called Large Data Generator that is very helpful for capturing and loading such datasets.  
  2. With the slaves shut down, store the master binlog start position and then run 30 sysbench oltp test processes against the master to update and read from all schemas simultaneously for one hour.  
  3. Start the MySQL slave from the stored master binlog position and measure time to process the sysbench transactions. Shut down the MySQL slave at the end of the test. 
  4. Start the Tungsten slave from the stored master binlog position and measure time to process the sysbench transactions using Tungsten Replicator with 30 channels (i.e. threads). 
Test Results

Database performance is substantially different depending on whether data are fully resident in the buffer pool (cache-resident) or largely read from disk (I/O-bound).  Tungsten parallel replication over 30 databases varies from 1.8 to 4.5 depending on which case you look at, as shown in the following table.   Processing times are in minutes (m).

Test Scenario
Data Size
MySQL Slave
Tungsten Slave

Let's look at the results in detail.  In the cache-resident test the base dataset is relatively small and loads fully into the buffer cache within a minute or two.  Both MySQL and Tungsten slaves complete in well under an hour.  Here is a graph showing throughput as measured in bytes of binlog processed per 10 second increment.  

Cache-Resident Slave Catch-Up - MySQL vs. Tungsten Replicator, 30 Databases
In the cache-resident case there are virtually no reads from disk as all data are fully resident in the InnoDB buffer pool.  Tungsten Replicator is faster because multiple writes can occur in parallel but the speed-up versus native replication is not especially large.  Note that Tungsten processes around 40Mb every 10 seconds or about 1Gb of binlog every four minutes.

With I/O bound workloads, on the other hand, we see a profound difference in performance.  Tungsten Replicator is at least 6x slower than in the cache-resident case, but still processes updates faster than the master (51 minutes on the slave vs. 60 minutes on the master).   Buffer cache loading is correspondingly fast and Tungsten reaches steady-state performance within about 20 minutes.  MySQL native replication on the other hand is far slower. The slave not only does not catch up, but it would quickly lag far behind  the master under this workload.  It takes about 90 minutes for native replication even to achieve steady state performance after buffer pool loading.
I/O-Bound Slave Catch-Up - MySQL vs. Tungsten Replicator, 30 Databases
Overall we can safely say that single-threaded native replication is likely non-workable in the I/O-bound case without going to some combination of SSDs and/or slave pre-fetch.  

Further Improvements and Caveats

The current results of parallel replication benchmarks on Tungsten are gratifying especially when you consider that two years ago Tungsten Replicator performance was around 10% of the speed of MySQL replication.  Nevertheless, these benchmarks are not the final word.  It is clear there is room for optimization as we observe that Tungsten processes the cache-bound binlog at least 6 times faster than the I/O bound workload.  Much of the difference seems to be time spent reading from disk.   If this could be improved, Tungsten would go even faster.

During the London conference Yoshinori Matsunobu published some excellent performance results using slave pre-fetch, which has encouraged us to build pre-fetch into Tungsten as well.   I am curious to see if we can further boost throughput by adding pre-fetching on each parallel thread, though other people at the conference such as Domas Mituzas were not optimistic.  Either way, I am certain we will improve performance, if not using pre-fetch then with other tricks like batching inserts.

Finally, some caveats.  Our sysbench load is nice because it is evenly distributed across schemas of exactly the same size.  Most application workloads do not behave this way, though some do come very close.  The slides for my talk discuss practical issues in maximizing performance in real applications.  I suspect that a combination of parallelization with pre-fetch will in fact turn out to be a very good solution for a wide variety of workloads.

Fine Print

If you would like to repeat our results (or attack them as fraudulent), here are some parameters that may help.  The database settings in the MySQL sandbox instances are as follows:

Next, here is the sysbench command used to generate load on each schema.  We run 30 of these simultaneously varying the database name for each invocation.  This example is for the I/O-bound case.

sysbench --test=oltp --db-driver=mysql --mysql-db=${db} \  --mysql-user=msandbox --mysql-password=msandbox \
  --mysql-host= --mysql-port=33306 \
  --oltp-read-only=off --oltp-table-size=10000000 \
  --oltp-index-updates=4 --oltp-non-index-updates=2 \
  --max-requests=200000 \
  --max-time=3600 --num-threads=5 run

The replicator configuration is given in the slides for the talk, but here it is again.   Options in red are required for sandboxes.  Production installations are therefore simpler than what is shown here.

tools/tungsten-installer tools/tungsten-installer --direct -a \
  --service-name=parallel --native-slave-takeover \
  --master-host= --master-port=33306  \
  --master-user=msandbox --master-password=msandbox  \
  --slave-host= --slave-port=33307  \
  --slave-user=msandbox --slave-password=msandbox  \
  --home-directory=/opt/continuent \
  --property=replicator.store.parallel-queue.maxOfflineInterval=5 \
  --svc-parallelization-type=disk --buffer-size=100 \
  --channels=30 --thl-port=2115 --rmi-port=10010 \
  --skip-validation-check=MySQLPermissionsCheck \
  --skip-validation-check=MySQLApplierServerIDCheck \

To equal the results shown above you will also need to assign databases explicitly to channels in the shard.list file.   Otherwise, databases will be assigned  channels using a hashing function, which tends to result in somewhat uneven distributions.  Look in the comments of the shard.list file for instructions on how to do this. 

Finally, all of our tests depend on two excellent tools from Giuseppe Maxia:   MySQL Sandbox and the new Large Data Generator program in the Tungsten Toolbox.  Once you get the hang of them you will become completely addicted as they make test setup both reliable as well as quick.  

Oct 1, 2011

Open Source Hardware

Back in 2010 I stopped buying test servers from Dell and began building them from components using Intel i7 processors, X58-based mother boards, and modular power supplies from Ultra.  It was a good way to learn about hardware.  Besides, it was getting old to pay for Dell desktop systems with Windows, which I would then wipe off when installing Linux.  Between the educational value of understanding the systems better, selecting the exact components I wanted, and being able to fix problems quickly, it has been one of the best investments I have ever made.  And it didn't cost any more than equivalent Dell servers.

For this reason, a couple of recent articles about computer hardware caught my attention.  First, Dell is losing business as companies like Facebook build their own customized servers.  Open source database performance experts like Peter Zaitsev have been talking about full-stack optimization including hardware for years.  Google built their original servers using off-the-shelf parts.  Vertical integration of applications and hardware has since gone mainstream.  If you deploy the same application(s) on many machines, balancing characteristics like cost, performance, and power utilization is no longer a specialist activity but a necessity of business.  It's not just cutting out the Microsoft tax but many other optimizations as well.

Second, developments in hardware itself are making custom systems more attractive to a wide range of users.  A recent blog post by Bunnie Huang describes how decreases in the slope of CPU clock speed increase over time mean you can get better cost/performance by building optimized, application-specific systems now than waiting for across-the-board improvements.  Stable standards also drive down the difficulty of rolling your own.  Components on mid-range servers are sufficiently standardized it is easier to build basic systems from components than to put together a bicycle from scratch.  Try building your own wheels sometime if you don't believe this.

Easily customizable hardware has important consequences.  At a business level, Dell and other mainline hardware vendors will adapt to lower margins, but the market for generic, mid-range appliances has evaporated.  Starting around 2005 there was a wave of companies trying to sell open source databases, memcached, and datamarts on custom hardware.   Most seem to have moved away from hardware, like Schooner,  or folded entirely (like Gear6 and Kickfire).  The long-term market for such appliances, to the extent it exists, is in the cloud.

The other consequence is potentially far more significant.  The traditional walls that encapsulated hardware and software design are breaking down.  Big web properties or large ISPs like Rackspace run lean design teams that integrate hardware with open source software deployment.  This not just a matter of software engineers learning about hardware or vice-versa.  It is the tip of a much bigger iceberg.  Facebook recently started the Open Compute Project, which is a community-based effort to design server infrastructure.   In their own words:
By releasing Open Compute Project technologies as open hardware, our goal is to develop servers and data centers following the model traditionally associated with open source software projects. That’s where you come in.
Facebook and others are opening up data center design.  Gamers have been building their own systems for years.  Assuming Bunnie's logic is correct, open hardware will apply to wide range of devices from phones up to massive clusters.  Community-based, customized system designs are no longer an oddity but part of a broad movement that will change the way all of us think about building and deploying applications on any kind of physical hardware.  It will upset current companies but also create opportunities for new kinds of businesses.  The "cloud" is not the only revolution in computing.  Open source hardware has arrived.