Friday, November 18, 2011

Replicating Data Now and Then with Tungsten

What do cruise ship management software and data warehouses have in common?  One answer:  they both depend on intermittent data replication.   Large vessels collect data to share with a home base whenever connectivity permits.  If there is no connection, they just wait until later.  Data warehouses also do not replicate constantly.  Instead, it is often far faster to pool updates and load them in a single humongous batch using SQL COPY commands or native loaders.  Replicating updates in this way is sometimes known as batch replication.  Tungsten Replicator supports it quite easily.

To illustrate we will consider a Tungsten master/slave configuration.  (Sample setup instructions here.)  In this example MySQL-based web sales data upload to a data warehouse.   The master receives constant updates, which then apply at controlled intervals on the slave.


The first step is to turn off the slave replicator.  Login to the prod2 host and execute the following command.

$ trepctl offline

The prod2 slave will disconnect from the master as well as the data warehouse.  Updates now accumulate on the master.  We can turn on the slave to fetch and apply them all, then go offline again using one of three methods.  The first method uses the current sequence number on the master.  Here are sample commands to fetch and apply all transactions from the master up to the current master position.

$ trepctl -host prod1 status |grep appliedLastSeqno
appliedLastSeqno       : 19600
$ trepctl online -seqno 19600
$ trepctl wait -state OFFLINE -limit 300

As you can see, the first command locates the master sequence number.  The second command tells the slave to go online and replicate to sequence number 19600.  Finally the third command waits until either slave is back in the offline state or 300 seconds elapse, whichever comes first.  This is not strictly necessary for replication but is very handy for scripts, as it eliminates a potentially awkward polling loop.  

The second method is to use the MySQL binlog position on the master.  The idea is the same as the previous example.  We get the master binlog position, then tell the slave to apply transactions to that point and go offline.  Here's an example: 

$ mysql -utungsten -psecret -hprod1 -e 'show master status'
+------------------+----------+--------------+------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+------------------+----------+--------------+------------------+
| mysql-bin.002023 | 92395851 |              |                  |
+------------------+----------+--------------+------------------+
$ trepctl online -event mysql-bin.002023:0000000092395851
$ trepctl wait -state OFFLINE -limit 300

Note in this example that you must pad the binlog offset out to 16 digits, which means you must add the extra zeros shown in bold.  Tungsten compares native replication IDs as strings, so that we can handle other databases besides MySQL.  This normally a minor inconvenience, unless you don't know the trick.  In that case it could be a bit of a head-scratcher.  

There is a final way to implement batch replication using Tungsten's built-in heartbeat mechanism.   With this method we insert a named heartbeat event on the master, then ask the slave to replicate until the heartbeat appears.  Here's an example: 

$ trepctl -host prod1 heartbeat -name batch1
$ trepctl online -heartbeat batch1
$ trepctl wait -state OFFLINE -limit 300

This method is perhaps the simplest of all, because there is no need to check for either sequence numbers or binlog offsets on the master.  The only downside is that you must have a master and a slave replicator to use it.  It does not work with direct replication, in which a single replicator moves data from the master DBMS to the slave.  (This limitation will be removed in the future when Issue 228 is fixed.) 

When using any of these techniques, we may want to know whether Tungsten will really go offline at the correct point.  Fortunately, there's a simple way to find out.  The trepctl status command shows pending requests to go offline.  Let's say you check status after requesting the slave to replicate to a heartbeat as in the previous example.   

$ trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.002023:0000000104369615;37978
appliedLastSeqno       : 220126
appliedLatency         : 470.589
...
offlineRequests        : Offline at heartbeat event: batch1
...
state                  : ONLINE
timeInStateSeconds     : 2.436
uptimeSeconds          : 1742.0
Finished status command...

It is simple to see from the status output that Tungsten will go offline when it sees a heartbeat named batch1.

As this article shows, the trepctl online and trepctl wait commands make it very simple to implement batch replication.  You can simplify still further by wrapping the commands in a short script written in your favorite scripting language.   Either way you have a handy solution to a problem that affects a diverse set of applications.

This is not the end of Tungsten features to enable batch replication.  Tungsten has a new applier that can submit transactions using CSV files, which is critical to load transactions quickly into data warehouses.  We have been testing it out with Vertica, where early results show that it improves load performance by a factor of 100 or more in some cases.  I will describe this new feature in an upcoming article.  

Wednesday, November 16, 2011

Why So Many Proprietary Rewrites of MySQL and InnoDB?

Every couple of weeks or so I get marketing email from a Continuent competitor advertising a closed-source clone of MySQL. It is said to be pumped up on illegal substances and therefore the solution to all my problems.  I like this sort of spam because it makes it easier to track what the neighbors are up to.  However it does bring up a question.  Why are so many companies offering what amount to proprietary replacements of MySQL?  This does not mean alternative builds like Percona or MariaDB.  It means products like ClustrixSchooner, or Xeround, which replace MySQL entirely, or like ScaleDB, or Tokutek, which replace InnoDB.  

There's nothing wrong with proprietary software, of course.  And there is nothing wrong with rewriting things to make them better.  The rewrites are a tribute to the vitality of the MySQL marketplace and in some cases quite clever as well.  However, the proprietary offerings tend to obscure an important truth about MySQL.  Most businesses that run on open source software have problems with MySQL management, not with MySQL itself.

Here is a simple example.  Say you have 2 Terabytes in MySQL 5.1.  How do you upgrade from MySQL 5.1 to 5.5 without incurring an application outage?  This is a big problem for 24x7 web-facing applications.  You don't need to rewrite MySQL to do zero-downtime upgrades.  MySQL with InnoDB already works fine.  You just need a way to shift connections transparently to a new master database, upgrade the old master, and shift back when you are done.  Similar reasoning applies for slave provisioning, automated failover, spreading load over replicas to improve performance, or operating across multiple sites.

At Continuent we concluded a number of years ago that you don't need to change MySQL to manage data effectively.  We therefore designed Tungsten Enterprise, Continuent's commercial clustering solution, to work with unaltered MySQL. Tungsten Enterprise uses master/slave replication (i.e., my favorite replicator), distributed management, and transparent connectivity to make a set of standard MySQL or PostgreSQL servers look like a single highly available DBMS that distributes load across all replicas.   This architecture has tremendous advantages, because it complements the strengths of MySQL itself.   Here are a few of the principle benefits.

Transparency.  Everything runs on standard MySQL from initial development to large-scale deployment.  Application code runs the same way on a dev laptop or production.  Application bugs in production are reproducible on the laptop.  Standard MySQL configuration and tuning also work, because this is standard MySQL.

No lock-in.  Don't like Tungsten Enterprise?  Use something else or revert back to simple MySQL.  There's no need to change your database or migrate data.

Data integrity.  InnoDB has had years to shake out bugs, especially those involving data corruption.  There are still a few but they do not typically show up unless there is a bad hardware failure or you configure your system incorrectly.  (Hint #1: don't use MyISAM.)  Do you really want to give this up for a new store implementation?

Scalability.  MySQL performance is improving constantly, driven by competition between builds, an active community, investment from Oracle and large web properties like Facebook.  SSDs are also increasingly affordable and make a lot of performance problems evaporate.  As MySQL improves in this and other areas, you get the benefits.  The trick is to have a way to upgrade.  I mentioned the MySQL 5.1 to 5.5 upgrade problem for precisely this reason.

Operational costs.  There is a deep pool of administrators and administrative tools for MySQL.  Thanks to books like High Performance MySQL, abundant talks, and a wealth of community resources as well as consulting, there is little mystery about how things work.  I probably don't even need to discuss license costs.

Viability.  MySQL is not going anywhere.  Oracle is continuing to invest in the core database, and Percona, MariaDB and most important Microsoft will ensure Oracle stays on its toes.  At Continuent we do our best to keep our friends at Oracle competitive on replication.  Innovation on open source MySQL will continue for years to come.  (Psst, MySQL guys at Oracle are welcome to come work for us. :)

Given the number of advantages that off-the-shelf MySQL confers, the real question is why our approach is not more popular.  Actually it is.  For all the marketing attention generated by proprietary MySQL or InnoDB rewrites, many hundreds of billions of transactions per day run on unaltered MySQL.  Switching to proprietary versions of MySQL is a substantial wrench for most businesses, because the economics run so strongly in favor of open source DBMS.   However, the open source tools for managing MySQL are by-and-large inadequate, in part because some of the problems turn out to be rather difficult to solve.   

When we started to rethink database clustering at Continuent back in 2007, we therefore focused on solving the problems outside MySQL that make data management hard.  That includes building fast replication with global transaction IDs, so you can fail over easily to up-to-date live replicas.  It includes building distributed, rule-based management that has simple primitives like "recover" to fix a broken slave.  It includes speedy, transparent connectivity that can spread reads intelligently across multiple servers and reroute connections transparently to allow maintenance without halting applications.  Finally, it includes simplifying management so that users don't spend much time worrying about their data.  These capabilities are now very robust and help customers handle hundreds of millions of transactions per day.  

It is obvious off-the-shelf MySQL (and PostgreSQL too) are already very good and continuing to get better.  For most users there is no need to migrate to proprietary offerings that give up the leverage conferred by open source databases.  Tungsten Enterprise solves the difficult problems that are critical to building businesses on standard MySQL.   If you are building new systems based on MySQL or scaling old ones you should look hard at what we have done.   

Sunday, November 13, 2011

I Really Dislike Anonymous Attacks

If you are interested in NoSQL databases (or maybe not) perhaps you have seen the anonymous "warning" about using MongoDB.   It concludes with the following pious request:

  Please take this warning seriously.

Now there are a lot of great resources about data management on the web but the aforementioned rant is not one of them.  If you plan to write technical articles and have people take them seriously, here are a few tips.
  1. Sign your name.  Readers are more impressed when they see you are not afraid to stand behind your words. 
  2. Explain what problem you were trying to solve.  Otherwise uncharitable readers might think you just started pumping information into a new database without thinking about possible consequences and now want to blame somebody else for your bad decision.  
  3. Explain how you could do better.  Not all designs work out, so propose alternatives.  Readers love to see authors demonstrate that they are not discouraged by adversity.  
As for most of the points made by the anonymous author, all I can say is: well, duh! 

MongoDB behavior with respect to global write locking and transaction durability is obvious from the official documentation.  These features are not my cup of tea, but it's also not as if 10gen is hiding them either.  Moreover, most people understand that new DBMS implementations have problems, not least of all losing data now and then.  You usually pick them because they have features that make it worth putting up with the immaturity.  I am not an expert on MongoDB, but I can say from experience it is amazingly easy to load JSON objects into it.  The up-front usability alone demonstrates excellent engineering.  I am sure for this reason that there are many other good features.  

Monday, October 31, 2011

Benchmarking Tungsten Parallel Replication

Tungsten parallel apply on slaves, or parallel replication for short, has been available for about a year.   Until recently we did not have many formal benchmarks of its performance.  Fortunately the excellent Percona Live Conference in London accepted my talk on Tungsten parallel replication (slides available here), so Giuseppe Maxia and I finally allocated a block of time for systematic performance testing.

In a nutshell, the results were quite good. In the best cases Tungsten parallel apply out-performs single-threaded native replication by about 4.5 to 1.  Both Giuseppe and I have verified this using slightly different test methodologies, which helps avoid dumb counting mistakes.  Our results also match field tests at a customer site over the previous summer, so we regard them as fairly robust.  In the remainder of this article I would like to expand a bit on the details of the benchmarks as well as the results.  The results shown here are from my tests.

Benchmark Test Design

Both Giuseppe and I used a similar testbed for replication testing:    
  • HP Proliant server, dual Xeon L5520 CPUs with hyper-threading enabled, 72Gb of RAM
  • 1TB HP Smart Array RAID 1+0 
  • Centos 5.6
  • XFS file system
  • MySQL 5.1.57 with InnoDB buffer pool set to 10Gb and using O_DIRECT purge method 
  • Tungsten Replicator 2.0.5 build 347  
For convenience we use MySQL sandbox to set up a master with two slaves, as shown in the following diagram.  It turns out that for measuring replication throughput there is no reason to set up on separate hosts, as the master does little or nothing during the test and we only operate one slave at a time.  



The Tungsten Slave is configured as described in a previous article in this blog, except that there are 30 channels instead of 10.   The exact installation command is given at the end of this article.

The test run uses sysbench to spread transactions evenly across 30 databases of identical size, then measure time to process them.  This is also known as a replication catch-up test. 
  1. Load all MySQL servers with an identical dataset consisting of 30 databases pre-populated with data from sysbench.  Giuseppe wrote a new tool called Large Data Generator that is very helpful for capturing and loading such datasets.  
  2. With the slaves shut down, store the master binlog start position and then run 30 sysbench oltp test processes against the master to update and read from all schemas simultaneously for one hour.  
  3. Start the MySQL slave from the stored master binlog position and measure time to process the sysbench transactions. Shut down the MySQL slave at the end of the test. 
  4. Start the Tungsten slave from the stored master binlog position and measure time to process the sysbench transactions using Tungsten Replicator with 30 channels (i.e. threads). 
Test Results

Database performance is substantially different depending on whether data are fully resident in the buffer pool (cache-resident) or largely read from disk (I/O-bound).  Tungsten parallel replication over 30 databases varies from 1.8 to 4.5 depending on which case you look at, as shown in the following table.   Processing times are in minutes (m).

Test Scenario
Rows/Db
Data Size
MySQL Slave
Tungsten Slave
Ratio
Cache-resident
10K
430Mb
30m 
17m
1.8
I/O-Bound
10M
68Gb
228m
51m
4.5

Let's look at the results in detail.  In the cache-resident test the base dataset is relatively small and loads fully into the buffer cache within a minute or two.  Both MySQL and Tungsten slaves complete in well under an hour.  Here is a graph showing throughput as measured in bytes of binlog processed per 10 second increment.  

Cache-Resident Slave Catch-Up - MySQL vs. Tungsten Replicator, 30 Databases
In the cache-resident case there are virtually no reads from disk as all data are fully resident in the InnoDB buffer pool.  Tungsten Replicator is faster because multiple writes can occur in parallel but the speed-up versus native replication is not especially large.  Note that Tungsten processes around 40Mb every 10 seconds or about 1Gb of binlog every four minutes.

With I/O bound workloads, on the other hand, we see a profound difference in performance.  Tungsten Replicator is at least 6x slower than in the cache-resident case, but still processes updates faster than the master (51 minutes on the slave vs. 60 minutes on the master).   Buffer cache loading is correspondingly fast and Tungsten reaches steady-state performance within about 20 minutes.  MySQL native replication on the other hand is far slower. The slave not only does not catch up, but it would quickly lag far behind  the master under this workload.  It takes about 90 minutes for native replication even to achieve steady state performance after buffer pool loading.
I/O-Bound Slave Catch-Up - MySQL vs. Tungsten Replicator, 30 Databases
Overall we can safely say that single-threaded native replication is likely non-workable in the I/O-bound case without going to some combination of SSDs and/or slave pre-fetch.  


Further Improvements and Caveats

The current results of parallel replication benchmarks on Tungsten are gratifying especially when you consider that two years ago Tungsten Replicator performance was around 10% of the speed of MySQL replication.  Nevertheless, these benchmarks are not the final word.  It is clear there is room for optimization as we observe that Tungsten processes the cache-bound binlog at least 6 times faster than the I/O bound workload.  Much of the difference seems to be time spent reading from disk.   If this could be improved, Tungsten would go even faster.

During the London conference Yoshinori Matsunobu published some excellent performance results using slave pre-fetch, which has encouraged us to build pre-fetch into Tungsten as well.   I am curious to see if we can further boost throughput by adding pre-fetching on each parallel thread, though other people at the conference such as Domas Mituzas were not optimistic.  Either way, I am certain we will improve performance, if not using pre-fetch then with other tricks like batching inserts.

Finally, some caveats.  Our sysbench load is nice because it is evenly distributed across schemas of exactly the same size.  Most application workloads do not behave this way, though some do come very close.  The slides for my talk discuss practical issues in maximizing performance in real applications.  I suspect that a combination of parallelization with pre-fetch will in fact turn out to be a very good solution for a wide variety of workloads.

Fine Print

If you would like to repeat our results (or attack them as fraudulent), here are some parameters that may help.  The database settings in the MySQL sandbox instances are as follows:
default-storage-engine=InnoDB
innodb-additional-mem-pool-size=100M
innodb-flush-method=O_DIRECT
innodb-log-buffer-size=4M
innodb-log-file-size=50M
innodb-thread-concurrency=0
innodb_buffer_pool_size=10G
innodb_file_format=barracuda
innodb_file_per_table=1
innodb_flush_log_at_trx_commit=2
innodb_strict_mode=1log-bin=mysql-binmax-connections=500
max_allowed_packet=48M
skip_slave_start
sync_binlog=0

Next, here is the sysbench command used to generate load on each schema.  We run 30 of these simultaneously varying the database name for each invocation.  This example is for the I/O-bound case.

sysbench --test=oltp --db-driver=mysql --mysql-db=${db} \  --mysql-user=msandbox --mysql-password=msandbox \
  --mysql-host=127.0.0.1 --mysql-port=33306 \
  --oltp-read-only=off --oltp-table-size=10000000 \
  --oltp-index-updates=4 --oltp-non-index-updates=2 \
  --max-requests=200000 \
  --max-time=3600 --num-threads=5 run

The replicator configuration is given in the slides for the talk, but here it is again.   Options in red are required for sandboxes.  Production installations are therefore simpler than what is shown here.

tools/tungsten-installer tools/tungsten-installer --direct -a \
  --service-name=parallel --native-slave-takeover \
  --master-host=127.0.0.1 --master-port=33306  \
  --master-user=msandbox --master-password=msandbox  \
  --slave-host=127.0.0.1 --slave-port=33307  \
  --slave-user=msandbox --slave-password=msandbox  \
  --home-directory=/opt/continuent \
  --property=replicator.store.parallel-queue.maxOfflineInterval=5 \
  --svc-parallelization-type=disk --buffer-size=100 \
  --channels=30 --thl-port=2115 --rmi-port=10010 \
  --skip-validation-check=MySQLPermissionsCheck \
  --skip-validation-check=MySQLApplierServerIDCheck \
  --start-and-report

To equal the results shown above you will also need to assign databases explicitly to channels in the shard.list file.   Otherwise, databases will be assigned  channels using a hashing function, which tends to result in somewhat uneven distributions.  Look in the comments of the shard.list file for instructions on how to do this. 

Finally, all of our tests depend on two excellent tools from Giuseppe Maxia:   MySQL Sandbox and the new Large Data Generator program in the Tungsten Toolbox.  Once you get the hang of them you will become completely addicted as they make test setup both reliable as well as quick.  

Saturday, October 1, 2011

Open Source Hardware

Back in 2010 I stopped buying test servers from Dell and began building them from components using Intel i7 processors, X58-based mother boards, and modular power supplies from Ultra.  It was a good way to learn about hardware.  Besides, it was getting old to pay for Dell desktop systems with Windows, which I would then wipe off when installing Linux.  Between the educational value of understanding the systems better, selecting the exact components I wanted, and being able to fix problems quickly, it has been one of the best investments I have ever made.  And it didn't cost any more than equivalent Dell servers.

For this reason, a couple of recent articles about computer hardware caught my attention.  First, Dell is losing business as companies like Facebook build their own customized servers.  Open source database performance experts like Peter Zaitsev have been talking about full-stack optimization including hardware for years.  Google built their original servers using off-the-shelf parts.  Vertical integration of applications and hardware has since gone mainstream.  If you deploy the same application(s) on many machines, balancing characteristics like cost, performance, and power utilization is no longer a specialist activity but a necessity of business.  It's not just cutting out the Microsoft tax but many other optimizations as well.

Second, developments in hardware itself are making custom systems more attractive to a wide range of users.  A recent blog post by Bunnie Huang describes how decreases in the slope of CPU clock speed increase over time mean you can get better cost/performance by building optimized, application-specific systems now than waiting for across-the-board improvements.  Stable standards also drive down the difficulty of rolling your own.  Components on mid-range servers are sufficiently standardized it is easier to build basic systems from components than to put together a bicycle from scratch.  Try building your own wheels sometime if you don't believe this.

Easily customizable hardware has important consequences.  At a business level, Dell and other mainline hardware vendors will adapt to lower margins, but the market for generic, mid-range appliances has evaporated.  Starting around 2005 there was a wave of companies trying to sell open source databases, memcached, and datamarts on custom hardware.   Most seem to have moved away from hardware, like Schooner,  or folded entirely (like Gear6 and Kickfire).  The long-term market for such appliances, to the extent it exists, is in the cloud.

The other consequence is potentially far more significant.  The traditional walls that encapsulated hardware and software design are breaking down.  Big web properties or large ISPs like Rackspace run lean design teams that integrate hardware with open source software deployment.  This not just a matter of software engineers learning about hardware or vice-versa.  It is the tip of a much bigger iceberg.  Facebook recently started the Open Compute Project, which is a community-based effort to design server infrastructure.   In their own words:
By releasing Open Compute Project technologies as open hardware, our goal is to develop servers and data centers following the model traditionally associated with open source software projects. That’s where you come in.
Facebook and others are opening up data center design.  Gamers have been building their own systems for years.  Assuming Bunnie's logic is correct, open hardware will apply to wide range of devices from phones up to massive clusters.  Community-based, customized system designs are no longer an oddity but part of a broad movement that will change the way all of us think about building and deploying applications on any kind of physical hardware.  It will upset current companies but also create opportunities for new kinds of businesses.  The "cloud" is not the only revolution in computing.  Open source hardware has arrived.  

Thursday, September 29, 2011

Quick Installation of Replication from MySQL to MongoDB

Proof-of-concept Tungsten support for MongoDB arrived last May, when I posted about our hackathon effort to replicate from MySQL to MongoDB.  That code then lay fallow for a few months while we worked on other things like parallel replication, but the period of idleness has ended.  Earlier this week I checked in fixes to Tungsten Replicator to add one-line installation support for MongoDB slaves.

MySQL to MongoDB replication will be officially supported in the Tungsten Replicator 2.0.5 build, which will be available in a few weeks.  However, you can try out MySQL to MongoDB replication right now.  Here is a quick how-to using my lab hosts logos1 for the MySQL master and logos2 for the MongoDB slave. 

1. Download the latest development build of Tungsten Replicator.   See the nightly builds page for S3 URLs.

$ cd /tmp
$ wget --no-check-certificate https://s3.amazonaws.com/files.continuent.com/builds/nightly/tungsten-2.0-snapshots/tungsten-replicator-2.0.5-332.tar.gz

2. Untar and cd into the release. 

$ tar -xzf tungsten-replicator-2.0.5-332.tar.gz
$ cd tungsten-replicator-2.0.5-332

3. Install a MySQL master replicator on a host that has MySQL installed and is configured to use row replication, i.e. binlog_format=row.  Note that you need to enable the colnames and pkey filters.  These add column names to row updates and eliminate update and delete query columns other than those corresponding to the primary key, respectively. Last but not least, ensure strings are converted to Unicode rather than transported as raw bytes, which we have to do in homogeneous MySQL replication to finesse character set issues.  

$ tools/tungsten-installer --master-slave -a \
  --datasource-type=mysql \
  --master-host=logos1  \
  --datasource-user=tungsten  \
  --datasource-password=secret  \
  --service-name=mongodb \
  --home-directory=/opt/continuent \
  --cluster-hosts=logos1 \
  --mysql-use-bytes-for-string=false \
  --svc-extractor-filters=colnames,pkey \
  --svc-parallelization-type=disk --start-and-report

4. Finally, install a MongoDB slave.  Before you do this, ensure mongod 1.8.x is up and running on the host as described in the original blog post on MySQL to MongoDB replication.   My mongod is running on the default port of 27017, so there is no --slave-port option necessary. 

$ tools/tungsten-installer --master-slave -a \
  --datasource-type=mongodb \
  --master-host=logos1  \
  --datasource-user=tungsten  \
  --datasource-password=secret  \
  --service-name=mongodb \
  --home-directory=/opt/continuent \
  --cluster-hosts=logos2 \
  --skip-validation-check=InstallerMasterSlaveCheck \
  --svc-parallelization-type=disk --start-and-report

That's it.  You test replication by logging into MySQL on the master, adding a row to a table, and confirming it reaches the slave.   First the SQL commands: 

$ mysql -utungsten -psecret -hlogos1 test
Welcome to the MySQL monitor.  Commands end with ; or \g.
...
mysql> create table bar(id1 int primary key, data varchar(30));
Query OK, 0 rows affected (0.15 sec)

mysql> insert into bar values(1, 'hello from mysql');
Query OK, 1 row affected (0.00 sec)

Now check the contents of MongoDB:  

$ mongo logos2:27017/test
MongoDB shell version: 1.8.3
connecting to: logos2:27017/test
system.indexes
> db.bar.find()
{ "_id" : ObjectId("4e85269484aef8fcae4b0010"), "id1" : "1", "data" : "hello from mysql" }

Voila!  We may still have bugs, but at least MySQL to MongoDB replication is now easy to install.   

Speaking of bugs, I have been fixing problems as they pop up in testing.  The most significant improvement is a feature I call auto-indexing on MongoDB slaves.  MongoDB materializes collections automatically when you put in the first update, but it does nothing about indexes.  My first TPC-B runs processed less than 100 transactions per second on the MongoDB slave, which is pretty pathetic. The bottleneck is due to MongoDB update operations of the form 'db.account.findAndModify(myquery,mydoc)'.  You must index properties used in the query or things will be very slow.   

Auto-indexing cures the update bottleneck by ensuring that there is an index corresponding to the SQL primary key for any table that we update.  MongoDB makes this logic very easy to implement--you can issue a command like 'db.account.ensureIndex({account_id:1})' to create an index.  What's really cool is that MongoDB will do this even if the collection is not yet materialized--e.g., before you load data.   It seems to be another example of how MongoDB collections materialize whenever you refer to them, which is a very useful feature.  

TPC-B updates into MongoDB are now running at over 1000 transactions per second on my test hosts. I plan to fix more bugs and goose up performance still further over the next few weeks.  Through MongoDB we are unlearning assumptions within Tungsten that are necessary to work with non-relational databases.  It's great preparation for big game hunting next year:  replication to HBase and Cassandra.  

Thursday, September 8, 2011

What's Next for Tungsten Replicator

As Giuseppe Maxia recently posted we released Tungsten Replicator 2.0.4 this week.  It has a raft of bug fixes and new features of which one-line installations are the single biggest improvement.  I set up replicators dozens of times a day and having a single command for standard cluster topologies is a huge step forward.  Kudos to Jeff Mace for getting this nailed down.

So what's next?  You can get see what we are up to in general by looking at our issues list.  We cannot do everything at once, but here are the current priorities for Tungsten Replicator 2.0.5.
  • Parallel replication speed and robustness.  I'm currently working on eliminating choke points in performance (like this one) as well as eliminating corner cases that cause the replicator to require manual intervention, such as aging out logs that are still needed by slaves.  
  • Multi-master replication.  This includes better support for system of record architectures, many masters to one slave, and replication between the same databases on different sites.  Stephane Giron nailed a key MyISAM multi-master bug for the last release.  We will continue to polish this as we work through our current projects.   
  • Better installations for more types of databases.  Jeff recently hacked in support for PostgreSQL as well as Oracle slaves, and we are contemplating addition of MongoDB support.  Heterogeneous replication is getting simpler to set up.  
  • Filter usability.  Giuseppe has a list of improvements for filters, which are one of the most powerful Tungsten Replicator features but not as easy for non-developers to use as we would like.  Better installation support is first on the list followed by ability to load and unload dynamically.  
  • Data warehouse loading.  We have a design for fast data warehouse loading that I hope we'll be able to implement in the next few weeks.  Linas Virbalas has also been working on this problem along with a number of other heterogeneous projects for customers.  
This is a lot of work and not everything will necessarily be finished when 2.0.5 goes out.  However, I hope we'll make progress on all of them.  In case you are wondering how we pick things, replicator development is largely driven by customer projects.   If you have something you need in the replicator, please contact Continuent.

After this build we will... Er, let's get 2.0.5 done first.  Suffice it to say we have a long list of useful and interesting features to discuss in future blog articles.

Tuesday, September 6, 2011

The Inimitable Mr. Steven Jobs

There have been countless articles praising Steve Jobs since he announced his retirement from Apple on August 25th.  Most either catalogue Steve Job's many triumphs or assess the impact of his creativity on society.  Those are entertaining topics but not especially useful.  A more practical question is why Steve Jobs is so good at creating new products and whether the rest of us can imitate him.

Steve Job's best work seems to follow a repeated pattern.  Let's call it the Apple pattern, though of course it could just as well be the Pixar pattern or Next pattern:
  1. See the whole picture of some crucial human/technology interaction and recognize gaps.  
  2. Design products to fill those gaps that combine artistic sensibility and innovative technology.
  3. Get a large organization to implement designs in a way that makes the end result like the handiwork of a single highly-focused craftsman. 
    Two things about the pattern seem particularly striking.  First, Steve Jobs is a complete package.  I have been in the tech industry for over three decades and have met people who did one or at most two of these things at the level necessary to create products that move large markets.  Almost nobody does all three.  The fact that Steve is excellent in all areas simultaneously may be a root cause behind his long run of successes.

    Second, Job's ability to drive implementation teams is extraordinary.  Maybe it's just the manager in me, but I find his ability to pick the right people to run teams and to keep those teams pointed in a clear direction without product-destroying compromises quite remarkable.  This is far harder than generating ideas in the first place.  The heart of the Apple pattern as as much about understanding people as technology--not just users but the creators as well.  I have never heard Jobs make pronouncements on team management, but there is an excellent talk from Ed Catmull of Pixar that summarizes the tensions quite well.

    Steve Jobs is commonly compared to great inventors like Edison, Ford, and Disney.  When thinking about imitation, another parallel seems more illuminating:  John Churchill, Duke of Marlborough and hands-down the greatest English general of all time.


    A possible Jobs ancestor?
    Marlborough possessed a seldom equalled ability to see war as an integrated whole across geography and branches of arms, devise unexpected strategies to exploit the weaknesses of his enemies, and execute them flawlessly in the difficult conditions of early 18th Century campaigns.  Execution extended from handling fractious allies down to the painstaking work to ensure his men had proper meals after each day's march.  In other words:  analogous problem-solving abilities to Steve Jobs, translated into the field of warfare.   The parallel extends to the lavish praise of contemporaries and later historians.  Winston Churchill famously described Marlborough as follows.  
    He commanded the armies of Europe against France for ten campaigns. He fought four great battles and many important actions ... He never fought a battle that he did not win, nor besieged a fortress that he did not take ... He quitted war invincible.
    Grand problem-solvers like Marlborough and Jobs are sufficiently rare they tend to be one-offs who change society but leave no obvious successors.  English military superiority on the Continent waned after Marlborough's retirement.  Something similar will likely befall Apple after Jobs, current happy talk about product pipelines and cash position notwithstanding.  It is simply not possible to imitate Jobs by committee, which is effectively what will happen once he is completely absent.  The driving force is gone.

    That said, we can all imitate Steve Jobs, albeit on a smaller scale.  Many highly successful products start with a single person who conceives the idea and drives at least the first couple of iterations to completion.  Seeing the whole problem, applying innovative designs to solve it, and managing the team to get it done is a fundamental pattern that applies across a wide range of endeavors.  Here is just one of many examples.

    Many years ago at Sybase I worked for a manager named Mark Deppe.  Early in the 1990s Mark learned that Wall Street firms were patching together crude publish/subscribe messaging applications to move data between financial systems in order to speed up trades.   He recognized that there was a much better way to do this using log-based data replication and built the Sybase Replication Server product.  The Rep Server went on to generate hundreds of millions of dollars in sales.  It still sells well today, over 15 years later.  Mark was a great architect but also a great builder of teams.  He paid as much or more attention to hiring and managing people as he did to technology.  He trusted the people he hired, and he gave them the freedom and support to do great work.  At the same time Mark was also incredibly attentive to detail and did all the project management for the first releases himself.  Years later he said it was too important a task to hand off to anyone else.

    Mark Deppe was the best technical manager I ever worked for.  I have consciously imitated his best practices for many years.  Looking back it seems I was unconsciously imitating the Apple design pattern.  But perhaps that was not a complete coincidence.   Before joining Sybase Mark was at Apple where he worked with (guess who?) Steve Jobs.

    ------------------
    NOTE:  After this article was published I found the flow hard to understand and edited it a week or so later to make it more readable.  The argument is the same as before.

    Tuesday, August 30, 2011

    Practical Multi-Master Replication using Shard Filters

    Earlier this month I published an article on this blog describing the system of record approach to multi-master replication.  As mentioned in that article my colleagues and I at Continuent have been working on improving Tungsten to make system of record design patterns easier to implement.  This article describes how to set up system of record using Tungsten Replicator shard filters, which are a new feature in Tungsten 2.0.4.  By doing so we will create a multi-master configuration that avoids replication loops and transaction conflicts.  On top of that, it is quite easy to set up.

    There are many possible system of record patterns depending on how many schemas are shared and across how many masters.  The following diagram shows three of them.  In contrast to many so-called MySQL multi-master implementations, all masters are live and accept updates.  (By contrast, schemes such as MySQL-MHA make extra masters read-only.  Don't be fooled!)  

    For today's exercise we will implement the basic system of record.  Once you understand this you can quickly set up other multi-master scenarios.  

    Defining Shard Master Locations

    The first step is to tell Tungsten where each shard is mastered.  By mastered we mean it is the one master that receives application updates on that shard, whereas all other masters have copies only or may not even contain the shard at all.  Tungsten uses a variant of CSV (comma-separated format) where the first line contains column names.  You can have any amount of whitespace between entries.  Create a file called shards.map with your favorite editor and type in the following lines. 

    shard_id master critical
    tungsten_nyc nyc false
    tungsten_sjc sjc false
    acme         sjc     false
    pinnacle        nyc     false

    The first column is the name of the shard.  This must be unique--because a shard can only live on one master.  The next column is the "home" master for the shard.  This is the one and only master that should receive shard updates.  The third column defines whether the shard is critical and requires full serialization.  It will be linked to parallel replication in a later release.  

    It turns out you do not need to add entries for Tungsten catalog schemas such as tungsten_nyc.  Tungsten Replicator will create them automatically.  They are shown here for completeness only.  

    Creating Replication Services

    Next we need to define services to replicate bi-directionally between DBMS servers and set options to filter shards using the ShardFilter class, which is new in Tungsten 2.0.4.  The shard filter helps ensure that shards replicate from their home masters only and not from other locations.   If you do not know what replication services are, you can find a description of them in this article.  

    Multi-master replication is easy to mis-configure, so to prevent accidents we will tell the shard filter to generate an error any time it processes a shard it has never seen before.  The replication service will immediately fail, which signals that we have to update shard definitions.  This is the safest way to implement system of record or any multi-master configuration for that matter.  It is generally easier to restart replication after correcting the configuration than to mix up data, which can lead to major outages.  

    The first step is to set replication services for each master.  These read the binlog and make transactions available to slave replication services.  Here are the commands.   Note that the sjc master is on host logos1, while the nyc master is on logos2.  The remaining examples use these names consistently. 

    # Define common master settings. 
    COMMON_MASTER_OPTS="--datasource-user=tungsten --datasource-password=secret \
     --home-directory=/opt/continuent --svc-parallelization-type=disk \
     --svc-extractor-filters=shardfilter \
     --property=replicator.filter.shardfilter.unknownShardPolicy=error"

    # Set up sjc master. 
    tools/tungsten-installer --master-slave -a --master-host=logos1 \ 
    --cluster-hosts=logos1 --service-name=sjc $COMMON_MASTER_OPTS --start-and-report 

    # Set up nyc master. 
    tools/tungsten-installer --master-slave -a --master-host=logos2 \
    --cluster-hosts=logos2 --service-name=nyc $COMMON_MASTER_OPTS --start-and-report 

    The --svc-extractor-filters option adds shard filtering immediately after event extraction.  The unknownShardPolicy=error setting will cause the masters to die if they process an undefined shard.  

    Now we can define the remote slave services for sjc and nyc.  These are special slaves that write transactions onto another master as opposed to a normal slave.  We would like slave services to error out on unknown shards as well.  Also (and this is important) we want them to enforce shard homes.  Here are the commands to create the services and start each one.  

    COMMON_SLAVE_OPTS="--release-directory=/opt/continuent/tungsten \
    --service-type=remote --allow-bidi-unsafe=true --svc-parallelization-type=disk \
    --svc-applier-filters=shardfilter \
    --property=replicator.filter.shardfilter.unknownShardPolicy=error \
    --property=replicator.filter.shardfilter.enforceHome=true"

    # Set up sjc remote slave. 
    tools/configure-service -C -a --host=logos2 \
    --local-service-name=nyc --role=slave \
    --datasource=logos2 --master-host=logos1 $COMMON_SLAVE_OPTS sjc
    $trepctl -host logos2 -service sjc start

    # Set up nyc remote slave. 
    tools/configure-service -C -a --host=logos1 \
    --local-service-name=sjc --role=slave \
    --datasource=logos1 --master-host=logos2 $COMMON_SLAVE_OPTS nyc
    trepctl -host logos1 -service nyc start

    The --svc-applier-filters option adds shard filtering before applying to the DBMS.  The unknownShardPolicy=error setting will cause the slaves to die if they process an undefined shard.  Finally, the enforceHome=true option means that each slave will drop any transaction that lives on a different service from that slave's master.  

    At the end of this procedure, your services should be online and read to run.  Use 'trepctl services' to make sure.  

    Loading Shard Definitions

    To make shard definitions take effect, you must load the shard.map contents into each replication service.  You can do this any time the replicator is running but after loading new definitions you must put the replicator online again.  Here are the commands to load the shard maps onto each of the four replication services.   For each replication service, you must delete the old definitions, reload new ones, and get the replicator to go online again. 

    #!/bin/bash
    MAP=shard.map
    for host in logos1 logos2
    do
      for service in sjc nyc
      do
        trepctl -host $host -service $service shard -deleteAll
        trepctl -host $host -service $service shard -insert < $MAP
        trepctl -host $host -service $service offline
        trepctl -host $host -service $service wait -state OFFLINE
        trepctl -host $host -service $service online
      done
    done

    This looks a little clunky and will be reduced to a single command instead of five in a later release.  I put it in a script to make it quicker to run.  The good news is that there is just one shard map that works for all replication services, regardless of location or role.  

    Once you finish this step, you can go to any replication service and list the shards it knows about.  Let's pick a service and demonstrate: 

    $ trepctl -host logos1 -service sjc shard -list
    shard_id master critical
    tungsten_nyc nyc false
    tungsten_sjc sjc false
    acme sjc false
    pinnacle nyc false

    With this we are ready to start processing some transactions. 

    Multi-Master Operation

    At this point we have multi-master replication enabled between hosts logos1 and logos2.  You can try it out.  Let's add the acme database to the sjc master on logos1 as an example. 

    mysql -utungsten -psecret -hlogos1 
    mysql> create database acme;
    mysql> use acme
    mysql> create table foo (id int);
    mysql> insert into foo values(1);

    We can see that all of these commands replicate over to the logos2 server quite easily with the following command: 

    mysql -utungsten -psecret -hlogos2 -e 'select * from acme.foo'
    +------+
    | id   |
    +------+
    |    1 | 
    +------+

    That seems pretty simple.  In fact it is.  You can go over to logos2 and enter transactions for pinnacle in the same way.  Data replicate back and forth.  There are no replication loops.  There are also no conflicts.  

    Adding a New Shard

    So what happens when we add a new shard?  The simplest way to see is to create a database using a schema name that does not exist in the shard map.   Let's try to create a database named superior on the nyc master.  

    mysql -utungsten -psecret -hlogos2 -e 'create database superior'

    Now check the status of the nyc master replication service.  We see it has failed with an error due to the unknown shard.   (Tungsten parses the create database command and assigns it the shard ID "superior.") 

    $ trepctl -host logos2 -service nyc status
    Processing status command...
    NAME                     VALUE
    ----                     -----
    ...
    pendingError           : Stage task failed: binlog-to-q
    pendingErrorCode       : NONE
    pendingErrorEventId    : mysql-bin.000157:0000000000002475;1287
    pendingErrorSeqno      : 8
    pendingExceptionMessage: Rejected event from unknown shard: seqno=8 shard ID=superior
    ...
    state                  : OFFLINE:ERROR...
    Finished status command...

    This problem is quite easy to fix.  We just open up the shard.map file and add a row for superior so that the file contents look like the following: 

    shard_id master critical
    tungsten_nyc nyc false
    tungsten_sjc sjc false
    acme         sjc     false
    pinnacle        nyc     false
    superior        nyc     false

    Reload the shard.map file as shown previously and you will be back in business.  Incidentally, if you do not want the superior database to be replicated to other masters, you can also specify this in the rules.  Just give superior the special master name #LOCAL as in the following example and it will not replicate outside the nyc service. 

    superior        #LOCAL  false

    In fact, #LOCAL means that any schema named superior will not replicate outside the service in which it is defined.  You can have an unshared schema named superior on every master.  

    Where to Next?

    The shard support described in this article is now part of Tungsten 2.0.4 and will appear in the official build when it is finally ready.  You can try it out right now using one of our handy nightly builds.  

    We plan to build out shard filtering quite a bit from the current base.  One immediate fix is to put in a check so that if an application commits shard updates on the wrong DBMS server, the master replication service on that server will detect it and fail.  This will tell you there's a problem immediately rather than letting you wallow in blissful ignorance while your data become hopelessly mixed up.  We will also simplify the commands to update shards while replicators are online. 

    Longer term we will be adding features to propagate shard definitions through replication itself.  Stay tuned for more work in this area.  If you want to help fund work to enable your own applications, please get in contact with me at Continent.  I can think of at least a dozen ways to make our multi-master support better but it's always nicer to spend the effort on features that enable real systems.  In the meantime, I hope you find multi-master with shard filtering useful and look forward to your feedback.  

    Scaling Databases Using Commodity Hardware and Shared-Nothing Design