Thursday, September 29, 2011

Quick Installation of Replication from MySQL to MongoDB

Proof-of-concept Tungsten support for MongoDB arrived last May, when I posted about our hackathon effort to replicate from MySQL to MongoDB.  That code then lay fallow for a few months while we worked on other things like parallel replication, but the period of idleness has ended.  Earlier this week I checked in fixes to Tungsten Replicator to add one-line installation support for MongoDB slaves.

MySQL to MongoDB replication will be officially supported in the Tungsten Replicator 2.0.5 build, which will be available in a few weeks.  However, you can try out MySQL to MongoDB replication right now.  Here is a quick how-to using my lab hosts logos1 for the MySQL master and logos2 for the MongoDB slave. 

1. Download the latest development build of Tungsten Replicator.   See the nightly builds page for S3 URLs.

$ cd /tmp
$ wget --no-check-certificate https://s3.amazonaws.com/files.continuent.com/builds/nightly/tungsten-2.0-snapshots/tungsten-replicator-2.0.5-332.tar.gz

2. Untar and cd into the release. 

$ tar -xzf tungsten-replicator-2.0.5-332.tar.gz
$ cd tungsten-replicator-2.0.5-332

3. Install a MySQL master replicator on a host that has MySQL installed and is configured to use row replication, i.e. binlog_format=row.  Note that you need to enable the colnames and pkey filters.  These add column names to row updates and eliminate update and delete query columns other than those corresponding to the primary key, respectively. Last but not least, ensure strings are converted to Unicode rather than transported as raw bytes, which we have to do in homogeneous MySQL replication to finesse character set issues.  

$ tools/tungsten-installer --master-slave -a \
  --datasource-type=mysql \
  --master-host=logos1  \
  --datasource-user=tungsten  \
  --datasource-password=secret  \
  --service-name=mongodb \
  --home-directory=/opt/continuent \
  --cluster-hosts=logos1 \
  --mysql-use-bytes-for-string=false \
  --svc-extractor-filters=colnames,pkey \
  --svc-parallelization-type=disk --start-and-report

4. Finally, install a MongoDB slave.  Before you do this, ensure mongod 1.8.x is up and running on the host as described in the original blog post on MySQL to MongoDB replication.   My mongod is running on the default port of 27017, so there is no --slave-port option necessary. 

$ tools/tungsten-installer --master-slave -a \
  --datasource-type=mongodb \
  --master-host=logos1  \
  --datasource-user=tungsten  \
  --datasource-password=secret  \
  --service-name=mongodb \
  --home-directory=/opt/continuent \
  --cluster-hosts=logos2 \
  --skip-validation-check=InstallerMasterSlaveCheck \
  --svc-parallelization-type=disk --start-and-report

That's it.  You test replication by logging into MySQL on the master, adding a row to a table, and confirming it reaches the slave.   First the SQL commands: 

$ mysql -utungsten -psecret -hlogos1 test
Welcome to the MySQL monitor.  Commands end with ; or \g.
...
mysql> create table bar(id1 int primary key, data varchar(30));
Query OK, 0 rows affected (0.15 sec)

mysql> insert into bar values(1, 'hello from mysql');
Query OK, 1 row affected (0.00 sec)

Now check the contents of MongoDB:  

$ mongo logos2:27017/test
MongoDB shell version: 1.8.3
connecting to: logos2:27017/test
system.indexes
> db.bar.find()
{ "_id" : ObjectId("4e85269484aef8fcae4b0010"), "id1" : "1", "data" : "hello from mysql" }

Voila!  We may still have bugs, but at least MySQL to MongoDB replication is now easy to install.   

Speaking of bugs, I have been fixing problems as they pop up in testing.  The most significant improvement is a feature I call auto-indexing on MongoDB slaves.  MongoDB materializes collections automatically when you put in the first update, but it does nothing about indexes.  My first TPC-B runs processed less than 100 transactions per second on the MongoDB slave, which is pretty pathetic. The bottleneck is due to MongoDB update operations of the form 'db.account.findAndModify(myquery,mydoc)'.  You must index properties used in the query or things will be very slow.   

Auto-indexing cures the update bottleneck by ensuring that there is an index corresponding to the SQL primary key for any table that we update.  MongoDB makes this logic very easy to implement--you can issue a command like 'db.account.ensureIndex({account_id:1})' to create an index.  What's really cool is that MongoDB will do this even if the collection is not yet materialized--e.g., before you load data.   It seems to be another example of how MongoDB collections materialize whenever you refer to them, which is a very useful feature.  

TPC-B updates into MongoDB are now running at over 1000 transactions per second on my test hosts. I plan to fix more bugs and goose up performance still further over the next few weeks.  Through MongoDB we are unlearning assumptions within Tungsten that are necessary to work with non-relational databases.  It's great preparation for big game hunting next year:  replication to HBase and Cassandra.  

Thursday, September 8, 2011

What's Next for Tungsten Replicator

As Giuseppe Maxia recently posted we released Tungsten Replicator 2.0.4 this week.  It has a raft of bug fixes and new features of which one-line installations are the single biggest improvement.  I set up replicators dozens of times a day and having a single command for standard cluster topologies is a huge step forward.  Kudos to Jeff Mace for getting this nailed down.

So what's next?  You can get see what we are up to in general by looking at our issues list.  We cannot do everything at once, but here are the current priorities for Tungsten Replicator 2.0.5.
  • Parallel replication speed and robustness.  I'm currently working on eliminating choke points in performance (like this one) as well as eliminating corner cases that cause the replicator to require manual intervention, such as aging out logs that are still needed by slaves.  
  • Multi-master replication.  This includes better support for system of record architectures, many masters to one slave, and replication between the same databases on different sites.  Stephane Giron nailed a key MyISAM multi-master bug for the last release.  We will continue to polish this as we work through our current projects.   
  • Better installations for more types of databases.  Jeff recently hacked in support for PostgreSQL as well as Oracle slaves, and we are contemplating addition of MongoDB support.  Heterogeneous replication is getting simpler to set up.  
  • Filter usability.  Giuseppe has a list of improvements for filters, which are one of the most powerful Tungsten Replicator features but not as easy for non-developers to use as we would like.  Better installation support is first on the list followed by ability to load and unload dynamically.  
  • Data warehouse loading.  We have a design for fast data warehouse loading that I hope we'll be able to implement in the next few weeks.  Linas Virbalas has also been working on this problem along with a number of other heterogeneous projects for customers.  
This is a lot of work and not everything will necessarily be finished when 2.0.5 goes out.  However, I hope we'll make progress on all of them.  In case you are wondering how we pick things, replicator development is largely driven by customer projects.   If you have something you need in the replicator, please contact Continuent.

After this build we will... Er, let's get 2.0.5 done first.  Suffice it to say we have a long list of useful and interesting features to discuss in future blog articles.

Scaling Databases Using Commodity Hardware and Shared-Nothing Design