Saturday, May 14, 2011

Introducing MySQL to MongoDB Replication

The last article on this blog described our planned MySQL to MongoDB replication hackathon at the recent Open DB Camp in Sardinia.  Well, it worked, and the code is now checked into the Tungsten Replicator project.   This article describes exactly what we did to write the code and set up replication.  You can view it as a kind of cookbook both for implementing new database types in Tungsten as well as setting up replication to MongoDB.

The Team

MySQL to MongoDB replication was a group effort with three people:  Flavio Percoco, Stephane Giron, and me.  Flavio has worked on MongoDB for a couple of years and is extremely well-informed both on database setup as well as application design.  Stephane Giron is a replication engineer at Continuent and has done a substantial amount of the work on data extraction from MySQL, especially row replication.  I work on the core execution framework as well as performance.

Getting Started with MongoDB

There were a couple of talks on MongoDB during the first morning of Open DB camp (Saturday May 7th), which Stephane and I dutifully attended to raise our consciousness.  We got cracking on implementation around 2pm that afternoon.   The first step was to bring up MongoDB 1.8.1 and study its habits with help from Flavio.

MongoDB is definitely easy to set up.  You get binary builds from the MongoDB download page.  Here is a minimal set of commands to unpack MongoDB 1.8.1 and start the mongod using directory data to hold tables.  

$ tar -xvzf mongodb-osx-x86_64-1.8.1.tgz
$ cd mongodb-osx-x86_64-1.8.1
$ mkdir data
$ bin/mongo --dbpath data
(... messages ...)

You connect to mongod using the mongo client.  Here's an example of connecting and creating a table with a single row.

$ bin/mongo localhost:27017
MongoDB shell version: 1.8.1
connecting to: localhost:27017/test
> use mydb
switched to db mydb
> db.test.insert({"test": "test value", "anumber" : 5 })                
> db.test.find()
{ "_id" : ObjectId("4dce9a4f3d6e186ffccdd4bb"), "test" : "test value", "anumber" : 5 }
> exit

This is schema-less programming in action.  You just insert BSON documents (BSON =  Binary JSON) into collections, which is Mongolese for tables.  MongoDB creates the collection for you as soon as you put something in it. The automatic materialization is quite addictive once you get used to it, which takes about 5 minutes.

The MongoDB client language is really handy.  It is based on JavaScript.  There are what seem to be some non-Javascript commands like "show dbs" to show databases or "show collections" to list the tables.  Everything else is object-oriented and easy to understand.  For example, to find all the records in collection test, as we saw above, you just connect to the database and issue a command on the local db object.  Collections appear as properties of db, and operations on the collection are methods.

It helps that the MongoDB folks provide very accessible documentation, for example a SQL to MongoDB translation chart.   I put together a little practice program using the MongoDB Java driver to insert, referring to the Javadoc for the class library when in doubt about API calls.  There are also a couple of very helpful examples, like this one, included with the driver.

All told, setup and orientation took us about 45 minutes.  It helped enormously that Flavio is a MongoDB expert, which minimized flail considerably.

Implementing Basic Replication from MySQL to MongoDB

After setup we proceeded to implement replication.  Here is an overview of the replicator pipeline to move data from MySQL to MongoDB.  Pipelines are message processing flows within the replicator.

Direct pipelines move data from DBMS to another within a single replicator.  They are already a standard part of Tungsten Replicator and most of the code shown above already exists, except for the parts shown in red.   Before we started, we therefore needed to set up a replicator with a direct pipeline.

We first built the code according to the instructions on the Tungsten project wiki, uploaded the binary to our test host, and configured the replicator.   First, we ran the Tungsten configure script to set defaults for the MySQL server (user name, extract method, etc.).   Next we ran the configure-service command to set up the direct pipeline configuration file.   Both commands together look like the following:

./configure-service -C --role=direct mongodb

The second command created a file called tungsten-replicator/conf/ with all the information about the direct pipeline implementation but of course nothing yet about MongoDB.

Now we could start the implementation.   To move data to MongoDB, we needed two new components:
  1. A Tungsten RawApplier to apply row updates to MongoDB.  RawApplier is the basic interface you implement to create an applier to a database.  
  2. A Tungsten Filter to stick column names on row updates after extracting from MySQL.  MySQL row replication does not do this automatically, which makes it difficult to construct JSON at the other end because you do not have the right property names. 
To get started on the applier I implemented a very simple class named MongoApplier that could take an insert from MySQL, turn it into a BSON document, and add it to an equivalently named database and collection in MongoDB.  I added this to the replicator code tree, then built and uploaded tungsten-replicator.jar.  (Use 'ant dist' in the replicator directory to build the JAR.)

To start using the new MongoDB applier, we needed to edit the service properties file to use this component instead of the standard MySQL applier that configuration adds by default.  To do this, you can open up with your favorite editor.  Add the following properties at the bottom of the APPLIERS section.

# MongoDB applier.  You must specify a connection string for the server.
# This currently supports only a single server.

Next, you need to fix up the direct pipeline so that the last stage uses the new applier.  We located the direct pipeline definition (around line 208 in the properties file) and set the applier to mongodb as shown in the following example.

# Write from parallel queue to database.

We then started the replicator using 'replicator start.'  At that point we could do the following on MySQL:

mysql> create table foo(id int primary key, msg varchar(35));
Query OK, 0 rows affected (0.05 sec)
mysql> insert into foo values(1, 'hello from MySQL!');
Query OK, 1 row affected (0.00 sec)

...And within a second we could see the following over in MongoDB:

> show collections
{ "_id" : ObjectId("4dc55e45ad90a25b9b57909d"), "1" : "1", "2" : "hello from MySQL!" }

This kind of progress was very encouraging.  It took roughly 2 hours to get to move the first inserts across.  Compared to replicating to a new SQL database like Oracle that's lightning fast.  However, there were still no property names because we were not adding column names to row updates.

Meanwhile, Stephane had finished the column name filter (ColumnNameFilter) and checked it in.  I rebuilt and refreshed the replicator code, then edited as follows to add the filter.  First put in the filter definition in the FILTERS section:

# Column name filter.  Adds column name metadata to row updates.  This is
# required for MySQL row replication if you have logic that requires column
# names.

Next, make the first stage of the direct pipeline use the filter:

# Extract from binlog into queue.

We then restarted the replicator.  Thereupon, we started to see inserts like the following, complete with property names:

{ "_id" : ObjectId("4dc77bacad9092bd1aef046d"), "id" : "25", "data" : "data value" }

That was better, much better!   To this point we had put in exactly 2 hours and 45 minutes wall clock time.  It was enough to prove the point and more than enough for a demo the next day.   The hackathon was a rousing success.

Further Development

Over the next couple of days I rounded out the MongoApplier to add support for UPDATE and DELETE operations, as well as to implement restart.  The full implementation is now checked in on, so you can repeat our experiences by downloading code and building yourself or by grabbing one of the Tungsten nightly builds.

Restart is an interesting topic.  Tungsten uses a table to store the sequence number of the last transaction it applied.  We do this by creating an equivalent collection in MongoDB, which is updated after each commit.  There is a problem in that MongoDB does not have transactions.  Each update is effectively auto-commit, much like MyISAM table type on MySQL.  This means that while Tungsten can restart properly after a clean shutdown, slave replication is not crash safe.  Lack of atomic transactions is a bigger issue with MongoDB and other NoSQL databases that goes far beyond replication.  For now, this is just how Tungsten's MongoDB support works.

Speaking of things that don't work, the current implementation is a prototype only.  We have not tested it with more than a few data types.  It only works with a single MongoDB daemon.  It does not set keys properly or specify indexes on tables.  There are no guarantees about performance, except to say that if you had more than a small amount of data it would be quite slow.  (OK, that's a guarantee after all.)


Overall all the hackathon was a great success, not to mention lots of fun.  It went especially well because we had a relatively small problem and three people (Stephane, Flavio, and Robert) with complementary skills that we could combine easily for a quick solution.  That seems to be a recipe for succeeding on future hackathons.

From a technical point of view, it helped that MongoDB is schema-less.  Unlike SQL databases, just adding a document materializes the table in MongoDB.  This made our applier implementation almost trivially easy, because processing row updates takes only a few dozen lines of Java code in total.  It also explains why a lot of people are quite attached to the NoSQL programming model.

I am looking forward to learning a lot more about MongoDB and other NoSQL databases.  It would take two or three weeks of work to get our prototype to work with real applications.  Also, it looks as if we can implement replication going from MongoDB to MySQL.  According to Flavio there is a way to search the transaction log of MongoDB as a regular collection.  By appropriately transforming BSON objects back to SQL tuples, we can offer replication back to MySQL.

There are many other lessons about MongoDB and NoSQL in general but it seems best to leave them for a future article when I have more experience and actually know what I'm talking about.  Meanwhile, you are welcome to try out our newest Tungsten replication feature.

Tuesday, May 3, 2011

Time for a SQL/NoSQL Group Hug

European Open Database Camp 2011 is this weekend in the hills above Cagliari, Sardinia.  In honor of the increasing number sites that use both NoSQL and SQL databases, I am going to be running a MySQL to NoSQL Hackathon to prototype Tungsten Replicator support for transferring data from MySQL to MongoDB.  The conference will have at least one well-informed MongoDB expert, so we should have enough critical mass to get this done.  It helps that I'll be completely jet-lagged after flying in from the US and unable to sleep anyway.  

Over the past year SQL vs. NoSQL rants have started to abate as people get down to the practical work to make both types of systems work most effectively, often within the same site if not the same application.   I have spoken to a small but growing number of users who want to move data between MySQL and Cassandra, HBase, MongoDB, and others.  (Here's a typical example from StackOverflow.)  Some of them are even willing to pay money for Tungsten Replicator implementations.  You know there is a real need when that happens.  I expect such requests will grow as more applications implement NoSQL stores and have to set up feeds to or from SQL.   Heterogeneous replication in this sense is a good proxy for solution maturity, and at least a couple of the NoSQL stores seem to be getting there.

The MySQL to MongoDB hackathon is our first crack at this problem.  There is no guarantee the results will be especially usable, because Tungsten will not have a built-in solution for SQL to JSON mapping.  (I'm still noodling about that problem--suggestions welcome.)  Still, we will start to understand the problem space better and pave the way for more robust solutions over the course of the next year or so.  I'm looking forward to learning more a lot more about non-SQL stores along the way. 

If you are attending the conference you can hear about our results in person.  You can also tune into the #tungsten IRC channel at starting Friday May 5th around 10pm GMT.  I'll post regular updates over the weekend.  Feel free to drop by on IRC to see how we are doing or even to help out.  We will not single-handedly end the SQL/NoSQL wars, but at least we can help them learn how to share. 

Monday, May 2, 2011

Tungsten Supports Logical Replication on PostgreSQL

Thanks to my colleague Linas Virbalas, Tungsten Replicator has just taken the next step to support full logical replication for PostgreSQL.  Linas posted an article today on his new blog describing PostgreSQL logical replication using SLONY triggers.  I saw a demo of his implementation and was really impressed.   For more information you should read the article, which provides an excellent description of how Tungsten replicates from SLONY logs.

It is pretty exciting whenever Tungsten replicates data to or from a new DBMS type, but PostgreSQL logical replication is really special.  Tungsten Replicator has been able to manage native warm standby and log streaming replication more than a year using a script-based plugin.  This is fine for offering copies for failover but rather limited for problems beyond simple availability.  Logical replication opens up a whole new set of solutions that include multi-master replication, heterogeneous data transfer, and large-scale read scaling on replicas.  It is also a key building block for advanced clustering capabilities like zero-downtime upgrade as well as for building multi-tenant systems for SaaS applications.  After years of working on these problems for MySQL we are glad to finally attack them head on for PostgreSQL as well.

There is still a lot of work to achieve fully functional, easy-to-use PostgreSQL logical replication, but the work that Linas described gives Tungsten a clear path forward.  I expect we will make rapid progress, because so many of the other parts of Tungsten Replicator are already in place.  Meanwhile, Linas has put together a very readable blog that should make interesting reading for years to come.

Scaling Databases Using Commodity Hardware and Shared-Nothing Design