May 14, 2011

Introducing MySQL to MongoDB Replication

The last article on this blog described our planned MySQL to MongoDB replication hackathon at the recent Open DB Camp in Sardinia.  Well, it worked, and the code is now checked into the Tungsten Replicator project.   This article describes exactly what we did to write the code and set up replication.  You can view it as a kind of cookbook both for implementing new database types in Tungsten as well as setting up replication to MongoDB.

The Team

MySQL to MongoDB replication was a group effort with three people:  Flavio Percoco, Stephane Giron, and me.  Flavio has worked on MongoDB for a couple of years and is extremely well-informed both on database setup as well as application design.  Stephane Giron is a replication engineer at Continuent and has done a substantial amount of the work on data extraction from MySQL, especially row replication.  I work on the core execution framework as well as performance.

Getting Started with MongoDB

There were a couple of talks on MongoDB during the first morning of Open DB camp (Saturday May 7th), which Stephane and I dutifully attended to raise our consciousness.  We got cracking on implementation around 2pm that afternoon.   The first step was to bring up MongoDB 1.8.1 and study its habits with help from Flavio.

MongoDB is definitely easy to set up.  You get binary builds from the MongoDB download page.  Here is a minimal set of commands to unpack MongoDB 1.8.1 and start the mongod using directory data to hold tables.  

$ tar -xvzf mongodb-osx-x86_64-1.8.1.tgz
$ cd mongodb-osx-x86_64-1.8.1
$ mkdir data
$ bin/mongo --dbpath data
(... messages ...)


You connect to mongod using the mongo client.  Here's an example of connecting and creating a table with a single row.

$ bin/mongo localhost:27017
MongoDB shell version: 1.8.1
connecting to: localhost:27017/test
> use mydb
switched to db mydb
> db.test.insert({"test": "test value", "anumber" : 5 })                
> db.test.find()
{ "_id" : ObjectId("4dce9a4f3d6e186ffccdd4bb"), "test" : "test value", "anumber" : 5 }
> exit


This is schema-less programming in action.  You just insert BSON documents (BSON =  Binary JSON) into collections, which is Mongolese for tables.  MongoDB creates the collection for you as soon as you put something in it. The automatic materialization is quite addictive once you get used to it, which takes about 5 minutes.

The MongoDB client language is really handy.  It is based on JavaScript.  There are what seem to be some non-Javascript commands like "show dbs" to show databases or "show collections" to list the tables.  Everything else is object-oriented and easy to understand.  For example, to find all the records in collection test, as we saw above, you just connect to the database and issue a command on the local db object.  Collections appear as properties of db, and operations on the collection are methods.

It helps that the MongoDB folks provide very accessible documentation, for example a SQL to MongoDB translation chart.   I put together a little practice program using the MongoDB Java driver to insert, referring to the Javadoc for the class library when in doubt about API calls.  There are also a couple of very helpful examples, like this one, included with the driver.

All told, setup and orientation took us about 45 minutes.  It helped enormously that Flavio is a MongoDB expert, which minimized flail considerably.

Implementing Basic Replication from MySQL to MongoDB

After setup we proceeded to implement replication.  Here is an overview of the replicator pipeline to move data from MySQL to MongoDB.  Pipelines are message processing flows within the replicator.


Direct pipelines move data from DBMS to another within a single replicator.  They are already a standard part of Tungsten Replicator and most of the code shown above already exists, except for the parts shown in red.   Before we started, we therefore needed to set up a replicator with a direct pipeline.

We first built the code according to the instructions on the Tungsten project wiki, uploaded the binary to our test host, and configured the replicator.   First, we ran the Tungsten configure script to set defaults for the MySQL server (user name, extract method, etc.).   Next we ran the configure-service command to set up the direct pipeline configuration file.   Both commands together look like the following:

./configure
./configure-service -C --role=direct mongodb

The second command created a file called tungsten-replicator/conf/static-mongodb.properties with all the information about the direct pipeline implementation but of course nothing yet about MongoDB.

Now we could start the implementation.   To move data to MongoDB, we needed two new components:
  1. A Tungsten RawApplier to apply row updates to MongoDB.  RawApplier is the basic interface you implement to create an applier to a database.  
  2. A Tungsten Filter to stick column names on row updates after extracting from MySQL.  MySQL row replication does not do this automatically, which makes it difficult to construct JSON at the other end because you do not have the right property names. 
To get started on the applier I implemented a very simple class named MongoApplier that could take an insert from MySQL, turn it into a BSON document, and add it to an equivalently named database and collection in MongoDB.  I added this to the replicator code tree, then built and uploaded tungsten-replicator.jar.  (Use 'ant dist' in the replicator directory to build the JAR.)

To start using the new MongoDB applier, we needed to edit the service properties file to use this component instead of the standard MySQL applier that configuration adds by default.  To do this, you can open up static-mongodb.properties with your favorite editor.  Add the following properties at the bottom of the APPLIERS section.

# MongoDB applier.  You must specify a connection string for the server.
# This currently supports only a single server.
replicator.applier.mongodb=com.continuent.tungsten.replicator.applier.MongoApplier
replicator.applier.mongodb.connectString=localhost:27017


Next, you need to fix up the direct pipeline so that the last stage uses the new applier.  We located the direct pipeline definition (around line 208 in the properties file) and set the applier to mongodb as shown in the following example.

# Write from parallel queue to database.
replicator.stage.d-pq-to-dbms=com.continuent.tungsten.replicator.pipeline.SingleThreadStageTask
replicator.stage.d-pq-to-dbms.extractor=parallel-q-extractor
replicator.stage.d-pq-to-dbms.applier=mongodb
replicator.stage.d-pq-to-dbms.filters=mysqlsessions
replicator.stage.d-pq-to-dbms.taskCount=${replicator.global.apply.channels}
replicator.stage.d-pq-to-dbms.blockCommitRowCount=${replicator.global.buffer.size}

We then started the replicator using 'replicator start.'  At that point we could do the following on MySQL:

mysql> create table foo(id int primary key, msg varchar(35));
Query OK, 0 rows affected (0.05 sec)
mysql> insert into foo values(1, 'hello from MySQL!');
Query OK, 1 row affected (0.00 sec)


...And within a second we could see the following over in MongoDB:

> show collections
foo
system.indexes
> db.foo.find();
{ "_id" : ObjectId("4dc55e45ad90a25b9b57909d"), "1" : "1", "2" : "hello from MySQL!" }


This kind of progress was very encouraging.  It took roughly 2 hours to get to move the first inserts across.  Compared to replicating to a new SQL database like Oracle that's lightning fast.  However, there were still no property names because we were not adding column names to row updates.

Meanwhile, Stephane had finished the column name filter (ColumnNameFilter) and checked it in.  I rebuilt and refreshed the replicator code, then edited static-mongodb.properties as follows to add the filter.  First put in the filter definition in the FILTERS section:

# Column name filter.  Adds column name metadata to row updates.  This is
# required for MySQL row replication if you have logic that requires column
# names.
replicator.filter.colnames=com.continuent.tungsten.replicator.filter.ColumnNameFilter

Next, make the first stage of the direct pipeline use the filter:

# Extract from binlog into queue.
replicator.stage.d-binlog-to-q=com.continuent.tungsten.replicator.pipeline.SingleThreadStageTask
replicator.stage.d-binlog-to-q.extractor=mysql
replicator.stage.d-binlog-to-q.filters=colnames,pkey
replicator.stage.d-binlog-to-q.applier=queue

We then restarted the replicator.  Thereupon, we started to see inserts like the following, complete with property names:

> db.foo.find()
{ "_id" : ObjectId("4dc77bacad9092bd1aef046d"), "id" : "25", "data" : "data value" }

That was better, much better!   To this point we had put in exactly 2 hours and 45 minutes wall clock time.  It was enough to prove the point and more than enough for a demo the next day.   The hackathon was a rousing success.

Further Development

Over the next couple of days I rounded out the MongoApplier to add support for UPDATE and DELETE operations, as well as to implement restart.  The full implementation is now checked in on code.google.com, so you can repeat our experiences by downloading code and building yourself or by grabbing one of the Tungsten nightly builds.

Restart is an interesting topic.  Tungsten uses a table to store the sequence number of the last transaction it applied.  We do this by creating an equivalent collection in MongoDB, which is updated after each commit.  There is a problem in that MongoDB does not have transactions.  Each update is effectively auto-commit, much like MyISAM table type on MySQL.  This means that while Tungsten can restart properly after a clean shutdown, slave replication is not crash safe.  Lack of atomic transactions is a bigger issue with MongoDB and other NoSQL databases that goes far beyond replication.  For now, this is just how Tungsten's MongoDB support works.

Speaking of things that don't work, the current implementation is a prototype only.  We have not tested it with more than a few data types.  It only works with a single MongoDB daemon.  It does not set keys properly or specify indexes on tables.  There are no guarantees about performance, except to say that if you had more than a small amount of data it would be quite slow.  (OK, that's a guarantee after all.)

Epilog

Overall all the hackathon was a great success, not to mention lots of fun.  It went especially well because we had a relatively small problem and three people (Stephane, Flavio, and Robert) with complementary skills that we could combine easily for a quick solution.  That seems to be a recipe for succeeding on future hackathons.

From a technical point of view, it helped that MongoDB is schema-less.  Unlike SQL databases, just adding a document materializes the table in MongoDB.  This made our applier implementation almost trivially easy, because processing row updates takes only a few dozen lines of Java code in total.  It also explains why a lot of people are quite attached to the NoSQL programming model.

I am looking forward to learning a lot more about MongoDB and other NoSQL databases.  It would take two or three weeks of work to get our prototype to work with real applications.  Also, it looks as if we can implement replication going from MongoDB to MySQL.  According to Flavio there is a way to search the transaction log of MongoDB as a regular collection.  By appropriately transforming BSON objects back to SQL tuples, we can offer replication back to MySQL.

There are many other lessons about MongoDB and NoSQL in general but it seems best to leave them for a future article when I have more experience and actually know what I'm talking about.  Meanwhile, you are welcome to try out our newest Tungsten replication feature.

5 comments:

alex.zaitsev said...

Hi Robert,

This is very interesting and cool. Hopefully you could do the same for Cassandra eventually :)

Speaking of MongoDB, I just visited the MongoMoscow conference today and was a little bit disappointed. It is still quite limited and immature, though I know many companies are using MongoDB in production already. They lack a lot of features and have just basic implementation of others. When MongoDB engineers were talking about functionality, I kept catching myself that this was done in RDBMS long ago and the same can be done with Tungsten Replicator, for example (I mean complex topologies with sharding), or dbShards and similar solutions. So unlike Cassanrda this is not impressive at all.

Cheers,
Sasha

Robert Hodges said...

Hi Sasha! It was really fun. You are definitely right about the immaturity of the non-RDBMS offerings. At this point it's really hard to tell which, if any are going to make it. From what I have seen so far the lack of atomic transactions is a real issue for replication and operations like bulk load not to mention a lot of application updates. It probably won't be long before some form of transactions goes back in.

Meanwhile, you to give the MongoDB guys props for making their DBMS very easy to set up and use.

Gijs Mollema said...

After attending Devoxx 2011 and attending a presentation from 10Gen I have written a little blog comparing MongoDB to rdbm databases. Just as additional info:) Please read the following url and let me know if you have any suggestions \ additions.

http://blog.iprofs.nl/2011/11/25/is-mongodb-a-good-alternative-to-rdbms-databases-like-oracle-and-mysql/

Cheers,
Gijs Mollema

Anonymous said...

I'd love to try this out but can't get it to work.

Running this: ./configure-service -C -a --role=direct --master-datasource=localhost --datasource-type=mongodb --datasource-user=root --datasource-password= --skip-validation-check=InstallerMasterSlaveCheck --svc-parallelization-type=disk mongodb

I get the following error:
ERROR >> localhost >> Unable to find a database type class for mongodb
ERROR >> localhost >> There was a problem parsing the arguments

Running this:
./configure-service -C --role=direct mongodb

I get the following error:
ERROR >> localhost >> You must specify a datasource alias or datasource configuration information
ERROR >> localhost >> There was a problem parsing the arguments

Robert Hodges said...

@Anonymous, did you try out the commands in the follow-on blog article at http://scale-out-blog.blogspot.com/2011/09/quick-installation-of-replication-from.html? This shows correct commands. For now direct pipelines do not work properly with MongoDB at least through the standard tungsten-installer. You have to do a master/slave setup.