Thursday, September 29, 2011

Quick Installation of Replication from MySQL to MongoDB

Proof-of-concept Tungsten support for MongoDB arrived last May, when I posted about our hackathon effort to replicate from MySQL to MongoDB.  That code then lay fallow for a few months while we worked on other things like parallel replication, but the period of idleness has ended.  Earlier this week I checked in fixes to Tungsten Replicator to add one-line installation support for MongoDB slaves.

MySQL to MongoDB replication will be officially supported in the Tungsten Replicator 2.0.5 build, which will be available in a few weeks.  However, you can try out MySQL to MongoDB replication right now.  Here is a quick how-to using my lab hosts logos1 for the MySQL master and logos2 for the MongoDB slave. 

1. Download the latest development build of Tungsten Replicator.   See the nightly builds page for S3 URLs.

$ cd /tmp
$ wget --no-check-certificate https://s3.amazonaws.com/files.continuent.com/builds/nightly/tungsten-2.0-snapshots/tungsten-replicator-2.0.5-332.tar.gz

2. Untar and cd into the release. 

$ tar -xzf tungsten-replicator-2.0.5-332.tar.gz
$ cd tungsten-replicator-2.0.5-332

3. Install a MySQL master replicator on a host that has MySQL installed and is configured to use row replication, i.e. binlog_format=row.  Note that you need to enable the colnames and pkey filters.  These add column names to row updates and eliminate update and delete query columns other than those corresponding to the primary key, respectively. Last but not least, ensure strings are converted to Unicode rather than transported as raw bytes, which we have to do in homogeneous MySQL replication to finesse character set issues.  

$ tools/tungsten-installer --master-slave -a \
  --datasource-type=mysql \
  --master-host=logos1  \
  --datasource-user=tungsten  \
  --datasource-password=secret  \
  --service-name=mongodb \
  --home-directory=/opt/continuent \
  --cluster-hosts=logos1 \
  --mysql-use-bytes-for-string=false \
  --svc-extractor-filters=colnames,pkey \
  --svc-parallelization-type=disk --start-and-report

4. Finally, install a MongoDB slave.  Before you do this, ensure mongod 1.8.x is up and running on the host as described in the original blog post on MySQL to MongoDB replication.   My mongod is running on the default port of 27017, so there is no --slave-port option necessary. 

$ tools/tungsten-installer --master-slave -a \
  --datasource-type=mongodb \
  --master-host=logos1  \
  --datasource-user=tungsten  \
  --datasource-password=secret  \
  --service-name=mongodb \
  --home-directory=/opt/continuent \
  --cluster-hosts=logos2 \
  --skip-validation-check=InstallerMasterSlaveCheck \
  --svc-parallelization-type=disk --start-and-report

That's it.  You test replication by logging into MySQL on the master, adding a row to a table, and confirming it reaches the slave.   First the SQL commands: 

$ mysql -utungsten -psecret -hlogos1 test
Welcome to the MySQL monitor.  Commands end with ; or \g.
...
mysql> create table bar(id1 int primary key, data varchar(30));
Query OK, 0 rows affected (0.15 sec)

mysql> insert into bar values(1, 'hello from mysql');
Query OK, 1 row affected (0.00 sec)

Now check the contents of MongoDB:  

$ mongo logos2:27017/test
MongoDB shell version: 1.8.3
connecting to: logos2:27017/test
system.indexes
> db.bar.find()
{ "_id" : ObjectId("4e85269484aef8fcae4b0010"), "id1" : "1", "data" : "hello from mysql" }

Voila!  We may still have bugs, but at least MySQL to MongoDB replication is now easy to install.   

Speaking of bugs, I have been fixing problems as they pop up in testing.  The most significant improvement is a feature I call auto-indexing on MongoDB slaves.  MongoDB materializes collections automatically when you put in the first update, but it does nothing about indexes.  My first TPC-B runs processed less than 100 transactions per second on the MongoDB slave, which is pretty pathetic. The bottleneck is due to MongoDB update operations of the form 'db.account.findAndModify(myquery,mydoc)'.  You must index properties used in the query or things will be very slow.   

Auto-indexing cures the update bottleneck by ensuring that there is an index corresponding to the SQL primary key for any table that we update.  MongoDB makes this logic very easy to implement--you can issue a command like 'db.account.ensureIndex({account_id:1})' to create an index.  What's really cool is that MongoDB will do this even if the collection is not yet materialized--e.g., before you load data.   It seems to be another example of how MongoDB collections materialize whenever you refer to them, which is a very useful feature.  

TPC-B updates into MongoDB are now running at over 1000 transactions per second on my test hosts. I plan to fix more bugs and goose up performance still further over the next few weeks.  Through MongoDB we are unlearning assumptions within Tungsten that are necessary to work with non-relational databases.  It's great preparation for big game hunting next year:  replication to HBase and Cassandra.  

11 comments:

eeb said...

Hey Robert,
Thanks a lot for this blog! It is very informative and interesting.

I have a a question about the MySQL to MongoDB replication project. How do you see handling the replication of a highly normalized database? For example, if I have an OLTP system and I insert a record in my Orders table, more than likely it is going to consist of columns full of foreign keys. When this is replicated to mongoDB, these keys have no meaning. My own thinking on the problem is to denormalize the record and create a single entity in mongo containing the denormalized data, so the Orders collection in mongoDB would actually contain say a product name instead of the key to row in the Product table. What do you think?

Robert Hodges said...

You are most welcome. You have your fingers on a really interesting problem. I also assume that most apps will organize data into documents exactly as you describe. This creates an interesting mapping problem.

First, you have to find the right document for updates on parts of the document, which means that (I imagine) you'll need to chase backwards up foreign key relationships to locate the primary key of the document you actually need to update in MongoDB. This needs to be done on the MySQL side before updates reach the MongoDB slave.

Second, there's the question of keys. As you say there are reasons for using other types key schemes--auto-increment keys in MongoDB don't look very scalable due to the global write lock in the current implementation. However, there's another more practical problem in that if you change key schemes for replicated data you may end up with an expensive and complex key mapping operation when moving data. To solve this you should decide which system really "owns" the object being replicated. If it's actually MySQL you should probably keep the MySQL keys as extra fields to aid replication.

Tungsten has filters, which are a powerful way of changing transactions and adding metadata. We'll be adding new filters in the future to deal with these and other issues.

Sushil Jain said...

Greetings,

Thanks a ton for publishing very informative blog on heterogeneous replication.

I was trying and following the procedure in my lab and sucessfully done with setting up "MySQL Master Replicator" on one host i.e. "mysqlpc" and slave replicator on another host i.e. "mongopc" but now I am stuck at, I guess, very last point.

Some information:

I am using MySQL 5.6.x, tungsten replicator 2.0.7 and mongodb 2.4.1
Both machines are 32 bit and CentOS installed on them.

Service on both machines are online and I am not getting any kind of error.
In "/opt/continuent/test_db" dir, I am also getting log files.
But there is no data in "Mongo database".

And when I run the replicator on slave("mongopc"), I am getting following warnings :

#1) 32-bit architecture not recommended for DBMS nodes

#2) Currently unable to check for the THL schema in mongodb.

I tried hard to find what's wrong with the configuration but failed.

Could you help me to get this issue fixed ?

Thanks & Regards,
Sushil Jain

Sushil Jain said...

Greetings,

Thanks a ton for publishing very informative blog on heterogeneous replication.

I was trying and following the procedure in my lab and sucessfully done with setting up "MySQL Master Replicator" on one host i.e. "mysqlpc" and slave replicator on another host i.e. "mongopc" but now I am stuck at, I guess, very last point.

Some information:

I am using MySQL 5.6.x, tungsten replicator 2.0.7 and mongodb 2.4.1
Both machines are 32 bit and CentOS installed on them.

Service on both machines are online and I am not getting any kind of error.
In "/opt/continuent/test_db" dir, I am also getting log files.
But there is no data in "Mongo database".

And when I run the replicator on slave("mongopc"), I am getting following warnings :

#1) 32-bit architecture not recommended for DBMS nodes

#2) Currently unable to check for the THL schema in mongodb.

I tried hard to find what's wrong with the configuration but failed.

Could you help me to get this issue fixed ?

Thanks & Regards,
Sushil Jain

Robert Hodges said...

@Sushil, you are most welcome.

Please ensure you have row replication enabled. The MongoDB throws away statements, because only row updates can be translated to BSON. You can confirm that statements or rows are generated using 'thl list' on either the master or slave.

Sushil Jain said...

Thank for the very quick response.

I've already configured
"binlog-format=row" in "my.cnf", but still its not working.
I unable to figure out what I am doing wrong or missing.

Robert Hodges said...

@Sushil, it would be easier to follow this on the Tungsten Replicator discussion group (https://groups.google.com/forum/?fromgroups#!forum/tungsten-replicator-discuss).

Meanwhile, I believe this is a problem where you probably want to turn on debug logging for the MongoDB applier. You can do this by adding the following lines to the replicator log4j.properties file and restarting the process:

log4j.logger.com.continuent.tungsten.replicator.applier.MongoApplier=DEBUG, stdout
log4j.additivity.com.continuent.tungsten.replicator.applier.MongoApplier=false

With luck you'll see something in the messages that helps. Among other things, I would look carefully at the transactions to make sure you are not somehow filtering out data you want or that you don't have an application that's secretly sending statements instead of row updates. That happens all too often, I'm sad to say. :(

Sushil Jain said...

I've submitted an issue(#548) on the URL provided by you.

I checked into "trepsvc.log" and found some kind of NullPointerException.
I've also attached the log file.

Could you please take a look and tell me what's wrong ?

:Sushil Jain

Alex Willemsma said...

Hi Robert,

I'm hoping to use Tungsten Replicator to replicate some MySQL data to MongoDB, and I've gotten everything set up and replication is working great, however I noticed that all the data replicated to MongoDB is stored as strings. It doesn't seem to matter whether it be dates, integers, floats or whatever in MySQL, when it gets to mongo it's a string. For most cases I can work around this, but I just discovered that I need to be able to query (in Mongo) based on some of these dates, and obviously that requires them to be actually stored correctly, strings won't do it.

Am I doing something wrong (possibly not including some filter?), or is this known behaviour? If it's known behaviour, do you estimate it would be difficult to fix, or are there good reasons this hasn't been implemented?

Thanks!

Robert Hodges said...

@Alex, your analysis is correct. We currently store data as strings rather than JSON types. It would not be especially hard to add support for correct typing but nobody so far has asked for it.

What kind of application are you developing? If it's a company would you be interesting in sponsoring some development for this to help move it along?

Robert Hodges said...

@Alex, p.s., Try submitting this question on the tungsten-replicator-discuss group on Google. It's easier to correspond there. (https://groups.google.com/forum/#!forum/tungsten-replicator-discuss)

Scaling Databases Using Commodity Hardware and Shared-Nothing Design