Friday, January 6, 2012

Tungsten on the Beach--LA MySQL Meetup on Jan 11, 2012

It is my pleasure to announce that I will be presenting on Tungsten Replicator next Wednesday, January 11th at the Los Angeles MySQL Meetup. The presentation title is Fast, Flexible, and Fun--The Tungsten Replicator Magical Mystery Tour. This talk is going to be fun for two reasons.

First, it's a great opportunity to meet people in the LA MySQL community and talk about my favorite replication software. Tungsten is like a Swiss Army Knife for data replication.  It solves a wide range of problems involving HA, scaling, and data movement.   The presentation gives a quick intro to the replicator, then surveys how to use the most interesting features, including parallel slave apply, multi-master replication, transaction filtering, and replicating to MongoDB, Oracle, or data warehouses.  I'll even show you how to grab the GPL V2 sources from code.google.com and code up your own replicator extensions using Java or Javascript.

Second, the talk venue is in Santa Monica about 10 blocks up from the ocean.  Who doesn't like beaches?    I certainly do.  See you next week!

p.s.,  Thanks to Joe Devon and the other LA MySQL Meetup folks for the kind invitation.

Friday, November 18, 2011

Replicating Data Now and Then with Tungsten

What do cruise ship management software and data warehouses have in common?  One answer:  they both depend on intermittent data replication.   Large vessels collect data to share with a home base whenever connectivity permits.  If there is no connection, they just wait until later.  Data warehouses also do not replicate constantly.  Instead, it is often far faster to pool updates and load them in a single humongous batch using SQL COPY commands or native loaders.  Replicating updates in this way is sometimes known as batch replication.  Tungsten Replicator supports it quite easily.

To illustrate we will consider a Tungsten master/slave configuration.  (Sample setup instructions here.)  In this example MySQL-based web sales data upload to a data warehouse.   The master receives constant updates, which then apply at controlled intervals on the slave.


The first step is to turn off the slave replicator.  Login to the prod2 host and execute the following command.

$ trepctl offline

The prod2 slave will disconnect from the master as well as the data warehouse.  Updates now accumulate on the master.  We can turn on the slave to fetch and apply them all, then go offline again using one of three methods.  The first method uses the current sequence number on the master.  Here are sample commands to fetch and apply all transactions from the master up to the current master position.

$ trepctl -host prod1 status |grep appliedLastSeqno
appliedLastSeqno       : 19600
$ trepctl online -seqno 19600
$ trepctl wait -state OFFLINE -limit 300

As you can see, the first command locates the master sequence number.  The second command tells the slave to go online and replicate to sequence number 19600.  Finally the third command waits until either slave is back in the offline state or 300 seconds elapse, whichever comes first.  This is not strictly necessary for replication but is very handy for scripts, as it eliminates a potentially awkward polling loop.  

The second method is to use the MySQL binlog position on the master.  The idea is the same as the previous example.  We get the master binlog position, then tell the slave to apply transactions to that point and go offline.  Here's an example: 

$ mysql -utungsten -psecret -hprod1 -e 'show master status'
+------------------+----------+--------------+------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+------------------+----------+--------------+------------------+
| mysql-bin.002023 | 92395851 |              |                  |
+------------------+----------+--------------+------------------+
$ trepctl online -event mysql-bin.002023:0000000092395851
$ trepctl wait -state OFFLINE -limit 300

Note in this example that you must pad the binlog offset out to 16 digits, which means you must add the extra zeros shown in bold.  Tungsten compares native replication IDs as strings, so that we can handle other databases besides MySQL.  This normally a minor inconvenience, unless you don't know the trick.  In that case it could be a bit of a head-scratcher.  

There is a final way to implement batch replication using Tungsten's built-in heartbeat mechanism.   With this method we insert a named heartbeat event on the master, then ask the slave to replicate until the heartbeat appears.  Here's an example: 

$ trepctl -host prod1 heartbeat -name batch1
$ trepctl online -heartbeat batch1
$ trepctl wait -state OFFLINE -limit 300

This method is perhaps the simplest of all, because there is no need to check for either sequence numbers or binlog offsets on the master.  The only downside is that you must have a master and a slave replicator to use it.  It does not work with direct replication, in which a single replicator moves data from the master DBMS to the slave.  (This limitation will be removed in the future when Issue 228 is fixed.) 

When using any of these techniques, we may want to know whether Tungsten will really go offline at the correct point.  Fortunately, there's a simple way to find out.  The trepctl status command shows pending requests to go offline.  Let's say you check status after requesting the slave to replicate to a heartbeat as in the previous example.   

$ trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.002023:0000000104369615;37978
appliedLastSeqno       : 220126
appliedLatency         : 470.589
...
offlineRequests        : Offline at heartbeat event: batch1
...
state                  : ONLINE
timeInStateSeconds     : 2.436
uptimeSeconds          : 1742.0
Finished status command...

It is simple to see from the status output that Tungsten will go offline when it sees a heartbeat named batch1.

As this article shows, the trepctl online and trepctl wait commands make it very simple to implement batch replication.  You can simplify still further by wrapping the commands in a short script written in your favorite scripting language.   Either way you have a handy solution to a problem that affects a diverse set of applications.

This is not the end of Tungsten features to enable batch replication.  Tungsten has a new applier that can submit transactions using CSV files, which is critical to load transactions quickly into data warehouses.  We have been testing it out with Vertica, where early results show that it improves load performance by a factor of 100 or more in some cases.  I will describe this new feature in an upcoming article.  

Wednesday, November 16, 2011

Why So Many Proprietary Rewrites of MySQL and InnoDB?

Every couple of weeks or so I get marketing email from a Continuent competitor advertising a closed-source clone of MySQL. It is said to be pumped up on illegal substances and therefore the solution to all my problems.  I like this sort of spam because it makes it easier to track what the neighbors are up to.  However it does bring up a question.  Why are so many companies offering what amount to proprietary replacements of MySQL?  This does not mean alternative builds like Percona or MariaDB.  It means products like ClustrixSchooner, or Xeround, which replace MySQL entirely, or like ScaleDB, or Tokutek, which replace InnoDB.  

There's nothing wrong with proprietary software, of course.  And there is nothing wrong with rewriting things to make them better.  The rewrites are a tribute to the vitality of the MySQL marketplace and in some cases quite clever as well.  However, the proprietary offerings tend to obscure an important truth about MySQL.  Most businesses that run on open source software have problems with MySQL management, not with MySQL itself.

Here is a simple example.  Say you have 2 Terabytes in MySQL 5.1.  How do you upgrade from MySQL 5.1 to 5.5 without incurring an application outage?  This is a big problem for 24x7 web-facing applications.  You don't need to rewrite MySQL to do zero-downtime upgrades.  MySQL with InnoDB already works fine.  You just need a way to shift connections transparently to a new master database, upgrade the old master, and shift back when you are done.  Similar reasoning applies for slave provisioning, automated failover, spreading load over replicas to improve performance, or operating across multiple sites.

At Continuent we concluded a number of years ago that you don't need to change MySQL to manage data effectively.  We therefore designed Tungsten Enterprise, Continuent's commercial clustering solution, to work with unaltered MySQL. Tungsten Enterprise uses master/slave replication (i.e., my favorite replicator), distributed management, and transparent connectivity to make a set of standard MySQL or PostgreSQL servers look like a single highly available DBMS that distributes load across all replicas.   This architecture has tremendous advantages, because it complements the strengths of MySQL itself.   Here are a few of the principle benefits.

Transparency.  Everything runs on standard MySQL from initial development to large-scale deployment.  Application code runs the same way on a dev laptop or production.  Application bugs in production are reproducible on the laptop.  Standard MySQL configuration and tuning also work, because this is standard MySQL.

No lock-in.  Don't like Tungsten Enterprise?  Use something else or revert back to simple MySQL.  There's no need to change your database or migrate data.

Data integrity.  InnoDB has had years to shake out bugs, especially those involving data corruption.  There are still a few but they do not typically show up unless there is a bad hardware failure or you configure your system incorrectly.  (Hint #1: don't use MyISAM.)  Do you really want to give this up for a new store implementation?

Scalability.  MySQL performance is improving constantly, driven by competition between builds, an active community, investment from Oracle and large web properties like Facebook.  SSDs are also increasingly affordable and make a lot of performance problems evaporate.  As MySQL improves in this and other areas, you get the benefits.  The trick is to have a way to upgrade.  I mentioned the MySQL 5.1 to 5.5 upgrade problem for precisely this reason.

Operational costs.  There is a deep pool of administrators and administrative tools for MySQL.  Thanks to books like High Performance MySQL, abundant talks, and a wealth of community resources as well as consulting, there is little mystery about how things work.  I probably don't even need to discuss license costs.

Viability.  MySQL is not going anywhere.  Oracle is continuing to invest in the core database, and Percona, MariaDB and most important Microsoft will ensure Oracle stays on its toes.  At Continuent we do our best to keep our friends at Oracle competitive on replication.  Innovation on open source MySQL will continue for years to come.  (Psst, MySQL guys at Oracle are welcome to come work for us. :)

Given the number of advantages that off-the-shelf MySQL confers, the real question is why our approach is not more popular.  Actually it is.  For all the marketing attention generated by proprietary MySQL or InnoDB rewrites, many hundreds of billions of transactions per day run on unaltered MySQL.  Switching to proprietary versions of MySQL is a substantial wrench for most businesses, because the economics run so strongly in favor of open source DBMS.   However, the open source tools for managing MySQL are by-and-large inadequate, in part because some of the problems turn out to be rather difficult to solve.   

When we started to rethink database clustering at Continuent back in 2007, we therefore focused on solving the problems outside MySQL that make data management hard.  That includes building fast replication with global transaction IDs, so you can fail over easily to up-to-date live replicas.  It includes building distributed, rule-based management that has simple primitives like "recover" to fix a broken slave.  It includes speedy, transparent connectivity that can spread reads intelligently across multiple servers and reroute connections transparently to allow maintenance without halting applications.  Finally, it includes simplifying management so that users don't spend much time worrying about their data.  These capabilities are now very robust and help customers handle hundreds of millions of transactions per day.  

It is obvious off-the-shelf MySQL (and PostgreSQL too) are already very good and continuing to get better.  For most users there is no need to migrate to proprietary offerings that give up the leverage conferred by open source databases.  Tungsten Enterprise solves the difficult problems that are critical to building businesses on standard MySQL.   If you are building new systems based on MySQL or scaling old ones you should look hard at what we have done.   

Sunday, November 13, 2011

I Really Dislike Anonymous Attacks

If you are interested in NoSQL databases (or maybe not) perhaps you have seen the anonymous "warning" about using MongoDB.   It concludes with the following pious request:

  Please take this warning seriously.

Now there are a lot of great resources about data management on the web but the aforementioned rant is not one of them.  If you plan to write technical articles and have people take them seriously, here are a few tips.
  1. Sign your name.  Readers are more impressed when they see you are not afraid to stand behind your words. 
  2. Explain what problem you were trying to solve.  Otherwise uncharitable readers might think you just started pumping information into a new database without thinking about possible consequences and now want to blame somebody else for your bad decision.  
  3. Explain how you could do better.  Not all designs work out, so propose alternatives.  Readers love to see authors demonstrate that they are not discouraged by adversity.  
As for most of the points made by the anonymous author, all I can say is: well, duh! 

MongoDB behavior with respect to global write locking and transaction durability is obvious from the official documentation.  These features are not my cup of tea, but it's also not as if 10gen is hiding them either.  Moreover, most people understand that new DBMS implementations have problems, not least of all losing data now and then.  You usually pick them because they have features that make it worth putting up with the immaturity.  I am not an expert on MongoDB, but I can say from experience it is amazingly easy to load JSON objects into it.  The up-front usability alone demonstrates excellent engineering.  I am sure for this reason that there are many other good features.  

Monday, October 31, 2011

Benchmarking Tungsten Parallel Replication

Tungsten parallel apply on slaves, or parallel replication for short, has been available for about a year.   Until recently we did not have many formal benchmarks of its performance.  Fortunately the excellent Percona Live Conference in London accepted my talk on Tungsten parallel replication (slides available here), so Giuseppe Maxia and I finally allocated a block of time for systematic performance testing.

In a nutshell, the results were quite good. In the best cases Tungsten parallel apply out-performs single-threaded native replication by about 4.5 to 1.  Both Giuseppe and I have verified this using slightly different test methodologies, which helps avoid dumb counting mistakes.  Our results also match field tests at a customer site over the previous summer, so we regard them as fairly robust.  In the remainder of this article I would like to expand a bit on the details of the benchmarks as well as the results.  The results shown here are from my tests.

Benchmark Test Design

Both Giuseppe and I used a similar testbed for replication testing:    
  • HP Proliant server, dual Xeon L5520 CPUs with hyper-threading enabled, 72Gb of RAM
  • 1TB HP Smart Array RAID 1+0 
  • Centos 5.6
  • XFS file system
  • MySQL 5.1.57 with InnoDB buffer pool set to 10Gb and using O_DIRECT purge method 
  • Tungsten Replicator 2.0.5 build 347  
For convenience we use MySQL sandbox to set up a master with two slaves, as shown in the following diagram.  It turns out that for measuring replication throughput there is no reason to set up on separate hosts, as the master does little or nothing during the test and we only operate one slave at a time.  



The Tungsten Slave is configured as described in a previous article in this blog, except that there are 30 channels instead of 10.   The exact installation command is given at the end of this article.

The test run uses sysbench to spread transactions evenly across 30 databases of identical size, then measure time to process them.  This is also known as a replication catch-up test. 
  1. Load all MySQL servers with an identical dataset consisting of 30 databases pre-populated with data from sysbench.  Giuseppe wrote a new tool called Large Data Generator that is very helpful for capturing and loading such datasets.  
  2. With the slaves shut down, store the master binlog start position and then run 30 sysbench oltp test processes against the master to update and read from all schemas simultaneously for one hour.  
  3. Start the MySQL slave from the stored master binlog position and measure time to process the sysbench transactions. Shut down the MySQL slave at the end of the test. 
  4. Start the Tungsten slave from the stored master binlog position and measure time to process the sysbench transactions using Tungsten Replicator with 30 channels (i.e. threads). 
Test Results

Database performance is substantially different depending on whether data are fully resident in the buffer pool (cache-resident) or largely read from disk (I/O-bound).  Tungsten parallel replication over 30 databases varies from 1.8 to 4.5 depending on which case you look at, as shown in the following table.   Processing times are in minutes (m).

Test Scenario
Rows/Db
Data Size
MySQL Slave
Tungsten Slave
Ratio
Cache-resident
10K
430Mb
30m 
17m
1.8
I/O-Bound
10M
68Gb
228m
51m
4.5

Let's look at the results in detail.  In the cache-resident test the base dataset is relatively small and loads fully into the buffer cache within a minute or two.  Both MySQL and Tungsten slaves complete in well under an hour.  Here is a graph showing throughput as measured in bytes of binlog processed per 10 second increment.  

Cache-Resident Slave Catch-Up - MySQL vs. Tungsten Replicator, 30 Databases
In the cache-resident case there are virtually no reads from disk as all data are fully resident in the InnoDB buffer pool.  Tungsten Replicator is faster because multiple writes can occur in parallel but the speed-up versus native replication is not especially large.  Note that Tungsten processes around 40Mb every 10 seconds or about 1Gb of binlog every four minutes.

With I/O bound workloads, on the other hand, we see a profound difference in performance.  Tungsten Replicator is at least 6x slower than in the cache-resident case, but still processes updates faster than the master (51 minutes on the slave vs. 60 minutes on the master).   Buffer cache loading is correspondingly fast and Tungsten reaches steady-state performance within about 20 minutes.  MySQL native replication on the other hand is far slower. The slave not only does not catch up, but it would quickly lag far behind  the master under this workload.  It takes about 90 minutes for native replication even to achieve steady state performance after buffer pool loading.
I/O-Bound Slave Catch-Up - MySQL vs. Tungsten Replicator, 30 Databases
Overall we can safely say that single-threaded native replication is likely non-workable in the I/O-bound case without going to some combination of SSDs and/or slave pre-fetch.  


Further Improvements and Caveats

The current results of parallel replication benchmarks on Tungsten are gratifying especially when you consider that two years ago Tungsten Replicator performance was around 10% of the speed of MySQL replication.  Nevertheless, these benchmarks are not the final word.  It is clear there is room for optimization as we observe that Tungsten processes the cache-bound binlog at least 6 times faster than the I/O bound workload.  Much of the difference seems to be time spent reading from disk.   If this could be improved, Tungsten would go even faster.

During the London conference Yoshinori Matsunobu published some excellent performance results using slave pre-fetch, which has encouraged us to build pre-fetch into Tungsten as well.   I am curious to see if we can further boost throughput by adding pre-fetching on each parallel thread, though other people at the conference such as Domas Mituzas were not optimistic.  Either way, I am certain we will improve performance, if not using pre-fetch then with other tricks like batching inserts.

Finally, some caveats.  Our sysbench load is nice because it is evenly distributed across schemas of exactly the same size.  Most application workloads do not behave this way, though some do come very close.  The slides for my talk discuss practical issues in maximizing performance in real applications.  I suspect that a combination of parallelization with pre-fetch will in fact turn out to be a very good solution for a wide variety of workloads.

Fine Print

If you would like to repeat our results (or attack them as fraudulent), here are some parameters that may help.  The database settings in the MySQL sandbox instances are as follows:
default-storage-engine=InnoDB
innodb-additional-mem-pool-size=100M
innodb-flush-method=O_DIRECT
innodb-log-buffer-size=4M
innodb-log-file-size=50M
innodb-thread-concurrency=0
innodb_buffer_pool_size=10G
innodb_file_format=barracuda
innodb_file_per_table=1
innodb_flush_log_at_trx_commit=2
innodb_strict_mode=1log-bin=mysql-binmax-connections=500
max_allowed_packet=48M
skip_slave_start
sync_binlog=0

Next, here is the sysbench command used to generate load on each schema.  We run 30 of these simultaneously varying the database name for each invocation.  This example is for the I/O-bound case.

sysbench --test=oltp --db-driver=mysql --mysql-db=${db} \  --mysql-user=msandbox --mysql-password=msandbox \
  --mysql-host=127.0.0.1 --mysql-port=33306 \
  --oltp-read-only=off --oltp-table-size=10000000 \
  --oltp-index-updates=4 --oltp-non-index-updates=2 \
  --max-requests=200000 \
  --max-time=3600 --num-threads=5 run

The replicator configuration is given in the slides for the talk, but here it is again.   Options in red are required for sandboxes.  Production installations are therefore simpler than what is shown here.

tools/tungsten-installer tools/tungsten-installer --direct -a \
  --service-name=parallel --native-slave-takeover \
  --master-host=127.0.0.1 --master-port=33306  \
  --master-user=msandbox --master-password=msandbox  \
  --slave-host=127.0.0.1 --slave-port=33307  \
  --slave-user=msandbox --slave-password=msandbox  \
  --home-directory=/opt/continuent \
  --property=replicator.store.parallel-queue.maxOfflineInterval=5 \
  --svc-parallelization-type=disk --buffer-size=100 \
  --channels=30 --thl-port=2115 --rmi-port=10010 \
  --skip-validation-check=MySQLPermissionsCheck \
  --skip-validation-check=MySQLApplierServerIDCheck \
  --start-and-report

To equal the results shown above you will also need to assign databases explicitly to channels in the shard.list file.   Otherwise, databases will be assigned  channels using a hashing function, which tends to result in somewhat uneven distributions.  Look in the comments of the shard.list file for instructions on how to do this. 

Finally, all of our tests depend on two excellent tools from Giuseppe Maxia:   MySQL Sandbox and the new Large Data Generator program in the Tungsten Toolbox.  Once you get the hang of them you will become completely addicted as they make test setup both reliable as well as quick.  

Saturday, October 1, 2011

Open Source Hardware

Back in 2010 I stopped buying test servers from Dell and began building them from components using Intel i7 processors, X58-based mother boards, and modular power supplies from Ultra.  It was a good way to learn about hardware.  Besides, it was getting old to pay for Dell desktop systems with Windows, which I would then wipe off when installing Linux.  Between the educational value of understanding the systems better, selecting the exact components I wanted, and being able to fix problems quickly, it has been one of the best investments I have ever made.  And it didn't cost any more than equivalent Dell servers.

For this reason, a couple of recent articles about computer hardware caught my attention.  First, Dell is losing business as companies like Facebook build their own customized servers.  Open source database performance experts like Peter Zaitsev have been talking about full-stack optimization including hardware for years.  Google built their original servers using off-the-shelf parts.  Vertical integration of applications and hardware has since gone mainstream.  If you deploy the same application(s) on many machines, balancing characteristics like cost, performance, and power utilization is no longer a specialist activity but a necessity of business.  It's not just cutting out the Microsoft tax but many other optimizations as well.

Second, developments in hardware itself are making custom systems more attractive to a wide range of users.  A recent blog post by Bunnie Huang describes how decreases in the slope of CPU clock speed increase over time mean you can get better cost/performance by building optimized, application-specific systems now than waiting for across-the-board improvements.  Stable standards also drive down the difficulty of rolling your own.  Components on mid-range servers are sufficiently standardized it is easier to build basic systems from components than to put together a bicycle from scratch.  Try building your own wheels sometime if you don't believe this.

Easily customizable hardware has important consequences.  At a business level, Dell and other mainline hardware vendors will adapt to lower margins, but the market for generic, mid-range appliances has evaporated.  Starting around 2005 there was a wave of companies trying to sell open source databases, memcached, and datamarts on custom hardware.   Most seem to have moved away from hardware, like Schooner,  or folded entirely (like Gear6 and Kickfire).  The long-term market for such appliances, to the extent it exists, is in the cloud.

The other consequence is potentially far more significant.  The traditional walls that encapsulated hardware and software design are breaking down.  Big web properties or large ISPs like Rackspace run lean design teams that integrate hardware with open source software deployment.  This not just a matter of software engineers learning about hardware or vice-versa.  It is the tip of a much bigger iceberg.  Facebook recently started the Open Compute Project, which is a community-based effort to design server infrastructure.   In their own words:
By releasing Open Compute Project technologies as open hardware, our goal is to develop servers and data centers following the model traditionally associated with open source software projects. That’s where you come in.
Facebook and others are opening up data center design.  Gamers have been building their own systems for years.  Assuming Bunnie's logic is correct, open hardware will apply to wide range of devices from phones up to massive clusters.  Community-based, customized system designs are no longer an oddity but part of a broad movement that will change the way all of us think about building and deploying applications on any kind of physical hardware.  It will upset current companies but also create opportunities for new kinds of businesses.  The "cloud" is not the only revolution in computing.  Open source hardware has arrived.  

Thursday, September 29, 2011

Quick Installation of Replication from MySQL to MongoDB

Proof-of-concept Tungsten support for MongoDB arrived last May, when I posted about our hackathon effort to replicate from MySQL to MongoDB.  That code then lay fallow for a few months while we worked on other things like parallel replication, but the period of idleness has ended.  Earlier this week I checked in fixes to Tungsten Replicator to add one-line installation support for MongoDB slaves.

MySQL to MongoDB replication will be officially supported in the Tungsten Replicator 2.0.5 build, which will be available in a few weeks.  However, you can try out MySQL to MongoDB replication right now.  Here is a quick how-to using my lab hosts logos1 for the MySQL master and logos2 for the MongoDB slave. 

1. Download the latest development build of Tungsten Replicator.   See the nightly builds page for S3 URLs.

$ cd /tmp
$ wget --no-check-certificate https://s3.amazonaws.com/files.continuent.com/builds/nightly/tungsten-2.0-snapshots/tungsten-replicator-2.0.5-332.tar.gz

2. Untar and cd into the release. 

$ tar -xzf tungsten-replicator-2.0.5-332.tar.gz
$ cd tungsten-replicator-2.0.5-332

3. Install a MySQL master replicator on a host that has MySQL installed and is configured to use row replication, i.e. binlog_format=row.  Note that you need to enable the colnames and pkey filters.  These add column names to row updates and eliminate update and delete query columns other than those corresponding to the primary key, respectively. Last but not least, ensure strings are converted to Unicode rather than transported as raw bytes, which we have to do in homogeneous MySQL replication to finesse character set issues.  

$ tools/tungsten-installer --master-slave -a \
  --datasource-type=mysql \
  --master-host=logos1  \
  --datasource-user=tungsten  \
  --datasource-password=secret  \
  --service-name=mongodb \
  --home-directory=/opt/continuent \
  --cluster-hosts=logos1 \
  --mysql-use-bytes-for-string=false \
  --svc-extractor-filters=colnames,pkey \
  --svc-parallelization-type=disk --start-and-report

4. Finally, install a MongoDB slave.  Before you do this, ensure mongod 1.8.x is up and running on the host as described in the original blog post on MySQL to MongoDB replication.   My mongod is running on the default port of 27017, so there is no --slave-port option necessary. 

$ tools/tungsten-installer --master-slave -a \
  --datasource-type=mongodb \
  --master-host=logos1  \
  --datasource-user=tungsten  \
  --datasource-password=secret  \
  --service-name=mongodb \
  --home-directory=/opt/continuent \
  --cluster-hosts=logos2 \
  --skip-validation-check=InstallerMasterSlaveCheck \
  --svc-parallelization-type=disk --start-and-report

That's it.  You test replication by logging into MySQL on the master, adding a row to a table, and confirming it reaches the slave.   First the SQL commands: 

$ mysql -utungsten -psecret -hlogos1 test
Welcome to the MySQL monitor.  Commands end with ; or \g.
...
mysql> create table bar(id1 int primary key, data varchar(30));
Query OK, 0 rows affected (0.15 sec)

mysql> insert into bar values(1, 'hello from mysql');
Query OK, 1 row affected (0.00 sec)

Now check the contents of MongoDB:  

$ mongo logos2:27017/test
MongoDB shell version: 1.8.3
connecting to: logos2:27017/test
system.indexes
> db.bar.find()
{ "_id" : ObjectId("4e85269484aef8fcae4b0010"), "id1" : "1", "data" : "hello from mysql" }

Voila!  We may still have bugs, but at least MySQL to MongoDB replication is now easy to install.   

Speaking of bugs, I have been fixing problems as they pop up in testing.  The most significant improvement is a feature I call auto-indexing on MongoDB slaves.  MongoDB materializes collections automatically when you put in the first update, but it does nothing about indexes.  My first TPC-B runs processed less than 100 transactions per second on the MongoDB slave, which is pretty pathetic. The bottleneck is due to MongoDB update operations of the form 'db.account.findAndModify(myquery,mydoc)'.  You must index properties used in the query or things will be very slow.   

Auto-indexing cures the update bottleneck by ensuring that there is an index corresponding to the SQL primary key for any table that we update.  MongoDB makes this logic very easy to implement--you can issue a command like 'db.account.ensureIndex({account_id:1})' to create an index.  What's really cool is that MongoDB will do this even if the collection is not yet materialized--e.g., before you load data.   It seems to be another example of how MongoDB collections materialize whenever you refer to them, which is a very useful feature.  

TPC-B updates into MongoDB are now running at over 1000 transactions per second on my test hosts. I plan to fix more bugs and goose up performance still further over the next few weeks.  Through MongoDB we are unlearning assumptions within Tungsten that are necessary to work with non-relational databases.  It's great preparation for big game hunting next year:  replication to HBase and Cassandra.  

Scaling Databases Using Commodity Hardware and Shared-Nothing Design