Thursday, September 29, 2011

Quick Installation of Replication from MySQL to MongoDB

Proof-of-concept Tungsten support for MongoDB arrived last May, when I posted about our hackathon effort to replicate from MySQL to MongoDB.  That code then lay fallow for a few months while we worked on other things like parallel replication, but the period of idleness has ended.  Earlier this week I checked in fixes to Tungsten Replicator to add one-line installation support for MongoDB slaves.

MySQL to MongoDB replication will be officially supported in the Tungsten Replicator 2.0.5 build, which will be available in a few weeks.  However, you can try out MySQL to MongoDB replication right now.  Here is a quick how-to using my lab hosts logos1 for the MySQL master and logos2 for the MongoDB slave. 

1. Download the latest development build of Tungsten Replicator.   See the nightly builds page for S3 URLs.

$ cd /tmp
$ wget --no-check-certificate https://s3.amazonaws.com/files.continuent.com/builds/nightly/tungsten-2.0-snapshots/tungsten-replicator-2.0.5-332.tar.gz

2. Untar and cd into the release. 

$ tar -xzf tungsten-replicator-2.0.5-332.tar.gz
$ cd tungsten-replicator-2.0.5-332

3. Install a MySQL master replicator on a host that has MySQL installed and is configured to use row replication, i.e. binlog_format=row.  Note that you need to enable the colnames and pkey filters.  These add column names to row updates and eliminate update and delete query columns other than those corresponding to the primary key, respectively. Last but not least, ensure strings are converted to Unicode rather than transported as raw bytes, which we have to do in homogeneous MySQL replication to finesse character set issues.  

$ tools/tungsten-installer --master-slave -a \
  --datasource-type=mysql \
  --master-host=logos1  \
  --datasource-user=tungsten  \
  --datasource-password=secret  \
  --service-name=mongodb \
  --home-directory=/opt/continuent \
  --cluster-hosts=logos1 \
  --mysql-use-bytes-for-string=false \
  --svc-extractor-filters=colnames,pkey \
  --svc-parallelization-type=disk --start-and-report

4. Finally, install a MongoDB slave.  Before you do this, ensure mongod 1.8.x is up and running on the host as described in the original blog post on MySQL to MongoDB replication.   My mongod is running on the default port of 27017, so there is no --slave-port option necessary. 

$ tools/tungsten-installer --master-slave -a \
  --datasource-type=mongodb \
  --master-host=logos1  \
  --datasource-user=tungsten  \
  --datasource-password=secret  \
  --service-name=mongodb \
  --home-directory=/opt/continuent \
  --cluster-hosts=logos2 \
  --skip-validation-check=InstallerMasterSlaveCheck \
  --svc-parallelization-type=disk --start-and-report

That's it.  You test replication by logging into MySQL on the master, adding a row to a table, and confirming it reaches the slave.   First the SQL commands: 

$ mysql -utungsten -psecret -hlogos1 test
Welcome to the MySQL monitor.  Commands end with ; or \g.
...
mysql> create table bar(id1 int primary key, data varchar(30));
Query OK, 0 rows affected (0.15 sec)

mysql> insert into bar values(1, 'hello from mysql');
Query OK, 1 row affected (0.00 sec)

Now check the contents of MongoDB:  

$ mongo logos2:27017/test
MongoDB shell version: 1.8.3
connecting to: logos2:27017/test
system.indexes
> db.bar.find()
{ "_id" : ObjectId("4e85269484aef8fcae4b0010"), "id1" : "1", "data" : "hello from mysql" }

Voila!  We may still have bugs, but at least MySQL to MongoDB replication is now easy to install.   

Speaking of bugs, I have been fixing problems as they pop up in testing.  The most significant improvement is a feature I call auto-indexing on MongoDB slaves.  MongoDB materializes collections automatically when you put in the first update, but it does nothing about indexes.  My first TPC-B runs processed less than 100 transactions per second on the MongoDB slave, which is pretty pathetic. The bottleneck is due to MongoDB update operations of the form 'db.account.findAndModify(myquery,mydoc)'.  You must index properties used in the query or things will be very slow.   

Auto-indexing cures the update bottleneck by ensuring that there is an index corresponding to the SQL primary key for any table that we update.  MongoDB makes this logic very easy to implement--you can issue a command like 'db.account.ensureIndex({account_id:1})' to create an index.  What's really cool is that MongoDB will do this even if the collection is not yet materialized--e.g., before you load data.   It seems to be another example of how MongoDB collections materialize whenever you refer to them, which is a very useful feature.  

TPC-B updates into MongoDB are now running at over 1000 transactions per second on my test hosts. I plan to fix more bugs and goose up performance still further over the next few weeks.  Through MongoDB we are unlearning assumptions within Tungsten that are necessary to work with non-relational databases.  It's great preparation for big game hunting next year:  replication to HBase and Cassandra.  

Thursday, September 8, 2011

What's Next for Tungsten Replicator

As Giuseppe Maxia recently posted we released Tungsten Replicator 2.0.4 this week.  It has a raft of bug fixes and new features of which one-line installations are the single biggest improvement.  I set up replicators dozens of times a day and having a single command for standard cluster topologies is a huge step forward.  Kudos to Jeff Mace for getting this nailed down.

So what's next?  You can get see what we are up to in general by looking at our issues list.  We cannot do everything at once, but here are the current priorities for Tungsten Replicator 2.0.5.
  • Parallel replication speed and robustness.  I'm currently working on eliminating choke points in performance (like this one) as well as eliminating corner cases that cause the replicator to require manual intervention, such as aging out logs that are still needed by slaves.  
  • Multi-master replication.  This includes better support for system of record architectures, many masters to one slave, and replication between the same databases on different sites.  Stephane Giron nailed a key MyISAM multi-master bug for the last release.  We will continue to polish this as we work through our current projects.   
  • Better installations for more types of databases.  Jeff recently hacked in support for PostgreSQL as well as Oracle slaves, and we are contemplating addition of MongoDB support.  Heterogeneous replication is getting simpler to set up.  
  • Filter usability.  Giuseppe has a list of improvements for filters, which are one of the most powerful Tungsten Replicator features but not as easy for non-developers to use as we would like.  Better installation support is first on the list followed by ability to load and unload dynamically.  
  • Data warehouse loading.  We have a design for fast data warehouse loading that I hope we'll be able to implement in the next few weeks.  Linas Virbalas has also been working on this problem along with a number of other heterogeneous projects for customers.  
This is a lot of work and not everything will necessarily be finished when 2.0.5 goes out.  However, I hope we'll make progress on all of them.  In case you are wondering how we pick things, replicator development is largely driven by customer projects.   If you have something you need in the replicator, please contact Continuent.

After this build we will... Er, let's get 2.0.5 done first.  Suffice it to say we have a long list of useful and interesting features to discuss in future blog articles.

Tuesday, September 6, 2011

The Inimitable Mr. Steven Jobs

There have been countless articles praising Steve Jobs since he announced his retirement from Apple on August 25th.  Most either catalogue Steve Job's many triumphs or assess the impact of his creativity on society.  Those are entertaining topics but not especially useful.  A more practical question is why Steve Jobs is so good at creating new products and whether the rest of us can imitate him.

Steve Job's best work seems to follow a repeated pattern.  Let's call it the Apple pattern, though of course it could just as well be the Pixar pattern or Next pattern:
  1. See the whole picture of some crucial human/technology interaction and recognize gaps.  
  2. Design products to fill those gaps that combine artistic sensibility and innovative technology.
  3. Get a large organization to implement designs in a way that makes the end result like the handiwork of a single highly-focused craftsman. 
    Two things about the pattern seem particularly striking.  First, Steve Jobs is a complete package.  I have been in the tech industry for over three decades and have met people who did one or at most two of these things at the level necessary to create products that move large markets.  Almost nobody does all three.  The fact that Steve is excellent in all areas simultaneously may be a root cause behind his long run of successes.

    Second, Job's ability to drive implementation teams is extraordinary.  Maybe it's just the manager in me, but I find his ability to pick the right people to run teams and to keep those teams pointed in a clear direction without product-destroying compromises quite remarkable.  This is far harder than generating ideas in the first place.  The heart of the Apple pattern as as much about understanding people as technology--not just users but the creators as well.  I have never heard Jobs make pronouncements on team management, but there is an excellent talk from Ed Catmull of Pixar that summarizes the tensions quite well.

    Steve Jobs is commonly compared to great inventors like Edison, Ford, and Disney.  When thinking about imitation, another parallel seems more illuminating:  John Churchill, Duke of Marlborough and hands-down the greatest English general of all time.


    A possible Jobs ancestor?
    Marlborough possessed a seldom equalled ability to see war as an integrated whole across geography and branches of arms, devise unexpected strategies to exploit the weaknesses of his enemies, and execute them flawlessly in the difficult conditions of early 18th Century campaigns.  Execution extended from handling fractious allies down to the painstaking work to ensure his men had proper meals after each day's march.  In other words:  analogous problem-solving abilities to Steve Jobs, translated into the field of warfare.   The parallel extends to the lavish praise of contemporaries and later historians.  Winston Churchill famously described Marlborough as follows.  
    He commanded the armies of Europe against France for ten campaigns. He fought four great battles and many important actions ... He never fought a battle that he did not win, nor besieged a fortress that he did not take ... He quitted war invincible.
    Grand problem-solvers like Marlborough and Jobs are sufficiently rare they tend to be one-offs who change society but leave no obvious successors.  English military superiority on the Continent waned after Marlborough's retirement.  Something similar will likely befall Apple after Jobs, current happy talk about product pipelines and cash position notwithstanding.  It is simply not possible to imitate Jobs by committee, which is effectively what will happen once he is completely absent.  The driving force is gone.

    That said, we can all imitate Steve Jobs, albeit on a smaller scale.  Many highly successful products start with a single person who conceives the idea and drives at least the first couple of iterations to completion.  Seeing the whole problem, applying innovative designs to solve it, and managing the team to get it done is a fundamental pattern that applies across a wide range of endeavors.  Here is just one of many examples.

    Many years ago at Sybase I worked for a manager named Mark Deppe.  Early in the 1990s Mark learned that Wall Street firms were patching together crude publish/subscribe messaging applications to move data between financial systems in order to speed up trades.   He recognized that there was a much better way to do this using log-based data replication and built the Sybase Replication Server product.  The Rep Server went on to generate hundreds of millions of dollars in sales.  It still sells well today, over 15 years later.  Mark was a great architect but also a great builder of teams.  He paid as much or more attention to hiring and managing people as he did to technology.  He trusted the people he hired, and he gave them the freedom and support to do great work.  At the same time Mark was also incredibly attentive to detail and did all the project management for the first releases himself.  Years later he said it was too important a task to hand off to anyone else.

    Mark Deppe was the best technical manager I ever worked for.  I have consciously imitated his best practices for many years.  Looking back it seems I was unconsciously imitating the Apple design pattern.  But perhaps that was not a complete coincidence.   Before joining Sybase Mark was at Apple where he worked with (guess who?) Steve Jobs.

    ------------------
    NOTE:  After this article was published I found the flow hard to understand and edited it a week or so later to make it more readable.  The argument is the same as before.

    Scaling Databases Using Commodity Hardware and Shared-Nothing Design