The Scale-Out Blog: Open Source

Feb 7, 2014

Fun with MySQL and Hadoop at SCaLE 12X

It's my pleasure to be presenting at SCaLE 12X on the subject of real-time data loading from MySQL to Hadoop. This is the first public talk on work at Continuent that enables Tungsten Replicator to move transactions from MySQL to HDFS (Hadoop Distributed File System). I will explain how replication to Hadoop works, how to set it up, and offer a few words on constructing views of MySQL data using tools like Hive.

As usual with replication everything we are doing on Hadoop replication is open source. Builds and documentation will be publicly available by the 21st of February, which is when the talk happens. Hadoop support is already in testing with Continuent customers, and we have confidence that we can handle basic loading cases already. That said, Hadoop is a complex beast with lots of use cases, and we need feedback from the community on how to make Tungsten loading support better. My colleagues and I plan to do a lot of talks about Hadoop to help community users get up to speed.

Here is a tiny taste of what MySQL to Hadoop loading looks like. Most MySQL users are familiar with sysbench. Have you ever wondered what sysbench tables would look like in Hadoop? Let's use the following sysbench command to apply transactions to table db01.sbtest:

sysbench --test=oltp --db-driver=mysql --mysql-host=logos1 --mysql-db=db01 \
    --mysql-user=tungsten --mysql-password=secret \
    --oltp-read-only=off --oltp-table-size=10000 \
    --oltp-index-updates=4 --oltp-non-index-updates=2 --max-requests=200000 \
    --max-time=900 --num-threads=5 run

This results in rows that look like the following in MySQL:

mysql> select * from sbtest where id = 2841\G
*************************** 1. row ***************************
 id: 2841
  k: 2
  c: 958856489-674262868-320369638-679749255-923517023-47082008-646125665-898439458-1027227482-602181769
pad: qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt

After replication into Hadoop with Tungsten, we can crunch the log records using a couple of HiveQL queries to generate a point-in-time snapshot of the sbtest table on HDFS. By a point-in-time snapshot, I mean that a table that contains not only inserted data but also shows the results of subsequent update and delete operations on each row up to a particular point in time. We can now run the same query to see the data:

hive> select * from sbtest where id = 2841;
Total MapReduce jobs = 1
Launching Job 1 out of 1
...
Job 0: Map: 1   Cumulative CPU: 0.74 sec   HDFS Read: 901196 HDFS Write: 158 SUCCESS
Total MapReduce CPU Time Spent: 740 msec
OK
2841 2 958856489-674262868-320369638-679749255-923517023-47082008-646125665-898439458-1027227482-602181769 qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt

Tungsten does a lot more than just move transaction data, of course. It also provides tools to generate Hive schema, performs transformations on columns to make them match the limited HiveQL datatypes, and arranges data in a way that allows you generate materialized views for analytic usage (like the preceding example) with minimal difficulty.

If you want to learn more about how Tungsten does all of this magic, please attend the talk. I hope to see you in Los Angeles.

p.s., If you cannot attend SCaLE 12X, we will have a Continuent webinar on the same subject the following week. (Sign up here.)

Jan 10, 2014

Why I Love Open Source

Anders Karlsson wrote about Some myths on Open Source, the way I see it a few days ago. Anders' article is mostly focused on exploding the idea that open source magically creates high quality code. It is sad to say you do not have to look very far to see how true this is.

While I largely agree with Anders' points, there is far more that could be said on this subject, especially on the benefits of open source. I love working on open source software. Here are three reasons that are especially important to me.

1.) Open source is a great way to disseminate technology to users. In the best cases, it is this easy to get open source products up and running:

$ sudo apt-get install software-i-want-to-use

A lot of software companies (mine included) open source their software because it gets product into the hands of people who might pay money for it later. The strategy worked brilliantly for MySQL AB as Anders pointed out. MongoDB is repeating the tactic with what looks like equal success. There has been a lot of pointless argument over the years about whetherMySQL or MongoDB are "real databases." Being easy to get is just as critical to adoption as features like transactions and scalable performance.

Open source is therefore even better for users, who can quickly decide if something works for them and provide feedback through communities about problems as well as suggested improvement. To the extent open source software has high quality, it originates in the tight feedback loop between software producers and their user communities. That in turn leads to faster innovation with fewer deviations from real user needs. In olden days we called this getting the requirements right. Open source projects often do it extraordinarily well.

2.) Open source allows like-minded communities of developers to create products that would otherwise never happen. Linux became a dominant operating system in large part through the staggering scale of contributions enabled by exceptionally well-managed open source development. Linus Torvalds recently pointed out that Linux kernel releases have patches from a thousand contributors or more. Thanks to the wide range of contributions, Linux operates on everything from tiny ARM processors to servers with over 200 cores. The development effort underlying the Linux ecosystem is huge when you include the kernel and all the packages that install over it. It dwarfs any comparable operating system effort I can think of.

At the other end of the spectrum there are small but incredibly useful projects like Apache Curator. The Curator project currently has 8 project members, mostly from different companies, who collaborate to make Apache ZooKeeper vastly easier to program. I doubt libraries like Curator would even exist without open source licenses and infrastructure like distributed source code management. Either would ZooKeeper, for that matter.

Not every line of open source code is excellent or even above average. (I'm looking at you, Hadoop.) That said, open source projects are not so much about code but communities of developers who understand and are interested in solving a specific problem. Besides direct feedback from real users, this is the other prerequisite for creating truly great products. Clean code is helpful but not necessary.

3.) Open source means your creations can never be taken away from you. In many creative endeavors work belongs to the people who employ you. It effectively disappears when you change jobs. Putting code on GitHub or code.google.com breaks that bond. Knowing that anything you create will always be accessible removes any hesitation about revealing your best ideas. I believe this is one of the drivers behind the flowering of creativity that infuses so many open source projects.

At the same time working on open source software is not all peaches and cream. Building successful businesses on open source is hard, which limits the opportunities to work on it for a living.

For instance, if most of the value of your product is in the software itself there is not much motivation for users to pay you. I think that's one reason mobile apps are by-and-large for pay or at least not open source. You need to find a business model that brings in enough money over time to fund the sort of concentrated engineering necessary to build robust software. Successful open source businesses often depend on finding the right markets or achieving network effects, and not all software can fit the pattern.

The good news is that once you get the economics right it really wrong-foots your closed source competitors. RedHat has built a great business packaging and supporting open source for enterprises. They see open source as a competitive advantage that extends their market reach and speeds up innovation. An increasing number of companies producing DBMS software take the same view as they try to disrupt data management. Outside of enterprise software Valve Software is attacking proprietary gaming platforms through open source.

It's great to see the growing number of businesses based on open source development. When the model works it is incredibly satisfying. I guess this is a fourth reason why I love working on open source software.

Feb 7, 2014

Fun with MySQL and Hadoop at SCaLE 12X

Jan 10, 2014

Why I Love Open Source

Contributors

Blog Archive

Favorite Blogs