As usual with replication everything we are doing on Hadoop replication is open source. Builds and documentation will be publicly available by the 21st of February, which is when the talk happens. Hadoop support is already in testing with Continuent customers, and we have confidence that we can handle basic loading cases already. That said, Hadoop is a complex beast with lots of use cases, and we need feedback from the community on how to make Tungsten loading support better. My colleagues and I plan to do a lot of talks about Hadoop to help community users get up to speed.
Here is a tiny taste of what MySQL to Hadoop loading looks like. Most MySQL users are familiar with sysbench. Have you ever wondered what sysbench tables would look like in Hadoop? Let's use the following sysbench command to apply transactions to table db01.sbtest:
This results in rows that look like the following in MySQL:
sysbench --test=oltp --db-driver=mysql --mysql-host=logos1 --mysql-db=db01 \ --mysql-user=tungsten --mysql-password=secret \ --oltp-read-only=off --oltp-table-size=10000 \ --oltp-index-updates=4 --oltp-non-index-updates=2 --max-requests=200000 \ --max-time=900 --num-threads=5 run
After replication into Hadoop with Tungsten, we can crunch the log records using a couple of HiveQL queries to generate a point-in-time snapshot of the sbtest table on HDFS. By a point-in-time snapshot, I mean that a table that contains not only inserted data but also shows the results of subsequent update and delete operations on each row up to a particular point in time. We can now run the same query to see the data:
mysql> select * from sbtest where id = 2841\G *************************** 1. row *************************** id: 2841 k: 2 c: 958856489-674262868-320369638-679749255-923517023-47082008-646125665-898439458-1027227482-602181769 pad: qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt
Tungsten does a lot more than just move transaction data, of course. It also provides tools to generate Hive schema, performs transformations on columns to make them match the limited HiveQL datatypes, and arranges data in a way that allows you generate materialized views for analytic usage (like the preceding example) with minimal difficulty.
hive> select * from sbtest where id = 2841; Total MapReduce jobs = 1 Launching Job 1 out of 1 ... Job 0: Map: 1 Cumulative CPU: 0.74 sec HDFS Read: 901196 HDFS Write: 158 SUCCESS Total MapReduce CPU Time Spent: 740 msec OK 2841 2 958856489-674262868-320369638-679749255-923517023-47082008-646125665-898439458-1027227482-602181769 qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt
If you want to learn more about how Tungsten does all of this magic, please attend the talk. I hope to see you in Los Angeles.
p.s., If you cannot attend SCaLE 12X, we will have a Continuent webinar on the same subject the following week. (Sign up here.)