Nov 18, 2011

Replicating Data Now and Then with Tungsten

What do cruise ship management software and data warehouses have in common?  One answer:  they both depend on intermittent data replication.   Large vessels collect data to share with a home base whenever connectivity permits.  If there is no connection, they just wait until later.  Data warehouses also do not replicate constantly.  Instead, it is often far faster to pool updates and load them in a single humongous batch using SQL COPY commands or native loaders.  Replicating updates in this way is sometimes known as batch replication.  Tungsten Replicator supports it quite easily.

To illustrate we will consider a Tungsten master/slave configuration.  (Sample setup instructions here.)  In this example MySQL-based web sales data upload to a data warehouse.   The master receives constant updates, which then apply at controlled intervals on the slave.

The first step is to turn off the slave replicator.  Login to the prod2 host and execute the following command.

$ trepctl offline

The prod2 slave will disconnect from the master as well as the data warehouse.  Updates now accumulate on the master.  We can turn on the slave to fetch and apply them all, then go offline again using one of three methods.  The first method uses the current sequence number on the master.  Here are sample commands to fetch and apply all transactions from the master up to the current master position.

$ trepctl -host prod1 status |grep appliedLastSeqno
appliedLastSeqno       : 19600
$ trepctl online -seqno 19600
$ trepctl wait -state OFFLINE -limit 300

As you can see, the first command locates the master sequence number.  The second command tells the slave to go online and replicate to sequence number 19600.  Finally the third command waits until either slave is back in the offline state or 300 seconds elapse, whichever comes first.  This is not strictly necessary for replication but is very handy for scripts, as it eliminates a potentially awkward polling loop.  

The second method is to use the MySQL binlog position on the master.  The idea is the same as the previous example.  We get the master binlog position, then tell the slave to apply transactions to that point and go offline.  Here's an example: 

$ mysql -utungsten -psecret -hprod1 -e 'show master status'
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB |
| mysql-bin.002023 | 92395851 |              |                  |
$ trepctl online -event mysql-bin.002023:0000000092395851
$ trepctl wait -state OFFLINE -limit 300

Note in this example that you must pad the binlog offset out to 16 digits, which means you must add the extra zeros shown in bold.  Tungsten compares native replication IDs as strings, so that we can handle other databases besides MySQL.  This normally a minor inconvenience, unless you don't know the trick.  In that case it could be a bit of a head-scratcher.  

There is a final way to implement batch replication using Tungsten's built-in heartbeat mechanism.   With this method we insert a named heartbeat event on the master, then ask the slave to replicate until the heartbeat appears.  Here's an example: 

$ trepctl -host prod1 heartbeat -name batch1
$ trepctl online -heartbeat batch1
$ trepctl wait -state OFFLINE -limit 300

This method is perhaps the simplest of all, because there is no need to check for either sequence numbers or binlog offsets on the master.  The only downside is that you must have a master and a slave replicator to use it.  It does not work with direct replication, in which a single replicator moves data from the master DBMS to the slave.  (This limitation will be removed in the future when Issue 228 is fixed.) 

When using any of these techniques, we may want to know whether Tungsten will really go offline at the correct point.  Fortunately, there's a simple way to find out.  The trepctl status command shows pending requests to go offline.  Let's say you check status after requesting the slave to replicate to a heartbeat as in the previous example.   

$ trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.002023:0000000104369615;37978
appliedLastSeqno       : 220126
appliedLatency         : 470.589
offlineRequests        : Offline at heartbeat event: batch1
state                  : ONLINE
timeInStateSeconds     : 2.436
uptimeSeconds          : 1742.0
Finished status command...

It is simple to see from the status output that Tungsten will go offline when it sees a heartbeat named batch1.

As this article shows, the trepctl online and trepctl wait commands make it very simple to implement batch replication.  You can simplify still further by wrapping the commands in a short script written in your favorite scripting language.   Either way you have a handy solution to a problem that affects a diverse set of applications.

This is not the end of Tungsten features to enable batch replication.  Tungsten has a new applier that can submit transactions using CSV files, which is critical to load transactions quickly into data warehouses.  We have been testing it out with Vertica, where early results show that it improves load performance by a factor of 100 or more in some cases.  I will describe this new feature in an upcoming article.  


Jonimatix said...

Great Article!
Once data is in Vertica, and users can create reports / dashboards, is there any security mechanism in Vertica to allow / disallow users parts of the information?


Robert Hodges said...

Vertica has built-in security but the kind of end-user security to which you are referring tends to be built into end-user business intelligence tools. Here's just one of many examples:

This applies to Business Objects but the same principles apply to other tools. Another approach is to build the authorization into the applications that call reports.