Sunday, June 3, 2012

MySQL to Vertica Replication, Part 2: Setup and Operation

As described in the first article of this series, Tungsten Replicator can replicate data from MySQL to Vertica in real-time.  We use a new batch loading feature that applies transactions to data warehouses in very large blocks using COPY or LOAD DATA INFILE commands.  This second and concluding article walks through the details of setting up and testing MySQL to Vertica replication.

To keep the article reasonably short, I assume that readers are conversant with MySQL, Tungsten, and Vertica.  Basic replication setup is not hard if you follow all the steps described here, but of course there are variations in every setup.  For more information on Tungsten check out the Tungsten Replicator project at code.google.com site well as current Tungsten commercial documentation at Continuent

Now let's get replication up and running!  

What Is the Topology? 

In this exercise we will set up Tungsten master/slave replication to move data from MySQL 5.1 to Vertical 5.1.  Master/slave is the simplest topology to set up because you don't have to mix settings for different DBMS types in each service.  To keep keep things simple, we will install Tungsten directly on the MySQL and Vertica hosts, which are named db1 and db2 respectively.  Here is a diagram:


There are of course many other possible configurations.  You can run replicators on separate hosts to reduce load on the DBMS servers.  You can with a little patience set up direct replication using a Tungsten single replication service, which results in fewer processes to manage.  Finally, you can use both direct as well as master/slave topologies to publish data from Tungsten 1.5 clusters.  Tungsten clusters provide availability and scaling on the MySQL side.

MySQL to Vertica Replication, Part 1: Enabling Real-Time Analytics with Tungsten

Real-time analytics allow companies to react rapidly to changing business conditions.   Online ad services process click-through data to maximize ad impressions.  Retailers analyze sales patterns to identify micro-trends and move inventory to meet them.  The common theme is speed: moving lots of information without delay from operational systems to fast data warehouses that can feed reports back to users as quickly as possible.

Real-time data publishing is a classic example of a big data replication problem.  In this two-part article I will describe recent work on Tungsten Replicator to move data out of MySQL into Vertica at high speed with minimal load on DBMS servers.  This feature is known as batch loading.  Batch loading enables not only real-time analytics but also any other application that depends on moving data efficiently from MySQL into a data warehouse.

The first article works through the overall solution starting with replication problems for real-time analytics through a description of how Tungsten adapts real-time replication to data warehouses.  If you are in a hurry to set up, just skim this article and jump straight to the implementation details in the follow-on article.

Replication Challenges for Real-Time Analytics

To understand some of the difficulties of replicating to a data warehouse, imagine a hosted intrusion detection service that collects access log data from across the web and generates security alerts as well as threat assessments for users.  The architecture for this application follows a pattern that is increasingly common in businesses that have to analyze large quantities of incoming data.  


Access log entries arrive through data feeds, whereupon an application server checks them to look for suspicious activity and commits results into a front-end DBMS tier of sharded MySQL servers.  The front-end tier optimizes for a MySQL sweet spot, namely fast processing of a lot of small transactions.

Scaling Databases Using Commodity Hardware and Shared-Nothing Design