Real-time analytics allow companies to react rapidly to changing business conditions. Online ad services process click-through data to maximize ad impressions. Retailers analyze sales patterns to identify micro-trends and move inventory to meet them. The common theme is speed: moving lots of information without delay from operational systems to fast data warehouses that can feed reports back to users as quickly as possible.
Real-time data publishing is a classic example of a
big data replication problem. In this two-part article I will describe recent work on
Tungsten Replicator to move data out of MySQL into
Vertica at high speed with minimal load on DBMS servers. This feature is known as
batch loading. Batch loading enables not only real-time analytics but also any other application that depends on moving data efficiently from MySQL into a data warehouse.
The first article works through the overall solution starting with replication problems for real-time analytics through a description of how Tungsten adapts real-time replication to data warehouses. If you are in a hurry to set up, just skim this article and jump straight to the implementation details in the
follow-on article.
Replication Challenges for Real-Time Analytics
To understand some of the difficulties of replicating to a data warehouse, imagine a hosted intrusion detection service that collects access log data from across the web and generates security alerts as well as threat assessments for users. The architecture for this application follows a pattern that is increasingly common in businesses that have to analyze large quantities of incoming data.
Access log entries arrive through data feeds, whereupon an application server checks them to look for suspicious activity and commits results into a front-end DBMS tier of sharded MySQL servers. The front-end tier optimizes for a MySQL sweet spot, namely fast processing of a lot of small transactions.