The Scale-Out Blog

An Ending and a Beginning: VMware Has Acquired Continuent

2014-10-29T08:00:00.000-07:00

As of today, Continuent is part of VMware. We are absolutely over the moon about it.

You can read more about the news on the VMware vCloud blog by Ajay Patel, our new boss. There’s also an official post on our Continuent company blog. In a nutshell the Continuent team is joining the VMware Cloud Services Division. We will continue to improve, sell, and support our Tungsten products and work on innovative integration into VMware’s product line.

So why do I feel exhilarated about joining VMware? There are three reasons.

1. Continuent is joining a world-class company that is the leader in virtualization and cloud infrastructure solutions. Even better, VMware understands the value of data to businesses. They share our vision of managing an integrated fabric of standard DBMS platforms, both in public clouds as well as in local data centers. It is a great home to advance our work for many years to come.

2. We can continue to support our existing users and make Tungsten even better. I know many of you have made big decisions to adopt Continuent technology that would affect your careers if they turned out badly. We now have more resources and a mandate to grow our product line. We will be able to uphold our commitments to you and your businesses.

3. It’s a great outcome for our team, which has worked for many years to make Continuent Tungsten technology successful. This includes our investors at Aura in Helsinki, who have been dogged in their support throughout our journey.

Speaking of the Continuent team…I am so proud of what all of you have achieved. Today we are starting a new chapter in our work together. See you at VMware!

Exorcising the CAP Demon

2014-10-06T08:37:00.000-07:00

Computer science is like an enormous tool box you can rummage through whenever you have a problem to solve. Most of the tools are sturdy and practical, like algorithms for B-trees. Some are also elegant, like consistent hashing in Dynamo. Finally there are some tools that you never quite figure out even after years of reflection. That piece of steel you are looking at could be Excalibur. Or it could be a rusty knife.

The CAP theorem falls into the last category, at least for me. It was a major topic in the blogosphere a few years ago and Google Trends shows steadily increasing interest in the term since 2010. It's not my goal to explain CAP fully--a good informal description is here or you can just read the proof yourself. Instead I would like to talk about how I understand and use the CAP theorem today as well as how that understanding might evolve in the future.

In a nutshell CAP puts a limit on how distributed database systems trade off data consistency and system availability. Eric Brewer originated the theorem as a conjecture in the late 1990s. Seth Gilbert and Nancy Lynch supplied a proof of the conjecture in 2002. Brewer described it as follows in 2012:

The CAP theorem states that any networked shared-data system can have at most two of three desirable properties:

consistency (C) equivalent to having a single up-to-date copy of the data;

high availability (A) of that data (for updates); and

tolerance to network partitions (P).

My initial problem in understanding CAP was relating the proof to what happens in the real world, which is not especially easy. Network partitions are an example. Here's how the Gilbert/Lynch proof defines them in Section 2.3.

When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost. (And any pattern of message loss can be modeled as a temporary partition separating the communicating nodes at the exact instant the message is lost.)

So does this include an asymmetric communication failure? That's where a process on one host can see and send messages to a process on another host but the reverse is not true. This happens all the time in group communications for reasons that range from application software bugs to bad cabling and everything in between. Do you model the asymmetry as a sequence of temporary partitions? It's of course possible. But it feels a bit like using Ptolemaic astronomy with epicycles.

Other people have made similar observations. Eric Brewer even wrote about the "nuances" of partitions in his 2012 retrospective. There are analogous problems with the other terms. There was enough public disagreement their meaning that I wrote a "disproof" of CAP a few years back as an April Fools Day joke. It depended on not being able to distinguish CA and CP choices in real systems.

That confusion is not a problem with the CAP theorem itself. Nobody has seriously challenged the proof. Instead, it's a matter of what logicians refer to as interpretation, which links a logical model to some domain of discourse so that you can draw valid conclusions about that domain. If you want to reason about real-world systems using the CAP theorem you must first ensure your systems really match the model. Otherwise it's like using a map of Oregon to drive between New York and Boston. The core difficulty is that the CAP theorem proof assumes binary properties whereas in reality properties like availability operate on a sliding scale.

My other issue with CAP evaluation is what you might call a suitability problem. There are a lot of issues with operating distributed systems, and the 3-way trade-off is irrelevant to many of them. For instance, what happens when the network is behaving and you don't have to make pesky choices between availability and consistency? Let's look at some examples.

CAP defines consistency as linearizability, which means that transactions on different replicas look as if they all happened at once in a single place in a single unbroken series. Imagine driving around to different automated teller machines at a bank and making changes to your account balance or checking it. No matter which teller machine you visit next, it knows exactly what happened before and has the right balance amount. Or imagine a shopping cart on a website like Zappos.com. No matter how you jump around the website to select clothing or even if you fold up your laptop and fly to Paris, the items in your shopping care remain consistent without duplicate or missing selections.

You might say, well, not all systems work that way. You would be right, and that's the exactly the point. Real distributed systems do not always try to ensure linearizability. It turns out that many people, most particularly end users who ultimately pay for computer systems, conclude they don't really care so much about consistency of the sort CAP promises. Here are two different types of reasons:

1. Linearized consistency is expensive. Keeping active replicas up to date requires round trip messages between hosts, which can reduce transaction commit times by an order of magnitude or more. Users are allergic to slow response, regardless of any other benefits that slowness might bring. Daniel Abadi pointed out this latency problem some time ago in a great blog post on CAP that is still excellent reading today.

2. Linearized consistency is irrelevant for many applications. Consider a measurement from a household thermometer or a text message from a cell phone. There is only one of each generated in a single location. Your servers either get them or they don't. Multiple copies are just that: replicas of the same thing. Conflicts don't exist.

The share of immutable data from analytic systems like Hadoop and object stores like Amazon S3 is increasing rapidly, which means that there is an increasing number of applications for which CAP is not the only or even a major design consideration. It might be in the guts of the system but it's just one of many problems at that level and there may be multiple choices. The original Hadoop architecture actually ignored CAP for one critical part of the system--the NameNode, which maps HDFS file names to storage, was a single point of failure.

Which brings us back to understanding CAP at a practical level. Is it Excalibur or just the rusty knife? At this point it feels like another tool in the toolbox that you use at the right time, albeit carefully. Imagine a band saw that does not have a very good guard on the blade. Here are my personal instructions for safe use.

1. Use it for suitable problems. The CAP theorem applies to a very specific problem involving systems that want to remain consistent and available across multiple networked hosts. If you design clusters or distributed databases, this is a relatively big deal. The trade-offs are real and you have to think about them.

For instance at Continuent we have some problems where the theorem is directly applicable. We build clusters that implement failover. We have to consider how to establish consensus while keeping the cluster available even when members lose messages or respond slowly. The CAP theorem guides you to manage this kind of problem rather than try to solve it using techniques that will not work, such as adding timeouts on messages. (Continuent Tungsten clusters are generally CP, in case you are wondering.)

2. Avoid CAP where it does not obviously apply. It is a tricky theorem to interpret correctly, and many applications are concerned with unrelated problems. I work a lot on transactional replication. There are no CAP issues in Tungsten Replicator. At the other end of the spectrum if you build systems that link multiple stores using replication, you likely have multiple CAP choices under the covers. That's a common pattern in complex applications.

It is therefore important to look with a jaundiced eye upon any product that claims to "beat CAP," like this widely read article. This is just marketing hype. If your application matches the CAP theorem model, it applies and you are subject to the limitations. If the limitations don't seem to make sense you have not evaded them. You are either working on a problem to which CAP is not relevant or you made implicit CAP choices of which you are not aware. It is easy to make a fool of yourself by asserting otherwise.

3. Other tools are important too. CAP of course does not even cover all trade-offs in clusters. There are also many issues to consider when building distributed data systems that actually work. Latency, durability of data, monitoring, automation, reliability, ability to do zero-downtime maintenance, and security are critical. Especially security. That looks like the next big problem for a lot of existing distributed systems.

Beyond these, don't stop thinking about CAP. It is one of those ideas that gets under your skin and really bugs you. In addition to Eric Brewer's 2012 article, Seth Gilbert and Nancy Lynch wrote a follow-up perspective on the implications of CAP, so even the originators are continuing to consider the problems. The long term value of CAP is that it has focused attention on a set of difficult data management problems and led to numerous productive ideas about how to manage them. The resulting evolution is not nearly finished. We will all continue to worry this bone for many years to come.

No Hadoop Fun for Me at SCaLE 12X :(

2014-02-20T12:25:00.001-08:00

I blogged a couple of weeks ago about my upcoming MySQL/Hadoop talk at SCaLE 12X. Unfortunately I had to cancel. A few days after writing the article I came down with an eye problem that is fixed but prevents me from flying anywhere for a few weeks. That's a pity as I was definitely looking forward to attending the conference and explaining how Tungsten replicates transactions from MySQL into HDFS.

Meanwhile, we are still moving at full steam with Hadoop-related work at Continuent, which is the basis for the next major replication release, Tungsten Replicator 3.0.0. Binary builds and documentation will go up in a few days. There will also be many more public talks about Hadoop support, starting in April at Percona Live 2014. I hope you'll consider attending one of our talks there. It's a great conference.

Since my SCaLE 12X talk won't be happening I would like to repeat the invitation to attend the Continuent webinar on loading from MySQL to Hadoop on Thursday February 27th. It's essentially the same talk, but no airplanes are involved.

Why Aren't All Data Immutable?

2014-02-17T20:00:00.000-08:00

Over the last few years there has been an increasing interest in immutable data management. This is a big change from the traditional update-in-place approach many database systems use today, where new values delete old values, which are then lost. With immutable data you record everything, generally using methods that append data from successive transactions rather than replacing them. In some DBMS types you can access the older values, while in others the system transparently uses the old values to solve useful problems like implementing eventual consistency.

Baron Schwartz recently pointed out that it can be hard to get decent transaction processing performance based on append-only methods like append-only B-trees. This is not a very strong argument against immutable data per se. Immutable data are already in wide use. It is actually surprising they have not made deeper inroads into online transaction processing, which is widely handled by relational DBMS servers like MySQL and Oracle.

Immutable Data Are Now Economically Feasible

One reason for the popularity of update-in-place approaches is simple: storage used to be really expensive. This is no longer the case. Many applications can now afford to store the entire DBMS transaction log almost indefinitely. To illustrate, look at storage costs in Amazon Web Services. Applications running in Amazon have API-level access to practically unlimited replicated, long-term storage through services like S3 and Glacier. Amazon conveniently publishes prices that serve as good proxies for storage costs in general. Using these numbers, I worked up a simple spread sheet that shows the cost of storing 7 years of transactions for a made-up business application.

To start with, assume our sample app generates one thousand transactions per second at 1,000 bytes per transaction. This is not exceedingly busy by some standards but is relatively high for business systems that handle human-generated transactions. The main place you see numbers approaching this level is SaaS businesses that handle many customers on a single system. Our sample system generates about 205,591 gigabytes of data over seven years.

Xacts/Sec	Bytes/Xact	Bytes/Sec	GB Generated in 1 Hour	GB Generated in 1 Day	GB Generated in 1 Month	GB Generated in 1 Year	GB Generated in 7 Years
1,000	1,000	1,000,000	3.35	80.47	2,447.52	29,370.19	205,591.32

Amazon storage costs vary from $0.011/Gb/month for Glacier to $0.09/Gb/month for S3 with full redundancy. (These are numbers for the US-West region as of 29 December 2013.) Annual storage costs for 7 years of data are pretty hefty if you store uncompressed data. However, if you factor in compression--for example MySQL binlogs tend to compress around 90% in my experience--things start to look a lot better.

	Annual cost to store 7 years of data at different levels of compression
	0%	20%	40%	60%	70%	80%	90%
Glacier	$27,138.05	$21,710.44	$16,282.83	$10,855.22	$8,141.42	$5,427.61	$2,713.81
S3 Reduce Redundancy	$177,630.90	$142,104.72	$106,578.54	$71,052.36	$53,289.27	$35,526.18	$17,763.09
S3 Standard	$222,038.63	$177,630.90	$133,223.18	$88,815.45	$66,611.59	$44,407.73	$22,203.86

The raw costs still look hefty to the untrained eye, but we need to factor in the real expense of operating this type of system. Here's a typical cost structure for a 3 node cluster (to ensure HA) with labor costs factored in and preserving 7 years of data. I have put in generously small IT overhead costs including software development, since the code has to come from somewhere. Under these assumptions long-term storage costs are less 10% of the yearly cost of operation.

Component	Cost	Percentage	Notes
3 i2.4xlarge instances	$46,306.68	20.09%	(Heavy utilization reserved, 1 yr. term)
3 support licenses	$15,000.00	6.51%	(Support subscription costs * 3x)
Raw dbadmin labor	$12,000.00	5.21%	(1 FTE/30 DBMS servers @ 120K per)
Software dev/QA	$120,000.00	52.06%	(10 FTE/30 DBMS servers @ 120K per)
Misc. overhead costs	$15,000.00	6.51%	($5K per server)
S3 Storage	$22,203.86	9.63%	(7 years of data, 90% compression)
Total	$230,510.54	100.00%

Long storage costs for base transaction data can be far lower if any of the following hold:

You generate fewer transactions per second or they are smaller. Many business apps produce far fewer transactions than my example.
You don't keep data for the full 7 years. Some of the analytic users I work with just keep a couple of years.
You are already paying archiving costs for backups, in which case the additional storage cost becomes a wash if you can stop using a separate backup system.
You add more external costs to the picture--running a real business that generates this level of transactions often takes far more people than are shown in my projection.

In these cases long term storage costs could be in the 1-2% range as a percentage of IT operating costs. Over time storage costs will decrease--though the rate of decline is hard to predict--so each year the number systems able to afford preservation of complete transaction histories will corresponding increase. This is particularly true for business transactions, which tend to be human generated and subject to upper growth limits once businesses are fully automated. If you push data into Glacier, economically feasible retention periods can run to decades. This is far longer than most businesses (or more particularly their lawyers) even want to keep information around.

There are still reasons for wanting an update-in-place model for OLTP systems, for example to keep as much of your working set as possible in RAM or on fast SSDs to keep response time low. But storage cost alone is no longer a major factor for a wide range of applications. This development is already affecting data management technology profoundly. Doug Cutting has pointed out on numerous occasions that the downward cost trajectory of commodity storage was a key driver in the development of Hadoop.

Users Want Immutable Data

Many organizations already keep long transaction histories to feed analytics by loading them into traditional data warehouses based on Teradata, Vertica, and the like. As soon as a practical method appeared to keep such data more economically, businesses began to adopt it quickly. That "method" is Hadoop.

Hadoop has a fundamentally different approach to data management from relational and even many NoSQL systems. For one thing, immutable data are fundamental. The default processing model is that you write data but rarely change it once written. To illustrate, the HiveQL SQL dialect does not even have UPDATE or DELETE statements. Instead, you overwrite entire tables or parts of them to make changes. This works because Hadoop organizes storage on cheap commodity hardware (HDFS) and provides a workable way to access data programmatically (MapReduce).

Hadoop changes the data management cost model in other ways besides utilizing commodity hardware efficiently. With Hadoop you don't necessary define *any* data structures up front. Instead, you store transactions in native form and write programs to interpret them later on. If you need structure for efficient queries you add it through MapReduce and perhaps store it as a materialized view to make other queries more efficient. Hadoop eliminates a lot of the up-front effort (and risk) required to get transactions into a data warehouse. Instead, it defers those costs until you actually need to run specific analytics. Moreover by storing native transaction formats, you can answer new questions years later. That is a very powerful benefit.

I have been working a lot with Hadoop over the last few months. It's a bear to use because it consists of a set of loosely integrated and rapidly evolving projects with weak documentation and lots of bugs. Even with these difficulties, the rising level of Hadoop adoption for analytics shows the underlying model has legs and that users want it. As Floyd Strimling pointed out a while ago on Twitter this genie is not going back in the bottle. HDFS is becoming the default storage mechanism for vast quantities of data.

Immutable Data Management Looks Like a Good Bet

One of the basic problems in discussing immutable data management is that there are different kinds of immutable data that persist at different timescales. Baron has a point that Couchbase, Datanomic, NuoDB, or whatever new DBMS implementation you choose are in some ways recapitulating solutions that existing RDBMS implementations reached long ago. But I also think that's not necessarily the right comparison when talking about immutable data, especially when you start to think about long retentions.

The fact is that Oracle, MySQL, PostgreSQL, and the like do not utilize distributed commodity storage effectively and they certainly do not enable storage of the long tail transaction histories that many businesses clearly want for analytics. The best way to do that is to replicate transactions into HDFS and work on them there. That is hard even for MySQL, which has flexible and economical replication options. (We are working on making it easier to do at Continuent but that's another article. :)

In my opinion a more useful criticism of the arriviste competitors of traditional OLTP systems is that they don't go far enough with immutable data and risk being outflanked by real-time transaction handling built on top of HDFS. Hadoop real-time work on projects like Apache Spark is for the time being is focused on analytics but OLTP support cannot be far behind. Moreover, there is a window to build competitors to HDFS that gets smaller as Hadoop becomes more entrenched. This seems more interesting than building stores that offer only incremental improvements over existing RDBMS implementations.

Immutable data now permeate IT due to decreasing storage costs coupled with requirements for analytic processing. It's like the famous quote from William Gibson:

The future is already here--it's just not very evenly distributed.

If you look at the big picture the arguments for database management based on immutable data seem pretty strong. It is hard to believe it won't be a persistent trend in DBMS design. Over the long term mutable data look increasingly like a special case rather than the norm.

Fun with MySQL and Hadoop at SCaLE 12X

2014-02-07T17:07:00.000-08:00

It's my pleasure to be presenting at SCaLE 12X on the subject of real-time data loading from MySQL to Hadoop. This is the first public talk on work at Continuent that enables Tungsten Replicator to move transactions from MySQL to HDFS (Hadoop Distributed File System). I will explain how replication to Hadoop works, how to set it up, and offer a few words on constructing views of MySQL data using tools like Hive.

As usual with replication everything we are doing on Hadoop replication is open source. Builds and documentation will be publicly available by the 21st of February, which is when the talk happens. Hadoop support is already in testing with Continuent customers, and we have confidence that we can handle basic loading cases already. That said, Hadoop is a complex beast with lots of use cases, and we need feedback from the community on how to make Tungsten loading support better. My colleagues and I plan to do a lot of talks about Hadoop to help community users get up to speed.

Here is a tiny taste of what MySQL to Hadoop loading looks like. Most MySQL users are familiar with sysbench. Have you ever wondered what sysbench tables would look like in Hadoop? Let's use the following sysbench command to apply transactions to table db01.sbtest:

sysbench --test=oltp --db-driver=mysql --mysql-host=logos1 --mysql-db=db01 \
    --mysql-user=tungsten --mysql-password=secret \
    --oltp-read-only=off --oltp-table-size=10000 \
    --oltp-index-updates=4 --oltp-non-index-updates=2 --max-requests=200000 \
    --max-time=900 --num-threads=5 run

This results in rows that look like the following in MySQL:

mysql> select * from sbtest where id = 2841\G
*************************** 1. row ***************************
 id: 2841
  k: 2
  c: 958856489-674262868-320369638-679749255-923517023-47082008-646125665-898439458-1027227482-602181769
pad: qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt

After replication into Hadoop with Tungsten, we can crunch the log records using a couple of HiveQL queries to generate a point-in-time snapshot of the sbtest table on HDFS. By a point-in-time snapshot, I mean that a table that contains not only inserted data but also shows the results of subsequent update and delete operations on each row up to a particular point in time. We can now run the same query to see the data:

hive> select * from sbtest where id = 2841;
Total MapReduce jobs = 1
Launching Job 1 out of 1
...
Job 0: Map: 1   Cumulative CPU: 0.74 sec   HDFS Read: 901196 HDFS Write: 158 SUCCESS
Total MapReduce CPU Time Spent: 740 msec
OK
2841 2 958856489-674262868-320369638-679749255-923517023-47082008-646125665-898439458-1027227482-602181769 qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt

Tungsten does a lot more than just move transaction data, of course. It also provides tools to generate Hive schema, performs transformations on columns to make them match the limited HiveQL datatypes, and arranges data in a way that allows you generate materialized views for analytic usage (like the preceding example) with minimal difficulty.

If you want to learn more about how Tungsten does all of this magic, please attend the talk. I hope to see you in Los Angeles.

p.s., If you cannot attend SCaLE 12X, we will have a Continuent webinar on the same subject the following week. (Sign up here.)

Why I Love Open Source

2014-01-10T09:17:00.001-08:00

Anders Karlsson wrote about Some myths on Open Source, the way I see it a few days ago. Anders' article is mostly focused on exploding the idea that open source magically creates high quality code. It is sad to say you do not have to look very far to see how true this is.

While I largely agree with Anders' points, there is far more that could be said on this subject, especially on the benefits of open source. I love working on open source software. Here are three reasons that are especially important to me.

1.) Open source is a great way to disseminate technology to users. In the best cases, it is this easy to get open source products up and running:

$ sudo apt-get install software-i-want-to-use

A lot of software companies (mine included) open source their software because it gets product into the hands of people who might pay money for it later. The strategy worked brilliantly for MySQL AB as Anders pointed out. MongoDB is repeating the tactic with what looks like equal success. There has been a lot of pointless argument over the years about whetherMySQL or MongoDB are "real databases." Being easy to get is just as critical to adoption as features like transactions and scalable performance.

Open source is therefore even better for users, who can quickly decide if something works for them and provide feedback through communities about problems as well as suggested improvement. To the extent open source software has high quality, it originates in the tight feedback loop between software producers and their user communities. That in turn leads to faster innovation with fewer deviations from real user needs. In olden days we called this getting the requirements right. Open source projects often do it extraordinarily well.

2.) Open source allows like-minded communities of developers to create products that would otherwise never happen. Linux became a dominant operating system in large part through the staggering scale of contributions enabled by exceptionally well-managed open source development. Linus Torvalds recently pointed out that Linux kernel releases have patches from a thousand contributors or more. Thanks to the wide range of contributions, Linux operates on everything from tiny ARM processors to servers with over 200 cores. The development effort underlying the Linux ecosystem is huge when you include the kernel and all the packages that install over it. It dwarfs any comparable operating system effort I can think of.

At the other end of the spectrum there are small but incredibly useful projects like Apache Curator. The Curator project currently has 8 project members, mostly from different companies, who collaborate to make Apache ZooKeeper vastly easier to program. I doubt libraries like Curator would even exist without open source licenses and infrastructure like distributed source code management. Either would ZooKeeper, for that matter.

Not every line of open source code is excellent or even above average. (I'm looking at you, Hadoop.) That said, open source projects are not so much about code but communities of developers who understand and are interested in solving a specific problem. Besides direct feedback from real users, this is the other prerequisite for creating truly great products. Clean code is helpful but not necessary.

3.) Open source means your creations can never be taken away from you. In many creative endeavors work belongs to the people who employ you. It effectively disappears when you change jobs. Putting code on GitHub or code.google.com breaks that bond. Knowing that anything you create will always be accessible removes any hesitation about revealing your best ideas. I believe this is one of the drivers behind the flowering of creativity that infuses so many open source projects.

At the same time working on open source software is not all peaches and cream. Building successful businesses on open source is hard, which limits the opportunities to work on it for a living.

For instance, if most of the value of your product is in the software itself there is not much motivation for users to pay you. I think that's one reason mobile apps are by-and-large for pay or at least not open source. You need to find a business model that brings in enough money over time to fund the sort of concentrated engineering necessary to build robust software. Successful open source businesses often depend on finding the right markets or achieving network effects, and not all software can fit the pattern.

The good news is that once you get the economics right it really wrong-foots your closed source competitors. RedHat has built a great business packaging and supporting open source for enterprises. They see open source as a competitive advantage that extends their market reach and speeds up innovation. An increasing number of companies producing DBMS software take the same view as they try to disrupt data management. Outside of enterprise software Valve Software is attacking proprietary gaming platforms through open source.

It's great to see the growing number of businesses based on open source development. When the model works it is incredibly satisfying. I guess this is a fourth reason why I love working on open source software.

See You at Percona Live 2013!

2013-03-28T13:34:00.000-07:00

Percona Live 2013 is coming up fast. This is hands-down the best MySQL conference of the year, attended by a lot of people I really respect. Check the speaker list if you need some of their names. I will also be doing two talks myself.

9am Wednesday 24 April - Keynote: How MySQL Can Thrive in the World of Massive Data Hype. NoSQL solutions are oversold, but this is no reason for complacency in the MySQL community. There are new challenges in data management, and we need to solve them or become irrelevant. I will show some of the advances Continuent has on tap for MySQL-based applications and also point back to problems our experience shows must be solved within MySQL itself.
1pm Wednesday 24 April - Session: State of the Art for MySQL Multi-Master Replication. This talk will explain the fundamentals of multi-master operation and then trace the trade-offs of Tungsten, Galera, and other solutions. Thanks to excellent work on several products there is a lot of excitement about multi-master in 2013. My goal is to help listeners understand what applications are possible now as well as what we have the potential to achieve in the future.

I hope you will attend these talks. I am looking forward to meeting old friends at the conference and making new ones.

Incidentally, Percona Live sent me an email yesterday that you can get a 15% discount on the registration price using the code KeySQL when you sign up. At Continuent we are also offering free passes to customers who give us the best quotes about our software. However you get there, I really recommend this conference.

Data Fabric Design Patterns: Fabric Connector

2013-02-19T09:14:00.000-08:00

This article is the third in a series on data fabric design and introduces the fabric connector service design pattern. The previous article in this series introduced the transactional data service design pattern, which defines individual data stores and is the building block for data fabrics based on SQL databases. The fabric connector builds on transactional data services and is another basic building block of fabric architecture.

Description and Responsibilities

Fabric connectors make a collection of DBMS servers look like a single server. The fabric connector presents what appears to be a data service API to applications. It routes each request to an appropriate physical server for whatever task the application is performing, hiding the fact that a data fabric can consist of dozens or even hundreds of servers. Applications cannot tell the difference between talking to the fabric connector and talking to a real DBMS server. We call this property transparency.

Here are the responsibilities of a fabric connector. I will use the phrase proxying to refer to the first of these, and routing responsibilities to refer to the remaining three.

Expose a data service interface to applications.
Route each application query to an appropriate DBMS server.
Balance load by distributing queries across multiple replicas, if available.
Switch to another server following a failure or if the DBMS becomes unavailable due to maintenance.

The following diagram shows the logical components of a fabric connector. The fabric connector sits between applications, transactional data services, and a fabric directory service. These are greyed out, as they are not part of the pattern.

Fabric Connector Design Pattern

Fabric connectors contain two logical components. The proxy is responsible for routing queries and responses between applications and underlying data services. This can be a library layer, a separate server process, or a TCP/IP load balancer--anything that provides a transparent indirection layer. The directory information contains rules to route SQL queries correctly to the actual location of data. There is a notification protocol that permits connectors to receive updates about the fabric topology and confirm that they have responded to them.

Motivation

Connecting to data is a problem in large systems. Sharded data sets spread data across multiple services. Data services have different roles, such as master or slave. Services fail or go offline for maintenance. Services change roles, such as a master switching to a slave. Shards move between services to balance load and use storage more efficiently. Within short periods of time there may be significant variations in load across data services. Adding routing logic directly to applications in these cases adds complexity and can lead to a tangled mess for administrators.

The fabric connector design pattern encapsulates logic to route connections from the application to DBMS servers. Hiding connection logic helps keep applications simple. It allows independent testing and tuning of the connection rules. That way you can have some assurance the logic actually works. You can also modify fabric behavior without modifying applications, for example to redistribute load more evening across replicas.

Related Design Patterns

The fabric connector design pattern manages single application connections to data services, for example a transactional data service. Transparency is the leitmotif of this pattern. It provides needed encapsulation for other data fabric design patterns and is particularly critical for sharded as well as fault tolerant data services. These will be covered in future articles on data fabric design.

There are also other design patterns for data access. Here are two that should not be confused with fabric connectors.

Federated query. Federated query splits a SQL query into sub-queries that it routes to multiple underlying data services, then returns the results. Sharding products like DbShards and shard-query implement this pattern. It requires complex parsing, query optimization, and aggregation logic to do correctly and has varying levels of transparency.
MapReduce. MapReduce is a procedure for breaking queries into pieces that can run in parallel across large numbers of hosts by splitting the query into map operations to fetch data followed by reduce operations to aggregate results. It can work on any distributed data set, not just SQL. MapReduce implementations often eschew SQL features like joins and also can have a very different programming model from SQL. Their use is often non-transparent to SQL applications.

Finally, there is a very important pattern for the fabric directory service. This is a directory service that maintains information about the desired topology of the fabric and its actual state. It can be implemented in forms ranging from a shared configuration file to network services in a distributed configuration manager like ZooKeeper.

I hope to add more complete descriptions for the latter three design patterns at some point in the future. For the current article, we will stick to simple connectivity.

Detailed Behavior

Fabric connectors are conceptually simple: route request from application to server, then transfer results back. Actual behavior can be quite complex. To give some perspective on the problem, here is a short Perl program for a SaaS application that logs order detail information in a table named sale, then reads the same data back. We will use the sample program to illustrate the responsibilities of this design pattern in detail.

use DBI;

# Connect to server. 
$dbh = DBI->connect("DBI:mysql:test;host=prodg23", "app", "s3cr3t5"
            ) || die "Could not connect to database: $DBI::errstr";

# Insert order using a transaction. 
$dbh->{'AutoCommit'} = 0;
$dbh->do("INSERT INTO sale(order_id, cust_id, sku, amount) \
   VALUES(2331, 9959, 353009, 24.99)");
$dbh->do("INSERT INTO sale(order_id, cust_id, sku, amount) \
   VALUES(2331, 9959, 268122, 59.05)");
$dbh->commit();

# Select order back with an auto-commit read
$dbh->{'AutoCommit'} = 1;
$sth = $dbh->prepare("SELECT * FROM sale WHERE order_id=2331");
$sth->execute();
while( $href = $sth->fetchrow_hashref ) {
  print "id      : $$href{id} \n";
  print "order_id: $$href{order_id} \n";
  print "cust_id : $$href{cust_id} \n";
  print "sku     : $$href{sku} \n";
  print "amount  : $$href{amount} \n";
}

# Disconnect from server. 
$dbh->disconnect();

The first responsibility of the fabric connector design pattern is to provide a transparent interface for underlying data services. That means that our Perl program has to work as written--no extra changes. Here are just a few things a connector needs to do:

Implement the DBMS connection protocol fully or pass it transparently to an underlying server. This includes handling authentication handshakes as well as setting base session context like client character sets.
Handle all standard features of query invocation and response, including submitting queries, returning automatically generated keys, and handling all supported datatypes in results.
Respect transaction boundaries so that the INSERT statements on the sales table are enclosed in a transaction in the DBMS and the SELECT statement is auto-commit (i.e., a single-statement transaction.)
Read back data written to the sales table.

In addition to handling APIs protocols, fabric connectors need to avoid slowing down transaction processing as a result of proxying. Properly written connectors for the most part add minimal overhead, but there are at least two instances where this may not be the case for some implementations (such as network proxies). The first is establishing connections, a relatively expensive operation that occurs constantly in languages like PHP that do not use connection pools. The second is short primary key-lookup queries on small datasets, which tend to be memory-resident in the server and hence have quick access.

One common reaction is to see such overhead as a serious problem and avoid the whole fabric connector approach. Yet the "tax" applications pay for proxying is not the whole story on performance. Fabric connectors can boost throughput by an order of magnitude by distributing load intelligently across replicas. To understand the real application overhead of a connector you therefore need to measure with a properly sized data set and take into account load-balancing effects. Test results on small data sets that reside fully in memory with no load balancing tend to be very misleading.

The remaining fabric connector design pattern responsibilities are closely related: route requests accurately to the correct service, load-balance queries across replicas within a service, and route around replicas that are down due to maintenance or failure. We call these routing responsibilities. They require information about the fabric topology, which is maintained in the connector's directory information. Here is a diagram of typical directory organization.

Fabric Directory Service Organization

Let's start with the responsibility to route requests to data services. A simple fabric connector implementation allows connections using a logical server name, such as group2, which the connector would translate to an actual DBMS server and port, such as prodg23:3306. A better fabric connector would allow applications use a customer name like "walmart" that matches what the application is doing. The connector would look up the location of customer data and connect automatically to the right server and even DBMS schema. This is especially handy for SaaS applications, which often shard data by customer name or some other simple identifier.

We could then change our program as follows to connect to the local host and look for the "walmart" schema. Under the covers, the fabric connector will connect to the prodg23 server and use the actual schema for that customer's data.

use DBI;
# Connect to customer data.  
$dbh = DBI->connect("DBI:mysql:walmart;host=localhost", "app", "s3cr3t5"
            ) || die "Could not connect to database: $DBI::errstr";

This is a modest change that is very easy to explain and implement. It is a small price to pay for omitting complex logic to locate the correct server and schema that contains the data for this customer.

The next responsibility is to distribute data across replicas. This requires additional directory information, such as the DBMS server role (master vs. slave), current status (online or offline), and other relevant information like slave latency or log position. There are many ways to use this information effectively. Here are a few of the more interesting things we can do.

Slave load balancing. Allow applications to request a read-only connection, then route to the most up-to-date slave. This works well for applications such as Drupal 7, which is an application for website content management. Drupal 7 is slave-enabled, which means that it can use separate connections for read-only queries that can run on a replica. Many applications tuned to work with MySQL have similar features.
Session load balancing. Track the log position for each application session and dispatch reads to slaves when they are caught up with the last write of the session. This is a good technique for SaaS applications that have large numbers of users spread across many schemas. It is one of the most effectively scaling algorithms for master/slave topologies.
Partitioning. Split requests by schema across a number of multi-master data services. SQL requests for schema 1 go to server 1, requests for schema 2 to server 2, etc. Besides distributing load across replicas this technique also helps avoid deadlocks, which can become common in multi-master topologies if applications simultaneously update a small set of tables across multiple replicas.

Recalling our sample program, we could imagine a connector using session load balancing to write the sales table transaction to the master DBMS server, then sending the SELECT to a slave if it happened to be caught up for customer "walmart." No program changes are required for this behavior.

The final responsibility is to route traffic around offline replicas. This gets a bit complicated. We need not only state information but an actual state model for DBMS servers. There also needs to be a procedure to tell fabric connectors about a pending change as well as wait for them to reconfigure themselves. Returning to our sample program, it should be possible to execute the following transaction:

$dbh->{'AutoCommit'} = 0;
$dbh->do("INSERT INTO sale(order_id, cust_id, sku, amount) \
   VALUES(2331, 9959, 353009, 24.99)");
$dbh->do("INSERT INTO sale(order_id, cust_id, sku, amount) \
   VALUES(2331, 9959, 268122, 59.05)");
$dbh->commit();

then failover to a new master and execute:

$dbh->{'AutoCommit'} = 1;
$sth = $dbh->prepare("SELECT * FROM sale WHERE order_id=2331");
$sth->execute();

...

To do this properly we need to ensure that the connecter responds to updates in a timely fashion. We would not want to change fabric topology or take a DBMS server offline while connectors were still using it. The notification protocol that updates connector directory information has to ensure reconfiguration does not proceed until connectors are ready.

Does every fabric connector have to work exactly this way? Not at all. So far, we have only been talking about responsibilities. There are many ways to implement them. To start with, fabric connectors do not even need to handle SQL. This is interesting in two ways.

First, you can skip using the Perl DBI completely and use a specialized interface to connect to the fabric. We will see an example of this shortly. Second, the underlying store does not even need to be a SQL database at all. You can use the fabric connector design pattern for other types of stores, such as key-value stores that use the memcached protocol. This series of articles focuses on SQL databases, but the fabric connector design pattern is very general.

Implementations

Here are a couple of off-the-shelf implementations that illustrate quite different ways to implement the fabric connector design pattern.

1. Tungsten Connector. Tungsten Connector is a Java proxy developed by Continuent that sits between applications and clusters of MySQL or PostgreSQL servers. It implements the MySQL and PostgreSQL network protocols faithfully, so that it appears to applications like a DBMS server.

Tungsten Connector gets directory information from Tungsten clusters. Tungsten clusters use a simple distributed consensus algorithm to keep directory data consistent across nodes even when there are failures or network outages--connectors can receive topology updates from any node in the cluster through a protocol that also ensures each connector acts on it when the cluster reconfigures itself. In this sense, Tungsten clusters implement the fabric directory service pattern described earlier.

The directory information allows the connector to switch connections transparently between servers in the event of planned or even some unplanned failovers. It can also load balance reads automatically using a variety of polices including the slave load balancing and session load balancing techniques described above.

The big advantage of the network proxy approach is the high level of transparency for all applications. Here is a sample session with the out-of-the-box mysql utility that is part of MySQL distributions. In this sample, we check the DBMS host name using the MySQL show variables command. Meanwhile, a planned cluster failover occurs, followed by an unplanned failover.

mysql> show variables like 'hostname';
+---------------+---------+
| Variable_name | Value   |
+---------------+---------+
| hostname      | prodg23 |
+---------------+---------+
1 row in set (0.00 sec)

(Planned failover to prodg21 to permit upgrade on prodg23)
mysql> show variables like 'hostname';
+---------------+---------+
| Variable_name | Value   |
+---------------+---------+
| hostname      | prodg21 |
+---------------+---------+

1 row in set (0.01 sec)

(Unplanned failure to prodg22)
mysql> show variables like 'hostname';
+---------------+---------+
| Variable_name | Value   |
+---------------+---------+
| hostname      | prodg22 |
+---------------+---------+

1 row in set (4.82 sec)

As this example shows, the session continues uninterrupted as the location of the server switches. These changes occur transparently to applications. The downside is that there is some network overhead due to the extra network hop through the Tungsten Connector, though of course load balancing of reads can more than repay the extra latency cost. Also, this type of connector is hard to build because of the complexity of the MySQL network API as well as the logic to transfer connections seamlessly between servers.

2. Gizzard. Gizzard is an open source sharding software developed by Twitter to manage links between Twitter users. The proxy part of the design pattern is implemented by middleware servers, which accept requests from clients using thrift, a language-independent set of tools for building distributed services. For more on a particular application built on Gizzard, look at descriptions of Twitter's FlockDB service. Gizzard servers give applications a simple API for data services, which fulfills the proxy responsibility of the fabric connector design pattern.

Gizzard servers get directory information using gizzmo. Gizzmo is a simple command line tool that maintains persistent copies of the Gizzard cluster topology and takes care of propagating changes out to individual Gizzard servers. For more on how Gizzmo works, look here. Using this information, Gizzard servers can locate data, route around down servers, and handle distribution of queries to replicas, which are the final three responsibilities of the fabric connector design pattern.

The Gizzard architecture lacks the generality of the Tungsten Connector, because it requires clients to use a specific interface rather than general-purpose SQL APIs. It also introduces an extra network hop. On the other hand, it works extremely well for its intended use case of tracking relationships between Twitter users. This is because Gizzard deals with a simplified problem and also allows scalability through many Gizzard servers. Like the Tungsten Connector the network hop expense pays for itself due to the ability to load-balance across multiple replicas.

Gizzard is a nice example of how the fabric connector design pattern does not have to be specifically about SQL. Gizzard clients do not use SQL, so the underlying store could be anything. Gizzard is specifically designed to work with a range of DBMS types.

Fabric Connector Implementation Trade-Offs

General-purpose fabric connectors like the one used for Tungsten are hard to implement for a variety of reasons. This approach is really only practical if you have a lot of resources at your disposal or are doing it as a business venture like Continuent. You can still roll your own implementations. The Gizzard architecture nicely illustrates some of the trade-offs necessary to do so.

1. General vs. particular data service interfaces. Implementing a simple data service interface, for example using thrift, eliminates the complexity of DBMS interfaces like those of MySQL or PostgreSQL. Rather than a thrift server you can also use a library within applications themselves. This takes out the network hop.

2. Automatic vs. manual failover. Automatic failover requires connectors to respond to fabric topology changes in real time, which is a hard problem with a lot of corner cases. (Look here if you disagree.) You can simplify things considerably by minimizing automated administration and instead orchestrate changes through scripts.

3. Generic vs. application-specific semantics. Focusing on a particular application allows you to add features that are important for particular use cases. Gizzard supports shard migration. To make it tractable to implement Gizzard requires a simple update model in which transactions can be applied in any order.

These and other simplifications make the fabric connector design pattern much easier to implement correctly. You can make the same sort of trade-offs for most applications.

Implementations to Avoid

Here are a couple of implementations for the fabric connector design pattern that you should avoid or at least approach warily.

1. Virtual IP Addresses (VIPs). VIPs allow hosts to listen for traffic on multiple IP addresses. They are commonly used in many failover schemes, such as programs like heartbeat. VIPs do not have the intelligence to fulfill fabric connector responsibilities like load-balancing queries across replicas. They are subject to nasty split-brains, a subject I covered in detail as part of an earlier article on this blog. Finally, VIPs are not available in Amazon and other popular cloud environment. VIPs do not seem like a good implementation choice for data fabrics.

2. SQL proxies. There are a number of software packages that solve the problem of proxying SQL queries, such as PgBouncer or MySQL Proxy. Many of them do this quite well, which means that they fulfill the first responsibility of the fabric connector design pattern. The problem is that they do not have a directory service. This means they do not fulfill the next three responsibilities to route queries effectively, at least out of the box.

Unlike VIPs, SQL proxies can be a good starting point for fabric implementations. You need to add the directory information and notification protocol to make them work. It is definitely quite doable for specific cases, especially if you make the sort of trade-offs that Gizzard illustrates.

Conclusion and Takeaways

The fabric connector design pattern reduces the complexity of applications by encapsulating the logic required to connect to servers in a data fabric. There is a tremendous benefit to putting this logic in a separate layer that you can test and tune independently. Fabric connectors are more common than they appear at first because many applications implement the responsibilities within libraries or middleware servers that include embedded session management. Fabric connectors do not have to expose SQL interfaces or any other DBMS-specific interface, for that matter.

Fault-tolerant and sharded data service design patterns depend on fabric connectors to work properly and avoid polluting applications with complex logic to locate data. Products that implement these design patterns commonly include fabric connector implementations as well. You can evaluate them by finding out how well they fulfill the design pattern responsibilities.

Off-the-shelf fabric connectors have the advantage that they are more general than something you can develop easily for yourself. If you decide to write your own fabric connector, you will need to consider some of the trade-offs like reducing automation or simplifying APIs in order to make the problem easier to solve. Regardless of the approach, you should allow time. The responsibilities are complicated and must be implemented with care. Fabric connectors that only work 99% of the time of are not much use in production environments.

One final point about fabric connectors. Automated failover can make fabric connectors harder to implement and increase the risk that the fabric connector may write to the wrong replica. The difficulty of managing connectivity is one of the reasons many data management experts are very cautious about automatic failover. This problem is tractable in my opinion, but it is definitely a practical consideration in system design.

My next article on data fabrics will cover the fault-tolerant data service design pattern. This design pattern depends on the fabric connector design pattern to hide replicas. I hope you will continue reading to find out about it.

Data Fabric Design Patterns: Transactional Data Service

2013-02-06T21:56:00.000-08:00

This article is the second in a series on data fabric design and introduces the transactional data service design pattern. The previous article in this series introduced data fabrics, which are collections of off-the-shelf DBMS servers that applications can connect to like a single server. They are implemented from data fabric design patterns, which are reusable arrangements of DBMS servers, replication, and connectivity. With this article we begin to look at individual design patterns in detail.

Description and Responsibilities

The transactional data service is a basic building block of data fabric architectures. A transactional data service is a DBMS server that processes transactions submitted by applications and stores data safely. Transactional data services have the following responsibilities:

Store data transactionally and recover data faithfully to the last full transaction following failure.
Provide a network-accessible application interface for accessing data
Provide a reliable and reasonably quick method for backup and restore.
Maintain an accessible, serialized log of transactions. This enables replication between services.

The following diagram illustrates the moving parts of a transactional data service. In future diagrams we will just use the standard database symbol for the entire transactional data service, but for now we need to be able to see the contents.

Motivation

Durable storage of transactions is the most fundamental responsibility of database systems. It is difficult to build reliable applications if stored data can disappear or become corrupted because transactions were not committed before a crash. Both problems can cause data loss. Moreover, they can break replication links very badly if the DBMS server comes up in an inconsistent state, for example with some updates committed but others randomly rolled back. This condition affects not only the one server but potentially many others throughout the fabric.

The transactional data service therefore focuses on storing data safely and recovering to the last committed transaction after a restart. With this basic capability we can construct more complex services knowing that individual changes are unlikely to disappear or be recorded inconsistently.

Detailed Behavior

Let's look in detail at the properties required for a successful transactional data service. Somewhat surprisingly, an off-the-shelf SQL DBMS does not necessarily fit the pattern, though it comes close. It is important to understand the differences.

The transactional store keeps data from getting lost and is the basis for recovery throughout the fabric. Transactional stores support commit and rollback with multi-statement transactions. In theory the transaction data service responsibility for data persistence matches MySQL/InnoDB and PostgreSQL behavior, both of which commit transactions safely in serial order. However, the reality is not quite that simple.

Most DBMS allow applications to ignore transactions under certain conditions. This results in wormholes, which are violations in serial ordering of data. There are a number of table definition options in SQL that undo transactional consistency.

(MySQL) MyISAM table type. MyISAM tables ignore transactions and commit immediately, even if the application later tries to roll back. The tables may also become corrupted by server crashes.
(MySQL) Memory table type. These tables are maintained in memory only. They disappear on restart.
(PostgreSQL) UNLOGGED tables. Such tables are not logged and disappear on crash or unclean shutdown (thanks Frederico).

All of these allow data to disappear or become corrupted after a crash. However, there is a more subtle problem. If applications depend on these tables, transaction results may then depend on when the server last crashed or restarted, which in turn makes updates across replicas non-deterministic. Random updates create problems for data replication, which depends on replicas behaving identically when transactions are applied. It is important to avoid application dependencies on any feature that creates wormholes or you might not be able to use other design patterns.

So is data loss always bad? Surprisingly, no. In some cases transactional stores can lose data provided that they do so by dropping the last transactions in serial order. It's as if the data just reverted to an earlier point in time. To understand why this might be OK, imagine three servers linked into a chain by asynchronous replication.

Data loss is sometimes not a big deal

It is bad to lose data on the first server, especially if those data are lost before replicating transactions to the downstream replicas. However, data loss on the last server is fine. Assuming that server stores the replication restart point transactionally, it will just re-apply the missing transactions and catch up. This is exactly what happens when you restore a backup in slave in master/slave replication.

Data loss on the second server also may not be a problem. It should restart replication and should generate identical transactions for itself as well as for replication to the last server. In both cases we assume that replication will handle these cases correctly and can replay missing transactions from logs. If so, you can not only tolerate such losses but even depend on recovering from them automatically.

Turning to the next responsibility of the transactional data service, the application interface may obviously include SQL via MySQL or PostgreSQL wire protocols. However, any consistent interface that is accessible over a network will do. The memcached protocol is also perfectly acceptable. Subsets of SQL such as stored procedures work quite well. Transactional data services are more general than SQL DBMS servers in this sense. Full SQL or even a subset is not a requirement.

Backup and restore are critical for data fabrics as they enable provisioning of new services as well as recovery of services that fail. You restore a backup and then let the service catch up using replication. Data fabrics can get along fine using a range of options from logical dumps of the DBMS (mysqldump or pgdump) to file system snapshots. Fabric backups just need to be transactionally consistent and movable to other hosts.

Note that the backup required by the transaction data service design pattern is a less general form of backup than most businesses really require. Businesses may need to recover data after accidental deletion or to keep copies of information for many years for legal reasons. You can therefore use a general-purpose backup solution like Zmanda or Barman provided it meets the fabric design pattern requirements. There's no need to do things twice.

Finally, the replication log is a serialized list of transactions to replicate to other hosts. Serialization enables the transactions to be replayed on another host and result in an identical copy. Generally speaking, data fabrics require logical replication, which applies changes to replicas using SQL statements on a live server. This is because other design patterns depend on being able to access and even write to the transactional data service when it is acting as a slave. Binary replication methods like disk block replication, such as DRBD, do not meet this requirement and therefore are of limited use in data fabrics.

Implementation

You can implement the transactional data service design pattern with any DBMS that meets the pattern responsibilities. That said, implementation details are very important. As we have seen, ensuring that DBMS servers live up to the responsibility to store transactions safely is a little harder than one might think.

1. MySQL. MySQL with InnoDB engine is generally a good choice. It has stable SQL APIs and a wide range of capable client libraries. However, MySQL must be correctly configured to maintain proper transactional guarantees. Here are three properties that should be in your my.cnf file to help ensure MySQL lives up to its responsibilities:

# Ensure durable flush to storage on transaction commit.

innodb_flush_log_at_trx_commit=1

# Synchronize binlog with committed transactions.

sync_binlog=1

# Use InnoDB as default storage engine. (Unnecessary for MySQL 5.5 and above.)

default-table-type=InnoDB

There are a variety of good backup mechanisms for MySQL databases, including mysqldump (with --single-transaction, useful only for small data sets), Percona XtraBackup, and file system snapshots. Snapshots are especially good when using NetApp or other capable storage. NetApp snapshots can be restored in seconds and cost little in terms of performance overhead.

2. PostgreSQL. PostgreSQL with a trigger-based replication log, for example from Londiste or SLONY, and pgdump for backups is another good choice. PostgreSQL has unusually good trigger support for DML changes at least, and permits users to encode them in a number of languages. Be aware the PostgreSQL triggers do not capture DDL statements like CREATE TABLE, though.

PostgreSQL is fully transactional out of the box and triggers create a fully serialized replication log. It does not have the problem that MySQL does with potentially unsafe table types like MyISAM. However, you need to set a couple of parameters to ensure safe operation. These ensure transactions are committed down to the storage level and prevent catastrophic corruption of the database and/or the WAL (write-ahead log). Since PostgreSQL defaults to these values, you mostly need to avoid turning them off.

fsync = on # turns forced synchronization on or off
synchronous_commit = on # immediate fsync at commit

Like MySQL, PostgreSQL SQL and APIs are stable and well-known. Pgdump also loads and restores data without difficulty for smallish data sets. For larger data sets file system snapshots work very well.

Regardless of the DBMS type you choose, it is important to avoid application-level features that introduce wormholes, such as the PostgreSQL unlogged tables mentioned in the previous section. Generally speaking, you should only skip transactions if there is a very strong reason for doing so.

Do other database types work for this design pattern? Of course. You can also use a commercial DBMS like Oracle. Oracle fulfills the pattern responsibilities quite well, but is a bit more heavyweight than users want, particular when operating in the cloud.

And Hardware Implementation, Too...

Even with a properly configured DBMS server you are still not completely out of the woods for data durability. Database servers generally use an fsync() or similar system call to flush data to storage. Unfortunately storage controller cards may just cache the data to be written in local RAM and return. That can fool the DBMS server into thinking transactions are safely stored when they actually are still sitting in memory on a controller card. The "committed" data will then vaporize in a host crash, which in turn can fatally corrupt both MySQL and PostgreSQL stores if you are very unlucky. Just a few bad blocks can cause very serious problems.

Fortunately there is a cure to make data vaporization less likely. On local storage you can invest in RAID with a battery-backed cache (BBU), which keeps power on for the cache even if the host fails completely. SANs and network attached storage tend to have this capability built in. (But check the specifications!) Battery backed cache also tends to be fast, since controllers can safely return from an fsync() operation as soon as the data to be written are in the on-board cache. Without this feature writes to storage can be painfully slow.

One interesting question is how to handle cloud environments like Amazon. You just do not know how the storage actually works. (That's really the point of a cloud, after all.) Amazon provides SLAs for performance (example: EBS provisioned IOPS), but there do not seem to be any SLAs about storage consistency. There is a lot to learn here and lessons that apply to Amazon may not necessarily apply to others. I suspect this will prompt some rethinking about data consistency--it's an interesting "what if" for transaction processing to suppose you cannot trust the underlying storage capabilities.

Data loss occurs sooner or later in virtually all systems, but the good new is that you can make it uncommon. For more information check out data sources like this and this. Also, other fabric design patterns like the Fault-Tolerant Data Service keep applications running when failures do occur and can even minimize the effects of data loss. See the upcoming article on that design pattern for more information.

Implementations to Avoid

Here are two examples that do not meet the transactional data service design pattern responsibilities or at least not fully.

1. MySQL with MyISAM table type. MyISAM does not support transactions and is not crash safe. You will lose data or incur downtime fixing problems. MyISAM does not belong in data fabrics.

2. PostgreSQL with streaming replication. Streaming replication replicates log updates in real-time and has the added benefit of permitting queries on replicas. However, streaming replication does not allow you to write to replicates. It therefore does not support online schema maintenance or multi-master replication. It also does not help with heterogeneous replication. Streaming replication is therefore an unattractive choice, even though it is far simpler and works better for ensuring high availability than logical replication solutions like SLONY.

How do NoSQL stores fare in this design pattern? Let's pick on MongoDB. MongoDB supports atomic commit to single BSON documents but does not support transactions across multiple documents. (Unless you think that the hacky two-phase commit proposed by the MongoDB manual is an answer.) Atomic transactions are not one of the reasons people tend to choose NoSQL systems, so this is not surprising. It means that MongoDB cannot handle the responsibilities of this design pattern.

Conclusion and Takeaways

The transactional data service design pattern can be implemented with a carefully configured SQL database. As we have seen, however, there are a number of details about what "carefully configured" really means.

It is a good idea to use the transactional data service design pattern even if you are not planning to implement a data fabric. Systems grow. This pattern gives you the flexibility to build out later by adding other fabric design patterns, for example to introduce cross-site operation using the Multi-Site Data Server pattern. It also protects your data at multiple levels that include transactions as well as regular backups. Nobody really likes losing data if it can be avoided.

Another important point: the transactional data service design pattern keeps your entire fabric working. Losing small amounts of data is typically just an inconvenience for users, especially if it does not occur too often. Broken replication on the other hand due to replicas that diverge or become corrupt after failures can lead to time-consuming administration and significant downtime to repair. The fabric is a network of servers. Poor configuration on one server can cause problems for multiple others.

Finally, the biggest error people make with this design pattern is to neglect backups. There's something inherently human about it: backups are painful to test so most of us don't. My first and biggest IT mistake involved a problem with backups. It nearly caused a large medical records company to lose a week of data entry records. Put in backups and test them regularly to ensure they work. This is a theme that will recur in later articles.

Speaking of which, the next article in this series will cover the fabric connector design pattern. Stay tuned!

Introducing Data Fabric Design for Commodity SQL Databases

2013-02-05T09:23:00.000-08:00

Data management is undergoing a revolution. Many businesses now depend on data sets that vastly exceed the capacity of DBMS servers. Applications operate 24x7 in complex cloud environments using small and relatively unreliable VMs. Managers need to act on new information from those systems in real-time. Users want constant and speedy access to their data in locations across the planet.

It is tempting to think popular SQL databases like MySQL and PostgreSQL have no place in this new world. They manage small quantities of data, lack scalability features like parallel query, and have weak availability models. One reaction is to discard them and adopt alternatives like Cassandra or MongoDB. Yet open source SQL databases have tremendous strengths: simplicity, robust transaction support, lightning fast operation, flexible APIs, and broad communities of users familiar with their operation. The question is how to design SQL systems that can meet the new requirements for data management.

This article introduces an answer to that question: data fabric design. Data fabrics arrange off-the-shelf DBMS servers so that applications can connect to them as if they were a single database server. Under the covers a data fabric consists of a network of servers linked by specialized connectivity and data replication. Connectivity routes queries transparently from applications to DBMS servers. Replication creates replicas to ensure fault tolerance, distribute data across locations, and move data into and out of other DBMS types. The resulting lattice of servers can handle very large data sets and meet many other requirements as well.

Data fabric design is a big topic, so I am going to spread the discussion over several articles. This first article provides a definition of data fabric architecture and introduces a set of design patterns to create successful data fabrics. In the follow-on articles I will explore each design pattern in detail. The goal is to make it possible for anyone with a background in database and application construction to design data management systems that operate not only today but far into the future. At the very least you should understand the issues behind building these systems.

Some readers may see data fabric design as just another reaction to NoSQL. This would be a mistake. Building large systems out of small, reliable parts is a robust engineering approach that derives from ground-breaking work by Jim Grey, Pat Helland, and others dating back to the 1970s. Data fabrics consist of DBMS servers that you can look at and touch, whereas NoSQL systems tend to build storage, replication, and access into a single distributed system. It is an open question which approach is more complex or difficult to use. There are trade-offs and many systems actually require both of them. You can read this article and those that follow it, then make up your own mind about the proper balance.

Acknowledgements

The data fabric concept is largely based on practical experience on Continuent Tungsten. I am indebted to Continuent as well as our customers for the opportunity to work on this problem. I am likewise indebted to Ed Archibald, Continuent CTO, with whom I have worked for many years. Ed among other things came up with the data fabric moniker. Our interest in this topic goes back to shared experiences with Sybase OpenServer and Sybase Replication Server, which were ground-breaking products in the fields of connectivity and replication. Two decades later we are still applying the insights gained from working on them.

What Is a Data Fabric Architecture?

Let's start the discussion of data fabrics with a practical problem. We want to design a SaaS application for customer relationship management that will support up to a million users using a commodity open source DBMS like MySQL. What are the requirements?

Obviously, our system must be able to scale over time to hundreds of app servers operating on hundreds of terabytes data. It must hide failures, maintenance, and schema upgrades on individual DBMS hosts. It must permit data to distribute across geographic regions. It must deliver and accept transactions from NoSQL, data warehouses, and commercial DBMS in real time. It must allow smooth technology upgrade and replacement. Finally, it must look as much like a single DBMS server to applications as possible. Here's a picture of what we want:

Conceptual Data Fabric

The last requirement is especially important and goes to the essential nature of data fabric architecture. The fabric may contain dozens or even hundreds of servers but encapsulates their locations and number. Applications connect to the fabric the same way they connect to individual DBMS servers, a property we call transparency. Transparency permits developers to build applications using a DBMS on a laptop and push out code to production through increasingly capable test environments without changes in behavior. This is a potentially confusing requirement so let's look at a couple of examples.

Transparency does not mean that you get access to all data all the time from everywhere. Say our sample application stores customer data across many servers. The following SQL query to list regions with SaaS users who have sales to their own customers greater than $100,000 would typically not work:

mysql> select region, count(cust_id), sum(sales) as sales_total
from customer group by region having sales_total > 100000;

This is not as big a problem as it sounds, as SaaS applications for the most part operate within a single SaaS user at a time and do not ask for data across users. Moreover, it is easy to understand that you need to do something special for this particular query, such as connect to all servers explicitly or load transactions into a data warehouse. Most SaaS designers are pretty comfortable with this limitation, which just makes explicit something that you know anyway.

On the other hand transparency does mean that your application talks to what looks like a single server for individual SaaS user data. For instance the following sequence of commands on a single customer to insert a row and get the generated auto-increment key back must work in a fabric just as it does in a single DBMS server.

mysql> insert into customer(name, region, sales) values('bob', 'sfo', 10035.0);
Query OK, 1 row affected (0.00 sec)

mysql> select last_insert_id();
+------------------+
| last_insert_id() |
+------------------+
| 1 |
+------------------+
1 row in set (0.00 sec)

Selecting the last inserted ID is a standard idiom for adding rows to table with synthetic keys in MySQL. It is baked into widely used libraries like PHP mysqli. Change it and you break thousands of applications. This could happen if the fabric switched a DBMS connection across servers between these two commands. When operating on data for a single SaaS user fabric transparency needs to be as close to perfect as possible.

Any architecture that meets the preceding requirements including transparency is a data fabric. The next topic is what is actually inside a fabric architecture.

What Are Data Fabrics Made of?

Combining individual components into lattices is hardly a new idea. Such compositions are common in fields from bridge-building to art. One of my favorite examples is the famous arabesques of the Alhambra in Granada, which combine simple motifs into patterns then combine those to create still more complex patterns that cover walls and ceilings throughout the palace. The resulting compositions are works of stunning beauty.

Detail of Arabesque from Alhambra, Spain (Source: Wikipedia)

Arabesque construction is far from random. Arabesques combine plant-like elements into geometric patterns. Only certain elements are allowed--there are typically no human representations--and only certain patterns work well. Arabesques are also expressed in a particular medium such as stone, plaster, tiles, or paint.

Much as arabesques do, data fabrics combine very specific elements in a medium consisting of the following logical parts:

A partitioned network of recoverable transactional stores (DBMS servers) connected by reliable messaging (replication). Partitioned means that not every service exchanges information with every other, in the same way that DBMS servers for different applications may be separate silos.
A routing layer that connects applications to relevant data based on simple hints provided by the application itself, such as the name of a schema, a primary key, or whether the current user transaction is a set of writes or an auto-commit read.

Stores, replication, and routing logic are the fundamental elements of fabric implementations. These are powerful tools for building very large applications. Let's take a quick tour, as their particular properties are critical for building systems that actually work.

Transactional stores are DBMS servers that apply changes as atomic units that either commit or roll back as a whole. Transactional stores convert these changes into a serial history, which orders concurrent transactions in its log so that they can replay as if they had executed one after the other all by themselves. Serial ordering enables recovery, which is the procedure to bring data back cleanly to the last committed transaction after a crash or restart. These properties are at the heart of all relational DBMS systems today and are necessary for the fabric to operate. Non-transactional stores are close to useless in fabrics. Use them at your peril.

Reliable messaging transmits changes between stores without losing them or applying them twice. The usual way to do this is through replication, which moves transactions automatically on commit and applies them to replicas. Replication needs to support transactions, apply them in serial order, and recover just as transactional stores do. Replication systems that fail to meet any of these three requirements do not cut the mustard and do not belong in data fabrics.

Data replication implementations differ in at least three major ways. There is not really a "right approach" to replication--different methods work better for some applications than others.

1. Synchronous vs. asynchronous. Synchronous replication moves transactions before reporting commit back to applications. It minimizes problems with data loss if a replica fails but may slow or block applications if replication is slow. It can also lead to deadlocks. Asynchronous replication moves transactions after commit. It does not block applications but leads to latency between replicas and may result in data loss if there is a failure before transactions replicate from a particular replica.

2. Master-master vs. master/slave. Master-master replication allows updates on any replica. It requires conflict management, typically either through locking in advance or fixing up after the fact (also known as conflict resolution). Master/slave replication requires applications to use a single master for updates. It works well with asynchronous replication, which is easier to implement.

3. Log-based vs. trigger-based. Log-based replication reads the database journal or some representation of it like the MySQL binlog. Log-based replication has lower impact on DBMS servers but can be very hard to implement correctly. Trigger-based replication uses triggers to capture transaction content, typically using transfer tables. Trigger-based replication is simpler to implement but adds load to DBMS servers, can add unwanted serialization points to transactions, and may not handle certain types of changes, such as DDL data.

The routing layer directs application DBMS connections to specific copies of data. An application sees what appears to be a single session for each connection with consistent settings, such as character set encodings and session variables. Underneath the fabric may actually switch the connection across DBMS servers at opportune times. Transparency is absolutely critical. As mentioned previously, even tiny changes in session behavior breaks libraries that applications depend on to access data.

There are many methods to route connections as well as transactions. As with replication, there is no "best" way. It all depends on your application and the type of environment in which you are operating. Here are four common approaches.

1. Gateways. A gateway is a proxy process that sits between the application and DBMS server. The gateway looks like a server to applications. It establishes connections on behalf of the applications, hence can perform very flexible transformations on data at the cost of some performance due to the double network hop the gateway introduces. Gateways are hard to implement initially, but good ones are relatively easy to program once they work properly.

2. IP Routers. Routing software switches IP packet paths between servers. It has the lowest overheard and potentially highest transparency but require substantial effort to implement correctly.

3. Library wrappers. Library wrappers re-implement standard interfaces like JDBC or Perl DBI, then route application requests through underlying connectors like the MySQL or PostgreSQL JDBC driver. Wrappers are relatively easy to implement and have excellent performance but do not handle traffic that goes outside the library. Compiled versions can have nasty library dependencies that introduce new forms of RPM hell for users if you try to use them generally.

4. Custom routing layers. Any interface that your applications use to access data can implement routing logic. For instance, SOAP or JSON servers work perfectly well for this purpose. Internal data access libraries can also implement routing. This approach is specific to single applications, and like library wrappers does not cover other means of access.

The biggest constraint for any routing method is to keep it simple. Fat routing layers with complex logic tend to introduce bugs as well as change database semantics. This in turn violates the requirement for transparency.

Fabrics use the preceding elements over and over. As we build up data fabric designs it is helpful to use consistent notation for data stores, replication, and routing. The following summarizes the main notation that you will see in diagrams.

Data Fabric Notation

At this point we understand that data fabrics consist of a specific set of elements. However, we have not provided any organization yet. The next question is therefore how to arrange them into real systems.

What Are the Design Patterns for Data Fabrics?

Having a model of fabric elements is not the same as an actual implementation. We need to define how the elements are composed into a complete system that can be deployed. This is where design patterns come in.

Design patterns are reusable solutions that combine flexibly with each other to create large-scale architectures. In the case of data fabrics, design patterns offer guidelines to organize fabric elements in ways that work well for large data sets spread over multiple geographic locations. They serve the same function as the geometric arrangements that organize individual motifs in an arabesque.

Data fabric design patterns arrange data services. A data service is an abstraction for a transactional store with a well-defined API. Design patterns either link data services in some way or create more capable data services from simpler ones. The ability to create new services in this way is called service composability. Composability is a fundamental attribute of fabric design patterns. It is the reason that fabric designs can handle very large data sets flexibly. Service composition is analogous to the patterns-within-patterns layout that allows arabesques to cover an entire building without seeming repetitive or boring.

There are six design patterns that are particularly useful for SQL databases. The following diagram shows them using italized blue font and illustrates how they organize the implementation of our sample CRM system.

Fully Implemented Data Fabric

Here is a short description of each fabric design pattern.

Transactional Data Service - A transactional store that can restart without losing data following failure. It has a well-defined API, a mechanism for backup, and a transaction log for replication. This is the building block for all other design patterns.
Fabric Connector -- A routing method that hides the location of databases and enables data services consisting of multiple DBMS servers to look like a single server to applications.
Fault-Tolerant Data Service -- A data service that protects against failures and enables zero-downtime maintenance using a set of redundant DBMS servers.
Sharded Data Service -- A data service that handles large data sets by dividing them into shards distributed across a number of underlying data services
Multi-Site Data Service -- A data service that enables data to spread across multiple geographic locations using data services on each site linked by replication
Real-Time Data Bridge -- A link that enables real-time replication of data between heterogeneous data services, for example between MySQL and a data warehouse like Vertica.

You can implement fabric design patterns in many different ways. This means there are many possible implementations of data fabrics as well. As long as you meet the assumptions of each particular design pattern, the result is likely to work well. We will delve into some of the variations in the next few articles.

A Little Bit about Naming

You may have encountered the term "data fabric" in products (example here or here) or as a concept (a nice example here). This is because the ideas behind fabrics are very general and apply to many types of data management systems, perhaps most notably storage area networks. The name data fabric is very natural and seems to have occurred to many people independently, especially after storage vendors popularized it during the 1990s.

In our case the term "data fabric" always means a data fabric architecture as I defined it a couple of sections back.

Conclusion and More

This has been a longish introduction but introduces terminology and sets the stage for looking at specific design patterns used to build data fabrics. Follow-on articles will look at individual design patterns in detail.

One final thought: there are no doubt other ways of factoring design patterns for data fabrics. This particular set of patterns works very well for SQL databases, and I have seen them used successfully in many large systems over the years. I would argue they work well for other DBMS types as well but welcome your comments on other approaches.

p.s., The next article covering the transactional data service is now available.

Replicating from MySQL to Amazon RDS

2013-01-13T23:19:00.003-08:00

There have been a number of comments that Amazon RDS does not allow users access to MySQL replication capabilities (for example here and here). This is a pity. Replication is one of the great strengths of MySQL and the lack of it is a show-stopper for many users. As of the latest build of Tungsten Replicator half of this problem is on the way to being solved. You can now set up real-time replication from an external MySQL master into an Amazon RDS instance.

In the remainder of this article I will explain how to set up Tungsten replication to an Amazon RDS slave, then add a few thoughts about why this feature is useful along with some suggestions for improvement. To keep the article reasonably short I assume you understand the basics of installing Tungsten Replicator. If you need more information, check out the online documentation.

Readying an RDS Test Instance

Amazon RDS is Amazon's on-demand relational database. RDS supports several database types including MySQL, which I will use for this demonstration. Launching a new instance is simple. Login to the Amazon AWS Console using an account that has RDS enabled, then switch to the Amazon RDS Console. Press the Launch a DB Instance button, whereupon a screen like the following appears:

RDS Database Selection

Press the Select button for MySQL Community Edition, which starts the configuration window. (You can also replicate into Oracle if you are up for a challenge. If you do this, please post what you did to get it to work!) Next fill out properties for MySQL.

MySQL Instance Configuration

Among other things, you create a master login and password. Note these carefully as you will need them to configure replication. Then continue for another couple of screens, at which point you can launch your instance. It takes about 10 minutes for new instances to spin up, after which you can see the instance properties in the AWS RDS Console. Here's a screen shot of my test instance.

AWS RDS Console

Once the instance is up, test access. This is of course necessary to prove the instance is running properly and that we can login from a remote location. Note the host name in the Endpoint field. This is a DNS entry for the new MySQL instance. Using that and the master login, we can now fire up the mysql client from a remote host where we plan to run replication.

$ mysql -utungsten -p -htest.c4villnbpuq1.us-west-1.rds.amazonaws.com
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 2125
Server version: 5.5.27-log Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| innodb |
| mysql |
| performance_schema |
| test |
+--------------------+
5 rows in set (0.01 sec)

This looks quite good. We are now ready to install and start replication.

Important note! If you have trouble connecting you may need to tweak your Amazon security group settings to open up ports, especially if you are replicating from non-Amazon locations. Amazon has a very cool feature that can guess your originating host IP and offer a CIDR address that covers the port range from which you are operating. I used this when configuring my security groups.

Setting Up Tungsten Replication

It is possible to set up replication from a MySQL master directly to Amazon RDS using a single Tungsten Replicator process. However, it is more versatile and simpler to set up two replicators: one to read from the MySQL master, and another replicator to apply to the Amazon RDS slave. I will therefore demonstrate this configuration.

We will assume you have a MySQL master already running and that it meets prerequisites for running Tungsten. Let's now grab the Tungsten code and install a master replicator. You can get fresh builds from the Tungsten Replicator builds page.

We will take a recent replicator build that contains the RDS changes, which are documented in Issue 425. Use 2.0.7 build 177 or later. The main improvement is to add a non-privileged slave mode that avoids invoking any of the operations forbidden by Amazon. Among other things Tungsten normally uses commands like 'SET SESSION SQL_LOG_BIN=0' to suppress writing to the binlog when it applies data on a slave. This command requires SUPER privilege, hence causes problems for underprivileged RDS logins.

Unpack the code and install the master replicator in /opt/continuent. This is no different from installing a normal Tungsten master. My master host is logos1. Here are sample commands to pull the code and set up the master. The example shows the minimum options--if you have MySQL installed in a non-standard location or otherwise differ from a stock installation you may need to add additional options.

mkdir ~/staging
cd staging
wget --no-check-certificate https://s3.amazonaws.com/files.continuent.com/builds/nightly/tungsten-2.0-snapshots/tungsten-replicator-2.0.7-177.tar.gz
tar -xf tungsten-replicator-2.0.7-177.tar.gz
tungsten-replicator-2.0.7-177/tools/tungsten-installer \
--master-slave \
--master-host=logos1 \
--datasource-user=tungsten \
--datasource-password=your_passord \
--service-name=aws \
--home-directory=/opt/continuent \
--cluster-hosts=logos1 \
--start-and-report

Next, set up the slave replicator. For convenience I am going to install the slave on a separate host, named logos2, to avoid port clashes between the two replicators. If you install on the same host, you'll need to install into another release directory location and use the --rmi-port and --thl-port options to avoid port overlaps. Here is the command to set up the Amazon RDS slave. Note that the tungsten-installer program can install code between hosts, which is an extremely useful feature.

tungsten-replicator-2.0.7-177/tools/tungsten-installer \
--master-slave \
--cluster-hosts=logos2 \
--master-host=logos1 \
--datasource-host=test.c4villnbpuq1.us-west-1.rds.amazonaws.com \
--datasource-user=tungsten \
--datasource-password=your_password \
--service-name=aws \
--slave-privileged-updates=false \
--home-directory=/opt/continuent \
--skip-validation-check=InstallerMasterSlaveCheck \
--skip-validation-check=MySQLPermissionsCheck \
--start-and-report

You may see a few warnings during the RDS installation, as the tungsten-installer cannot verify some settings on the Amazon RDS host. These can be ignored. If everything goes well, you now have two replicators up and running. You can check the status of the master and slave using the trepctl command, as in:

/opt/continuent/tungsten/tungsten-replicator/bin/trepctl -host logos1 status

/opt/continuent/tungsten/tungsten-replicator/bin/trepctl -host logos2 status

Both replicators should report that they are online. Now complete the exercise by proving that replication works between the replicators. We start by logging into the local MySQL instance, creating a new table in the test schema, and adding data.

$ mysql -uroot test
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 231488
Server version: 5.5.21-rel25.1-log Percona Server with XtraDB (GPL), Release rel25.1, Revision 234

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> create table foo(id int primary key);
Query OK, 0 rows affected (0.24 sec)

mysql> insert into foo values (256);
Query OK, 1 row affected (0.00 sec)

Now login to the Amazon RDS instance and look for table foo.

$ mysql -utungsten -p -htest.c4villnbpuq1.us-west-1.rds.amazonaws.com test

Enter password:

Reading table information for completion of table and column names

You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor. Commands end with ; or \g.

Your MySQL connection id is 2161

Server version: 5.5.27-log Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select * from foo;

+-----+

| id |

+-----+

| 256 |

+-----+

1 row in set (0.01 sec)

Mission accomplished! We have real-time replication enabled from a MySQL master to Amazon RDS. At this point you can replicate more or less normally. There are some obvious limitations due to the fact that Amazon RDS is locked down and does not grant our login full privileges.

Temp table replication may not work. Tungsten depends on being able to issue commands of the form "set @@session.pseudo_thread_id=23531" and the like. This prevents clashes between temp tables of the same name on different master sessions. You may need to enable row replication on the master, which suppresses temp table replication. (For other approaches, see my previous article on temp tables and the binlog.)
Any command that requires SUPER privilege will not work. As an obvious example, you will not be able to grant SUPER privilege to new accounts. Such commands will break replication.
All replicated commands go into the binlog, which is a potential performance drag and may slow down Amazon RDS slaves. Parallel replication may not help in this case, since committing to the binlog is a serialization point that blocks other transactions. This problem may be cured if Amazon picks up group commit fixes from MySQL 5.6 and/or MariaDB.

All things considered, however, these are minor inconveniences. Most applications should be able to replicate without difficulties, especially if the master transaction rate is not too high.

Configuring SSL for Connections to RDS

In the previous demonstration I used a master host running outside Amazon. This means my test transactions traveled across the Internet, where they were visible to all and sundry along the way. To illustrate, we can run tcpdump and watch traffic as it goes by.

$ sudo tcpdump -A -vvv -s 256 host test.c4villnbpuq1.us-west-1.rds.amazonaws.com
...
logos2.46657 > ec2-54-241-56-140.us-west-1.compute.amazonaws.com.mysql: Flags [P.], cksum 0xb0f7 (incorrect -> 0x995d), seq 3012:3073, ack 3412, win 94, options [nop,nop,TS val 89630514 ecr 74223746], length 61
E..qw>@.@......n6.8..A..4..L...f...^.......
.W.2.l..9....insert into foo values (256) /* ___SERVICE___ = [aws] */
22:55:36.316049 IP (tos 0x8, ttl 51, id 26226, offset 0, flags [DF], proto TCP (6), length 63)

If we were handling confidential data, exposing traffic like this to possible evildoers would be a serious problem. Fortunately, Amazon RDS supports SSL encrypted connections from clients. Here is how to use it with Tungsten.

First, you need to get the Amazon RDS certificate, which is used to sign certificates for individual RDS instances.

mkdir /opt/continuent/certs
cd /opt/continuent/certs
wget https://rds.amazonaws.com/doc/mysql-ssl-ca-cert.pem

Next, you need to create a trust store that Java can access containing the certificates of signing authorities whom you trust. For this you will need the Java keytool utility, which is included in the JDK. If you are just using the Java runtime in production, you will need to generate the store on another host, then copy it over to your test hosts. I used the password "secret" in this example.

$ keytool -import -alias rds -file mysql-ssl-ca-cert.pem -keystore truststore
Enter keystore password:
Re-enter new password:
Owner: CN=aws.amazon.com/rds/, OU=RDS, O=Amazon.com, L=Seattle, ST=Washington, C=US
Issuer: CN=aws.amazon.com/rds/, OU=RDS, O=Amazon.com, L=Seattle, ST=Washington, C=US
...
Trust this certificate? [no]: yes
Certificate was added to keystore

We now need to tell the slave replicator about the truststore file using Java VM options. On the slave host, edit /opt/continuent/tungsten/tungsten-replicator/conf/wrapper.conf and add the extra options shown in bold face.

# Java Additional Parameters

wrapper.java.additional.1=-Dreplicator.home.dir=../../tungsten-replicator/

wrapper.java.additional.2=-Dreplicator.log.dir=../../tungsten-replicator/log

wrapper.java.additional.3=-Dcom.sun.management.jmxremote

wrapper.java.additional.4=-Djavax.net.ssl.trustStore=/opt/continuent/certs/truststore

wrapper.java.additional.5=-Djavax.net.ssl.trustStorePassword=secret

The last step is to enable SSL encryption when applying data. We need to set an extra URL option on the drizzle JDBC driver to turn on SSL. For this we need to edit the static-svc.properties file that configures replication. In my example this file is located in /opt/continuent/tungsten/tungsten-replicator/conf/static-aws.properties. Open the file and look for the section that starts with APPLIERS. Edit the text to add additional urlOptions line as shown below.

############

# APPLIERS #

############

replicator.applier.dbms=com.continuent.tungsten.replicator.applier.MySQLDrizzleApplier

replicator.applier.dbms.host=${replicator.global.db.host}

replicator.applier.dbms.port=${replicator.global.db.port}

replicator.applier.dbms.user=${replicator.global.db.user}

replicator.applier.dbms.password=${replicator.global.db.password}

replicator.applier.dbms.urlOptions=?useSSL=true

replicator.applier.dbms.ignoreSessionVars=autocommit

replicator.applier.dbms.getColumnMetadataFromDB=true

Restart the replicator process (/opt/continuent/tungsten/tungsten-replicator/bin/replicator restart) and you will now be using SSL encryption. If we now look back at the tcpdump outout, it looks like garbage as the following example shows.

...

23:46:52.370904 IP (tos 0x0, ttl 64, id 5899, offset 0, flags [DF], proto TCP (6), length 105)

logos2.44717 > ec2-54-241-56-140.us-west-1.compute.amazonaws.com.mysql: Flags [P.], cksum 0xb0ef (incorrect -> 0xd834), seq 3087:3140, ack 4838, win 102, options [nop,nop,TS val 89938121 ecr 74992771], length 53

E..i..@.@.r....n6.8......zSR.b.....f.......

.\X..xL.....07..|..x.)...T..888.H/...iz...^.W8....'...... ..J

...

This is much better. We are replicating to an Amazon RDS, and the transactions are safe from prying eyes. If you have gotten this far you are ready to try your own applications.

Benefits of Replication into Amazon RDS

Amazon RDS is convenient thanks to its quick and simple setup, but the lack of replication is a severe limitation in building systems that need more than a single MySQL instance. In particular it makes it hard to integrate RDS into systems that consist of more than Amazon RDS itself. Adding the ability to replicate in real-time into RDS therefore has a number of benefits. The most obvious include using RDS to extend existing systems.

First, Amazon RDS can offer a quick way to add read capacity to existing MySQL applications. This is especially useful if you have a cluster, such as Tungsten, which handles your transaction processing and overall HA. You can now add Amazon RDS read slaves that you discard when no longer needed. Tungsten Replicator has a number of other useful features like the ability to read from a group of nodes, not just one, that make such topologies easy to set up and maintain. Clusters other than Tungsten will likewise benefit from this feature.

Second, Amazon RDS is suitable for applications that do not need 24x7 high availability (limitations include slow failover, no online maintenance, no cross-cloud capabilities, etc.) You can now pull data from other sources and send them to Amazon RDS slaves for processing, which amounts to extending overall processing capacity. For example, you could use RDS to run back-office tasks using transactions replicated in from MySQL masters. Tungsten Replicator also replicates data from Oracle, so this is an additional source of transactions.

There are of course other ways to replicate data to and from RDS, for example using batch ETL tools like Talend. However, these are not real-time and often require application changes to add timestamp columns or otherwise mark transactions that need to be extracted. Log-based replication as implemented by Tungsten is fast and has minimal impact on applications or MySQL itself.

Thoughts about Further Improvements for Amazon RDS Replication

On our side, i.e., at Continuent, we need to do more testing, add documentation, and fix problems as they arise. We are starting a beta test with one of our customers in the next few days, who incidentally was the same customer who requested this feature in the first place after hacking it for themselves. RDS also has some interesting provisioning capabilities that I would like to understand better. We are also adding options that eliminate the need for manual configuration of security settings. This will keep us busy for a while.

Other improvements depend on changes to RDS itself. An obvious and huge improvement would be to permit replication out of Amazon RDS. Unfortunately, Tungsten needs a login that has REPLICATION_SLAVE privilege so that we can download binlog data. That privilege is not yet available to Amazon users. Once it is, Tungsten extraction will also work in very short order. We actually don't need other commands like START/STOP SLAVE or FLUSH LOGS--just ability to issue a COM_BINLOG_EVENT from a client connection and receive binlog records. I am sure other products would use this capability as well. (RDS developers, if you are listening here's an easy way to extend your product usability significantly...)

Replication is such a valuable feature of MySQL that Amazon RDS feels somewhat crippled without it. For this reason I would imagine that Amazon will open up additional capabilities in the future. Until then we will polish up replication from MySQL masters to Amazon RDS slaves, awaiting a time when we can add more features.

In the meantime, I hope you will try the new replication to Amazon RDS. As noted in this article you can grab the latest builds and try it yourself. Please report your experiences through either Continuent Support if you are a customer or the Tungsten discussion list if you use the open source Tungsten Replicator. I look forward to your feedback and suggestions for making Amazon RDS support better.

Tungsten University

2013-01-09T22:59:00.000-08:00

We have started a new series of webinars at Continuent that we call Tungsten University. They provide education on Tungsten clustering and replication in handy one-hour chunks. These are not sales pitches. Our goal is to provide accessible education about setting up and operating Tungsten without any marketing fluff.

The first Tungsten University webinar entitled "Configure & provision Tungsten clusters" will take place on Thursday January 17th at 10:00 PST. It will show you how to set up a cluster in Amazon EC2. There will be a repeat on January 22nd at 15:00 GMT. We usually record webinars, so you can look at them later as well.

You do not have to be a customer to attend these webinars, just interested in Tungsten. I hope users of our open source Tungsten Replicator will attend, since we will have a number of presentations on replication. Here are some of the future webinar topics we are considering:

Setting up, deploying, and upgrading Tungsten Replicator
Setting up multi-master and fan-in replication topologies
Configuring Tungsten Connector for transparent SQL routing and load balancing
Replication tips and tricks (such as how to improve performance and fix broken replicators)
Implementing zero-downtime maintenance and schema upgrade
Replicating between MySQL and Oracle
Loading MySQL data into a Vertica data warehouse

There will be more titles out soon, so watch our webinar list and the announcements on the Continuent Tungsten blog. If there are topics that you would like to hear about please suggest them as comments on this blog or the official Continuent blog.

Meanwhile, I am looking forward to doing some of the replicator presentations and attending the talks on clustering. I mostly write code for the replicator, so talks about other parts of Tungsten tend to be learning experiences. The Tungsten Connector is particularly interesting because it makes off-the-shelf database replicas look like a single DBMS server and transparently switches connections between them. It is our secret sauce for creating master/slave clusters. If you have not seen Tungsten Connector before, I recommend attending that talk when it comes up. I'm still amazed how well it functions even after working with it for a number of years.

Questions about MariaDB JDBC Driver

2013-01-01T14:58:00.002-08:00

The recent release of the MariaDB client libraries has prompted questions about their purpose as well as provenance. Colin Charles posted that some of these would be answered in the very near future. I have a couple of specific questions about the MariaDB JDBC driver, which I hope will be addressed at that time.

1.) What is really in the MariaDB JDBC driver and how exactly does it differ from the drizzle JDBC driver? What, if any, relation is there to Connector/J code? There is a JIRA project but it contains only four bugs, hence is not very informative. The launchpad bzr history shows detailed check-ins but not overall intent.

2.) Why relicense from BSD to LGPL? I have checked the class headers and so far as attributions are concerned everything seems to be done quite properly. However, the license change appears to prevent those of us currently using the drizzle JDBC driver from transferring code changes back to the drizzle driver. If so, that seems a little unneighborly.

Here is some background on the relationship between the drivers. The MariaDB JDBC client is a fork of the BSD-licensed drizzle JDBC driver originally developed by Marcus Eriksson, who continues to maintain the code. According to the bzr change history the code forked after rev 253, which was 24 April 2011. There are still many similarities in the Java classes. For instance, a number of classes in the org.mariadb.jdbc.internal.common package differ by little other than licensing headers and package names. The MariaDB code is now up to rev 375 and includes substantial changes that appear to be designed to bring the MariaDB JDBC driver closer to the capabilities of the MySQL Connector/J driver.

At Continuent we have a lively interest in the drizzle JDBC driver, as we adopted it for Tungsten Replicator some time ago. The code had fewer bugs than Connector/J, which was attractive. More importantly, Marcus kindly accepted a patch from my colleague Stephane Giron (working as a Continuent employee) that made it easy for us to send queries using binary data rather than the usual Unicode data required by the JDBC standard. This fix allows Tungsten to replicate codesets and binary data correctly. We have since contributed a few other patches. Our modest contribution in part reflects the quality of the base code.

While waiting for answers I would like to commend Marcus as well as other drizzle contributors for their work. We are particularly indebted to Marcus for starting and continuing the drizzle JDBC project. Tungsten Replicator users have applied many trillions of transactions using the drizzle driver. If the MariaDB JDBC driver gains wide acceptance, the rest of the MySQL community owes Marcus Eriksson substantial thanks as well.

The MySQL Community: Beleaguered or Better than Ever?

2012-12-27T09:25:00.002-08:00

The MariaDB Foundation announcement spawned some interesting commentary about the state of open source databases. One recent headline cited the "beleaguered MySQL community." Beleaguered is a delightful adjective. The OED tells us that it means beset, invested, or besieged. Much as I like the word, I do not think it is an accurate or useful description of the MySQL community. This article and others like it miss the point of what is happening to MySQL and its users.

Let's start by disproving that the notion that the MySQL community is beleaguered. I don't know everyone who uses MySQL, but in my job I talk to numerous companies that have made sizable investments in MySQL and stand to lose big if they are wrong. They do not seem especially nervous.

1. Nobody seriously questions MySQL viability. I have yet to meet a manager with a substantial business on MySQL who is deeply worried about it disappearing or being ruined by Oracle. They are too busy working on software upgrades or keeping their sites running. The future of MySQL is well down the list of problems keeping them awake at night.

2. MySQL meets or beats the immediate alternatives. There is of course discussion about dropping MySQL for PostgreSQL but it is mostly idle talk. I'm sure some companies have switched (actually in both directions), but I not seen a single customer migrate a working business app from MySQL to PostgreSQL. Once you get past the religion, it's clear MySQL and PostgreSQL are just too similar to supplant each other easily: reliable, row-based stores with single threaded SQL query engines that handle a few terabytes of data at most. Companies need far stronger reasons to switch to something new, especially given the large ecosystem and deep pool of MySQL expertise.

3. MySQL is not the only game in town. Virtually every large web site I know uses at least one NoSQL store alongside MySQL. Column stores are increasingly common for data warehouses. Production Hadoop clusters are no longer a novelty. On the surface this might look like a failure of MySQL. What's really happening in many cases is that small businesses that started on MySQL are now large, profitable enterprises that require more than just economical OLTP. This is a mark of success, not a deficiency.

If this is what beleaguered looks like I can't wait to see something that's actually successful.

Turning the argument around, can we say that the MySQL community is better than it was? In at least one important way, yes. The community is now multi-polar. MySQL long benefitted from having a large community of open source users to find bugs, help focus development direction, and construct a wide range of robust tools like language bindings. However, innovation on MySQL itself was largely gated by a single company: MySQL AB. Multiple groups are now competing to improve MySQL, and it's a very good thing for users. Let me count the ways.

There are three major versions of MySQL: Oracle, Percona, and MariaDB, not to mention cloud-only versions like Amazon RDS. There are at least four companies working directly on major upgrades to replication: Continuent, Oracle, Codership, and Monty Program. Oracle is continuing to make improvements in InnoDB like online schema change and multi-core scaling, efforts that are complemented by Percona's persistent focus on all aspect of performance. Aside from Amazon RDS, all of this work is available in open source, and there is an unusual degree of sharing across otherwise competitive groups. I could keep going for a while but to be frank there's so much it's hard to track all the improvements or give them their proper due.

The MySQL community is therefore competitive in a way that did not exist a few years ago. That's good, because innovation in data management is no longer centered around the web-facing applications that MySQL helped enable. Businesses are grappling with massive data volumes that far exceed the capacity of single DBMS servers while simultaneously moving to Amazon or VMWare. There is a whole new set of problems such as deploying in unstable cloud environments, adjusting to polyglot persistence, managing sharded data effectively, distributing data across multiple regions, and enabling real-time analytics on MySQL transactions. As a group, the MySQL community is well-positioned to address them.

If there is a problem, it is how to keep a strong multi-polar community going for as long as possible. Competition creates uncertainty for users, because change is a given. Pointy-haired bosses have to make decisions with incomplete information or even reverse them later. Competition is hard for vendors, because it is more difficult to make money in efficient markets. Competition even strikes against the vanity of community contributors, who have to try harder to get recognition. It is clear there will be pressures to make the community less competitive. They won't necessarily be from Oracle, which thrives on competition.

This gets back to the MariaDB Foundation reference that started this article. Anything that ensures long-term competitiveness and vitality of MySQL is good. Foundations in general seem well suited to this task. At Continuent we have already had some discussions about joining. So far we are undecided, for reasons that are somewhat similar to Peter Zaitsev's comments on this subject. If the MariaDB Foundation helps maintain a stable multi-polar community, we're in.

Slides from Percona Live London and a Request

2012-12-10T18:18:00.000-08:00

Percona hosted another excellent Percona Live conference this past December 3-4 in London. It was my pleasure to deliver 3 talks including the first keynote following Peter Zaitsev. Percona does a great job of organizing these conferences--this year's London conference was well attended and in an excellent location in Kensington. My thanks to the entire Percona team for putting this together.

Here are the slides for my talks in case you would like to see them.

Keynote: Future-Proofing MySQL for the World-Wide Data Revolution -- Covering the greatly exaggerated death of MySQL and design patterns for robust MySQL systems that can last for decades

Talk: Why, What, and How of Data Warehouses for MySQL -- Why you need a data warehouse for MySQL, some standard choices, and how to move data in real time from MySQL to Vertica. There was even a demo of sysbench data replicating into Vertica automatically.

Talk: Multi-Master, Multi-Site MySQL Databases Made Easy with Continuent Tungsten -- How to build clusters that span multiple sites using multi-master and primary/DR techniques. I did this with Giuseppe Maxia, who did a couple of great demos along the way, including one that I found kind of terrifying to do in front of an audience. (It involved killing a lot of database servers, something Giuseppe does for a living.)

Speaking of talks, there are many conferences with database tracks coming up in the next few months. If you have not done a talk on MySQL before I would encourage you to think about submitting for upcoming conferences like future Percona Live events, OSCON, O'Reilly Strata, or one of the many local meet-ups. There are many people doing interesting things with open source databases. It's great to hear your stories, so speak up!

Data Fabrics and Other Tales: Percona Live and MySQL Connect

2012-09-21T11:47:00.000-07:00

The fall conference season is starting. I will be doing a number of talks including a keynote on "future proofing" MySQL through the use of data fabrics. Data fabrics allow you to build durable, long-lasting systems that take advantage of MySQL's strengths today but also evolve to solve future problems using fast-changing cloud and big data technologies. The talk brings together ideas that Ed Archibald (our CTO) and I have been working on for over two decades. I'm looking forward to rolling them out to a larger crowd.

Here are the talks in calendar order. The first two are at MySQL Connect 2012 in San Francisco on September 30th:

CON11591 - Managing Worldwide Data with MySQL and Continuent Tungsten (Tech talk)
CON9319 - Replicating from MySQL to Oracle Database and Back Again (Tech talk)

MySQL Connect is an add-on to Oracle Open World. You know the conference is big if they have to use 5-digit codes to keep track of talk titles. It's almost worth the price of admission to look at Larry Ellison's boat. Well maybe not quite, but you get the idea.

Next up is the Percona Live MySQL Conference in New York on October 2nd:

Future-Proofing MySQL for the Worldwide Data Revolution (Keynote)
Solving Large-Scale Database Administration with Tungsten (Tech talk with Neil Armitage)

Percona has been doing an amazing job of organizing conferences. This will be my fifth one. The previous four were great. If you are in the New York area and like MySQL, sign up now. This conference is the single best way to get up to speed on state-of-the-art MySQL usage.

Database Failure Is Not the Biggest Availability Problem

2012-09-19T16:09:00.001-07:00

There have been a number of excellent articles about the pros and cons of automatic database failover triggered by Baron's post on the GitHub database outage. In the spirit of Peter Zaitsev's article "The Math of Automated Failover," it seems like a good time to point out that database failure is usually not the biggest source of downtime for websites or indeed applications in general. The real culprit is maintenance.

Here is a simple table showing availability numbers out to 5 nines and what they mean in terms of monthly down-time.

Uptime	Downtime per 30-Day Month
0.9	3 days
0.99	07:12:00
0.999	00:43:12
0.9999	00:04:20
0.99999	00:00:26

Now let's do some math. We start with Peter's suggested number that the DBMS fails once a year. Let's also say you take a while to wake up (because it's the middle of the night and you don't like automatic failover), figure out what happened, and run your failover procedure. You are back online in an hour. Amortized over the year an hour of downtime is 5 minutes per month. Overall availability is close to 99.99% or four nines.

Five minutes per month is small potatoes compared to the time for planned maintenance. Let's say you allow yourself a one-hour maintenance window each month for DBMS schema changes, database version upgrades, and other work that takes the DBMS fully offline from applications. Real availability in this simple (and conservative) example is well below 99.9% or less than three nines. Maintenance accounts for over 90% of the downtime. The real key to improved availability is to be able to maintain the DBMS without taking applications offline.

We have been very focused on the maintenance problem in Tungsten. Database replication is a good start for enabling rolling maintenance where you work on one replica at a time. In Tungsten the magic sauce is an intervening connectivity layer that can transparently switch connections between DBMS servers while applications are running. You can take DBMS servers offline and upgrade safely without bothering users. Managing planned failover in this way is easier to solve than providing bombproof automatic failover, I am happy to say. It is also considerably more valuable for many users.

Automated Database Failover Is Weird but not Evil

2012-09-17T22:13:00.000-07:00

Github had a recent outage due to malfunctioning automatic MySQL failover. Having worked on this problem for several years I felt sympathy but not much need to comment. Then Baron Schwartz wrote a short post entitled "Is automated failover the root of all evil?" OK, that seems worth a comment: it's not. Great title, though.

Selecting automated database failover involves a trade-off between keeping your site up 24x7 and making things worse by having software do the thinking when humans are not around. When comparing outcomes of wetware vs. software it is worth remembering that humans are not at their best when woken up at 3:30am. Humans go on vacations, or their cell phones run out of power. Humans can commit devastating unforced errors due to inexperience. For these and other reasons, automated failover is the right choice for many sites even if it is not perfect.

Speaking of perfection, it is common to hear claims that automated database failover can never be trusted (such as this example). For the most part such claims apply to particular implementations, not database failover in general. Even so, it is undoubtedly true that failover is complex and hard to get right. Here is a short list of things I have learned about failover from working on Tungsten and how I learned them. Tungsten clusters are master/slave, but you would probably derive similar lessons from most other types of clusters.

1. Fail over once and only once. Tungsten does so by electing a coordinator that makes decisions for the whole cluster. There are other approaches, but you need an algorithm that is provably sound. Good clusters stop when they cannot maintain the pre-conditions required for soundness, to which Baron's article alludes. (We got this right more or less from the start through work on other systems and reading lots of books about distributed systems.)

2. Do not fail over unless the DBMS is truly dead. The single best criterion for failure seems to be whether the DBMS server will accept a TCP/IP connection. Tests that look for higher brain function, such as running a SELECT, tend to generate false positives due to transient load problems like running out of connections or slow server responses. Failing over due to load is very bad as it can take down the entire cluster in sequence as load shifts to the remaining hosts. (We learned this through trial and error.)

3. Stop if failover will not work or better yet don't even start. For example, Tungsten will not fail over if it does not have up-to-date slaves available. Tungsten will also try to get back to the original pre-failover state when failover fails, though that does not always work. We get credit for trying, I suppose. (We also learned this through trial and error.)

4. Keep it simple. People often ask why Tungsten does not resubmit transactions that are in-flight when a master failover occurs. The reason is that there are many reasons why resubmission does not work on a new master and it is difficult to predict when such failures will occur. Imagine you were dependent on a temp table, for example. Resubmitting just creates more ways for failover to fail. Tungsten therefore lets connections break and puts the responsibility on apps to retry failed transactions. (We learned this from previous products that did not work.)

Even if you start out with such principles firmly in mind, new failover mechanisms tend to encounter a lot of bugs. They are hard to find and fix because failover is not easy to test. Yet the real obstacle to getting automated failover right is not so much bugs but the unexpected nature of the problems clusters encounter. There is a great quote from J.B.S. Haldane about the nature of the universe that also gives a flavor of the mind-bending nature of distributed programming:

My own suspicion is that the universe is not only queerer than we suppose, but queerer than we can suppose.

I can't count the number of times where something misbehaved in a way that would never have occurred to me without seeing it happen in a live system. That is why mature clustering products can be pretty good while young ones, however well-designed, are not. The problem space is just strange.

My sympathy for the Github failures and everyone involved is therefore heartfelt. Anyone who has worked on failover knows the guilt of failing to anticipate problems as well as as the sense of enlightenment that comes from understanding why they occur. Automated failover is not evil. But it is definitely weird.

Life in the Amazon Jungle

2012-09-03T11:12:00.000-07:00

In late 2011 I attended a lecture by John Wilkes on Google compute clusters, which link thousands of commodity computers into huge task processing systems. At this scale hardware faults are common. Google puts a lot of effort into making failures harmless by managing hardware efficiently and using fault-tolerant application programming models. This is not just good for application up-time. It also allows Google to operate on cheaper hardware with higher failure rates, hence offers a competitive advantage in data center operation.

It's becoming apparent we all have to think like Google to run applications successfully in the cloud. At Continuent we run our IT and an increasing amount of QA and development on Amazon Web Services (AWS). During the months of July and August 2012 at least 3 of our EC2 instances were decommissioned or otherwise degraded due to hardware problems. One of the instances hosted our main website www.continuent.com.

In Amazon failures are common and may occur with no warning. You have minimal ability to avoid them or some cases even understand the root causes. To survive in this environment, applications need to obey a new law of the jungle. Here are the rules as I understand them.

First, build clusters of redundant services. The www.continuent.com failure brought our site down for a couple of hours until we could switch to a backup instance. Redundant means up and ready to handle traffic now, not after a bridge call to decide what to do. We protect our MySQL servers by replicating data cross-region using Tungsten, but the website is an Apache server that runs on a separate EC2 instance. Lesson learned. Make everything a cluster and load balance traffic onto individual services so applications do not have to do anything special to connect.

Second, make applications fault-tolerant. Remote services can fail outright, respond very slowly, or hang. To live through these problems apply time-honored methods to create loosely coupled systems that degrade gracefully during service outages and repair themselves automatically when service returns. Here are some of my favorites.

If your application has a feature that depends on a remote service and that service fails, turn off the feature but don't stop the whole application.
Partition features so that your applications operate where possible on data copies. Learn how to build caches that do not have distributed consistency problems.
Substitute message queues for synchronous network calls.
Set timeouts on network calls to prevent applications from hanging indefinitely. In Java you usually do this by putting the calls in a separate thread.
Use thread pools to limit calls to remote services so that your application does not explode when those services are unavailable or fail to respond quickly.
Add auto-retry so that applications reconnect to services when they are available again.
Add catch-all exception handling to deal with unexpected errors from failed services. In Java this means catching RuntimeException or even Throwable to ensure it gets properly handled.
Build in monitoring to report problems quickly and help you understand failures you have not previously seen.

Third, revel in failure. Netflix takes this to an extreme with Chaos Monkey, which introduces failures in running systems. Another approach is to build scaffolding into applications so operations fail randomly. We use that technique (among others) to test clusters. In deployed database clusters I like to check regularly that any node can become the master and that you can recover any failed node. However, this is just the beginning. There are many, many ways that things can fail. It is better to provoke the problems yourself than have them occur for the first time when something bad happens in Amazon.

There is nothing new about these suggestions. That said, the Netflix approach exposes the difference between cloud operation and traditional enterprise computing. If you play the game applications will stay up 24x7 in this rougher landscape and you can tap into the flexible cost structure and rapid scaling of Amazon. The shift feels similar to using database transactions or eliminating GOTO statements--just something we all need to do in order to build better systems. There are big benefits to running in the cloud but you really need to step up your game to get them.

Is Synchronous Data Replication over WAN Really a Viable Strategy?

2012-08-05T14:27:00.002-07:00

Synchronous data replication over long distances has the sort of seductive appeal that often characterizes bad ideas. Why wouldn't you want every local credit card transaction simultaneously stored on the other side of the planet far away from earthquake, storms and human foolishness? The answer is simple: conventional SQL applications interact poorly with synchronous replication over wide area networks (WANs).

I spent a couple of years down the synchronous replication rabbit hole in an earlier Continuent product. It was one of those experiences that make you a sadder but wiser person. This article digs into some of the problems with synchronous replication and shows why another approach, asynchronous multi-master replication, is currently a better way to manage databases connected by long-haul networks.

Synchronous Replication between Sites

The most obvious problem with any form of synchronous replication is the hit on application performance. Every commit requires a round-trip to transfer and acknowledge receipt at the remote site, which in turn reduces single-thread transaction rates to the number of pings per second between sites. As a nice article by Aaron Brown demonstrates, you can show the effect easily using MySQL semi-synchronous replication between hosts in Amazon regions. Aaron's experiment measured 11.5 transactions per second, or about 100 times less than single-thread performance on a local master between hosts with 85 millisecond latency. At that rate you would theoretically expect transaction throughput of ~11.7 transactions per second (1000 / 85 = 11.7), so the agreement between practice and theory is very close. It's great when science works out like this.

You might argue that applications could tolerate the slow rate assuming it were at least constant. Sadly that's not the case for real systems. Network response varies enormously between sites in ways that are quite easy to demonstrate.

To illustrate variability I set up an Amazon m1.small instance in the us-east-1 region (Virginia) and ran 24 hours of ping tests to instances in us-west-2 (Oregon) and ap-southeast-1 (Singapore). As the following graph shows, during a 4 hour period ping times to Singapore remain within a band but vary up to 10%. Ping times to Oregon on the other hand hover around 100ms but spike up randomly to almost 200ms. During these times, synchronous replication throughput would be cut by 50% to approximately 5 transactions per second.

Amazon ping times from us-east-1 to us-west-2 and ap-southeast-1 (240 minute interval)

Moreover, it's not just a question of network traffic. Remote VMs also become busy, which slows their response. To demonstrate, I ran two-minute sysbench CPU tests to saturate processing on the us-west-2 instance while observing the ping rate. Here is the command:

sysbench --test=cpu --num-threads=10 --max-time=120 --cpu-max-prime=1000000 run

As the next graph illustrates, CPU load has a unpredictable but substantial effect on ping times. As it happens, the ping variation in the previous graph may be due to resource contention on the underlying physical host. (Or it might really be network traffic--you never really know with Amazon.)

Effect of sysbench runs on ping times to US-West (20 minute interval)

Slow response on resource-bound systems is a problem that is familiar to anyone with experience with distributed systems, including systems where everything is on a single LAN. You cannot even count on clock ticks being delivered accurately within various types of virtual machines. The timing delays are magnified in WANs, as they already have high latency to begin with. Between busy hosts and network latency, it's reasonable to expect that at some point most systems would at least briefly experience single session transaction rates of 1 transaction per second or less. Even with parallelized replication you would see substantial backups on the originating DBMS servers as commits begin to freeze.

To complete the tale of woe, failures of various kinds can cause remote hosts to stop responding at all for periods of time that vary from seconds to days. Amazon is generally quite reliable but had two outages in the Virginia data center in June 2012 alone that brought down applications from hours to days. If you replicate synchronously to a host affected by such an outage, your application just stops and you no longer store transactions at all, let alone securely. You need to turn off synchronous replication completely to stay available.

So is synchronous replication really impossible between sites? Given the problems I just described it would be silly to set up MySQL semi-synchronous replication between over WAN for a real application. However, there are other ways to implement synchronous replication. Let's look at two of them.

First, there is Galera, which uses a distributed protocol called certification-based replication to agree on commit order between all cluster nodes combined with execution of non-conflicting transactions in parallel. Certification-based replication is a great algorithm in many ways, but Galera comes with some important practical limitations. First it replicates rows rather than statements. The row approach handles large transactions poorly, especially over distances, due to the large size of change sets. Also, not all workloads parallelize well, since transactions that conflict in any way must be fully serialized. Overall DBMS throughput may therefore reduce to the single-session throughput discussed above at unexpected times due to variations in workload. Finally, full multi-master mode between sites (as opposed to master/slave) is likely to be very problematic as nodes drop out of the cluster due to transient communication failures and require expensive reprovisioning. This is a general problem with group communications, which Galera depends on to order transactions.

Second, there are theoretical approaches that claim many of the benefits of synchronous replication without killing throughput or availability. One example is the Calvin system developed by Daniel Abadi and others, which seeks to achieve both strong transaction consistency and high throughput when operating across sites. The secret sauce in Calvin is that it radically changes the programming model to replicate what amount to transaction requests while forcing actual transaction processing to be under control of the Calvin transaction manager, which orders transaction order in advance across nodes. That should at least in principle reduce some of the unpredictability you may see in systems like Galera that do not constrain transaction logic. Unfortunately it also means a major rewrite for most existing applications. Calvin is also quite speculative. It will be some time before this approach is available for production systems and we can see whether it is widely applicable.

There's absolutely a place for synchronous replication in LANs, but given the current state of the art it's hard to see how most applications can use effectively it to link DBMS servers over WAN links. In fact, the main issue with synchronous replication is the unpredictability it introduces into applications that must work with slow and unreliable networks. This is one of the biggest lessons I have learned at Continuent.

The Alternative: Asynchronous Multi-Master Replication

So what are the alternatives? If you need to build applications that are available 24x7 with high throughput and rarely, if ever, lose data, you should consider high-speed local clusters linked by asynchronous multi-master replication between sites. Here is a typical architecture, which is incidentally a standard design pattern for Tungsten.

Local clusters linked by asynchronous, multi-master replication

The big contrast between synchronous and and asynchronous replication between sites is that while both have downsides, you can minimize asynchronous multi-master problems using techniques that work now. Let's look at how async multi-master meets requirements and the possible optimizations.

Performance. Asynchronous replication solves WAN performance problem as completely as possible. To the extent that you use synchronous or near-synchronous replication technologies it is on local area networks, which are extremely fast and reliable, so application blocking is minimal. Meanwhile, long-haul replication can be improved by compression as well as parallelization, because WANs offer good bandwidth even if there is high end-to-end latency.
Data loss. Speedy local replication, including synchronous and "near synchronous" methods, minimizes of data loss due to storage failures and configuration errors. Somewhat surprisingly you do not need fully synchronous replication for most systems even at the local level--that's a topic for a future blog article--but replication does need to be quite fast to ensure local replicas are up-to-date. Actually, one of the big issues for avoiding local data loss is to configure systems carefully (think sync_binlog=1 for MySQL, for example).
Availability. Async multi-master systems have the delightful property that anything interrupts transaction flow between sites, replication just stops and then resumes when the problem is corrected. There's no failover and no moving parts. This is a major strength of the multi-master model.

So what are the downsides? Nothing comes for free, and there are at least two obvious issues.

Applicability. Not every application is compatible with asychronous multi-master. You will need to do work on most existing applications to implement multi-master and ensure you got it right. I touched on some of the MySQL issues in an earlier article. If multi-master is not a possibility, you may need the other main approach to cross-site replication: a system-of-record design where applications update data on a single active site at any given time with other sites present for disaster recovery. (Tungsten also does this quite well, I should add.)
Data access. While you might not lose data it's also quite likely you might not be able to access it for a while either. It's rare to lose a site completely but not uncommon for sites to be inaccessible for hours to days. The nice thing is that with a properly constructed multi-master application you will at least know that the data will materialize on all sites once the problem is solved. Meanwhile, relax and work on something else until the unavailable site comes back.

In the MySQL community local clusters linked by asynchronous multi-master are an increasingly common architecture for credit card payment gateways, which I mentioned at the beginning of this article. This is a telling point in favor of asynchronous cross-site replication, as credit card processors have a low tolerance for lost data.

Also, a great deal of current innovation in distributed data management is headed in the direction of asynchronous mechanisms. NoSQL systems (such as Cassandra) tend to use asynchronous replication between sites. There is interesting research afoot, for example in Joe Hellerstein's group at UC Berkeley, to make asynchronous replication more efficient by accurately inferring cases where no synchronization is necessary. Like other research, this work is quite speculative, but the foundations are in use in operational systems today.

For now the challenge is to make the same mechanisms that NoSQL systems have jumped on work equally well for relational databases like MySQL. We have been working on this problem for the last couple of years at Continuent. I am confident we are well on the way to solutions that are as good as the best NoSQL offerings for distributed data management.

MySQL to Vertica Replication, Part 2: Setup and Operation

2012-06-03T23:58:00.001-07:00

As described in the first article of this series, Tungsten Replicator can replicate data from MySQL to Vertica in real-time. We use a new batch loading feature that applies transactions to data warehouses in very large blocks using COPY or LOAD DATA INFILE commands. This second and concluding article walks through the details of setting up and testing MySQL to Vertica replication.

To keep the article reasonably short, I assume that readers are conversant with MySQL, Tungsten, and Vertica. Basic replication setup is not hard if you follow all the steps described here, but of course there are variations in every setup. For more information on Tungsten check out the Tungsten Replicator project at code.google.com site well as current Tungsten commercial documentation at Continuent.

Now let's get replication up and running!

What Is the Topology?

In this exercise we will set up Tungsten master/slave replication to move data from MySQL 5.1 to Vertical 5.1. Master/slave is the simplest topology to set up because you don't have to mix settings for different DBMS types in each service. To keep keep things simple, we will install Tungsten directly on the MySQL and Vertica hosts, which are named db1 and db2 respectively. Here is a diagram:

There are of course many other possible configurations. You can run replicators on separate hosts to reduce load on the DBMS servers. You can with a little patience set up direct replication using a Tungsten single replication service, which results in fewer processes to manage. Finally, you can use both direct as well as master/slave topologies to publish data from Tungsten 1.5 clusters. Tungsten clusters provide availability and scaling on the MySQL side.

Preparing MySQL

To replicate heterogeneously, MySQL servers need to enable row-based replication. You therefore need to use MySQL 5.1 or higher. Tungsten supports all popular builds of MySQL 5.1 and 5.5, so you can pick your favorite.

Batch replication prints values in CSV files, so setting mismatches in character sets and timezones between MySQL, the OS platform, and Vertica will result in corrupted string and/or dates. Settling on UTF8 charset as server default and GMT as the default timezone solves these problems. This is important to get heterogeneous replication to work properly.

Here are sample my.cnf parameters to enable the recommended settings.

# Use row replication.

binlog-format=row

# Server timezone is GMT.

default-time-zone='+00:00'

# Tables default to UTF8.

character-set-server=utf8

collation-server=utf8_general_ci

Restart MySQL to pick up new settings. Beyond this you need to meet the usual requirements for installing Tungsten, such as defining a 'tungsten' user and ensuring replication is properly enabled. Check the Tungsten docs for more information.

Preparing Vertica

Next, spin up Vertica. I used Vertica Community Edition version 5.1.1 for these demos, but any recent production version should do. There is no special configuration for the Vertica instance at this point--unlike MySQL Tungsten works fine with Vertica default settings.

Once Vertica is started and you have created a database, you will need to login and create a schema to hold Tungsten catalogs. Here is an example:

$ vsql -Udbadmin -wsecret bigdata

Welcome to vsql, the Vertica Analytic Database v5.1.1-0 interactive terminal.

Type: \h for help with SQL commands

\? for help with vsql commands

\g or terminate with semicolon to execute query

\q to quit

bigdata=> create schema tungsten_mysql2vertica;

CREATE SCHEMA

Note the location of the JDBC driver in your Vertica release directory. For Vertica 5.1.1 this is located in the following directory: /opt/vertica/java/lib. It should have a name like vertica_5.1.1_jdk_5.jar or something similar.

Downloading and Installing Tungsten

With MySQL and Vertica running, you can now install Tungsten. Let's first download and unpack the software in a staging directory. You should do this on the MySQL host as well as the Vertica host.

db1$ mkdir ~/staging
db1$ cd ~/staging
db1$ wget --no-check-certificate https://s3.amazonaws.com/files.continuent.com/builds/nightly/tungsten-2.0-snapshots/tungsten-replicator-2.0.6-667.tar.gz
db1$ tar -xf tungsten-replicator-2.0.6-667.tar.gz
db1$ cd tungsten-replicator-2.0.6-667

Note that we use build 667 or later. This is necessary to pick up recent fixes to ensure compatibility with Vertica Version 5 JDBC drivers. You'll need to download from the Tungsten nightly build page for now.

Now set up the MySQL master. On the MySQL host, run the following installation command:

db1$ tools/tungsten-installer --master-slave -a \
--service-name=mysql2vertica \
--master-host=db1 \
--cluster-hosts=db1 \
--datasource-user=msandbox \
--datasource-password=msandbox \
--home-directory=/opt/continuent \
--buffer-size=1000 \
--java-file-encoding=UTF8 \
--java-user-timezone=GMT \
--mysql-use-bytes-for-string=false \
--svc-extractor-filters=colnames,pkey \
--property=replicator.filter.pkey.addPkeyToInserts=true \
--property=replicator.filter.pkey.addColumnsToDeletes=true \
--start-and-report

This command has some special settings, highlighted in bold, to help with heterogeneous replication.

The Java VM file encoding and timezone are UTF8 and GMT respectively. Standardizing is essential to avoid corrupting data in batch loads.
String values are translated to UTF8 rather than passed to slaves as bytes.
We insert filters to add column names and identify the primary key on tables including for INSERT operations. These are both required for batch loading to work correctly.

At the end of a successful master replicator installation you should see the following message:

INFO >> db1 >> .....

Processing services command...

NAME VALUE

---- -----

appliedLastSeqno: 0

appliedLatency : 0.973

role : master

serviceName : mysql2vertica

serviceType : local

started : true

state : ONLINE

Finished services command...

Next, let's turn to the Vertica slave. Before installing Tungsten on the Vertica host, we must copy in the JDBC driver to the replicator lib directory. Assuming you are in the unpacked Tungsten release, you can do this with a command like the following:

db2$ cp /opt/vertica/java/lib/vertica_5.1.1_jdk_5.jar tungsten-replicator/lib

Now run the installation command for Vertica, which looks like the following:

db2$ tools/tungsten-installer --master-slave -a \

--service-name=mysql2vertica \

--cluster-hosts=db2 \

--master-host=db1 \

--datasource-type=vertica \

--datasource-user=dbadmin \

--datasource-password=secret \

--datasource-port=5433 \

--batch-enabled=true \

--batch-load-template=vertica \

--vertica-dbname=bigdata \

--buffer-size=25000 \

--java-file-encoding=UTF8 \

--java-user-timezone=GMT \

--skip-validation-check=InstallerMasterSlaveCheck \

--start-and-report

Vertica settings are fairly simple. Values that are different from a standard MySQL configuration are highlighted. JVM file encoding and timezones should match MySQL. Note also the very large buffer-size value. This means that Tungsten can commit up to 25,000 MySQL transactions in a single block on Vertica. I have tested values up to 100,000 without problems.

If the installation is successful, a message like the following appears at the end.

INFO >> db2 >> .

Processing services command...

NAME VALUE

---- -----

appliedLastSeqno: -1

appliedLatency : -1.0

role : slave

serviceName : mysql2vertica

serviceType : local

started : true

state : ONLINE

Finished services command...

Testing MySQL to Vertica Replication

We can check liveness quickly using a heartbeat on the MySQL master replicator.

db1$ trepctl heartbeat

If everything is working, we then see the following on the Vertica slave replicator.

db2$ trepctl services

Processing services command...

NAME VALUE

---- -----

appliedLastSeqno: 1

appliedLatency : 0.266

role : master

serviceName : mysql2vertica

serviceType : local

started : true

state : ONLINE

Finished services command...

Next, let's try to replicate something. On MySQL, we need a table to hold some data, which we create as follows:

mysql -uroot test

...

mysql> create table simple_tab(id int primary key,

f_data varchar(100)) default charset=utf8;

Now let's create the same table in Vertica, plus a staging table. We will also have to create a schema 'test' on Vertica as well.

vsql -Udbadmin -wsecret bigdata

Welcome to vsql, the Vertica Analytic Database v5.1.1-0 interactive terminal.

Type: \h for help with SQL commands

\? for help with vsql commands

\g or terminate with semicolon to execute query

\q to quit

bigdata=> create schema test;

CREATE SCHEMA

bigdata=> create table test.simple_tab(

bigdata(> id int,

bigdata(> f_data varchar(100)

bigdata(> );

CREATE TABLE

bigdata=>

bigdata=> create table test.stage_xxx_simple_tab (

bigdata(> tungsten_seqno int,

bigdata(> tungsten_opcode char(1),

bigdata(> id int,

bigdata(> f_data varchar(100),

bigdata(> tungsten_row_id int);

CREATE TABLE

Finally, we can try to move a row from one table to the other. Login to MySQL and insert a row:

mysql> insert into test.simple_tab values(1, 'hello!');

Query OK, 1 row affected (0.00 sec)

If we configured things properly, we will now see the following on the Vertica side.

bigdata=> select * from test.simple_tab;

id | f_data

----+--------

1 | hello!

(1 row)

At this point our topology is ready to start full-on data loading.

Troubleshooting

It is common to run into errors when setting up heterogeneous replication--it's complicated. Don't forget to look at the replicator logs if 'trepctl status' does not show a meaningful error message. Here are two of the more common problems.

MySQL writes to databases, whereas Vertica has a single database with schemas. If you write to database 'test' in MySQL it goes to schema 'test' on Vertica. It's easy to get confused if you are jumping back and forth.
Staging tables are easy to mess up. If you get the definition wrong, the Vertica slave will fail. In that case, drop the staging table and recreate it correctly. Then put the replicator back online. We plan to offer an automated tool to create staging tables in the future, but for now it is a little bit painful.
If you see dates that are off by a couple of hours, you likely did not configure Java timezones correctly. Make sure you have this correctly set. Refer to this for more information.

It is often very useful to look at the CSV files to debug problems. They are by default located in directory /tmp/staging/staging0. You can change the CSV file location in the Tungsten static-svc.properties file that controls the replication service configuration. Generally speaking, if there is a problem with loading you just fix it and put the replicator back online.

If you run into problems that looks like a product issue, feel free to log a bug. The issue tracking system for replication is located here. Before logging a bug, though, make sure it's really a Tungsten problem. Check out Giuseppe Maxia's hints on proper bug reporting. You can submit questions on the Tungsten Replicator Discuss group as well. And if you really get stuck, Continuent offers commercial support.

Conclusion

This two-part series has provided a short introduction to setting up MySQL to Vertica replication. You can solve the real-time analytics problem used as an example in the articles as well as countless others that require loading data from MySQL to Vertica.

For more information about the detailed design of Tungsten batch loading, check out this wiki article. More complete information will be posted in the Tungsten commercial documentation at www.continuent.com in the near future. Meanwhile, enjoy replicating to Vertica and send us feedback on the Tungsten Replicator Discuss group. I look forward to hearing about your experiences.

MySQL to Vertica Replication, Part 1: Enabling Real-Time Analytics with Tungsten

2012-06-03T23:41:00.000-07:00

Real-time analytics allow companies to react rapidly to changing business conditions. Online ad services process click-through data to maximize ad impressions. Retailers analyze sales patterns to identify micro-trends and move inventory to meet them. The common theme is speed: moving lots of information without delay from operational systems to fast data warehouses that can feed reports back to users as quickly as possible.

Real-time data publishing is a classic example of a big data replication problem. In this two-part article I will describe recent work on Tungsten Replicator to move data out of MySQL into Vertica at high speed with minimal load on DBMS servers. This feature is known as batch loading. Batch loading enables not only real-time analytics but also any other application that depends on moving data efficiently from MySQL into a data warehouse.

The first article works through the overall solution starting with replication problems for real-time analytics through a description of how Tungsten adapts real-time replication to data warehouses. If you are in a hurry to set up, just skim this article and jump straight to the implementation details in the follow-on article.

Replication Challenges for Real-Time Analytics

To understand some of the difficulties of replicating to a data warehouse, imagine a hosted intrusion detection service that collects access log data from across the web and generates security alerts as well as threat assessments for users. The architecture for this application follows a pattern that is increasingly common in businesses that have to analyze large quantities of incoming data.

Access log entries arrive through data feeds, whereupon an application server checks them to look for suspicious activity and commits results into a front-end DBMS tier of sharded MySQL servers. The front-end tier optimizes for a MySQL sweet spot, namely fast processing of a lot of small transactions.

Next, MySQL data feed as quickly as possible into a Vertica cluster that generates reports to users. Vertica is a popular column store with data compression, advanced projections (essentially materialized views) and built-in redundancy. (For more on Vertica origins and column stores in general, read this.) The back-end DBMS tier optimizes for a Vertica sweet spot, namely fast parallel load and quick query performance.

There are many challenges in building any system that must scale to high numbers of transactions. Replicating from MySQL to Vertica is an especially thorny issue. Here is a short list of problems to overcome.

Intrusion detection generates a lot of data. This type of application can generate aggregate peak rates of 100,000 updates per second into the front-end DBMS tier.
Data warehouses handle normal SQL commands like INSERT, UPDATE or DELETE very inefficiently. You need to use batch loading methods like the Vertica COPY command rather than submitting individual transactions as they appear in the MySQL binlog.
Real applications generate not only INSERTS but also UPDATE and DELETE operations. You need to apply these in the correct order during batch loading or the data warehouse will quickly become inconsistent.
Both DBMS tiers are very busy, and whatever replication technique you use needs to reduce load as much as possible on both sides of the fence.

Until recently there were two obvious options for moving data between MySQL and Vertica.

Use an ETL tool like Talend to post batches extracted from MySQL to Vertica.
Write your own scripts to scrape data out of the binlog, process them with a fast scripting language like Perl, and load the result into Vertica.

ETL tools put load on MySQL to scan for changes and often require application changes, for example to add timestamps to detect updates. Home grown tools in addition to other limitations are difficult to maintain and deal poorly with corner cases unless very carefully tested. Both approaches also add latency to replication, which detracts from the real-time delivery goal.

The summary, then, is that there is no simple way to provide anything like real-time reports to users when large volumes of data are involved. ETL and home-grown solutions tend to fall down on real-time transfer as well as the extra load they impose on already busy servers. That's where Tungsten comes in.

Developing Tungsten Batch Loading for Data Warehouses

Our first crack at replicating to data warehouses applied MySQL transactions to Greenplum using the same approach used for MySQL--connect with a JDBC driver and apply row changes in binlog order as fast as possible. It was functionally correct but not very usable. Like many data warehouses, Greenplum processes individual SQL statements around 100 times slower than MySQL. To populate data at a reasonable speed you need to dump changes to CSV and insert them in batches using gpload, an extremely fast parallel loader for Greenplum.

We did not add gpload support at that time, because it was obviously a major effort and we did not understand the implementation very well. However, I spent the next couple of months thinking about how to add CSV-based batch loading to Tungsten. The basic idea was to turn on MySQL row replication on the master and then apply updates to the data warehouse as follows:

Accumulate a large number of transactions as rows in open CSV files.
Load the files to staging tables.
Merge the staging table contents into the base tables using SQL.

When a customer showed up needing fast replication into Vertica from MySQL we were therefore ready to develop batch loading and dived right in. It looked like a few weeks of work to get something ready for production deployment, but that estimate turned out to be quite optimistic. The implementation in fact took a good bit longer because of the complexities of CSV formats used by different DBMS services, problems with timezones, differences in SQL load command semantics, and the fact that when we started out we did not have an easy-to-setup method to test heterogeneous replication. Plus we needed to take time to create a proper installation.

That said, most of the work was SMOP, or a simple matter of programming. After ~~a few weeks~~ six months we had fast, functional batch loading for Vertica as well as working implementations for MySQL and PostgreSQL. Batch loading applies MySQL row updates in very large groups to Vertica using CSV files and Vertica COPY commands. The following diagram shows direct replication using a single pipeline to apply transactions from a Tungsten master replication to Vertica.

Tungsten replication operates more or less normally up to the point where we apply to Vertica. This is the job of a new applier class called SimpleBatchApplier. It implements the CSV loading as follows.

First, as new transactions arrive Tungsten writes them to CSV files named after the Vertica tables to which they apply. For instance, say we have updates for a table simple_tab in schema test with the following format (slightly truncated from the vsql \d output):

Schema | Table | Column | Type | Size |
--------+------------+-----------------+--------------+------+
test | simple_tab | id | int | 8 |
test | simple_tab | f_data | varchar(100) | 100 |

The updates go into file test.simple_tab. Here is an example of the data in the CSV file.

"64087","I","17","Some data to be inserted","1"
"64088","I","18","Some more data to be inserted","2"
"64088","D","0",null,"3"

The CSV file includes a Tungsten seqno (global transaction ID), an operation code (I for insert, D for delete), and the primary key. For inserts, we have additional columns containing data. Deletes just contain nulls for those columns. The last column is a row number, which allows us to order the rows when they are loaded into Vertica.

Tungsten keeps writing transactions until it reaches the block commit maximum (for example 25,000 transactions). It then closes each CSV file and loads the contents into a staging table that has the base name plus a prefix, here "stage_xxx_." The staging table format mimics the CSV file columns. For example, the previous example might have a staging table like the following:

  Schema | Table | Column | Type | Size |
--------+----------------------+-----------------+--------------+------+
test | stage_xxx_simple_tab | tungsten_seqno | int | 8 |
test   | stage_xxx_simple_tab | tungsten_opcode | char(1) | 1 |
test   | stage_xxx_simple_tab | id | int | 8 |
test   | stage_xxx_simple_tab | f_data | varchar(100) | 100 |
test   | stage_xxx_simple_tab | tungsten_row_id | int | 8 |

Finally, Tungsten applies the deletes and inserts to table test.simple_tab by executing SQL commands like the following:

DELETE FROM test.simple_tab WHERE id IN
(SELECT id FROM test.stage_xxx_simple_tab
WHERE tungsten_opcode = 'D');
INSERT INTO test.simple_tab(id, f_data)
SELECT id, f_data
FROM test.stage_xxx_simple_tab AS stage_a
WHERE tungsten_opcode='I' AND tungsten_row_id IN
  (SELECT MAX(tungsten_row_id)
FROM test.stage_xxx_simple_tab GROUP BY id);

Simple right? The SQL commands are actually generated from templates that specify the SQL to execute when connecting to Vertica, to load a CSV file into the staging table, and to merge changes from the staging table to the base (i.e., real) table. You can find the template files in directory tungsten-replicator/samples/scripts/batch. The template file format is documented here.

Tungsten MySQL to Vertica replication is currently in field testing. The performance on the MySQL side is excellent, as you would expect with asynchronous replication. On the Vertica side we find that batch loading operates far faster than using JDBC interfaces. Tungsten has a block commit feature that allows you to commit very large numbers of transactions at once. Tests show that Tungsten easily commits around 20,000 transactions per block using CSV files.

We added a specialized batch loader class to perform CSV uploads to Vertica from other hosts, which further reduces the load on Vertica servers. (It still needs a small fix to work with Vertica 5 JDBC but works with Vertica 4.) Taking together the new Vertica replication features look as if they will be very successful for implementing real-time analytics. Reading the binlog on MySQL minimizes master overhead and fetches exactly the rows that have changed within seconds of being committed. Batch loading on Vertica takes advantage of parallel load, again reducing overhead in the reporting tier.

A New Replication Paradigm: Set-Based Apply

Batch loading is significant for reasons other than conveniently moving data between MySQL and Vertica. Batch loading is also the beginning of a new model for replication. I would like to expand on this briefly as it will likely be a theme in future work on Tungsten.

Up until this time, Tungsten Replicator has followed the principle of rigorously applying transactions to replicas in serial order without any deviations whatsoever. If you INSERT and then UPDATE a row, it always works because Tungsten applies them to the slave in the same order. This consistency is one of the reasons for the success of Tungsten overall, as serialization short-cuts usually end up hitting weird corner cases and are also hard to test. However, the serialized apply model is horribly inefficient on data warehouses, because single SQL statements execute very slowly.

The SQL-based procedure for updating replicas that we saw in the previous section is based on a model that I call set-based apply. It works by treating the changes in a group of transactions as an ordered set (actually a relation) consisting of insert and delete operations. The algorithm is easiest to explain with an example. The following diagram shows how three row operation on table t in the MySQL binlog morph to four changes, of which we actually apply only the last two.

Set-based apply merges the ordered change set to the base table using the following rules:

Delete any rows from the base table where there is change set DELETE for the primary key and the first operation on that key is not an INSERT. This deletes any rows that previously existed.
Apply the last INSERT on each key provided it is not followed by a DELETE. This inserts any row that was not later deleted.

This is a form of logical reduction using a combination of staging tables and CSV loading as described in the previous section. The rules are implemented as SQL queries. Taken together these two rules apply changes in a way that is identical to applying them straight from the binlog. Using SQL is not the most efficient approach but is relatively simple to implement and easy for users to understand.

Set-based apply offers interesting capabilities because sets, particularly relations, have powerful mathematical properties. We can use set theory to reason about and develop optimized handling to solve problems like conflict resolution in multi-master systems. I will get back to this topic in a future post.

Meanwhile, there are obvious ways to speed up the apply process for data warehouses by performing more of the set reduction in Tungsten and less in SQL. We can also take advantage of existing Tungsten parallelization capabilities. I predict that this will offer the same sort of efficiency gains for data warehouse loading as Tungsten parallel apply provides for I/O-bound MySQL slaves. Log-based replication is simply a very good way of handling real-time loading and there are lots of ways to optimize it provided we follow a sound processing model.

Conclusion

This first article on enabling real-time analytics explained how Tungsten loads data in real-time from Vertica to MySQL. The focus has been allowing users to serve up reports quickly from MySQL-based data, but Tungsten replication obviously applies to many other problems involving data warehouses.

In the next article I will turn from dry theory to practical details. We will walk through the details of configuring MySQL to Vertica replication, so that you can try setting up real-time data loading yourself.

P.S. If optimized batch loading seems like something you can help solve, Continuent is hiring. This is just one of a number of cutting-edge problems we are working on.

If You Must Deploy Multi-Master Replication, Read This First

2012-04-30T00:57:00.000-07:00

An increasing number of organizations run applications that depend on MySQL multi-master replication between remote sites. I have worked on several such implementations recently. This article summarizes the lessons from those experiences that seem most useful when deploying multi-master on existing as well as new applications.

Let's start by defining terms. Multi-master replication means that applications update the same tables on different masters, and the changes replicate automatically between those masters. Remote sites mean that the masters are separated by a wide area network (WAN), which implies high average network latency of 100ms or more. WAN network latency is also characterized by a long tail, ranging from seconds due to congestion to hours or even days if a ship runs over the wrong undersea cable.

With the definitions in mind we can proceed to the lessons. The list is not exhaustive but includes a few insights that may not be obvious if you are new to multi-master topologies. Also, I have omitted issues like monitoring replication, using InnoDB to make slaves crash-safe, or provisioning new nodes. If you use master/slave replication, you are likely familiar with these topics already.

1. Use the Right Replication Technology and Configure It Properly

The best overall tool for MySQL multi-master replication between sites is Tungsten. The main reason for this assertion is that Tungsten uses a flexible, asynchronous, point-to-point, master/slave replication model that handles a wide variety of topologies such as star replication or all-to-all. Even so, you have to configure Tungsten properly. The following topology is currently my favorite:

All-to-all topology. Each master replicates directly to every other master. This handles prolonged network outages or replication failures well, because one or more masters can drop out without breaking replication between the remaining masters or requiring reconfiguration. When the broken master(s) return, replication just resumes on all sides. All-to-all does not work well if you have a large number of masters.
Updates are not logged on slaves. This keeps master binlogs simple, which is helpful for debugging, and eliminates the possibility of loops. It also requires some extra configuration if the masters have their own slaves, as would be the case in a Tungsten Enterprise cluster.

There are many ways to set up multi-master replication replication, and the right choice varies according to the number of masters, whether you have local clustering, or other considerations. Giuseppe Maxia has described many topologies, for example here, and the Tungsten Cookbook has even more details.

One approach you should approach with special caution is MySQL circular replication. In topologies of three or more nodes, circular replication results in broken systems if one of the masters fails. Also, you should be wary of any kind of synchronous multi-master replication across sites that are separated by more than 50 kilometers (i.e. 1-2ms latency). Synchronous replication makes a siren-like promise of consistency but the price you pay is slow performance under normal conditions and broken replication when WAN links go down.

2. Use Row-Based Replication to Avoid Data Drift

Replication depends on deterministic updates--a transaction that changes 10 rows on the original master should change exactly the same rows when it executes against a replica. Unfortunately many SQL statements that are deterministic in master/slave replication are non-deterministic in multi-master topologies. Consider the following example, which gives a 10% raise to employees in department #35.

UPDATE emp SET salary = salary * 1.1 WHERE dep_id = 35;

If all masters add employees, then the number of employees who actually get the raise will vary depending on whether such additions have replicated to all masters. Your servers will very likely become inconsistent with statement replication. The fix is to enable row-based replication using binlog-format=row in my.cnf. Row replication transfers the exact row updates from each master to the others and eliminates ambiguity.

3. Prevent Key Collisions on INSERTs

For applications that use auto-increment keys, MySQL offers a useful trick to ensure that such keys do not collide between masters using the auto-increment-increment and auto-increment-offset parameters in my.cnf. The following example ensures that auto-increment keys start at 1 and increment by 4 to give values like 1, 5, 9, etc. on this server.

server-id=1
auto-increment-offset = 1

auto-increment-increment = 4

This works so long as your applications use auto-increment keys faithfully. However, any table that either does not have a primary key or where the key is not an auto-increment field is suspect. You need to hunt them down and ensure the application generates a proper key that does not collide across masters, for example using UUIDs or by putting the server ID into the key. Here is a query on the MySQL information schema to help locate tables that do not have an auto-increment primary key.

SELECT t.table_schema, t.table_name

FROM information_schema.tables t

WHERE NOT EXISTS

(SELECT * FROM information_schema.columns c

WHERE t.table_schema = c.table_schema

AND t.table_name = c.table_name

AND c.column_key = 'PRI'

AND c.extra = 'auto_increment')

4. Beware of Semantic Conflicts in Applications

Neither Tungsten nor MySQL native replication can resolve conflicts, though we are starting to design this capability for Tungsten. You need to avoid them in your applications. Here are a few tips as you go about this.

First, avoid obvious conflicts. These include inserting data with the same keys on different masters (described above), updating rows in two places at once, or deleting rows that are updated elsewhere. Any of these can cause errors that will break replication or cause your masters to become out of sync. The good news is that many of these problems are not hard to detect and eliminate using properly formatted transactions. The bad news is that these are the easy conflicts. There are others that are much harder to address.

For example, accounting systems need to generate unbroken sequences of numbers for invoices. A common approach is to use a table that holds the next invoice number and increment it in the same transaction that creates a new invoice. Another accounting example is reports that need to read the value of accounts consistently, for example at monthly close. Neither example works off-the-shelf in a multi-master system with asynchronous replication, as they both require some form of synchronization to ensure global consistency across masters. These and other such cases may force substantial application changes. Some applications simply do not work with multi-master topologies for this reason.

5. Remove Triggers or Make Them Harmless

Triggers are a bane of replication. They conflict with row replication if they run by accident on the slave. They can also create strange conflicts due to weird behavior/bugs (like this) or other problems like needing definer accounts present. MySQL native replication turns triggers off on slaves when using row replication, which is a very nice feature that prevents a lot of problems.

Tungsten on the other hand cannot suppress slave-side triggers. You must instead alter each trigger to add an IF statement that prevents the trigger from running on the slave. The technique is described in the Tungsten Cookbook. It is actually quite flexible and has some advantages for cleaning up data because you can also suppress trigger execution on the master.

You should regard all triggers with suspicion when moving to multi-master. If you cannot eliminate triggers, at least find them, look at them carefully to ensure they do not generate conflicts, and test them very thoroughly before deployment. Here's a query to help you hunt them down:

SELECT trigger_schema, trigger_name

FROM information_schema.triggers;

6. Have a Plan for Sorting Out Mixed Up Data

Master/slave replication has its discontents, but at least sorting out messed up replicas is simple: re-provision from another slave or the master. No so with multi-master topologies--you can easily get into a situation where all masters have transactions you need to preserve and the only way to sort things out is to track down differences and update masters directly. Here are some thoughts on how to do this.

Ensure you have tools to detect inconsistencies. Tungsten has built-in consistency checking with the 'trepctl check' command. You can also use the Percona Toolkit pt-table-checksum to find differences. Be forewarned that neither of these works especially well on large tables and may give false results if more than one master is active when you run them.
Consider relaxing foreign key constraints. I love foreign keys because they keep data in sync. However, they can also create problems for fixing messed up data, because the constraints may break replication or make it difficult to go table-by-table when synchronizing across masters. There is an argument for being a little more relaxed in multi-master settings.
Switch masters off if possible. Fixing problems is a lot easier if you can quiesce applications on all but one master.
Know how to fix data. Being handy with SQL is very helpful for fixing up problems. I find SELECT INTO OUTFILE and LOAD DATA INFILE quite handy for moving changes between masters. Don't forget SET SESSION LOG_FILE_BIN=0 to keep changes from being logged and breaking replication elsewhere. There are also various synchronization tools like pt-table-sync, but I do not know enough about them to make recommendations.

At this point it's probably worth mentioning commercial support. Unless you are a replication guru, it is very comforting to have somebody to call when you are dealing with messed up masters. Even better, expert advice early on can help you avoid problems in the first place.

(Disclaimer: My company sells support for Tungsten so I'm not unbiased. That said, commercial outfits really earn their keep on problems like this.)

7. Test Everything

Cutting corners on testing for multi-master can really hurt. This article has described a lot of things to look for, so put together a test plan and check for them. Here are a few tips on procedure:

Set up a realistic pre-prod test with production data snapshots.
Have a way to reset your test environment quickly from a single master, so you can get back to a consistent state to restart testing.
Run tests on all masters, not just one. You never know if things are properly configured everywhere until you try.
Check data consistency after tests. Quiesce your applications and run a consistency check to compare tables across masters.

It is tempting to take shortcuts or slack off, so you'll need to find ways to improve your motivation. If it helps, picture yourself explaining to the people you work for why your DBMS servers have conflicting data with broken replication, and the problem is getting worse because you cannot take applications offline to fix things. It is a lot easier to ask for more time to test. An even better approach is to hire great QA people and give them time to do the job right.

Summary

Before moving to a multi-master replication topology you should ask yourself whether the trouble is justified. You can get many of the benefits of multi-master with system-of-record architectures with a lot less heartburn. That said, an increasing number of applications do require full multi-master across multiple sites. If you operate one of them, I hope this article is helpful in getting you deployed or improving what you already have.

Tungsten does a pretty good job of multi-master replication already, but I am optimistic we can make it much better. There is a wealth of obvious features around conflict resolution, data repair, and up-front detection of problems that will make life better for Tungsten users and reduce our support load. Plus I believe we can make it easier for developers to write applications that run on multi-master DBMS topologies. You will see more about how we do this in future articles on this blog.

Replication Is Bad for MySQL Temp Tables

2012-04-22T19:46:00.001-07:00

Experienced MySQL DBAs know that temp tables cause major problems for MySQL replication. It turns out the converse is also true: replication can cause major problems for temporary tables.

In a recent customer engagement we enabled Tungsten Replicator with a MySQL application that originally ran on a server that did not use replication. QA promptly discovered reports that previously ran in 10 seconds were now running in as many minutes. It turned out that the reports used temp tables to assemble data, and these were being written into the master binlog. This created bloated binlogs and extremely slow reports. We fixed the problem by enabling row replication (i.e., binlog-format=row in my.cnf).

A common DBA response to temp table problems is to try to eliminate them completely, as suggested in the excellent High Performance MySQL, 3rd Edition. (See p. 502.) Elimination is a good philosophy when applications use temp tables to generate updates. However, it does not work for reporting. Temp tables allow you to stage data for complex reports across a series of transactions, then pull the final results into a report writer like JasperReports. This modular approach is easy to implement and maintain afterwards. Eliminating temp tables in such cases can create an unmaintainable mess.

The real solution with report temp tables is to keep them out of the master binlog. Here is a list of common ways to do so. Let me know if you know others.

* Turn off binlog updates. Issue 'SET SESSION SQL_LOG_BIN=0' when generating reports. The downside is that it requires SUPER privilege to set. Also, if you make a code mistake and update normal tables with this setting, your changes will not be replicated.

* Use a non-replicated database. Configure the master my.cnf with binlog-ignore-db as follows to ignore any update (including on temp tables) that is issued when database 'scratch' is the default database:

binlog_ignore_db = scratch

This approach does not require special privileges. However coding errors or connection pool misconfigurations are obvious liabilities. Your application must either connect to the scratch database or issue an explicit use command. Otherwise, temp table operations will be logged, as in the following example:

use not_scratch;

create temporary table scratch.report1_temp(name varchar(256), entry_time date, exit_time date);

* Use a slave with the binlog disabled. Remove the log-bin option from my.cnf. This works well if you have extra reporting slaves that are caught up. However, it may not work if the reports must be fully up-to-date or you need the ability to promote the slave quickly to a master, in which case the binlog must be enabled.

* Use row replication. You can set row replication at the session level using 'SET SESSION binlog_format=row', which requires SUPER privilege, or overall by setting binlog-format in my.cnf. In this case CREATE TEMPORARY TABLE and updates on temp tables do not appear in the binlog at all. The downside of enabling row replication fully is that it can lead to bloated logs and blocked servers if you have very large transactions. SQL operations like DELETE that affect multiple rows are stored far more compactly in statement replication. Also, reloading mysqldump files can be very slow in row replication compared to statement replication, which can handle block inserts generated by the --extended-insert option.

The proper solution to keep replication from hurting your use of temp tables will vary depending on your application as well as the way you run your site. For my money, though, this is a good example of where row replication really helps and deserves a closer look.

MySQL could use some feature improvements in the area of temp tables and replication. I find it surprising that mixed mode replication does not fully suppress temp table binlog updates. Only row replication does so. Second, it would be great to have a CREATE TABLE option to suppress logging particular tables to the binlog. This would allow applications to make the logging decision at schema design time. Finally, global options to suppress binlogging of specific table types, such as temp tables and MEMORY tables would be useful. Perhaps we will see some of these in future MySQL releases.

Oracle Missed at MySQL User Conference...Not!

2012-04-14T14:04:00.000-07:00

The MySQL UC this past week was the best in years. Percona did an outstanding job of organizing the main Percona Live event that ran Tuesday through Thursday. About 1000 people attended, which is up from the 800 or so at the O'Reilly-run conference in 2011. There were also excellent follow-on events on Friday for MariaDB/SkySQL, Drizzle, and Sphinx.

What made this conference different was the renewed energy around MySQL and the number of companies using it.

Big web properties like Facebook, Twitter, Google, and Craigslist continue to anchor the MySQL community and drive innovation from others through a combination of funding, encouragement, and patches.
Many new companies we have not heard from before like Pinterest, BigDoor, Box.net, and Constant Contact talked about their experience building major new applications on MySQL.
The vendor exhibition hall at Percona Live was hopping. Every vendor I spoke to had a great show and plans to return next year. There is great innovation around MySQL from many creative companies. I'm very proud my company, Continuent, is a part of this.
The demand for MySQL expertise was completely out of hand. So many talks ended with "...and we are hiring" that it became something of a joke. The message board was likewise packed with help wanted ads.

When Oracle acquired Sun Microsystems a couple of years ago, it triggered a lot of uncertainty about the future of MySQL. This concern turns out to be unfounded. Oracle does excellent engineering work, especially on InnoDB, but had no involvement either official or unofficial at the conference. This was actually a good thing.

By not participating, Oracle helped demonstrate that MySQL is no longer dependent on any single vendor and has taken on a real life of its own driven by the people who use it. MySQL fans owe Oracle a vote of thanks for not attending this year. Next year I hope they will be back to join the fun.

p.s., It has come to my attention since writing this article that 800 may not be correct attendance for the O'Reilly 2011 conference. The 1000 figure is from Percona. Speaking as an attendee they seemed about the same size. Please feel free to comment if you have accurate numbers.

The Scale-Out Blog

An Ending and a Beginning: VMware Has Acquired Continuent

Exorcising the CAP Demon

No Hadoop Fun for Me at SCaLE 12X :(

Why Aren't All Data Immutable?

Fun with MySQL and Hadoop at SCaLE 12X

Why I Love Open Source

See You at Percona Live 2013!

Data Fabric Design Patterns: Fabric Connector

Data Fabric Design Patterns: Transactional Data Service

Introducing Data Fabric Design for Commodity SQL Databases

Replicating from MySQL to Amazon RDS

Tungsten University

Questions about MariaDB JDBC Driver

The MySQL Community: Beleaguered or Better than Ever?

Slides from Percona Live London and a Request

Data Fabrics and Other Tales: Percona Live and MySQL Connect

Database Failure Is Not the Biggest Availability Problem

Automated Database Failover Is Weird but not Evil

Life in the Amazon Jungle

Is Synchronous Data Replication over WAN Really a Viable Strategy?

MySQL to Vertica Replication, Part 2: Setup and Operation

MySQL to Vertica Replication, Part 1: Enabling Real-Time Analytics with Tungsten

If You *Must* Deploy Multi-Master Replication, Read This First

Replication Is Bad for MySQL Temp Tables

Oracle Missed at MySQL User Conference...Not!

If You Must Deploy Multi-Master Replication, Read This First