The Scale-Out Blog

Oct 29, 2014

An Ending and a Beginning: VMware Has Acquired Continuent

As of today, Continuent is part of VMware. We are absolutely over the moon about it.

You can read more about the news on the VMware vCloud blog by Ajay Patel, our new boss. There’s also an official post on our Continuent company blog. In a nutshell the Continuent team is joining the VMware Cloud Services Division. We will continue to improve, sell, and support our Tungsten products and work on innovative integration into VMware’s product line.

So why do I feel exhilarated about joining VMware? There are three reasons.

1. Continuent is joining a world-class company that is the leader in virtualization and cloud infrastructure solutions. Even better, VMware understands the value of data to businesses. They share our vision of managing an integrated fabric of standard DBMS platforms, both in public clouds as well as in local data centers. It is a great home to advance our work for many years to come.

2. We can continue to support our existing users and make Tungsten even better. I know many of you have made big decisions to adopt Continuent technology that would affect your careers if they turned out badly. We now have more resources and a mandate to grow our product line. We will be able to uphold our commitments to you and your businesses.

3. It’s a great outcome for our team, which has worked for many years to make Continuent Tungsten technology successful. This includes our investors at Aura in Helsinki, who have been dogged in their support throughout our journey.

Speaking of the Continuent team…I am so proud of what all of you have achieved. Today we are starting a new chapter in our work together. See you at VMware!

Oct 6, 2014

Exorcising the CAP Demon

Computer science is like an enormous tool box you can rummage through whenever you have a problem to solve. Most of the tools are sturdy and practical, like algorithms for B-trees. Some are also elegant, like consistent hashing in Dynamo. Finally there are some tools that you never quite figure out even after years of reflection. That piece of steel you are looking at could be Excalibur. Or it could be a rusty knife.

The CAP theorem falls into the last category, at least for me. It was a major topic in the blogosphere a few years ago and Google Trends shows steadily increasing interest in the term since 2010. It's not my goal to explain CAP fully--a good informal description is here or you can just read the proof yourself. Instead I would like to talk about how I understand and use the CAP theorem today as well as how that understanding might evolve in the future.

In a nutshell CAP puts a limit on how distributed database systems trade off data consistency and system availability. Eric Brewer originated the theorem as a conjecture in the late 1990s. Seth Gilbert and Nancy Lynch supplied a proof of the conjecture in 2002. Brewer described it as follows in 2012:

The CAP theorem states that any networked shared-data system can have at most two of three desirable properties:

consistency (C) equivalent to having a single up-to-date copy of the data;

high availability (A) of that data (for updates); and

tolerance to network partitions (P).

My initial problem in understanding CAP was relating the proof to what happens in the real world, which is not especially easy. Network partitions are an example. Here's how the Gilbert/Lynch proof defines them in Section 2.3.

When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost. (And any pattern of message loss can be modeled as a temporary partition separating the communicating nodes at the exact instant the message is lost.)

So does this include an asymmetric communication failure? That's where a process on one host can see and send messages to a process on another host but the reverse is not true. This happens all the time in group communications for reasons that range from application software bugs to bad cabling and everything in between. Do you model the asymmetry as a sequence of temporary partitions? It's of course possible. But it feels a bit like using Ptolemaic astronomy with epicycles.

Other people have made similar observations. Eric Brewer even wrote about the "nuances" of partitions in his 2012 retrospective. There are analogous problems with the other terms. There was enough public disagreement their meaning that I wrote a "disproof" of CAP a few years back as an April Fools Day joke. It depended on not being able to distinguish CA and CP choices in real systems.

That confusion is not a problem with the CAP theorem itself. Nobody has seriously challenged the proof. Instead, it's a matter of what logicians refer to as interpretation, which links a logical model to some domain of discourse so that you can draw valid conclusions about that domain. If you want to reason about real-world systems using the CAP theorem you must first ensure your systems really match the model. Otherwise it's like using a map of Oregon to drive between New York and Boston. The core difficulty is that the CAP theorem proof assumes binary properties whereas in reality properties like availability operate on a sliding scale.

My other issue with CAP evaluation is what you might call a suitability problem. There are a lot of issues with operating distributed systems, and the 3-way trade-off is irrelevant to many of them. For instance, what happens when the network is behaving and you don't have to make pesky choices between availability and consistency? Let's look at some examples.

CAP defines consistency as linearizability, which means that transactions on different replicas look as if they all happened at once in a single place in a single unbroken series. Imagine driving around to different automated teller machines at a bank and making changes to your account balance or checking it. No matter which teller machine you visit next, it knows exactly what happened before and has the right balance amount. Or imagine a shopping cart on a website like Zappos.com. No matter how you jump around the website to select clothing or even if you fold up your laptop and fly to Paris, the items in your shopping care remain consistent without duplicate or missing selections.

You might say, well, not all systems work that way. You would be right, and that's the exactly the point. Real distributed systems do not always try to ensure linearizability. It turns out that many people, most particularly end users who ultimately pay for computer systems, conclude they don't really care so much about consistency of the sort CAP promises. Here are two different types of reasons:

1. Linearized consistency is expensive. Keeping active replicas up to date requires round trip messages between hosts, which can reduce transaction commit times by an order of magnitude or more. Users are allergic to slow response, regardless of any other benefits that slowness might bring. Daniel Abadi pointed out this latency problem some time ago in a great blog post on CAP that is still excellent reading today.

2. Linearized consistency is irrelevant for many applications. Consider a measurement from a household thermometer or a text message from a cell phone. There is only one of each generated in a single location. Your servers either get them or they don't. Multiple copies are just that: replicas of the same thing. Conflicts don't exist.

The share of immutable data from analytic systems like Hadoop and object stores like Amazon S3 is increasing rapidly, which means that there is an increasing number of applications for which CAP is not the only or even a major design consideration. It might be in the guts of the system but it's just one of many problems at that level and there may be multiple choices. The original Hadoop architecture actually ignored CAP for one critical part of the system--the NameNode, which maps HDFS file names to storage, was a single point of failure.

Which brings us back to understanding CAP at a practical level. Is it Excalibur or just the rusty knife? At this point it feels like another tool in the toolbox that you use at the right time, albeit carefully. Imagine a band saw that does not have a very good guard on the blade. Here are my personal instructions for safe use.

1. Use it for suitable problems. The CAP theorem applies to a very specific problem involving systems that want to remain consistent and available across multiple networked hosts. If you design clusters or distributed databases, this is a relatively big deal. The trade-offs are real and you have to think about them.

For instance at Continuent we have some problems where the theorem is directly applicable. We build clusters that implement failover. We have to consider how to establish consensus while keeping the cluster available even when members lose messages or respond slowly. The CAP theorem guides you to manage this kind of problem rather than try to solve it using techniques that will not work, such as adding timeouts on messages. (Continuent Tungsten clusters are generally CP, in case you are wondering.)

2. Avoid CAP where it does not obviously apply. It is a tricky theorem to interpret correctly, and many applications are concerned with unrelated problems. I work a lot on transactional replication. There are no CAP issues in Tungsten Replicator. At the other end of the spectrum if you build systems that link multiple stores using replication, you likely have multiple CAP choices under the covers. That's a common pattern in complex applications.

It is therefore important to look with a jaundiced eye upon any product that claims to "beat CAP," like this widely read article. This is just marketing hype. If your application matches the CAP theorem model, it applies and you are subject to the limitations. If the limitations don't seem to make sense you have not evaded them. You are either working on a problem to which CAP is not relevant or you made implicit CAP choices of which you are not aware. It is easy to make a fool of yourself by asserting otherwise.

3. Other tools are important too. CAP of course does not even cover all trade-offs in clusters. There are also many issues to consider when building distributed data systems that actually work. Latency, durability of data, monitoring, automation, reliability, ability to do zero-downtime maintenance, and security are critical. Especially security. That looks like the next big problem for a lot of existing distributed systems.

Beyond these, don't stop thinking about CAP. It is one of those ideas that gets under your skin and really bugs you. In addition to Eric Brewer's 2012 article, Seth Gilbert and Nancy Lynch wrote a follow-up perspective on the implications of CAP, so even the originators are continuing to consider the problems. The long term value of CAP is that it has focused attention on a set of difficult data management problems and led to numerous productive ideas about how to manage them. The resulting evolution is not nearly finished. We will all continue to worry this bone for many years to come.

Feb 20, 2014

No Hadoop Fun for Me at SCaLE 12X :(

I blogged a couple of weeks ago about my upcoming MySQL/Hadoop talk at SCaLE 12X. Unfortunately I had to cancel. A few days after writing the article I came down with an eye problem that is fixed but prevents me from flying anywhere for a few weeks. That's a pity as I was definitely looking forward to attending the conference and explaining how Tungsten replicates transactions from MySQL into HDFS.

Meanwhile, we are still moving at full steam with Hadoop-related work at Continuent, which is the basis for the next major replication release, Tungsten Replicator 3.0.0. Binary builds and documentation will go up in a few days. There will also be many more public talks about Hadoop support, starting in April at Percona Live 2014. I hope you'll consider attending one of our talks there. It's a great conference.

Since my SCaLE 12X talk won't be happening I would like to repeat the invitation to attend the Continuent webinar on loading from MySQL to Hadoop on Thursday February 27th. It's essentially the same talk, but no airplanes are involved.

Feb 17, 2014

Why Aren't All Data Immutable?

Over the last few years there has been an increasing interest in immutable data management. This is a big change from the traditional update-in-place approach many database systems use today, where new values delete old values, which are then lost. With immutable data you record everything, generally using methods that append data from successive transactions rather than replacing them. In some DBMS types you can access the older values, while in others the system transparently uses the old values to solve useful problems like implementing eventual consistency.

Baron Schwartz recently pointed out that it can be hard to get decent transaction processing performance based on append-only methods like append-only B-trees. This is not a very strong argument against immutable data per se. Immutable data are already in wide use. It is actually surprising they have not made deeper inroads into online transaction processing, which is widely handled by relational DBMS servers like MySQL and Oracle.

Immutable Data Are Now Economically Feasible

One reason for the popularity of update-in-place approaches is simple: storage used to be really expensive. This is no longer the case. Many applications can now afford to store the entire DBMS transaction log almost indefinitely. To illustrate, look at storage costs in Amazon Web Services. Applications running in Amazon have API-level access to practically unlimited replicated, long-term storage through services like S3 and Glacier. Amazon conveniently publishes prices that serve as good proxies for storage costs in general. Using these numbers, I worked up a simple spread sheet that shows the cost of storing 7 years of transactions for a made-up business application.

To start with, assume our sample app generates one thousand transactions per second at 1,000 bytes per transaction. This is not exceedingly busy by some standards but is relatively high for business systems that handle human-generated transactions. The main place you see numbers approaching this level is SaaS businesses that handle many customers on a single system. Our sample system generates about 205,591 gigabytes of data over seven years.

Xacts/Sec	Bytes/Xact	Bytes/Sec	GB Generated in 1 Hour	GB Generated in 1 Day	GB Generated in 1 Month	GB Generated in 1 Year	GB Generated in 7 Years
1,000	1,000	1,000,000	3.35	80.47	2,447.52	29,370.19	205,591.32

Amazon storage costs vary from $0.011/Gb/month for Glacier to $0.09/Gb/month for S3 with full redundancy. (These are numbers for the US-West region as of 29 December 2013.) Annual storage costs for 7 years of data are pretty hefty if you store uncompressed data. However, if you factor in compression--for example MySQL binlogs tend to compress around 90% in my experience--things start to look a lot better.

	Annual cost to store 7 years of data at different levels of compression
	0%	20%	40%	60%	70%	80%	90%
Glacier	$27,138.05	$21,710.44	$16,282.83	$10,855.22	$8,141.42	$5,427.61	$2,713.81
S3 Reduce Redundancy	$177,630.90	$142,104.72	$106,578.54	$71,052.36	$53,289.27	$35,526.18	$17,763.09
S3 Standard	$222,038.63	$177,630.90	$133,223.18	$88,815.45	$66,611.59	$44,407.73	$22,203.86

The raw costs still look hefty to the untrained eye, but we need to factor in the real expense of operating this type of system. Here's a typical cost structure for a 3 node cluster (to ensure HA) with labor costs factored in and preserving 7 years of data. I have put in generously small IT overhead costs including software development, since the code has to come from somewhere. Under these assumptions long-term storage costs are less 10% of the yearly cost of operation.

Component	Cost	Percentage	Notes
3 i2.4xlarge instances	$46,306.68	20.09%	(Heavy utilization reserved, 1 yr. term)
3 support licenses	$15,000.00	6.51%	(Support subscription costs * 3x)
Raw dbadmin labor	$12,000.00	5.21%	(1 FTE/30 DBMS servers @ 120K per)
Software dev/QA	$120,000.00	52.06%	(10 FTE/30 DBMS servers @ 120K per)
Misc. overhead costs	$15,000.00	6.51%	($5K per server)
S3 Storage	$22,203.86	9.63%	(7 years of data, 90% compression)
Total	$230,510.54	100.00%

Long storage costs for base transaction data can be far lower if any of the following hold:

You generate fewer transactions per second or they are smaller. Many business apps produce far fewer transactions than my example.
You don't keep data for the full 7 years. Some of the analytic users I work with just keep a couple of years.
You are already paying archiving costs for backups, in which case the additional storage cost becomes a wash if you can stop using a separate backup system.
You add more external costs to the picture--running a real business that generates this level of transactions often takes far more people than are shown in my projection.

In these cases long term storage costs could be in the 1-2% range as a percentage of IT operating costs. Over time storage costs will decrease--though the rate of decline is hard to predict--so each year the number systems able to afford preservation of complete transaction histories will corresponding increase. This is particularly true for business transactions, which tend to be human generated and subject to upper growth limits once businesses are fully automated. If you push data into Glacier, economically feasible retention periods can run to decades. This is far longer than most businesses (or more particularly their lawyers) even want to keep information around.

There are still reasons for wanting an update-in-place model for OLTP systems, for example to keep as much of your working set as possible in RAM or on fast SSDs to keep response time low. But storage cost alone is no longer a major factor for a wide range of applications. This development is already affecting data management technology profoundly. Doug Cutting has pointed out on numerous occasions that the downward cost trajectory of commodity storage was a key driver in the development of Hadoop.

Users Want Immutable Data

Many organizations already keep long transaction histories to feed analytics by loading them into traditional data warehouses based on Teradata, Vertica, and the like. As soon as a practical method appeared to keep such data more economically, businesses began to adopt it quickly. That "method" is Hadoop.

Hadoop has a fundamentally different approach to data management from relational and even many NoSQL systems. For one thing, immutable data are fundamental. The default processing model is that you write data but rarely change it once written. To illustrate, the HiveQL SQL dialect does not even have UPDATE or DELETE statements. Instead, you overwrite entire tables or parts of them to make changes. This works because Hadoop organizes storage on cheap commodity hardware (HDFS) and provides a workable way to access data programmatically (MapReduce).

Hadoop changes the data management cost model in other ways besides utilizing commodity hardware efficiently. With Hadoop you don't necessary define *any* data structures up front. Instead, you store transactions in native form and write programs to interpret them later on. If you need structure for efficient queries you add it through MapReduce and perhaps store it as a materialized view to make other queries more efficient. Hadoop eliminates a lot of the up-front effort (and risk) required to get transactions into a data warehouse. Instead, it defers those costs until you actually need to run specific analytics. Moreover by storing native transaction formats, you can answer new questions years later. That is a very powerful benefit.

I have been working a lot with Hadoop over the last few months. It's a bear to use because it consists of a set of loosely integrated and rapidly evolving projects with weak documentation and lots of bugs. Even with these difficulties, the rising level of Hadoop adoption for analytics shows the underlying model has legs and that users want it. As Floyd Strimling pointed out a while ago on Twitter this genie is not going back in the bottle. HDFS is becoming the default storage mechanism for vast quantities of data.

Immutable Data Management Looks Like a Good Bet

One of the basic problems in discussing immutable data management is that there are different kinds of immutable data that persist at different timescales. Baron has a point that Couchbase, Datanomic, NuoDB, or whatever new DBMS implementation you choose are in some ways recapitulating solutions that existing RDBMS implementations reached long ago. But I also think that's not necessarily the right comparison when talking about immutable data, especially when you start to think about long retentions.

The fact is that Oracle, MySQL, PostgreSQL, and the like do not utilize distributed commodity storage effectively and they certainly do not enable storage of the long tail transaction histories that many businesses clearly want for analytics. The best way to do that is to replicate transactions into HDFS and work on them there. That is hard even for MySQL, which has flexible and economical replication options. (We are working on making it easier to do at Continuent but that's another article. :)

In my opinion a more useful criticism of the arriviste competitors of traditional OLTP systems is that they don't go far enough with immutable data and risk being outflanked by real-time transaction handling built on top of HDFS. Hadoop real-time work on projects like Apache Spark is for the time being is focused on analytics but OLTP support cannot be far behind. Moreover, there is a window to build competitors to HDFS that gets smaller as Hadoop becomes more entrenched. This seems more interesting than building stores that offer only incremental improvements over existing RDBMS implementations.

Immutable data now permeate IT due to decreasing storage costs coupled with requirements for analytic processing. It's like the famous quote from William Gibson:

The future is already here--it's just not very evenly distributed.

If you look at the big picture the arguments for database management based on immutable data seem pretty strong. It is hard to believe it won't be a persistent trend in DBMS design. Over the long term mutable data look increasingly like a special case rather than the norm.

Feb 7, 2014

Fun with MySQL and Hadoop at SCaLE 12X

It's my pleasure to be presenting at SCaLE 12X on the subject of real-time data loading from MySQL to Hadoop. This is the first public talk on work at Continuent that enables Tungsten Replicator to move transactions from MySQL to HDFS (Hadoop Distributed File System). I will explain how replication to Hadoop works, how to set it up, and offer a few words on constructing views of MySQL data using tools like Hive.

As usual with replication everything we are doing on Hadoop replication is open source. Builds and documentation will be publicly available by the 21st of February, which is when the talk happens. Hadoop support is already in testing with Continuent customers, and we have confidence that we can handle basic loading cases already. That said, Hadoop is a complex beast with lots of use cases, and we need feedback from the community on how to make Tungsten loading support better. My colleagues and I plan to do a lot of talks about Hadoop to help community users get up to speed.

Here is a tiny taste of what MySQL to Hadoop loading looks like. Most MySQL users are familiar with sysbench. Have you ever wondered what sysbench tables would look like in Hadoop? Let's use the following sysbench command to apply transactions to table db01.sbtest:

sysbench --test=oltp --db-driver=mysql --mysql-host=logos1 --mysql-db=db01 \
    --mysql-user=tungsten --mysql-password=secret \
    --oltp-read-only=off --oltp-table-size=10000 \
    --oltp-index-updates=4 --oltp-non-index-updates=2 --max-requests=200000 \
    --max-time=900 --num-threads=5 run

This results in rows that look like the following in MySQL:

mysql> select * from sbtest where id = 2841\G
*************************** 1. row ***************************
 id: 2841
  k: 2
  c: 958856489-674262868-320369638-679749255-923517023-47082008-646125665-898439458-1027227482-602181769
pad: qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt

After replication into Hadoop with Tungsten, we can crunch the log records using a couple of HiveQL queries to generate a point-in-time snapshot of the sbtest table on HDFS. By a point-in-time snapshot, I mean that a table that contains not only inserted data but also shows the results of subsequent update and delete operations on each row up to a particular point in time. We can now run the same query to see the data:

hive> select * from sbtest where id = 2841;
Total MapReduce jobs = 1
Launching Job 1 out of 1
...
Job 0: Map: 1   Cumulative CPU: 0.74 sec   HDFS Read: 901196 HDFS Write: 158 SUCCESS
Total MapReduce CPU Time Spent: 740 msec
OK
2841 2 958856489-674262868-320369638-679749255-923517023-47082008-646125665-898439458-1027227482-602181769 qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt

Tungsten does a lot more than just move transaction data, of course. It also provides tools to generate Hive schema, performs transformations on columns to make them match the limited HiveQL datatypes, and arranges data in a way that allows you generate materialized views for analytic usage (like the preceding example) with minimal difficulty.

If you want to learn more about how Tungsten does all of this magic, please attend the talk. I hope to see you in Los Angeles.

p.s., If you cannot attend SCaLE 12X, we will have a Continuent webinar on the same subject the following week. (Sign up here.)

Jan 10, 2014

Why I Love Open Source

Anders Karlsson wrote about Some myths on Open Source, the way I see it a few days ago. Anders' article is mostly focused on exploding the idea that open source magically creates high quality code. It is sad to say you do not have to look very far to see how true this is.

While I largely agree with Anders' points, there is far more that could be said on this subject, especially on the benefits of open source. I love working on open source software. Here are three reasons that are especially important to me.

1.) Open source is a great way to disseminate technology to users. In the best cases, it is this easy to get open source products up and running:

$ sudo apt-get install software-i-want-to-use

A lot of software companies (mine included) open source their software because it gets product into the hands of people who might pay money for it later. The strategy worked brilliantly for MySQL AB as Anders pointed out. MongoDB is repeating the tactic with what looks like equal success. There has been a lot of pointless argument over the years about whetherMySQL or MongoDB are "real databases." Being easy to get is just as critical to adoption as features like transactions and scalable performance.

Open source is therefore even better for users, who can quickly decide if something works for them and provide feedback through communities about problems as well as suggested improvement. To the extent open source software has high quality, it originates in the tight feedback loop between software producers and their user communities. That in turn leads to faster innovation with fewer deviations from real user needs. In olden days we called this getting the requirements right. Open source projects often do it extraordinarily well.

2.) Open source allows like-minded communities of developers to create products that would otherwise never happen. Linux became a dominant operating system in large part through the staggering scale of contributions enabled by exceptionally well-managed open source development. Linus Torvalds recently pointed out that Linux kernel releases have patches from a thousand contributors or more. Thanks to the wide range of contributions, Linux operates on everything from tiny ARM processors to servers with over 200 cores. The development effort underlying the Linux ecosystem is huge when you include the kernel and all the packages that install over it. It dwarfs any comparable operating system effort I can think of.

At the other end of the spectrum there are small but incredibly useful projects like Apache Curator. The Curator project currently has 8 project members, mostly from different companies, who collaborate to make Apache ZooKeeper vastly easier to program. I doubt libraries like Curator would even exist without open source licenses and infrastructure like distributed source code management. Either would ZooKeeper, for that matter.

Not every line of open source code is excellent or even above average. (I'm looking at you, Hadoop.) That said, open source projects are not so much about code but communities of developers who understand and are interested in solving a specific problem. Besides direct feedback from real users, this is the other prerequisite for creating truly great products. Clean code is helpful but not necessary.

3.) Open source means your creations can never be taken away from you. In many creative endeavors work belongs to the people who employ you. It effectively disappears when you change jobs. Putting code on GitHub or code.google.com breaks that bond. Knowing that anything you create will always be accessible removes any hesitation about revealing your best ideas. I believe this is one of the drivers behind the flowering of creativity that infuses so many open source projects.

At the same time working on open source software is not all peaches and cream. Building successful businesses on open source is hard, which limits the opportunities to work on it for a living.

For instance, if most of the value of your product is in the software itself there is not much motivation for users to pay you. I think that's one reason mobile apps are by-and-large for pay or at least not open source. You need to find a business model that brings in enough money over time to fund the sort of concentrated engineering necessary to build robust software. Successful open source businesses often depend on finding the right markets or achieving network effects, and not all software can fit the pattern.

The good news is that once you get the economics right it really wrong-foots your closed source competitors. RedHat has built a great business packaging and supporting open source for enterprises. They see open source as a competitive advantage that extends their market reach and speeds up innovation. An increasing number of companies producing DBMS software take the same view as they try to disrupt data management. Outside of enterprise software Valve Software is attacking proprietary gaming platforms through open source.

It's great to see the growing number of businesses based on open source development. When the model works it is incredibly satisfying. I guess this is a fourth reason why I love working on open source software.

Mar 28, 2013

See You at Percona Live 2013!

Percona Live 2013 is coming up fast. This is hands-down the best MySQL conference of the year, attended by a lot of people I really respect. Check the speaker list if you need some of their names. I will also be doing two talks myself.

9am Wednesday 24 April - Keynote: How MySQL Can Thrive in the World of Massive Data Hype. NoSQL solutions are oversold, but this is no reason for complacency in the MySQL community. There are new challenges in data management, and we need to solve them or become irrelevant. I will show some of the advances Continuent has on tap for MySQL-based applications and also point back to problems our experience shows must be solved within MySQL itself.
1pm Wednesday 24 April - Session: State of the Art for MySQL Multi-Master Replication. This talk will explain the fundamentals of multi-master operation and then trace the trade-offs of Tungsten, Galera, and other solutions. Thanks to excellent work on several products there is a lot of excitement about multi-master in 2013. My goal is to help listeners understand what applications are possible now as well as what we have the potential to achieve in the future.

I hope you will attend these talks. I am looking forward to meeting old friends at the conference and making new ones.

Incidentally, Percona Live sent me an email yesterday that you can get a 15% discount on the registration price using the code KeySQL when you sign up. At Continuent we are also offering free passes to customers who give us the best quotes about our software. However you get there, I really recommend this conference.

Feb 19, 2013

Data Fabric Design Patterns: Fabric Connector

This article is the third in a series on data fabric design and introduces the fabric connector service design pattern. The previous article in this series introduced the transactional data service design pattern, which defines individual data stores and is the building block for data fabrics based on SQL databases. The fabric connector builds on transactional data services and is another basic building block of fabric architecture.

Description and Responsibilities

Fabric connectors make a collection of DBMS servers look like a single server. The fabric connector presents what appears to be a data service API to applications. It routes each request to an appropriate physical server for whatever task the application is performing, hiding the fact that a data fabric can consist of dozens or even hundreds of servers. Applications cannot tell the difference between talking to the fabric connector and talking to a real DBMS server. We call this property transparency.

Here are the responsibilities of a fabric connector. I will use the phrase proxying to refer to the first of these, and routing responsibilities to refer to the remaining three.

Expose a data service interface to applications.
Route each application query to an appropriate DBMS server.
Balance load by distributing queries across multiple replicas, if available.
Switch to another server following a failure or if the DBMS becomes unavailable due to maintenance.

The following diagram shows the logical components of a fabric connector. The fabric connector sits between applications, transactional data services, and a fabric directory service. These are greyed out, as they are not part of the pattern.

Fabric Connector Design Pattern

Fabric connectors contain two logical components. The proxy is responsible for routing queries and responses between applications and underlying data services. This can be a library layer, a separate server process, or a TCP/IP load balancer--anything that provides a transparent indirection layer. The directory information contains rules to route SQL queries correctly to the actual location of data. There is a notification protocol that permits connectors to receive updates about the fabric topology and confirm that they have responded to them.

Motivation

Connecting to data is a problem in large systems. Sharded data sets spread data across multiple services. Data services have different roles, such as master or slave. Services fail or go offline for maintenance. Services change roles, such as a master switching to a slave. Shards move between services to balance load and use storage more efficiently. Within short periods of time there may be significant variations in load across data services. Adding routing logic directly to applications in these cases adds complexity and can lead to a tangled mess for administrators.

The fabric connector design pattern encapsulates logic to route connections from the application to DBMS servers. Hiding connection logic helps keep applications simple. It allows independent testing and tuning of the connection rules. That way you can have some assurance the logic actually works. You can also modify fabric behavior without modifying applications, for example to redistribute load more evening across replicas.

Related Design Patterns

The fabric connector design pattern manages single application connections to data services, for example a transactional data service. Transparency is the leitmotif of this pattern. It provides needed encapsulation for other data fabric design patterns and is particularly critical for sharded as well as fault tolerant data services. These will be covered in future articles on data fabric design.

There are also other design patterns for data access. Here are two that should not be confused with fabric connectors.

Federated query. Federated query splits a SQL query into sub-queries that it routes to multiple underlying data services, then returns the results. Sharding products like DbShards and shard-query implement this pattern. It requires complex parsing, query optimization, and aggregation logic to do correctly and has varying levels of transparency.
MapReduce. MapReduce is a procedure for breaking queries into pieces that can run in parallel across large numbers of hosts by splitting the query into map operations to fetch data followed by reduce operations to aggregate results. It can work on any distributed data set, not just SQL. MapReduce implementations often eschew SQL features like joins and also can have a very different programming model from SQL. Their use is often non-transparent to SQL applications.

Finally, there is a very important pattern for the fabric directory service. This is a directory service that maintains information about the desired topology of the fabric and its actual state. It can be implemented in forms ranging from a shared configuration file to network services in a distributed configuration manager like ZooKeeper.

I hope to add more complete descriptions for the latter three design patterns at some point in the future. For the current article, we will stick to simple connectivity.

Detailed Behavior

Fabric connectors are conceptually simple: route request from application to server, then transfer results back. Actual behavior can be quite complex. To give some perspective on the problem, here is a short Perl program for a SaaS application that logs order detail information in a table named sale, then reads the same data back. We will use the sample program to illustrate the responsibilities of this design pattern in detail.

use DBI;

# Connect to server. 
$dbh = DBI->connect("DBI:mysql:test;host=prodg23", "app", "s3cr3t5"
            ) || die "Could not connect to database: $DBI::errstr";

# Insert order using a transaction. 
$dbh->{'AutoCommit'} = 0;
$dbh->do("INSERT INTO sale(order_id, cust_id, sku, amount) \
   VALUES(2331, 9959, 353009, 24.99)");
$dbh->do("INSERT INTO sale(order_id, cust_id, sku, amount) \
   VALUES(2331, 9959, 268122, 59.05)");
$dbh->commit();

# Select order back with an auto-commit read
$dbh->{'AutoCommit'} = 1;
$sth = $dbh->prepare("SELECT * FROM sale WHERE order_id=2331");
$sth->execute();
while( $href = $sth->fetchrow_hashref ) {
  print "id      : $$href{id} \n";
  print "order_id: $$href{order_id} \n";
  print "cust_id : $$href{cust_id} \n";
  print "sku     : $$href{sku} \n";
  print "amount  : $$href{amount} \n";
}

# Disconnect from server. 
$dbh->disconnect();

The first responsibility of the fabric connector design pattern is to provide a transparent interface for underlying data services. That means that our Perl program has to work as written--no extra changes. Here are just a few things a connector needs to do:

Implement the DBMS connection protocol fully or pass it transparently to an underlying server. This includes handling authentication handshakes as well as setting base session context like client character sets.
Handle all standard features of query invocation and response, including submitting queries, returning automatically generated keys, and handling all supported datatypes in results.
Respect transaction boundaries so that the INSERT statements on the sales table are enclosed in a transaction in the DBMS and the SELECT statement is auto-commit (i.e., a single-statement transaction.)
Read back data written to the sales table.

In addition to handling APIs protocols, fabric connectors need to avoid slowing down transaction processing as a result of proxying. Properly written connectors for the most part add minimal overhead, but there are at least two instances where this may not be the case for some implementations (such as network proxies). The first is establishing connections, a relatively expensive operation that occurs constantly in languages like PHP that do not use connection pools. The second is short primary key-lookup queries on small datasets, which tend to be memory-resident in the server and hence have quick access.

One common reaction is to see such overhead as a serious problem and avoid the whole fabric connector approach. Yet the "tax" applications pay for proxying is not the whole story on performance. Fabric connectors can boost throughput by an order of magnitude by distributing load intelligently across replicas. To understand the real application overhead of a connector you therefore need to measure with a properly sized data set and take into account load-balancing effects. Test results on small data sets that reside fully in memory with no load balancing tend to be very misleading.

The remaining fabric connector design pattern responsibilities are closely related: route requests accurately to the correct service, load-balance queries across replicas within a service, and route around replicas that are down due to maintenance or failure. We call these routing responsibilities. They require information about the fabric topology, which is maintained in the connector's directory information. Here is a diagram of typical directory organization.

Fabric Directory Service Organization

Let's start with the responsibility to route requests to data services. A simple fabric connector implementation allows connections using a logical server name, such as group2, which the connector would translate to an actual DBMS server and port, such as prodg23:3306. A better fabric connector would allow applications use a customer name like "walmart" that matches what the application is doing. The connector would look up the location of customer data and connect automatically to the right server and even DBMS schema. This is especially handy for SaaS applications, which often shard data by customer name or some other simple identifier.

We could then change our program as follows to connect to the local host and look for the "walmart" schema. Under the covers, the fabric connector will connect to the prodg23 server and use the actual schema for that customer's data.

use DBI;
# Connect to customer data.  
$dbh = DBI->connect("DBI:mysql:walmart;host=localhost", "app", "s3cr3t5"
            ) || die "Could not connect to database: $DBI::errstr";

This is a modest change that is very easy to explain and implement. It is a small price to pay for omitting complex logic to locate the correct server and schema that contains the data for this customer.

The next responsibility is to distribute data across replicas. This requires additional directory information, such as the DBMS server role (master vs. slave), current status (online or offline), and other relevant information like slave latency or log position. There are many ways to use this information effectively. Here are a few of the more interesting things we can do.

Slave load balancing. Allow applications to request a read-only connection, then route to the most up-to-date slave. This works well for applications such as Drupal 7, which is an application for website content management. Drupal 7 is slave-enabled, which means that it can use separate connections for read-only queries that can run on a replica. Many applications tuned to work with MySQL have similar features.
Session load balancing. Track the log position for each application session and dispatch reads to slaves when they are caught up with the last write of the session. This is a good technique for SaaS applications that have large numbers of users spread across many schemas. It is one of the most effectively scaling algorithms for master/slave topologies.
Partitioning. Split requests by schema across a number of multi-master data services. SQL requests for schema 1 go to server 1, requests for schema 2 to server 2, etc. Besides distributing load across replicas this technique also helps avoid deadlocks, which can become common in multi-master topologies if applications simultaneously update a small set of tables across multiple replicas.

Recalling our sample program, we could imagine a connector using session load balancing to write the sales table transaction to the master DBMS server, then sending the SELECT to a slave if it happened to be caught up for customer "walmart." No program changes are required for this behavior.

The final responsibility is to route traffic around offline replicas. This gets a bit complicated. We need not only state information but an actual state model for DBMS servers. There also needs to be a procedure to tell fabric connectors about a pending change as well as wait for them to reconfigure themselves. Returning to our sample program, it should be possible to execute the following transaction:

$dbh->{'AutoCommit'} = 0;
$dbh->do("INSERT INTO sale(order_id, cust_id, sku, amount) \
   VALUES(2331, 9959, 353009, 24.99)");
$dbh->do("INSERT INTO sale(order_id, cust_id, sku, amount) \
   VALUES(2331, 9959, 268122, 59.05)");
$dbh->commit();

then failover to a new master and execute:

$dbh->{'AutoCommit'} = 1;
$sth = $dbh->prepare("SELECT * FROM sale WHERE order_id=2331");
$sth->execute();

...

To do this properly we need to ensure that the connecter responds to updates in a timely fashion. We would not want to change fabric topology or take a DBMS server offline while connectors were still using it. The notification protocol that updates connector directory information has to ensure reconfiguration does not proceed until connectors are ready.

Does every fabric connector have to work exactly this way? Not at all. So far, we have only been talking about responsibilities. There are many ways to implement them. To start with, fabric connectors do not even need to handle SQL. This is interesting in two ways.

First, you can skip using the Perl DBI completely and use a specialized interface to connect to the fabric. We will see an example of this shortly. Second, the underlying store does not even need to be a SQL database at all. You can use the fabric connector design pattern for other types of stores, such as key-value stores that use the memcached protocol. This series of articles focuses on SQL databases, but the fabric connector design pattern is very general.

Implementations

Here are a couple of off-the-shelf implementations that illustrate quite different ways to implement the fabric connector design pattern.

1. Tungsten Connector. Tungsten Connector is a Java proxy developed by Continuent that sits between applications and clusters of MySQL or PostgreSQL servers. It implements the MySQL and PostgreSQL network protocols faithfully, so that it appears to applications like a DBMS server.

Tungsten Connector gets directory information from Tungsten clusters. Tungsten clusters use a simple distributed consensus algorithm to keep directory data consistent across nodes even when there are failures or network outages--connectors can receive topology updates from any node in the cluster through a protocol that also ensures each connector acts on it when the cluster reconfigures itself. In this sense, Tungsten clusters implement the fabric directory service pattern described earlier.

The directory information allows the connector to switch connections transparently between servers in the event of planned or even some unplanned failovers. It can also load balance reads automatically using a variety of polices including the slave load balancing and session load balancing techniques described above.

The big advantage of the network proxy approach is the high level of transparency for all applications. Here is a sample session with the out-of-the-box mysql utility that is part of MySQL distributions. In this sample, we check the DBMS host name using the MySQL show variables command. Meanwhile, a planned cluster failover occurs, followed by an unplanned failover.

mysql> show variables like 'hostname';
+---------------+---------+
| Variable_name | Value   |
+---------------+---------+
| hostname      | prodg23 |
+---------------+---------+
1 row in set (0.00 sec)

(Planned failover to prodg21 to permit upgrade on prodg23)
mysql> show variables like 'hostname';
+---------------+---------+
| Variable_name | Value   |
+---------------+---------+
| hostname      | prodg21 |
+---------------+---------+

1 row in set (0.01 sec)

(Unplanned failure to prodg22)
mysql> show variables like 'hostname';
+---------------+---------+
| Variable_name | Value   |
+---------------+---------+
| hostname      | prodg22 |
+---------------+---------+

1 row in set (4.82 sec)

As this example shows, the session continues uninterrupted as the location of the server switches. These changes occur transparently to applications. The downside is that there is some network overhead due to the extra network hop through the Tungsten Connector, though of course load balancing of reads can more than repay the extra latency cost. Also, this type of connector is hard to build because of the complexity of the MySQL network API as well as the logic to transfer connections seamlessly between servers.

2. Gizzard. Gizzard is an open source sharding software developed by Twitter to manage links between Twitter users. The proxy part of the design pattern is implemented by middleware servers, which accept requests from clients using thrift, a language-independent set of tools for building distributed services. For more on a particular application built on Gizzard, look at descriptions of Twitter's FlockDB service. Gizzard servers give applications a simple API for data services, which fulfills the proxy responsibility of the fabric connector design pattern.

Gizzard servers get directory information using gizzmo. Gizzmo is a simple command line tool that maintains persistent copies of the Gizzard cluster topology and takes care of propagating changes out to individual Gizzard servers. For more on how Gizzmo works, look here. Using this information, Gizzard servers can locate data, route around down servers, and handle distribution of queries to replicas, which are the final three responsibilities of the fabric connector design pattern.

The Gizzard architecture lacks the generality of the Tungsten Connector, because it requires clients to use a specific interface rather than general-purpose SQL APIs. It also introduces an extra network hop. On the other hand, it works extremely well for its intended use case of tracking relationships between Twitter users. This is because Gizzard deals with a simplified problem and also allows scalability through many Gizzard servers. Like the Tungsten Connector the network hop expense pays for itself due to the ability to load-balance across multiple replicas.

Gizzard is a nice example of how the fabric connector design pattern does not have to be specifically about SQL. Gizzard clients do not use SQL, so the underlying store could be anything. Gizzard is specifically designed to work with a range of DBMS types.

Fabric Connector Implementation Trade-Offs

General-purpose fabric connectors like the one used for Tungsten are hard to implement for a variety of reasons. This approach is really only practical if you have a lot of resources at your disposal or are doing it as a business venture like Continuent. You can still roll your own implementations. The Gizzard architecture nicely illustrates some of the trade-offs necessary to do so.

1. General vs. particular data service interfaces. Implementing a simple data service interface, for example using thrift, eliminates the complexity of DBMS interfaces like those of MySQL or PostgreSQL. Rather than a thrift server you can also use a library within applications themselves. This takes out the network hop.

2. Automatic vs. manual failover. Automatic failover requires connectors to respond to fabric topology changes in real time, which is a hard problem with a lot of corner cases. (Look here if you disagree.) You can simplify things considerably by minimizing automated administration and instead orchestrate changes through scripts.

3. Generic vs. application-specific semantics. Focusing on a particular application allows you to add features that are important for particular use cases. Gizzard supports shard migration. To make it tractable to implement Gizzard requires a simple update model in which transactions can be applied in any order.

These and other simplifications make the fabric connector design pattern much easier to implement correctly. You can make the same sort of trade-offs for most applications.

Implementations to Avoid

Here are a couple of implementations for the fabric connector design pattern that you should avoid or at least approach warily.

1. Virtual IP Addresses (VIPs). VIPs allow hosts to listen for traffic on multiple IP addresses. They are commonly used in many failover schemes, such as programs like heartbeat. VIPs do not have the intelligence to fulfill fabric connector responsibilities like load-balancing queries across replicas. They are subject to nasty split-brains, a subject I covered in detail as part of an earlier article on this blog. Finally, VIPs are not available in Amazon and other popular cloud environment. VIPs do not seem like a good implementation choice for data fabrics.

2. SQL proxies. There are a number of software packages that solve the problem of proxying SQL queries, such as PgBouncer or MySQL Proxy. Many of them do this quite well, which means that they fulfill the first responsibility of the fabric connector design pattern. The problem is that they do not have a directory service. This means they do not fulfill the next three responsibilities to route queries effectively, at least out of the box.

Unlike VIPs, SQL proxies can be a good starting point for fabric implementations. You need to add the directory information and notification protocol to make them work. It is definitely quite doable for specific cases, especially if you make the sort of trade-offs that Gizzard illustrates.

Conclusion and Takeaways

The fabric connector design pattern reduces the complexity of applications by encapsulating the logic required to connect to servers in a data fabric. There is a tremendous benefit to putting this logic in a separate layer that you can test and tune independently. Fabric connectors are more common than they appear at first because many applications implement the responsibilities within libraries or middleware servers that include embedded session management. Fabric connectors do not have to expose SQL interfaces or any other DBMS-specific interface, for that matter.

Fault-tolerant and sharded data service design patterns depend on fabric connectors to work properly and avoid polluting applications with complex logic to locate data. Products that implement these design patterns commonly include fabric connector implementations as well. You can evaluate them by finding out how well they fulfill the design pattern responsibilities.

Off-the-shelf fabric connectors have the advantage that they are more general than something you can develop easily for yourself. If you decide to write your own fabric connector, you will need to consider some of the trade-offs like reducing automation or simplifying APIs in order to make the problem easier to solve. Regardless of the approach, you should allow time. The responsibilities are complicated and must be implemented with care. Fabric connectors that only work 99% of the time of are not much use in production environments.

One final point about fabric connectors. Automated failover can make fabric connectors harder to implement and increase the risk that the fabric connector may write to the wrong replica. The difficulty of managing connectivity is one of the reasons many data management experts are very cautious about automatic failover. This problem is tractable in my opinion, but it is definitely a practical consideration in system design.

My next article on data fabrics will cover the fault-tolerant data service design pattern. This design pattern depends on the fabric connector design pattern to hide replicas. I hope you will continue reading to find out about it.

Feb 6, 2013

Data Fabric Design Patterns: Transactional Data Service

This article is the second in a series on data fabric design and introduces the transactional data service design pattern. The previous article in this series introduced data fabrics, which are collections of off-the-shelf DBMS servers that applications can connect to like a single server. They are implemented from data fabric design patterns, which are reusable arrangements of DBMS servers, replication, and connectivity. With this article we begin to look at individual design patterns in detail.

Description and Responsibilities

The transactional data service is a basic building block of data fabric architectures. A transactional data service is a DBMS server that processes transactions submitted by applications and stores data safely. Transactional data services have the following responsibilities:

Store data transactionally and recover data faithfully to the last full transaction following failure.
Provide a network-accessible application interface for accessing data
Provide a reliable and reasonably quick method for backup and restore.
Maintain an accessible, serialized log of transactions. This enables replication between services.

The following diagram illustrates the moving parts of a transactional data service. In future diagrams we will just use the standard database symbol for the entire transactional data service, but for now we need to be able to see the contents.

Motivation

Durable storage of transactions is the most fundamental responsibility of database systems. It is difficult to build reliable applications if stored data can disappear or become corrupted because transactions were not committed before a crash. Both problems can cause data loss. Moreover, they can break replication links very badly if the DBMS server comes up in an inconsistent state, for example with some updates committed but others randomly rolled back. This condition affects not only the one server but potentially many others throughout the fabric.

The transactional data service therefore focuses on storing data safely and recovering to the last committed transaction after a restart. With this basic capability we can construct more complex services knowing that individual changes are unlikely to disappear or be recorded inconsistently.

Detailed Behavior

Let's look in detail at the properties required for a successful transactional data service. Somewhat surprisingly, an off-the-shelf SQL DBMS does not necessarily fit the pattern, though it comes close. It is important to understand the differences.

The transactional store keeps data from getting lost and is the basis for recovery throughout the fabric. Transactional stores support commit and rollback with multi-statement transactions. In theory the transaction data service responsibility for data persistence matches MySQL/InnoDB and PostgreSQL behavior, both of which commit transactions safely in serial order. However, the reality is not quite that simple.

Most DBMS allow applications to ignore transactions under certain conditions. This results in wormholes, which are violations in serial ordering of data. There are a number of table definition options in SQL that undo transactional consistency.

(MySQL) MyISAM table type. MyISAM tables ignore transactions and commit immediately, even if the application later tries to roll back. The tables may also become corrupted by server crashes.
(MySQL) Memory table type. These tables are maintained in memory only. They disappear on restart.
(PostgreSQL) UNLOGGED tables. Such tables are not logged and disappear on crash or unclean shutdown (thanks Frederico).

All of these allow data to disappear or become corrupted after a crash. However, there is a more subtle problem. If applications depend on these tables, transaction results may then depend on when the server last crashed or restarted, which in turn makes updates across replicas non-deterministic. Random updates create problems for data replication, which depends on replicas behaving identically when transactions are applied. It is important to avoid application dependencies on any feature that creates wormholes or you might not be able to use other design patterns.

So is data loss always bad? Surprisingly, no. In some cases transactional stores can lose data provided that they do so by dropping the last transactions in serial order. It's as if the data just reverted to an earlier point in time. To understand why this might be OK, imagine three servers linked into a chain by asynchronous replication.

Data loss is sometimes not a big deal

It is bad to lose data on the first server, especially if those data are lost before replicating transactions to the downstream replicas. However, data loss on the last server is fine. Assuming that server stores the replication restart point transactionally, it will just re-apply the missing transactions and catch up. This is exactly what happens when you restore a backup in slave in master/slave replication.

Data loss on the second server also may not be a problem. It should restart replication and should generate identical transactions for itself as well as for replication to the last server. In both cases we assume that replication will handle these cases correctly and can replay missing transactions from logs. If so, you can not only tolerate such losses but even depend on recovering from them automatically.

Turning to the next responsibility of the transactional data service, the application interface may obviously include SQL via MySQL or PostgreSQL wire protocols. However, any consistent interface that is accessible over a network will do. The memcached protocol is also perfectly acceptable. Subsets of SQL such as stored procedures work quite well. Transactional data services are more general than SQL DBMS servers in this sense. Full SQL or even a subset is not a requirement.

Backup and restore are critical for data fabrics as they enable provisioning of new services as well as recovery of services that fail. You restore a backup and then let the service catch up using replication. Data fabrics can get along fine using a range of options from logical dumps of the DBMS (mysqldump or pgdump) to file system snapshots. Fabric backups just need to be transactionally consistent and movable to other hosts.

Note that the backup required by the transaction data service design pattern is a less general form of backup than most businesses really require. Businesses may need to recover data after accidental deletion or to keep copies of information for many years for legal reasons. You can therefore use a general-purpose backup solution like Zmanda or Barman provided it meets the fabric design pattern requirements. There's no need to do things twice.

Finally, the replication log is a serialized list of transactions to replicate to other hosts. Serialization enables the transactions to be replayed on another host and result in an identical copy. Generally speaking, data fabrics require logical replication, which applies changes to replicas using SQL statements on a live server. This is because other design patterns depend on being able to access and even write to the transactional data service when it is acting as a slave. Binary replication methods like disk block replication, such as DRBD, do not meet this requirement and therefore are of limited use in data fabrics.

Implementation

You can implement the transactional data service design pattern with any DBMS that meets the pattern responsibilities. That said, implementation details are very important. As we have seen, ensuring that DBMS servers live up to the responsibility to store transactions safely is a little harder than one might think.

1. MySQL. MySQL with InnoDB engine is generally a good choice. It has stable SQL APIs and a wide range of capable client libraries. However, MySQL must be correctly configured to maintain proper transactional guarantees. Here are three properties that should be in your my.cnf file to help ensure MySQL lives up to its responsibilities:

# Ensure durable flush to storage on transaction commit.

innodb_flush_log_at_trx_commit=1

# Synchronize binlog with committed transactions.

sync_binlog=1

# Use InnoDB as default storage engine. (Unnecessary for MySQL 5.5 and above.)

default-table-type=InnoDB

There are a variety of good backup mechanisms for MySQL databases, including mysqldump (with --single-transaction, useful only for small data sets), Percona XtraBackup, and file system snapshots. Snapshots are especially good when using NetApp or other capable storage. NetApp snapshots can be restored in seconds and cost little in terms of performance overhead.

2. PostgreSQL. PostgreSQL with a trigger-based replication log, for example from Londiste or SLONY, and pgdump for backups is another good choice. PostgreSQL has unusually good trigger support for DML changes at least, and permits users to encode them in a number of languages. Be aware the PostgreSQL triggers do not capture DDL statements like CREATE TABLE, though.

PostgreSQL is fully transactional out of the box and triggers create a fully serialized replication log. It does not have the problem that MySQL does with potentially unsafe table types like MyISAM. However, you need to set a couple of parameters to ensure safe operation. These ensure transactions are committed down to the storage level and prevent catastrophic corruption of the database and/or the WAL (write-ahead log). Since PostgreSQL defaults to these values, you mostly need to avoid turning them off.

fsync = on # turns forced synchronization on or off
synchronous_commit = on # immediate fsync at commit

Like MySQL, PostgreSQL SQL and APIs are stable and well-known. Pgdump also loads and restores data without difficulty for smallish data sets. For larger data sets file system snapshots work very well.

Regardless of the DBMS type you choose, it is important to avoid application-level features that introduce wormholes, such as the PostgreSQL unlogged tables mentioned in the previous section. Generally speaking, you should only skip transactions if there is a very strong reason for doing so.

Do other database types work for this design pattern? Of course. You can also use a commercial DBMS like Oracle. Oracle fulfills the pattern responsibilities quite well, but is a bit more heavyweight than users want, particular when operating in the cloud.

And Hardware Implementation, Too...

Even with a properly configured DBMS server you are still not completely out of the woods for data durability. Database servers generally use an fsync() or similar system call to flush data to storage. Unfortunately storage controller cards may just cache the data to be written in local RAM and return. That can fool the DBMS server into thinking transactions are safely stored when they actually are still sitting in memory on a controller card. The "committed" data will then vaporize in a host crash, which in turn can fatally corrupt both MySQL and PostgreSQL stores if you are very unlucky. Just a few bad blocks can cause very serious problems.

Fortunately there is a cure to make data vaporization less likely. On local storage you can invest in RAID with a battery-backed cache (BBU), which keeps power on for the cache even if the host fails completely. SANs and network attached storage tend to have this capability built in. (But check the specifications!) Battery backed cache also tends to be fast, since controllers can safely return from an fsync() operation as soon as the data to be written are in the on-board cache. Without this feature writes to storage can be painfully slow.

One interesting question is how to handle cloud environments like Amazon. You just do not know how the storage actually works. (That's really the point of a cloud, after all.) Amazon provides SLAs for performance (example: EBS provisioned IOPS), but there do not seem to be any SLAs about storage consistency. There is a lot to learn here and lessons that apply to Amazon may not necessarily apply to others. I suspect this will prompt some rethinking about data consistency--it's an interesting "what if" for transaction processing to suppose you cannot trust the underlying storage capabilities.

Data loss occurs sooner or later in virtually all systems, but the good new is that you can make it uncommon. For more information check out data sources like this and this. Also, other fabric design patterns like the Fault-Tolerant Data Service keep applications running when failures do occur and can even minimize the effects of data loss. See the upcoming article on that design pattern for more information.

Implementations to Avoid

Here are two examples that do not meet the transactional data service design pattern responsibilities or at least not fully.

1. MySQL with MyISAM table type. MyISAM does not support transactions and is not crash safe. You will lose data or incur downtime fixing problems. MyISAM does not belong in data fabrics.

2. PostgreSQL with streaming replication. Streaming replication replicates log updates in real-time and has the added benefit of permitting queries on replicas. However, streaming replication does not allow you to write to replicates. It therefore does not support online schema maintenance or multi-master replication. It also does not help with heterogeneous replication. Streaming replication is therefore an unattractive choice, even though it is far simpler and works better for ensuring high availability than logical replication solutions like SLONY.

How do NoSQL stores fare in this design pattern? Let's pick on MongoDB. MongoDB supports atomic commit to single BSON documents but does not support transactions across multiple documents. (Unless you think that the hacky two-phase commit proposed by the MongoDB manual is an answer.) Atomic transactions are not one of the reasons people tend to choose NoSQL systems, so this is not surprising. It means that MongoDB cannot handle the responsibilities of this design pattern.

Conclusion and Takeaways

The transactional data service design pattern can be implemented with a carefully configured SQL database. As we have seen, however, there are a number of details about what "carefully configured" really means.

It is a good idea to use the transactional data service design pattern even if you are not planning to implement a data fabric. Systems grow. This pattern gives you the flexibility to build out later by adding other fabric design patterns, for example to introduce cross-site operation using the Multi-Site Data Server pattern. It also protects your data at multiple levels that include transactions as well as regular backups. Nobody really likes losing data if it can be avoided.

Another important point: the transactional data service design pattern keeps your entire fabric working. Losing small amounts of data is typically just an inconvenience for users, especially if it does not occur too often. Broken replication on the other hand due to replicas that diverge or become corrupt after failures can lead to time-consuming administration and significant downtime to repair. The fabric is a network of servers. Poor configuration on one server can cause problems for multiple others.

Finally, the biggest error people make with this design pattern is to neglect backups. There's something inherently human about it: backups are painful to test so most of us don't. My first and biggest IT mistake involved a problem with backups. It nearly caused a large medical records company to lose a week of data entry records. Put in backups and test them regularly to ensure they work. This is a theme that will recur in later articles.

Speaking of which, the next article in this series will cover the fabric connector design pattern. Stay tuned!