Feb 5, 2013

Introducing Data Fabric Design for Commodity SQL Databases

Data management is undergoing a revolution.  Many businesses now depend on data sets that vastly exceed the capacity of DBMS servers.  Applications operate 24x7 in complex cloud environments using small and relatively unreliable VMs.  Managers need to act on new information from those systems in real-time. Users want constant and speedy access to their data in locations across the planet.

It is tempting to think popular SQL databases like MySQL and PostgreSQL have no place in this new world.  They manage small quantities of data, lack scalability features like parallel query, and have weak availability models.  One reaction is to discard them and adopt alternatives like Cassandra or MongoDB.  Yet open source SQL databases have tremendous strengths:  simplicity, robust transaction support, lightning fast operation, flexible APIs, and broad communities of users familiar with their operation.  The question is how to design SQL systems that can meet the new requirements for data management.

This article introduces an answer to that question:  data fabric design.  Data fabrics arrange off-the-shelf DBMS servers so that applications can connect to them as if they were a single database server.  Under the covers a data fabric consists of a network of servers linked by specialized connectivity and data replication.  Connectivity routes queries transparently from applications to DBMS servers.  Replication creates replicas to ensure fault tolerance, distribute data across locations, and move data into and out of other DBMS types.  The resulting lattice of servers can handle very large data sets and meet many other requirements as well.

Data fabric design is a big topic, so I am going to spread the discussion over several articles.  This first article provides a definition of data fabric architecture and introduces a set of design patterns to create successful data fabrics.  In the follow-on articles I will explore each design pattern in detail.  The goal is to make it possible for anyone with a background in database and application construction to design data management systems that operate not only today but far into the future.  At the very least you should understand the issues behind building these systems.

Some readers may see data fabric design as just another reaction to NoSQL.  This would be a mistake.  Building large systems out of small, reliable parts is a robust engineering approach that derives from ground-breaking work by Jim GreyPat Helland, and others dating back to the 1970s.  Data fabrics consist of DBMS servers that you can look at and touch, whereas NoSQL systems tend to build storage, replication, and access into a single distributed system.  It is an open question which approach is more complex or difficult to use.  There are trade-offs and many systems actually require both of them.  You can read this article and those that follow it, then make up your own mind about the proper balance.

Acknowledgements

The data fabric concept is largely based on practical experience on Continuent Tungsten.  I am indebted to Continuent as well as our customers for the opportunity to work on this problem.  I am likewise indebted to Ed Archibald, Continuent CTO, with whom I have worked for many years.  Ed among other things came up with the data fabric moniker.  Our interest in this topic goes back to shared experiences with Sybase OpenServer and Sybase Replication Server, which were ground-breaking products in the fields of connectivity and replication.  Two decades later we are still applying the insights gained from working on them.

What Is a Data Fabric Architecture?

Let's start the discussion of data fabrics with a practical problem.  We want to design a SaaS application for customer relationship management that will support up to a million users using a commodity open source DBMS like MySQL.  What are the requirements?

Obviously, our system must be able to scale over time to hundreds of app servers operating on hundreds of terabytes data.  It must hide failures, maintenance, and schema upgrades on individual DBMS hosts.  It must permit data to distribute across geographic regions.  It must deliver and accept transactions from NoSQL, data warehouses, and commercial DBMS in real time.  It must allow smooth technology upgrade and replacement.  Finally, it must look as much like a single DBMS server to applications as possible.  Here's a picture of what we want: 
Conceptual Data Fabric
The last requirement is especially important and goes to the essential nature of data fabric architecture.  The fabric may contain dozens or even hundreds of servers but encapsulates their locations and number.  Applications connect to the fabric the same way they connect to individual DBMS servers, a property we call transparency.  Transparency permits developers to build applications using a DBMS on a laptop and push out code to production through increasingly capable test environments without changes in behavior.   This is a potentially confusing requirement so let's look at a couple of examples.

Transparency does not mean that you get access to all data all the time from everywhere.  Say our sample application stores customer data across many servers.  The following SQL query to list regions with SaaS users who have sales to their own customers greater than $100,000 would typically not work:

mysql> select region, count(cust_id), sum(sales) as sales_total 
  from customer group by region having sales_total > 100000;

This is not as big a problem as it sounds, as SaaS applications for the most part operate within a single SaaS user at a time and do not ask for data across users.  Moreover, it is easy to understand that you need to do something special for this particular query, such as connect to all servers explicitly or load transactions into a data warehouse. Most SaaS designers are pretty comfortable with this limitation, which just makes explicit something that you know anyway.

On the other hand transparency does mean that your application talks to what looks like a single server for individual SaaS user data.  For instance the following sequence of commands on a single customer to insert a row and get the generated auto-increment key back must work in a fabric just as it does in a single DBMS server.

mysql> insert into customer(name, region, sales) values('bob', 'sfo', 10035.0);
Query OK, 1 row affected (0.00 sec)

mysql> select last_insert_id();
+------------------+
| last_insert_id() |
+------------------+
|                1 |
+------------------+
1 row in set (0.00 sec)

Selecting the last inserted ID is a standard idiom for adding rows to table with synthetic keys in MySQL.  It is baked into widely used libraries like PHP mysqli.  Change it and you break thousands of applications.  This could happen if the fabric switched a DBMS connection across servers between these two commands.  When operating on data for a single SaaS user fabric transparency needs to be as close to perfect as possible.

Any architecture that meets the preceding requirements including transparency is a data fabric.  The next topic is what is actually inside a fabric architecture.

What Are Data Fabrics Made of?  

Combining individual components into lattices is hardly a new idea.  Such compositions are common in fields from bridge-building to art.  One of my favorite examples is the famous arabesques of the Alhambra in Granada, which combine simple motifs into patterns then combine those to create still more complex patterns that cover walls and ceilings throughout the palace.  The resulting compositions are works of stunning beauty.  

Detail of Arabesque from Alhambra, Spain (Source: Wikipedia)
Arabesque construction is far from random.  Arabesques combine plant-like elements into geometric patterns.  Only certain elements are allowed--there are typically no human representations--and only certain patterns work well.  Arabesques are also expressed in a particular medium such as stone, plaster, tiles, or paint.

Much as arabesques do, data fabrics combine very specific elements in a medium consisting of the following logical parts: 
  1. A partitioned network of recoverable transactional stores (DBMS servers) connected by reliable messaging (replication).  Partitioned means that not every service exchanges information with every other, in the same way that DBMS servers for different applications may be separate silos.  
  2. A routing layer that connects applications to relevant data based on simple hints provided by the application itself, such as the name of a schema, a primary key, or whether the current user transaction is a set of writes or an auto-commit read.  
Stores, replication, and routing logic are the fundamental elements of fabric implementations.  These are powerful tools for building very large applications.  Let's take a quick tour, as their particular properties are critical for building systems that actually work.

Transactional stores are DBMS servers that apply changes as atomic units that either commit or roll back as a whole.  Transactional stores convert these changes into a serial history, which orders concurrent transactions in its log so that they can replay as if they had executed one after the other all by themselves.  Serial ordering enables recovery, which is the procedure to bring data back cleanly to the last committed transaction after a crash or restart.  These properties are at the heart of all relational DBMS systems today and are necessary for the fabric to operate.  Non-transactional stores are close to useless in fabrics.  Use them at your peril.

Reliable messaging transmits changes between stores without losing them or applying them twice.  The usual way to do this is through replication, which moves transactions automatically on commit and applies them to replicas.   Replication needs to support transactions, apply them in serial order, and recover just as transactional stores do.  Replication systems that fail to meet any of these three requirements do not cut the mustard and do not belong in data fabrics.

Data replication implementations differ in at least three major ways.  There is not really a "right approach" to replication--different methods work better for some applications than others.

1. Synchronous vs. asynchronous.  Synchronous replication moves transactions before reporting commit back to applications.  It minimizes problems with data loss if a replica fails but may slow or block applications if replication is slow.  It can also lead to deadlocks.  Asynchronous replication moves transactions after commit.  It does not block applications but leads to latency between replicas and may result in data loss if there is a failure before transactions replicate from a particular replica. 

2. Master-master vs. master/slave.  Master-master replication allows updates on any replica.  It requires conflict management, typically either through locking in advance or fixing up after the fact (also known as conflict resolution).  Master/slave replication requires applications to use a single master for updates.  It works well with asynchronous replication, which is easier to implement.  

3. Log-based vs. trigger-based.  Log-based replication reads the database journal or some representation of it like the MySQL binlog.  Log-based replication has lower impact on DBMS servers but can be very hard to implement correctly.  Trigger-based replication uses triggers to capture transaction content, typically using transfer tables.  Trigger-based replication is simpler to implement but adds load to DBMS servers, can add unwanted serialization points to transactions, and may not handle certain types of changes, such as DDL data.  

The routing layer directs application DBMS connections to specific copies of data.  An application sees what appears to be a single session for each connection with consistent settings, such as character set encodings and session variables.  Underneath the fabric may actually switch the connection across DBMS servers at opportune times.  Transparency is absolutely critical.  As mentioned previously, even tiny changes in session behavior breaks libraries that applications depend on to access data.  

There are many methods to route connections as well as transactions.  As with replication, there is no "best" way.  It all depends on your application and the type of environment in which you are operating.  Here are four common approaches. 

1. Gateways.  A gateway is a proxy process that sits between the application and DBMS server.  The gateway looks like a server to applications.  It establishes connections on behalf of the applications, hence can perform very flexible transformations on data at the cost of some performance due to the double network hop the gateway introduces.  Gateways are hard to implement initially, but good ones are relatively easy to program once they work properly.  

2. IP Routers.  Routing software switches IP packet paths between servers.  It has the lowest overheard and potentially highest transparency but require substantial effort to implement correctly.  

3. Library wrappers.  Library wrappers re-implement standard interfaces like JDBC or Perl DBI, then route application requests through underlying connectors like the MySQL or PostgreSQL JDBC driver.    Wrappers are relatively easy to implement and have excellent performance but do not handle traffic that goes outside the library.  Compiled versions can have nasty library dependencies that introduce new forms of RPM hell for users if you try to use them generally.

4. Custom routing layers.  Any interface that your applications use to access data can implement routing logic.  For instance, SOAP or JSON servers work perfectly well for this purpose.   Internal data access libraries can also implement routing.  This approach is specific to single applications, and like library wrappers does not cover other means of access.  

The biggest constraint for any routing method is to keep it simple.  Fat routing layers with complex logic tend to introduce bugs as well as change database semantics.  This in turn violates the requirement for transparency.

Fabrics use the preceding elements over and over.  As we build up data fabric designs it is helpful to use consistent notation for data stores, replication, and routing.  The following summarizes the main notation that you will see in diagrams.
Data Fabric Notation
At this point we understand that data fabrics consist of a specific set of elements.  However, we have not provided any organization yet.  The next question is therefore how to arrange them into real systems.

What Are the Design Patterns for Data Fabrics?

Having a model of fabric elements is not the same as an actual implementation.  We need to define how the elements are composed into a complete system that can be deployed.  This is where design patterns come in.

Design patterns are reusable solutions that combine flexibly with each other to create large-scale architectures.  In the case of data fabrics, design patterns offer guidelines to organize fabric elements in ways that work well for large data sets spread over multiple geographic locations.  They serve the same function as the geometric arrangements that organize individual motifs in an arabesque.

Data fabric design patterns arrange data services.  A data service is an abstraction for a transactional store with a well-defined API.  Design patterns either link data services in some way or create more capable data services from simpler ones.  The ability to create new services in this way is called service composability.  Composability is a fundamental attribute of fabric design patterns.  It is the reason that fabric designs can handle very large data sets flexibly.  Service composition is analogous to the patterns-within-patterns layout that allows arabesques to cover an entire building without seeming repetitive or boring.

There are six design patterns that are particularly useful for SQL databases.  The following diagram shows them using italized blue font and illustrates how they organize the implementation of our sample CRM system.
Fully Implemented Data Fabric
Here is a short description of each fabric design pattern.  
  • Transactional Data Service - A transactional store that can restart without losing data following failure.  It has a well-defined API, a mechanism for backup, and a transaction log for replication.   This is the building block for all other design patterns.  
  • Fabric Connector -- A routing method that hides the location of databases and enables data services consisting of multiple DBMS servers to look like a single server to applications.  
  • Fault-Tolerant Data Service -- A data service that protects against failures and enables zero-downtime maintenance using a set of redundant DBMS servers.  
  • Sharded Data Service -- A data service that handles large data sets by dividing them into shards distributed across a number of underlying data services
  • Multi-Site Data Service -- A data service that enables data to spread across multiple geographic locations using data services on each site linked by replication
  • Real-Time Data Bridge -- A link that enables real-time replication of data between heterogeneous data services, for example between MySQL and a data warehouse like Vertica.  
You can implement fabric design patterns in many different ways.  This means there are many possible implementations of data fabrics as well.  As long as you meet the assumptions of each particular design pattern, the result is likely to work well.  We will delve into some of the variations in the next few articles.

A Little Bit about Naming

You may have encountered the term "data fabric" in products (example here or here) or as a concept (a nice example here).  This is because the ideas behind fabrics are very general and apply to many types of data management systems, perhaps most notably storage area networks.  The name data fabric is very natural and seems to have occurred to many people independently, especially after storage vendors popularized it during the 1990s.

In our case the term "data fabric" always means a data fabric architecture as I defined it a couple of sections back.

Conclusion and More

This has been a longish introduction but introduces terminology and sets the stage for looking at specific design patterns used to build data fabrics.  Follow-on articles will look at individual design patterns in detail.

One final thought:  there are no doubt other ways of factoring design patterns for data fabrics.  This particular set of patterns works very well for SQL databases, and I have seen them used successfully in many large systems over the years.  I would argue they work well for other DBMS types as well but welcome your comments on other approaches.

p.s., The next article covering the transactional data service is now available.  

8 comments:

Justin Swanhart said...

"Transparency does not mean that you get access to all data all the time from everywhere."

Actually it can mean that :)

I am the primary maintainer and creator of Shard-Query: (http://code.google.com/p/shard-query)

I've been working on Shard-Query for three years now. It is designed to run exactly the type of query you mention (even with the HAVING clause) over sharded data sets. I call this "distributed set processing" or DSP.

Right now Shard-Query doesn't rely on replication for data-distribution of tables which must be duplicated between nodes (dimension tables). It instead "broadcasts" changes to those tables (it should use XA for this but MySQL XA is broken so it doesn't).

I was talking to Giuseppe at FOSDEM about using Tungsten for such data distribution instead.

Justin Swanhart said...

The second fabric problem (last_insert_id()) is possible to solve as long as the proxy providing the abstraction collects the last_insert_id() after sending the insert. Then, the proxy replaces calls to last_insert_id() in the SQL statement with the constant value that was collected previously. Shard-Query handles this partially right now, as the last_insert_id() from each shard that is inserted into is returned by the insert statement.

Robert Hodges said...

@Justin, Thanks for the comments and for reminding me about shard-query.

#1, My point was really that in a data fabric it's OK if you have to go somewhere else to get some queries answered, as long as they are not especially pervasive. Shard-query looks quite interesting. There is certainly a range of federated queries that can be automated productively, much as you are doing. One question--how does shard query deal with a failure on one of the shards?

#2 Absolutely, we have done this same trick before on last_insert_id(). Temp tables are another SQL feature that create headaches. We handle them in our connectivity by "freezing" the connection and not allowing switches after they are created on a session.

Justin Swanhart said...

If an error occurs on a shard, or a shard can not be contacted, then the query fails. It is the end user's responsibility to make shards HA via Galera, MySQL replication, DRBD or other means.

I talked with Henrik at FOSDEM about this and we agreed that returning partial results during a cluster partition is acceptable to some users and it should be optional, so I'll be adding a result that allows partial results to be returned if one or more shards can not be contacted, but will still result in error if any of the shards return an error (like missing column).

Justin Swanhart said...

Shard-Query doesn't support temporary tables, but supports subqueries in the from clause, and they are automatically indexed for good performance. So that is the workaround with Shard-Query. It is conceivable to map temporary tables to concrete tables and replicate them to all the shards, but the complexity in such a solution is pretty high.

Robert Hodges said...

Thanks for the answers, especially on failures. It sounds as if you may be interested in some of the specific design patterns like the sharded data service. I look forward to your comments on it.

Patrick Callaghan said...

Hi, have you ever thought of using Cassandra for this ? It pretty much solves all the data fabric services at its core.

Robert Hodges said...

Cassandra solves some interesting problems related to horizontal scaling and replication but you give up a lot. Simple examples are full featured queries (i.e., SQL including joins) and transactions. That rules out using Cassandra for many business apps.