Thursday, March 28, 2013

See You at Percona Live 2013!

Percona Live 2013 is coming up fast.  This is hands-down the best MySQL conference of the year, attended by a lot of people I really respect.  Check the speaker list if you need some of their names.  I will also be doing two talks myself.
  • 9am Wednesday 24 April - Keynote:  How MySQL Can Thrive in the World of Massive Data Hype.  NoSQL solutions are oversold, but this is no reason for complacency in the MySQL community.  There are new challenges in data management, and we need to solve them or become irrelevant.   I will show some of the advances Continuent has on tap for MySQL-based applications and also point back to problems our experience shows must be solved within MySQL itself.  
  • 1pm Wednesday 24 April - Session:  State of the Art for MySQL Multi-Master Replication.  This talk will explain the fundamentals of multi-master operation and then trace the trade-offs of Tungsten, Galera, and other solutions.  Thanks to excellent work on several products there is a lot of excitement about multi-master in 2013.  My goal is to help listeners understand what applications are possible now as well as what we have the potential to achieve in the future.  
I hope you will attend these talks.  I am looking forward to meeting old friends at the conference and making new ones.  

Incidentally, Percona Live sent me an email yesterday that you can get a 15% discount on the registration price using  the code KeySQL when you sign up.  At Continuent we are also offering free passes to customers who give us the best quotes about our software.  However you get there, I really recommend this conference.  

Tuesday, February 19, 2013

Data Fabric Design Patterns: Fabric Connector

This article is the third in a series on data fabric design and introduces the fabric connector service design pattern.  The previous article in this series introduced the transactional data service design pattern, which defines individual data stores and is the building block for data fabrics based on SQL databases.  The fabric connector builds on transactional data services and is another basic building block of fabric architecture.

Description and Responsibilities

Fabric connectors make a collection of DBMS servers look like a single server.  The fabric connector presents what appears to be a data service API to applications.  It routes each request to an appropriate physical server for whatever task the application is performing, hiding the fact that a data fabric can consist of dozens or even hundreds of servers.  Applications cannot tell the difference between talking to the fabric connector and talking to a real DBMS server.  We call this property transparency.

Here are the responsibilities of a fabric connector.  I will use the phrase proxying to refer to the first of these, and routing responsibilities to refer to the remaining three.  
  1. Expose a data service interface to applications.
  2. Route each application query to an appropriate DBMS server.
  3. Balance load by distributing queries across multiple replicas, if available.
  4. Switch to another server following a failure or if the DBMS becomes unavailable due to maintenance.
The following diagram shows the logical components of a fabric connector.  The fabric connector sits between applications, transactional data services, and a fabric directory service.  These are greyed out, as they are not part of the pattern.
Fabric Connector Design Pattern
Fabric connectors contain two logical components.  The proxy is responsible for routing queries and responses between applications and underlying data services.  This can be a library layer, a separate server process, or a TCP/IP load balancer--anything that provides a transparent indirection layer.  The directory information contains rules to route SQL queries correctly to the actual location of data.  There is a notification protocol that permits connectors to receive updates about the fabric topology and confirm that they have responded to them.


Connecting to data is a problem in large systems.  Sharded data sets spread data across multiple services.  Data services have different roles, such as master or slave.  Services fail or go offline for maintenance.  Services change roles, such as a master switching to a slave.  Shards move between services to balance load and use storage more efficiently.  Within short periods of time there may be significant variations in load across data services. Adding routing logic directly to applications in these cases adds complexity and can lead to a tangled mess for administrators.

The fabric connector design pattern encapsulates logic to route connections from the application to DBMS servers.  Hiding connection logic helps keep applications simple.  It allows independent testing and tuning of the connection rules.  That way you can have some assurance the logic actually works.  You can also modify fabric behavior without modifying applications, for example to redistribute load more evening across replicas.

Related Design Patterns

The fabric connector design pattern manages single application connections to data services, for example a transactional data service.  Transparency is the leitmotif of this pattern.  It provides needed encapsulation for other data fabric design patterns and is particularly critical for sharded as well as fault tolerant data services.  These will be covered in future articles on data fabric design.

There are also other design patterns for data access.  Here are two that should not be confused with fabric connectors.
  • Federated query.  Federated query splits a SQL query into sub-queries that it routes to multiple underlying data services, then returns the results.  Sharding products like DbShards and shard-query implement this pattern.  It requires complex parsing, query optimization, and aggregation logic to do correctly and has varying levels of transparency. 
  • MapReduce.  MapReduce is a procedure for breaking queries into pieces that can run in parallel across large numbers of hosts by splitting the query into map operations to fetch data followed by reduce operations to aggregate results.  It can work on any distributed data set, not just SQL.  MapReduce implementations often eschew SQL features like joins and also can have a very different programming model from SQL.  Their use is often non-transparent to SQL applications.
Finally, there is a very important pattern for the fabric directory service.  This is a directory service that maintains information about the desired topology of the fabric and its actual state.  It can be implemented in forms ranging from a shared configuration file to network services in a distributed configuration manager like ZooKeeper.

I hope to add more complete descriptions for the latter three design patterns at some point in the future.  For the current article, we will stick to simple connectivity.

Detailed Behavior

Fabric connectors are conceptually simple:  route request from application to server, then transfer results back.  Actual behavior can be quite complex. To give some perspective on the problem, here is a short Perl program for a SaaS application that logs order detail information in a table named sale, then reads the same data back.  We will use the sample program to illustrate the responsibilities of this design pattern in detail.

use DBI;
# Connect to server. 
$dbh = DBI->connect("DBI:mysql:test;host=prodg23", "app", "s3cr3t5"
            ) || die "Could not connect to database: $DBI::errstr";

# Insert order using a transaction. 
$dbh->{'AutoCommit'} = 0;
$dbh->do("INSERT INTO sale(order_id, cust_id, sku, amount) \
   VALUES(2331, 9959, 353009, 24.99)");
$dbh->do("INSERT INTO sale(order_id, cust_id, sku, amount) \
   VALUES(2331, 9959, 268122, 59.05)");

# Select order back with an auto-commit read
$dbh->{'AutoCommit'} = 1;
$sth = $dbh->prepare("SELECT * FROM sale WHERE order_id=2331");
while( $href = $sth->fetchrow_hashref ) {
  print "id      : $$href{id} \n";
  print "order_id: $$href{order_id} \n";
  print "cust_id : $$href{cust_id} \n";
  print "sku     : $$href{sku} \n";
  print "amount  : $$href{amount} \n";

# Disconnect from server. 
The first responsibility of the fabric connector design pattern is to provide a transparent interface for underlying data services.  That means that our Perl program has to work as written--no extra changes.  Here are just a few things a connector needs to do:
  1. Implement the DBMS connection protocol fully or pass it transparently to an underlying server.  This includes handling authentication handshakes as well as setting base session context like client character sets. 
  2. Handle all standard features of query invocation and response, including submitting queries,  returning automatically generated keys, and handling all supported datatypes in results.  
  3. Respect transaction boundaries so that the INSERT statements on the sales table are enclosed in a transaction in the DBMS and the SELECT statement is auto-commit (i.e., a single-statement transaction.) 
  4. Read back data written to the sales table.  
In addition to handling APIs protocols, fabric connectors need to avoid slowing down transaction processing as a result of proxying.  Properly written connectors for the most part add minimal overhead, but there are at least two instances where this may not be the case for some implementations (such as network proxies).  The first is establishing connections, a relatively expensive operation that occurs constantly in languages like PHP that do not use connection pools.  The second is short primary key-lookup queries on small datasets, which tend to be memory-resident in the server and hence have quick access.

One common reaction is to see such overhead as a serious problem and avoid the whole fabric connector approach.  Yet the "tax" applications pay for proxying is not the whole story on performance.  Fabric connectors can boost throughput by an order of magnitude by distributing load intelligently across replicas. To understand the real application overhead of a connector you therefore need to measure with a properly sized data set and take into account load-balancing effects.  Test results on small data sets that reside fully in memory with no load balancing tend to be very misleading.  

The remaining fabric connector design pattern responsibilities are closely related:  route requests accurately to the correct service, load-balance queries across replicas within a service, and route around replicas that are down due to maintenance or failure.  We call these routing responsibilities.  They require information about the fabric topology, which is maintained in the connector's directory information.  Here is a diagram of typical directory organization.  
Fabric Directory Service Organization
Let's start with the responsibility to route requests to data services.  A simple fabric connector implementation allows connections using a logical server name, such as group2, which the connector would translate to an actual DBMS server and port, such as prodg23:3306.  A better fabric connector would allow applications use a customer name like "walmart" that matches what the application is doing.  The connector would look up the location of customer data and connect automatically to the right server and even DBMS schema.  This is especially handy for SaaS applications, which often shard data by customer name or some other simple identifier. 

We could then change our program as follows to connect to the local host and look for the "walmart" schema.  Under the covers, the fabric connector will connect to the prodg23 server and use the actual schema for that customer's data. 

use DBI;
# Connect to customer data.  
$dbh = DBI->connect("DBI:mysql:walmart;host=localhost", "app", "s3cr3t5"
            ) || die "Could not connect to database: $DBI::errstr";

This is a modest change that is very easy to explain and implement.  It is a small price to pay for omitting complex logic to locate the correct server and schema that contains the data for this customer. 

The next responsibility is to distribute data across replicas.   This requires additional directory information, such as the DBMS server role (master vs. slave), current status (online or offline), and other relevant information like slave latency or log position.  There are many ways to use this information effectively.   Here are a few of the more interesting things we can do.
  1. Slave load balancing.  Allow applications to request a read-only connection, then route to the most up-to-date slave.  This works well for applications such as Drupal 7, which is an application for website content management.  Drupal 7 is slave-enabled, which means that it can use separate connections for read-only queries that can run on a replica.  Many applications tuned to work with MySQL have similar features. 
  2. Session load balancing.  Track the log position for each application session and dispatch reads to slaves when they are caught up with the last write of the session.  This is a good technique for SaaS applications that have large numbers of users spread across many schemas.  It is one of the most effectively scaling algorithms for master/slave topologies.     
  3. Partitioning.  Split requests by schema across a number of multi-master data services.  SQL requests for schema 1 go to server 1, requests for schema 2 to server 2, etc.  Besides distributing load across replicas this technique also helps avoid deadlocks, which can become common in multi-master topologies if applications simultaneously update a small set of tables across multiple replicas.  
Recalling our sample program, we could imagine a connector using session load balancing to write the sales table transaction to the master DBMS server, then sending the SELECT to a slave if it happened to be caught up for customer "walmart."  No program changes are required for this behavior.  

The final responsibility is to route traffic around offline replicas.  This gets a bit complicated.  We need not only state information but an actual state model for DBMS servers.  There also needs to be a procedure to tell fabric connectors about a pending change as well as wait for them to reconfigure themselves.  Returning to our sample program, it should be possible to execute the following transaction: 

$dbh->{'AutoCommit'} = 0;
$dbh->do("INSERT INTO sale(order_id, cust_id, sku, amount) \
   VALUES(2331, 9959, 353009, 24.99)");
$dbh->do("INSERT INTO sale(order_id, cust_id, sku, amount) \
   VALUES(2331, 9959, 268122, 59.05)");

then failover to a new master and execute:

$dbh->{'AutoCommit'} = 1;
$sth = $dbh->prepare("SELECT * FROM sale WHERE order_id=2331");

To do this properly we need to ensure that the connecter responds to updates in a timely fashion.  We would not want to change fabric topology or take a DBMS server offline while connectors were still using it.  The notification protocol that updates connector directory information has to ensure reconfiguration does not proceed until connectors are ready.

Does every fabric connector have to work exactly this way?  Not at all.  So far, we have only been talking about responsibilities.  There are many ways to implement them.  To start with, fabric connectors do not even need to handle SQL.  This is interesting in two ways.

First, you can skip using the Perl DBI completely and use a specialized interface to connect to the fabric. We will see an example of this shortly.  Second, the underlying store does not even need to be a SQL database at all.  You can use the fabric connector design pattern for other types of stores, such as key-value stores that use the memcached protocol.  This series of articles focuses on SQL databases, but the fabric connector design pattern is very general.


Here are a couple of off-the-shelf implementations that illustrate quite different ways to implement the fabric connector design pattern.

1. Tungsten Connector.  Tungsten Connector is a Java proxy developed by Continuent  that sits between applications and clusters of MySQL or PostgreSQL servers.  It implements the MySQL and PostgreSQL network protocols faithfully, so that it appears to applications like a DBMS server.

Tungsten Connector gets directory information from Tungsten clusters.  Tungsten clusters use a simple distributed consensus algorithm to keep directory data consistent across nodes even when there are failures or network outages--connectors can receive topology updates from any node in the cluster through a protocol that also ensures each connector acts on it when the cluster reconfigures itself.  In this sense, Tungsten clusters implement the fabric directory service pattern described earlier.

The directory information allows the connector to switch connections transparently between servers in the event of planned or even some unplanned failovers.   It can also load balance reads automatically using a variety of polices including the slave load balancing and session load balancing techniques described above.

The big advantage of the network proxy approach is the high level of transparency for all applications.  Here is a sample session with the out-of-the-box mysql utility that is part of MySQL distributions.  In this sample, we check the DBMS host name using the MySQL show variables command.   Meanwhile, a planned cluster failover occurs, followed by an unplanned failover.

mysql> show variables like 'hostname';
| Variable_name | Value   |
| hostname      | prodg23 |
1 row in set (0.00 sec)
(Planned failover to prodg21 to permit upgrade on prodg23)
mysql> show variables like 'hostname';
| Variable_name | Value   |
| hostname      | prodg21 |
1 row in set (0.01 sec)
(Unplanned failure to prodg22)
mysql> show variables like 'hostname';
| Variable_name | Value   |
| hostname      | prodg22 |
1 row in set (4.82 sec)
As this example shows, the session continues uninterrupted as the location of the server switches.  These changes occur transparently to applications.  The downside is that there is some network overhead due to the extra network hop through the Tungsten Connector, though of course load balancing of reads can more than repay the extra latency cost.  Also, this type of connector is hard to build because of the complexity of the MySQL network API as well as the logic to transfer connections seamlessly between servers.

2. Gizzard.  Gizzard is an open source sharding software developed by Twitter to manage links between Twitter users.  The proxy part of the design pattern is implemented by middleware servers, which accept requests from clients using thrift, a language-independent set of tools for building distributed services.  For more on a particular application built on Gizzard, look at descriptions of Twitter's FlockDB service.  Gizzard servers give applications a simple API for data services, which fulfills the proxy responsibility of the fabric connector design pattern.

Gizzard servers get directory information using gizzmo.  Gizzmo is a simple command line tool that maintains persistent copies of the Gizzard cluster topology and takes care of propagating changes out to individual Gizzard servers.  For more on how Gizzmo works, look here.  Using this information, Gizzard servers can locate data, route around down servers, and handle distribution of queries to replicas, which are the final three responsibilities of the fabric connector design pattern.

The Gizzard architecture lacks the generality of the Tungsten Connector, because it requires clients to use a specific interface rather than general-purpose SQL APIs.  It also introduces an extra network hop.  On the other hand, it works extremely well for its intended use case of tracking relationships between Twitter users.  This is because Gizzard deals with a simplified problem and also allows scalability through many Gizzard servers.  Like the Tungsten Connector the network hop expense pays for itself due to the ability to load-balance across multiple replicas.

Gizzard is a nice example of how the fabric connector design pattern does not have to be specifically about SQL.  Gizzard clients do not use SQL, so the underlying store could be anything.  Gizzard is specifically designed to work with a range of DBMS types.

Fabric Connector Implementation Trade-Offs

General-purpose fabric connectors like the one used for Tungsten are hard to implement for a variety of reasons.  This approach is really only practical if you have a lot of resources at your disposal or are doing it as a business venture like Continuent.  You can still roll your own implementations.  The Gizzard architecture nicely illustrates some of the trade-offs necessary to do so.

1. General vs. particular data service interfaces.  Implementing a simple data service interface, for example using thrift, eliminates the complexity of DBMS interfaces like those of MySQL or PostgreSQL.  Rather than a thrift server you can also use a library within applications themselves.  This takes out the network hop.

2. Automatic vs. manual failover.  Automatic failover requires connectors to respond to fabric topology changes in real time, which is a hard problem with a lot of corner cases.  (Look here if you disagree.) You can simplify things considerably by minimizing automated administration and instead orchestrate changes through scripts.

3. Generic vs. application-specific semantics.  Focusing on a particular application allows you to add features that are important for particular use cases.  Gizzard supports shard migration.  To make it tractable to implement Gizzard requires a simple update model in which transactions can be applied in any order.

These and other simplifications make the fabric connector design pattern much easier to implement correctly.  You can make the same sort of trade-offs for most applications.

Implementations to Avoid

Here are a couple of implementations for the fabric connector design pattern that you should avoid or at least approach warily.

1. Virtual IP Addresses (VIPs).  VIPs allow hosts to listen for traffic on multiple IP addresses.  They are commonly used in many failover schemes, such as programs like heartbeat.  VIPs do not have the intelligence to fulfill fabric connector responsibilities like load-balancing queries across replicas.  They are subject to nasty split-brains, a subject I covered in detail as part of an earlier article on this blog.   Finally, VIPs are not available in Amazon and other popular cloud environment.  VIPs do not seem like a good implementation choice for data fabrics.

2. SQL proxies.  There are a number of software packages that solve the problem of proxying SQL queries, such as PgBouncer or MySQL Proxy.  Many of them do this quite well, which means that they fulfill the first responsibility of the fabric connector design pattern.  The problem is that they do not have a directory service.  This means they do not fulfill the next three responsibilities to route queries effectively, at least out of the box.

Unlike VIPs, SQL proxies can be a good starting point for fabric implementations.  You need to add the directory information and notification protocol to make them work.  It is definitely quite doable for specific cases, especially if you make the sort of trade-offs that Gizzard illustrates.

Conclusion and Takeaways

The fabric connector design pattern reduces the complexity of applications by encapsulating the logic required to connect to servers in a data fabric.  There is a tremendous benefit to putting this logic in a separate layer that you can test and tune independently.  Fabric connectors are more common than they appear at first because many applications implement the responsibilities within libraries or middleware servers that include embedded session management.  Fabric connectors do not have to expose SQL interfaces or any other DBMS-specific interface, for that matter.

Fault-tolerant and sharded data service design patterns depend on fabric connectors to work properly and avoid polluting applications with complex logic to locate data.  Products that implement these design patterns commonly include fabric connector implementations as well.  You can evaluate them by finding out how well they fulfill the design pattern responsibilities. 

Off-the-shelf fabric connectors have the advantage that they are more general than something you can develop easily for yourself.  If you decide to write your own fabric connector, you will need to consider some of the trade-offs like reducing automation or simplifying APIs in order to make the problem easier to solve.  Regardless of the approach, you should allow time.  The responsibilities are complicated and must be implemented with care.  Fabric connectors that only work 99% of the time of are not much use in production environments.

One final point about fabric connectors.  Automated failover can make fabric connectors harder to implement and increase the risk that the fabric connector may write to the wrong replica.  The difficulty of managing connectivity is one of the reasons many data management experts are very cautious about automatic failover.  This problem is tractable in my opinion, but it is definitely a practical consideration in system design.

My next article on data fabrics will cover the fault-tolerant data service design pattern.  This design pattern depends on the fabric connector design pattern to hide replicas.  I hope you will continue reading to find out about it.  

Wednesday, February 6, 2013

Data Fabric Design Patterns: Transactional Data Service

This article is the second in a series on data fabric design and introduces the transactional data service design pattern.  The previous article in this series introduced data fabrics, which are collections of off-the-shelf DBMS servers that applications can connect to like a single server.  They are implemented from data fabric design patterns, which are reusable arrangements of DBMS servers, replication, and connectivity.  With this article we begin to look at individual design patterns in detail.

Description and Responsibilities

The transactional data service is a basic building block of data fabric architectures.  A transactional data service is a DBMS server that processes transactions submitted by applications and stores data safely.  Transactional data services have the following responsibilities:
  • Store data transactionally and recover data faithfully to the last full transaction following failure.  
  • Provide a network-accessible application interface for accessing data
  • Provide a reliable and reasonably quick method for backup and restore. 
  • Maintain an accessible, serialized log of transactions.  This enables replication between services.  
The following diagram illustrates the moving parts of a transactional data service.   In future diagrams we will just use the standard database symbol for the entire transactional data service, but for now we need to be able to see the contents.  

Durable storage of transactions is the most fundamental responsibility of database systems.  It is difficult to build reliable applications if stored data can disappear or become corrupted because transactions were not committed before a crash.  Both problems can cause data loss.  Moreover, they can break replication links very badly if the DBMS server comes up in an inconsistent state, for example with some updates committed but others randomly rolled back.  This condition affects not only the one server but potentially many others throughout the fabric.

The transactional data service therefore focuses on storing data safely and recovering to the last committed transaction after a restart.  With this basic capability we can construct more complex services knowing that individual changes are unlikely to disappear or be recorded inconsistently.  

Detailed Behavior

Let's look in detail at the properties required for a successful transactional data service.  Somewhat surprisingly, an off-the-shelf SQL DBMS does not necessarily fit the pattern, though it comes close.  It is important to understand the differences. 

The transactional store keeps data from getting lost and is the basis for recovery throughout the fabric.  Transactional stores support commit and rollback with multi-statement transactions.  In theory the transaction data service responsibility for data persistence matches MySQL/InnoDB and PostgreSQL behavior, both of which commit transactions safely in serial order.  However, the reality is not quite that simple.

Most DBMS allow applications to ignore transactions under certain conditions.  This results in  wormholes, which are violations in serial ordering of data.  There are a number of table definition options in SQL that undo transactional consistency.
  • (MySQL) MyISAM table type.  MyISAM tables ignore transactions and commit immediately, even if the application later tries to roll back.  The tables may also become corrupted by server crashes.  
  • (MySQL) Memory table type.  These tables are maintained in memory only.  They disappear on restart.  
  • (PostgreSQL) UNLOGGED tables.  Such tables are not logged and disappear on crash or unclean shutdown (thanks Frederico).  
All of these allow data to disappear or become corrupted after a crash. However, there is a more subtle problem.  If applications depend on these tables, transaction results may then depend on when the server last crashed or restarted, which in turn makes updates across replicas non-deterministic.  Random updates create problems for data replication, which depends on replicas behaving identically when transactions are applied.  It is important to avoid application dependencies on any feature that creates wormholes or you might not be able to use other design patterns.

So is data loss always bad?  Surprisingly, no.  In some cases transactional stores can lose data provided that they do so by dropping the last transactions in serial order.   It's as if the data just reverted to an earlier point in time.  To understand why this might be OK, imagine three servers linked into a chain by asynchronous replication.
Data loss is sometimes not a big deal
It is bad to lose data on the first server, especially if those data are lost before replicating transactions to the downstream replicas.  However, data loss on the last server is fine.  Assuming that server stores the replication restart point transactionally, it will just re-apply the missing transactions and catch up.  This is exactly what happens when you restore a backup in slave in master/slave replication.

Data loss on the second server also may not be a problem.  It should restart replication and should generate identical transactions for itself as well as for replication to the last server.  In both cases we assume that replication will handle these cases correctly and can replay missing transactions from logs.  If so, you can not only tolerate such losses but even depend on recovering from them automatically.

Turning to the next responsibility of the transactional data service, the application interface may obviously include SQL via MySQL or PostgreSQL wire protocols.  However, any consistent interface that is accessible over a network will do.  The memcached protocol is also perfectly acceptable.  Subsets of SQL such as stored procedures work quite well.  Transactional data services are more general than SQL DBMS servers in this sense.  Full SQL or even a subset is not a requirement.  

Backup and restore are critical for data fabrics as they enable provisioning of new services as well as recovery of services that fail.  You restore a backup and then let the service catch up using replication.  Data fabrics can get along fine using a range of options from logical dumps of the DBMS (mysqldump or pgdump) to file system snapshots.  Fabric backups just need to be transactionally consistent and movable to other hosts.

Note that the backup required by the transaction data service design pattern is a less general form of backup than most businesses really require.  Businesses may need to recover data after accidental deletion or to keep copies of information for many years for legal reasons.  You can therefore use a general-purpose backup solution like Zmanda or Barman provided it meets the fabric design pattern requirements.  There's no need to do things twice.  

Finally, the replication log is a serialized list of transactions to replicate to other hosts.  Serialization enables the transactions to be replayed on another host and result in an identical copy.  Generally speaking, data fabrics require logical replication, which applies changes to replicas using SQL statements on a live server.  This is because other design patterns depend on being able to access and even write to the transactional data service when it is acting as a slave.   Binary replication methods like disk block replication, such as DRBD, do not meet this requirement and therefore are of limited use in data fabrics.  


You can implement the transactional data service design pattern with any DBMS that meets the pattern responsibilities.  That said, implementation details are very important.  As we have seen, ensuring that DBMS servers live up to the responsibility to store transactions safely is a little harder than one might think.  

1. MySQL.  MySQL with InnoDB engine is generally a good choice.  It has stable SQL APIs and a wide range of capable client libraries.  However, MySQL must be correctly configured to maintain proper transactional guarantees.  Here are three properties that should be in your my.cnf file to help ensure MySQL lives up to its responsibilities: 

# Ensure durable flush to storage on transaction commit.   
# Synchronize binlog with committed transactions. 
# Use InnoDB as default storage engine.  (Unnecessary for MySQL 5.5 and above.) 

There are a variety of good backup mechanisms for MySQL databases, including mysqldump (with --single-transaction, useful only for small data sets), Percona XtraBackup, and file system snapshots. Snapshots are especially good when using NetApp or other capable storage.  NetApp snapshots can be restored in seconds and cost little in terms of performance overhead.  

2. PostgreSQL.  PostgreSQL with a trigger-based replication log, for example from Londiste or SLONY, and pgdump for backups is another good choice.  PostgreSQL has unusually good trigger support for DML changes at least, and permits users to encode them in a number of languages.  Be aware the PostgreSQL triggers do not capture DDL statements like CREATE TABLE, though.

PostgreSQL is fully transactional out of the box and triggers create a fully serialized replication log.  It does not have the problem that MySQL does with potentially unsafe table types like MyISAM.  However, you need to set a couple of parameters to ensure safe operation.  These ensure transactions are committed down to the storage level and prevent catastrophic corruption of the database and/or the WAL (write-ahead log).  Since PostgreSQL defaults to these values, you mostly need to avoid turning them off.

fsync = on                             # turns forced synchronization on or off
synchronous_commit = on                # immediate fsync at commit

Like MySQL, PostgreSQL SQL and APIs are stable and well-known.  Pgdump also loads and restores data without difficulty for smallish data sets.  For larger data sets file system snapshots work very well.  

Regardless of the DBMS type you choose, it is important to avoid application-level features that introduce wormholes, such as the PostgreSQL unlogged tables mentioned in the previous section.  Generally speaking, you should only skip transactions if there is a very strong reason for doing so.

Do other database types work for this design pattern?  Of course.  You can also use a commercial DBMS like Oracle.  Oracle fulfills the pattern responsibilities quite well, but is a bit more heavyweight than users want, particular when operating in the cloud.  

And Hardware Implementation, Too...

Even with a properly configured DBMS server you are still not completely out of the woods for data durability.  Database servers generally use an fsync() or similar system call to flush data to storage.  Unfortunately storage controller cards may just cache the data to be written in local RAM and return.  That can fool the DBMS server into thinking transactions are safely stored when they actually are still sitting in memory on a controller card.  The "committed" data will then vaporize in a host crash, which in turn can fatally corrupt both MySQL and PostgreSQL stores if you are very unlucky.  Just a few bad blocks can cause very serious problems.

Fortunately there is a cure to make data vaporization less likely.  On local storage you can invest in RAID with a battery-backed cache (BBU), which keeps power on for the cache even if the host fails completely.  SANs and network attached storage tend to have this capability built in.  (But check the specifications!)  Battery backed cache also tends to be fast, since controllers can safely return from an fsync() operation as soon as the data to be written are in the on-board cache.  Without this feature writes to storage can be painfully slow.

One interesting question is how to handle cloud environments like Amazon.  You just do not know how the storage actually works.  (That's really the point of a cloud, after all.)  Amazon provides SLAs for performance (example:  EBS provisioned IOPS), but there do not seem to be any SLAs about storage consistency.  There is a lot to learn here and lessons that apply to Amazon may not necessarily apply to others.  I suspect this will prompt some rethinking about data consistency--it's an interesting "what if" for transaction processing to suppose you cannot trust the underlying storage capabilities.

Data loss occurs sooner or later in virtually all systems, but the good new is that you can make it uncommon.  For more information check out data sources like this and this.  Also, other fabric design patterns like the Fault-Tolerant Data Service keep applications running when failures do occur and can even minimize the effects of data loss.  See the upcoming article on that design pattern for more information.

Implementations to Avoid

Here are two examples that do not meet the transactional data service design pattern responsibilities or at least not fully.  

1. MySQL with MyISAM table type.  MyISAM does not support transactions and is not crash safe.   You will lose data or incur downtime fixing problems.  MyISAM does not belong in data fabrics.

2. PostgreSQL with streaming replication.  Streaming replication replicates log updates in real-time and has the added benefit of permitting queries on replicas.  However, streaming replication does not allow you to write to replicates.  It therefore does not support online schema maintenance or multi-master replication.  It also does not help with heterogeneous replication.  Streaming replication is therefore an unattractive choice, even though it is far simpler and works better for ensuring high availability than logical replication solutions like SLONY.  

How do NoSQL stores fare in this design pattern?  Let's pick on MongoDB.  MongoDB supports atomic commit to single BSON documents but does not support transactions across multiple documents.  (Unless you think that the hacky two-phase commit proposed by the MongoDB manual is an answer.)  Atomic transactions are not one of the reasons people tend to choose NoSQL systems, so this is not surprising. It means that MongoDB cannot handle the responsibilities of this design pattern.  

Conclusion and Takeaways

The transactional data service design pattern can be implemented with a carefully configured SQL database.  As we have seen, however, there are a number of details about what "carefully configured" really means.

It is a good idea to use the transactional data service design pattern even if you are not planning to implement a data fabric.  Systems grow.  This pattern gives you the flexibility to build out later by adding other fabric design patterns, for example to introduce cross-site operation using the Multi-Site Data Server pattern.  It also protects your data at multiple levels that include transactions as well as regular backups.  Nobody really likes losing data if it can be avoided.

Another important point: the transactional data service design pattern keeps your entire fabric working.  Losing small amounts of data is typically just an inconvenience for users, especially if it does not occur too often.  Broken replication on the other hand due to replicas that diverge or become corrupt after failures can lead to time-consuming administration and significant downtime to repair.  The fabric is a network of servers.  Poor configuration on one server can cause problems for multiple others.

Finally, the biggest error people make with this design pattern is to neglect backups.  There's something inherently human about it:  backups are painful to test so most of us don't.  My first and biggest IT mistake involved a problem with backups.  It nearly caused a large medical records company to lose a week of data entry records.  Put in backups and test them regularly to ensure they work.  This is a theme that will recur in later articles.

Speaking of which, the next article in this series will cover the fabric connector design pattern.  Stay tuned! 

Tuesday, February 5, 2013

Introducing Data Fabric Design for Commodity SQL Databases

Data management is undergoing a revolution.  Many businesses now depend on data sets that vastly exceed the capacity of DBMS servers.  Applications operate 24x7 in complex cloud environments using small and relatively unreliable VMs.  Managers need to act on new information from those systems in real-time. Users want constant and speedy access to their data in locations across the planet.

It is tempting to think popular SQL databases like MySQL and PostgreSQL have no place in this new world.  They manage small quantities of data, lack scalability features like parallel query, and have weak availability models.  One reaction is to discard them and adopt alternatives like Cassandra or MongoDB.  Yet open source SQL databases have tremendous strengths:  simplicity, robust transaction support, lightning fast operation, flexible APIs, and broad communities of users familiar with their operation.  The question is how to design SQL systems that can meet the new requirements for data management.

This article introduces an answer to that question:  data fabric design.  Data fabrics arrange off-the-shelf DBMS servers so that applications can connect to them as if they were a single database server.  Under the covers a data fabric consists of a network of servers linked by specialized connectivity and data replication.  Connectivity routes queries transparently from applications to DBMS servers.  Replication creates replicas to ensure fault tolerance, distribute data across locations, and move data into and out of other DBMS types.  The resulting lattice of servers can handle very large data sets and meet many other requirements as well.

Data fabric design is a big topic, so I am going to spread the discussion over several articles.  This first article provides a definition of data fabric architecture and introduces a set of design patterns to create successful data fabrics.  In the follow-on articles I will explore each design pattern in detail.  The goal is to make it possible for anyone with a background in database and application construction to design data management systems that operate not only today but far into the future.  At the very least you should understand the issues behind building these systems.

Some readers may see data fabric design as just another reaction to NoSQL.  This would be a mistake.  Building large systems out of small, reliable parts is a robust engineering approach that derives from ground-breaking work by Jim GreyPat Helland, and others dating back to the 1970s.  Data fabrics consist of DBMS servers that you can look at and touch, whereas NoSQL systems tend to build storage, replication, and access into a single distributed system.  It is an open question which approach is more complex or difficult to use.  There are trade-offs and many systems actually require both of them.  You can read this article and those that follow it, then make up your own mind about the proper balance.


The data fabric concept is largely based on practical experience on Continuent Tungsten.  I am indebted to Continuent as well as our customers for the opportunity to work on this problem.  I am likewise indebted to Ed Archibald, Continuent CTO, with whom I have worked for many years.  Ed among other things came up with the data fabric moniker.  Our interest in this topic goes back to shared experiences with Sybase OpenServer and Sybase Replication Server, which were ground-breaking products in the fields of connectivity and replication.  Two decades later we are still applying the insights gained from working on them.

What Is a Data Fabric Architecture?

Let's start the discussion of data fabrics with a practical problem.  We want to design a SaaS application for customer relationship management that will support up to a million users using a commodity open source DBMS like MySQL.  What are the requirements?

Obviously, our system must be able to scale over time to hundreds of app servers operating on hundreds of terabytes data.  It must hide failures, maintenance, and schema upgrades on individual DBMS hosts.  It must permit data to distribute across geographic regions.  It must deliver and accept transactions from NoSQL, data warehouses, and commercial DBMS in real time.  It must allow smooth technology upgrade and replacement.  Finally, it must look as much like a single DBMS server to applications as possible.  Here's a picture of what we want: 
Conceptual Data Fabric
The last requirement is especially important and goes to the essential nature of data fabric architecture.  The fabric may contain dozens or even hundreds of servers but encapsulates their locations and number.  Applications connect to the fabric the same way they connect to individual DBMS servers, a property we call transparency.  Transparency permits developers to build applications using a DBMS on a laptop and push out code to production through increasingly capable test environments without changes in behavior.   This is a potentially confusing requirement so let's look at a couple of examples.

Transparency does not mean that you get access to all data all the time from everywhere.  Say our sample application stores customer data across many servers.  The following SQL query to list regions with SaaS users who have sales to their own customers greater than $100,000 would typically not work:

mysql> select region, count(cust_id), sum(sales) as sales_total 
  from customer group by region having sales_total > 100000;

This is not as big a problem as it sounds, as SaaS applications for the most part operate within a single SaaS user at a time and do not ask for data across users.  Moreover, it is easy to understand that you need to do something special for this particular query, such as connect to all servers explicitly or load transactions into a data warehouse. Most SaaS designers are pretty comfortable with this limitation, which just makes explicit something that you know anyway.

On the other hand transparency does mean that your application talks to what looks like a single server for individual SaaS user data.  For instance the following sequence of commands on a single customer to insert a row and get the generated auto-increment key back must work in a fabric just as it does in a single DBMS server.

mysql> insert into customer(name, region, sales) values('bob', 'sfo', 10035.0);
Query OK, 1 row affected (0.00 sec)

mysql> select last_insert_id();
| last_insert_id() |
|                1 |
1 row in set (0.00 sec)

Selecting the last inserted ID is a standard idiom for adding rows to table with synthetic keys in MySQL.  It is baked into widely used libraries like PHP mysqli.  Change it and you break thousands of applications.  This could happen if the fabric switched a DBMS connection across servers between these two commands.  When operating on data for a single SaaS user fabric transparency needs to be as close to perfect as possible.

Any architecture that meets the preceding requirements including transparency is a data fabric.  The next topic is what is actually inside a fabric architecture.

What Are Data Fabrics Made of?  

Combining individual components into lattices is hardly a new idea.  Such compositions are common in fields from bridge-building to art.  One of my favorite examples is the famous arabesques of the Alhambra in Granada, which combine simple motifs into patterns then combine those to create still more complex patterns that cover walls and ceilings throughout the palace.  The resulting compositions are works of stunning beauty.  

Detail of Arabesque from Alhambra, Spain (Source: Wikipedia)
Arabesque construction is far from random.  Arabesques combine plant-like elements into geometric patterns.  Only certain elements are allowed--there are typically no human representations--and only certain patterns work well.  Arabesques are also expressed in a particular medium such as stone, plaster, tiles, or paint.

Much as arabesques do, data fabrics combine very specific elements in a medium consisting of the following logical parts: 
  1. A partitioned network of recoverable transactional stores (DBMS servers) connected by reliable messaging (replication).  Partitioned means that not every service exchanges information with every other, in the same way that DBMS servers for different applications may be separate silos.  
  2. A routing layer that connects applications to relevant data based on simple hints provided by the application itself, such as the name of a schema, a primary key, or whether the current user transaction is a set of writes or an auto-commit read.  
Stores, replication, and routing logic are the fundamental elements of fabric implementations.  These are powerful tools for building very large applications.  Let's take a quick tour, as their particular properties are critical for building systems that actually work.

Transactional stores are DBMS servers that apply changes as atomic units that either commit or roll back as a whole.  Transactional stores convert these changes into a serial history, which orders concurrent transactions in its log so that they can replay as if they had executed one after the other all by themselves.  Serial ordering enables recovery, which is the procedure to bring data back cleanly to the last committed transaction after a crash or restart.  These properties are at the heart of all relational DBMS systems today and are necessary for the fabric to operate.  Non-transactional stores are close to useless in fabrics.  Use them at your peril.

Reliable messaging transmits changes between stores without losing them or applying them twice.  The usual way to do this is through replication, which moves transactions automatically on commit and applies them to replicas.   Replication needs to support transactions, apply them in serial order, and recover just as transactional stores do.  Replication systems that fail to meet any of these three requirements do not cut the mustard and do not belong in data fabrics.

Data replication implementations differ in at least three major ways.  There is not really a "right approach" to replication--different methods work better for some applications than others.

1. Synchronous vs. asynchronous.  Synchronous replication moves transactions before reporting commit back to applications.  It minimizes problems with data loss if a replica fails but may slow or block applications if replication is slow.  It can also lead to deadlocks.  Asynchronous replication moves transactions after commit.  It does not block applications but leads to latency between replicas and may result in data loss if there is a failure before transactions replicate from a particular replica. 

2. Master-master vs. master/slave.  Master-master replication allows updates on any replica.  It requires conflict management, typically either through locking in advance or fixing up after the fact (also known as conflict resolution).  Master/slave replication requires applications to use a single master for updates.  It works well with asynchronous replication, which is easier to implement.  

3. Log-based vs. trigger-based.  Log-based replication reads the database journal or some representation of it like the MySQL binlog.  Log-based replication has lower impact on DBMS servers but can be very hard to implement correctly.  Trigger-based replication uses triggers to capture transaction content, typically using transfer tables.  Trigger-based replication is simpler to implement but adds load to DBMS servers, can add unwanted serialization points to transactions, and may not handle certain types of changes, such as DDL data.  

The routing layer directs application DBMS connections to specific copies of data.  An application sees what appears to be a single session for each connection with consistent settings, such as character set encodings and session variables.  Underneath the fabric may actually switch the connection across DBMS servers at opportune times.  Transparency is absolutely critical.  As mentioned previously, even tiny changes in session behavior breaks libraries that applications depend on to access data.  

There are many methods to route connections as well as transactions.  As with replication, there is no "best" way.  It all depends on your application and the type of environment in which you are operating.  Here are four common approaches. 

1. Gateways.  A gateway is a proxy process that sits between the application and DBMS server.  The gateway looks like a server to applications.  It establishes connections on behalf of the applications, hence can perform very flexible transformations on data at the cost of some performance due to the double network hop the gateway introduces.  Gateways are hard to implement initially, but good ones are relatively easy to program once they work properly.  

2. IP Routers.  Routing software switches IP packet paths between servers.  It has the lowest overheard and potentially highest transparency but require substantial effort to implement correctly.  

3. Library wrappers.  Library wrappers re-implement standard interfaces like JDBC or Perl DBI, then route application requests through underlying connectors like the MySQL or PostgreSQL JDBC driver.    Wrappers are relatively easy to implement and have excellent performance but do not handle traffic that goes outside the library.  Compiled versions can have nasty library dependencies that introduce new forms of RPM hell for users if you try to use them generally.

4. Custom routing layers.  Any interface that your applications use to access data can implement routing logic.  For instance, SOAP or JSON servers work perfectly well for this purpose.   Internal data access libraries can also implement routing.  This approach is specific to single applications, and like library wrappers does not cover other means of access.  

The biggest constraint for any routing method is to keep it simple.  Fat routing layers with complex logic tend to introduce bugs as well as change database semantics.  This in turn violates the requirement for transparency.

Fabrics use the preceding elements over and over.  As we build up data fabric designs it is helpful to use consistent notation for data stores, replication, and routing.  The following summarizes the main notation that you will see in diagrams.
Data Fabric Notation
At this point we understand that data fabrics consist of a specific set of elements.  However, we have not provided any organization yet.  The next question is therefore how to arrange them into real systems.

What Are the Design Patterns for Data Fabrics?

Having a model of fabric elements is not the same as an actual implementation.  We need to define how the elements are composed into a complete system that can be deployed.  This is where design patterns come in.

Design patterns are reusable solutions that combine flexibly with each other to create large-scale architectures.  In the case of data fabrics, design patterns offer guidelines to organize fabric elements in ways that work well for large data sets spread over multiple geographic locations.  They serve the same function as the geometric arrangements that organize individual motifs in an arabesque.

Data fabric design patterns arrange data services.  A data service is an abstraction for a transactional store with a well-defined API.  Design patterns either link data services in some way or create more capable data services from simpler ones.  The ability to create new services in this way is called service composability.  Composability is a fundamental attribute of fabric design patterns.  It is the reason that fabric designs can handle very large data sets flexibly.  Service composition is analogous to the patterns-within-patterns layout that allows arabesques to cover an entire building without seeming repetitive or boring.

There are six design patterns that are particularly useful for SQL databases.  The following diagram shows them using italized blue font and illustrates how they organize the implementation of our sample CRM system.
Fully Implemented Data Fabric
Here is a short description of each fabric design pattern.  
  • Transactional Data Service - A transactional store that can restart without losing data following failure.  It has a well-defined API, a mechanism for backup, and a transaction log for replication.   This is the building block for all other design patterns.  
  • Fabric Connector -- A routing method that hides the location of databases and enables data services consisting of multiple DBMS servers to look like a single server to applications.  
  • Fault-Tolerant Data Service -- A data service that protects against failures and enables zero-downtime maintenance using a set of redundant DBMS servers.  
  • Sharded Data Service -- A data service that handles large data sets by dividing them into shards distributed across a number of underlying data services
  • Multi-Site Data Service -- A data service that enables data to spread across multiple geographic locations using data services on each site linked by replication
  • Real-Time Data Bridge -- A link that enables real-time replication of data between heterogeneous data services, for example between MySQL and a data warehouse like Vertica.  
You can implement fabric design patterns in many different ways.  This means there are many possible implementations of data fabrics as well.  As long as you meet the assumptions of each particular design pattern, the result is likely to work well.  We will delve into some of the variations in the next few articles.

A Little Bit about Naming

You may have encountered the term "data fabric" in products (example here or here) or as a concept (a nice example here).  This is because the ideas behind fabrics are very general and apply to many types of data management systems, perhaps most notably storage area networks.  The name data fabric is very natural and seems to have occurred to many people independently, especially after storage vendors popularized it during the 1990s.

In our case the term "data fabric" always means a data fabric architecture as I defined it a couple of sections back.

Conclusion and More

This has been a longish introduction but introduces terminology and sets the stage for looking at specific design patterns used to build data fabrics.  Follow-on articles will look at individual design patterns in detail.

One final thought:  there are no doubt other ways of factoring design patterns for data fabrics.  This particular set of patterns works very well for SQL databases, and I have seen them used successfully in many large systems over the years.  I would argue they work well for other DBMS types as well but welcome your comments on other approaches.

p.s., The next article covering the transactional data service is now available.  

Sunday, January 13, 2013

Replicating from MySQL to Amazon RDS

There have been a number of comments that Amazon RDS does not allow users access to MySQL replication capabilities (for example here and here).  This is a pity. Replication is one of the great strengths of MySQL and the lack of it is a show-stopper for many users.  As of the latest build of Tungsten Replicator half of this problem is on the way to being solved. You can now set up real-time replication from an external MySQL master into an Amazon RDS instance.

In the remainder of this article I will explain how to set up Tungsten replication to an Amazon RDS slave, then add a few thoughts about why this feature is useful along with some suggestions for improvement.   To keep the article reasonably short I assume you understand the basics of installing Tungsten Replicator.  If you need more information, check out the online documentation.

Readying an RDS Test Instance

Amazon RDS is Amazon's on-demand relational database.  RDS supports several database types including MySQL, which I will use for this demonstration.  Launching a new instance is simple.  Login to the Amazon AWS Console using an account that has RDS enabled, then switch to the Amazon RDS Console.  Press the Launch a DB Instance button, whereupon a screen like the following appears:
RDS Database Selection
Press the Select button for MySQL Community Edition, which starts the configuration window.  (You can also replicate into Oracle if you are up for a challenge.  If you do this, please post what you did to get it to work!)  Next fill out properties for MySQL.
MySQL Instance Configuration
Among other things, you create a master login and password.  Note these carefully as you will need them to configure replication.  Then continue for another couple of screens, at which point you can launch your instance.  It takes about 10 minutes for new instances to spin up, after which you can see the instance properties in the AWS RDS Console.  Here's a screen shot of my test instance.
AWS RDS Console
Once the instance is up, test access.  This is of course necessary to prove the instance is running properly and that we can login from a remote location.  Note the host name in the Endpoint field.  This is a DNS entry for the new MySQL instance.  Using that and the master login, we can now fire up the mysql client from a remote host where we plan to run replication.

$ mysql -utungsten -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2125
Server version: 5.5.27-log Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show databases;
| Database           |
| information_schema |
| innodb             |
| mysql              |
| performance_schema |
| test               |
5 rows in set (0.01 sec)

This looks quite good.  We are now ready to install and start replication.

Important note!  If you have trouble connecting you may need to tweak your Amazon security group settings to open up ports, especially if you are replicating from non-Amazon locations.  Amazon has a very cool feature that can guess your originating host IP and offer a CIDR address that covers the port range from which you are operating.  I used this when configuring my security groups.

Setting Up Tungsten Replication

It is possible to set up replication from a MySQL master directly to Amazon RDS using a single Tungsten Replicator process.  However, it is more versatile and simpler to set up two replicators:  one to read from the MySQL master, and another replicator to apply to the Amazon RDS slave.  I will therefore demonstrate this configuration.

We will assume you have a MySQL master already running and that it meets prerequisites for running Tungsten.  Let's now grab the Tungsten code and install a master replicator.  You can get fresh builds from the Tungsten Replicator builds page.

We will take a recent replicator build that contains the RDS changes, which are documented in Issue 425.  Use 2.0.7 build 177 or later. The main improvement is to add a non-privileged slave mode that avoids invoking any of the operations forbidden by Amazon.  Among other things Tungsten normally uses commands like 'SET SESSION SQL_LOG_BIN=0' to suppress writing to the binlog when it applies data on a slave.  This command requires SUPER privilege, hence causes problems for underprivileged RDS logins.

Unpack the code and install the master replicator in /opt/continuent.  This is no different from installing a normal Tungsten master.  My master host is logos1.  Here are sample commands to pull the code and set up the master.  The example shows the minimum options--if you have MySQL installed in a non-standard location or otherwise differ from a stock installation you may need to add additional options.

mkdir ~/staging
cd staging
wget --no-check-certificate
tar -xf tungsten-replicator-2.0.7-177.tar.gz
tungsten-replicator-2.0.7-177/tools/tungsten-installer \
  --master-slave  \
  --master-host=logos1 \
  --datasource-user=tungsten  \
  --datasource-password=your_passord  \
  --service-name=aws \
  --home-directory=/opt/continuent \
  --cluster-hosts=logos1 \

Next, set up the slave replicator.  For convenience I am going to install the slave on a separate host, named logos2, to avoid port clashes between the two replicators.  If you install on the same host, you'll need to install into another release directory location and use the --rmi-port and --thl-port options to avoid port overlaps.  Here is the command to set up the Amazon RDS slave.  Note that the tungsten-installer program can install code between hosts, which is an extremely useful feature.

tungsten-replicator-2.0.7-177/tools/tungsten-installer \
  --master-slave  \
  --cluster-hosts=logos2 \
  --master-host=logos1 \ \
  --datasource-user=tungsten  \
  --datasource-password=your_password  \
  --service-name=aws \
  --slave-privileged-updates=false \
  --home-directory=/opt/continuent \
  --skip-validation-check=InstallerMasterSlaveCheck \
  --skip-validation-check=MySQLPermissionsCheck \

You may see a few warnings during the RDS installation, as the tungsten-installer cannot verify some settings on the Amazon RDS host.  These can be ignored.  If everything goes well, you now have two replicators up and running.  You can check the status of the master and slave using the trepctl command, as in:

/opt/continuent/tungsten/tungsten-replicator/bin/trepctl -host logos1 status
/opt/continuent/tungsten/tungsten-replicator/bin/trepctl -host logos2 status

Both replicators should report that they are online.  Now complete the exercise by proving that replication works between the replicators.  We start by logging into the local MySQL instance, creating a new table in the test schema, and adding data.

$ mysql -uroot test
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 231488
Server version: 5.5.21-rel25.1-log Percona Server with XtraDB (GPL), Release rel25.1, Revision 234

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> create table foo(id int primary key);
Query OK, 0 rows affected (0.24 sec)

mysql> insert into foo values (256);
Query OK, 1 row affected (0.00 sec)

Now login to the Amazon RDS instance and look for table foo.  

$ mysql -utungsten -p test
Enter password: 
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2161
Server version: 5.5.27-log Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select * from foo;
| id  |
| 256 |
1 row in set (0.01 sec)

Mission accomplished!  We have real-time replication enabled from a MySQL master to Amazon RDS.  At this point you can replicate more or less normally.  There are some obvious limitations due to the fact that Amazon RDS is locked down and does not grant our login full privileges.  
  1. Temp table replication may not work.  Tungsten depends on being able to issue commands of the form "set @@session.pseudo_thread_id=23531" and the like.  This prevents clashes between temp tables of the same name on different master sessions.  You may need to enable row replication on the master, which suppresses temp table replication.  (For other approaches, see my previous article on temp tables and the binlog.) 
  2. Any command that requires SUPER privilege will not work.  As an obvious example, you will not be able to grant SUPER privilege to new accounts.  Such commands will break replication. 
  3. All replicated commands go into the binlog, which is a potential performance drag and may slow down Amazon RDS slaves.  Parallel replication may not help in this case, since committing to the binlog is a serialization point that blocks other transactions.  This problem may be cured if Amazon picks up group commit fixes from MySQL 5.6 and/or MariaDB.  
All things considered, however, these are minor inconveniences.  Most applications should be able to replicate without difficulties, especially if the master transaction rate is not too high.

Configuring SSL for Connections to RDS

In the previous demonstration I used a master host running outside Amazon.  This means my test transactions traveled across the Internet, where they were visible to all and sundry along the way.  To illustrate, we can run tcpdump and watch traffic as it goes by.

$ sudo tcpdump -A -vvv -s 256 host
logos2.46657 > Flags [P.], cksum 0xb0f7 (incorrect -> 0x995d), seq 3012:3073, ack 3412, win 94, options [nop,nop,TS val 89630514 ecr 74223746], length 61
.W.2.l..9....insert into foo values (256) /* ___SERVICE___ = [aws] */
22:55:36.316049 IP (tos 0x8, ttl 51, id 26226, offset 0, flags [DF], proto TCP (6), length 63)

If we were handling confidential data, exposing traffic like this to possible evildoers would be a serious problem.  Fortunately, Amazon RDS supports SSL encrypted connections from clients.  Here is how to use it with Tungsten.

First, you need to get the Amazon RDS certificate, which is used to sign certificates for individual RDS instances.

mkdir /opt/continuent/certs
cd /opt/continuent/certs

Next, you need to create a trust store that Java can access containing the certificates of signing authorities whom you trust.  For this you will need the Java keytool utility, which is included in the JDK.  If you are just using the Java runtime in production, you will need to generate the store on another host, then copy it over to your test hosts.  I used the password "secret" in this example.

$ keytool -import -alias rds -file mysql-ssl-ca-cert.pem -keystore truststore
Enter keystore password:  
Re-enter new password: 
Owner:, OU=RDS,, L=Seattle, ST=Washington, C=US
Issuer:, OU=RDS,, L=Seattle, ST=Washington, C=US
Trust this certificate? [no]:  yes
Certificate was added to keystore

We now need to tell the slave replicator about the truststore file using Java VM options.  On the slave host, edit /opt/continuent/tungsten/tungsten-replicator/conf/wrapper.conf and add the extra options shown in bold face.  

# Java Additional Parameters

The last step is to enable SSL encryption when applying data.  We need to set an extra URL option on the drizzle JDBC driver to turn on SSL. For this we need to edit the file that configures replication.  In my example this file is located in /opt/continuent/tungsten/tungsten-replicator/conf/  Open the file and look for the section that starts with APPLIERS. Edit the text to add additional urlOptions line as shown below.  



Restart the replicator process (/opt/continuent/tungsten/tungsten-replicator/bin/replicator restart) and you will now be using SSL encryption.  If we now look back at the tcpdump outout, it looks like garbage as the following example shows. 

23:46:52.370904 IP (tos 0x0, ttl 64, id 5899, offset 0, flags [DF], proto TCP (6), length 105)
    logos2.44717 > Flags [P.], cksum 0xb0ef (incorrect -> 0xd834), seq 3087:3140, ack 4838, win 102, options [nop,nop,TS val 89938121 ecr 74992771], length 53
.\X..xL.....07..|..x.)...T..888.H/...iz...^.W8....'...... ..J

This is much better.  We are replicating to an Amazon RDS, and the transactions are safe from prying eyes.  If you have gotten this far you are ready to try your own applications.  

Benefits of Replication into Amazon RDS

Amazon RDS is convenient thanks to its quick and simple setup, but the lack of replication is a severe limitation in building systems that need more than a single MySQL instance.  In particular it makes it hard to integrate RDS into systems that consist of more than Amazon RDS itself.  Adding the ability to replicate in real-time into RDS therefore has a number of benefits.  The most obvious include using RDS to extend existing systems.

First, Amazon RDS can offer a quick way to add read capacity to existing MySQL applications.  This is especially useful if you have a cluster, such as Tungsten, which handles your transaction processing and overall HA.  You can now add Amazon RDS read slaves that you discard when no longer needed.  Tungsten Replicator has a number of other useful features like the ability to read from a group of nodes, not just one, that make such topologies easy to set up and maintain.  Clusters other than Tungsten will likewise benefit from this feature. 

Second, Amazon RDS is suitable for applications that do not need 24x7 high availability (limitations include slow failover, no online maintenance, no cross-cloud capabilities, etc.)  You can now pull data from other sources and send them to Amazon RDS slaves for processing, which amounts to extending overall processing capacity.  For example, you could use RDS to run back-office tasks using transactions replicated in from MySQL masters.  Tungsten Replicator also replicates data from Oracle, so this is an additional source of transactions.  

There are of course other ways to replicate data to and from RDS, for example using batch ETL tools like Talend.  However, these are not real-time and often require application changes to add timestamp columns or otherwise mark transactions that need to be extracted.  Log-based replication as implemented by Tungsten is fast and has minimal impact on applications or MySQL itself.  

Thoughts about Further Improvements for Amazon RDS Replication

On our side, i.e., at Continuent, we need to do more testing, add documentation, and fix problems as they arise.  We are starting a beta test with one of our customers in the next few days, who incidentally was the same customer who requested this feature in the first place after hacking it for themselves.  RDS also has some interesting provisioning capabilities that I would like to understand better.  We are also adding options that eliminate the need for manual configuration of security settings.  This will keep us busy for a while.

Other improvements depend on changes to RDS itself.  An obvious and huge improvement would be to permit replication out of Amazon RDS.  Unfortunately, Tungsten needs a login that has REPLICATION_SLAVE privilege so that we can download binlog data.  That privilege is not yet available to Amazon users.  Once it is, Tungsten extraction will also work in very short order. We actually don't need other commands like START/STOP SLAVE or FLUSH LOGS--just ability to issue a COM_BINLOG_EVENT from a client connection and receive binlog records.  I am sure other products would use this capability as well.  (RDS developers, if you are listening here's an easy way to extend your product usability significantly...)

Replication is such a valuable feature of MySQL that Amazon RDS feels somewhat crippled without it.  For this reason I would imagine that Amazon will open up additional capabilities in the future.  Until then we will polish up replication from MySQL masters to Amazon RDS slaves, awaiting a time when we can add more features.  

In the meantime, I hope you will try the new replication to Amazon RDS.  As noted in this article you can grab the latest builds and try it yourself.  Please report your experiences through either Continuent Support if you are a customer or the Tungsten discussion list if you use the open source Tungsten Replicator.  I look forward to your feedback and suggestions for making Amazon RDS support better. 

Scaling Databases Using Commodity Hardware and Shared-Nothing Design