Showing posts with label Drizzle. Show all posts
Showing posts with label Drizzle. Show all posts

Jan 1, 2013

Questions about MariaDB JDBC Driver

The recent release of the MariaDB client libraries has prompted questions about their purpose as well as provenance.  Colin Charles posted that some of these would be answered in the very near future.  I have a couple of specific questions about the MariaDB JDBC driver, which I hope will be addressed at that time.  

1.) What is really in the MariaDB JDBC driver and how exactly does it differ from the drizzle JDBC driver?  What, if any, relation is there to Connector/J code?  There is a JIRA project but it contains only four bugs, hence is not very informative.  The launchpad bzr history shows detailed check-ins but not overall intent. 

2.) Why relicense from BSD to LGPL?  I have checked the class headers and so far as attributions are concerned everything seems to be done quite properly.  However, the license change appears to prevent those of us currently using the drizzle JDBC driver from transferring code changes back to the drizzle driver.  If so, that seems a little unneighborly.  

Here is some background on the relationship between the drivers.  The MariaDB JDBC client is a fork of the BSD-licensed drizzle JDBC driver originally developed by Marcus Eriksson, who continues to maintain the code.  According to the bzr change history the code forked after rev 253, which was 24 April 2011.  There are still many similarities in the Java classes.  For instance, a number of classes in the org.mariadb.jdbc.internal.common package differ by little other than licensing headers and package names.  The MariaDB code is now up to rev 375 and includes substantial changes that appear to be designed to bring the MariaDB JDBC driver closer to the capabilities of the MySQL Connector/J driver.  

At Continuent we have a lively interest in the drizzle JDBC driver, as we adopted it for Tungsten Replicator some time ago.  The code had fewer bugs than Connector/J, which was attractive.  More importantly, Marcus kindly accepted a patch from my colleague Stephane Giron (working as a Continuent employee) that made it easy for us to send queries using binary data rather than the usual Unicode data required by the JDBC standard.  This fix allows Tungsten to replicate codesets and binary data correctly.  We have since contributed a few other patches.  Our modest contribution in part reflects the quality of the base code. 

While waiting for answers I would like to commend Marcus as well as other drizzle contributors for their work.  We are particularly indebted to Marcus for starting and continuing the drizzle JDBC project.  Tungsten Replicator users have applied many trillions of transactions using the drizzle driver.  If the MariaDB JDBC driver gains wide acceptance, the rest of the MySQL community owes Marcus Eriksson substantial thanks as well.  

Aug 5, 2012

Is Synchronous Data Replication over WAN Really a Viable Strategy?

Synchronous data replication over long distances has the sort of seductive appeal that often characterizes bad ideas.  Why wouldn't you want every local credit card transaction simultaneously stored on the other side of the planet far away from earthquake, storms and human foolishness?  The answer is simple: conventional SQL applications interact poorly with synchronous replication over wide area networks (WANs).

I spent a couple of years down the synchronous replication rabbit hole in an earlier Continuent product.  It was one of those experiences that make you a sadder but wiser person.  This article digs into some of the problems with synchronous replication and shows why another approach, asynchronous multi-master replication, is currently a better way to manage databases connected by long-haul networks.

Synchronous Replication between Sites

The most obvious problem with any form of synchronous replication is the hit on application performance.  Every commit requires a round-trip to transfer and acknowledge receipt at the remote site, which in turn reduces single-thread transaction rates to the number of pings per second between sites.  As a nice article by Aaron Brown demonstrates, you can show the effect easily using MySQL semi-synchronous replication between hosts in Amazon regions.  Aaron's experiment measured 11.5 transactions per second, or about 100 times less than single-thread performance on a local master between hosts with 85 millisecond latency.  At that rate you would theoretically expect transaction throughput of ~11.7 transactions per second (1000 / 85 = 11.7), so the agreement between practice and theory is very close.  It's great when science works out like this.

You might argue that applications could tolerate the slow rate assuming it were at least constant.  Sadly that's not the case for real systems.  Network response varies enormously between sites in ways that are quite easy to demonstrate.

To illustrate variability I set up an Amazon m1.small instance in the us-east-1 region (Virginia) and ran 24 hours of ping tests to instances in us-west-2 (Oregon) and ap-southeast-1 (Singapore).   As the following graph shows, during a 4 hour period ping times to Singapore remain within a band but vary up to 10%.  Ping times to Oregon on the other hand hover around 100ms but spike up randomly to almost 200ms.  During these times, synchronous replication throughput would be cut by 50% to approximately 5 transactions per second.

Amazon ping times from us-east-1 to us-west-2 and ap-southeast-1 (240 minute interval)
Moreover, it's not just a question of network traffic.  Remote VMs also become busy, which slows their response.  To demonstrate,  I ran two-minute sysbench CPU tests to saturate processing on the us-west-2 instance while observing the ping rate.  Here is the command:

sysbench --test=cpu --num-threads=10 --max-time=120 --cpu-max-prime=1000000 run

As the next graph illustrates, CPU load has a unpredictable but substantial effect on ping times.  As it happens, the ping variation in the previous graph may be due to resource contention on the underlying physical host.   (Or it might really be network traffic--you never really know with Amazon.)

Effect of sysbench runs on ping times to US-West (20 minute interval)
Slow response on resource-bound systems is a problem that is familiar to anyone with experience with distributed systems, including systems where everything is on a single LAN.  You cannot even count on clock ticks being delivered accurately within various types of virtual machines.  The timing delays are magnified in WANs, as they already have high latency to begin with.  Between busy hosts and network latency, it's reasonable to expect that at some point most systems would at least briefly experience single session transaction rates of 1 transaction per second or less.  Even with parallelized replication you would see substantial backups on the originating DBMS servers as commits begin to freeze.

To complete the tale of woe, failures of various kinds can cause remote hosts to stop responding at all for periods of time that vary from seconds to days.  Amazon is generally quite reliable but had two outages in the Virginia data center in June 2012 alone that brought down applications from hours to days.  If you replicate synchronously to a host affected by such an outage, your application just stops and you no longer store transactions at all, let alone securely.  You need to turn off synchronous replication completely to stay available.

So is synchronous replication really impossible between sites?  Given the problems I just described it would be silly to set up MySQL semi-synchronous replication between over WAN for a real application.  However, there are other ways to implement synchronous replication.  Let's look at two of them.

First, there is Galera, which uses a distributed protocol called certification-based replication to agree on commit order between all cluster nodes combined with execution of non-conflicting transactions in parallel.  Certification-based replication is a great algorithm in many ways, but Galera comes with some important practical limitations.  First it replicates rows rather than statements.  The row approach handles large transactions poorly, especially over distances, due to the large size of change sets.  Also, not all workloads parallelize well, since transactions that conflict in any way must be fully serialized.   Overall DBMS throughput may therefore reduce to the single-session throughput discussed above at unexpected times due to variations in workload.  Finally, full multi-master mode between sites (as opposed to master/slave) is likely to be very problematic as nodes drop out of the cluster due to transient communication failures and require expensive reprovisioning.  This is a general problem with group communications, which Galera depends on to order transactions.

Second, there are theoretical approaches that claim many of the benefits of synchronous replication without killing throughput or availability.  One example is the Calvin system developed by Daniel Abadi and others, which seeks to achieve both strong transaction consistency and high throughput when operating across sites.  The secret sauce in Calvin is that it radically changes the programming model to replicate what amount to transaction requests while forcing actual transaction processing to be under control of the Calvin transaction manager, which orders transaction order in advance across nodes.   That should at least in principle reduce some of the unpredictability you may see in systems like Galera that do not constrain transaction logic.  Unfortunately it also means a major rewrite for most existing applications.  Calvin is also quite speculative.  It will be some time before this approach is available for production systems and we can see whether it is widely applicable.

There's absolutely a place for synchronous replication in LANs, but given the current state of the art it's hard to see how most applications can use effectively it to link DBMS servers over WAN links.  In fact, the main issue with synchronous replication is the unpredictability it introduces into applications that must work with slow and unreliable networks.  This is one of the biggest lessons I have learned at Continuent.

The Alternative:  Asynchronous Multi-Master Replication

So what are the alternatives?  If you need to build applications that are available 24x7 with high throughput and rarely, if ever, lose data, you should consider high-speed local clusters linked by asynchronous multi-master replication between sites.  Here is a typical architecture, which is incidentally a standard design pattern for Tungsten.

Local clusters linked by asynchronous, multi-master replication
The big contrast between synchronous and and asynchronous replication between sites is that while both have downsides, you can minimize asynchronous multi-master problems using techniques that work now.  Let's look at how async multi-master meets requirements and the possible optimizations.
  1. Performance. Asynchronous replication solves WAN performance problem as completely as possible.  To the extent that you use synchronous or near-synchronous replication technologies it is on local area networks, which are extremely fast and reliable, so application blocking is minimal.  Meanwhile, long-haul replication can be improved by compression as well as parallelization, because WANs offer good bandwidth even if there is high end-to-end latency. 
  2. Data loss.  Speedy local replication, including synchronous and "near synchronous" methods, minimizes of data loss due to storage failures and configuration errors.   Somewhat surprisingly you do not need fully synchronous replication for most systems even at the local level--that's a topic for a future blog article--but replication does need to be quite fast to ensure local replicas are up-to-date.  Actually, one of the big issues for avoiding local data loss is to configure systems carefully (think sync_binlog=1 for MySQL, for example).  
  3. Availability.  Async multi-master systems have the delightful property that anything interrupts transaction flow between sites, replication just stops and then resumes when the problem is corrected.  There's no failover and no moving parts.  This is a major strength of the multi-master model.  
So what are the downsides?  Nothing comes for free, and there are at least two obvious issues.
  1. Applicability.  Not every application is compatible with asychronous multi-master.  You will need to do work on most existing applications to implement multi-master and ensure you got it right.  I touched on some of the MySQL issues in an earlier article.  If multi-master is not a possibility, you may need the other main approach to cross-site replication:  a system-of-record design where applications update data on a single active site at any given time with other sites present for disaster recovery.  (Tungsten also does this quite well, I should add.) 
  2. Data access.  While you might not lose data it's also quite likely you might not be able to access it for a while either.  It's rare to lose a site completely but not uncommon for sites to be inaccessible for hours to days.  The nice thing is that with a properly constructed multi-master application you will at least know that the data will materialize on all sites once the problem is solved.  Meanwhile, relax and work on something else until the unavailable site comes back. 
In the MySQL community local clusters linked by asynchronous multi-master are an increasingly common architecture for credit card payment gateways, which I mentioned at the beginning of this article.  This is a telling point in favor of asynchronous cross-site replication, as credit card processors have a low tolerance for lost data.

Also, a great deal of current innovation in distributed data management is headed in the direction of asynchronous mechanisms.  NoSQL systems (such as Cassandra) tend to use asynchronous replication between sites.  There is interesting research afoot, for example in Joe Hellerstein's group at UC Berkeley, to make asynchronous replication more efficient by accurately inferring cases where no synchronization is necessary.  Like other research, this work is quite speculative, but the foundations are in use in operational systems today.  

For now the challenge is to make the same mechanisms that NoSQL systems have jumped on work equally well for relational databases like MySQL.  We have been working on this problem for the last couple of years at Continuent.  I am confident we are well on the way to solutions that are as good as the best NoSQL offerings for distributed data management.  

Apr 1, 2012

Disproving the CAP Theorem

Since the famous conjecture by Eric Brewer and proof by Nancy Lynch et al., CAP has given the world countless learned discussions about distributed systems and many a well-funded start-up.  Yet who truly understands what CAP means?  Even a cursory survey of the blogosphere shows profound disagreement about the meaning of terms like CP, AP, and CA in real systems.  Those who disagree on CAP include some of the most illustrious personages of the database community.

We can therefore state with some confidence that CAP is confusing. Yet this observation itself raises deeper questions.  Is CAP merely confusing?  Or is it the case that as with other initially accepted but now doubtful ideas like the Copernican model, evolution, and continental drift, that CAP is actually not correct?  Thoughtful readers will agree this question has not received anywhere near the level of scientific scrutiny it deserves.

Fortunately for science private citizens like me have been forging ahead without regard to the opinions of so-called experts or even common sense.  My work on CAP relies on two trusted analytic tools of database engineers over the legal drinking age: formal logic and beer.  Given the nature of the problem we should obviously use a minimum of the former and a maximum of the latter.  We have established that CAP is confusing.  To understand why we must now deepen our confusion and study its habits carefully.  Other investigators have used this approach with great success.

Let us begin by translating the terms of CAP into the propositional calculus.  The terms C (consistency), A (availability) and P (partition tolerance) can be used to state the famous "two out of three" of CAP using logical implication as shown below.

(1) A and P => not C
(2) P and C => not A
(3) C and A => not P

So far so good.  We can now dispense briefly with logic and turn to confusion.  It seems there is difficulty distinguishing the difference between CA and CP systems, i.e., that they are therefore equivalent. This is a key insight, which we can express formally as follows:

(4) C and A <=> C and P

which further reduces to

(5) A <=> P

In short our confusion has led us directly to the invaluable result that A and P, hence availability and partition tolerance, are exactly equivalent!  I am sure you share my excitement at the direction this work is taking.  We can now through a trivial substitution of A for P in equation 2 above reveal the following:

(6) A and C => not A
(7) C => (A => not A)

We have just shown that consistency implies that any system that is available is also unavailable simultaneously.  This is an obvious contradiction, which means the vast logical edifice on which CAP relies crumbles like a soggy nacho.  Considering the amount of beer consumed at the average database conference it is surprising nobody thought of this before.

At this point we can now raise the conversation up a level from looking for spare change under the table and comment on the greater meaning of our results in the real world.  Which is the following: Given the way most of us programmers write software it's a wonder CAP is an issue at all. Honestly, I can't even get calendar programs to send invitations to each other across time zones.  I plan to bring the combustible analytic capabilities of logic and beer to bear on the mystery of time at a later date.  For now we can just speculate it is due to a mistaken design based on CAP.

Aug 18, 2011

So Where's the Fall MySQL Community Conference?

Last week Percona announced plans to sponsor the Percona MySQL Conference in Santa Clara in April 2012.  It is meant to replace the O'Reilly conferences of previous years.  The announcement led to some reasonable questions, for example from Giuseppe Maxia.  These and other online posts initiated a thoughtful exchange of views about the pros and cons of Percona's conference announcement by various members of the MySQL community.

Not everyone agrees with what Percona is doing.  However, you have to give Peter, Baron, and others at Percona credit for taking the risk to organize a replacement conference.  It's a big financial commitment to rent space in Santa Clara.  The conference will be a huge amount of work, much of it quite thankless.  Even so I imagine it might be tempting for some people to try to organize an anti-conference just to spite Percona, much as Percona did at the MySQL Conference in 2009.  This would be a mistake.

Here is a better suggestion.  Why not put the energy into organizing a Fall 2012 MySQL Community Conference in Europe?   Make it somewhere pleasant like Barcelona or Verona and everybody will want to go.  Two solid conferences in different locations at opposite ends of the calendar would benefit the MySQL community at all levels and make it possible for more people to attend.  For those of you who want to grow the community, it's time to stand up and be counted.  Let's get another conference off the ground.

Mar 22, 2011

Parallel Replication Using Shards Is the Only Workable Approach for SQL

There have been a couple of recent blog articles (here and here) asking for parallel replication based on something other than schemas.  These articles both focus on the problem of parallelizing updates within a single MySQL schema.  I read these with great interest, not least because they both mentioned Tungsten (thanks!) and also found that our schema-based parallelization approach is too limited.  It is therefore worth a short article explaining exactly what the Tungsten approach is and why we chose it.

First of all, Tungsten does not exactly use schema-based parallel replication.  Tungsten is actually based on what I call the serialized shard model of replication.  We assign global transaction IDs to all transactions, which means that for any particular set of transactions we can always figure out the correct serialization and apply in the right order.  This is true even if the transactions travel across independent replication paths or if we have master failover.

Second, we assign a shard ID to all transactions.  Shards are independent streams of transactions that execute correctly when applied by themselves in serial order.  Shards are typically independent, which means transactions in different shards can execute in parallel without deadlocking or corrupting data.  This is the case when each shard contains data for a single customer in a multi-tenant application.  We also have a notion of "critical shards," which are shards that contain global data, such as shared currency rate tables.  Updates in critical shards cause full serialization across all shards.  

You can define shards in a variety of ways, but as a practical matter identifying durable shards inside individual MySQL schemas is hard for most applications, especially if there are constraints between tables or you have large transactions.   Many SQL applications tend to make most of their updates to a small number of very large tables, which makes finding stable dividing lines even more difficult.  Schemas are therefore a natural unit of sharding, and Tungsten uses these by default.

Schema-based sharding seems pretty limiting, but for current SQL databases it is really the only approach that works.  Here are some important reasons that give you a flavor of the issues.

* Restart.  To handle failures you need to mark the exact restart point on each replication apply thread or you will either repeat or miss transactions.  This requires precise and repeatable serialization on each thread, which you get with the serialized shard model.

* Deadlocks.  If there are conflicts between updates you will quickly hit deadlocks.  This is especially true because one of the biggest single thread replication optimizations is block commit, where you commit dozens of success transactions at once--it can raise throughput by 100% in some cases.  Deadlocks on the other hand can reduce effective throughput to zero in pathological cases.   Shard-based execution avoids deadlocks.

* Ordering.  SQL gives you a lot of ways to shoot yourself in the foot through bad transaction ordering.  You can't write to a table before creating it.  You can't delete a row before it is inserted.  Violating these rules does not just lead to invalid data but also causes errors that stop replication.  The workarounds are either unreliable and slow (conflict resolution) or impractical for most applications (make everything an insert).  To avoid this you need to observe serialization very carefully.

* Throughput.  SQL transactions in real systems vary tremendously in duration, which tends to result in individual long transactions blocking simpler parallelization schemes that use in-memory distribution of updates.  In the Tungsten model we can solve this by letting shard progress vary (by hours potentially), something that is only possible with a well-defined serialization model that deals with dependencies between parallel update streams.  I don't know of another approach that deals with this problem.

If you mess up the solution to any of the foregoing problems, chances are good you will irreparably corrupt data, which leads to replication going completely off the rails.  Then you reprovision your slave(s).  The databases that most need parallel replication are very large, so this is a multi-hour or even multi-day process.  It makes for unpleasant calls with customers when you tell them they need to do this.

I don't spend a lot of time worrying that Tungsten parallel replication is not well suited to the single big schema problem.  So far, the only ways I can think of making it work scalably require major changes to the DBMS or the applications that use it.  In many cases your least costly alternative may be to use SSDs to boost slave I/O performance.

My concerns about Tungsten's model lie in a different area.  The serialized shard model is theoretically sound--it has essentially the same semantics as causally dependent messaging in distributed systems.  However, if we fail to identify shards correctly (and don't know we failed) we will have crashes and corrupt data.  I want Tungsten either to work properly or tell users it won't work and degrade gracefully to full serialization.  If we can't do one of these two for every conceivable sequence of transactions that's a serious problem.

So, to get back to my original point, serialized shards are the best model for parallel replication in SQL databases as we find them today.  I suspect if you look at some of the other incipient designs for parallel replication on MySQL you will find that they follow this model in the end if not at first.  I would think in fact that the next step is to add MySQL features that make sharded replication more effective.  The drizzle team seems to be thinking along these lines already.

Jan 30, 2011

Virtual IP Addresses and Their Discontents for Database Availability

Virtual IP addresses or VIPs are commonly used to enable database high availability.   A standard failover design uses an active/passive DBMS server pair connected by replication and watched by a cluster manager.  The active database listens on a virtual IP address; applications use it for database connections instead of the normal host IP address. Should the active database fail, the cluster manager promotes the passive database server and shifts the floating IP address to the newly promoted host.  Application connections break and then reconnect to the VIP again, which points them to the new database.
VIP-Based Database Availability Design
Virtual IP addresses are enticing because they are completely transparent to applications--no changes to database API behavior, no changes to connection strings, etc.  While virtual IP addresses seem simple, they depend on arcane TCP/IP behavior that is not especially well understood and not always consistent across different TCP/IP implementations.  Unless properly managed, virtual IP addresses can induce problems that run the gamut from simple lack of availability to split-brain, which can in turn lead to unrepairable data corruption. 

This blog article describes in some detail how virtual IP addresses work, specifically the behavior of Address Resolution Protocol (ARP) which is a core part of the TCP/IP stack that maps IP numbers to hardware addresses and permits VIPs to move.  We will then dig into how split-brain arises as a consequence of ARP behavior.  Finally we will look at ways to avoid split-brain or at least make it less dangerous when it occurs.

Note: the examples below use MySQL, in part because mysqld has a convenient way to show the server host name.  However you can set up the same test scenarios with PostgreSQL or most other DBMS implementations for that matter. 

What is a Virtual IP Address? 

Network interface cards (NICs) typically bind to a single IP address in TCP/IP networks.   However, you can also tell the NIC to listen on extra addresses using the Linux ifconfig command.  Such addresses are called virtual IP addresses or VIPs for short.

Let's say for example you have an existing interface eth0 listening on port 192.168.128.114.  Here's the configuration of that interface as shown by the ifconfig command:
saturn# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 08:00:27:ce:5f:8e  
          inet addr:192.168.128.114  Bcast:192.168.128.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fece:5f8e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:10057681 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8902384 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:6941125495 (6.9 GB)  TX bytes:6305062533 (6.3 GB)
You can allow the eth0 interface to accept traffic for another address 192.168.128.130 using the following simple command. 
saturn# ifconfig eth0:0 192.168.128.130 up
This command tells the TCP/IP stack to accept packets directed to IP address 192.168.128.130 as well as the original address 192.168.128.114.   This means that if you are running a MySQL server on the host users can connect using either of the following commands:
mysql -utungsten -psecret -h192.168.128.114 (or)
mysql -utungsten -psecret -h192.168.128.130
Finally, you can move VIPs from host to host very easily by dropping them on the first host and binding to them on a second host, as shown in the following example:
saturn# ifconfig eth0:0 192.168.128.130 down
...
neptune# ifconfig eth0:0 192.168.128.130 up
Meanwhile, virtual IP addresses behave in every other way like standard IP addresses.  You can put them in DNS, applications can connect to them, etc.  VIP-based high availability is therefore a minimally invasive implementation for most applications--about the only requirement is that applications need to reconnect if their connection breaks.

How Virtual IP Addresses Work

To understand the weaknesses of virtual IP addresses for database high availability, it helps to understand exactly how the TCP/IP stack actually routes messages between IP addresses, including those that correspond to VIPs.   The following diagram shows moving parts of the TCP/IP stack in a typical active/passive database set-up with a single client host.  In this diagram host saturn currently has virtual IP address 192.168.128.130. Neptune is the other database host.  Mercury is the client host.


Applications direct TCP/IP packets using IP addresses, which in IPV4 are four byte numbers.  The IP destination address is written into the header by the IP layer of the sending host and read by the IP layer on the receiving host.

However, this is not enough to get packets from one host to another.  At the hardware level within a single LAN, network interfaces use MAC addresses to route messages over physical connections like Ethernet.   MAC addresses are 48-bit addresses that are typically burned into the NIC firmware or set in a configuration file if you are running a virtual machine.  To forward a packet from host mercury to saturn, the link layer writes in the proper MAC address, in this case 08:00:27:ce:5f:8e.  The link layer on host saturn accepts this packet, since it corresponds to its MAC address.  It forwards the packet up into the IP layer for acceptance and further processing.

Yet how does host mercury figure out which MAC address to use for particular IP addresses?  This is the job of the ARP cache, which maintains an up-to-date mapping between IP and MAC addresses.  You can view the ARP cache contents using the arp command, as shown in the following example.
mercury# arp -an
? (192.168.128.41) at 00:25:00:44:f3:ce [ether] on eth0
? (192.168.128.1) at 00:0f:cc:74:64:5c [ether] on eth0
Note that the ARP cache does not have a mapping for the VIP address 192.168.128.130.  Let's say we now make a client connection to the MySQL server at the other end of the VIP address on mercury:
# mysql -utungsten -psecret -h192.168.128.130
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 33826
...
mysql>
To route traffic, host mercury gets the IP address to MAC address mapping using an ARP request.  You can watch this happen in real time using the tcpdump command to track ARP traffic.
mercury# tcpdump -n -i eth0 arp 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
09:37:43.755081 ARP, Request who-has 192.168.128.130 tell 192.168.128.110, length 28
09:37:43.755360 ARP, Reply 192.168.128.130 is-at 08:00:27:ce:5f:8e, length 46
Now if you look at the ARP cache again on host mercury, you will see a proper mapping for the VIP in mercury's ARP cache:
# arp -an
? (192.168.128.130) at 08:00:27:ce:5f:8e [ether] on eth0
? (192.168.128.41) at 00:25:00:44:f3:ce [ether] on eth0
? (192.168.128.1) at 00:0f:cc:74:64:5c [ether] on eth0
Now if you go back and look at the picture (or still recall the details), you will notice that the MAC address maps to the NIC on host saturn.  This is exactly what we expect since saturn is listening on the corresponding virtual IP address 192.168.128.130. 

Virtual IP Addresses and Split-Brain

Most real problems with VIPs appear when you try to move them.  The reason is simple:  TCP/IP does not stop you from having multiple hosts listening on the same virtual IP address.  For instance, let's say you issue the following command on host neptune without first dropping the virtual IP address on saturn.  
neptune# ifconfig eth0:0 192.168.128.130 up
Let's further clear the ARP mapping for the virtual IP on mercury using the handy arp -d command and reconnect to MySQL.
mercury # arp -d 192.168.128.130
root@logos1:~# mysql -utungsten -psecret -h192.168.128.130
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 294
...
mysql>
So far so good--we logged into MySQL and everything appears normal.  But in fact it is not normal at all.  If you run tcpdump and watch the ARP requests during this login, you see the following:
# tcpdump -n -i eth0 arp 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
09:59:32.643518 ARP, Request who-has 192.168.128.130 tell 192.168.128.110, length 28
09:59:32.643768 ARP, Reply 192.168.128.130 is-at 08:00:27:68:cd:7d, length 46
09:59:32.643793 ARP, Reply 192.168.128.130 is-at 08:00:27:ce:5f:8e, length 46 
This is not not just bad--it's very bad.  Both saturn and neptune responded to mercury's ARP request!  Mercury can pick only one for the mapping; which one it picks depends on timing as well as the exact implementation of the TCP/IP stack.  In other words we have a race condition and the winner is essentially random.

You can demonstrate the randomness for yourself with a simple experiment.  Let's create a test script named mysql-arp-flush.sh, which clears the ARP cache entry for the VIP and then connects to MySQL, all in a loop.  
#!/bin/bash
for i in {1..5}; 
do 
  arp -d 192.168.128.130
  sleep 1
  mysql -utungsten -psecret -h192.168.128.130 -N \
    -e "show variables like 'host%'"
done
If you run the script you'll see results like the following.  Note the random switch to Neptune on the fourth connection.  
# ./mysql-arp-flush.sh
+----------+---------+
| hostname | saturn  | 
+----------+---------+
+----------+---------+
| hostname | saturn  | 
+----------+---------+
+----------+---------+
| hostname | saturn  | 
+----------+---------+
+----------+---------+
| hostname | neptune |
+----------+---------+
+----------+---------+
| hostname | saturn  | 
+----------+---------+
At this point you have successfully created a split-brain.  If you use database replication and both databases are open for writes, as would be the default case with MySQL replication, Tungsten, or any of the PostgreSQL replication solutions like Londiste, your applications will randomly connect to each DBMS server.  Your data will quickly become irreparably mixed up.  All you can do is hope that the problem will be discovered quickly. 

A Half-Hearted Solution using Gratuitous ARP

You might think that it would be handy if the ARP protocol provided a way to get around split-brain problems by invalidating client host ARP caches.  In fact, there is such a feature in ARP--it's called gratuitous ARP.  While useful, it is not a solution for split-brain issues.  Let's look closely to see why. 

Gratuitous ARP works by sending an unsolicited ARP response to let hosts on the LAN know that an IP address mapping has changed.  On Linux systems you can issue the arping command as shown below to generate a gratuitous ARP response. 
neptune# arping -q -c 3 -A -I eth0 192.168.128.130
This tells host neptune to send 3 ARP reply messages with its MAC address for the VIP address.  If we look at tcpdump output again, we see the following:
# tcpdump -n -i eth0 arp 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
11:02:27.154279 ARP, Reply 192.168.128.130 is-at 08:00:27:68:cd:7d, length 46
11:02:28.159291 ARP, Reply 192.168.128.130 is-at 08:00:27:68:cd:7d, length 46
11:02:29.162403 ARP, Reply 192.168.128.130 is-at 08:00:27:68:cd:7d, length 46
Linux hosts that receive the response will generally then update their ARP caches, though as we will see, there are some important exceptions.   But first, we need to show the effect of gratuitous ARP on MySQL connections.  Let's start with the following ARP cache contents on host mercury.  This shows an existing mapping to the MAC address of host neptune, which is what we expect from the previous arping command on neptune. 
mercury# arp -an
? (192.168.128.130) at 08:00:27:68:cd:7d [ether] on eth0
? (192.168.128.41) at 00:25:00:44:f3:ce [ether] on eth0
? (192.168.128.1) at 00:0f:cc:74:64:5c [ether] on eth0
Next, we run a loop that connects to MySQL and prints the host name every second.  The loop code is shown below and stored in a script named mysql-no-arp-flush.sh.  Unlike the previous script this does not release the ARP cache mapping between connections to MySQL.
#!/bin/bash
for i in {1..30}; 
do 
  sleep 1
  mysql -utungsten -psecret -h192.168.128.130 -N \
    -e "show variables like 'host%'"
done
While the test script is running is running, we run an arping command from saturn. 
saturn# arping -q -c 3 -A -I eth0 192.168.128.130
What we see in the MySQL output is the following.  Once the gratuitous ARP is received, mercury switches its connection from neptune to saturn and stays there, at least for the time being.
mercury# ./mysql-no-arp-flush.sh
+----------+---------+
| hostname | neptune | 
+----------+---------+
+----------+---------+
| hostname | neptune | 
+----------+---------+
+----------+---------+
| hostname | saturn  | 
+----------+---------+
+----------+---------+
| hostname | saturn  | 
+----------+---------+
There is one more interesting property of gratuitous ARP responses.  If you issue one during a session, it will cause client sessions to switch between hosts without waiting for a timeout.  Here's an example.  First login with MySQL and see which host we are on.
root@logos1:~# mysql -utungsten -psecret -h192.168.128.130
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 33853
mysql> show variables like 'hostname';
+---------------+---------+
| Variable_name | Value   |
+---------------+---------+
| hostname      | neptune | 
+---------------+---------+
1 row in set (0.00 sec)
Now issue an arping command on saturn using a separate window.
saturn# arping -q -c 3 -A -I eth0 192.168.128.130
Finally, go back and check the host name again in the MySQL session.   The session switches over to the other server, which you see at the client level as a lost connection followed by a reconnect.
mysql> show variables like 'hostname'; 
ERROR 2006 (HY000): MySQL server has gone away No connection. Trying to reconnect... Connection id:   33854 
Current database: *** NONE ***  
+---------------+--------+ 
| Variable_name | Value  | 
+---------------+--------+ 
| hostname      | saturn |  
+---------------+--------+ 
1 row in set (0.00 sec)
So is gratuitous ARP the solution to virtual IP split-brain?  It announces that there is a mapping change, which can make failover work much more quickly.  This is useful in its own right.  However, it does not solve split-brain. 

First, not all TCP/IP stacks even recognize gratuitous ARP responses.  Second, gratuitous ARP only takes effect on hosts that actually have a current mapping in their ARP cache.  Other hosts will  wait until they actually need the mapping and then issue a new ARP request.  Finally, ARP mappings automatically time out after a few minutes.   In that case the host will issue a new ARP request, which as in the two preceding cases brings us right back to the split-brain scenario we were trying to cure.

Avoiding Virtual IP Split -Brains

Avoiding a VIP-induced split-brain not a simple problem.  The best approach is combination of sound cluster management, amelioration, and paranoia. 

Proper cluster management is the first line of defense.  VIPs are an example of a unique resource in the system that only one host may hold at a time.   An old saying that has been attributed to everyone from Genghis Khan to Larry Ellison sums up the problem succinctly:
It is not enough to succeed.  All others must fail.  
The standard technique to implement this policy is called STONITH, which stands for "Shoot the other node in the head."  Basically it means that before one node acquires the virtual IP address the cluster manager must make every effort to ensure no other node has it, using violent means if necessary.   Moving the VIP thus has the following steps.
  1. The cluster manager executes a procedure to drop the VIP on all other hosts (for example using ssh or by cutting off power).  Once these procedures are complete, the cluster manager executes a command to assign the VIP to the new owner. 
  2. Isolated nodes automatically release their VIP.  "Isolated" is usually defined as being cut off from the cluster manager and unable to access certain agreed-upon network resources such as routers or public DNS servers. 
  3. In cases of doubt, everybody stops.  For most systems it is far better to be unavailable than to mix up data randomly.  
Cluster managers like Tungsten and Pacemaker handle this kind of process very well.   PaceMaker for example has a number of specialized hooks to cut power or otherwise use extreme violence to ensure proper fencing of databases.  Tungsten has fewer such hooks but has a much richer set of operations for databases and also has a wide set of connectivity options for HA besides using VIPs.

Incidentally, you want to be very wary about re-inventing the wheel, especially when it comes to DBMS clustering and high availability.  Clustering has a lot of non-obvious corner cases; even the "easy" problems like planned failover are quite hard to implement correctly.  You are almost always better off using something that already exists instead of trying to roll your own solution.

Amelioration is the next line of defense, namely to make split-brain situations less dangerous when they actually occur.  Failover using shared disks or non-writable slaves (e.g., with DRBD or PostgreSQL streaming replication) have a degree of protection because it is somewhat harder to have multiple databases open for writes.  However, it is definitely possible and the cluster manager is your best bet to prevent this.  However, when using MySQL with either native or Tungsten replication, databases are open and therefore susceptible to data corruption, unless you ensure slaves are not writable.

Fortunately, this is very easy to do in MySQL.   To make a database readonly to all accounts other than those with SUPER privilege, just issue the following commands to make the server readonly and ensure the setting is in effect:
neptune# mysql -uroot -e "set global read_only=1"
neptune# mysql -uroot -e "show variables like 'read_only'"
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only     | ON    | 
+---------------+-------+
This protects you not just in cases of actual failover but also from administrative mistakes or software failures that switch the VIP by accident.   Many cluster managers implement read-only slaves.   Tungsten clustering has explicit support for read-only slaves.  Other cluster managers like MMM and Pacemaker can do the same.

Lastly paranoia is always a good thing.  You should test the daylights out of clusters that depend on VIPs before deployment, and also check regularly afterwards to ensure there are no unexpected writes on slaves.  Regular checks of logs are a good idea.  Another good way to check for problems in MySQL master/slave setups is to run consistency checks.   Tungsten Replicator has built-in consistency checking designed for exactly this purpose.  You can also run Maatkit mk-table-checksum at regular intervals.   Another best practice is to "flip" masters and slaves on a regular basis to ensure switch and failover procedures work properly.   Don't avoid trouble--look for it!

Conclusion and Note on Sources

Virtual IP addresses are a convenient way to set up database high availability but can lead to very severe split-brain situations if used incorrectly.   To deploy virtual IP addresses without problems you must first of all understand how they work and second use a sound cluster management approach that avoids split-brain and minimizes the impact if it does occur.  As with all problems of this kind you need to test the implementation thoroughly before deployment as well as regularly during operations.  This will help avoid nasty surprises and corrupt data that are otherwise all but inevitable.

Finally it is worth talking a bit about sources.  I wrote this article because I could not find a single location that explained virtual IP addresses in a way that drew out the consequences of their behavior for database failover.  That said, there are a couple of good general sources for information on Internet tools and high availability:
Beyond that you can look at general networking sources like Radia Perlman's Interconnections, Second Edition or Internetworking with TCP/IP by Douglas Comer.  These are more high-level.  If you get really desperate for details, try the RFCs, for example RFC-826, which is the original specification for ARP.  Some of them are surprisingly good reads even 30 years after the fact.

Jan 25, 2011

Tungsten Replicator Overview Webinar

On Thursday January 27th at 10am PST I will doing a webinar on Tungsten Replicator together with my colleague Giuseppe Maxia.  The title is "What MySQL Replication Cannot Do.  And How to Get Around It."  Basically it is a nuts and bolts description of Tungsten Replicator capabilities like multi-master replication, failover, parallel apply, and using replication for zero-downtime upgrade.  If you have ever wanted an in-depth look at the Tungsten Replicator this is a good opportunity. 

During 2010 we implemented an amazing number of new replication features ranging from pipelines early in the year to fast disk logs, multiple replication services per process, bi-directional replication, and parallel apply by the end.  We will be building out all of these in the coming year and releasing increasingly capable features into open source as well. 

This presentation is part of Continuent's regular webinar series which means we will also talk a bit about commercial products and services at the end.  However, it's mostly cool replication stuff.   You can sign up on the Continuent webinar page.  Hope to see you there.

Dec 12, 2010

Interested in Sponsoring Tungsten Open Source Features?

Over the last few months I have been pleasantly surprised by the number of people using open source builds of Tungsten.  My company, Continuent, has therefore started to offer support for open source users and will likely expand these services to meet demand.

There have also been a number of requests to add specific features to open source builds, especially for replication. We have added a few already but are now considering pushing even more features into open source if we can find sponsors.  These add to a number of great features already in open source like global transaction IDs, MySQL 5.0/5.1, basic drizzle replication, transaction filtering, and many others. 

Do you have special replication or clustering features you would like to see added to Tungsten? Specialized MySQL to PostgreSQL replication?  Management and monitoring commands?  Cool parallel replication problems?  High-performance logging?  Weird multi-master topologies?  Talk to us about sponsoring new open source features.  We're happy to do projects that solve interesting problems, benefit the open source databases community, and help grow Tungsten as a product. 

Visit the Continuent website or send email directly to robert dot hodges at continuent dot com.  

Nov 7, 2010

It's All about the Team

Earlier this week Giuseppe Maxia blogged about joining Continuent as Director of QA.  Creating high quality systems for distributed data management is a hard but fascinating problem.  I have been hooked on it myself for many years.  Guiseppe brings the creativity as well as humor our team needs to nail this problem completely.  I'm therefore delighted to know he will be focused on it.

That said, I'm even happier for another reason.  Beyond solving any single problem, Giuseppe strengthens an already strong team.  Ed Catmull of Pixar gave a great speech a few years ago about managing creative teams and why successful companies eventually fail.  Among other things he asked the question whether it is the idea or the people who implement it that count most.  His conclusion:  great teams implement good ideas to build great products.  But even more important, great teams can turn bad ideas into good ones, then go on to build great products from those ideas too.  Pixar has proved this many times over.

I believe strongly in the power of great teams to create great products.  Giuseppe, welcome to the team. 

Oct 24, 2010

Parallel Replication on MySQL: Report from the Trenches

Single-threaded apply is one of the big downsides of MySQL's built-in replication, as Baron Schwartz pointed out a couple of days ago.  While a master can process dozens of updates at once, slaves must apply them one after the other on a single thread.  Add in disk I/O, and the result is very slow performance indeed.  The obvious answer is parallel apply, namely writing multiple non-conflicting updates to the slave at once.

I have spent the last few months implementing parallel apply for Tungsten 2.0, which we are now testing at customer sites.  In this article I would like to describe how Tungsten's parallel apply works as well as some of the lessons that have become apparent through the implementation.

There are a couple of big challenges in parallel apply.  There is of course the practical problem of separating transactions into parallel streams, for example splitting them by database.  This is known as sharding.   Row updates are easy enough but MySQL also has statement replication.  Transactions with statements require parsing, and there are ambiguous cases.  If that's not enough, features like LOAD DATA INFILE have a complex implementation in the binlog and require specialized logic to shard correctly.  In addition, parallel apply of any kind has a lot of corner cases that you have to solve completely or risk unpredictable failures.  Here's an example:  skipping transactions on the slave.  You have to wait for the event, but what if some of the threads are already past it when you ask to skip?  How do you synchronize access to the list of transactions to skip without creating a choke point for threads?  

The next challenge is performance.  Parallel replication offers a rich choice of ways to lower throughput, not raise it.  Multiple disk logs are the best I have found so far, as they can convert sequential reads and writes on the disk log to random I/O when more replication threads contend for different parts of the disk.  Implementing multiple queues in memory is far faster and simpler but limits the queue sizes.  Another excellent way to slow things down is to try to parallelize SQL transactions with a lot of dependencies, which means you end up effectively serialized *and* paying the extra cost of parsing transactions and synchronizing threads.  In this case it can be better to keep everything sequential but use block commit to apply 50 or 100 transactions simultaneously on the slave.

With all that said, the parallel apply problem is still quite tractable, but you need to pick your battles carefully.  Tungsten's parallel apply implementation has a very clear problem focus:  speeding up slave updates for multi-tenant applications that have a high degree of natural partitioning and concurrent updates across customers.  This is not as limiting as it might sound to readers unfamiliar with MySQL.  SaaS applications for the most part follow the multi-tenant model on MySQL, with each customer assigned to a particular database.  So do large ISPs or cloud providers that host customers on shared servers using separate databases.

Tungsten parallel apply is based on automatic sharding of transactions.   The following diagram shows the parallel apply algorithm conceptually. 
Tungsten Parallel Apply
Tungsten has a flexible architecture based on replication pipelines, described in a previous article on this blog.  To recap the model, pipelines are divided into stages, which represent processing steps.  Each stage consists of an extract-filter-apply loop with symmetric interfaces and identical processing logic for each stage.  The parallel apply implementation builds on replication pipelines as follows:
  1. A new filter called EventMetadataFilter automatically parses incoming transactions to figure out which database(s) they affect.  This is simple for row updates but involves parsing for statements and specialized extract handling for odd-ball operations like LOAD DATA INFILE. 
  2. The shard ID is assigned from the database name. This is glommed into the EventMetadataFilter but will shortly be broken out into a separate filter so that it is possible to support alternate shard assignment algorithms. 
  3. There is a new kind of in-memory buffer between stages called a ParallelQueue that supports multiple queues that feed the final apply stage.   Stages have a corresponding extension to allow them to have multiple threads, which must match the number of parallel queues or you get an error. 
  4. The ParallelQueue implementation calls a new component called a Partitioner to assign transactions a partition number (i.e., a parallel queue).  You can substitute different algorithms by providing different partitioner implementations.  The default implementation uses a configuration file called shard.list to map shards to queues.  Unless you say otherwise it hashes on the shard ID to make this assignment.
Extensions #1 and #2 run on the master, while #3 and #4 run on the slave.  I really like diagrams, so here is a picture of the fully implemented parallel apply architecture.  The master replicator extracts, assigns the shard, and logs each transaction.  The slave replicator fetches transactions, logs them locally, then applies in parallel.
Full Master/Slave Architecture for Parallel Apply
So how does this work?  Pretty well actually.  Local lab tests indicate that parellel apply roughly doubles throughput on a multi-database TPC-B benchmark we use for testing.   We should be able to publish some real-world performance numbers in the near future, but so far things look quite promising.  During the implementation a number of interesting issues have arisen, which I would like to discuss now.

The first issue is the ratio between parallel apply threads and shards.  While it might seem obvious to have a thread per shard, in real deployments the situation is not so clear.  For one thing actual deployments in SaaS and ISP situations often have hundreds or even thousands of databases, which has a number of practical consequences for implementation.  Less obviously, spreading transactions thinly across a lot of queues means fewer opportunities to use block commit, hence more work for slave servers and less overall throughput.  Performance optimization is a very uncertain matter, so Tungsten lets users configure the ratio.

Dependencies between shards are yet another issue.  While I mentioned that Tungsten is designed for applications with "a high degree of natural partitioning," dependencies between databases as well as individual transactions do occur and cannot be ignored.  For example, many SaaS applications have reference data that are used by all customer databases.  Even if parallel SQL works here, applications may get sick from seeing updates appear in the wrong order.  Or you could have global operations like CREATE USER that affect all databases.  Or you might not be able to tell which shard a piece of SQL belongs to.  Tungsten allows users to declare reference databases and automatically serializes these databases as well as global or "don't know" cases. 

There are also numerous issues around startup and shutdown.  Remember how MySQL replication slaves will not restart after unclean shutdown with open temp tables?  (If not, take a quick break and read this now.  You'll thank me later.)  Parallel apply introduces similar issues, because you have multiple threads all updating different positions in the database.  Tungsten handles crash recovery by tracking the apply position of each queue in InnoDB and then recommencing from that point on restart in each queue.  I am putting finishing touches on clean shutdown, which ensures that all queues are empty, much like automatically checking that temp tables are closed on MySQL.  


In short, over the last few months Tungsten has climbed a fair distance up a pretty big hill to get parallel apply to work.  The flexibility of the replicator architecture, particularly pipelines, has been very helpful as it is quite easy to extend.  The parallelization algorithm builds on terrific work by other colleagues at Continuent, especially Stephane Giron and Linas Virbalas.  They have both put enormous effort into building up MySQL and PostgreSQL replication capabilities.

Here are a couple of parting thoughts about parallelization based on the experience so far.

Thought number one:  parallel replication is not magic.  To use parallel apply effectively, applications need to play nice:  mostly short transactions and not too many dependencies between shards are the biggest requirements to see a substantial boost in throughput.  For example, if you let one user write 50M statements to the binlog in a single transaction, things are going to get kind of quiet on the slave no matter what you do.  Also, you can forget about MyISAM or other non-transactional engines.  As I have written before, these engines offer a number of opportunities for databases to get messed up or out-of-sync even using conventional MySQL replication.  Tungsten's block commit and parallel apply increase the window for problems significantly.  If you are still using MyISAM for replicated data, it's time to man up and convert to InnoDB. 

Thought number two: The long-term answer to effective parallel replication is to change how MySQL works by interleaving transactions within the binlog along the lines suggested by Kristian Nielsen and others.  MySQL currently completely serializes transactions to the binlog, an accomplishment that makes slave apply logic a lot simpler.   Tungsten parallel apply then has to undo this good work and recreate streams of non-conflicting updates, which is complex and does not help all workloads.

It is doubtful that replicating interleaved transactions will be less complex than handling a serial binlog as it stands today.  There is also some heavy lifting inside MySQL to get to an interleaved binlog.  However, interleaved transactions would have the advantage that transactions for any workload would be parallelized, which would widen the scope of benefits to users.  I'm happy to see that Kristian and other people are now working this feature for future releases of MySQL.

Meanwhile, we have a workable solution for Tungsten and are pushing it forward as quickly as we can.  Contact Continuent if you would like to test it out.

Apr 23, 2010

MySQL Conference Slides and Thoughts on State of the Dolphin

I did two talks on replication and clustering at the recent MySQL Conference in Santa Clara.  Thanks to all of you who attended as well as the fine O'Reilly folks who organized everything.  Slides are posted on the talk descriptions at the following URLs: 

Conferences like the MySQL UC are fun because you get to see all your virtual pals in the flesh and have a beer with them.  This is one of the fundamental open source bonding experiences.  Unfortunately the taps for draft beer stopped working at the bar, and Tungsten is in the middle of a big crunch to get parallel replication working.  I didn't get to hang around a lot this year.  A few things still stood out compared to 2009. 

First of all, long-term effects of the Oracle acquisition are clear.   Edward Screven's keynote on "The State of the Dolphin" was sensible and boring.  It seemed a telling metaphor for life in the community going forward.  Oracle is going to do an adequate job of MySQL engineering and better than adequate for Windows.  This is of course "adequate" in the same way that the word applies to products like Microsoft Word. 

An adequate MySQL is probably the worst possible outcome for the groups trying to build businesses on alternative builds.  It looks like an effective way for Oracle to neutralize competitive threats from below for a few years to come.   On the other hand, it's good for most users, who won't be greatly inclined to switch unless Oracle tries to soak them for big licensing fees.  At least one conference attendee, a licensee of other Oracle products, mentioned that had already happened.  He's a MariaDB fan now. 

Second, solid state devices (SSDs) are for real.   Andreas Bechtolsheim gave a great talk on the coming SSD revolution at the 2009 MySQL Conference.  It sounded good.  At the 2010 conference we started to see some real test results.  The hype on SSDs is completely justified. 

There was an excellent panel talk sponsored by Fusion-IO that presented some very compelling results including 10x throughput improvements that allowed one of the companies doing the testing to pull out and repurpose 75% of their hosts.  PCI-based Fusion-IO cards have a 300- to 400X price differential compared to basic rotating disk, but the cost is likely to drop pretty quickly as the technology matures and more competitors enter the field.  Much cheaper SATA alternatives like the Intel X-25 are already starting to flood the low-end market.  Anybody building database systems has to have a plan that accounts for SSDs now

Third, innovation is continuing apace but the problems (and solutions) are moving away from MySQL.   Mark Callaghan really put his finger on it at his Ignite MySQL talk when he said, "In 3 years MySQL won't be the default DBMS for high-scale applications."  New system investment is going into applications that handle big data, have to utilize new hardware efficiently to operate economically, and require multi-tenancy.  These are good targets for Drizzle, PBXT, Tungsten, and other new projects working to make names for themselves.   We all have to raise our game or MySQL will start to become irrelevant.   It's going to be an interesting year.  :)

Mar 28, 2010

New Tungsten Software Releases for MySQL and PostgreSQL

I would like to announce a couple of new Tungsten versions available for your database clustering enjoyment.  As most readers of this blog are aware, Tungsten allows users to create highly available data services that include replicated copies, distributed management, and application connectivity using unaltered open source databases.   We are continually improving the software and have a raft of new features coming out this year.  

First, there is a new Tungsten 1.2.3 maintenance release available in both commercial as well as open source editions.  You can get access to the commercial version on the Continuent website, while the open source version is available on SourceForge

 The Tungsten 1.2.3 release focuses on improvements for MySQL users including the following:
  • Transparent session consistency for multi-tenant applications.  This allows applications that follow some simple conventions like sharding tenant data by database to get automatic read scaling to slaves without making code changes.
  • A greatly improved script for purging history on Tungsten Replicator. 
  • Fixes to binlog extraction to handle enum and set data types correctly. 
By far the biggest improvement in this release is Tungsten product documentation, including major rewrites for the guides covering management and connectivity.  Even the Release Notes are better.  If you want to find out how Tungsten works, start with the new Tungsten Concepts and Administration Guide

Second, there's a new Tungsten 1.3 release coming out soon.  Commercial versions are already in use at selected customer sites, and you can build the open source version by downloading code from SVN on SourceForge

The Tungsten 1.3 release sports major feature additions in the following areas: 
  • A new replicator architecture that allows you to manage non-Tungsten replication and also to configure very flexible replication flows to use multi-core systems more effectively and implement complex replication topologies.  The core processing loop for replication can now cycle through 700,000 events per second on my laptop--it's really quick. 
  • Much improved support for PostgreSQL warm standby clustering as well as provisional management of new PostgreSQL 9 features like streaming replication and hot standby.  
  • Replication support for just about everything in the MySQL binlog:  large transactions, unsigned characters, session variables, various permutations of character sets and binary data, and ability to download binlog files through the MySQL client protocol.  If you can put it in the binlog we can replicate it.  
We also have provisional support for Drizzle thanks to Markus Ericsson, plus a raft of other improvements.  This has been a huge amount of work all around, so I hope you'll enjoy the results.

P.s., Contact Continuent if you want to be a beta test site for Tungsten 1.3. 

Jan 27, 2010

Tungsten 1.2.2 Release is Out - Faster, More Stable, More Fun

Release 1.2.2 of Tungsten Clustering is available on SourceForge as well as through the Continuent website.  The release contains mostly bug fixes in the open source version but there are also two very significant improvements of interest to all users.
  • The manager and monitoring capabilities of Tungsten are completely integrated on the same group communications channel.  This fixes a number of problems that caused data sources not to show up properly in older versions.  
  • We are officially supporting a new Tungsten Connector capability for MySQL called pass-through mode, which allows us to proxy connections by transferring network blocks directly rather than translating native request protocol to JDBC calls.  Our tests show that it speeds up throughput by as much as 200% in some cases. 
The commercial version has additional features like PostgreSQL warm standby clustering, add-on rules to manage master virtual IP addresses and other niceties.   If you are serious about replication and clustering it is worth a look.

This is a good time to give a couple of reminders for Tungsten users.  First, Tungsten is distributed as a single build that integrates replication, management, monitoring, and connectivity.   The old Tungsten Replicator and Myosotis builds are going away.   Second, we have a single set of docs on the Continuent website that covers both open source and commercial distributions.

With that, enjoy the new release.  If you are using the open source edition, please post your experiences in the Tungsten community forums or write a blog article.  We would love to hear from you.

P.s., We have added Drizzle support thanks to a patch from Marcus Eriksson but it's not in 1.2.2.  For that you need to build directly from the SVN trunk.  Drizzle support will be out in binary builds as part of Tungsten version 1.3.

Oct 31, 2009

Replicating from MySQL to Drizzle and Beyond

Drizzle is one of the really great pieces of technology to emerge from the MySQL diaspora--a lightweight, scalable, and pluggable database for web applications. I am therefore delighted that Marcus Erikkson has published a patch to Tungsten that allows replication from MySQL to Drizzle. He's also working on implementing Drizzle-to-Drizzle support, which will be very exciting.

Marcus has submitted the patch to us and I have reviewed the code. It's quite supportable, so I plan to integrate it as soon as we are done with our next Tungsten release, which will post around 5 November. You will be able to build and run it using our new community builds.

This brings up a question--what about replicating from MySQL to PostgreSQL? What about other databases? I get the PostgreSQL replication question fairly often but it may be a while before our in-house team can implement plug-in support for it. Anybody want to submit a patch in the meantime? Post in the Tungsten forums if you have ideas and need help to get the work done. Tungsten Replicator code is very modular and it is not hard to add new database support.

Meanwhile, go Marcus!!

Sep 1, 2009

The Future of Database Clustering

Baron Schwartz started a good discussion about MMM use cases that quickly veered into an argument about clustering in general. As Florian Haas put it on his blog, this is not just an issue of DRBD vs. MySQL Replication. Is a database cluster something you cobble together through bits and pieces like MMM? Or is it something integrated that we can really call a cluster? This is the core question that will determine the future of clustering for open source databases.

I have a strong personal interest in this question, because Tungsten clustering, which I designed, is betting that the answer is changing in two fundamental ways. First, the problems that clustering solves are evolving, which will in turn will lead to significant changes in off-the-shelf clusters. Second, for most users the new clusters will be far better than solutions built from a bunch of individual pieces.

To see why, let's start with some history of the people who use open source databases and why they have been interested in clustering over the last decade or so. Open source databases have a wide range of users, but there are a couple of particularly significant groups. Small- to medium-sized business applications like content management systems are a very large segment. Large web facing applications like Facebook or GameSpot are another. Then there are a lot of custom applications that are somewhere in between--too big to fit on a single database dual- or quad-core server but completely satisfied with the processing power of 2 to 4 servers.

For a long time all of these groups of users introduced clusters for two main reasons: ensuring availability and raising performance. Spreading processing across a cluster of smaller commodity machines was a good solution to both requirements and explains the enormous popularity of MySQL Replication as well as many less-than-successful attempts to implement multi-master clustering. However the state of the art has evolved in a big way in the last couple of years.

The reason for change is simple: hardware. Multi-core architectures, cheap DRAM, and flash memory are changing not just the cost of databases but the fundamental assumptions of database computing. Pull out your dog-eared copy of Transaction Processing by Gray and Reuter, and have a look at the 1991 price/performance trade-offs for memory inside the front cover. Then look at any recent graph of DRAM and flash memory prices (like this one). For example, within a couple of years it will be practical to have even relatively large databases on SSDs. Assuming reasonable software support random reads and writes to "disk" will approach main memory speeds. Dirt-cheap disk archives are already spread across the Internet. The old graph of costs down to off-line tape has collapsed.

Moreover, open source databases are also starting to catch up with the hardware. In the MySQL community both MySQL 5.4 and Drizzle are focused on multi-core scaling. PostgreSQL has been working on this problem for years as well. Commercial vendors like Schooner are pushing the boundaries with custom appliances that integrate new hardware better than most users can do it themselves and add substantial database performance improvements to boot.

With better multi-core utilization plus cheap memory and SSDs, the vast majority of users will be able to run applications with adequate performance on a single database host rather than the 2 to 4 nodes of yore. In other words, performance scaling is rapidly becoming a non-issue for a larger and larger group of users. These user don't need infinite performance any more than they need infinite features in a word processing program. What's already there is enough, or will be within the next year or two.

Performance is therefore receding as a motivation for clustering. Meanwhile, here are three needs that will drive database clustering of open source SQL databases over the next few years.
  1. Availability. Keeping databases alive has always been the number one concern for open source database users, even back in the days when hosts and databases were less capable. This is not a guess. I have talked to hundreds of them since early 2006. Moreover most users just don't have the time to cover all the corner cases themselves and want something that just works without a lot of integration and configuration.
  2. Data protection. Losing data is really bad. For most users nirvana is verified, up-to-the-minute copies of data without having to worry a great deal about how it happens. Off-site protection is pretty big too. Talk to any DBA if you don't believe how important this problem is.
  3. Hardware utilization. With the dropping cost of hardware, concerns about up-front hardware investment are becoming somewhat outdated. Operational costs are a different matter. Let's look at power consumption and assume a dual CPU host drawing 250W, which we double to allow for cooling and other overhead. Using recent industrial electricity rates of 13.51 cents per kilowatt/hour in California you get an electric bill of around $600 per year. Electricity is just one part of operational expenses, which add up very quickly. (Thanks to an alert reader for correcting my math in the original post.)
We will continue to see database clusters in the future: in fact lots of them. But off-the-shelf clusters that meet the newer requirements in an efficient and cost-effective way for open source databases are going to look quite different from tightly coupled master/master or shared disk clusters like Postgres-R and RAC. Instead, we will see clusters based for the most part on far more scalable master/slave replication and with features that give them many of the same cluster benefits but cover a wider range of needs. To the extent that other approaches remain viable in the mass market, they will need to cover these needs as well.
  • Simple management and monitoring - The biggest complaint about clustering is that it's complicated. That's a solvable problem or should be once you can work with master/slave methods instead of more complex approaches. You can use group communications to auto-discover and auto-provision databases. You can control failover using simple, configurable policies based on business rules. You can schedule recurring tasks like backups using job management queues. You can have installations that pop up and just work.
  • Fast, flexible replication - Big servers create big update loads and overwhelm single-threaded slaves. We either need parallel database replication or disk-level approaches like the proposed PostgreSQL 8.5 log-streaming/hot standby or DRBD. Synchronous replication is a requirement for many users. Cross-site replication is increasingly common as well. Finally, replication methods will need to be pluggable, because different replication methods have different strengths; replication itself is just one part of the clustering solution, which for the most part is the same regardless of the replication type.
  • Top-to-bottom data protection - Simple backup integration is a good start, but the list of needs is far longer: off-site data storage, automatic data consistency checks, and data repair are on the short list of necessary features. Most clustering and replication frameworks offer little or nothing in this area even though replica provisioning is often closely tied to backups. Yet for many users integrated data protection will be the single biggest benefit of the new clustering approach.
  • Partition management - In the near future most applications will fit on a single database server, but most organizations have multiple applications while ISPs run many thousands of them. There need to be ways to assign specific databases to partitions and then allow applications to locate them transparently. This type of large-scale sharding is the problem that remains when single application databases can run on a single host.
  • Cloud and virtualized operation - In the long run virtualization is the simplest cure for hardware utilization problems--far easier and more transparent than other approaches. A large number of applications now run on virtual machines at ISPs or in cloud environments like Amazon for this reason. To operate in virtual environments, database clusters must be software only, have simple installation, and make very minimal assumptions about resources. Also, they need to support seamless database provisioning to as capacity needs rise and fall, for example adding new VMs or provisioning an existing 4 core VM to a larger 8-core VM with more memory as demand shifts.
  • Transparent application access - Applications need to be able to connect to clusters seamlessly using accustomed APIs and without SQL changes. This is actually easier to do on databases that use simple master/slave or disk block methods rather than more complex clustering implementations. (Case in point: porting existing applications to MySQL Cluster.) Also, the application access needs to be able to handle simple performance-based routing, such as directing reports or backups to a replica database. The performance scaling that most users now need is just not that complicated.
  • Open source - For a variety of reasons closed approaches to clustering are doomed to insignificance in the open source database markets. The base clustering components have to be open source as some of them will depend on extensions of existing open source technology down to the level of storage and database log changes. You also need the feedback loops and distribution that open source provides to create mass-market solutions.
What I have just described is exactly what we are building with Tungsten. Tungsten is aimed at the increasingly large number of applications that can run on a single database. We can help with database performance too, of course, but we recognize that over time other issues will loom larger for most users. The technical properties described above are tractable to implement and we have a number of them already with more on the way in the near future. Master/slave clustering is not just feasible--it works, and works well for a wide range of users.

Still, I don't want anyone to mistake my point. There are many applications for which performance is a very serious problem or whose other needs cannot possibly be met by off-the-shelf software. Facebook and other large sites will continue to use massive, custom-built MySQL clusters as well as non-SQL approaches that push the state of the art for scaling and availability. Analytics and reporting will continue to require ever larger databases with parallel query and automatic partitioning of data as Aster and GreenPlum do. There are specialized applications like Telco provisioning that really do require a tightly coupled cluster and where it's worth the effort to rewrite the application so it works well in such an environment. These are all special cases at the high end of the market.

Mainstream users need something that's a lot simpler and frankly more practical to deliver as an off-the-shelf cluster. Given the choice between combining a number of technologies like MMM, backups of various flavors, cron jobs, Maatkit, etc., a lot of people are just going to choose something that pops up and works. The hardware capability shift and corresponding database improvements are tilting the field to clustering solutions like Tungsten that are practical to implement, cover the real needs of users, and are fully integrated. I'm betting that for a sizable number of users this is the future of database clustering.

p.s., We have had a long summer of work on Tungsten, which is why this blog has not been as active as in some previous months. We are working on getting a full clustering solution out in open source during the week of September 7th. For more information check out full documentation of open source and commercial products here.