Sunday, January 30, 2011

Virtual IP Addresses and Their Discontents for Database Availability

Virtual IP addresses or VIPs are commonly used to enable database high availability.   A standard failover design uses an active/passive DBMS server pair connected by replication and watched by a cluster manager.  The active database listens on a virtual IP address; applications use it for database connections instead of the normal host IP address. Should the active database fail, the cluster manager promotes the passive database server and shifts the floating IP address to the newly promoted host.  Application connections break and then reconnect to the VIP again, which points them to the new database.
VIP-Based Database Availability Design
Virtual IP addresses are enticing because they are completely transparent to applications--no changes to database API behavior, no changes to connection strings, etc.  While virtual IP addresses seem simple, they depend on arcane TCP/IP behavior that is not especially well understood and not always consistent across different TCP/IP implementations.  Unless properly managed, virtual IP addresses can induce problems that run the gamut from simple lack of availability to split-brain, which can in turn lead to unrepairable data corruption. 

This blog article describes in some detail how virtual IP addresses work, specifically the behavior of Address Resolution Protocol (ARP) which is a core part of the TCP/IP stack that maps IP numbers to hardware addresses and permits VIPs to move.  We will then dig into how split-brain arises as a consequence of ARP behavior.  Finally we will look at ways to avoid split-brain or at least make it less dangerous when it occurs.

Note: the examples below use MySQL, in part because mysqld has a convenient way to show the server host name.  However you can set up the same test scenarios with PostgreSQL or most other DBMS implementations for that matter. 

What is a Virtual IP Address? 

Network interface cards (NICs) typically bind to a single IP address in TCP/IP networks.   However, you can also tell the NIC to listen on extra addresses using the Linux ifconfig command.  Such addresses are called virtual IP addresses or VIPs for short.

Let's say for example you have an existing interface eth0 listening on port 192.168.128.114.  Here's the configuration of that interface as shown by the ifconfig command:
saturn# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 08:00:27:ce:5f:8e  
          inet addr:192.168.128.114  Bcast:192.168.128.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fece:5f8e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:10057681 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8902384 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:6941125495 (6.9 GB)  TX bytes:6305062533 (6.3 GB)
You can allow the eth0 interface to accept traffic for another address 192.168.128.130 using the following simple command. 
saturn# ifconfig eth0:0 192.168.128.130 up
This command tells the TCP/IP stack to accept packets directed to IP address 192.168.128.130 as well as the original address 192.168.128.114.   This means that if you are running a MySQL server on the host users can connect using either of the following commands:
mysql -utungsten -psecret -h192.168.128.114 (or)
mysql -utungsten -psecret -h192.168.128.130
Finally, you can move VIPs from host to host very easily by dropping them on the first host and binding to them on a second host, as shown in the following example:
saturn# ifconfig eth0:0 192.168.128.130 down
...
neptune# ifconfig eth0:0 192.168.128.130 up
Meanwhile, virtual IP addresses behave in every other way like standard IP addresses.  You can put them in DNS, applications can connect to them, etc.  VIP-based high availability is therefore a minimally invasive implementation for most applications--about the only requirement is that applications need to reconnect if their connection breaks.

How Virtual IP Addresses Work

To understand the weaknesses of virtual IP addresses for database high availability, it helps to understand exactly how the TCP/IP stack actually routes messages between IP addresses, including those that correspond to VIPs.   The following diagram shows moving parts of the TCP/IP stack in a typical active/passive database set-up with a single client host.  In this diagram host saturn currently has virtual IP address 192.168.128.130. Neptune is the other database host.  Mercury is the client host.


Applications direct TCP/IP packets using IP addresses, which in IPV4 are four byte numbers.  The IP destination address is written into the header by the IP layer of the sending host and read by the IP layer on the receiving host.

However, this is not enough to get packets from one host to another.  At the hardware level within a single LAN, network interfaces use MAC addresses to route messages over physical connections like Ethernet.   MAC addresses are 48-bit addresses that are typically burned into the NIC firmware or set in a configuration file if you are running a virtual machine.  To forward a packet from host mercury to saturn, the link layer writes in the proper MAC address, in this case 08:00:27:ce:5f:8e.  The link layer on host saturn accepts this packet, since it corresponds to its MAC address.  It forwards the packet up into the IP layer for acceptance and further processing.

Yet how does host mercury figure out which MAC address to use for particular IP addresses?  This is the job of the ARP cache, which maintains an up-to-date mapping between IP and MAC addresses.  You can view the ARP cache contents using the arp command, as shown in the following example.
mercury# arp -an
? (192.168.128.41) at 00:25:00:44:f3:ce [ether] on eth0
? (192.168.128.1) at 00:0f:cc:74:64:5c [ether] on eth0
Note that the ARP cache does not have a mapping for the VIP address 192.168.128.130.  Let's say we now make a client connection to the MySQL server at the other end of the VIP address on mercury:
# mysql -utungsten -psecret -h192.168.128.130
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 33826
...
mysql>
To route traffic, host mercury gets the IP address to MAC address mapping using an ARP request.  You can watch this happen in real time using the tcpdump command to track ARP traffic.
mercury# tcpdump -n -i eth0 arp 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
09:37:43.755081 ARP, Request who-has 192.168.128.130 tell 192.168.128.110, length 28
09:37:43.755360 ARP, Reply 192.168.128.130 is-at 08:00:27:ce:5f:8e, length 46
Now if you look at the ARP cache again on host mercury, you will see a proper mapping for the VIP in mercury's ARP cache:
# arp -an
? (192.168.128.130) at 08:00:27:ce:5f:8e [ether] on eth0
? (192.168.128.41) at 00:25:00:44:f3:ce [ether] on eth0
? (192.168.128.1) at 00:0f:cc:74:64:5c [ether] on eth0
Now if you go back and look at the picture (or still recall the details), you will notice that the MAC address maps to the NIC on host saturn.  This is exactly what we expect since saturn is listening on the corresponding virtual IP address 192.168.128.130. 

Virtual IP Addresses and Split-Brain

Most real problems with VIPs appear when you try to move them.  The reason is simple:  TCP/IP does not stop you from having multiple hosts listening on the same virtual IP address.  For instance, let's say you issue the following command on host neptune without first dropping the virtual IP address on saturn.  
neptune# ifconfig eth0:0 192.168.128.130 up
Let's further clear the ARP mapping for the virtual IP on mercury using the handy arp -d command and reconnect to MySQL.
mercury # arp -d 192.168.128.130
root@logos1:~# mysql -utungsten -psecret -h192.168.128.130
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 294
...
mysql>
So far so good--we logged into MySQL and everything appears normal.  But in fact it is not normal at all.  If you run tcpdump and watch the ARP requests during this login, you see the following:
# tcpdump -n -i eth0 arp 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
09:59:32.643518 ARP, Request who-has 192.168.128.130 tell 192.168.128.110, length 28
09:59:32.643768 ARP, Reply 192.168.128.130 is-at 08:00:27:68:cd:7d, length 46
09:59:32.643793 ARP, Reply 192.168.128.130 is-at 08:00:27:ce:5f:8e, length 46 
This is not not just bad--it's very bad.  Both saturn and neptune responded to mercury's ARP request!  Mercury can pick only one for the mapping; which one it picks depends on timing as well as the exact implementation of the TCP/IP stack.  In other words we have a race condition and the winner is essentially random.

You can demonstrate the randomness for yourself with a simple experiment.  Let's create a test script named mysql-arp-flush.sh, which clears the ARP cache entry for the VIP and then connects to MySQL, all in a loop.  
#!/bin/bash
for i in {1..5}; 
do 
  arp -d 192.168.128.130
  sleep 1
  mysql -utungsten -psecret -h192.168.128.130 -N \
    -e "show variables like 'host%'"
done
If you run the script you'll see results like the following.  Note the random switch to Neptune on the fourth connection.  
# ./mysql-arp-flush.sh
+----------+---------+
| hostname | saturn  | 
+----------+---------+
+----------+---------+
| hostname | saturn  | 
+----------+---------+
+----------+---------+
| hostname | saturn  | 
+----------+---------+
+----------+---------+
| hostname | neptune |
+----------+---------+
+----------+---------+
| hostname | saturn  | 
+----------+---------+
At this point you have successfully created a split-brain.  If you use database replication and both databases are open for writes, as would be the default case with MySQL replication, Tungsten, or any of the PostgreSQL replication solutions like Londiste, your applications will randomly connect to each DBMS server.  Your data will quickly become irreparably mixed up.  All you can do is hope that the problem will be discovered quickly. 

A Half-Hearted Solution using Gratuitous ARP

You might think that it would be handy if the ARP protocol provided a way to get around split-brain problems by invalidating client host ARP caches.  In fact, there is such a feature in ARP--it's called gratuitous ARP.  While useful, it is not a solution for split-brain issues.  Let's look closely to see why. 

Gratuitous ARP works by sending an unsolicited ARP response to let hosts on the LAN know that an IP address mapping has changed.  On Linux systems you can issue the arping command as shown below to generate a gratuitous ARP response. 
neptune# arping -q -c 3 -A -I eth0 192.168.128.130
This tells host neptune to send 3 ARP reply messages with its MAC address for the VIP address.  If we look at tcpdump output again, we see the following:
# tcpdump -n -i eth0 arp 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
11:02:27.154279 ARP, Reply 192.168.128.130 is-at 08:00:27:68:cd:7d, length 46
11:02:28.159291 ARP, Reply 192.168.128.130 is-at 08:00:27:68:cd:7d, length 46
11:02:29.162403 ARP, Reply 192.168.128.130 is-at 08:00:27:68:cd:7d, length 46
Linux hosts that receive the response will generally then update their ARP caches, though as we will see, there are some important exceptions.   But first, we need to show the effect of gratuitous ARP on MySQL connections.  Let's start with the following ARP cache contents on host mercury.  This shows an existing mapping to the MAC address of host neptune, which is what we expect from the previous arping command on neptune. 
mercury# arp -an
? (192.168.128.130) at 08:00:27:68:cd:7d [ether] on eth0
? (192.168.128.41) at 00:25:00:44:f3:ce [ether] on eth0
? (192.168.128.1) at 00:0f:cc:74:64:5c [ether] on eth0
Next, we run a loop that connects to MySQL and prints the host name every second.  The loop code is shown below and stored in a script named mysql-no-arp-flush.sh.  Unlike the previous script this does not release the ARP cache mapping between connections to MySQL.
#!/bin/bash
for i in {1..30}; 
do 
  sleep 1
  mysql -utungsten -psecret -h192.168.128.130 -N \
    -e "show variables like 'host%'"
done
While the test script is running is running, we run an arping command from saturn. 
saturn# arping -q -c 3 -A -I eth0 192.168.128.130
What we see in the MySQL output is the following.  Once the gratuitous ARP is received, mercury switches its connection from neptune to saturn and stays there, at least for the time being.
mercury# ./mysql-no-arp-flush.sh
+----------+---------+
| hostname | neptune | 
+----------+---------+
+----------+---------+
| hostname | neptune | 
+----------+---------+
+----------+---------+
| hostname | saturn  | 
+----------+---------+
+----------+---------+
| hostname | saturn  | 
+----------+---------+
There is one more interesting property of gratuitous ARP responses.  If you issue one during a session, it will cause client sessions to switch between hosts without waiting for a timeout.  Here's an example.  First login with MySQL and see which host we are on.
root@logos1:~# mysql -utungsten -psecret -h192.168.128.130
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 33853
mysql> show variables like 'hostname';
+---------------+---------+
| Variable_name | Value   |
+---------------+---------+
| hostname      | neptune | 
+---------------+---------+
1 row in set (0.00 sec)
Now issue an arping command on saturn using a separate window.
saturn# arping -q -c 3 -A -I eth0 192.168.128.130
Finally, go back and check the host name again in the MySQL session.   The session switches over to the other server, which you see at the client level as a lost connection followed by a reconnect.
mysql> show variables like 'hostname'; 
ERROR 2006 (HY000): MySQL server has gone away No connection. Trying to reconnect... Connection id:   33854 
Current database: *** NONE ***  
+---------------+--------+ 
| Variable_name | Value  | 
+---------------+--------+ 
| hostname      | saturn |  
+---------------+--------+ 
1 row in set (0.00 sec)
So is gratuitous ARP the solution to virtual IP split-brain?  It announces that there is a mapping change, which can make failover work much more quickly.  This is useful in its own right.  However, it does not solve split-brain. 

First, not all TCP/IP stacks even recognize gratuitous ARP responses.  Second, gratuitous ARP only takes effect on hosts that actually have a current mapping in their ARP cache.  Other hosts will  wait until they actually need the mapping and then issue a new ARP request.  Finally, ARP mappings automatically time out after a few minutes.   In that case the host will issue a new ARP request, which as in the two preceding cases brings us right back to the split-brain scenario we were trying to cure.

Avoiding Virtual IP Split -Brains

Avoiding a VIP-induced split-brain not a simple problem.  The best approach is combination of sound cluster management, amelioration, and paranoia. 

Proper cluster management is the first line of defense.  VIPs are an example of a unique resource in the system that only one host may hold at a time.   An old saying that has been attributed to everyone from Genghis Khan to Larry Ellison sums up the problem succinctly:
It is not enough to succeed.  All others must fail.  
The standard technique to implement this policy is called STONITH, which stands for "Shoot the other node in the head."  Basically it means that before one node acquires the virtual IP address the cluster manager must make every effort to ensure no other node has it, using violent means if necessary.   Moving the VIP thus has the following steps.
  1. The cluster manager executes a procedure to drop the VIP on all other hosts (for example using ssh or by cutting off power).  Once these procedures are complete, the cluster manager executes a command to assign the VIP to the new owner. 
  2. Isolated nodes automatically release their VIP.  "Isolated" is usually defined as being cut off from the cluster manager and unable to access certain agreed-upon network resources such as routers or public DNS servers. 
  3. In cases of doubt, everybody stops.  For most systems it is far better to be unavailable than to mix up data randomly.  
Cluster managers like Tungsten and Pacemaker handle this kind of process very well.   PaceMaker for example has a number of specialized hooks to cut power or otherwise use extreme violence to ensure proper fencing of databases.  Tungsten has fewer such hooks but has a much richer set of operations for databases and also has a wide set of connectivity options for HA besides using VIPs.

Incidentally, you want to be very wary about re-inventing the wheel, especially when it comes to DBMS clustering and high availability.  Clustering has a lot of non-obvious corner cases; even the "easy" problems like planned failover are quite hard to implement correctly.  You are almost always better off using something that already exists instead of trying to roll your own solution.

Amelioration is the next line of defense, namely to make split-brain situations less dangerous when they actually occur.  Failover using shared disks or non-writable slaves (e.g., with DRBD or PostgreSQL streaming replication) have a degree of protection because it is somewhat harder to have multiple databases open for writes.  However, it is definitely possible and the cluster manager is your best bet to prevent this.  However, when using MySQL with either native or Tungsten replication, databases are open and therefore susceptible to data corruption, unless you ensure slaves are not writable.

Fortunately, this is very easy to do in MySQL.   To make a database readonly to all accounts other than those with SUPER privilege, just issue the following commands to make the server readonly and ensure the setting is in effect:
neptune# mysql -uroot -e "set global read_only=1"
neptune# mysql -uroot -e "show variables like 'read_only'"
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only     | ON    | 
+---------------+-------+
This protects you not just in cases of actual failover but also from administrative mistakes or software failures that switch the VIP by accident.   Many cluster managers implement read-only slaves.   Tungsten clustering has explicit support for read-only slaves.  Other cluster managers like MMM and Pacemaker can do the same.

Lastly paranoia is always a good thing.  You should test the daylights out of clusters that depend on VIPs before deployment, and also check regularly afterwards to ensure there are no unexpected writes on slaves.  Regular checks of logs are a good idea.  Another good way to check for problems in MySQL master/slave setups is to run consistency checks.   Tungsten Replicator has built-in consistency checking designed for exactly this purpose.  You can also run Maatkit mk-table-checksum at regular intervals.   Another best practice is to "flip" masters and slaves on a regular basis to ensure switch and failover procedures work properly.   Don't avoid trouble--look for it!

Conclusion and Note on Sources

Virtual IP addresses are a convenient way to set up database high availability but can lead to very severe split-brain situations if used incorrectly.   To deploy virtual IP addresses without problems you must first of all understand how they work and second use a sound cluster management approach that avoids split-brain and minimizes the impact if it does occur.  As with all problems of this kind you need to test the implementation thoroughly before deployment as well as regularly during operations.  This will help avoid nasty surprises and corrupt data that are otherwise all but inevitable.

Finally it is worth talking a bit about sources.  I wrote this article because I could not find a single location that explained virtual IP addresses in a way that drew out the consequences of their behavior for database failover.  That said, there are a couple of good general sources for information on Internet tools and high availability:
Beyond that you can look at general networking sources like Radia Perlman's Interconnections, Second Edition or Internetworking with TCP/IP by Douglas Comer.  These are more high-level.  If you get really desperate for details, try the RFCs, for example RFC-826, which is the original specification for ARP.  Some of them are surprisingly good reads even 30 years after the fact.

Tuesday, January 25, 2011

Tungsten Replicator Overview Webinar

On Thursday January 27th at 10am PST I will doing a webinar on Tungsten Replicator together with my colleague Giuseppe Maxia.  The title is "What MySQL Replication Cannot Do.  And How to Get Around It."  Basically it is a nuts and bolts description of Tungsten Replicator capabilities like multi-master replication, failover, parallel apply, and using replication for zero-downtime upgrade.  If you have ever wanted an in-depth look at the Tungsten Replicator this is a good opportunity. 

During 2010 we implemented an amazing number of new replication features ranging from pipelines early in the year to fast disk logs, multiple replication services per process, bi-directional replication, and parallel apply by the end.  We will be building out all of these in the coming year and releasing increasingly capable features into open source as well. 

This presentation is part of Continuent's regular webinar series which means we will also talk a bit about commercial products and services at the end.  However, it's mostly cool replication stuff.   You can sign up on the Continuent webinar page.  Hope to see you there.

Tuesday, January 11, 2011

Fixing Replication with Replication

A couple of days ago I ran into a Tungsten Replicator case where several MySQL tables became corrupted on slaves and needed to be restored from the master.   We identified the tables that had problems fairly quickly using Tungsten Replicator's consistency checks.  However, that led to another problem:  how to restore the slave tables efficiently from the master.  The MySQL server in question processes around 10M tranactions per day--there is virtually no downtime.  Though the tables were not large, we could not be sure whether they were in use. 

Fortunately, you can use a simple MySQL trick to get all the rows of a table to replicate through to slaves.  The idea is to dump the table, delete the rows, then reload it again.  The delete and subsequent reload replicate out to slaves, after which everything is consistent again.  Let's say we have a table called tpcb.history that needs to be fixed.  Login with mysql and run the following commands:
BEGIN;
SELECT * FROM tpcb.history 
  INTO OUTFILE '/tmp/tpcb.history.dmp' FOR UPDATE;
DELETE FROM tpcb.history;
LOAD DATA INFILE '/tmp/tpcb.history.dmp' REPLACE
  INTO TABLE tpcb.history FIELDS TERMINATED BY '\t'
  LINES TERMINATED BY '\n';
COMMIT;
You can do the reload several ways in MySQL, but this particular code has some advantages over other approaches, such as using LOCK TABLES.  First, it uses a transaction, so if something goes wrong the changes roll back and you do not lose your data.  Second, the SELECT ... FOR UPDATE locks your data and ensures serialization.  You can run this while applications are running without problems.

This seems useful enough that I put together a simple script called reload-table.sh with a README and checked them into the Tungsten Replicator codeline on SourgeForge.net.  You can refresh the same table shown above using the following command:
[sudo] ./reload-table-b.sh -u tungsten -p secret -t tpcb.history
I tested the reload using Tungsten 1.3.1 on MySQL 5.1 with statement replication.  However, it would work equally well with row replication.  Moreover, you can do the same trick in MySQL replication, as this involves base replication capabilities that are directly equivalent.  There are a few caveats:  you need to use InnoDB (or another transactional engine), large tables may be a problem, and you would need to tread carefully in cases where tables contain referential constraints.  Finally, it would be wise to save the master table somewhere else before running the script.

Scaling Databases Using Commodity Hardware and Shared-Nothing Design