Oct 16, 2010

MySQL Disaster Recovery With Tungsten

Disaster recovery (DR) is not the first thing most DBAs think of when putting up a new database application.   However, it's one of the top issues for people using the data--what happens if the site goes down and everything disappears?   So even if DR is not the first issue in every deployment, it is a very high priority as soon as your application is the least bit successful.

At the database level DR has a fairly simple solution:  keep copies of data on a backup site that is up-to-date at all times.  This article explains the architecture for MySQL DR with Tungsten and a couple of key features that make it work, namely floating IP addresses and global transation IDs.  We will dig into those at the end.

First a bit of introduction.  Tungsten manages clusters of off-the-shelf database connected by master/slave replication.  There are replication and management services on each host with automated policies to handle failover as well as low-level tasks like recognizing new cluster members.  There is a simple management client that lets you log into any host and manage all nodes in the cluster.  Tungsten also has connectivity options to let applications find databases easily.  However, for this article we are going to focus on the database only and how you solve the problem of ensuring your data are protected. 

DR Setup

To implement disaster recovery, you actually create two clusters--one on your main site and one on a backup site which we will henceforth call the DR site.  It looks like the following picture.

Standard Main/DR Architecture with Backups
Here is an outline of the setup.  There are additional details of course but those are covered in Tungsten documentation and support procedures.  The goal here is to give you a sense of how things work at the top level. 
  1. Main site.  Set up the main site cluster as a master/slave pair with a floating IP address on the master.  Enable automatic policy mode so that in the event of a master failure the local slave will immediately take over.  Set up backups and run them on the slave on a regular basis. 
  2. DR site.  Next, set up the DR cluster by provisioning both databases with a recent backup from the main cluster.  Configure it identically to the main site with a master IP address and with backups but with two exceptions.  First, use manual policy mode so that the cluster does not try to fail over. Second, do not start replication automatically.  Instead, manually configure the DR master to be a slave of the main site master using the master floating IP address and start services.  Set up backups on this site as well. 
Handling Failures

At the end of setup you have a main site with a cluster and a DR site with a cluster that slaves efficiently off the main site master.  Both sites have regular backups.  As long as there are no failures, you operate both sites and everything is fine.  Let us now consider a couple of different types of failures and how to handle them.

Let's suppose the main site master fails.  Tungsten will automatically fail over to the main site slave and move the master floating IP address.  The DR site relay slave TCP/IP connection to the master will then break, or more accurately time out.  When the relay slave reconnects to the floating IP,  it will have shifted to the new master and replication to the DR site will continue without any human intervention.
Failed Master on Main Site
This protocol is handy because failures are not the only reason that the main site master may move.  You can also move masters for maintenance or upgrades.  Tungsten has a switch command that makes this very easy to do.  The floating IP moves as before and the DR site continues to receive updates properly after it reconnects. 

If you lose the main site, you initiate a site switch procedure.  At the database level this consists of running a script to "unconfigure" your DR relay slave node so that it becomes a master again and then reload the configuration.  When the node comes up as a master it will then automatically install its own master floating IP address.  The commands are simple and run in a few seconds.  In most cases it will take a lot longer to switch applications properly than switch databases, because you have to change DNS entries, start and/or reconfigure applications, and potentially activate other resources to have a functioning system. 

In fact, the real problem with site failover at the database level is not so much failing over but getting the main site back in operation without losing too much data and with as little interruption to users as possible.  You first need to check for any transactions that did not make it off the main site and apply them to the DR site master.  In MySQL you can do this by carefully applying transactions from the main site binlog.  You can help yourself considerably by including a step in the site failover process where you fence (i.e., turn off) the old site as quickly as possible by shutting down applications and taking applications offline.   The fewer extra transactions on the main site, the simpler it is to clean up.

Next, you need to get the master site resynchronized with the slave.  If there are more than a few differences, you will probably just restore the main site master and slave from local backups, then manually configure them to make the main site master a relay slave of the DR site.    If you have large databases, you may want to look at SAN or NAS products like NetAPP that offer snapshot capabilities.  I have been working lately with NetApp; the snap restore command is really impressive for rolling back file system state quickly. 

DR Site Operation and Main Site Recovery
Once the main site is caught up, you can switch applications back the main site by taking a short outage to move applications.  This step is not fully transparent, but unlike the original DR failover, you get to pick the time that is least inconvenient for your users.  Also, you can use Tungsten features like consistency checks to verify that data are consistent across sites. 

Underlying Tungsten Features to Enable DR

As promised at the beginning, here is a look at the Tungsten features that make DR work.  First, there is automated failover with floating IP address management.  Tungsten uses a rules engine combined with group communications to manage failover and master floating IPs efficiently.  The rules take care of many of the weird failure cases as well as handling tasks like automatically making slave servers readonly, etc.  Setting up DR without floating IP addresses is more complex because it means your relay slave needs to know when the main site master moves for any reason.

As useful as floating IP addresses are, Tungsten has a much more important feature that underlies the entire DR site architecture:  global transaction IDs.   Unlike native MySQL replication, Tungsten assigns a global ID or seqno to each transaction as it is read from the binlog.  Tungsten replicator processes track position using the seqno values rather than the file name and offset used by MySQL slaves.  Here is a picture that illustrates how the replicator log works.
Global IDs, Epoch Numbers, and Backups
As already mentioned, the Tungsten master replicator assigns the seqno to each transaction as it is extracted.  Tungsten slave replicators always use the seqno to request the next event from the master.  This means that you can switch the master without worrying whether slaves will lose track of their positions, because they will just ask for the seqno from the new master.

The other important feature of global IDs is that they make backups fungible across different databases and even sites.  Tungsten marks the database with the current seqno and epoch number.  As long as your backup (or file system snapshot) is transactionally consistent, you can load it on any server and bring it back online as a slave.  The new slave will connect to and catch up with the master, wherever it happens to be.  This makes database recovery both simple and very flexible.

The phrase "transactional consistency" brings up another issue.  To make the disaster recovery architecture work reliably I strongly recommend you switch to InnoDB or another fully transactional engine.  MyISAM does not have a place in this architecture--there are just too many ways to end up with corrupt data and a massive outage.

There is one final aspect of Global IDs in Tungsten that is worth mentioning.  What if the master log is corrupted or a slave from a different cluster accidentally logs into the wrong master?  In both cases the slave could get bad data if it just asked for the next seqno without some how checking that the master and slave logs are somehow consistent.  This would at best lead to errors and in the worst case to badly messed up data.

Tungsten deals with log consistency problems using epoch numbers. Whenever the master goes online it sets a new epoch number, which works like a parity check on the sequence number.   Each time a slave connects to the master, it offers the last seqno it received along with the epoch number.  If the values match the same seqno/epoch number in the master log, we assume the logs have the same master and proceed.  Otherwise, we assume somebody is confused and do not allow the slave to fetch transactions. 

Conclusion

DR site setup is complex and this article obviously glosses over a lot of details even for databases.  One final bit of advice is that whatever you do, test the daylights out of it before deploying.  Site failures may be karmic but dealing with them is certainly not.  Site failover is a really bad time to find out you don't have the password to your DNS provider handy or that you have a network configuration problem on the DR site.  One customer I know put all the computers from his main site and DR site in a pile on his conference room table and tested (and retested and retested and retested) until he was completely satisfied with the results.  That is the soul of true disaster recovery.

No comments: