The Scale-Out Blog: 2010

Dec 12, 2010

Interested in Sponsoring Tungsten Open Source Features?

Over the last few months I have been pleasantly surprised by the number of people using open source builds of Tungsten. My company, Continuent, has therefore started to offer support for open source users and will likely expand these services to meet demand.

There have also been a number of requests to add specific features to open source builds, especially for replication. We have added a few already but are now considering pushing even more features into open source if we can find sponsors. These add to a number of great features already in open source like global transaction IDs, MySQL 5.0/5.1, basic drizzle replication, transaction filtering, and many others.

Do you have special replication or clustering features you would like to see added to Tungsten? Specialized MySQL to PostgreSQL replication? Management and monitoring commands? Cool parallel replication problems? High-performance logging? Weird multi-master topologies? Talk to us about sponsoring new open source features. We're happy to do projects that solve interesting problems, benefit the open source databases community, and help grow Tungsten as a product.

Visit the Continuent website or send email directly to robert dot hodges at continuent dot com.

Nov 7, 2010

It's All about the Team

Earlier this week Giuseppe Maxia blogged about joining Continuent as Director of QA. Creating high quality systems for distributed data management is a hard but fascinating problem. I have been hooked on it myself for many years. Guiseppe brings the creativity as well as humor our team needs to nail this problem completely. I'm therefore delighted to know he will be focused on it.

That said, I'm even happier for another reason. Beyond solving any single problem, Giuseppe strengthens an already strong team. Ed Catmull of Pixar gave a great speech a few years ago about managing creative teams and why successful companies eventually fail. Among other things he asked the question whether it is the idea or the people who implement it that count most. His conclusion: great teams implement good ideas to build great products. But even more important, great teams can turn bad ideas into good ones, then go on to build great products from those ideas too. Pixar has proved this many times over.

I believe strongly in the power of great teams to create great products. Giuseppe, welcome to the team.

Oct 24, 2010

Parallel Replication on MySQL: Report from the Trenches

Single-threaded apply is one of the big downsides of MySQL's built-in replication, as Baron Schwartz pointed out a couple of days ago. While a master can process dozens of updates at once, slaves must apply them one after the other on a single thread. Add in disk I/O, and the result is very slow performance indeed. The obvious answer is parallel apply, namely writing multiple non-conflicting updates to the slave at once.

I have spent the last few months implementing parallel apply for Tungsten 2.0, which we are now testing at customer sites. In this article I would like to describe how Tungsten's parallel apply works as well as some of the lessons that have become apparent through the implementation.

There are a couple of big challenges in parallel apply. There is of course the practical problem of separating transactions into parallel streams, for example splitting them by database. This is known as sharding. Row updates are easy enough but MySQL also has statement replication. Transactions with statements require parsing, and there are ambiguous cases. If that's not enough, features like LOAD DATA INFILE have a complex implementation in the binlog and require specialized logic to shard correctly. In addition, parallel apply of any kind has a lot of corner cases that you have to solve completely or risk unpredictable failures. Here's an example: skipping transactions on the slave. You have to wait for the event, but what if some of the threads are already past it when you ask to skip? How do you synchronize access to the list of transactions to skip without creating a choke point for threads?

The next challenge is performance. Parallel replication offers a rich choice of ways to lower throughput, not raise it. Multiple disk logs are the best I have found so far, as they can convert sequential reads and writes on the disk log to random I/O when more replication threads contend for different parts of the disk. Implementing multiple queues in memory is far faster and simpler but limits the queue sizes. Another excellent way to slow things down is to try to parallelize SQL transactions with a lot of dependencies, which means you end up effectively serialized *and* paying the extra cost of parsing transactions and synchronizing threads. In this case it can be better to keep everything sequential but use block commit to apply 50 or 100 transactions simultaneously on the slave.

With all that said, the parallel apply problem is still quite tractable, but you need to pick your battles carefully. Tungsten's parallel apply implementation has a very clear problem focus: speeding up slave updates for multi-tenant applications that have a high degree of natural partitioning and concurrent updates across customers. This is not as limiting as it might sound to readers unfamiliar with MySQL. SaaS applications for the most part follow the multi-tenant model on MySQL, with each customer assigned to a particular database. So do large ISPs or cloud providers that host customers on shared servers using separate databases.

Tungsten parallel apply is based on automatic sharding of transactions. The following diagram shows the parallel apply algorithm conceptually.

Tungsten Parallel Apply

Tungsten has a flexible architecture based on replication pipelines, described in a previous article on this blog. To recap the model, pipelines are divided into stages, which represent processing steps. Each stage consists of an extract-filter-apply loop with symmetric interfaces and identical processing logic for each stage. The parallel apply implementation builds on replication pipelines as follows:

A new filter called EventMetadataFilter automatically parses incoming transactions to figure out which database(s) they affect. This is simple for row updates but involves parsing for statements and specialized extract handling for odd-ball operations like LOAD DATA INFILE.
The shard ID is assigned from the database name. This is glommed into the EventMetadataFilter but will shortly be broken out into a separate filter so that it is possible to support alternate shard assignment algorithms.
There is a new kind of in-memory buffer between stages called a ParallelQueue that supports multiple queues that feed the final apply stage. Stages have a corresponding extension to allow them to have multiple threads, which must match the number of parallel queues or you get an error.
The ParallelQueue implementation calls a new component called a Partitioner to assign transactions a partition number (i.e., a parallel queue). You can substitute different algorithms by providing different partitioner implementations. The default implementation uses a configuration file called shard.list to map shards to queues. Unless you say otherwise it hashes on the shard ID to make this assignment.

Extensions #1 and #2 run on the master, while #3 and #4 run on the slave. I really like diagrams, so here is a picture of the fully implemented parallel apply architecture. The master replicator extracts, assigns the shard, and logs each transaction. The slave replicator fetches transactions, logs them locally, then applies in parallel.

Full Master/Slave Architecture for Parallel Apply

So how does this work? Pretty well actually. Local lab tests indicate that parellel apply roughly doubles throughput on a multi-database TPC-B benchmark we use for testing. We should be able to publish some real-world performance numbers in the near future, but so far things look quite promising. During the implementation a number of interesting issues have arisen, which I would like to discuss now.

The first issue is the ratio between parallel apply threads and shards. While it might seem obvious to have a thread per shard, in real deployments the situation is not so clear. For one thing actual deployments in SaaS and ISP situations often have hundreds or even thousands of databases, which has a number of practical consequences for implementation. Less obviously, spreading transactions thinly across a lot of queues means fewer opportunities to use block commit, hence more work for slave servers and less overall throughput. Performance optimization is a very uncertain matter, so Tungsten lets users configure the ratio.

Dependencies between shards are yet another issue. While I mentioned that Tungsten is designed for applications with "a high degree of natural partitioning," dependencies between databases as well as individual transactions do occur and cannot be ignored. For example, many SaaS applications have reference data that are used by all customer databases. Even if parallel SQL works here, applications may get sick from seeing updates appear in the wrong order. Or you could have global operations like CREATE USER that affect all databases. Or you might not be able to tell which shard a piece of SQL belongs to. Tungsten allows users to declare reference databases and automatically serializes these databases as well as global or "don't know" cases.

There are also numerous issues around startup and shutdown. Remember how MySQL replication slaves will not restart after unclean shutdown with open temp tables? (If not, take a quick break and read this now. You'll thank me later.) Parallel apply introduces similar issues, because you have multiple threads all updating different positions in the database. Tungsten handles crash recovery by tracking the apply position of each queue in InnoDB and then recommencing from that point on restart in each queue. I am putting finishing touches on clean shutdown, which ensures that all queues are empty, much like automatically checking that temp tables are closed on MySQL.

In short, over the last few months Tungsten has climbed a fair distance up a pretty big hill to get parallel apply to work. The flexibility of the replicator architecture, particularly pipelines, has been very helpful as it is quite easy to extend. The parallelization algorithm builds on terrific work by other colleagues at Continuent, especially Stephane Giron and Linas Virbalas. They have both put enormous effort into building up MySQL and PostgreSQL replication capabilities.

Here are a couple of parting thoughts about parallelization based on the experience so far.

Thought number one: parallel replication is not magic. To use parallel apply effectively, applications need to play nice: mostly short transactions and not too many dependencies between shards are the biggest requirements to see a substantial boost in throughput. For example, if you let one user write 50M statements to the binlog in a single transaction, things are going to get kind of quiet on the slave no matter what you do. Also, you can forget about MyISAM or other non-transactional engines. As I have written before, these engines offer a number of opportunities for databases to get messed up or out-of-sync even using conventional MySQL replication. Tungsten's block commit and parallel apply increase the window for problems significantly. If you are still using MyISAM for replicated data, it's time to man up and convert to InnoDB.

Thought number two: The long-term answer to effective parallel replication is to change how MySQL works by interleaving transactions within the binlog along the lines suggested by Kristian Nielsen and others. MySQL currently completely serializes transactions to the binlog, an accomplishment that makes slave apply logic a lot simpler. Tungsten parallel apply then has to undo this good work and recreate streams of non-conflicting updates, which is complex and does not help all workloads.

It is doubtful that replicating interleaved transactions will be less complex than handling a serial binlog as it stands today. There is also some heavy lifting inside MySQL to get to an interleaved binlog. However, interleaved transactions would have the advantage that transactions for any workload would be parallelized, which would widen the scope of benefits to users. I'm happy to see that Kristian and other people are now working this feature for future releases of MySQL.

Meanwhile, we have a workable solution for Tungsten and are pushing it forward as quickly as we can. Contact Continuent if you would like to test it out.

Oct 16, 2010

MySQL Disaster Recovery With Tungsten

Disaster recovery (DR) is not the first thing most DBAs think of when putting up a new database application. However, it's one of the top issues for people using the data--what happens if the site goes down and everything disappears? So even if DR is not the first issue in every deployment, it is a very high priority as soon as your application is the least bit successful.

At the database level DR has a fairly simple solution: keep copies of data on a backup site that is up-to-date at all times. This article explains the architecture for MySQL DR with Tungsten and a couple of key features that make it work, namely floating IP addresses and global transation IDs. We will dig into those at the end.

First a bit of introduction. Tungsten manages clusters of off-the-shelf database connected by master/slave replication. There are replication and management services on each host with automated policies to handle failover as well as low-level tasks like recognizing new cluster members. There is a simple management client that lets you log into any host and manage all nodes in the cluster. Tungsten also has connectivity options to let applications find databases easily. However, for this article we are going to focus on the database only and how you solve the problem of ensuring your data are protected.

DR Setup

To implement disaster recovery, you actually create two clusters--one on your main site and one on a backup site which we will henceforth call the DR site. It looks like the following picture.

Standard Main/DR Architecture with Backups

Here is an outline of the setup. There are additional details of course but those are covered in Tungsten documentation and support procedures. The goal here is to give you a sense of how things work at the top level.

Main site. Set up the main site cluster as a master/slave pair with a floating IP address on the master. Enable automatic policy mode so that in the event of a master failure the local slave will immediately take over. Set up backups and run them on the slave on a regular basis.
DR site. Next, set up the DR cluster by provisioning both databases with a recent backup from the main cluster. Configure it identically to the main site with a master IP address and with backups but with two exceptions. First, use manual policy mode so that the cluster does not try to fail over. Second, do not start replication automatically. Instead, manually configure the DR master to be a slave of the main site master using the master floating IP address and start services. Set up backups on this site as well.

Handling Failures

At the end of setup you have a main site with a cluster and a DR site with a cluster that slaves efficiently off the main site master. Both sites have regular backups. As long as there are no failures, you operate both sites and everything is fine. Let us now consider a couple of different types of failures and how to handle them.

Let's suppose the main site master fails. Tungsten will automatically fail over to the main site slave and move the master floating IP address. The DR site relay slave TCP/IP connection to the master will then break, or more accurately time out. When the relay slave reconnects to the floating IP, it will have shifted to the new master and replication to the DR site will continue without any human intervention.

Failed Master on Main Site

This protocol is handy because failures are not the only reason that the main site master may move. You can also move masters for maintenance or upgrades. Tungsten has a switch command that makes this very easy to do. The floating IP moves as before and the DR site continues to receive updates properly after it reconnects.

If you lose the main site, you initiate a site switch procedure. At the database level this consists of running a script to "unconfigure" your DR relay slave node so that it becomes a master again and then reload the configuration. When the node comes up as a master it will then automatically install its own master floating IP address. The commands are simple and run in a few seconds. In most cases it will take a lot longer to switch applications properly than switch databases, because you have to change DNS entries, start and/or reconfigure applications, and potentially activate other resources to have a functioning system.

In fact, the real problem with site failover at the database level is not so much failing over but getting the main site back in operation without losing too much data and with as little interruption to users as possible. You first need to check for any transactions that did not make it off the main site and apply them to the DR site master. In MySQL you can do this by carefully applying transactions from the main site binlog. You can help yourself considerably by including a step in the site failover process where you fence (i.e., turn off) the old site as quickly as possible by shutting down applications and taking applications offline. The fewer extra transactions on the main site, the simpler it is to clean up.

Next, you need to get the master site resynchronized with the slave. If there are more than a few differences, you will probably just restore the main site master and slave from local backups, then manually configure them to make the main site master a relay slave of the DR site. If you have large databases, you may want to look at SAN or NAS products like NetAPP that offer snapshot capabilities. I have been working lately with NetApp; the snap restore command is really impressive for rolling back file system state quickly.

DR Site Operation and Main Site Recovery

Once the main site is caught up, you can switch applications back the main site by taking a short outage to move applications. This step is not fully transparent, but unlike the original DR failover, you get to pick the time that is least inconvenient for your users. Also, you can use Tungsten features like consistency checks to verify that data are consistent across sites.

Underlying Tungsten Features to Enable DR

As promised at the beginning, here is a look at the Tungsten features that make DR work. First, there is automated failover with floating IP address management. Tungsten uses a rules engine combined with group communications to manage failover and master floating IPs efficiently. The rules take care of many of the weird failure cases as well as handling tasks like automatically making slave servers readonly, etc. Setting up DR without floating IP addresses is more complex because it means your relay slave needs to know when the main site master moves for any reason.

As useful as floating IP addresses are, Tungsten has a much more important feature that underlies the entire DR site architecture: global transaction IDs. Unlike native MySQL replication, Tungsten assigns a global ID or seqno to each transaction as it is read from the binlog. Tungsten replicator processes track position using the seqno values rather than the file name and offset used by MySQL slaves. Here is a picture that illustrates how the replicator log works.

Global IDs, Epoch Numbers, and Backups

As already mentioned, the Tungsten master replicator assigns the seqno to each transaction as it is extracted. Tungsten slave replicators always use the seqno to request the next event from the master. This means that you can switch the master without worrying whether slaves will lose track of their positions, because they will just ask for the seqno from the new master.

The other important feature of global IDs is that they make backups fungible across different databases and even sites. Tungsten marks the database with the current seqno and epoch number. As long as your backup (or file system snapshot) is transactionally consistent, you can load it on any server and bring it back online as a slave. The new slave will connect to and catch up with the master, wherever it happens to be. This makes database recovery both simple and very flexible.

The phrase "transactional consistency" brings up another issue. To make the disaster recovery architecture work reliably I strongly recommend you switch to InnoDB or another fully transactional engine. MyISAM does not have a place in this architecture--there are just too many ways to end up with corrupt data and a massive outage.

There is one final aspect of Global IDs in Tungsten that is worth mentioning. What if the master log is corrupted or a slave from a different cluster accidentally logs into the wrong master? In both cases the slave could get bad data if it just asked for the next seqno without some how checking that the master and slave logs are somehow consistent. This would at best lead to errors and in the worst case to badly messed up data.

Tungsten deals with log consistency problems using epoch numbers. Whenever the master goes online it sets a new epoch number, which works like a parity check on the sequence number. Each time a slave connects to the master, it offers the last seqno it received along with the epoch number. If the values match the same seqno/epoch number in the master log, we assume the logs have the same master and proceed. Otherwise, we assume somebody is confused and do not allow the slave to fetch transactions.

Conclusion

DR site setup is complex and this article obviously glosses over a lot of details even for databases. One final bit of advice is that whatever you do, test the daylights out of it before deploying. Site failures may be karmic but dealing with them is certainly not. Site failover is a really bad time to find out you don't have the password to your DNS provider handy or that you have a network configuration problem on the DR site. One customer I know put all the computers from his main site and DR site in a pile on his conference room table and tested (and retested and retested and retested) until he was completely satisfied with the results. That is the soul of true disaster recovery.

Apr 23, 2010

MySQL Conference Slides and Thoughts on State of the Dolphin

I did two talks on replication and clustering at the recent MySQL Conference in Santa Clara. Thanks to all of you who attended as well as the fine O'Reilly folks who organized everything. Slides are posted on the talk descriptions at the following URLs:

Clustering for the Masses - A Gentle Introduction to Tungsten for MySQL

Not Your Grandpa’s Replication-The New Wave of MySQL Replication and How It Helps Your Applications (A collaborative talk with Jay Pipes)

Conferences like the MySQL UC are fun because you get to see all your virtual pals in the flesh and have a beer with them. This is one of the fundamental open source bonding experiences. Unfortunately the taps for draft beer stopped working at the bar, and Tungsten is in the middle of a big crunch to get parallel replication working. I didn't get to hang around a lot this year. A few things still stood out compared to 2009.

First of all, long-term effects of the Oracle acquisition are clear.   Edward Screven's keynote on "The State of the Dolphin" was sensible and boring. It seemed a telling metaphor for life in the community going forward. Oracle is going to do an adequate job of MySQL engineering and better than adequate for Windows. This is of course "adequate" in the same way that the word applies to products like Microsoft Word.

An adequate MySQL is probably the worst possible outcome for the groups trying to build businesses on alternative builds. It looks like an effective way for Oracle to neutralize competitive threats from below for a few years to come.   On the other hand, it's good for most users, who won't be greatly inclined to switch unless Oracle tries to soak them for big licensing fees. At least one conference attendee, a licensee of other Oracle products, mentioned that had already happened. He's a MariaDB fan now.

Second, solid state devices (SSDs) are for real.   Andreas Bechtolsheim gave a great talk on the coming SSD revolution at the 2009 MySQL Conference. It sounded good. At the 2010 conference we started to see some real test results. The hype on SSDs is completely justified.

There was an excellent panel talk sponsored by Fusion-IO that presented some very compelling results including 10x throughput improvements that allowed one of the companies doing the testing to pull out and repurpose 75% of their hosts. PCI-based Fusion-IO cards have a 300- to 400X price differential compared to basic rotating disk, but the cost is likely to drop pretty quickly as the technology matures and more competitors enter the field. Much cheaper SATA alternatives like the Intel X-25 are already starting to flood the low-end market. Anybody building database systems has to have a plan that accounts for SSDs now.

Third, innovation is continuing apace but the problems (and solutions) are moving away from MySQL.   Mark Callaghan really put his finger on it at his Ignite MySQL talk when he said, "In 3 years MySQL won't be the default DBMS for high-scale applications." New system investment is going into applications that handle big data, have to utilize new hardware efficiently to operate economically, and require multi-tenancy. These are good targets for Drizzle, PBXT, Tungsten, and other new projects working to make names for themselves.   We all have to raise our game or MySQL will start to become irrelevant. It's going to be an interesting year. :)

Apr 20, 2010

Customized Data Movement with Tungsten Replicator Pipelines

Have you ever run into a problem where MySQL replication did 95% of what you needed but not the remaining 5% to solve a real problem? Hacking the binlog is always a possibility, but it typically looks like this example. Not a pretty sight. Wouldn't it be easier if replication were a bunch of building blocks you could recombine to create custom replicator processes?

Tungsten 1.3 has a new feature called pipelines that allows you to do exactly that. A pipeline consists of one or more stages that tie together generic components to extract, filter, store, and apply events, which is Tungsten parlance for transactions. Each stage has a processing thread, so multi-stage pipelines can process data independently and without blocking. The stages also take care of important but tedious issues like remembering the transactional state of each stage so Tungsten can restart without forgetting events or applying them twice.

Here is a picture of how a pipeline is put together.

When Tungsten Replicator starts it loads a pipeline corresponding to its role, for example master or slave. The preceding picture shows a slave pipeline consisting of two stages. The first stage pulls replicated events over the network from a master Tungsten Replicator and stores them in a local transaction history log, which we call the THL. The second stage extracts the stored events and applies them to the database. This pipeline is analogous to the I/O and SQL threads on a MySQL slave.

Where Tungsten departs from MySQL and most other replicators in a big way is that pipelines, hence the replication flows, are completely configurable. The configuration is stored in file replicator.properties. Here are the property settings to create the slave pipeline. Note how the role is the name of a pipeline. This determines which pipeline to run when the replicator goes online.

# Replicator role.  

replicator.role=slave

...

# Generic pipelines.  replicator.pipelines=master,slave,direct

...
# Slave pipeline has two stages:  extract from remote THL to local THL; 
# extract from local THL and apply to DBMS. 
replicator.pipeline.slave=remote-to-thl,thl-to-dbms
replicator.pipeline.slave.stores=thl
replicator.pipeline.slave.syncTHLWithExtractor=false

replicator.stage.remote-to-thl=com.continuent.tungsten.replicator.pipeline.SingleThreadStageTask
replicator.stage.remote-to-thl.extractor=thl-remote
replicator.stage.remote-to-thl.applier=thl-local

replicator.stage.thl-to-dbms=com.continuent.tungsten.replicator.pipeline.SingleThreadStageTask
replicator.stage.thl-to-dbms.extractor=thl-local
replicator.stage.thl-to-dbms.applier=mysql
replicator.stage.thl-to-dbms.filters=mysqlsessions

The syntax is not beautiful but it is quite flexible. Here is what this definition means.

This replicator knows about three pipelines named master, slave, and direct.
The slave pipeline has two stages called remote-to-thl and thl-to-dbms and a store called thl. It has a property named syncTHLWithExtractor which must be set to false for slaves. (We need to change that name to something like 'isMaster'.)
The remote-to-thl stage extracts from thl-remote. This extractor reads events over the network from a remote replicator. The stage apples to thl-local, which is an applier that writes events to the local transaction history log.
The thl-to-dbms stage pulls events from the local log and applies them to the database. Note that in addition to an applier and extractor, there is also a filter named mysqlsessions. This filter looks at events and modifies them to generate a pseudo-session ID, which is necessary to avoid problems with temporary tables when applying transactions from multiple sessions. It is just one of a number of filters that Tungsten provides.

Components like appliers, filters, extractors, and stores have individual configuration elsewhere in the tungsten.properties file. Here's an example of configuration for a MySQL binlog extractor. (Note that Tungsten 1.3 can now read binlogs directly as files or relay them from a master server.)


# MySQL binlog extractor properties.  
replicator.extractor.mysql=com.continuent.tungsten.replicator.extractor.mysql.MySQLExtractor
replicator.extractor.mysql.binlog_dir=/var/log/mysql
replicator.extractor.mysql.binlog_file_pattern=mysql-bin
replicator.extractor.mysql.host=logos1-u1
replicator.extractor.mysql.port=3306
replicator.extractor.mysql.user=${replicator.global.db.user}
replicator.extractor.mysql.password=${replicator.global.db.password}
replicator.extractor.mysql.parseStatements=true

# When using relay logs we download from the master into binlog_dir.  This 
# is used for off-board replication. 
#replicator.extractor.mysql.useRelayLogs=false

The thing that makes pipelines really flexible is that the interfaces are completely symmetric. Components to extract events from MySQL binlog or from a transaction history log have identical APIs. Similarly, the APIs to apply events are the same whether storing events in a log or applying to a slave. Pipelines can tie together practically any sequence of extract, filter, and apply operations you can think of.

Here are diagrams of a couple of useful single-stage pipelines.

The "dummy" pipeline reads events directly from MySQL binlogs and just throws them away. This sounds useless but in fact it is rather convenient. You can use the dummy pipeline check whether your binlogs are good. If you add filters you can also use a dummy pipeline to report on what is in the binlog. Finally, you can use it as a quick and non-intrusive check to see if Tungsten can handle the data in your binlog--a nice way to ensure you can migrate smoothly.

Here's the dummy pipeline definition:

# Generic pipelines. 
replicator.pipelines=master,slave,direct, dummy
...
# Dummy pipeline has single stage that writes from binlog to bit-bucket. 
replicator.pipeline.dummy=binlog-to-dummy
replicator.pipeline.dummy.autoSync=true

replicator.stage.binlog-to-dummy=com.continuent.tungsten.replicator.pipeline.SingleThreadStageTask
replicator.stage.binlog-to-dummy.extractor=mysql
replicator.stage.binlog-to-slave.applier=dummy

The "direct" pipeline fetches events directly from a master MySQL server using client log requests over the network and applies them immediately to a slave. I use this pipeline to test master-to-slave performance, but it's also very handy for transferring a set of SQL updates from the binlog of any master to any slave on the network. For instance, you can transfer upgrade commands very efficiently out of the binlog of a successfully upgraded MySQL server to other servers on the network. You can also use it to "rescue" transactions that are stuck in the binlog of a failed master. That is starting to be genuinely useful.

The definition of the direct pipeline is already in the default replicator.properties.mysql template that comes with Tungsten 1.3, so it is not necessary to repeat it here. You can just download the software (open source version is here) and have a look at it yourself. There's almost more documentation than people can bear--look here to find a full set. Version 1.3 docs will be posted shortly on the website and are already available for commercial customers. As usual you can also view the source code on SourceForge.net.

Pipelines belong to a set of major feature improvements to Tungsten to support SaaS and large enterprise deployments. Some of the other features include fast event logging directly to disk (no more posting events in InnoDB), low-latency WAN transfer, multi-master replication support, and parallel replication. Stay tuned!

Mar 28, 2010

New Tungsten Software Releases for MySQL and PostgreSQL

I would like to announce a couple of new Tungsten versions available for your database clustering enjoyment. As most readers of this blog are aware, Tungsten allows users to create highly available data services that include replicated copies, distributed management, and application connectivity using unaltered open source databases. We are continually improving the software and have a raft of new features coming out this year.

First, there is a new Tungsten 1.2.3 maintenance release available in both commercial as well as open source editions. You can get access to the commercial version on the Continuent website, while the open source version is available on SourceForge.

The Tungsten 1.2.3 release focuses on improvements for MySQL users including the following:

Transparent session consistency for multi-tenant applications. This allows applications that follow some simple conventions like sharding tenant data by database to get automatic read scaling to slaves without making code changes.
A greatly improved script for purging history on Tungsten Replicator.
Fixes to binlog extraction to handle enum and set data types correctly.

By far the biggest improvement in this release is Tungsten product documentation, including major rewrites for the guides covering management and connectivity. Even the Release Notes are better. If you want to find out how Tungsten works, start with the new Tungsten Concepts and Administration Guide.

Second, there's a new Tungsten 1.3 release coming out soon. Commercial versions are already in use at selected customer sites, and you can build the open source version by downloading code from SVN on SourceForge.

The Tungsten 1.3 release sports major feature additions in the following areas:

A new replicator architecture that allows you to manage non-Tungsten replication and also to configure very flexible replication flows to use multi-core systems more effectively and implement complex replication topologies. The core processing loop for replication can now cycle through 700,000 events per second on my laptop--it's really quick.
Much improved support for PostgreSQL warm standby clustering as well as provisional management of new PostgreSQL 9 features like streaming replication and hot standby.
Replication support for just about everything in the MySQL binlog: large transactions, unsigned characters, session variables, various permutations of character sets and binary data, and ability to download binlog files through the MySQL client protocol. If you can put it in the binlog we can replicate it.

We also have provisional support for Drizzle thanks to Markus Ericsson, plus a raft of other improvements. This has been a huge amount of work all around, so I hope you'll enjoy the results.

P.s., Contact Continuent if you want to be a beta test site for Tungsten 1.3.

Mar 22, 2010

Replication and More Replication at 2010 MySQL Conference

Database replication is still interesting after all these years. Two of my talks focused on replication technology were accepted for the upcoming MySQL 2010 Conference. Here are the summaries.

Clustering for the Masses - A Gentle Introduction to Tungsten for MySQL

Not Your Grandpa’s Replication-The New Wave of MySQL Replication and How It Helps Your Applications

The first talk is a solo presentation covering Tungsten, which creates highly available and scalable database clusters using vanilla MySQL databases linked by flexible replication. I'll describe how it works and some cool things you can do like zero-downtime upgrades and session-based performance scaling. If you want to know how Tungsten can help you, this is a good time to find out.

The second talk is a joint effort with Jay Pipes covering issues like big data that are driving replication technology and the solutions to these problems available to MySQL users. We'll lay out our vision of where things are going to try to help you pick the right technology for your next project. Jay and I are also soliciting input on this talk from the Drizzle community among others. If you are interested check out the thread on drizzle-discuss or post to this blog.

Finally, I'll be around for much of the MySQL conference, so if you are interested in Tungsten or data replication in general or just want to hang out, please look me up. See you in Santa Clara!

Tungsten and PostgreSQL 9 at PG-East Conference

My Continuent colleagues Linas Virbalas and Alex Alexander will be giving a talk entitled Building Tungsten Clusters with PostgreSQL Hot Standby and Streaming Replication later this week at the PG-East Conference in Philadelphia. I saw the demo last week and it's quite impressive. You can flip the master and slaves for maintenance, open slaves for reads, failover automatically, etc. It's definitely worth attending if you are in Philly this week.

Looking beyond the conference, we plan to be ready to support Tungsten clusters on PostgreSQL 9 as soon as it goes production. Everything we have seen so far indicates that the new log streaming and hot standby features are going to be real hits. They not only help applications, but from a clustering perspective queryable slaves with minimal replication lag are also a lot easier to manage. Alex and Linas will have more to say about that during their presentation.

Meanwhile, I'm sorry to miss the PG-East conference but wish everyone who will be attending a great time. See you later this year at PG-West!

Jan 28, 2010

MariaDB is Thinking about Fixing MySQL Replication and You Can Help

In case you have not noticed, MariaDB is joining the list of projects thinking about how to improve MySQL replication. The discussion thread starts here on the maria-developers mailing list.

This discussion was jointly started by Monty Program, Codership, and Continuent (my employer) in an effort to push the state of the art beyond features offered by the current MySQL replication. Now that things are starting to die down with the Oracle acquisition, we can get back to the job of making the MySQL code base substantially better. The first step in that effort is to get a discussion going to develop our understanding of the replication problems we think are most important and outline a strategy to solve them.

Speaking as a developer on Tungsten, my current preference would to be to improve the existing MySQL replication. I suspect this would also be the preference of most current MySQL users. However, there are also more radical approaches on the table, for example from our friends at Codership, who are developing an innovative form of multi-master replication based on group communications and transaction certification. That's a good thing, as we want a range of contrasting ideas that take full advantage of the creativity in the community on the topic of replication.

If you have interest in improving MySQL replication please join the MariaDB project and contribute your thoughts. It should be an interesting conversation.

Jan 27, 2010

Tungsten 1.2.2 Release is Out - Faster, More Stable, More Fun

Release 1.2.2 of Tungsten Clustering is available on SourceForge as well as through the Continuent website. The release contains mostly bug fixes in the open source version but there are also two very significant improvements of interest to all users.

The manager and monitoring capabilities of Tungsten are completely integrated on the same group communications channel. This fixes a number of problems that caused data sources not to show up properly in older versions.
We are officially supporting a new Tungsten Connector capability for MySQL called pass-through mode, which allows us to proxy connections by transferring network blocks directly rather than translating native request protocol to JDBC calls. Our tests show that it speeds up throughput by as much as 200% in some cases.

The commercial version has additional features like PostgreSQL warm standby clustering, add-on rules to manage master virtual IP addresses and other niceties. If you are serious about replication and clustering it is worth a look.

This is a good time to give a couple of reminders for Tungsten users. First, Tungsten is distributed as a single build that integrates replication, management, monitoring, and connectivity. The old Tungsten Replicator and Myosotis builds are going away. Second, we have a single set of docs on the Continuent website that covers both open source and commercial distributions.

With that, enjoy the new release. If you are using the open source edition, please post your experiences in the Tungsten community forums or write a blog article. We would love to hear from you.

P.s., We have added Drizzle support thanks to a patch from Marcus Eriksson but it's not in 1.2.2. For that you need to build directly from the SVN trunk. Drizzle support will be out in binary builds as part of Tungsten version 1.3.

Jan 17, 2010

What's in Your Binlog?

Over the last couple of months I have run into a number of replication problems where I needed to run reports on MySQL binlogs to understand what sort of updates servers were processing as well as to compute peak and average throughput. It seems that not even Maatkit has a simple tool to report on binlog contents, so I wrote a quick Perl script called analyze-binlog.pl to summary output from mysqlbinlog, which is the standard tool to dump binlogs to text.

The script operates as a filter with the following syntax:

Usage: ./binlog-analyze.pl [-h] [-q] [-v]
Options:
  -h : Print help
  -q : Suppress excess output
  -v : Print verbosely for debugging

To get a report, you just run mysqlbinlog on a binlog file and pipe the results into analyze-binlog.pl. Here is typical invocation and output. The -q option keeps the output as short as possible.

$ mysqlbinlog /var/lib/mysql/mysql-bin.001430 | ./binlog-analyze.pl -q
===================================
| SUMMARY INFORMATION             |
===================================
Server Version    : 5.0.89
Binlog Version    : 4
Duration          : 1:03:37 (3817s)

===================================
| SUMMARY STATISTICS              |
===================================
Lines Read        :        17212685
Events            :         3106006
Bytes             :      1073741952
Queries           :         2235077
Xacts             :          817575
Max. Events/Second:            5871.00
Max. Bytes/Second :         1990077.00
Max. Event Bytes  :          524339
Avg. Events/Second:             813.73
Avg. Bytes/Second :          281305.20
Avg. Queries/Sec. :             585.56
Avg. Xacts/Sec.   :             214.19
Max. Events Time  :         9:01:02

===================================
| EVENT COUNTS                    |
===================================
Execute_load_query   :           10
Intvar               :        53160
Query                :      2235077
Rotate               :            1
Start                :            1
User_var             :          182
Xid                  :       817575

===================================
| SQL STATEMENT COUNTS            |
===================================
begin                :       817585
create temp table    :            0
delete               :        31781
insert               :           20
insert into          :       411266
select into          :            0
update               :       633857

There are lots of things to see in the report, so here are a few examples. For one thing, peak update rates generate 5871 events and close to 2Mb of log output per second. That's loaded but not enormously so--MySQL replication can easily dump over 10,000 events per second into the binlog using workhorse 4-core machines. The application(s) connected to the database execute a large number of fast, short transactions--typical of data logging operations, for example storing session data. We can also see from the Execute_load_query events that somebody executed MySQL LOAD DATA INFILE commands. That's interesting to me because we are just putting them into Tungsten and need to look out for them in user databases.

To interprete the binlog report most effectively, you need to understand MySQL binlog event types. MySQL replication developers have kindly provided a very helpful description of the MySQL binlog format that is not hard to read. You'll need to refer to it if you get very deeply into binlog analysis. It certainly beats reading the MySQL replication code, which is a bit of a thicket.

Anyway, I hope this script proves useful. As you may have noted from the URL the script is checked into the Tungsten project on SourceForge and will be part of future releases. I plan to keep tweaking it regularly to add features and fix bugs. Incidentally, if you see any bugs let me know. There are without doubt a couple left.

Jan 2, 2010

Exploring SaaS Architectures and Database Clustering

Software-as-a-Service (Saas) is one of the main growth areas in modern database applications. This topic has become a correspondingly important focus for Tungsten, not least of all because new SaaS applications make heavy use of open source databases like MySQL and PostgreSQL that Tungsten supports.

This blog article introduces a series of essays on database architectures for SaaS and how we are adapting Tungsten to enable them more easily. I plan to focus especially on problems of replication and clustering relevant to SaaS—what are the problems, what are the common design patterns to solve them, and how to deploy and operate the solutions. I will also discuss how to make replication and clustering work better for these cases—either using Tungsten features that already exist or features we are designing.

I hope everything you read will be solid, useful stuff. However, I will also discuss problems where we are in effect thinking out loud about on-going design issues, so you may also see some ideas that are half-baked or flat-out wrong. Please do me the kindness of pointing out how they can be improved.

Now let's get started. The most important difference between SaaS applications and ordinary apps is multi-tenancy. SaaS applications are typically designed from the ground up to run multiple tenants (i.e., customers) on shared software and hardware. One popular design pattern is to have users share applications but keep each tenant's data stored in a separate database, spreading the tenant databases over multiple servers as the number of tenants grows.

Multi-tenancy has a number of important impacts on database architecture. I'm going to mention just three, but they are all significant. First of all, multi-tenant databases tend to evolve into complex topologies. Here's a simple example that shows how a successful SaaS application quickly grows from a single, harmless DBMS server to five servers linked by replication with rapid growth beyond.

In the beginning, the application has tenant data stored in separate databases plus an extra database for the list of tenants as well as data shared by every application. In accounting applications, for example, the shared information would include items like currency exchange and VAT rates that are identical for each tenant. Everything fits into a single DBMS server and life is good.

Now business booms and more tenants join, so soon we split the single server into three—a server for the shared data plus two tenant servers. We add replication to move the shared data into tenant databases.

Meanwhile business booms still more. Tenants want to run reports, which have a tendency to hammer the tenant servers. We set up separate analytics servers with optimized hardware and alternative indexing on the schema, plus more replication to load data dynamically from tenant databases.

And this is just the beginning of additional servers as the SaaS adds more customers and invents new services. It is not uncommon for successful SaaS vendors to run 20 or more DBMS servers, especially when you count slave copies maintained for failover and consider that many SaaS vendors also operate multiple sites. At some point in this evolution the topology, including replication as well as management of the databases, is no longer manually maintainable. As we say in baseball, Welcome to the Bigs.

Complex topologies with multiple DBMS servers lead to a second significant SaaS issue: failures. Just having a lot of servers already means failures are a bigger problem than when you run a single DBMS instance. To show why, let's say individual DBMS servers fail in a way that requires you do something about it on average once a year, a number that reliability engineers call Mean Time between Failures (MTBF). Here is a simple table that shows how often we can expect an individual failure to occur. (Supply your own numbers. These are just plausible samples.)

Number of DBMS Hosts			Days Between Failures
1			365
2			182.5
4			91.3
8			45.6
16			22.8
32			11.4

Failures are not just more common with more DBMS hosts, but more difficult to handle. Consider what happens in the example architecture when a tenant data server fails and has to be replaced with a standby copy. The replacement must not only replicate correctly from the shared data server, but the analytic server must also be reconfigured to replicate correctly as well. This is not a simple problem. There's currently no replication product for open source databases that handles failures in these topologies without sooner or later becoming confused and/or leading to extended downtime.

There is a third significant SaaS problem: operations on tenants. This includes provisioning new tenants or moving tenants from one database server to another without requiring extended downtime or application reconfiguration. Backing up and restoring individual tenants is another common problem. The one-database-per-tenant model is popular in part because it makes these operations much easier.

Tenant operations are tractable when you just have a few customers. In the same way that failures become more common with more hosts, tenant operations become more common as tenants multiply. It is therefore critical to automate them as well as make the impact on other tenants as small as possible.

Complex topologies, failures, and tenant operations are just three of the issues that make SaaS database architectures interesting as well as challenging to design and deploy. It is well worth thinking about how we can improve database clustering and replication to handle SaaS. That is exactly what we are working on with Tungsten. I hope you will follow me as we dive more deeply into SaaS problems and solutions over the next few months.

P.s., If you run a SaaS and are interested working with us on these features, please contact me at Continuent. I'm not hard to find.