Scaling Databases Using Commodity Hardware and Shared-Nothing Design


Saturday, June 20, 2009

When SANs Go Bad

They sometimes go bad in completely unpredictable ways. Here's a problem I have now seen twice in production situations. A host boots up nicely and mounts file systems from the SAN. At some point a SAN switch (e.g., through a Fibrechannel controller) fails in such a way that the SAN goes away but the file system still appears visible to applications.

This kind of problem is an example of a Byzantine fault where a system does not fail cleanly but instead starts to behave in a completely arbitrary manner. It seems that you can get into a state where the in-memory representation of the file system inodes is intact but the underlying storage is non-responsive. The non-responsive file system in turn can make operating system processes go a little crazy. They continue to operate but show bizarre failures or hang. The result is problems that may not be diagnosed or even detected for hours.

What to do about this type of failure? Here are some ideas.
  1. Be careful what you put on the SAN. Log files and other local data should not go onto the SAN. Use local files with syslog instead. Think about it: your application is sick and trying to write a log message to tell you about it on a non-responsive file system. In fact, if you have a robust scale-out architecture, don't use a SAN at all. Use database replication and/or DRBD instead to protect your data.
  2. Test the SAN configuration carefully, especially failover scenarios. What happens when the host fails from access one path to another? What happens when another host picks up the LUN from a "failed" host? Do you have fencing properly enabled?
  3. Actively look for SAN failures. Write test files to each mounted file system and read them back as part of your regular monitoring. That way you know that the file system is fully "live."
The last idea gets at a core issue with SAN failures--they are rare, so it's not the first thing people think of when there is a problem. The first time this happened on one of my systems it was around 4am in the morning. It took a really long time to figure out what was going on. We didn't exactly feel like geniuses when we finally checked the file system.

SANs are great technology, but there is an increasingly large "literature" of SAN failures on the net, such as this overview from Arjen Lentz and this example of a typical failure. You need to design mission-critical systems with SAN failures in mind. Otherwise you may want to consider avoiding SAN use entirely.

Wednesday, June 17, 2009

Lots of New Tungsten Builds--Get 'Em While They're Hot

There is a raft of new Tungsten open source builds available for your replication and clustering pleasure. Over the last couple of days we uploaded new binary builds for Tungsten Replicator, Tungsten Connector, Tungsten Monitor, and Tungsten SQL Router. These contain the features described in my previous blog article, including even more bug fixes (36 on Tungsten Replicator alone) than I had expected as we had a debugging fest over the last few days that knocked off a bunch of issues. You can pick up the builds on the Tungsten download page. Docs are posted on the Tungsten wiki.

If you have questions, see problems with the builds, or just want to tell us how great they are, please post on the community forums or on the tungsten-discuss mailing list.

Our next open source release will be the Tungsten Manager, which is long overdue to join the family of regular builds. We are doing some polishing work on the state machine processing and group communications, after which the Manager will go out along with documentation on how to use it.

Wednesday, June 10, 2009

Tungsten Development News - Lots of New Features!

Articles on this blog have been pretty scanty of late for a simple reason--we have been 100% heads-down in Tungsten code since the recent MySQL Conference. The result has been a number of excellent improvements that are already in Subversion and will appear as open source builds over the next couple of weeks.

Tungsten has a simple goal: create highly available, performant database clusters using unaltered commodity databases that are simple to manage and look as close to a single database as possible for applications. Over the last two months we completed the integration of individual Tungsten components necessary to make this happen.

Full integration is a big step forward and finally gets us to the ease-of-use we were seeking. Imagine you want to add a slave database to the cluster. There's no management procedure any more--you just turn it on. Managers in the cluster automatically detect the new slave and add it as a data source. That's the way we want every component to work from top to bottom--either on or off, end of story. It was really nice to see it start to work a few weeks ago.

We are now ready to start pushing builds out to the Tungsten SourceForge.net project. Here is a selection of the features:

Tungsten Replicator -- API support for seamless failover, certification on Solaris, better Windows support, testing against MariaDB, and many other improvements like flush events for seamless failover. There are already 26 fixes in JIRA and I expect more before we post the build.

Tungsten SQL Router -- Pluggable load balancing with session consistency support. Session consistency means users see their own writes but can read changes by other users from a slave. It works using a single database connection, which is an important step toward eliminating application changes in order to scale on master/slave clusters.

Tungsten Manager -- Directory-based management model that allows you to view and manage both JMX-enabled services as well as regular operating system processes that follow the familiar LSB pattern of 'service name start/stop/restart'. The managers use group communications and can broadcast commands across multiple hosts, handle failures, and automatically detect new services as they come online.

Tungsten Monitor -- Improved monitoring of replicator status including slave latency, which is necessary to guide SQL Router load balancing features like session consistency.

There's a lot going on with Tungsten right now, in fact far too many things to mention even in a longish post like this one. One of my current code projects is to implement built-in backup and restore for Tungsten Replicator. I am planning on supporting slave auto-provisioning: a new slave comes up, restores the latest backup, and starts replicating. All you have to do is turn the slave on. (More of that on/off stuff--it's kind of an obsession for us at this point.)

Integrating backup/restore is the final big feature for Tungsten Replicator 1.0--after this we plan to turn attention to parallel replication and are already discussing how this might work with several potential customers. Feel free to contact me through this blog or better yet post on the community forums parallel replication topic to join the conversation.

One final bit of news, we are starting to work seriously on Tungsten PostgreSQL integration thanks to a new partnership between Continuent and 2nd Quadrant. This work is commercially focused for now but will lead to additional open source features in the not too distant future. Keep watching this space... :)

p.s., We also had a nice refit on the community website. Check it out.

Wednesday, May 13, 2009

Continuent is Joining the Open Database Alliance

Maybe it's a sense of shared adversity, but recent MySQL meetings have had this "we're all in it together" feeling. Today Monty Widenius announced the Open Database Alliance: the community feeling is starting to look like a real business entity.

The Open Database Alliance is appealing at multiple levels. First, it's good for the companies that join--a steadier flow of business and ability to offer bigger solutions by combining with partners. Second, it's good for users: first rate software, services, and support without vendor lock-in. Third, the parties are going to be excellent.

Sometimes you have to think hard before signing up for partnerships. But this one looks like a no-brainer. Count us in!

p.s., Stay tuned for Tungsten certification against MariaDB. If you have tried the Tungsten Replicator already with MariaDB, please post your experiences on the Tungsten Replicator Forum.

Wednesday, April 29, 2009

Overcoming MySQL-to-Oracle Culture Shock

Migrating from Oracle to MySQL is not easy. A few weeks ago Baron Schwartz summarized the culture shock in 50 things to know before migrating Oracle to MySQL. It's a great article but as you read through the comments it's easy to forget that culture shock can run the other way.

For example, try building horizontally scaled systems. Oracle has excellent "small" database editions like SE and SE1. However, they lack built-in replication of the type provided by MySQL. Even simple and effective deployment patterns like master-master replication do not exist. The usual approach in the Oracle world is to use RAC + Enterprise Edition features like Streams and DataGuard. That's great for large enterprises, but it's not a good method for smaller businesses and start-ups.

We have been working for some time on a better answer. We are now opening up for general beta testing a commercial extension to our Tungsten Replicator to address replication for Oracle. The new extension adds a process to read Oracle redo logs but otherwise fits neatly into the overall replicator design. It works on Linux Oracle Editions from XE to EE.

Implementing Oracle replication has been a long and arduous effort. Oracle has a huge feature set and a correspondingly elaborate log. It is far more challenging to read than the MySQL binlog. We currently handle basic data types as well as DDL statements. Large object types and XML are on the way. The implementation is a step-by-step process and one that needs to be guided by close work with customers.

On the other hand, Oracle has the features to make advanced replication really work. Most Oracle DBAs know about supplemental logging, which among other things adds keys to data so you can identify updated rows unambiguously. However, there are also far more interesting features like flashback queries, which allow you to see the state of the database at earlier points in time. It makes generating SQL from log entries much easier because we can see the state of system catalogs as of the exact time each update occurred. Flashback query was not on Baron's list or the comments that followed, but it is one of the truly great features of Oracle databases.

If you are interested in alternatives for existing Oracle replication, I would like to encourage you to contact us at Continuent. We are looking for customers who want to work closely with us to build out economical Oracle replication support. MySQL has shown over the years the power of lightweight, simple-to-use replication. It's going to be pleasure to make it work on Oracle.

Finally, there needs to be a list of 50 things you need to know about migrating from MySQL to Oracle. Open source databases are popular not just because they offer free downloads. Simplicity of operation, replication, and support for incremental scale-out patterns are among the strengths of MySQL. It takes some thought and effort to translate them into Oracle.

p.s., Since I wrote this article Robert Treat obligingly started the Oracle to MySQL 50 things list. Several people chipped in to get it up to 50.

Sunday, April 26, 2009

Tungsten Replicator Build 1.0.1 Available

A new build of the Tungsten Replicator is now available. As you probably know from reading this blog Tungsten Replicator provides advanced open source replication for MySQL. There is also a commercial extension to support Oracle. Tungsten Replicator 1.0.1 includes a number of important improvements.
  • Much better performance -- Current benchmark results show throughput of up to 650 inserts per second using a single slave apply thread. We are well on the way to our goal of 1000 inserts per second.
  • Simplified management -- Replicator administration has been largely reduced to two commands: online and offline. There is an option to go online automatically at startup, which further simplifies operation and makes it easy for the replicator to operate as a service.
  • Easy-to-use consistency checks. You just type trepctl check database.tablename.
  • Lots of bug fixes and small improvements. Check the release notes in file README.UPGRADE.
We also have some great features on tap for the next couple of releases. An integrated flush operation to simplify failover, built-in backup/restore, and parallel replication are just a few. I'm particularly excited about parallel replication, as it has the potential to boost throughput into the 1000s of updates per second and to support sharding as well. You can track development progress on the Tungsten Replicator JIRA list.

For more information check out the Tungsten Replicator community pages. You can grab binary downloads or look at source code on the Tungsten project on SourceForge.net. The 1.0.1 build is a considerable improvement over the previous beta releases and I hope you will try it out. We look forward to your feedback.

Friday, April 24, 2009

MySQL Conference Impressions and Slides

"Interesting" was probably the most overused word at the MySQL Conference that just ended yesterday. Everyone is waiting to find out more about the Oracle acquisition of Sun. As a community we need to find some synonyms or things will become very tiresome. Personally I vote for intriguing.

Here are slides for my presentations at the MySQL Conference as well as the parallel Percona Performance is Everything Conference. Thanks to everyone to attended as well as to the organizers. You had wonderful ideas and suggestions.


Finally, some short impressions on the conference. The two most intriguing trends were advances in hardware, especially memory and SSDs, as well as clouds. These are altering the economics of computer in fundamental ways: business costs as well as performance trade-offs in many of the basic algorithms for data management. Combined with the ferment of projects spinning off from MySQL and others, they are fueling an incredible burst of creative thinking about databases.

By comparison, Oracle consuming Sun is merely interesting.