The Scale-Out Blog: 2008

Dec 22, 2008

Tungsten Replicator Beta-3 Is Available

Our fiendish plot to provide advanced open source replication for MySQL and Oracle took another step forward yesterday. The Tungsten Replicator beta-3 build is available for download on our Forge site. This build is fully open source. Tungsten Replicator provides heterogeneous replication from MySQL to Oracle, seamless failover from a master to one of several slaves, event checksums, event filtering hooks, and a number of other useful replication features for MySQL and Oracle previously not offered outside of commercial products.

The beta-3 build has a number of important improvements:

MySQL 5.1 row replication is largely working. We are still seeing problems with datetime and timestamp replication but most other datatypes work.
The state machine model used by the replicator has been upgraded substantially. State transitions are processed cleanly and a number of race conditions have been eliminated.
You can now control master/slave failover remotely without any need to change local properties files. Planned failover passes our current tests, which means it's time to write new ones.
Multiple replicators can read and write from the same database. This is a prerequisite for fan-in and circular replication.
Numerous other bug fixes and small improvements are in the build.

For a full list of the build contents, read the Forge release notes or check the JIRA change log.

Our next build, beta-4, is scheduled for 5 January. It will tackle a number of key issues including corner cases for unplanned failure, scheduled table checksums, and heartbeat events. Plus we will continue to fix bugs and improve overall usability. Build contents are shown in the JIRA roadmap, a feature that delights the inner pointy-haired boss in all of us.

Please visit our community website and take the replicator out for a spin. It's improving quite rapidly. We take your feedback quite seriously--any reasonable request for capabilities will be considered and implemented if feasible.

In my next posts I will describe two key capabilities of Tungsten Replicator in more detail: replicating from MySQL to Oracle and filtering events. Stay tuned!

Dec 1, 2008

Don't Shy Away from MySQL 5.1!

MySQL 5.1 is GA. Let the fear and loathing begin. In a recent post Monty describes a number of problems that he feels should have prevented a GA declaration at this time. I like Monty's forthrightness immensely and his words have strongly influenced our work to develop the Tungsten Replicator. That said, I must respectfully disagree with his opinion.

It's hard to comment on overall quality of 5.1, though I have yet to hit any bugs personally after using it intermittently for almost a year. However, we have done a lot of work with MySQL row replication. Monty points out several bugs in the row replication implementation. Frankly, they would not hold me back. Row replication has so many advantages in eliminating strange corner cases from statement replication that it outweighs a few bugs. The MySQL 5.1 manual sums it up accurately:

Advantages of row-based replication:

All changes can be replicated. This is the safest form of replication.

Beyond issues like provable correctness, row replication is simply more flexible than statement replication. Heterogeneous replication is an obvious example. Our own Tungsten Replicator can replicate statements from MySQL 5.0 to Oracle. That's great if you use completely vanilla SQL and stick to int and varchar datatypes. For real applications, however, you need a data structure that transfers datatypes accurate and is easy to morph across schema differences. Similar reasoning applies when using replication for application upgrades. Finally, row replication is the only viable path for implementing parallel slave update, which is increasingly necessary on multi-core hosts. I can't speak directly for Mats Kindahl and other members of the replication team, but there's no doubt they see row replication as the foundation to solve a number of key problems.

For these and other reasons our team at Continuent has devoted quite a bit of effort to reading row updates in MySQL 5.1 binlogs. Obviously, we have some uses in mind that go well beyond simple MySQL to MySQL data transfer. However, I would not shy away from MySQL 5.1 if I were using native replication. Instead, I would be testing row replication today to see what problems it solves for me. Congratulations to the MySQL team for getting this feature out the door.

Nov 17, 2008

Announcing Tungsten Replicator Beta for MySQL

Pluggable open source replication has arrived, at least in beta form. Today we are releasing Tungsten Replicator 1.0 Beta-1 with support for MySQL. This release is the next step in bringing advanced data replication capabilities to open source and has many improvements and bug fixes. It also (finally) has complete documentation. I would like to focus on an interesting feature that is fully developed in this build: pluggable replication.

I have blogged about our goals for Tungsten Replicator quite a bit, for instance here and here. We want the Replicator to be platform-independent and database-neutral. We also want it to be as flexible as possible, so that our users can:

Support new databases easily
Filter and transform SQL events flexibly
Replicate between databases and applications, messaging systems, or files that you don't traditionally combine with replication

It was clear from the start we needed to factor the design cleanly. The result was an architecture where the main moving parts are interchangeable plug-ins. Here's a picture:

There are three main types of plug-ins in Tungsten Replicator.

Extractors remove data from a source, usually a database.
Appliers put the events in a target, usually a database.
Filters transform or drop events after extraction or before application.

This sounds pretty simple and it is. But it turns out to be amazingly flexible. I'll just give one example.

Say you are using Memcached to hold pages for a media application. The media database is loaded from a "dumb" 3rd party feed piped in through mysql. Normally you would set up some sort of mechanism within the feed that connects to the database and then updates Memcached accordingly. Okay, that works. However, your feed processor just got a lot more complicated. Now there's a better way. You can write an Applier that converts SQL events from the database to Memcached calls to invalidate corresponding pages. Then you can write a Filter that throws away any SQL events you don't want to see. Voila! Problem solved. Because it works off the database log, this approach works no matter how you load the database. That's even better.

Tungsten Beta has a number of other interesting features beyond pluggable replication. Our next builds will support MySQL row replication fully and have much better heterogeneous replication. I'm going to cover these in future blog posts. Incidentally, MySQL 5.1 row replication is a highly enabling feature for many data integration problems. If you have not checked it out already, I hope our replication will motivate you to do so in the very near future.

Meanwhile, please download load the build and take it out for a spin. Builds, documentation, bug tracking, wikis and much more are available on our community site. Have fun!

Oct 30, 2008

Why Is Solaris Missing the Party?

I just spent several hours in a fruitless quest to figure out if there's a way to run Solaris 10 on Amazon. Fruitless is the right word because "real" Solaris operating systems do not seem to be supported other than through QEMU emulation, which looks a bit shaky. So far there's only OpenSolaris.

Why is this a problem? Our company, Continuent, is moving at full speed onto Amazon S3 and EC2. We have a virtual organization with developers spread out from California to Lithuania. Amazon solves a really fundamental problem for us. We can have development machines that everyone can reach easily and activate or deactivate at will. Scott in Santa Cruz does not have to call Seppo in Helsinki just to get a host rebooted. (This is where globalization starts to go really bad.) We are also developing software like Tungsten Replicator that needs to run in cloud environments. Being on Amazon makes sense at multiple levels.

The fly in the ointment is that many of our customers use Solaris 9 and 10. The OpenSolaris instances on Amazon are essentially useless. OpenSolaris is so different from production Solaris that tests give little or no useful information. I have never heard of a customer deploying on it. So we are stuck on the old model of keeping machines in house. As a result Linux and even Windows look cheaper and involve far less hassle as development platforms.

It feels as if Sun is really missing the boat on this one. If I were working for the Solaris team, getting Solaris for Intel available on Amazon would be at the top of my list. If enough IT people start to make the calculations we are, the future of Solaris is not going to be very bright. That would be a pity both for Sun as well as a lot of users.

Oct 29, 2008

Simple is Beautiful

Last week I attended an incredibly intense conference in Lalandia, Denmark: Miracle Oracle Open World. According to Mogens Norgaard, the organizer, the conference devotes 80% of the time to intense discussions of Oracle databases and 80% of the time to drinking. During the festivities you get this dim mental image of what it would have been like if Vikings had access to 16-core machines and advanced database software. But I digress.

Anyway, Lalandia is located on just that kind of spare, beautiful coast that clears the mind to look for fundamental truths. And sure enough, a talk by Carel-Jan Engel, nailed one of them: simplicity is the key to availability.

At some level we all understand the idea. The more components you have in a system the more likely it is one or more of them will fail either because of a defect or an administrative error. The trouble is we don't act on our intuitions. Carel-Jan showed the Oracle MAA (Maximum Availability Architecture), which looks like this in the marketing pictures:

MAA is the recommended way to create a highly available system using RAC and Data Guard. And suddenly it hits you--there are a lot of moving parts. In seeking redundancy, the authors of the design have created tremendous complexity and hence opportunities for failures. It's an example of what Jeremiah Wilton once allegedly described as "design for maximum failability." I don't know if Jeremiah really said that but it describes the problem pretty well.

And this was Carel-Jan's point. Availability is not something you just purchase and roll in the door on wheels. You get it by engineering very simple systems that have few points of failure. In the Oracle world it often means buying Oracle SE instead of RAC. And running it on standard hardware linked together with replication. Plus, of course, changing your applications so they work within the limitations of the rest of the system. Want to stay available without losing data? Keep the rate of updates low. Performance overload? Partition data into separate systems. You get the idea.

In short, keep it really simple, like this:

This is simple availability. It's very beautiful. Open source database communities have understood this idea for a long time. My goal is to write software make it work better for them and for Oracle users as well.

Oct 17, 2008

Getting Smart about the New World of PostgreSQL Replication

Robert Treat and I had some back and forth emails a few weeks ago about explaining database replication to customers. Replication is totally cool but it is also completely confusing to a lot of people. The basic concepts are not widely understood. Plus PostgreSQL does not help by giving you a wide range of methods, often with poorly documented trade-offs.

Based on our conversation I put together a talk for PG West in Portland called Getting Smart about the New World of PostgreSQL Replication. It explains basic concepts and surveys five replication approaches. Press the title and you can possess the slides yourself.

Robert and I had talked about putting together a joint talk about replication. Consider this a first cut. I'm up for iterating a few times to get a solid tutorial.

Meanwhile at Continuent we should be able to replicate data from Oracle or MySQL into PostgreSQL using Tungsten Replicator within about two weeks or so. I'm waiting for one more check-in to enable writing plug-ins that apply SQL to new databases. Sad to say, reading data out of PostgreSQL is going to take a little longer. Stay tuned...

Sep 24, 2008

Amazon or Google--Which Is More Interesting?

Brian Aker has a great post about how he finds Amazon more interesting than Google, because they have addicting services but no framework lock-in. I couldn't agree more with his conclusion, though for somewhat different reasons.

The contrast between Amazon and Google has intrigued me for a long time. The fact that Amazon is exposing basic infrastructure to build business systems has enormous advantages if that's what you are building. Google on the other hand has been a lot more oriented toward end users. Their services seem more useful to individual consumers.

I got really interested in Amazon services when SQS first appeared. It was clear somebody understood that service-based systems require messaging for integration as well as workflow processing. With messaging, "safe" storage, availability zones, and rapid setup of virtual machines, you can solve some mighty big problems. I can't see how to do this with on-line spreadsheets and free email.

OK, that's kind of a cheap shot. Still, Google services still don't match Amazon by any stretch of the imagination when it comes to building scalable, general-purpose applications.

There's also an implicit difference between the Google and Amazon approaches. When you write software services for money there's usually a business plan somewhere or you don't do it for very long. Business plans in turn require you make some assumptions about the environments you are using like how much they will cost, what features they have now, and what they will have in the future. It's really important that these assumptions be reasonably stable or you can't make much progress.

Amazon may not be very open about how things work, but at least they are reasonably open about their plans. Now think about how many Google services are marked "BETA."

In fact, with Google I'm even very sure about what the term "beta" means. This is not just a problem for me. It's a problem with how Google interacts with the world that will hurt the company in the long run.

In the end I would be willing to go a bit further than Brian. If you write backend systems of any kind, Amazon is more than just interesting. I would be willing to bet that 20 years from now we will look back and say that Amazon provided the model that made the dubious idea once called utility computing really work.

p.s., Don't get me wrong about Google. I googled all the links for this article, which is written on Blogspot, another really nice end user service. Amazon might be the cat's meow but Google is a verb.

Open Source Databases at Oracle Open World

Open source databases still have a very long way to catch up to Oracle. I was at Oracle Open World touring the exhibits on Tuesday. Just for fun I asked everyone I met whether they used open source databases or saw demand for them in their businesses. The answer almost universally went like this: "No."

One simple reason explains much of the Oracle dominance as well as the inertia of many companies in switching to something else. A huge number of enterprise applications like Siebel or SAP run on Oracle. MySQL and PostgreSQL applications on the other hand are either custom code or belong to an area where open source is truly dominant, such as web site content management. Even when more applications run on open source, most companies will adopt them as supplements to existing systems, not as wholesale replacements. Oracle and other commercial databases will continue to rule enterprises for a very long time.

A lot of the focus in open source database development is on matching capabilities of commercial databases. What many open source users really need is the ability to integrate. That in turn depends on features like heterogeneous replication as well as bulk loading. These are not on the road maps of most open source database projects. However, they will be one of the factors that eventually enables open source to break out into a much bigger arena.

Sep 19, 2008

Tungsten Replicator 1.0 Alpha Is Released

The 1.0 Alpha of Tungsten Replicator is out. Actually it's been out since Tuesday but it's been a busy week. Binary downloads are available here.

The Alpha release offers basic statement replication for MySQL 5.0 on Linux, Solaris, MacOSX, and Windows platforms. The setup is very simple, and there are procedures for master failover as well as performing consistency checks. If you work at it, you'll find bugs. That's a promise, not a threat. Please log them in the project JIRA. We gladly accept feature requests, too.

Meanwhile, the beta version is in development. Among other nice features we will offer support for user-written SQL event extractors and appliers, MySQL row replication support, lots of testing, and much more.

Sep 15, 2008

Bringing Open Source Replication to the Oracle World

Replication is one of the most useful but also also one of the most arcane database technologies. Every real database has it in some form. Despite ubiquity, replication is complex to use and in the case of commercial databases quite expensive to boot.

We aim to change that. On Tuesday we will be announcing replication support for Oracle. Oracle replication will be based on our open source Tungsten Replicator, which is currently available in an alpha version for MySQL. Our goal is to provide replication that is accessible and usable by a wide range of users, especially those running lower-cost Oracle editions.

It's not a coincidence that we chose to implement MySQL and Oracle replication at the same time. MySQL has revolutionized the simplicity and accessibility of databases in general and replication in particular. For example, MySQL users have created cost-effective read scaling solutions using master/slave replication for years. MySQL replication is not free of problems, but there is no question MySQL AB helped by the community got a lot of the basics really right.

On the other hand, Oracle replication products offer state-of-the-art solutions for availability, heterogeneous replication, application upgrade, and other problems, albeit for high-end users. For example, Oracle Streams and Golden Gate TDM offer very advanced solutions to the problem of data migration with minimal downtime. The big problem with these solutions is not capabilities but administrative complexity and cost.

Our initial cut at merging the two worlds is focused on creating a simple and usable database replication product that handles the following use cases for small to medium Oracle installations:

Basic data availability using extra copies of databases locally and off-site
Scaling reads using the MySQL read-scaling model
Performing zero-downtime upgrades and migrations using database replicas
Heterogeneous data migration between Oracle and MySQL as well as PostgreSQL (initially one-way only).

The big technical feature is that replication will work on all editions of Oracle, not just Enterprise Edition. We expect to help Oracle users build economical new systems on the scale-out model as well as off-load existing Oracle servers to avoid having to upgrade to more expensive licensing.

An early adopter version will be available toward the end of the month. The Oracle redo log extractor is commercial but all other capabilities are open source, so you can replicate from MySQL up to Oracle freely. We are now looking for some select users who can really help propel the software forward. If you would like to try out Oracle replication, contact me at Continuent.

Sep 14, 2008

Java Service Wrapper Is Very Handy

If you write network services using Java, you should look into the Java Service Wrapper (JSW). The JSW turns Java programs from weak delicate creatures easily killed by an errant Ctrl-C into robust network services that boot up automatically, ignore most signals, and restart automatically following crashes. It's free for open source programs and has very reasonable licensing fees for commercial software.

We use JSW on several of our projects including the Tungsten Replicator and the Tungsten Connector. I just checked in a new project on our Tungsten Commons site with an Ant script that automatically copies the open source versions of JSW into a project directory with a conventional layout including bin and lib directories. Check it out here if you would like an example of how to automate addition of JSW wrappers to your own Java projects.

Sep 11, 2008

MySQL 5.0 to 4.1 "Down-Version" Replication using Tungsten

A couple of months ago Mark Callaghan mentioned it would be very nice to have a replication product that could transfer data from newer to older versions of MySQL. Ever since then I have been interested in trying it with our new Tungsten Replicator. Today I finally got the chance.

I have a couple of Centos5 virtual machines running on my Mac that I have been using to test the latest Tungsten Replicator 0.9.1 build. I happen to have MySQL 5.0.22 (the antiquated version that comes with CentOS5) on one VM. I set up MySQL 4.1.22 on the other CentOS5 VM and tried to make it a slave of the 5.0 server using MySQL replication. The result was the following error message:

080911 15:25:13 [ERROR] Master reported an unrecognized MySQL version. Note that 4.1 slaves can't replicate a 5.0 or newer master.

This message was highly satisfactory. MySQL replication is not supposed to work down-version from 5.0 to 4.1.

Now to try it with the Tungsten Replicator. I followed the Tungsten Replicator manual instructions with the MySQL 5.0 host as master and the MySQL 4.1 host as the slave. It turns out the set-up is identical for both versions, which made this part very fast. I then issued the standard commands to bring up the master:

trepsvc start
trep_ctl.sh configure
trep_ctl.sh goOnline
trep_ctl.sh goMaster

followed by commands to start the slave:

trepsvc start
trep_ctl.sh configure
trep_ctl.sh goOnline

Now it was time to fire up mysql against the master database and enter some data.

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 25 to server version: 5.0.22-log

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> create table foobar13 (id int, data varchar(25));
Query OK, 0 rows affected (0.12 sec)

mysql> insert into foobar13 values(1, 'first!!!');
Query OK, 1 row affected (0.00 sec)

However, over on the slave, nothing showed up. OK, I know we have never tested against MySQL 4.1, but what's up? Well, in the slave replicator log the following message appeared:

INFO   | jvm 1    | 2008/09/11 23:05:06 | 2008-09-11 23:05:06,536 FATAL tungsten.replicator.NodeManager You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'SCHEMA IF NOT EXISTS tungsten' at line 1

Oops! The replicator tried to issue a CREATE SCHEMA command to create its catalog database. CREATE SCHEMA was only introduced in MySQL 5.0.2. Change this to CREATE DATABASE and run Ant to build and redeploy the code. Restart the slave and check the logs. They look clean this time. Now login to the slave database with mysql and look for the foobar13 table:

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 20 to server version: 4.1.22-standard-log

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> select * from foobar13;
+------+----------+
| id     | data     |
+------+----------+
|    1    | first!!! |
+------+----------+

Cool, it worked. Replication from MySQL 5.0 to MySQL 4.1 successfully demonstrated.

We will have a much improved Tungsten Replicator 1.0 alpha build ready in a couple of days that includes this fix and many others. By the way, we are working on getting heterogeneous replication to work as well. I'll have a lot more to say about that in future posts.

Sep 1, 2008

Continuent Community Site for Database Scale-Out

Our goal at Continuent is to be the go-to guys for database scale-out. Last Thursday we opened up a new community site for scale-out software at http://community.continuent.com. The site is driven by Joomla and has a number of very nice additions like Fireboard Forums and Mediawikis for each project. The first day or two was a bit bumpy as we nailed down some final issues, but most features are now working. We hope the result will be a nice place to meet other people who are interested in database scale-out and share ideas as well as software.

As you will see when visiting the community site, we have a variety of projects that we collectively call the Tungsten Scale-Out Stack. We have had this idea for a while now that it's not enough to have just one or two singing and dancing products--that's too narrow to solve scale-out problems. Instead you want a set of technologies that combine to create a wide variety of solutions.

Our effort last week included posting initial code for the Tungsten Replicator. We are actively testing, fixing bugs, and adding more features. However, there are a number of other projects on the site. I will talk about them on this blog in the future as each one is quite interesting.

Meanwhile, if you have a project that you might like to post on our site, let me know. We are actively looking for scale-out technology for MySQL, PostgreSQL, and commercial databases like Oracle. Just post on the end of this blog and I will see it.

Aug 28, 2008

Answering Monty's Challenge: Advanced Replication for MySQL

Today Continuent is publishing the Tungsten Replicator, which provides advanced open source master/slave replication for MySQL. Publishing code is the first step to creating a robust alternative to current MySQL replication and will be followed by similar support for Oracle, PostgreSQL, and many other databases.

We started with master/slave replication on MySQL for a very simple reason: we know it well. And we know that while MySQL replication has many wonderful features like simple set-up, it also has many deficiencies that have persisted for a long time. Monty Widenius, a widely respected MySQL engineer, summarized some of the key problems last April:

- replication is not fail safe
- no synchronous options
- no checking consistency option
- setup and resync of slave is complicated
- single thread on the slave
- no multi-master
- only InnodDB synchronizes with the replication (binary) log

These issues are well-known to the MySQL community. Monty laid down a challenge, but we all know the community can write software that solves it. However, there’s a much bigger challenge out there. There are highly capable replication products produced by commercial vendors like Golden Gate, Quest, Oracle, Sybase, and others. They handle high availability, performance scaling, upgrade, heterogeneous replication, cross site clustering—you name it. Why aren’t these capabilities available in an open source product? Why doesn’t that open source product have the ease-of-use and accessibility MySQL is famous for?

The Tungsten Replicator is designed to answer that challenge. Here’s the initial feature set:

Simple set-up procedure
Master/slave replication of one, some, or all databases
MySQL statement replication
Proper handling of master failover in presence of multiple slaves
Checksums on replication events
Table consistency check mechanism

And here’s the roadmap:

Group communications-based management
Oracle support
PostgreSQL support
MySQL row replication
Heterogeneous replication
Multi-master via bi-directional replication with conflict resolution
Semi-synchronous replication
Parallel update on slaves to increase performance
Proxying support to reduce or eliminate application changes

We are implementing all of these features in a way that abstracts out platform and database differences. The architecture is not just database-neutral--by making it possible to extract from one database type and push to another we lay a cornerstone for heterogeneous data transfer.

Tungsten Replicator is available on our community website at http://community.continuent.com. Stop by and check it out. The code is in the early stages but will mature very rapidly. You can help us guide it forward. We are looking forward to answering Monty’s challenge and going much further. We are looking forward to creating something that brings powerful replication within the reach of every database user.

Aug 6, 2008

Drizzle is Cool but Confusing

Brian Aker's Drizzle post was the most interesting news to emerge during OSCON 2008. In case you have been on vacation, Drizzle is a stripped down version of MySQL for horizontally scaled web applications and Cloud Computing. Full-blown SQL databases are often overkill here, a point of view espoused by this blog among others.

It's easy to get excited about Drizzle. Brian, Monty, and others define the problem space very clearly and list some intriguing feature ideas on the Drizzle wiki. Just one example: sharding across multiple nodes, which is key to scaling massive reads and writes. From a technical perspective, it sounds cool.

Still, there's a dark side for Sun's database business. In addition to unfinished product versions and storage engines, there have now been at least three announced forks of the MySQL code in the last few months. It is thought-provoking that some of the most respected MySQL engineers inside and outside Sun are working on an alternative to the flagship product. This is the prelude to a classic trap that scuttled Informix among others in the 1990s. Even in the best case enterprise users will find it confusing.

Drizzle illustrates a problem with open source dialectics that has been developing since before the Sun acquisition--there's a big difference between open source to drive technology versus open source to market enterprise products. MySQL is a big tent with multiple products and business models uncomfortably rolled into one. There's no reason not to split them up into separate offerings with appropriate open source models for their respective markets. Other database vendors do this. However, Sun is running out of time to get the marketing right.

Meanwhile, for techies looking at large web applications or for Cloud developers, Drizzle is not confusing at all. It's time to download the code and see what's up. It could be really cool.

Jul 13, 2008

Myosotis Connector: a Fast SQL Proxy for MySQL and PostgreSQL

SQL proxies have been very much in the news lately, especially for open source databases. MySQL Proxy and PG-Pool are just two examples. Here is another proxy you should look at: Myosotis.

Myosotis is a 'native-client' to JDBC proxy for MySQL and PostgreSQL clients. We originally developed it to allow clients to attach to our Java-based middleware clusters without using a JDBC driver. Myosotis parses the native wire protocol request from the client, issues a corresponding JDBC call, and returns the results back to the client. As you can probably infer, it's written in Java. "Myosotis" incidentally is the scientific name for "Forget-Me-Not," a humble but strikingly beautiful flower.

Myosotis is still rather simple but it already has a couple of very interesting features. First, it works for both MySQL and PostgreSQL. That's a good start. Wire protocols are very time-consuming to implement. Another feature is that Myosotis is really fast. This deserves explanation and some proof.

As other people have discovered, proxying is very CPU-intensive. It also involves a lot of concurrency, since a proxy may have to manage hundreds or even thousands of connections. Java is already fast in single threads--after a few runs through method invocations, the JVM has compiled the bytecodes down to native machine code. In addition, Java uses multiple CPUs relatively efficiently. Myosotis uses a thread per connection. Java automatically schedules these on all CPUs and optimizes of memory access in multi-core environment.

We can show Myosotis throughput empirically using Bristlecone, an open source test framework we wrote to measure performance of database clusters. We test proxy throughput by issuing do-nothing queries as quickly as possible with varying numbers of threads. The following run compares Myosotis against a uni/cluster 2007.1 process (a much more complex commercial middleware clustering software) and MySQL Proxy 0.6.1 running without Lua scripts. The proxy test environment is a Dell SC 1425 with 4 cores running CentOS5 and MySQL 5.1.23.

The results are striking. Myosotis gets between 3000 and 3500 queries per second when 8 threads are simultaneously running queries. To demonstrate processor scaling, run htop when the Myosotis Connector is being tested. You see something like this--a nice distribution across 4 cores.

Myosotis is a very simple proxy now but it has the foundation to create something great. We have big plans for Myosotis--it's a key part of our Tungsten architecture for database scale-out, which we will be rolling out later in the summer. The next step is to add routing logic so that we can implement load balancing and failover. We'll be doing that over the next few months. Meanwhile, if you want to see how fast Java proxies for SQL can be, check us out at at http://myosotis.continuent.org.

p.s., If you want to repeat the test shown here on your own proxy, download Bristlecone and try it out. I used the ReadSimpleScenario test, which is specifically designed to check middleware latency.

Jul 2, 2008

What's Your Favorite Database Replication Feature?

Replication is one of the most flexible technologies available for databases. We are implementing a new open-source, database-neutral replication product that works with MySQL, Oracle, and PostgreSQL. Naturally we've done a lot of thinking about the feature set. It's tough to pick any single feature as the most important, but one that really stands out is optional statement replication. Here's why.

Database replication products tend to replicate row changes and DDL. However, Mark Callaghan has a great example of why you want to replicate statements as well--it enables Maatkit distributed consistency checking to work. If you dissect the mk-table-checksum --replicate command you will see that it uses a nice trick. The SQL queries generate checksums into the master table and then replicate as statements rather than row updates out to slaves. That way the slaves recompute the checksum locally at the same point in the overall transaction history. Very elegant!

Replicated consistency checks are a wonderful feature for large systems that can't afford to stop in order to compare tables between servers. However, you cannot use it if your database cannot replicate statements. As Mark points out, not even all MySQL engines do this. The proposed replication additions for PostgreSQL won't support it either.

Optional statement replication is really the best kind of feature: it is useful on its own, but also enables features like consistency checking and other nice administrative tricks. We're going to put a "worm-hole" in our replication engine that allows applications to invoke statement replication at the SQL level. Can you guess how we are going to do it? If not, you'll have to wait until we release. :)

So what's your favorite database replication feature?

Jun 29, 2008

The Economist Gets It Right: After Bill

Print journalism is still hanging in there. Ludwig Siegele has a great article about Bill Gates' departure from Microsoft in the latest edition of the Economist. It's the most balanced presentation of the problems and opportunities in front of Microsoft I have seen in quite some time.

For those of you who don't follow the Economist regularly, Ludwig is one of the top technical journalists at the magazine. He covered Silicon Valley during the dot-com bubble and is now working on a book that promises to be pretty interesting. Meanwhile, I hope he has time to keep cranking out more articles on technology like this one.

Jun 25, 2008

Cloudcamp San Francisco: SQL or SimpleDB?

One of the best discussions at Tuesday's CloudCamp San Francisco was "SQL or SimpleDB - Who will win?" Cloud computing is part of a fundamental shift in computer operations propelled by virtualization of hosts and disk storage. We were already starting to argue about SimpleDB as the camp started when the person sitting next me astutely jumped up and proposed it as a topic for discussion.

The argument against SQL goes something like this. Many applications handle very simple objects using only primary key look-ups. Hashtable-based datastores like SimpleDB and BigTable handle that model and also partition data automatically. This simpler data model maps better to object models in scripting languages, many of which deal in objects that are essentially associative arrays. Typing issues? Let the application figure it out. MapReduce processing permits huge increases in parallelism, provide you have a problem like document indexing for which it is well-suited. Finally, both SimpleDB and BigTable have an availability model that automatically deals with failures of databases nodes. Availability is almost always an add-on for SQL databases.

There's no doubt the question of SimpleDB vs. SQL is well-posed. Cloud computing is just another way of organizing operations. It does not make it any easier to build SQL clusters or in fact do things that SQL databases don't already do on LANs. The real issue is between programming models.

That said, I think we have heard these arguments before. There are ample reasons why just about every innovation in data management in the last 20 years has ended up being folded back into relational databases. First, "SQL" is a mass of features ranging from data model to programming APIs and conventions to tools that have taken decades to develop. Those features are there because at some point some application really needed them.

Second, programming in objects and eliminating impedance mis-match was the promise of object-oriented databases. However, it turns out that trapping data in objects is not so great when you decide to use data for other purposes. SQL makes data first class, hence easily accessible for new applications. This is a core idea behind the relational model. Also, "typeless" storage systems are really hard to maintain over time, because they put the onus of dealing with versions on applications. Such systems may scale well over large quantities of data. However, they don't scale well over complexity of data.

Third, SQL databases like MySQL and PostgreSQL run in any data center. SimpleDB only runs in Amazon Web Services. For the time being at least there's a major lock-in problem, though CouchDB and Hadoop show that it may not persist for all too long.

So what's the resolution? Well, this question is nowhere near settled and my account does not nearly do justice to the SimpleDB point of view. Still, I think there are two things going on here that actually don't have too much to do with cloud computing per se. To begin with, there are new classes of applications like web-scale indexing that need massive parallelization to operate efficiently. Conventional SQL databases just don't work here. It's not all that different from the way that large-scale data analytics are pushing people to consider column storage. However, there's another issue. I think we are seeing a reaction against complexity. Commercial databases are just overkill for many applications.

CloudCamp was full of interesting ideas, but my takeaway was quite basic. Cloud Computing needs lightweight SQL databases that are baked into the stack. This sounds a lot like MySQL, but MySQL is not simple any more. We need a simple relational database that partitions data across hosts and has built-in availability along the lines of SimpleDB's eventual consistency. As far as I know it does not exist yet. So who is building that database?

Jun 19, 2008

Webinar: The Coolest Scale-Out Projects on the Planet

My company Continuent sponsors Continuent.org, an open source site that contains some of the coolest scale-out projects around. You may have heard of Sequoia, which implements middleware clustering of any database that has a JDBC driver. However, Sequoia is really just the beginning.

We have several other projects that offer interesting scale-out technologies. Myosotis implements fast SQL proxying, Hedera provides wrappers for group communications, and Bristlecone has tools for performance testing of scale-out architectures. This summer we will add projects for database neutral master/slave replication as well as cluster management. In short, there's a lot to look at.

If you would like a closer look at Continuent.org, I'm doing a Webex webinar to talk about the overall technology stack and project roadmaps. It's scheduled for Thursday June 26th at 10am EDT. You can sign up here to see what's going on.

Don't worry if you miss the presentation--I'll post slides here and will be doing a series of blog entries on each of the projects in the coming weeks.

Jun 2, 2008

PostgreSQL Gets Religion About Replication

The PostgreSQL community is getting really serious about replication. On Thursday May 29th, Tom Lane issued a manifesto concerning database replication on behalf of the PostgreSQL core team to the pgsql-hackers mailing list. Tom's post basically said that lack of easy-to-use, built-in replication is a significant obstacle to wider adoption of PostgreSQL and proposed a technical solution based on log shipping, which is already a well-developed and useful feature.

What was the reaction? The post generated close to 140 responses within the next two days, with a large percentage of the community weighing in. It's one of the most significant announcements on the list in recent history. There is pent up demand for this feature and within a few hours people were already deep into the details of the implementation.

The basic idea comes from an excellent presentation by Takahiro Itagaki and Masao Fujii of NTT at PGCon 2008 in Ottawa. They have developed a system that replicates database log records synchronously to a standby database. The standby can recover quickly and without data loss, which makes it a good availability solution. The core team manifesto proposes to integrate this into the PostgreSQL core and add the ability to open the standby for reads.

So, is this the end of the story on replication? I don't think so. There's no question that synchronous log shipping with reads would be a great feature. Basic availability is the first problem users run into when setting up production systems and this feature looks considerably better than alternatives for other databases like MySQL. It will help if NTT donates their code to the community, but still the whole effort will take considerable time. Adding the ability to open a standby for reads is at least a version out (read: up to 2 years).

More importantly, log shipping is most useful for availability. It does not help you replicate across database versions (nice for upgrades), between different databases, from a master to large numbers of slaves, or bi-directionally between databases. Finally, it's a less than ideal solution for clustering data between sites, something that is rapidly becoming one of the most important overall uses of replication. For these and other cases you need logical replication, which turns log records into SQL statements and applies them using a client.

I'm therefore starting an effort to get logical replication hooks included as a parallel effort. If you are interested in this let me know. Meanwhile, stay tuned. Tom's message represents a real change of heart for the PostgreSQL community. Accepting the important of replication opens up the doors for a new round of innovation in scale-out based on PostgreSQL. It could not come at a better time.

May 15, 2008

Tungsten Scale-Out Stack Presentation from MySQL Conference

There have been a number of requests for copies of the slides to the Tungsten Scale-Out Stack talk I gave at the MySQL Conference in April. Here they are courtesy of the nice folks at O'Reilly who organized the conference.

Tungsten is our codename for a set of technologies to raise database performance and availability using scale-out. In the database world scale-out is a term of art that means spreading data across servers on multiple systems. With data in multiple places you are less subject to failures--when one copy crashes you just use the others. Similarly, if your application runs a lot of queries, you can spread them over different machines, which makes for faster and more stable response times.

So database scale-out sounds great (and is too), but getting it to work properly is harder than you would think. Along with practical issues like management, there are theoretical barriers. Let's say you are creating a product catalog service using database replicas on different hosts. Applications connect to any replica to get information. Your manager, a guy with pointed hair, tells you to make sure of the following:

1. The catalog service is always available.
2. The service keeps working even if you get a network partition between hosts.
3. The copies are always consistent (e.g., you can go to any copy and get the same data).

Here's an ugly surprise. It turns out your data service can only have two of the three properties at any given time, a result that was proven only recently and is now called the CAP Principle. If you want to be available and handle network partitions, you must accept that data will sometimes be inconsistent. Your manager is going to be very disappointed.

That's where we get back to Tungsten and the Scale-Out Stack. We realized a while back that you can't think in terms of a single product or even family of products to solve scale-out in a general way. It's better to design a flexible set of technologies with different strengths and weaknesses that users choose based on what's important to them. If you need to cluster over a WAN, use master/slave replication. If you don't want master failures, use synchronous replication in middleware.

Read the slides to learn more about the thinking. Database scale-out is a fascinating problem and we are looking forward to making it much easier to handle. Please stay tuned! I'll be writing more about this in the weeks and months to come.

May 7, 2008

What Else Would Oracle Say?

This just in. In a long interview on Linux Voices, Oracle's Linux architect Edward Screven comments on the MySQL/Sun acquisition.

...we just don’t care. I mean, we don’t see MySQL very often, again, in competitive deals. It’s out there, but it’s not very often that a database sales rep comes back and says, “I had to compete for the business against MySQL.”

To be fair the question is about how the MySQL acquisition affects Linux. But it seems really hard to believe Oracle does not care about MySQL. This is the same company that bought InnoDB. There is no doubt that Oracle is watching developments at Sun very carefully. The interesting problem for Oracle is not simply that Sun now has MySQL. It is that Sun owns or backs a portfolio of open source databases. And there are plenty of companies besides Sun that are working to make those databases full-featured, highly available, and very scalable. Like my company, Continuent, for example.

With a small number of additional acquisitions, Sun could control the open source end of the market fully. Now that has to be a little disquieting.

May 4, 2008

MySQL, Sun, and the Future of Open Source Databases

So what's it like now that Sun now owns MySQL? The executive summary: a little weird. I was at the MySQL User Conference a couple of weeks ago and had a chance to talk with a lot of people in the community as well as many MySQL folks. Marten Mickos is now the head of database products at Sun. It's not very hard to figure out what Sun will do with MySQL products for the near future--pretty much what MySQL was doing already.

The real question for a lot of people is what will happen with databases like PostgreSQL and Derby. Sun has invested heavily in both of them, and PostgreSQL in particular is now quite fast. With the MySQL acquisition, Sun has an opportunity to run the table with multiple offerings that cover both enterprise applications as well as web and embedded. However, that would mean cutting down the MySQL roadmap to concentrate, for example, on scale-out rather than scale-up. It would also require thinking big to combine with other vendors in order to disrupt the market leader Oracle. Done right, there's a chance to upend the industry in a way that has not occurred since Microsoft muscled into databases in the early 1990s using code bought from Sybase.

Based on talks from people like Rich Green and Marten Mickos, it's hard to see this happening. Sun is taking a hands-off approach to MySQL *and* giving MySQL management control of overall database strategy. A disruptive change therefore seems unlikely. In fact, the more likely result is stagnation, now that MySQL no longer has to fight for its existence. The MySQL roadmap is still pretty diffuse and there has been little product movement since the 2007 User Conference. MySQL 5.1 is still not out the door. Falcon is likely to show up ready for production use around the time the Boeing Dreamliner rolls out. MySQL is still working on multiple storage engines (2 new ones plus NDB and MyISAM, to name a couple.) There's not even a glimmer of a date for cool new replication features like a pluggable replication interface. In short, not much evidence for radical changes of any kind.

Also, there must be the awful temptation to focus on vertical scaling so that MySQL can work on Sun hardware with large numbers of cores. I asked Marten Mickos specifically about the choice between scaling up and scaling out but didn't get a very clear answer. Personally I think for MySQL to concentrate very hard on vertical scaling would be a strategic error. The community that made MySQL great is into commodity hardware and scale-out in a big way. First rate support for highly scaled SMP architectures is going to be a long slog that will compromise delivery of many other features.

Given all of this it's hard not to see innovation, particularly in problems like scale-out, shifting away from MySQL to other databases as well as middleware. This would be a great time for the PostgreSQL community to get really serious about data replication. MySQL won't fade--it's already a great database. But there's likely to be a crowd of people in the MySQL community eying other solutions. It's going to be interesting to see what they come up with.