Sunday, June 29, 2008

The Economist Gets It Right: After Bill

Print journalism is still hanging in there. Ludwig Siegele has a great article about Bill Gates' departure from Microsoft in the latest edition of the Economist. It's the most balanced presentation of the problems and opportunities in front of Microsoft I have seen in quite some time.

For those of you who don't follow the Economist regularly, Ludwig is one of the top technical journalists at the magazine. He covered Silicon Valley during the dot-com bubble and is now working on a book that promises to be pretty interesting. Meanwhile, I hope he has time to keep cranking out more articles on technology like this one.

Wednesday, June 25, 2008

Cloudcamp San Francisco: SQL or SimpleDB?

One of the best discussions at Tuesday's CloudCamp San Francisco was "SQL or SimpleDB - Who will win?" Cloud computing is part of a fundamental shift in computer operations propelled by virtualization of hosts and disk storage. We were already starting to argue about SimpleDB as the camp started when the person sitting next me astutely jumped up and proposed it as a topic for discussion.

The argument against SQL goes something like this. Many applications handle very simple objects using only primary key look-ups. Hashtable-based datastores like SimpleDB and BigTable handle that model and also partition data automatically. This simpler data model maps better to object models in scripting languages, many of which deal in objects that are essentially associative arrays. Typing issues? Let the application figure it out. MapReduce processing permits huge increases in parallelism, provide you have a problem like document indexing for which it is well-suited. Finally, both SimpleDB and BigTable have an availability model that automatically deals with failures of databases nodes. Availability is almost always an add-on for SQL databases.

There's no doubt the question of SimpleDB vs. SQL is well-posed. Cloud computing is just another way of organizing operations. It does not make it any easier to build SQL clusters or in fact do things that SQL databases don't already do on LANs. The real issue is between programming models.

That said, I think we have heard these arguments before. There are ample reasons why just about every innovation in data management in the last 20 years has ended up being folded back into relational databases. First, "SQL" is a mass of features ranging from data model to programming APIs and conventions to tools that have taken decades to develop. Those features are there because at some point some application really needed them.

Second, programming in objects and eliminating impedance mis-match was the promise of object-oriented databases. However, it turns out that trapping data in objects is not so great when you decide to use data for other purposes. SQL makes data first class, hence easily accessible for new applications. This is a core idea behind the relational model. Also, "typeless" storage systems are really hard to maintain over time, because they put the onus of dealing with versions on applications. Such systems may scale well over large quantities of data. However, they don't scale well over complexity of data.

Third, SQL databases like MySQL and PostgreSQL run in any data center. SimpleDB only runs in Amazon Web Services. For the time being at least there's a major lock-in problem, though CouchDB and Hadoop show that it may not persist for all too long.

So what's the resolution? Well, this question is nowhere near settled and my account does not nearly do justice to the SimpleDB point of view. Still, I think there are two things going on here that actually don't have too much to do with cloud computing per se. To begin with, there are new classes of applications like web-scale indexing that need massive parallelization to operate efficiently. Conventional SQL databases just don't work here. It's not all that different from the way that large-scale data analytics are pushing people to consider column storage. However, there's another issue. I think we are seeing a reaction against complexity. Commercial databases are just overkill for many applications.

CloudCamp was full of interesting ideas, but my takeaway was quite basic. Cloud Computing needs lightweight SQL databases that are baked into the stack. This sounds a lot like MySQL, but MySQL is not simple any more. We need a simple relational database that partitions data across hosts and has built-in availability along the lines of SimpleDB's eventual consistency. As far as I know it does not exist yet. So who is building that database?

Thursday, June 19, 2008

Webinar: The Coolest Scale-Out Projects on the Planet

My company Continuent sponsors, an open source site that contains some of the coolest scale-out projects around. You may have heard of Sequoia, which implements middleware clustering of any database that has a JDBC driver. However, Sequoia is really just the beginning.

We have several other projects that offer interesting scale-out technologies. Myosotis implements fast SQL proxying, Hedera provides wrappers for group communications, and Bristlecone has tools for performance testing of scale-out architectures. This summer we will add projects for database neutral master/slave replication as well as cluster management. In short, there's a lot to look at.

If you would like a closer look at, I'm doing a Webex webinar to talk about the overall technology stack and project roadmaps. It's scheduled for Thursday June 26th at 10am EDT. You can sign up here to see what's going on.

Don't worry if you miss the presentation--I'll post slides here and will be doing a series of blog entries on each of the projects in the coming weeks.

Monday, June 2, 2008

PostgreSQL Gets Religion About Replication

The PostgreSQL community is getting really serious about replication. On Thursday May 29th, Tom Lane issued a manifesto concerning database replication on behalf of the PostgreSQL core team to the pgsql-hackers mailing list. Tom's post basically said that lack of easy-to-use, built-in replication is a significant obstacle to wider adoption of PostgreSQL and proposed a technical solution based on log shipping, which is already a well-developed and useful feature.

What was the reaction? The post generated close to 140 responses within the next two days, with a large percentage of the community weighing in. It's one of the most significant announcements on the list in recent history. There is pent up demand for this feature and within a few hours people were already deep into the details of the implementation.

The basic idea comes from an excellent presentation by Takahiro Itagaki and Masao Fujii of NTT at PGCon 2008 in Ottawa. They have developed a system that replicates database log records synchronously to a standby database. The standby can recover quickly and without data loss, which makes it a good availability solution. The core team manifesto proposes to integrate this into the PostgreSQL core and add the ability to open the standby for reads.

So, is this the end of the story on replication? I don't think so. There's no question that synchronous log shipping with reads would be a great feature. Basic availability is the first problem users run into when setting up production systems and this feature looks considerably better than alternatives for other databases like MySQL. It will help if NTT donates their code to the community, but still the whole effort will take considerable time. Adding the ability to open a standby for reads is at least a version out (read: up to 2 years).

More importantly, log shipping is most useful for availability. It does not help you replicate across database versions (nice for upgrades), between different databases, from a master to large numbers of slaves, or bi-directionally between databases. Finally, it's a less than ideal solution for clustering data between sites, something that is rapidly becoming one of the most important overall uses of replication. For these and other cases you need logical replication, which turns log records into SQL statements and applies them using a client.

I'm therefore starting an effort to get logical replication hooks included as a parallel effort. If you are interested in this let me know. Meanwhile, stay tuned. Tom's message represents a real change of heart for the PostgreSQL community. Accepting the important of replication opens up the doors for a new round of innovation in scale-out based on PostgreSQL. It could not come at a better time.

Scaling Databases Using Commodity Hardware and Shared-Nothing Design