Jun 25, 2008

Cloudcamp San Francisco: SQL or SimpleDB?

One of the best discussions at Tuesday's CloudCamp San Francisco was "SQL or SimpleDB - Who will win?" Cloud computing is part of a fundamental shift in computer operations propelled by virtualization of hosts and disk storage. We were already starting to argue about SimpleDB as the camp started when the person sitting next me astutely jumped up and proposed it as a topic for discussion.

The argument against SQL goes something like this. Many applications handle very simple objects using only primary key look-ups. Hashtable-based datastores like SimpleDB and BigTable handle that model and also partition data automatically. This simpler data model maps better to object models in scripting languages, many of which deal in objects that are essentially associative arrays. Typing issues? Let the application figure it out. MapReduce processing permits huge increases in parallelism, provide you have a problem like document indexing for which it is well-suited. Finally, both SimpleDB and BigTable have an availability model that automatically deals with failures of databases nodes. Availability is almost always an add-on for SQL databases.

There's no doubt the question of SimpleDB vs. SQL is well-posed. Cloud computing is just another way of organizing operations. It does not make it any easier to build SQL clusters or in fact do things that SQL databases don't already do on LANs. The real issue is between programming models.

That said, I think we have heard these arguments before. There are ample reasons why just about every innovation in data management in the last 20 years has ended up being folded back into relational databases. First, "SQL" is a mass of features ranging from data model to programming APIs and conventions to tools that have taken decades to develop. Those features are there because at some point some application really needed them.

Second, programming in objects and eliminating impedance mis-match was the promise of object-oriented databases. However, it turns out that trapping data in objects is not so great when you decide to use data for other purposes. SQL makes data first class, hence easily accessible for new applications. This is a core idea behind the relational model. Also, "typeless" storage systems are really hard to maintain over time, because they put the onus of dealing with versions on applications. Such systems may scale well over large quantities of data. However, they don't scale well over complexity of data.

Third, SQL databases like MySQL and PostgreSQL run in any data center. SimpleDB only runs in Amazon Web Services. For the time being at least there's a major lock-in problem, though CouchDB and Hadoop show that it may not persist for all too long.

So what's the resolution? Well, this question is nowhere near settled and my account does not nearly do justice to the SimpleDB point of view. Still, I think there are two things going on here that actually don't have too much to do with cloud computing per se. To begin with, there are new classes of applications like web-scale indexing that need massive parallelization to operate efficiently. Conventional SQL databases just don't work here. It's not all that different from the way that large-scale data analytics are pushing people to consider column storage. However, there's another issue. I think we are seeing a reaction against complexity. Commercial databases are just overkill for many applications.

CloudCamp was full of interesting ideas, but my takeaway was quite basic. Cloud Computing needs lightweight SQL databases that are baked into the stack. This sounds a lot like MySQL, but MySQL is not simple any more. We need a simple relational database that partitions data across hosts and has built-in availability along the lines of SimpleDB's eventual consistency. As far as I know it does not exist yet. So who is building that database?

4 comments:

Anonymous said...

Drizzle :)

Anonymous said...

Let me first say that the relational model was actually designed for this. If you recall, it discusses the fact that users should be unconcerned with issues like storage implementation. The usual example is the idea that interacting with a view or a table should be indistinguishable from the users standpoint, bot there's no reason this can't be taken further to mean knowing which database your getting data from. This is what SQL/MED is supposed to bring to the table, but we're still quite a ways off from seeing any useful implementations of it.

OTOH, if your programming model involves application interacting with a database through stored procedures, this is what pl/proxy gives you (or tries to at least) today. All of the multi-server complexity is hidden to the programmer, who just calls the database functions, and they do all the work; if your application talks to 1 or 100 databases, it makes no difference to you as a programmer.

Robert Hodges said...

Indeed, that separation of concerns is exactly why it's so hard to give up SQL. It's still amazing to me how clearly Codd understood this issue.

Your point about stored procs gets to the heart of the matter--pl/proxy gives you local SQL access (within the procedure) while putting availability on top in a non-SQL model. However, that gets back to a point made in the post, which is that availability tends to be an add-on for SQL. (It's unclear how SQL/MED deals with this problem but I only just started reading about it.)

So I'm curious whether you can build availability directly into the relational model the way that BigTable does. The fact that multiple copies of data cannot be guaranteed to be consistent flies in the face of dearly held SQL notions like read consistency. A SQL implementation would therefore look like ACID but with at least a couple letters missing. It's hard to see how you can square such an availability model with transactions. Maybe something like PL/Proxy is in fact the only way.

Marius said...

How about ScimoreDB?

For clustering uses functional data shipment. Data partitioning, execution and merging done in parallel. Distributed query described by own language (T-DQL) which is auto generated from SQL.