One of the best discussions at Tuesday's CloudCamp San Francisco was "SQL or SimpleDB - Who will win?" Cloud computing is part of a fundamental shift in computer operations propelled by virtualization of hosts and disk storage. We were already starting to argue about SimpleDB as the camp started when the person sitting next me astutely jumped up and proposed it as a topic for discussion.
The argument against SQL goes something like this. Many applications handle very simple objects using only primary key look-ups. Hashtable-based datastores like SimpleDB and BigTable handle that model and also partition data automatically. This simpler data model maps better to object models in scripting languages, many of which deal in objects that are essentially associative arrays. Typing issues? Let the application figure it out. MapReduce processing permits huge increases in parallelism, provide you have a problem like document indexing for which it is well-suited. Finally, both SimpleDB and BigTable have an availability model that automatically deals with failures of databases nodes. Availability is almost always an add-on for SQL databases.
There's no doubt the question of SimpleDB vs. SQL is well-posed. Cloud computing is just another way of organizing operations. It does not make it any easier to build SQL clusters or in fact do things that SQL databases don't already do on LANs. The real issue is between programming models.
That said, I think we have heard these arguments before. There are ample reasons why just about every innovation in data management in the last 20 years has ended up being folded back into relational databases. First, "SQL" is a mass of features ranging from data model to programming APIs and conventions to tools that have taken decades to develop. Those features are there because at some point some application really needed them.
Second, programming in objects and eliminating impedance mis-match was the promise of object-oriented databases. However, it turns out that trapping data in objects is not so great when you decide to use data for other purposes. SQL makes data first class, hence easily accessible for new applications. This is a core idea behind the relational model. Also, "typeless" storage systems are really hard to maintain over time, because they put the onus of dealing with versions on applications. Such systems may scale well over large quantities of data. However, they don't scale well over complexity of data.
Third, SQL databases like MySQL and PostgreSQL run in any data center. SimpleDB only runs in Amazon Web Services. For the time being at least there's a major lock-in problem, though CouchDB and Hadoop show that it may not persist for all too long.
So what's the resolution? Well, this question is nowhere near settled and my account does not nearly do justice to the SimpleDB point of view. Still, I think there are two things going on here that actually don't have too much to do with cloud computing per se. To begin with, there are new classes of applications like web-scale indexing that need massive parallelization to operate efficiently. Conventional SQL databases just don't work here. It's not all that different from the way that large-scale data analytics are pushing people to consider column storage. However, there's another issue. I think we are seeing a reaction against complexity. Commercial databases are just overkill for many applications.
CloudCamp was full of interesting ideas, but my takeaway was quite basic. Cloud Computing needs lightweight SQL databases that are baked into the stack. This sounds a lot like MySQL, but MySQL is not simple any more. We need a simple relational database that partitions data across hosts and has built-in availability along the lines of SimpleDB's eventual consistency. As far as I know it does not exist yet. So who is building that database?
Percona Toolkit 2.2.15 is now available
2 days ago