Jan 2, 2010

Exploring SaaS Architectures and Database Clustering

Software-as-a-Service (Saas) is one of the main growth areas in modern database applications.  This topic has become a correspondingly important focus for Tungsten, not least of all because new SaaS applications make heavy use of open source databases like MySQL and PostgreSQL that Tungsten supports.

This blog article introduces a series of essays on database architectures for SaaS and how we are adapting Tungsten to enable them more easily.  I plan to focus especially on problems of replication and clustering relevant to SaaS—what are the problems, what are the common design patterns to solve them, and how to deploy and operate the solutions. I will also discuss how to make replication and clustering work better for these cases—either using Tungsten features that already exist or features we are designing.

I hope everything you read will be solid, useful stuff. However, I will also discuss problems where we are in effect thinking out loud about on-going design issues, so you may also see some ideas that are half-baked or flat-out wrong.  Please do me the kindness of pointing out how they can be improved.

Now let's get started. The most important difference between SaaS applications and ordinary apps is multi-tenancy. SaaS applications are typically designed from the ground up to run multiple tenants (i.e., customers) on shared software and hardware. One popular design pattern is to have users share applications but keep each tenant's data stored in a separate database, spreading the tenant databases over multiple servers as the number of tenants grows. 

Multi-tenancy has a number of important impacts on database architecture. I'm going to mention just three, but they are all significant. First of all, multi-tenant databases tend to evolve into complex topologies. Here's a simple example that shows how a successful SaaS application quickly grows from a single, harmless DBMS server to five servers linked by replication with rapid growth beyond.  

In the beginning, the application has tenant data stored in separate databases plus an extra database for  the list of tenants as well as data shared by every application. In accounting applications, for example, the shared information would include items like currency exchange and VAT rates that are identical for each tenant. Everything fits into a single DBMS server and life is good. 

Now business booms and more tenants join, so soon we split the single server into three—a server for the shared data plus two tenant servers. We add replication to move the shared data into tenant databases. 

Meanwhile business booms still more. Tenants want to run reports, which have a tendency to hammer the tenant servers. We set up separate analytics servers with optimized hardware and alternative indexing on the schema, plus more replication to load data dynamically from tenant databases.

And this is just the beginning of additional servers as the SaaS adds more customers and invents new services.  It is not uncommon for successful SaaS vendors to run 20 or more DBMS servers, especially when you count slave copies maintained for failover and consider that many SaaS vendors also operate multiple sites.  At some point in this evolution the topology, including replication as well as management of the databases, is no longer manually maintainable.  As we say in baseball, Welcome to the Bigs.

Complex topologies with multiple DBMS servers lead to a second significant SaaS issue: failures. Just having a lot of servers already means failures are a bigger problem than when you run a single DBMS instance. To show why, let's say individual DBMS servers fail in a way that requires you do something about it on average once a year, a number that reliability engineers call Mean Time between Failures (MTBF). Here is a simple table that shows how often we can expect an individual failure to occur.  (Supply your own numbers.   These are just plausible samples.)

Number of DBMS Hosts


Days Between Failures
1


365
2


182.5
4


91.3
8


45.6
16


22.8
32


11.4

Failures are not just more common with more DBMS hosts, but more difficult to handle. Consider what happens in the example architecture when a tenant data server fails and has to be replaced with a standby copy. The replacement must not only replicate correctly from the shared data server, but the analytic server must also be reconfigured to replicate correctly as well. This is not a simple problem. There's currently no replication product for open source databases that handles failures in these topologies without sooner or later becoming confused and/or leading to extended downtime.

There is a third significant SaaS problem:  operations on tenants. This includes provisioning new tenants or moving tenants from one database server to another without requiring extended downtime or application reconfiguration.  Backing up and restoring individual tenants is another common problem. The one-database-per-tenant model is popular in part because it makes these operations much easier.

Tenant operations are tractable when you just have a few customers.  In the same way that failures become more common with more hosts, tenant operations become more common as tenants multiply. It is therefore critical to automate them as well as make the impact on other tenants as small as possible.

Complex topologies, failures, and tenant operations are just three of the issues that make SaaS database architectures interesting as well as challenging to design and deploy.  It is well worth thinking about how we can improve database clustering and replication to handle SaaS.  That is exactly what we are working on with Tungsten.  I hope you will follow me as we dive more deeply into SaaS problems and solutions over the next few months. 

P.s., If you run a SaaS and are interested working with us on these features, please contact me at Continuent.  I'm not hard to find. 

1 comment:

Anonymous said...

Awesome post, this looks like it could be a really good series! Looking forward to the rest.