Monday, February 17, 2014

Why Aren't All Data Immutable?

Over the last few years there has been an increasing interest in immutable data management. This is a big change from the traditional update-in-place approach many database systems use today, where new values delete old values, which are then lost. With immutable data you record everything, generally using methods that append data from successive transactions rather than replacing them.  In some DBMS types you can access the older values, while in others the system transparently uses the old values to solve useful problems like implementing eventual consistency.

Baron Schwartz recently pointed out that it can be hard to get decent transaction processing performance based on append-only methods like append-only B-trees.  This is not a very strong argument against immutable data per se.  Immutable data are already in wide use.   It is actually surprising they have not made deeper inroads into online transaction processing, which is widely handled by relational DBMS servers like MySQL and Oracle.

Immutable Data Are Now Economically Feasible

One reason for the popularity of update-in-place approaches is simple: storage used to be really expensive. This is no longer the case.  Many applications can now afford to store the entire DBMS transaction log almost indefinitely. To illustrate, look at storage costs in Amazon Web Services. Applications running in Amazon have API-level access to practically unlimited replicated, long-term storage through services like S3 and Glacier. Amazon conveniently publishes prices that serve as good proxies for storage costs in general.  Using these numbers, I worked up a simple spread sheet that shows the cost of storing 7 years of transactions for a made-up business application.

To start with, assume our sample app generates one thousand transactions per second at 1,000 bytes per transaction.  This is not exceedingly busy by some standards but is relatively high for business systems that handle human-generated transactions.  The main place you see numbers approaching this level is SaaS businesses that handle many customers on a single system.   Our sample system generates about 205,591 gigabytes of data over seven years.

Xacts/SecBytes/XactBytes/SecGB Generated in 1 HourGB Generated in 1 DayGB Generated in 1 MonthGB Generated in 1 YearGB Generated in 7 Years
1,0001,0001,000,0003.3580.472,447.5229,370.19205,591.32

Amazon storage costs vary from $0.011/Gb/month for Glacier to $0.09/Gb/month for S3 with full redundancy. (These are numbers for the US-West region as of 29 December 2013.) Annual storage costs for 7 years of data are pretty hefty if you store uncompressed data. However, if you factor in compression--for example MySQL binlogs tend to compress around 90% in my experience--things start to look a lot better.

Annual cost to store 7 years of data at different levels of compression
0%20%40%60%70%80%90%
Glacier$27,138.05$21,710.44$16,282.83$10,855.22$8,141.42$5,427.61$2,713.81
S3 Reduce Redundancy$177,630.90$142,104.72$106,578.54$71,052.36$53,289.27$35,526.18$17,763.09
S3 Standard$222,038.63$177,630.90$133,223.18$88,815.45$66,611.59$44,407.73$22,203.86

The raw costs still look hefty to the untrained eye, but we need to factor in the real expense of operating this type of system.  Here's a typical cost structure for a 3 node cluster (to ensure HA) with labor costs factored in and preserving 7 years of data.  I have put in generously small IT overhead costs including software development, since the code has to come from somewhere. Under these assumptions long-term storage costs are less 10% of the yearly cost of operation.

ComponentCostPercentageNotes
3 i2.4xlarge instances$46,306.6820.09%(Heavy utilization reserved, 1 yr. term)
3 support licenses$15,000.006.51%(Support subscription costs * 3x)
Raw dbadmin labor$12,000.005.21%(1 FTE/30 DBMS servers @ 120K per)
Software dev/QA$120,000.0052.06%(10 FTE/30 DBMS servers @ 120K per)
Misc. overhead costs$15,000.006.51%($5K per server)
S3 Storage$22,203.869.63%(7 years of data, 90% compression)
Total$230,510.54100.00%

Long storage costs for base transaction data can be far lower if any of the following hold:
  • You generate fewer transactions per second or they are smaller.  Many business apps produce far fewer transactions than my example. 
  • You don't keep data for the full 7 years.  Some of the analytic users I work with just keep a couple of years. 
  • You are already paying archiving costs for backups, in which case the additional storage cost becomes a wash if you can stop using a separate backup system.
  • You add more external costs to the picture--running a real business that generates this level of transactions often takes far more people than are shown in my projection. 
In these cases long term storage costs could be in the 1-2% range as a percentage of IT operating costs. Over time storage costs will decrease--though the rate of decline is hard to predict--so each year the number systems able to afford preservation of complete transaction histories will corresponding increase. This is particularly true for business transactions, which tend to be human generated and subject to upper growth limits once businesses are fully automated.  If you push data into Glacier, economically feasible retention periods can run to decades.  This is far longer than most businesses (or more particularly their lawyers) even want to keep information around.

There are still reasons for wanting an update-in-place model for OLTP systems, for example to keep as much of your working set as possible in RAM or on fast SSDs to keep response time low.  But storage cost alone is no longer a major factor for a wide range of applications.  This development is already affecting data management technology profoundly.  Doug Cutting has pointed out on numerous occasions that the downward cost trajectory of commodity storage was a key driver in the development of Hadoop.

Users Want Immutable Data

Many organizations already keep long transaction histories to feed analytics by loading them into traditional data warehouses based on Teradata, Vertica, and the like.  As soon as a practical method appeared to keep such data more economically, businesses began to adopt it quickly.  That "method" is Hadoop.

Hadoop has a fundamentally different approach to data management from relational and even many NoSQL systems.  For one thing, immutable data are fundamental.  The default processing model is that you write data but rarely change it once written.  To illustrate, the HiveQL SQL dialect does not even have UPDATE or DELETE statements.  Instead, you overwrite entire tables or parts of them to make changes.  This works because Hadoop organizes storage on cheap commodity hardware (HDFS) and provides a workable way to access data programmatically (MapReduce).

Hadoop changes the data management cost model in other ways besides utilizing commodity hardware efficiently.  With Hadoop you don't necessary define *any* data structures up front.  Instead, you store transactions in native form and write programs to interpret them later on.  If you need structure for efficient queries you add it through MapReduce and perhaps store it as a materialized view to make other queries more efficient.  Hadoop eliminates a lot of the up-front effort (and risk) required to get transactions into a data warehouse.  Instead, it defers those costs until you actually need to run specific analytics.  Moreover by storing native transaction formats, you can answer new questions years later.  That is a very powerful benefit.

I have been working a lot with Hadoop over the last few months.  It's a bear to use because it consists of a set of loosely integrated and rapidly evolving projects with weak documentation and lots of bugs. Even with these difficulties, the rising level of Hadoop adoption for analytics shows the underlying model has legs and that users want it.  As Floyd Strimling pointed out a while ago on Twitter this genie is not going back in the bottle.  HDFS is becoming the default storage mechanism for vast quantities of data.

Immutable Data Management Looks Like a Good Bet

One of the basic problems in discussing immutable data management is that there are different kinds of immutable data that persist at different timescales.  Baron has a point that Couchbase, Datanomic, NuoDB, or whatever new DBMS implementation you choose are in some ways recapitulating solutions that existing RDBMS implementations reached long ago.  But I also think that's not necessarily the right comparison when talking about immutable data, especially when you start to think about long retentions.

The fact is that Oracle, MySQL, PostgreSQL, and the like do not utilize distributed commodity storage effectively and they certainly do not enable storage of the long tail transaction histories that many businesses clearly want for analytics.  The best way to do that is to replicate transactions into HDFS and work on them there.  That is hard even for MySQL, which has flexible and economical replication options.  (We are working on making it easier to do at Continuent but that's another article. :)

In my opinion a more useful criticism of the arriviste competitors of traditional OLTP systems is that they don't go far enough with immutable data and risk being outflanked by real-time transaction handling built on top of HDFS. Hadoop real-time work on projects like Apache Spark is for the time being is focused on analytics but OLTP support cannot be far behind.  Moreover, there is a window to build competitors to HDFS that gets smaller as Hadoop becomes more entrenched.  This seems more interesting than building stores that offer only incremental improvements over existing RDBMS implementations.

Immutable data now permeate IT due to decreasing storage costs coupled with requirements for analytic processing. It's like the famous quote from William Gibson:
The future is already here--it's just not very evenly distributed.
If you look at the big picture the arguments for database management based on immutable data seem pretty strong.  It is hard to believe it won't be a persistent trend in DBMS design.  Over the long term mutable data look increasingly like a special case rather than the norm.

3 comments:

Justin Swanhart said...

Immutable data is fine for stuff that doesn't have to be updated, which by definition makes it immutable.

Using an immutable store for data which actually is updated can be costly because it is generally necessary to retrieve many versions of a row when processing queries, which can reduce performance.

It is possible to create an append-only log-structured table in any database. You can then use your own version and pruning logic, and maintain summary tables that contain only the newest picture of rows and other options.

This allows you to roll your own "flash back query" for example.

Because the tables are log structured, you don't need separate log tables for the changes to the tables themselves, so maintaining materialized views and rolling things forward/backward is relatively straightforward.

Robert Young said...

Accounting systems, since at least the first RDBMS versions (post-COBOL), are in fact based on immutable data. Prior to that, one would "close the month", which would wipe out the month-to-close ledger/journal/register (different names for the same thing) transaction "files", and post the aggregates as line items that month's entry in the yearly "file".

With RDBMS based accounting, clients had the option (they could continue to destructively close if they wanted). The problem was/is that RBAR processing each and every time is a pain. With SSDs, multi-processor, large RAM machines, it is getting feasible to store only the transaction table(s) and dispense with aggregates.

Anyone who uses a MVCC database knows how slow keeping everything around can be.

Robert Hodges said...

@Robert, Regarding MVCC there's no question you want to keep only a subset of data visible for most OLTP operations. That's the only way to make reads work efficiently when you have a small working set relative to the log.

Still, I think history may prove Pat Helland right, when he said that current DBMS contents are a cache of the tail of the log (http://blogs.msdn.com/b/pathelland/archive/2007/06/14/accountants-don-t-use-erasers.aspx). One obvious way to make this work is using RAM and SSD, which have also dropped in price quite substantially. I'm not an internals expert but the MVCC view could work like a secondary index in that it's derived rather than primary. The economics are going in a direction where people will keep trying this until they figure out how to get it to work. That's really my point.

Scaling Databases Using Commodity Hardware and Shared-Nothing Design