The Scale-Out Blog: MongoDB

Showing posts with label MongoDB. Show all posts

Jan 10, 2014

Why I Love Open Source

Anders Karlsson wrote about Some myths on Open Source, the way I see it a few days ago. Anders' article is mostly focused on exploding the idea that open source magically creates high quality code. It is sad to say you do not have to look very far to see how true this is.

While I largely agree with Anders' points, there is far more that could be said on this subject, especially on the benefits of open source. I love working on open source software. Here are three reasons that are especially important to me.

1.) Open source is a great way to disseminate technology to users. In the best cases, it is this easy to get open source products up and running:

$ sudo apt-get install software-i-want-to-use

A lot of software companies (mine included) open source their software because it gets product into the hands of people who might pay money for it later. The strategy worked brilliantly for MySQL AB as Anders pointed out. MongoDB is repeating the tactic with what looks like equal success. There has been a lot of pointless argument over the years about whetherMySQL or MongoDB are "real databases." Being easy to get is just as critical to adoption as features like transactions and scalable performance.

Open source is therefore even better for users, who can quickly decide if something works for them and provide feedback through communities about problems as well as suggested improvement. To the extent open source software has high quality, it originates in the tight feedback loop between software producers and their user communities. That in turn leads to faster innovation with fewer deviations from real user needs. In olden days we called this getting the requirements right. Open source projects often do it extraordinarily well.

2.) Open source allows like-minded communities of developers to create products that would otherwise never happen. Linux became a dominant operating system in large part through the staggering scale of contributions enabled by exceptionally well-managed open source development. Linus Torvalds recently pointed out that Linux kernel releases have patches from a thousand contributors or more. Thanks to the wide range of contributions, Linux operates on everything from tiny ARM processors to servers with over 200 cores. The development effort underlying the Linux ecosystem is huge when you include the kernel and all the packages that install over it. It dwarfs any comparable operating system effort I can think of.

At the other end of the spectrum there are small but incredibly useful projects like Apache Curator. The Curator project currently has 8 project members, mostly from different companies, who collaborate to make Apache ZooKeeper vastly easier to program. I doubt libraries like Curator would even exist without open source licenses and infrastructure like distributed source code management. Either would ZooKeeper, for that matter.

Not every line of open source code is excellent or even above average. (I'm looking at you, Hadoop.) That said, open source projects are not so much about code but communities of developers who understand and are interested in solving a specific problem. Besides direct feedback from real users, this is the other prerequisite for creating truly great products. Clean code is helpful but not necessary.

3.) Open source means your creations can never be taken away from you. In many creative endeavors work belongs to the people who employ you. It effectively disappears when you change jobs. Putting code on GitHub or code.google.com breaks that bond. Knowing that anything you create will always be accessible removes any hesitation about revealing your best ideas. I believe this is one of the drivers behind the flowering of creativity that infuses so many open source projects.

At the same time working on open source software is not all peaches and cream. Building successful businesses on open source is hard, which limits the opportunities to work on it for a living.

For instance, if most of the value of your product is in the software itself there is not much motivation for users to pay you. I think that's one reason mobile apps are by-and-large for pay or at least not open source. You need to find a business model that brings in enough money over time to fund the sort of concentrated engineering necessary to build robust software. Successful open source businesses often depend on finding the right markets or achieving network effects, and not all software can fit the pattern.

The good news is that once you get the economics right it really wrong-foots your closed source competitors. RedHat has built a great business packaging and supporting open source for enterprises. They see open source as a competitive advantage that extends their market reach and speeds up innovation. An increasing number of companies producing DBMS software take the same view as they try to disrupt data management. Outside of enterprise software Valve Software is attacking proprietary gaming platforms through open source.

It's great to see the growing number of businesses based on open source development. When the model works it is incredibly satisfying. I guess this is a fourth reason why I love working on open source software.

Feb 6, 2013

Data Fabric Design Patterns: Transactional Data Service

This article is the second in a series on data fabric design and introduces the transactional data service design pattern. The previous article in this series introduced data fabrics, which are collections of off-the-shelf DBMS servers that applications can connect to like a single server. They are implemented from data fabric design patterns, which are reusable arrangements of DBMS servers, replication, and connectivity. With this article we begin to look at individual design patterns in detail.

Description and Responsibilities

The transactional data service is a basic building block of data fabric architectures. A transactional data service is a DBMS server that processes transactions submitted by applications and stores data safely. Transactional data services have the following responsibilities:

Store data transactionally and recover data faithfully to the last full transaction following failure.
Provide a network-accessible application interface for accessing data
Provide a reliable and reasonably quick method for backup and restore.
Maintain an accessible, serialized log of transactions. This enables replication between services.

The following diagram illustrates the moving parts of a transactional data service. In future diagrams we will just use the standard database symbol for the entire transactional data service, but for now we need to be able to see the contents.

Motivation

Durable storage of transactions is the most fundamental responsibility of database systems. It is difficult to build reliable applications if stored data can disappear or become corrupted because transactions were not committed before a crash. Both problems can cause data loss. Moreover, they can break replication links very badly if the DBMS server comes up in an inconsistent state, for example with some updates committed but others randomly rolled back. This condition affects not only the one server but potentially many others throughout the fabric.

The transactional data service therefore focuses on storing data safely and recovering to the last committed transaction after a restart. With this basic capability we can construct more complex services knowing that individual changes are unlikely to disappear or be recorded inconsistently.

Detailed Behavior

Let's look in detail at the properties required for a successful transactional data service. Somewhat surprisingly, an off-the-shelf SQL DBMS does not necessarily fit the pattern, though it comes close. It is important to understand the differences.

The transactional store keeps data from getting lost and is the basis for recovery throughout the fabric. Transactional stores support commit and rollback with multi-statement transactions. In theory the transaction data service responsibility for data persistence matches MySQL/InnoDB and PostgreSQL behavior, both of which commit transactions safely in serial order. However, the reality is not quite that simple.

Most DBMS allow applications to ignore transactions under certain conditions. This results in wormholes, which are violations in serial ordering of data. There are a number of table definition options in SQL that undo transactional consistency.

(MySQL) MyISAM table type. MyISAM tables ignore transactions and commit immediately, even if the application later tries to roll back. The tables may also become corrupted by server crashes.
(MySQL) Memory table type. These tables are maintained in memory only. They disappear on restart.
(PostgreSQL) UNLOGGED tables. Such tables are not logged and disappear on crash or unclean shutdown (thanks Frederico).

All of these allow data to disappear or become corrupted after a crash. However, there is a more subtle problem. If applications depend on these tables, transaction results may then depend on when the server last crashed or restarted, which in turn makes updates across replicas non-deterministic. Random updates create problems for data replication, which depends on replicas behaving identically when transactions are applied. It is important to avoid application dependencies on any feature that creates wormholes or you might not be able to use other design patterns.

So is data loss always bad? Surprisingly, no. In some cases transactional stores can lose data provided that they do so by dropping the last transactions in serial order. It's as if the data just reverted to an earlier point in time. To understand why this might be OK, imagine three servers linked into a chain by asynchronous replication.

Data loss is sometimes not a big deal

It is bad to lose data on the first server, especially if those data are lost before replicating transactions to the downstream replicas. However, data loss on the last server is fine. Assuming that server stores the replication restart point transactionally, it will just re-apply the missing transactions and catch up. This is exactly what happens when you restore a backup in slave in master/slave replication.

Data loss on the second server also may not be a problem. It should restart replication and should generate identical transactions for itself as well as for replication to the last server. In both cases we assume that replication will handle these cases correctly and can replay missing transactions from logs. If so, you can not only tolerate such losses but even depend on recovering from them automatically.

Turning to the next responsibility of the transactional data service, the application interface may obviously include SQL via MySQL or PostgreSQL wire protocols. However, any consistent interface that is accessible over a network will do. The memcached protocol is also perfectly acceptable. Subsets of SQL such as stored procedures work quite well. Transactional data services are more general than SQL DBMS servers in this sense. Full SQL or even a subset is not a requirement.

Backup and restore are critical for data fabrics as they enable provisioning of new services as well as recovery of services that fail. You restore a backup and then let the service catch up using replication. Data fabrics can get along fine using a range of options from logical dumps of the DBMS (mysqldump or pgdump) to file system snapshots. Fabric backups just need to be transactionally consistent and movable to other hosts.

Note that the backup required by the transaction data service design pattern is a less general form of backup than most businesses really require. Businesses may need to recover data after accidental deletion or to keep copies of information for many years for legal reasons. You can therefore use a general-purpose backup solution like Zmanda or Barman provided it meets the fabric design pattern requirements. There's no need to do things twice.

Finally, the replication log is a serialized list of transactions to replicate to other hosts. Serialization enables the transactions to be replayed on another host and result in an identical copy. Generally speaking, data fabrics require logical replication, which applies changes to replicas using SQL statements on a live server. This is because other design patterns depend on being able to access and even write to the transactional data service when it is acting as a slave. Binary replication methods like disk block replication, such as DRBD, do not meet this requirement and therefore are of limited use in data fabrics.

Implementation

You can implement the transactional data service design pattern with any DBMS that meets the pattern responsibilities. That said, implementation details are very important. As we have seen, ensuring that DBMS servers live up to the responsibility to store transactions safely is a little harder than one might think.

1. MySQL. MySQL with InnoDB engine is generally a good choice. It has stable SQL APIs and a wide range of capable client libraries. However, MySQL must be correctly configured to maintain proper transactional guarantees. Here are three properties that should be in your my.cnf file to help ensure MySQL lives up to its responsibilities:

# Ensure durable flush to storage on transaction commit.

innodb_flush_log_at_trx_commit=1

# Synchronize binlog with committed transactions.

sync_binlog=1

# Use InnoDB as default storage engine. (Unnecessary for MySQL 5.5 and above.)

default-table-type=InnoDB

There are a variety of good backup mechanisms for MySQL databases, including mysqldump (with --single-transaction, useful only for small data sets), Percona XtraBackup, and file system snapshots. Snapshots are especially good when using NetApp or other capable storage. NetApp snapshots can be restored in seconds and cost little in terms of performance overhead.

2. PostgreSQL. PostgreSQL with a trigger-based replication log, for example from Londiste or SLONY, and pgdump for backups is another good choice. PostgreSQL has unusually good trigger support for DML changes at least, and permits users to encode them in a number of languages. Be aware the PostgreSQL triggers do not capture DDL statements like CREATE TABLE, though.

PostgreSQL is fully transactional out of the box and triggers create a fully serialized replication log. It does not have the problem that MySQL does with potentially unsafe table types like MyISAM. However, you need to set a couple of parameters to ensure safe operation. These ensure transactions are committed down to the storage level and prevent catastrophic corruption of the database and/or the WAL (write-ahead log). Since PostgreSQL defaults to these values, you mostly need to avoid turning them off.

fsync = on # turns forced synchronization on or off
synchronous_commit = on # immediate fsync at commit

Like MySQL, PostgreSQL SQL and APIs are stable and well-known. Pgdump also loads and restores data without difficulty for smallish data sets. For larger data sets file system snapshots work very well.

Regardless of the DBMS type you choose, it is important to avoid application-level features that introduce wormholes, such as the PostgreSQL unlogged tables mentioned in the previous section. Generally speaking, you should only skip transactions if there is a very strong reason for doing so.

Do other database types work for this design pattern? Of course. You can also use a commercial DBMS like Oracle. Oracle fulfills the pattern responsibilities quite well, but is a bit more heavyweight than users want, particular when operating in the cloud.

And Hardware Implementation, Too...

Even with a properly configured DBMS server you are still not completely out of the woods for data durability. Database servers generally use an fsync() or similar system call to flush data to storage. Unfortunately storage controller cards may just cache the data to be written in local RAM and return. That can fool the DBMS server into thinking transactions are safely stored when they actually are still sitting in memory on a controller card. The "committed" data will then vaporize in a host crash, which in turn can fatally corrupt both MySQL and PostgreSQL stores if you are very unlucky. Just a few bad blocks can cause very serious problems.

Fortunately there is a cure to make data vaporization less likely. On local storage you can invest in RAID with a battery-backed cache (BBU), which keeps power on for the cache even if the host fails completely. SANs and network attached storage tend to have this capability built in. (But check the specifications!) Battery backed cache also tends to be fast, since controllers can safely return from an fsync() operation as soon as the data to be written are in the on-board cache. Without this feature writes to storage can be painfully slow.

One interesting question is how to handle cloud environments like Amazon. You just do not know how the storage actually works. (That's really the point of a cloud, after all.) Amazon provides SLAs for performance (example: EBS provisioned IOPS), but there do not seem to be any SLAs about storage consistency. There is a lot to learn here and lessons that apply to Amazon may not necessarily apply to others. I suspect this will prompt some rethinking about data consistency--it's an interesting "what if" for transaction processing to suppose you cannot trust the underlying storage capabilities.

Data loss occurs sooner or later in virtually all systems, but the good new is that you can make it uncommon. For more information check out data sources like this and this. Also, other fabric design patterns like the Fault-Tolerant Data Service keep applications running when failures do occur and can even minimize the effects of data loss. See the upcoming article on that design pattern for more information.

Implementations to Avoid

Here are two examples that do not meet the transactional data service design pattern responsibilities or at least not fully.

1. MySQL with MyISAM table type. MyISAM does not support transactions and is not crash safe. You will lose data or incur downtime fixing problems. MyISAM does not belong in data fabrics.

2. PostgreSQL with streaming replication. Streaming replication replicates log updates in real-time and has the added benefit of permitting queries on replicas. However, streaming replication does not allow you to write to replicates. It therefore does not support online schema maintenance or multi-master replication. It also does not help with heterogeneous replication. Streaming replication is therefore an unattractive choice, even though it is far simpler and works better for ensuring high availability than logical replication solutions like SLONY.

How do NoSQL stores fare in this design pattern? Let's pick on MongoDB. MongoDB supports atomic commit to single BSON documents but does not support transactions across multiple documents. (Unless you think that the hacky two-phase commit proposed by the MongoDB manual is an answer.) Atomic transactions are not one of the reasons people tend to choose NoSQL systems, so this is not surprising. It means that MongoDB cannot handle the responsibilities of this design pattern.

Conclusion and Takeaways

The transactional data service design pattern can be implemented with a carefully configured SQL database. As we have seen, however, there are a number of details about what "carefully configured" really means.

It is a good idea to use the transactional data service design pattern even if you are not planning to implement a data fabric. Systems grow. This pattern gives you the flexibility to build out later by adding other fabric design patterns, for example to introduce cross-site operation using the Multi-Site Data Server pattern. It also protects your data at multiple levels that include transactions as well as regular backups. Nobody really likes losing data if it can be avoided.

Another important point: the transactional data service design pattern keeps your entire fabric working. Losing small amounts of data is typically just an inconvenience for users, especially if it does not occur too often. Broken replication on the other hand due to replicas that diverge or become corrupt after failures can lead to time-consuming administration and significant downtime to repair. The fabric is a network of servers. Poor configuration on one server can cause problems for multiple others.

Finally, the biggest error people make with this design pattern is to neglect backups. There's something inherently human about it: backups are painful to test so most of us don't. My first and biggest IT mistake involved a problem with backups. It nearly caused a large medical records company to lose a week of data entry records. Put in backups and test them regularly to ensure they work. This is a theme that will recur in later articles.

Speaking of which, the next article in this series will cover the fabric connector design pattern. Stay tuned!

Jan 6, 2012

Tungsten on the Beach--LA MySQL Meetup on Jan 11, 2012

It is my pleasure to announce that I will be presenting on Tungsten Replicator next Wednesday, January 11th at the Los Angeles MySQL Meetup. The presentation title is Fast, Flexible, and Fun--The Tungsten Replicator Magical Mystery Tour. This talk is going to be fun for two reasons.

First, it's a great opportunity to meet people in the LA MySQL community and talk about my favorite replication software. Tungsten is like a Swiss Army Knife for data replication. It solves a wide range of problems involving HA, scaling, and data movement. The presentation gives a quick intro to the replicator, then surveys how to use the most interesting features, including parallel slave apply, multi-master replication, transaction filtering, and replicating to MongoDB, Oracle, or data warehouses. I'll even show you how to grab the GPL V2 sources from code.google.com and code up your own replicator extensions using Java or Javascript.

Second, the talk venue is in Santa Monica about 10 blocks up from the ocean. Who doesn't like beaches? I certainly do. See you next week!

p.s., Thanks to Joe Devon and the other LA MySQL Meetup folks for the kind invitation.

Sep 29, 2011

Quick Installation of Replication from MySQL to MongoDB

Proof-of-concept Tungsten support for MongoDB arrived last May, when I posted about our hackathon effort to replicate from MySQL to MongoDB. That code then lay fallow for a few months while we worked on other things like parallel replication, but the period of idleness has ended. Earlier this week I checked in fixes to Tungsten Replicator to add one-line installation support for MongoDB slaves.

MySQL to MongoDB replication will be officially supported in the Tungsten Replicator 2.0.5 build, which will be available in a few weeks. However, you can try out MySQL to MongoDB replication right now. Here is a quick how-to using my lab hosts logos1 for the MySQL master and logos2 for the MongoDB slave.

1. Download the latest development build of Tungsten Replicator. See the nightly builds page for S3 URLs.

$ cd /tmp

$ wget --no-check-certificate https://s3.amazonaws.com/files.continuent.com/builds/nightly/tungsten-2.0-snapshots/tungsten-replicator-2.0.5-332.tar.gz

2. Untar and cd into the release.

$ tar -xzf tungsten-replicator-2.0.5-332.tar.gz

$ cd tungsten-replicator-2.0.5-332

3. Install a MySQL master replicator on a host that has MySQL installed and is configured to use row replication, i.e. binlog_format=row. Note that you need to enable the colnames and pkey filters. These add column names to row updates and eliminate update and delete query columns other than those corresponding to the primary key, respectively. Last but not least, ensure strings are converted to Unicode rather than transported as raw bytes, which we have to do in homogeneous MySQL replication to finesse character set issues.

$ tools/tungsten-installer --master-slave -a \

--datasource-type=mysql \

--master-host=logos1 \

--datasource-user=tungsten \

--datasource-password=secret \

--service-name=mongodb \

--home-directory=/opt/continuent \

--cluster-hosts=logos1 \

--mysql-use-bytes-for-string=false \

--svc-extractor-filters=colnames,pkey \

--svc-parallelization-type=disk --start-and-report

4. Finally, install a MongoDB slave. Before you do this, ensure mongod 1.8.x is up and running on the host as described in the original blog post on MySQL to MongoDB replication. My mongod is running on the default port of 27017, so there is no --slave-port option necessary.

$ tools/tungsten-installer --master-slave -a \

--datasource-type=mongodb \

--master-host=logos1 \

--datasource-user=tungsten \

--datasource-password=secret \

--service-name=mongodb \

--home-directory=/opt/continuent \

--cluster-hosts=logos2 \

--skip-validation-check=InstallerMasterSlaveCheck \

--svc-parallelization-type=disk --start-and-report

That's it. You test replication by logging into MySQL on the master, adding a row to a table, and confirming it reaches the slave. First the SQL commands:

$ mysql -utungsten -psecret -hlogos1 test

Welcome to the MySQL monitor. Commands end with ; or \g.

...

mysql> create table bar(id1 int primary key, data varchar(30));

Query OK, 0 rows affected (0.15 sec)

mysql> insert into bar values(1, 'hello from mysql');

Query OK, 1 row affected (0.00 sec)

Now check the contents of MongoDB:

$ mongo logos2:27017/test

MongoDB shell version: 1.8.3

connecting to: logos2:27017/test

system.indexes

> db.bar.find()

{ "_id" : ObjectId("4e85269484aef8fcae4b0010"), "id1" : "1", "data" : "hello from mysql" }

Voila! We may still have bugs, but at least MySQL to MongoDB replication is now easy to install.

Speaking of bugs, I have been fixing problems as they pop up in testing. The most significant improvement is a feature I call auto-indexing on MongoDB slaves. MongoDB materializes collections automatically when you put in the first update, but it does nothing about indexes. My first TPC-B runs processed less than 100 transactions per second on the MongoDB slave, which is pretty pathetic. The bottleneck is due to MongoDB update operations of the form 'db.account.findAndModify(myquery,mydoc)'. You must index properties used in the query or things will be very slow.

Auto-indexing cures the update bottleneck by ensuring that there is an index corresponding to the SQL primary key for any table that we update. MongoDB makes this logic very easy to implement--you can issue a command like 'db.account.ensureIndex({account_id:1})' to create an index. What's really cool is that MongoDB will do this even if the collection is not yet materialized--e.g., before you load data. It seems to be another example of how MongoDB collections materialize whenever you refer to them, which is a very useful feature.

TPC-B updates into MongoDB are now running at over 1000 transactions per second on my test hosts. I plan to fix more bugs and goose up performance still further over the next few weeks. Through MongoDB we are unlearning assumptions within Tungsten that are necessary to work with non-relational databases. It's great preparation for big game hunting next year: replication to HBase and Cassandra.

Sep 8, 2011

What's Next for Tungsten Replicator

As Giuseppe Maxia recently posted we released Tungsten Replicator 2.0.4 this week. It has a raft of bug fixes and new features of which one-line installations are the single biggest improvement. I set up replicators dozens of times a day and having a single command for standard cluster topologies is a huge step forward. Kudos to Jeff Mace for getting this nailed down.

So what's next? You can get see what we are up to in general by looking at our issues list. We cannot do everything at once, but here are the current priorities for Tungsten Replicator 2.0.5.

Parallel replication speed and robustness. I'm currently working on eliminating choke points in performance (like this one) as well as eliminating corner cases that cause the replicator to require manual intervention, such as aging out logs that are still needed by slaves.
Multi-master replication. This includes better support for system of record architectures, many masters to one slave, and replication between the same databases on different sites. Stephane Giron nailed a key MyISAM multi-master bug for the last release. We will continue to polish this as we work through our current projects.
Better installations for more types of databases. Jeff recently hacked in support for PostgreSQL as well as Oracle slaves, and we are contemplating addition of MongoDB support. Heterogeneous replication is getting simpler to set up.
Filter usability. Giuseppe has a list of improvements for filters, which are one of the most powerful Tungsten Replicator features but not as easy for non-developers to use as we would like. Better installation support is first on the list followed by ability to load and unload dynamically.
Data warehouse loading. We have a design for fast data warehouse loading that I hope we'll be able to implement in the next few weeks. Linas Virbalas has also been working on this problem along with a number of other heterogeneous projects for customers.

This is a lot of work and not everything will necessarily be finished when 2.0.5 goes out. However, I hope we'll make progress on all of them. In case you are wondering how we pick things, replicator development is largely driven by customer projects. If you have something you need in the replicator, please contact Continuent.

After this build we will... Er, let's get 2.0.5 done first. Suffice it to say we have a long list of useful and interesting features to discuss in future blog articles.