Wednesday, January 21, 2009

Encrypting Replicated Data

Open source projects have a great way of pulling in original ideas. Take encrypting replicated SQL: it's a big deal when you are are not sure about network security. This problem comes up very quickly when you transmit data over WAN links or in shared/cloud environments.

I have been procrastinating on implementation of encryption in the Tungsten Replicator because it felt as if we were going to have to do some surgery for it to work correctly. (Another hypothesis is that I'm lazy.) However, this morning I was talking via Skype to Mark Stenbäck, an engineer from Finland whom we met through a recent conference. Mark had a great idea. Why not just write an event filter? An excellent thought...Here's how it could work.

Filters transform or alter SQL events in the Tungsten Replicator. Let's say that you are replicating using MySQL 5.0. All SQL events contain SQL statements, which look like the following in our transaction history storage. The field 'SQL(0)' is the statement. Everything else is metadata.
SEQ# = 97
- TIME = 2009-01-12 10:01:46.0
- EVENTID = 000282:20633
- SOURCEID = centos5d
- STATUS = 2
- SCHEMA = mats
The SQL statement is contained in a simple data structure that is easy to manipulate within filter code. Now suppose you write the following Java classes:
  • StatementEncryptionFilter -- Looks at the event and encrypts the statement, using secret key (2-way) encryption.
  • StatementDecryptionFilter -- Looks at the event and decrypts the statement using 2-way decryption.
Implementing encryption itself is not very difficult in Java. You can follow the cryptographic examples in the Java documentation to get something working quickly. You then add the filters into the Tungsten file. The configuration might look like the following:
# Pre-storage filter(s).  Runs on events before they are stored
# in the transaction history log.

# Post-storage filter(s). Runs on events after pulling them from
# storage and before handing off for application.

# Encrypt/decrypt filter definitions.
There are some subtleties here--when there are multiple filters you need to make sure that the encryption filter is the last one called and similarly that the decryption filter is called first. Otherwise the SQL statements will look like garbage when other filters try to read them.

Another subtlety is row replication, which uses lists of objects to contain replicated values. These are harder to encrypt. In this case it might make sense to alter the Tungsten Replicator architecture to add a configurable storage filter that serializes and deserializes events in the transaction history log. This would be a nice feature addition. It's now JIRA TREP-199--we should be able to implement it within a couple of builds.

To get back to the topic of open source, Mark's suggestion is an excellent example of the power of working with a diverse group of people with similar interests. His idea seems straightforward to implement and also leads to a simple but powerful improvement to the replicator design.

Do you have any ultra-cool replication ideas? Post them on the Tungsten Replicator mailing list or just implement them directly. I'm going to set up a project for extensions to the replicator. Contributions are welcome--they will be open source and free to everyone. We'll gladly extend the Replicator to help make your ideas work. Meanwhile I'll keep posting other neat ideas for extensions. Please come on down and join the fun.


Mark said...

Data encryption through event filter is tempting in its apparent simplicity: basically once the filter is active the data is encrypted before it is written and decrypted right after it has been read.

Ideally the filter can be turned on and off as needed and it should work without making any changes to table structure. But when this is given, couple of more or less tricky issues surface:

Exceeding maximum column size: if the encryption changes the length of string by making it longer, the insert is likely to fail. So, the encrypted string must be as long or shorter than the original string. One possibility could be to combine string compression (zip) with encryption.

Encrypting number values: Numbers can be encrypted, but the resulting encrypted value might contain strings and numbers, which would not sit well with integer / decimal based columns. Perhaps the encrypted value could be pure numerical value, like the original value, but the encryption probably would not be very strong.

Confused stored procedures: Because the encryption/decryption filter is between database and application, stored procedures are likely to be in trouble when attempting access and process encrypted data.

Well, just to name couple of potential issues. There are ways to go around these, but more effectice solutions might become more invasive. For example, the filter might store the encrypted data in a kind of shadow table, but as a solution it feels messy.

Or as Buckminster Fuller once put it: "When I am working on a problem, I never think about beauty. I only think about how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong."

However, by having the encryption / decryption filter available when creating new database, the tables can be designed so that they are suitable for storing encrypted data: this way the table structure won't give any hints about what the data is really about.

One more thing: encrypting everything is rarely needed and it is likely to affect performance, so user should be able to specify which tables/columns to encrypt and which ones not.

All and all I think this is fascinating topic.

Robert Hodges said...

The length issue is an interesting one. We do have occasional limits, for example when storing events in particular DBMS types.

I should point out is that the filter described in the post does not actually affect applications. It protects data moving over the network and stored outside the database. Stored procedures would therefore work fine.

Finally there's a tricky issue with changing encryption settings, for example to update passwords. Done incorrectly it would force you to initialize the whole cluster from scratch.

J (Encrypted Flash Drive Guy) said...

Mark did a great job but I have a problem while setting up p2p transactional replication between two databases and to encrypt the SSN in this database. I have configured replication successfully and encrypted the SSN using a symmetric key. Now please tell me the solution to be able to encrypt the SSN on one server and replicate it to the subscriber then decrypt the SSN on the subscriber. Is there a better way to handle encryption other than a symmetric key encrypted by a certificate?

Scaling Databases Using Commodity Hardware and Shared-Nothing Design