Sunday, February 5, 2012

The Agony of Big Transactions in the MySQL Binlog

Databases fail in interesting ways.  About a week ago I was called in on a support case for Tungsten Enterprise.  We were getting failures in which the master MySQL 5.1.50 server would run out of connections, block applications for 5 to 10 minutes and eventually trigger automated failover to a slave.  Running out of database connections is a classic symptom of blocking on a shared resource inside the server.  The blocked transactions hold onto their connections, which are quickly exhausted if new transactions constantly arrive from applications.  So where was the hold-up?

Our first clue was to notice that Tungsten Replicator was processing a huge transaction at the time of one of the failovers.  To find out more I dug into the MySQL binlog using the handy mysqlbinlog utility and looked at transactions committed around the time of the failure.  The following pattern quickly became apparent across hundreds of transactions.   The timestamps show that the commit occurred over 8 minutes after the first statement.

#120201 11:57:59 server id 313819  end_log_pos 215822862        Query   thread_id=55165463      exec_time=0     error_code=0
SET TIMESTAMP=1328119079/*!*/;
# at 215822862
#120201 11:49:23 server id 313819  end_log_pos 215823092        Query   thread_id=55165463      exec_time=1     error_code=0
SET TIMESTAMP=1328118563/*!*/;
(SQL statement)
# at 215823092
#120201 11:57:59 server id 313819  end_log_pos 215823119        Xid = 32444278496

The root cause turned out to be simple.  The large transaction we found in the Tungsten log resulted in over 1.3Gb of data in the MySQL binlog.  Mysqld took several minutes to write this into the log.  Meanwhile, since  transactions apply serially into the binlog, every other transaction had to wait instead of just committing and releasing its connection.  New transactions quickly consumed the spare connections up to the max_connections limit.  Problem explained.

The MySQL binlog is a wonderful thing but the fact that it is a serial resource has important consequences for busy servers that use replication.  

First, if you run out of connections, look at the binlog.   Tungsten makes large transactions fairly easy to detect because they are fragmented in the replicator log.   We noticed a transaction with 1300 fragments, which tipped us off to the problem.  However, you can also find the problem by looking at the binlog directly as I did above.  Or you can use any of several tools to analyze the binlog and look for periods of time with no transactions committed.

Second, big transactions are a problem for MySQL.  Admin operations or application bulk loads through web APIs can easily create huge numbers of updates.   Break them up into pieces and do a chunk at a time.  This is also better for slaves, which may end up lagging badly when large transactions reach them.  Increasing max_connections to a high number is incidentally not a solution.  It just makes problems with transactions less visible and will not help anyway if you have high transaction throughput.

Third, if you have updates or deletes that affect many rows, use statement replication.  Row replication is a great thing.  We use it regularly in Tungsten deployments, especially for heterogeneous replication, e.g., from MySQL to Oracle.  However, we have also seen customers write 5Gb or more into the binlog without even realizing it.  Statement replication can write as little as a single statement.  This flexibility is one of the outstanding features of MySQL replication.

A possible fourth suggestion is to put the binlog on faster storage.  However, it is unclear whether that would help--the binlogs in this case were already on fast, separate disk storage with large caches and sync_binlog=0.  It seems MySQL writes large transactions rather slowly into the binlog for reasons that may not have much to do with storage speed.  I hope to understand why that is at some later time.  Perhaps somebody reading this article can suggest a reason.


Sheeri K. Cabral said...

It seems like this was a statement-based log, because you took out something and replaced it with (SQL Statement) but was the SQL statement really over 1 Gb?

Or was this row-based binary logging/replication?

Robert Hodges said...

Hi Sheeri! The workload is largely statements with some RBR to avoid repeating large transactions on slaves. As far as I know the large transaction in question consisted of many individual statements, which allows Tungsten to fragment. The really enormous transactions I have seen (like the 5 GB example) were however RBR.

sbester said...

I saw this when using huge LOAD DATA INFILE imports.. all other commits are blocked while the binlog gets written.

Scaling Databases Using Commodity Hardware and Shared-Nothing Design