Last night I gave a presentation on Tungsten Replicator to the SFO MySQL MeetUp. It was really fun--lots of excellent questions and ideas from an audience that knows MySQL in and out. Here are slides from the talk from our trusty S3 document repository.
There were many things to think about on the BART ride home to Berkeley but here are a couple that really struck me, in part because I think we can do something about them.
First, database replication needs to expose as much monitoring data as possible. The kind of data you get or can infer from SHOW SLAVE STATUS is just the beginning. This came through really strongly from people like Erin who are using load-balancers and other automated techniques for distributing load to replicas using decision criteria like availability and latency of slaves. We have heard this from people like Peter Zaitsev as well--it's starting to sink in. :)
Second, it's critical to think through the database failover problem fully. There seem to be two axes to this problem. The first is what you might call call a vertical axis where you think about a single component in the system--here the database. You have to cover not just swapping replication flow but also enabling/disabling writes, triggers, and batch jobs, as well as application specific tasks. (Lots of good comments from the audience here as well.)
The other failover "axis" is horizontal where you think about the ensemble of databases, proxies, applications, and utilities that make up the cluster. The issues to cover here including sending commands in parallel, ensuring everyone receives them, and dealing with a wide range of possible failure conditions ranging from network partitions to failure of individual OS-level commands. We plan to unveil a solution shortly in the form of the Tungsten Manager, which uses group communications to attack this problem. I can't wait to get feedback on that.
p.s., Baron Schwartz yesterday posted a very interesting article on cache issues with failover on InnoDB. Just another example of how broad the failover problem really is.
pt-kill: How it Works
5 hours ago