Feb 1, 2009

Simple HA with PostgreSQL Point-In-Time Recovery

Point-in-time recovery or PITR is one of my favorite PostgreSQL features. Most database servers have a transaction log for recovery. (MySQL is one of the few exceptions; more on that shortly.) However, PostgreSQL PITR allows and even encourages users to manipulate the logs to perform some very useful tasks:

* Creating a bit-for-bit backup of a server without halting
* Restoring a server to a specific point in time
* Creating a warm standby server

The third task is especially interesting because it's so common. One of the most pronounced trends in computing is the decreasing cost of computing power through Moore's law improvements and virtualization. Standby solutions nowadays look like a good investment compared to the cost having a server down or worse yet losing a lot of data.

PostgreSQL PITR makes warm standby relatively easy to set up. You enable log archiving, which copies the completed WAL files to an archive directory, then transfer those files to a host where another PostgreSQL server is in continuous recovery mode. This is easier to understand with a picture, so here it is.


Two key points: first, warm standby is based on log files, so as long as you do not mess up the setup, your recovered data will always be consistent. Second, failover is really fast, because your standby server is in continuous recovery mode and generally will be on the last file or two, hence almost ready to go at all times.

There are some details but they are not very hard. The first step is to enable archiving. This copies log buffers into an archive directory from the standard pg_xlogs directory. Archiving is easy to enable in postgresql.conf, as shown by the following example:
archive_mode = on               # enable archiving
archive_command = 'cp -v %p /data/pgsql/archives/%f'
archive_timeout = 300 # timeout to close buffers
Next you need to transfer the files as they appear in the archive directory. For that you can use rsync, as shown below. Put it in a cron job on the standby so that files transfer automatically between hosts every few minutes.
rsync -avz --remove-sent-files prod1:/data/pgsql/archives/ \
/data/pgsql/archives/ > /dev/null
Finally, you copy the files into the pg_xlogs directory of a PostgreSQL server that is running recovery. This is the most complicated part. To make a long story short, you set up a carbon copy of server data on the standby and then create a file called recovery.conf with a command to copy the files from the archive directory to pg_xlogs. That command is usually pg_standby, a handy little program written by Simon Riggs of 2nd Quadrant. Here's a recovery.conf file with typical pg_standby configuration.
restore_command = '/usr/local/pgsql/bin/pg_standby
-l -d -s 2 -t /tmp/pgsql.trigger /data/pgsql/archives
%f %p %r 2>>standby.log'
Basically pg_standby blocks until it either finds the next log file requested by the server or sees a file named /tmp/pgsql.trigger. Then recovery is done and your standby server pops up and is open for business.

The PostgreSQL docs provide a good overview of warm standby. However, you need to get the details exactly right for recovery to work without losing data. I omitted these from the description but wrote a script called standby.sh to illustrate. It avoids some common gotchas in setup such as not cleaning up directories on the standby server or leaving the pg_standby trigger file lying around for unsuspecting programs to trip over. I'm sure there are better ways but this works for me.

Warm standby is thus a standard HA technique used in many PostgreSQL installations. It should be used in many more. PostgreSQL 8.4 is adding "hot standby", which allows users to open up the standby server for reads, similar to Oracle Active Data Guard, a pricey but very flexible availability solution for Oracle.

There's also an effort afoot to add synchronous replication of log changes to the standby, but it looks as if that will be delayed until PostgreSQL 8.5. If you really want to protect against data loss you can replicate the pg_logs directory using DRBD, then mount and copy any missing log files during failover.

So as you can see, PostgreSQL has a really nice solution to warm standby that is fast and ensures data consistency with options to eliminate data loss almost completely. Can MySQL do this? The answer is "kind of." The problem is that MySQL servers do not have a transaction log. Logging is the responsibility of individual storage engines. Some, like MyISAM, do not even have a log. The closest equivalent in MySQL is the binlog, which is used for replication.

In fact, a lot of people use MySQL master/master replication to implement warm standby. You can in theory reduce data loss by copying master binlogs over to another host and having that server read from them to distributed "missed" events to the slave(s). This is complicated and there are some two-phase commit problems with MySQL binlogs that mean sooner or later there will be data inconsistencies. Most MySQL users either just accept the chance of data loss or use DRBD to sync the entire master over to another host.

On balance, the PostgreSQL warm standby solution using PITR compares very favorably to MySQL as well as many commercial databases. Two things that would help are better management and proxies to help redirect reads to hot standby servers, once these are available. Our development team is working on both of these. As usual, stay tuned...

P.S., In response to some good comments from readers the comparison between MySQL and PostgreSQL replication deserves a fuller treatment in a separate article. PostgreSQL gives you a bit-for-bit copy of data, whereas MySQL makes a logical copy using SQL. PostgreSQL warm standby is therefore transparent to applications, whereas MySQL replication has some prerequisites for SQL schema and operations that applications must observe.

32 comments:

Paolo said...

Very interesting article, thanks!

Mikiya Okuno said...

Hi,

It appears that you have a wrong idea about MySQL master/master replication, which is hardly used for warm standby.

MySQL can implement a worm standby by master/slave replication just like the picture above. If you set sync_binlog=1, then binlog is synced with innodb WAL. The content of binlog is transferred to the slave immediately/continuously. The slave has almost same data to the master. However, network transfer is asynchronous, so there is a small chance to lose one ore few committed transactions, which are written to binlog but not transferred to the slave server.

Is file transfer from "The master instance" to "Standby" synchronous? Then, it may have more robust than MySQL replication a little bit.

Mark Callaghan said...

Archiving the log every N seconds where N >> 10 isn't something I would recommend to provide HA as you will lose the transactions from the last N/2 seconds on failure. And if that were acceptable than many of the problems with MySQL replication can be eliminated.

Sandty Weston said...

Wow, that is some pretty cool stuff dude!

RT
www.internet-privacy.us.tc

Robert Hodges said...

@Mikiya

Actually master/master *is* a form of warm standby--normally there's a VIP (virtual IP) for the master so only one database can be written at once. You can optionally read the slave but writing creates data consistency problems. The slave is therefore not fully available.

The interesting distinction, which I alluded to but did not follow up on at the end is that PostgreSQL warm standby is bit-for-bit, i.e., physical replication. You get identical data on the standby. Applications in theory cannot tell this is happening. Outside of PostgreSQL bugs (there have been a couple) it's completely transparent.

Robert Hodges said...

@Mark,

As Mikuya points out MySQL transfer is continuous, which is a nice feature. It's interesting to ask under what circumstances the slave I/O thread would be delayed, but in general MySQL will lose less data. In the example I provided, the WAL buffers full or not are dumped every 300 seconds, which by your math leads to average data loss of *at least* 150 seconds. As mentioned in the article you can eliminate this with DRBD protocol C on the pg_xlogs directory, but the solution is starting to get more complex in order to guarantee full transaction durability.

For this reason it's a real pity synchronous log shipping appears to be postponed to PostgreSQL 8.5.

Unknown said...

@Robert

Yes, on a MySQL master-to-master, you can write both servers without data consistencies problems (but better not at the same time, as you said, except for purposely adapted applications that just does writes that does not depends on other data state).

Anyway if you really need both servers to handle the load, then you will loose on a degraded situation.

Yes also, synchronous repl is a very much expected planned feature for PostgreSQL. Though for many practical cases (everyone is not a bank), having at least just a native "best effort" continuous replication would be very much more useful (something like slony but native and without this runtime performances penalty and those limitations).

Dave Edwards said...

On The Scale-Out Blog, Robert Hodges illustrates simple HA with PostgreSQL point-in-time recovery.

Log Buffer #134

Anonymous said...

archive_mode = on is no longer a valid option. Once you set archive_command archiving is turned on.

Mikiya Okuno said...

FYI,

MySQL 6.0 will have a semi-synchronous replication feature, which waits ack from the slave before completion of each commit: https://code.launchpad.net/~hezx/mysql-server/semi-sync-replication/

Robert Hodges said...

@Anonymous
Is this an 8.4 feature change? archive_mode is supported in 8.3.
Robert

Robert Hodges said...

@Mikiya,
Thanks for pointing this out. Semi-synchronous replication will be a great MySQL feature when it is finally available. It's possible some of the 3rd party MySQL builds (e.g., Percona or OurDelta) will integrate the Google semi-sync patches and get there even faster.
Cheers, Robert

Anonymous said...

Let me correct that statement....

archive_mode is not valid for Postgres Plus 8.3 from EnterpriseDB. If you add the parameter you will get "unrecognized configuration parameter archive_mode"

Anonymous said...

Question on the rsync delete... pg_standby will delete any old archive log files from the standby side automatically. So when rsync runs from source to destination again, those files will not be present on the standby and will be shipped again. If you use the --delete option like you do it will delete files that haven't been shipped yet, no?

I guess I'm confused which servers you have as source & destination in the rsync command. Is prod: the primary or standby server?

Robert Hodges said...

On the rsync command--oops I think you got me. This rsync command runs on the standby host. However, the option should be --remove-sent-files not --delete. That way files are removed from the primary once they are transferred. I'm going to correct the article.

Anonymous said...

Thanks for the explanation and the script! I've got this mostly working, but a quick question. Should archive_mode be turned on for the standby. Assuming the directories and config are the same on the production and standby machines, that would mean that the standby is set to dump its archives in the same directory it is pulling the production archives into, thereby creating name collision.

prod
postgres data: /mnt/postgresql
postgres archives: /mnt/postgresql_archives

standby
postgres data: /mnt/postgresql
postgres archives: /mnt/postgresql_archives

I'm assuming that archive mode should be turned off for the standby. If that is the case, then you probably want to turn it back on when you fail over, no?

Anyhow, I'm coming from mysql, so therefore a bit of postgresql newbie.

Thanks

Anonymous said...

very nice, and a "must read" for users interested in this topic (...most of postgresql admins should..;-) ).

Seems that I'll come back regularly.....

René Leonhardt said...

2 days ago Bruce Momjian wrote that the Hot Standby feature will not make it into 8.4 but maybe into 8.5 (page 15):
http://momjian.us/main/blogs/pgblog.html#March_1_2009

In the weekly news there is a pending patch for Hot Standby
http://www.postgresql.org/community/weeklynews/pwn20090301

The todo list entry is unchanged
http://wiki.postgresql.org/wiki/Todo#Point-In-Time_Recovery_.28PITR.29

So will Hot Standby make it into 8.4 or not? :)

Robert Hodges said...

@ReneL

Hot standby will not make it into 8.4. Look at Heiki's post on pgsql-hackers from 27 Feb: http://archives.postgresql.org/pgsql-hackers/2009-02/msg01345.php. I would guess it will be 8.5 at this point along with WAL streaming.

Cheers, Robert

Robert Hodges said...

@Anonymous
I would recommend turning on archive_mode for the standby so that when you fail over the server comes up immediately with archiving enabled. To prevent confusion it is probably a good idea to archive to a different directory. In addition, don't forget to turn off rsync so that you are no longer pulling from the old primary host. That is another potential source of heads-scratching.

Cheers, Robert

Bruce said...

I wonder if rsync will work around the problems I've encountered with NFS. It seems with nfs, pg_standby grabs the file before it is fully available and causes postgres to error out thinking a corrupt WAL segment is received. It is odd, because the Postgres docs specify NFS as a solution for remoting the WAL segments and NFS is a synchrounous protocol copy. It seems unlikely the data isn't there when it attempts to read the file.

It was suggested I copy the files to a different directory than the one pg_standby uses and move them over from there (or copy them with some extension to avoid having pg_standby see it) and then rename on the standby side. I'll probably try that solution first.

Robert Hodges said...

Hi Bruce,

We are doing some work to provide commercial support for PostgreSQL WAL shipping and this is one of the questions we will be looking at. I'll post when I have an answer.

Cheers, Robert

Robert Hodges said...

@Bruce, I should have asked--on which OS and environment did you see this problem? Thx, Robert

Bruce said...

We're using Postgres 8.3 on CentOS 5. Primary and backup are on common subnet with gig link.

Wish there were a viable replication solution available. We've tried Slony and Mammoth. Slony isn't suitable being trigger based (created too many issues with the schema churn we have in our project right now) and Mammoth is beta for PG 8.3 and didn't work well at all.

Robert Hodges said...

@Bruce

Have you tried Londiste? We are working on a commercial product based on Londiste and think rather highly of it.

Cheers, Robert

Anonymous said...

Ooo, after doing this on Postgres and MySQL I'll take MySQL replication and clustering any day over PostgreSQL. I can't believe your statements on the comparison if you really knew MySQL. I love postgres, I don't ever want to use MySQL. But MySQL replication and clustering is nothing short of a wonderful dream where all I have is jealousy after having to use it. PostgreSQL replication, at best, is a series of hacks. And I hate to say that but its nothing short of true.

Robert Hodges said...

@Anonymous
You need to look more closely at MySQL replication--it's not all peaches and cream. Just to mention one very basic problem, MySQL does not have a reliable method to handle master failure by promoting out of a group of slaves. There's a longer list at the following blog post:

http://scale-out-blog.blogspot.com/2008/08/answering-montys-challenge-advanced.html

Also I would not describe PostgreSQL PITR has a hack. It's very solidly engineered.

mason said...

FYI -- Hot Stanby feature is now available in PostgreSQL 8.5alpha3. See: http://www.postgresql.org/about/news.1172

Unknown said...

I carried out the steps mentioned in this post, but I am stuck at last point where it complaints WAL file SYSID is different that pg_control SYSID, probably because the WAL files are coming from a different machine.
Any suggestion how to resolve this error.
BTW: I am using 8.4.1

Data Recovery said...

Nice Post about "Simple HA with PostgreSQL Point-In-Time Recovery" Thanks for sharing the great stuff.

Anonymous said...

You could use rsync straight from archive_command on the primary node, therefore avoiding the need for a cron job and the possible extra delay in delivering the WAL files to the secondary node. This way the WAL files are sent as soon as they're available.
I've been using this for a while and it works fine.

Since PostgreSQL 9.0 was released not long ago, I recommend everyone to check the new features related to HA.

CDP.

Revti said...

Personally I don't like master-master replication as it results a lot of locks due to read-write conflict.