Re: Mount options for Ext3?
От | Tom Lane |
---|---|
Тема | Re: Mount options for Ext3? |
Дата | |
Msg-id | 1260.1043463535@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: Mount options for Ext3? (Kevin Brown <kevin@sysexperts.com>) |
Ответы |
Re: Mount options for Ext3?
(Kevin Brown <kevin@sysexperts.com>)
Re: Mount options for Ext3? (pgsql.spam@vinz.nl) |
Список | pgsql-performance |
Kevin Brown <kevin@sysexperts.com> writes: > I was presuming that when a savepoint occurs, a marker is written to > the log indicating which transactions had been committed to the data > files, and that this marker was paid attention to during database > startup. Not quite. The marker says that all datafile updates described by log entries before point X have been flushed to disk by the checkpoint --- and, therefore, if we need to restart we need only replay log entries occurring after the last checkpoint's point X. This has nothing directly to do with which transactions are committed or not committed. If we based checkpoint behavior on that, we'd need to maintain an indefinitely large amount of WAL log to cope with long-running transactions. The actual checkpoint algorithm is take note of current logical end of WAL (this will be point X) write() all dirty buffers in shared buffer arena sync() to ensure that above writes, as well as previous ones, are on disk put checkpoint record referencing point X into WAL; write and fsync WAL update pg_control with new checkpoint record, fsync it Since pg_control is what's examined after restart, the checkpoint is effectively committed when the pg_control write hits disk. At any instant before that, a crash would result in replaying from the prior checkpoint's point X. The algorithm is correct if and only if the pg_control write hits disk after all the other writes mentioned. The key assumption we are making about the filesystem's behavior is that writes scheduled by the sync() will occur before the pg_control write that's issued after it. People have occasionally faulted this algorithm by quoting the sync() man page, which saith (in the Gospel According To HP) The writing, although scheduled, is not necessarily complete upon return from sync. This, however, is not a problem in itself. What we need to know is whether the filesystem will allow writes issued after the sync() to complete before those "scheduled" by the sync(). > So suppose the marker makes it to the log but not all of the data the > marker refers to makes it to the data files. Then the system crashes. I think that this analysis is not relevant to what we're doing. regards, tom lane
В списке pgsql-performance по дате отправления: