Обсуждение: fsync data directory after DB crash
I found that starting from version 9.5, PostgreSQL will do fsync on the entire data directory after DB crash. Here's a question: if I have FPW = on, why is this step still necessary?
On Tue, Jul 18, 2023 at 04:50:25PM +0800, Pandora wrote: > I found that starting from version 9.5, PostgreSQL will do fsync on > the entire data directory after DB crash. Here's a question: if I > have FPW = on, why is this step still necessary? Yes, see around the call of SyncDataDirectory() in xlog.c: * - There might be data which we had written, intending to fsync it, but * which we had not actually fsync'd yet. Therefore, a power failure in * the near future might cause earlier unflushed writes to be lost, even * though more recent data written to disk from here on would be * persisted. To avoid that, fsync the entire data directory. -- Michael
Вложения
On Wed, Jul 19, 2023 at 12:41 PM Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Jul 18, 2023 at 04:50:25PM +0800, Pandora wrote: > > I found that starting from version 9.5, PostgreSQL will do fsync on > > the entire data directory after DB crash. Here's a question: if I > > have FPW = on, why is this step still necessary? > > Yes, see around the call of SyncDataDirectory() in xlog.c: > * - There might be data which we had written, intending to fsync it, but > * which we had not actually fsync'd yet. Therefore, a power failure in > * the near future might cause earlier unflushed writes to be lost, even > * though more recent data written to disk from here on would be > * persisted. To avoid that, fsync the entire data directory. FTR there was some discussion and experimental patches that would add recovery_init_sync_method=none and recovery_init_sync_method=wal, which are based on the OP's observation + an idea for how to make it work even without FPWs enabled: https://www.postgresql.org/message-id/flat/CA%2BhUKGKgj%2BSN6z91nVmOmTv2KYrG7VnAGdTkWdSjbOPghdtooQ%40mail.gmail.com#576caccf21cb6c3e883601fceb28d36b Only recovery_init_sync_method=syncfs actually went in from that thread. It works better for some setups (systems where opening squillions of files just do perform a no-op fsync() is painfully expensive).
Yes, I saw the usage of syncfs in PG14, but it is recommended to use it on Linux 5.8 or higher. If my OS version is lower than 5.8, can I still enable it?
------------------ 原始邮件 ------------------
发件人: "Thomas Munro"<thomas.munro@gmail.com>;
发送时间: 2023年7月19日(星期三) 上午9:37
收件人: "Michael Paquier"<michael@paquier.xyz>;
抄送: "Pandora"<yeyukui@qq.com>; "pgsql-general"<pgsql-general@lists.postgresql.org>;
主题: Re: fsync data directory after DB crash
> On Tue, Jul 18, 2023 at 04:50:25PM +0800, Pandora wrote:
> > I found that starting from version 9.5, PostgreSQL will do fsync on
> > the entire data directory after DB crash. Here's a question: if I
> > have FPW = on, why is this step still necessary?
>
> Yes, see around the call of SyncDataDirectory() in xlog.c:
> * - There might be data which we had written, intending to fsync it, but
> * which we had not actually fsync'd yet. Therefore, a power failure in
> * the near future might cause earlier unflushed writes to be lost, even
> * though more recent data written to disk from here on would be
> * persisted. To avoid that, fsync the entire data directory.
FTR there was some discussion and experimental patches that would add
recovery_init_sync_method=none and recovery_init_sync_method=wal,
which are based on the OP's observation + an idea for how to make it
work even without FPWs enabled:
https://www.postgresql.org/message-id/flat/CA%2BhUKGKgj%2BSN6z91nVmOmTv2KYrG7VnAGdTkWdSjbOPghdtooQ%40mail.gmail.com#576caccf21cb6c3e883601fceb28d36b
Only recovery_init_sync_method=syncfs actually went in from that
thread. It works better for some setups (systems where opening
squillions of files just do perform a no-op fsync() is painfully
expensive).
On Wed, Jul 19, 2023 at 2:09 PM Pandora <yeyukui@qq.com> wrote: > Yes, I saw the usage of syncfs in PG14, but it is recommended to use it on Linux 5.8 or higher. If my OS version is lowerthan 5.8, can I still enable it? Nothing stops you from enabling it, it's fairly ancient and should work. It just doesn't promise to report errors before Linux 5.8, which is why we don't recommend it, so you have to figure out the risks. One way to think about the risks: all we do is log the errors, but you could probably also check the kernel logs for errors. The edge cases around writeback failure are a tricky subject. If the reason we are running crash recovery is because we experienced an I/O error and PANIC'd before, then it's possible for recovery_init_sync_method=fsync to succeed while there is still phantom data in the page cache masquerading as "clean" (ie will never be sent to the disk by Linux). So at least in some cases, it's no better than older Linux's syncfs for our purposes. (I think the comment that Michael quoted assumes the default FreeBSD caching model: that cached data stays dirty until it's transferred to disk or the file system is force-removed, whereas the Linux model is: cached data stays dirty until the kernel has attempted to transfer it to disk just once, and then it'll report an error to user space one time (or, in older versions, sometimes fewer) and it is undefined (ie depends on file system) whether the affected data is forgotten from cache, or still present as phantom data that is bogusly considered clean. The reason this probably isn't a bigger deal than it sounds may be that "transient" I/O failures are probably rare -- it's more likely that a system with failing storage just completely self-destructs and you never reach these fun edge cases. But as database hackers, we try to think about this stuff... perhaps one day soon we'll be able to just go around this particular molehill with direct I/O.)