Обсуждение: fsync data directory after DB crash

Поиск
Список
Период
Сортировка

fsync data directory after DB crash

От
"Pandora"
Дата:
I found that starting from version 9.5, PostgreSQL will do fsync on the entire data directory after DB crash. Here's a question: if I have FPW = on, why is this step still necessary?

Re: fsync data directory after DB crash

От
Michael Paquier
Дата:
On Tue, Jul 18, 2023 at 04:50:25PM +0800, Pandora wrote:
> I found that starting from version 9.5, PostgreSQL will do fsync on
> the entire data directory after DB crash. Here's a question: if I
> have FPW = on, why is this step still necessary?

Yes, see around the call of SyncDataDirectory() in xlog.c:
 * - There might be data which we had written, intending to fsync it, but
 *   which we had not actually fsync'd yet.  Therefore, a power failure in
 *   the near future might cause earlier unflushed writes to be lost, even
 *   though more recent data written to disk from here on would be
 *   persisted.  To avoid that, fsync the entire data directory.
--
Michael

Вложения

Re: fsync data directory after DB crash

От
Thomas Munro
Дата:
On Wed, Jul 19, 2023 at 12:41 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Jul 18, 2023 at 04:50:25PM +0800, Pandora wrote:
> > I found that starting from version 9.5, PostgreSQL will do fsync on
> > the entire data directory after DB crash. Here's a question: if I
> > have FPW = on, why is this step still necessary?
>
> Yes, see around the call of SyncDataDirectory() in xlog.c:
>  * - There might be data which we had written, intending to fsync it, but
>  *   which we had not actually fsync'd yet.  Therefore, a power failure in
>  *   the near future might cause earlier unflushed writes to be lost, even
>  *   though more recent data written to disk from here on would be
>  *   persisted.  To avoid that, fsync the entire data directory.

FTR there was some discussion and experimental patches that would add
recovery_init_sync_method=none and recovery_init_sync_method=wal,
which are based on the OP's observation + an idea for how to make it
work even without FPWs enabled:


https://www.postgresql.org/message-id/flat/CA%2BhUKGKgj%2BSN6z91nVmOmTv2KYrG7VnAGdTkWdSjbOPghdtooQ%40mail.gmail.com#576caccf21cb6c3e883601fceb28d36b

Only recovery_init_sync_method=syncfs actually went in from that
thread.  It works better for some setups (systems where opening
squillions of files just do perform a no-op fsync() is painfully
expensive).



回复: fsync data directory after DB crash

От
"Pandora"
Дата:
Yes, I saw the usage of syncfs in PG14, but it is recommended to use it on Linux 5.8 or higher. If my OS version is lower than 5.8, can I still enable it?

 
 


------------------ 原始邮件 ------------------
发件人: "Thomas Munro"<thomas.munro@gmail.com>;
发送时间: 2023年7月19日(星期三) 上午9:37
收件人: "Michael Paquier"<michael@paquier.xyz>;
抄送: "Pandora"<yeyukui@qq.com>; "pgsql-general"<pgsql-general@lists.postgresql.org>;
主题: Re: fsync data directory after DB crash

On Wed, Jul 19, 2023 at 12:41 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Jul 18, 2023 at 04:50:25PM +0800, Pandora wrote:
> > I found that starting from version 9.5, PostgreSQL will do fsync on
> > the entire data directory after DB crash. Here's a question: if I
> > have FPW = on, why is this step still necessary?
>
> Yes, see around the call of SyncDataDirectory() in xlog.c:
>  * - There might be data which we had written, intending to fsync it, but
>  *   which we had not actually fsync'd yet.  Therefore, a power failure in
>  *   the near future might cause earlier unflushed writes to be lost, even
>  *   though more recent data written to disk from here on would be
>  *   persisted.  To avoid that, fsync the entire data directory.

FTR there was some discussion and experimental patches that would add
recovery_init_sync_method=none and recovery_init_sync_method=wal,
which are based on the OP's observation + an idea for how to make it
work even without FPWs enabled:

https://www.postgresql.org/message-id/flat/CA%2BhUKGKgj%2BSN6z91nVmOmTv2KYrG7VnAGdTkWdSjbOPghdtooQ%40mail.gmail.com#576caccf21cb6c3e883601fceb28d36b

Only recovery_init_sync_method=syncfs actually went in from that
thread.  It works better for some setups (systems where opening
squillions of files just do perform a no-op fsync() is painfully
expensive).

.qmbox style, .qmbox script, .qmbox head, .qmbox link, .qmbox meta {display: none !important;}

Re: fsync data directory after DB crash

От
Thomas Munro
Дата:
On Wed, Jul 19, 2023 at 2:09 PM Pandora <yeyukui@qq.com> wrote:
> Yes, I saw the usage of syncfs in PG14, but it is recommended to use it on Linux 5.8 or higher. If my OS version is
lowerthan 5.8, can I still enable it? 

Nothing stops you from enabling it, it's fairly ancient and should
work.  It just doesn't promise to report errors before Linux 5.8,
which is why we don't recommend it, so you have to figure out the
risks.  One way to think about the risks: all we do is log the errors,
but you could probably also check the kernel logs for errors.

The edge cases around writeback failure are a tricky subject.  If the
reason we are running crash recovery is because we experienced an I/O
error and PANIC'd before, then it's possible for
recovery_init_sync_method=fsync to succeed while there is still
phantom data in the page cache masquerading as "clean" (ie will never
be sent to the disk by Linux).  So at least in some cases, it's no
better than older Linux's syncfs for our purposes.

(I think the comment that Michael quoted assumes the default FreeBSD
caching model: that cached data stays dirty until it's transferred to
disk or the file system is force-removed, whereas the Linux model is:
cached data stays dirty until the kernel has attempted to transfer it
to disk just once, and then it'll report an error to user space one
time (or, in older versions, sometimes fewer) and it is undefined (ie
depends on file system) whether the affected data is forgotten from
cache, or still present as phantom data that is bogusly considered
clean.  The reason this probably isn't a bigger deal than it sounds
may be that "transient" I/O failures are probably rare -- it's more
likely that a system with failing storage just completely
self-destructs and you never reach these fun edge cases.  But as
database hackers, we try to think about this stuff... perhaps one day
soon we'll be able to just go around this particular molehill with
direct I/O.)