Обсуждение: Re: fdatasync performance problem with large number of DB files

Поиск
Список
Период
Сортировка

Re: fdatasync performance problem with large number of DB files

От
Thomas Munro
Дата:
On Tue, Mar 16, 2021 at 3:30 AM Paul Guo <guopa@vmware.com> wrote:
> By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by
pg_rewind/pg_basebackup.
> The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
> pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync
> during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]), so better
> let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files
like
> backup_label(?) does not seem to be good ideas since
> 1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify
pgoptionsin pg_ctl command.
 
> 2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might
>      copy the pgdata directory and then we still need fsync.
> The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg
instance.

As a thought experiment only, I wonder if there is a way to make your
touch-a-special-signal-file scheme more reliable and less dangerous
(considering people might copy the signal file around or otherwise
screw this up).  It seems to me that invalidation is the key, and
"unlink the signal file after the first crash recovery" isn't good
enough.  Hmm  What if the file contained a fingerprint containing...
let's see... checkpoint LSN, hostname, MAC address, pgdata path, ...
(add more seasoning to taste), and then also some flags to say what is
known to be fully fsync'd already: the WAL, pgdata but only as far as
changes up to the checkpoint LSN, or all of pgdata?  Then you could be
conservative for a non-match, but skip the extra work in some common
cases like pg_basebackup, as long as you trust the fingerprint scheme
not to produce false positives.  Or something like that...

I'm not too keen to invent clever new schemes for PG14, though.  This
sync_after_crash=syncfs scheme is pretty simple, and has the advantage
that it's very cheap to do it extra redundant times assuming nothing
else is creating new dirty kernel pages in serious quantities.  Is
that useful enough?  In particular it avoids the dreaded "open
1,000,000 uncached files over high latency network storage" problem.

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea.  We already have a
running-with-scissors mode you could use for that: fsync=off.



Re: fdatasync performance problem with large number of DB files

От
Fujii Masao
Дата:

On 2021/03/17 12:45, Thomas Munro wrote:
> On Tue, Mar 16, 2021 at 9:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>> On 2021/03/16 8:15, Thomas Munro wrote:
>>> I don't want to add a hypothetical sync_after_crash=none, because it
>>> seems like generally a bad idea.  We already have a
>>> running-with-scissors mode you could use for that: fsync=off.
>>
>> I heard that some backup tools sync the database directory when restoring it.
>> I guess that those who use such tools might want the option to disable such
>> startup sync (i.e., sync_after_crash=none) because it's not necessary.
> 
> Hopefully syncfs() will return quickly in that case, without doing any work?

Yes, in Linux.

> 
>> They can skip that sync by fsync=off. But if they just want to skip only that
>> startup sync and make subsequent recovery (or standby server) work with
>> fsync=on, they would need to shutdown the server after that startup sync
>> finishes, enable fsync, and restart the server. In this case, since the server
>> is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
>> would not be performed. This procedure is tricky. So IMO supporting
>> sync_after_crash=none would be helpful for this case and simple.
> 
> I still do not like this footgun :-)  However, perhaps I am being
> overly dogmatic.  Consider the change in d8179b00, which decided that
> I/O errors in this phase should be reported at LOG level rather than
> ERROR.  In contrast, my "sync_after_crash=wal" mode (which I need to
> rebase over this) will PANIC in this case, because any syncing will be
> handled through the usual checkpoint codepaths.
> 
> Do you think it would be OK to commit this feature with just "fsync"
> and "syncfs", and then to continue to consider adding "none" as a
> possible separate commit?

+1. "syncfs" feature is useful whether we also support "none" mode or not.
It's good idea to commit "syncfs" in advance.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: fdatasync performance problem with large number of DB files

От
Paul Guo
Дата:
About the syncfs patch, my first impression on the guc name sync_after_crash
is that it is a boolean type. Not sure about other people's feeling. Do you guys think
It is better to rename it to a clearer name like sync_method_after_crash or others?


Re: fdatasync performance problem with large number of DB files

От
Thomas Munro
Дата:
On Thu, Mar 18, 2021 at 8:52 PM Paul Guo <guopa@vmware.com> wrote:
> About the syncfs patch, my first impression on the guc name sync_after_crash
> is that it is a boolean type. Not sure about other people's feeling. Do you guys think
> It is better to rename it to a clearer name like sync_method_after_crash or others?

Works for me.  Here is a new version like that, also including the
documentation change discussed with Fujii-san, and a couple of
cosmetic changes.

Вложения

Re: fdatasync performance problem with large number of DB files

От
Thomas Munro
Дата:
On Wed, Mar 17, 2021 at 11:42 PM Paul Guo <paulguo@gmail.com> wrote:
> I just quickly reviewed the patch (the code part). It looks good. Only
> one concern
> or question is do_syncfs() for symlink opened fd for syncfs() - I'm
> not 100% sure.

Alright, let me try to prove that it works the way we want with an experiment.

I'll make a directory with a file in it, and create a symlink to it in
another filesystem:

tmunro@x1:~/junk$ mkdir my_wal_dir
tmunro@x1:~/junk$ touch my_wal_dir/foo
tmunro@x1:~/junk$ ln -s /home/tmunro/junk/my_wal_dir /dev/shm/my_wal_dir_symlink
tmunro@x1:~/junk$ ls /dev/shm/my_wal_dir_symlink/
foo

Now I'll write a program that repeatedly dirties the first block of
foo, and calls syncfs() on the containing directory that it opened
using the symlink:

tmunro@x1:~/junk$ cat test.c
#define _GNU_SOURCE

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int
main()
{
    int symlink_fd, file_fd;

    symlink_fd = open("/dev/shm/my_wal_dir_symlink", O_RDONLY);
    if (symlink_fd < 0) {
        perror("open1");
        return EXIT_FAILURE;
    }

    file_fd = open("/home/tmunro/junk/my_wal_dir/foo", O_RDWR);
    if (file_fd < 0) {
        perror("open2");
        return EXIT_FAILURE;
    }

    for (int i = 0; i < 4; ++i) {
        if (pwrite(file_fd, "hello world", 10, 0) != 10) {
            perror("pwrite");
            return EXIT_FAILURE;
        }
        if (syncfs(symlink_fd) < 0) {
            perror("syncfs");
            return EXIT_FAILURE;
        }
        sleep(1);
    }
    return EXIT_SUCCESS;
}
tmunro@x1:~/junk$ cc test.c
tmunro@x1:~/junk$ ./a.out

While that's running, to prove that it does what we want it to do,
I'll first find out where foo lives on the disk:

tmunro@x1:~/junk$ /sbin/xfs_bmap my_wal_dir/foo
my_wal_dir/foo:
    0: [0..7]: 242968520..242968527

Now I'll trace the writes going to block 242968520, and start the program again:

tmunro@x1:~/junk$ sudo btrace /dev/nvme0n1p2 | grep 242968520
259,0    4       93     4.157000669 724924  A   W 244019144 + 8 <-
(259,2) 242968520
259,0    2      155     5.158446989 718635  A   W 244019144 + 8 <-
(259,2) 242968520
259,0    7       23     6.163765728 724924  A   W 244019144 + 8 <-
(259,2) 242968520
259,0    7       30     7.169112683 724924  A   W 244019144 + 8 <-
(259,2) 242968520



Re: fdatasync performance problem with large number of DB files

От
Fujii Masao
Дата:

On 2021/03/18 19:19, Thomas Munro wrote:
> On Thu, Mar 18, 2021 at 8:52 PM Paul Guo <guopa@vmware.com> wrote:
>> About the syncfs patch, my first impression on the guc name sync_after_crash
>> is that it is a boolean type. Not sure about other people's feeling. Do you guys think
>> It is better to rename it to a clearer name like sync_method_after_crash or others?
> 
> Works for me.  Here is a new version like that, also including the
> documentation change discussed with Fujii-san, and a couple of
> cosmetic changes.

Thanks for updating the patch!

+        database cluster that did not shut down cleanly, including copies
+        created with pg_basebackup.

"pg_basebackup" should be "<application>pg_basebackup</application>"?

+        while ((de = ReadDir(dir, "pg_tblspc")))

The comment of SyncDataDirectory() says "Errors are logged but not
considered fatal". So ReadDirExtended() with LOG level should be used
here, instead?

Isn't it better to call CHECK_FOR_INTERRUPTS() in this loop?

+    fd = open(path, O_RDONLY);

For current use, it's not harmful to use open() and close(). But isn't
it safer to use OpenTransientFile() and CloseTransientFile(), instead?
Because do_syncfs() may be used for other purpose in the future.

+    if (syncfs(fd) < 0)
+        elog(LOG, "syncfs failed for %s: %m", path);

According to the message style guide, this message should be something
like "could not sync filesystem for \"%s\": %m"?

I confirmed that no error was reported when crash recovery started with
syncfs, in old Linux. I should also confirm that no error is reported in that
case in Linux 5.8+, but I don't have that environement. So I've not tested
this feature in Linux 5.8+....

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: fdatasync performance problem with large number of DB files

От
Bruce Momjian
Дата:
On Thu, Mar 18, 2021 at 11:19:13PM +1300, Thomas Munro wrote:
> On Thu, Mar 18, 2021 at 8:52 PM Paul Guo <guopa@vmware.com> wrote:
> > About the syncfs patch, my first impression on the guc name sync_after_crash
> > is that it is a boolean type. Not sure about other people's feeling. Do you guys think
> > It is better to rename it to a clearer name like sync_method_after_crash or others?
> 
> Works for me.  Here is a new version like that, also including the
> documentation change discussed with Fujii-san, and a couple of
> cosmetic changes.

Are we sure we want to use the word "crash" here?  I don't remember
seeing it used anywhere else in our user interface.  I guess it is
"crash recovery".

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: fdatasync performance problem with large number of DB files

От
Bruce Momjian
Дата:
On Thu, Mar 18, 2021 at 09:54:11AM -0400, Bruce Momjian wrote:
> On Thu, Mar 18, 2021 at 11:19:13PM +1300, Thomas Munro wrote:
> > On Thu, Mar 18, 2021 at 8:52 PM Paul Guo <guopa@vmware.com> wrote:
> > > About the syncfs patch, my first impression on the guc name sync_after_crash
> > > is that it is a boolean type. Not sure about other people's feeling. Do you guys think
> > > It is better to rename it to a clearer name like sync_method_after_crash or others?
> > 
> > Works for me.  Here is a new version like that, also including the
> > documentation change discussed with Fujii-san, and a couple of
> > cosmetic changes.
> 
> Are we sure we want to use the word "crash" here?  I don't remember
> seeing it used anywhere else in our user interface.  I guess it is
> "crash recovery".

Maybe call it "recovery_sync_method"?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: fdatasync performance problem with large number of DB files

От
Fujii Masao
Дата:

On 2021/03/18 23:03, Bruce Momjian wrote:
>> Are we sure we want to use the word "crash" here?  I don't remember
>> seeing it used anywhere else in our user interface.

We have GUC restart_after_crash.


>  I guess it is
>> "crash recovery".
> 
> Maybe call it "recovery_sync_method"
+1. This name sounds good to me. Or recovery_init_sync_method, because that
sync happens in the initial phase of recovery.

Another idea from different angle is data_directory_sync_method. If we adopt
this, we can easily extend this feature for other use cases (other than sync at
the beginning of recovery) without changing the name.
I'm not sure if such cases actually exist, though.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: fdatasync performance problem with large number of DB files

От
Thomas Munro
Дата:
On Fri, Mar 19, 2021 at 5:50 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> On 2021/03/18 23:03, Bruce Momjian wrote:
> >> Are we sure we want to use the word "crash" here?  I don't remember
> >> seeing it used anywhere else in our user interface.
>
> We have GUC restart_after_crash.
>
> >  I guess it is
> >> "crash recovery".
> >
> > Maybe call it "recovery_sync_method"
> +1. This name sounds good to me. Or recovery_init_sync_method, because that
> sync happens in the initial phase of recovery.

Yeah, I was trying to fit the existing pattern
{restart,remove_temp_files}_after_crash.  But
recovery_init_sync_method also sounds good to me.  I prefer the
version with "init"... without "init", people might get the wrong idea
about what this controls, so let's try that.  Done in the attached
version.

> Another idea from different angle is data_directory_sync_method. If we adopt
> this, we can easily extend this feature for other use cases (other than sync at
> the beginning of recovery) without changing the name.
> I'm not sure if such cases actually exist, though.

I can't imagine what -- it's like using a sledge hammer to crack a nut.

(I am aware of a semi-related idea: use the proposed fsinfo() Linux
system call to read the filesystem-wide error counter at every
checkpoint to see if anything bad happened that Linux forgot to tell
us about through the usual channels.  That's the same internal
mechanism used by syncfs() to report errors, but last I checked it
hadn't been committed yet.  I don't think that's share anything with
this code though.)

From your earlier email:

On Fri, Mar 19, 2021 at 2:12 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> +        database cluster that did not shut down cleanly, including copies
> +        created with pg_basebackup.
>
> "pg_basebackup" should be "<application>pg_basebackup</application>"?

Fixed.

> +               while ((de = ReadDir(dir, "pg_tblspc")))
>
> The comment of SyncDataDirectory() says "Errors are logged but not
> considered fatal". So ReadDirExtended() with LOG level should be used
> here, instead?

Fixed.

> Isn't it better to call CHECK_FOR_INTERRUPTS() in this loop?

How could this be useful?

> +       fd = open(path, O_RDONLY);
>
> For current use, it's not harmful to use open() and close(). But isn't
> it safer to use OpenTransientFile() and CloseTransientFile(), instead?

Ok, done, for consistency with other code.

> Because do_syncfs() may be used for other purpose in the future.

I hope not :-)

> +       if (syncfs(fd) < 0)
> +               elog(LOG, "syncfs failed for %s: %m", path);
>
> According to the message style guide, this message should be something
> like "could not sync filesystem for \"%s\": %m"?

Fixed.

> I confirmed that no error was reported when crash recovery started with
> syncfs, in old Linux. I should also confirm that no error is reported in that
> case in Linux 5.8+, but I don't have that environement. So I've not tested
> this feature in Linux 5.8+....

I have a Linux 5.10 system.  Here's a trace of SyncDataDirectory on a
system that has two tablespaces and has a symlink for pg_wal:

[pid 3861224] lstat("pg_wal", {st_mode=S_IFLNK|0777, st_size=11, ...}) = 0
[pid 3861224] openat(AT_FDCWD, ".", O_RDONLY) = 4
[pid 3861224] syncfs(4)                 = 0
[pid 3861224] close(4)                  = 0
[pid 3861224] openat(AT_FDCWD, "pg_tblspc",
O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
[pid 3861224] fstat(4, {st_mode=S_IFDIR|0700, st_size=32, ...}) = 0
[pid 3861224] getdents64(4, 0x55627e18fb60 /* 4 entries */, 32768) = 112
[pid 3861224] openat(AT_FDCWD, "pg_tblspc/16406", O_RDONLY) = 5
[pid 3861224] syncfs(5)                 = 0
[pid 3861224] close(5)                  = 0
[pid 3861224] openat(AT_FDCWD, "pg_tblspc/16407", O_RDONLY) = 5
[pid 3861224] syncfs(5)                 = 0
[pid 3861224] close(5)                  = 0
[pid 3861224] getdents64(4, 0x55627e18fb60 /* 0 entries */, 32768) = 0
[pid 3861224] close(4)                  = 0
[pid 3861224] openat(AT_FDCWD, "pg_wal", O_RDONLY) = 4
[pid 3861224] syncfs(4)                 = 0
[pid 3861224] close(4)                  = 0

To see how it looks when syncfs() fails, I added a fake implementation
that fails with EUCLEAN on every second call, and then the output
looks like this:

...
1616111356.663 startup 3879272 LOG:  database system was interrupted;
last known up at 2021-03-19 12:48:33 NZDT
1616111356.663 startup 3879272 LOG:  could not sync filesystem for
"pg_tblspc/16406": Structure needs cleaning
1616111356.663 startup 3879272 LOG:  could not sync filesystem for
"pg_wal": Structure needs cleaning
1616111356.663 startup 3879272 LOG:  database system was not properly
shut down; automatic recovery in progress
...

A more common setup with no tablespaces and pg_wal as a non-symlink looks like:

[pid 3861448] lstat("pg_wal", {st_mode=S_IFDIR|0700, st_size=316, ...}) = 0
[pid 3861448] openat(AT_FDCWD, ".", O_RDONLY) = 4
[pid 3861448] syncfs(4)                 = 0
[pid 3861448] close(4)                  = 0
[pid 3861448] openat(AT_FDCWD, "pg_tblspc",
O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
[pid 3861448] fstat(4, {st_mode=S_IFDIR|0700, st_size=6, ...}) = 0
[pid 3861448] getdents64(4, 0x55764beb0b60 /* 2 entries */, 32768) = 48
[pid 3861448] getdents64(4, 0x55764beb0b60 /* 0 entries */, 32768) = 0
[pid 3861448] close(4)

The alternative fsync() mode is (unsurprisingly) much longer.

Thanks for the reviews!

PS: For illustration/discussion, I've also attached a "none" patch.  I
also couldn't resist rebasing my "wal" mode patch, which I plan to
propose for PG15 because there is not enough time left for this
release.

Вложения

Re: fdatasync performance problem with large number of DB files

От
Thomas Munro
Дата:
On Fri, Mar 19, 2021 at 2:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> PS: For illustration/discussion, I've also attached a "none" patch.  I
> also couldn't resist rebasing my "wal" mode patch, which I plan to
> propose for PG15 because there is not enough time left for this
> release.

Erm... I attached the wrong version by mistake.  Here's a better one.
(Note: I'm not expecting any review of the 0003 patch in this CF, I
just wanted to share the future direction I'd like to consider for
this problem.)

Вложения

Re: fdatasync performance problem with large number of DB files

От
Fujii Masao
Дата:

On 2021/03/19 10:37, Thomas Munro wrote:
> On Fri, Mar 19, 2021 at 2:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> PS: For illustration/discussion, I've also attached a "none" patch.  I
>> also couldn't resist rebasing my "wal" mode patch, which I plan to
>> propose for PG15 because there is not enough time left for this
>> release.
> 
> Erm... I attached the wrong version by mistake.  Here's a better one.

Thanks for updating the patch! It looks good to me!
I have one minor comment for the patch.

+        elog(LOG, "could not open %s: %m", path);
+        return;
+    }
+    if (syncfs(fd) < 0)
+        elog(LOG, "could not sync filesystem for \"%s\": %m", path);

Since these are neither internal errors nor low-level debug messages, ereport() should be used for them rather than
elog()?For example,
 

        ereport(LOG,
                (errcode_for_file_access(),
                 errmsg("could not open \"%s\": %m", path)))

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: fdatasync performance problem with large number of DB files

От
Fujii Masao
Дата:

On 2021/03/19 11:22, Fujii Masao wrote:
> 
> 
> On 2021/03/19 10:37, Thomas Munro wrote:
>> On Fri, Mar 19, 2021 at 2:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>>> PS: For illustration/discussion, I've also attached a "none" patch.  I
>>> also couldn't resist rebasing my "wal" mode patch, which I plan to
>>> propose for PG15 because there is not enough time left for this
>>> release.
>>
>> Erm... I attached the wrong version by mistake.  Here's a better one.

0002 patch looks good to me. Thanks!
I have minor comments.

- * Issue fsync recursively on PGDATA and all its contents, or issue syncfs for
- * all potential filesystem, depending on recovery_init_sync_method setting.
+ * Issue fsync recursively on PGDATA and all its contents, issue syncfs for
+ * all potential filesystem, or do nothing, depending on
+ * recovery_init_sync_method setting.

The comment in SyncDataDirectory() should be updated so that
it mentions "none" method, as the above?

+        This is only safe if all buffered data is known to have been flushed
+        to disk already, for example by a tool such as
+        <application>pg_basebackup</application>.  It is not a good idea to

Isn't it better to add something like "without <literal>--no-sync</literal>"
to "pg_basebackup" part? Which would prevent users from misunderstanding
that pg_basebackup always ensures that whatever options are specified.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: fdatasync performance problem with large number of DB files

От
Thomas Munro
Дата:
On Fri, Mar 19, 2021 at 3:23 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> Thanks for updating the patch! It looks good to me!
> I have one minor comment for the patch.
>
> +               elog(LOG, "could not open %s: %m", path);
> +               return;
> +       }
> +       if (syncfs(fd) < 0)
> +               elog(LOG, "could not sync filesystem for \"%s\": %m", path);
>
> Since these are neither internal errors nor low-level debug messages, ereport() should be used for them rather than
elog()?For example,
 

Fixed.

I'll let this sit until tomorrow to collect any other feedback or
objections, and then push the 0001 patch
(recovery_init_sync_method=syncfs).

On Fri, Mar 19, 2021 at 4:08 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> 0002 patch looks good to me. Thanks!
> I have minor comments.

Ok, I made the changes you suggested.  Let's see if anyone else would
like to vote for or against the concept of the 0002 patch
(recovery_init_sync_method=none).

Вложения

Re: fdatasync performance problem with large number of DB files

От
Fujii Masao
Дата:

On 2021/03/19 14:28, Thomas Munro wrote:
> On Fri, Mar 19, 2021 at 3:23 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>> Thanks for updating the patch! It looks good to me!
>> I have one minor comment for the patch.
>>
>> +               elog(LOG, "could not open %s: %m", path);
>> +               return;
>> +       }
>> +       if (syncfs(fd) < 0)
>> +               elog(LOG, "could not sync filesystem for \"%s\": %m", path);
>>
>> Since these are neither internal errors nor low-level debug messages, ereport() should be used for them rather than
elog()?For example,
 
> 
> Fixed.

Thanks! LGTM.

> I'll let this sit until tomorrow to collect any other feedback or
> objections, and then push the 0001 patch
> (recovery_init_sync_method=syncfs).

Understood.

> On Fri, Mar 19, 2021 at 4:08 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>> 0002 patch looks good to me. Thanks!
>> I have minor comments.
> 
> Ok, I made the changes you suggested.

Thanks! LGTM.

> Let's see if anyone else would
> like to vote for or against the concept of the 0002 patch
> (recovery_init_sync_method=none).

Agreed. I also want to hear more opinion about the setting "none".

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: fdatasync performance problem with large number of DB files

От
David Steele
Дата:
On 3/19/21 1:28 AM, Thomas Munro wrote:
> On Fri, Mar 19, 2021 at 3:23 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>> Thanks for updating the patch! It looks good to me!
>> I have one minor comment for the patch.
>>
>> +               elog(LOG, "could not open %s: %m", path);
>> +               return;
>> +       }
>> +       if (syncfs(fd) < 0)
>> +               elog(LOG, "could not sync filesystem for \"%s\": %m", path);
>>
>> Since these are neither internal errors nor low-level debug messages, ereport() should be used for them rather than
elog()?For example,
 
> 
> Fixed.
> 
> I'll let this sit until tomorrow to collect any other feedback or
> objections, and then push the 0001 patch
> (recovery_init_sync_method=syncfs).

I had a look at the patch and it looks good to me.

Should we mention in the docs that the contents of non-standard symlinks 
will not be synced, i.e. anything other than tablespaces and pg_wal? Or 
can we point them to some docs saying not to do that (if such exists)?

> On Fri, Mar 19, 2021 at 4:08 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>> 0002 patch looks good to me. Thanks!
>> I have minor comments.
> 
> Ok, I made the changes you suggested.  Let's see if anyone else would
> like to vote for or against the concept of the 0002 patch
> (recovery_init_sync_method=none).

It worries me that this needs to be explicitly "turned off" after the 
initial recovery. Seems like something of a foot gun.

Since we have not offered this functionality before I'm not sure we 
should rush to introduce it now. For backup solutions that do their own 
syncing, syncfs() should provide excellent performance so long as the 
file system is not shared, which is something the user can control (and 
is noted in the docs).

Regards,
-- 
-David
david@pgmasters.net



Re: fdatasync performance problem with large number of DB files

От
Thomas Munro
Дата:
Thanks Justin and David.  Replies to two emails inline:

On Sat, Mar 20, 2021 at 12:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> On Fri, Mar 19, 2021 at 06:28:46PM +1300, Thomas Munro wrote:
> > +++ b/doc/src/sgml/config.sgml
>
> > +        <productname>PostgreSQL</productname> will recursively open and fsync
> > +        all files in the data directory before crash recovery begins.  This
>
> Maybe it should say "data, tablespace, and WAL directories", or just "critical
> directories".

Fair point.  Here's what I went with:

        When set to <literal>fsync</literal>, which is the default,
        <productname>PostgreSQL</productname> will recursively open and
        synchronize all files in the data directory before crash
recovery
        begins.  The search for files will follow symbolic links for the WAL
        directory and each configured tablespace (but not any other symbolic
        links).

> > +     {
> > +             {"recovery_init_sync_method", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
> > +                     gettext_noop("Sets the method for synchronizing the data directory before crash recovery."),
> > +             },
>
> "and tablespaces and WAL"

I feel like that's getting too detailed for the GUC description?

On Sat, Mar 20, 2021 at 2:55 AM David Steele <david@pgmasters.net> wrote:
> I had a look at the patch and it looks good to me.

Thanks!

> Should we mention in the docs that the contents of non-standard symlinks
> will not be synced, i.e. anything other than tablespaces and pg_wal? Or
> can we point them to some docs saying not to do that (if such exists)?

Good idea.  See above for the adjustment I went with to describe the
traditional behaviour, and then I also made a similar change to the
syncfs part, which, I hope, manages to convey that the new mode
matches the existing policy on symlinks:

        On Linux, <literal>syncfs</literal> may be used instead, to ask the
        operating system to synchronize the whole file systems that contain the
        data directory, the WAL file and each tablespace (but not any other
        file systems that may be reachable through symbolic links).

I thought about adding some text along the lines that such symlinks
are not expected, but I think you're right that what we really need is
a good place to point to.  I mean, generally you can't mess around
with the files managed by PostgreSQL and expect everything to keep
working correctly, but it wouldn't hurt to make an explicit statement
about symlinks and where they're allowed (or maybe there is one
already and I failed to find it).  There are hints though, like
pg_basebackup's documentation which tells you it won't follow or
preserve them in general, but... hmm, it also contemplates various
special subdirectories (pg_dynshmem, pg_notify, pg_replslot, ...) that
might be symlinks without saying why.

> > Ok, I made the changes you suggested.  Let's see if anyone else would
> > like to vote for or against the concept of the 0002 patch
> > (recovery_init_sync_method=none).
>
> It worries me that this needs to be explicitly "turned off" after the
> initial recovery. Seems like something of a foot gun.
>
> Since we have not offered this functionality before I'm not sure we
> should rush to introduce it now. For backup solutions that do their own
> syncing, syncfs() should provide excellent performance so long as the
> file system is not shared, which is something the user can control (and
> is noted in the docs).

Thanks.  I'm leaving the 0002 patch "on ice" until someone can explain
how you're supposed to use it without putting a hole in your foot.

I pushed the 0001 patch.  Thanks to all who reviewed.  Of course,
further documentation improvement patches are always welcome.

(One silly thing I noticed is that our comments generally think
"filesystem" is one word, but our documentation always has a space;
this patch followed the local convention in both cases!)



Re: fdatasync performance problem with large number of DB files

От
David Steele
Дата:
On 3/19/21 7:16 PM, Thomas Munro wrote:
> Thanks Justin and David.  Replies to two emails inline:
> 
> Fair point.  Here's what I went with:
> 
>          When set to <literal>fsync</literal>, which is the default,
>          <productname>PostgreSQL</productname> will recursively open and
>          synchronize all files in the data directory before crash
> recovery
>          begins.  The search for files will follow symbolic links for the WAL
>          directory and each configured tablespace (but not any other symbolic
>          links).
> 

+1

> I thought about adding some text along the lines that such symlinks
> are not expected, but I think you're right that what we really need is
> a good place to point to.  I mean, generally you can't mess around
> with the files managed by PostgreSQL and expect everything to keep
> working correctly

WRT to symlinks I'm not sure that's fair to say. From PG's perspective 
it's just a dir/file after all. Other than pg_wal I have seen 
pg_stat/pg_stat_tmp sometimes symlinked, plus config files, and the log dir.

pgBackRest takes a pretty liberal approach here. Were preserve all 
dir/file symlinks no matter where they appear and allow all of them to 
be remapped on restore.

> but it wouldn't hurt to make an explicit statement
> about symlinks and where they're allowed (or maybe there is one
> already and I failed to find it).  

I couldn't find it either and I would be in favor of it. For instance, 
pgBackRest forbids tablespaces inside PGDATA and when people complain 
(more often then you might imagine) we can just point to the code/docs.

> There are hints though, like
> pg_basebackup's documentation which tells you it won't follow or
> preserve them in general, but... hmm, it also contemplates various
> special subdirectories (pg_dynshmem, pg_notify, pg_replslot, ...) that
> might be symlinks without saying why.

Right, pg_dynshmem is another one that I've seen symlinked. Some things 
are nice to have on fast storage. pg_notify and pg_replslot are similar 
since they get written to a lot in certain configurations.

>> It worries me that this needs to be explicitly "turned off" after the
>> initial recovery. Seems like something of a foot gun.
>>
>> Since we have not offered this functionality before I'm not sure we
>> should rush to introduce it now. For backup solutions that do their own
>> syncing, syncfs() should provide excellent performance so long as the
>> file system is not shared, which is something the user can control (and
>> is noted in the docs).
> 
> Thanks.  I'm leaving the 0002 patch "on ice" until someone can explain
> how you're supposed to use it without putting a hole in your foot.

+1

> (One silly thing I noticed is that our comments generally think
> "filesystem" is one word, but our documentation always has a space;
> this patch followed the local convention in both cases!)

Personally I prefer "file system".

Regards,
-- 
-David
david@pgmasters.net



Re: fdatasync performance problem with large number of DB files

От
Fujii Masao
Дата:

On 2021/06/04 23:39, Justin Pryzby wrote:
> You said switching to SIGHUP "would have zero effect"; but, actually it allows
> an admin who's DB took a long time in recovery/startup to change the parameter
> without shutting down the service.  This mitigates the downtime if it crashes
> again.  I think that's at least 50% of how this feature might end up being
> used.

Yes, it would have an effect when the server is automatically restarted
after crash when restart_after_crash is enabled. At least for me +1 to
your proposed change.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: fdatasync performance problem with large number of DB files

От
Thomas Munro
Дата:
On Fri, Jun 18, 2021 at 1:11 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> Thomas, could you comment on this ?

Sorry, I missed that.  It is initially a confusing proposal, but after
trying it out (that is: making recovery_init_sync_method PGC_SIGHUP
and testing a scenario where you want to make the next crash use it
that way and without the change), I agree.  +1 from me.



Re: fdatasync performance problem with large number of DB files

От
Thomas Munro
Дата:
On Tue, Jun 22, 2021 at 5:01 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Jun 18, 2021 at 1:11 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > Thomas, could you comment on this ?
>
> Sorry, I missed that.  It is initially a confusing proposal, but after
> trying it out (that is: making recovery_init_sync_method PGC_SIGHUP
> and testing a scenario where you want to make the next crash use it
> that way and without the change), I agree.  +1 from me.

... and pushed.