Обсуждение: Re: fdatasync performance problem with large number of DB files
On Tue, Mar 16, 2021 at 3:30 AM Paul Guo <guopa@vmware.com> wrote: > By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by pg_rewind/pg_basebackup. > The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the > pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync > during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]), so better > let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files like > backup_label(?) does not seem to be good ideas since > 1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify pgoptionsin pg_ctl command. > 2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might > copy the pgdata directory and then we still need fsync. > The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg instance. As a thought experiment only, I wonder if there is a way to make your touch-a-special-signal-file scheme more reliable and less dangerous (considering people might copy the signal file around or otherwise screw this up). It seems to me that invalidation is the key, and "unlink the signal file after the first crash recovery" isn't good enough. Hmm What if the file contained a fingerprint containing... let's see... checkpoint LSN, hostname, MAC address, pgdata path, ... (add more seasoning to taste), and then also some flags to say what is known to be fully fsync'd already: the WAL, pgdata but only as far as changes up to the checkpoint LSN, or all of pgdata? Then you could be conservative for a non-match, but skip the extra work in some common cases like pg_basebackup, as long as you trust the fingerprint scheme not to produce false positives. Or something like that... I'm not too keen to invent clever new schemes for PG14, though. This sync_after_crash=syncfs scheme is pretty simple, and has the advantage that it's very cheap to do it extra redundant times assuming nothing else is creating new dirty kernel pages in serious quantities. Is that useful enough? In particular it avoids the dreaded "open 1,000,000 uncached files over high latency network storage" problem. I don't want to add a hypothetical sync_after_crash=none, because it seems like generally a bad idea. We already have a running-with-scissors mode you could use for that: fsync=off.
On 2021/03/17 12:45, Thomas Munro wrote: > On Tue, Mar 16, 2021 at 9:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >> On 2021/03/16 8:15, Thomas Munro wrote: >>> I don't want to add a hypothetical sync_after_crash=none, because it >>> seems like generally a bad idea. We already have a >>> running-with-scissors mode you could use for that: fsync=off. >> >> I heard that some backup tools sync the database directory when restoring it. >> I guess that those who use such tools might want the option to disable such >> startup sync (i.e., sync_after_crash=none) because it's not necessary. > > Hopefully syncfs() will return quickly in that case, without doing any work? Yes, in Linux. > >> They can skip that sync by fsync=off. But if they just want to skip only that >> startup sync and make subsequent recovery (or standby server) work with >> fsync=on, they would need to shutdown the server after that startup sync >> finishes, enable fsync, and restart the server. In this case, since the server >> is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync >> would not be performed. This procedure is tricky. So IMO supporting >> sync_after_crash=none would be helpful for this case and simple. > > I still do not like this footgun :-) However, perhaps I am being > overly dogmatic. Consider the change in d8179b00, which decided that > I/O errors in this phase should be reported at LOG level rather than > ERROR. In contrast, my "sync_after_crash=wal" mode (which I need to > rebase over this) will PANIC in this case, because any syncing will be > handled through the usual checkpoint codepaths. > > Do you think it would be OK to commit this feature with just "fsync" > and "syncfs", and then to continue to consider adding "none" as a > possible separate commit? +1. "syncfs" feature is useful whether we also support "none" mode or not. It's good idea to commit "syncfs" in advance. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
About the syncfs patch, my first impression on the guc name sync_after_crash is that it is a boolean type. Not sure about other people's feeling. Do you guys think It is better to rename it to a clearer name like sync_method_after_crash or others?
On Thu, Mar 18, 2021 at 8:52 PM Paul Guo <guopa@vmware.com> wrote: > About the syncfs patch, my first impression on the guc name sync_after_crash > is that it is a boolean type. Not sure about other people's feeling. Do you guys think > It is better to rename it to a clearer name like sync_method_after_crash or others? Works for me. Here is a new version like that, also including the documentation change discussed with Fujii-san, and a couple of cosmetic changes.
Вложения
On Wed, Mar 17, 2021 at 11:42 PM Paul Guo <paulguo@gmail.com> wrote: > I just quickly reviewed the patch (the code part). It looks good. Only > one concern > or question is do_syncfs() for symlink opened fd for syncfs() - I'm > not 100% sure. Alright, let me try to prove that it works the way we want with an experiment. I'll make a directory with a file in it, and create a symlink to it in another filesystem: tmunro@x1:~/junk$ mkdir my_wal_dir tmunro@x1:~/junk$ touch my_wal_dir/foo tmunro@x1:~/junk$ ln -s /home/tmunro/junk/my_wal_dir /dev/shm/my_wal_dir_symlink tmunro@x1:~/junk$ ls /dev/shm/my_wal_dir_symlink/ foo Now I'll write a program that repeatedly dirties the first block of foo, and calls syncfs() on the containing directory that it opened using the symlink: tmunro@x1:~/junk$ cat test.c #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> int main() { int symlink_fd, file_fd; symlink_fd = open("/dev/shm/my_wal_dir_symlink", O_RDONLY); if (symlink_fd < 0) { perror("open1"); return EXIT_FAILURE; } file_fd = open("/home/tmunro/junk/my_wal_dir/foo", O_RDWR); if (file_fd < 0) { perror("open2"); return EXIT_FAILURE; } for (int i = 0; i < 4; ++i) { if (pwrite(file_fd, "hello world", 10, 0) != 10) { perror("pwrite"); return EXIT_FAILURE; } if (syncfs(symlink_fd) < 0) { perror("syncfs"); return EXIT_FAILURE; } sleep(1); } return EXIT_SUCCESS; } tmunro@x1:~/junk$ cc test.c tmunro@x1:~/junk$ ./a.out While that's running, to prove that it does what we want it to do, I'll first find out where foo lives on the disk: tmunro@x1:~/junk$ /sbin/xfs_bmap my_wal_dir/foo my_wal_dir/foo: 0: [0..7]: 242968520..242968527 Now I'll trace the writes going to block 242968520, and start the program again: tmunro@x1:~/junk$ sudo btrace /dev/nvme0n1p2 | grep 242968520 259,0 4 93 4.157000669 724924 A W 244019144 + 8 <- (259,2) 242968520 259,0 2 155 5.158446989 718635 A W 244019144 + 8 <- (259,2) 242968520 259,0 7 23 6.163765728 724924 A W 244019144 + 8 <- (259,2) 242968520 259,0 7 30 7.169112683 724924 A W 244019144 + 8 <- (259,2) 242968520
On 2021/03/18 19:19, Thomas Munro wrote: > On Thu, Mar 18, 2021 at 8:52 PM Paul Guo <guopa@vmware.com> wrote: >> About the syncfs patch, my first impression on the guc name sync_after_crash >> is that it is a boolean type. Not sure about other people's feeling. Do you guys think >> It is better to rename it to a clearer name like sync_method_after_crash or others? > > Works for me. Here is a new version like that, also including the > documentation change discussed with Fujii-san, and a couple of > cosmetic changes. Thanks for updating the patch! + database cluster that did not shut down cleanly, including copies + created with pg_basebackup. "pg_basebackup" should be "<application>pg_basebackup</application>"? + while ((de = ReadDir(dir, "pg_tblspc"))) The comment of SyncDataDirectory() says "Errors are logged but not considered fatal". So ReadDirExtended() with LOG level should be used here, instead? Isn't it better to call CHECK_FOR_INTERRUPTS() in this loop? + fd = open(path, O_RDONLY); For current use, it's not harmful to use open() and close(). But isn't it safer to use OpenTransientFile() and CloseTransientFile(), instead? Because do_syncfs() may be used for other purpose in the future. + if (syncfs(fd) < 0) + elog(LOG, "syncfs failed for %s: %m", path); According to the message style guide, this message should be something like "could not sync filesystem for \"%s\": %m"? I confirmed that no error was reported when crash recovery started with syncfs, in old Linux. I should also confirm that no error is reported in that case in Linux 5.8+, but I don't have that environement. So I've not tested this feature in Linux 5.8+.... Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Thu, Mar 18, 2021 at 11:19:13PM +1300, Thomas Munro wrote: > On Thu, Mar 18, 2021 at 8:52 PM Paul Guo <guopa@vmware.com> wrote: > > About the syncfs patch, my first impression on the guc name sync_after_crash > > is that it is a boolean type. Not sure about other people's feeling. Do you guys think > > It is better to rename it to a clearer name like sync_method_after_crash or others? > > Works for me. Here is a new version like that, also including the > documentation change discussed with Fujii-san, and a couple of > cosmetic changes. Are we sure we want to use the word "crash" here? I don't remember seeing it used anywhere else in our user interface. I guess it is "crash recovery". -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Thu, Mar 18, 2021 at 09:54:11AM -0400, Bruce Momjian wrote: > On Thu, Mar 18, 2021 at 11:19:13PM +1300, Thomas Munro wrote: > > On Thu, Mar 18, 2021 at 8:52 PM Paul Guo <guopa@vmware.com> wrote: > > > About the syncfs patch, my first impression on the guc name sync_after_crash > > > is that it is a boolean type. Not sure about other people's feeling. Do you guys think > > > It is better to rename it to a clearer name like sync_method_after_crash or others? > > > > Works for me. Here is a new version like that, also including the > > documentation change discussed with Fujii-san, and a couple of > > cosmetic changes. > > Are we sure we want to use the word "crash" here? I don't remember > seeing it used anywhere else in our user interface. I guess it is > "crash recovery". Maybe call it "recovery_sync_method"? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On 2021/03/18 23:03, Bruce Momjian wrote: >> Are we sure we want to use the word "crash" here? I don't remember >> seeing it used anywhere else in our user interface. We have GUC restart_after_crash. > I guess it is >> "crash recovery". > > Maybe call it "recovery_sync_method" +1. This name sounds good to me. Or recovery_init_sync_method, because that sync happens in the initial phase of recovery. Another idea from different angle is data_directory_sync_method. If we adopt this, we can easily extend this feature for other use cases (other than sync at the beginning of recovery) without changing the name. I'm not sure if such cases actually exist, though. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Fri, Mar 19, 2021 at 5:50 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > On 2021/03/18 23:03, Bruce Momjian wrote: > >> Are we sure we want to use the word "crash" here? I don't remember > >> seeing it used anywhere else in our user interface. > > We have GUC restart_after_crash. > > > I guess it is > >> "crash recovery". > > > > Maybe call it "recovery_sync_method" > +1. This name sounds good to me. Or recovery_init_sync_method, because that > sync happens in the initial phase of recovery. Yeah, I was trying to fit the existing pattern {restart,remove_temp_files}_after_crash. But recovery_init_sync_method also sounds good to me. I prefer the version with "init"... without "init", people might get the wrong idea about what this controls, so let's try that. Done in the attached version. > Another idea from different angle is data_directory_sync_method. If we adopt > this, we can easily extend this feature for other use cases (other than sync at > the beginning of recovery) without changing the name. > I'm not sure if such cases actually exist, though. I can't imagine what -- it's like using a sledge hammer to crack a nut. (I am aware of a semi-related idea: use the proposed fsinfo() Linux system call to read the filesystem-wide error counter at every checkpoint to see if anything bad happened that Linux forgot to tell us about through the usual channels. That's the same internal mechanism used by syncfs() to report errors, but last I checked it hadn't been committed yet. I don't think that's share anything with this code though.) From your earlier email: On Fri, Mar 19, 2021 at 2:12 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > + database cluster that did not shut down cleanly, including copies > + created with pg_basebackup. > > "pg_basebackup" should be "<application>pg_basebackup</application>"? Fixed. > + while ((de = ReadDir(dir, "pg_tblspc"))) > > The comment of SyncDataDirectory() says "Errors are logged but not > considered fatal". So ReadDirExtended() with LOG level should be used > here, instead? Fixed. > Isn't it better to call CHECK_FOR_INTERRUPTS() in this loop? How could this be useful? > + fd = open(path, O_RDONLY); > > For current use, it's not harmful to use open() and close(). But isn't > it safer to use OpenTransientFile() and CloseTransientFile(), instead? Ok, done, for consistency with other code. > Because do_syncfs() may be used for other purpose in the future. I hope not :-) > + if (syncfs(fd) < 0) > + elog(LOG, "syncfs failed for %s: %m", path); > > According to the message style guide, this message should be something > like "could not sync filesystem for \"%s\": %m"? Fixed. > I confirmed that no error was reported when crash recovery started with > syncfs, in old Linux. I should also confirm that no error is reported in that > case in Linux 5.8+, but I don't have that environement. So I've not tested > this feature in Linux 5.8+.... I have a Linux 5.10 system. Here's a trace of SyncDataDirectory on a system that has two tablespaces and has a symlink for pg_wal: [pid 3861224] lstat("pg_wal", {st_mode=S_IFLNK|0777, st_size=11, ...}) = 0 [pid 3861224] openat(AT_FDCWD, ".", O_RDONLY) = 4 [pid 3861224] syncfs(4) = 0 [pid 3861224] close(4) = 0 [pid 3861224] openat(AT_FDCWD, "pg_tblspc", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4 [pid 3861224] fstat(4, {st_mode=S_IFDIR|0700, st_size=32, ...}) = 0 [pid 3861224] getdents64(4, 0x55627e18fb60 /* 4 entries */, 32768) = 112 [pid 3861224] openat(AT_FDCWD, "pg_tblspc/16406", O_RDONLY) = 5 [pid 3861224] syncfs(5) = 0 [pid 3861224] close(5) = 0 [pid 3861224] openat(AT_FDCWD, "pg_tblspc/16407", O_RDONLY) = 5 [pid 3861224] syncfs(5) = 0 [pid 3861224] close(5) = 0 [pid 3861224] getdents64(4, 0x55627e18fb60 /* 0 entries */, 32768) = 0 [pid 3861224] close(4) = 0 [pid 3861224] openat(AT_FDCWD, "pg_wal", O_RDONLY) = 4 [pid 3861224] syncfs(4) = 0 [pid 3861224] close(4) = 0 To see how it looks when syncfs() fails, I added a fake implementation that fails with EUCLEAN on every second call, and then the output looks like this: ... 1616111356.663 startup 3879272 LOG: database system was interrupted; last known up at 2021-03-19 12:48:33 NZDT 1616111356.663 startup 3879272 LOG: could not sync filesystem for "pg_tblspc/16406": Structure needs cleaning 1616111356.663 startup 3879272 LOG: could not sync filesystem for "pg_wal": Structure needs cleaning 1616111356.663 startup 3879272 LOG: database system was not properly shut down; automatic recovery in progress ... A more common setup with no tablespaces and pg_wal as a non-symlink looks like: [pid 3861448] lstat("pg_wal", {st_mode=S_IFDIR|0700, st_size=316, ...}) = 0 [pid 3861448] openat(AT_FDCWD, ".", O_RDONLY) = 4 [pid 3861448] syncfs(4) = 0 [pid 3861448] close(4) = 0 [pid 3861448] openat(AT_FDCWD, "pg_tblspc", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4 [pid 3861448] fstat(4, {st_mode=S_IFDIR|0700, st_size=6, ...}) = 0 [pid 3861448] getdents64(4, 0x55764beb0b60 /* 2 entries */, 32768) = 48 [pid 3861448] getdents64(4, 0x55764beb0b60 /* 0 entries */, 32768) = 0 [pid 3861448] close(4) The alternative fsync() mode is (unsurprisingly) much longer. Thanks for the reviews! PS: For illustration/discussion, I've also attached a "none" patch. I also couldn't resist rebasing my "wal" mode patch, which I plan to propose for PG15 because there is not enough time left for this release.
Вложения
On Fri, Mar 19, 2021 at 2:16 PM Thomas Munro <thomas.munro@gmail.com> wrote: > PS: For illustration/discussion, I've also attached a "none" patch. I > also couldn't resist rebasing my "wal" mode patch, which I plan to > propose for PG15 because there is not enough time left for this > release. Erm... I attached the wrong version by mistake. Here's a better one. (Note: I'm not expecting any review of the 0003 patch in this CF, I just wanted to share the future direction I'd like to consider for this problem.)
Вложения
On 2021/03/19 10:37, Thomas Munro wrote: > On Fri, Mar 19, 2021 at 2:16 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> PS: For illustration/discussion, I've also attached a "none" patch. I >> also couldn't resist rebasing my "wal" mode patch, which I plan to >> propose for PG15 because there is not enough time left for this >> release. > > Erm... I attached the wrong version by mistake. Here's a better one. Thanks for updating the patch! It looks good to me! I have one minor comment for the patch. + elog(LOG, "could not open %s: %m", path); + return; + } + if (syncfs(fd) < 0) + elog(LOG, "could not sync filesystem for \"%s\": %m", path); Since these are neither internal errors nor low-level debug messages, ereport() should be used for them rather than elog()?For example, ereport(LOG, (errcode_for_file_access(), errmsg("could not open \"%s\": %m", path))) Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2021/03/19 11:22, Fujii Masao wrote: > > > On 2021/03/19 10:37, Thomas Munro wrote: >> On Fri, Mar 19, 2021 at 2:16 PM Thomas Munro <thomas.munro@gmail.com> wrote: >>> PS: For illustration/discussion, I've also attached a "none" patch. I >>> also couldn't resist rebasing my "wal" mode patch, which I plan to >>> propose for PG15 because there is not enough time left for this >>> release. >> >> Erm... I attached the wrong version by mistake. Here's a better one. 0002 patch looks good to me. Thanks! I have minor comments. - * Issue fsync recursively on PGDATA and all its contents, or issue syncfs for - * all potential filesystem, depending on recovery_init_sync_method setting. + * Issue fsync recursively on PGDATA and all its contents, issue syncfs for + * all potential filesystem, or do nothing, depending on + * recovery_init_sync_method setting. The comment in SyncDataDirectory() should be updated so that it mentions "none" method, as the above? + This is only safe if all buffered data is known to have been flushed + to disk already, for example by a tool such as + <application>pg_basebackup</application>. It is not a good idea to Isn't it better to add something like "without <literal>--no-sync</literal>" to "pg_basebackup" part? Which would prevent users from misunderstanding that pg_basebackup always ensures that whatever options are specified. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Fri, Mar 19, 2021 at 3:23 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > Thanks for updating the patch! It looks good to me! > I have one minor comment for the patch. > > + elog(LOG, "could not open %s: %m", path); > + return; > + } > + if (syncfs(fd) < 0) > + elog(LOG, "could not sync filesystem for \"%s\": %m", path); > > Since these are neither internal errors nor low-level debug messages, ereport() should be used for them rather than elog()?For example, Fixed. I'll let this sit until tomorrow to collect any other feedback or objections, and then push the 0001 patch (recovery_init_sync_method=syncfs). On Fri, Mar 19, 2021 at 4:08 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > 0002 patch looks good to me. Thanks! > I have minor comments. Ok, I made the changes you suggested. Let's see if anyone else would like to vote for or against the concept of the 0002 patch (recovery_init_sync_method=none).
Вложения
On 2021/03/19 14:28, Thomas Munro wrote: > On Fri, Mar 19, 2021 at 3:23 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >> Thanks for updating the patch! It looks good to me! >> I have one minor comment for the patch. >> >> + elog(LOG, "could not open %s: %m", path); >> + return; >> + } >> + if (syncfs(fd) < 0) >> + elog(LOG, "could not sync filesystem for \"%s\": %m", path); >> >> Since these are neither internal errors nor low-level debug messages, ereport() should be used for them rather than elog()?For example, > > Fixed. Thanks! LGTM. > I'll let this sit until tomorrow to collect any other feedback or > objections, and then push the 0001 patch > (recovery_init_sync_method=syncfs). Understood. > On Fri, Mar 19, 2021 at 4:08 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >> 0002 patch looks good to me. Thanks! >> I have minor comments. > > Ok, I made the changes you suggested. Thanks! LGTM. > Let's see if anyone else would > like to vote for or against the concept of the 0002 patch > (recovery_init_sync_method=none). Agreed. I also want to hear more opinion about the setting "none". Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 3/19/21 1:28 AM, Thomas Munro wrote: > On Fri, Mar 19, 2021 at 3:23 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >> Thanks for updating the patch! It looks good to me! >> I have one minor comment for the patch. >> >> + elog(LOG, "could not open %s: %m", path); >> + return; >> + } >> + if (syncfs(fd) < 0) >> + elog(LOG, "could not sync filesystem for \"%s\": %m", path); >> >> Since these are neither internal errors nor low-level debug messages, ereport() should be used for them rather than elog()?For example, > > Fixed. > > I'll let this sit until tomorrow to collect any other feedback or > objections, and then push the 0001 patch > (recovery_init_sync_method=syncfs). I had a look at the patch and it looks good to me. Should we mention in the docs that the contents of non-standard symlinks will not be synced, i.e. anything other than tablespaces and pg_wal? Or can we point them to some docs saying not to do that (if such exists)? > On Fri, Mar 19, 2021 at 4:08 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >> 0002 patch looks good to me. Thanks! >> I have minor comments. > > Ok, I made the changes you suggested. Let's see if anyone else would > like to vote for or against the concept of the 0002 patch > (recovery_init_sync_method=none). It worries me that this needs to be explicitly "turned off" after the initial recovery. Seems like something of a foot gun. Since we have not offered this functionality before I'm not sure we should rush to introduce it now. For backup solutions that do their own syncing, syncfs() should provide excellent performance so long as the file system is not shared, which is something the user can control (and is noted in the docs). Regards, -- -David david@pgmasters.net
Thanks Justin and David. Replies to two emails inline: On Sat, Mar 20, 2021 at 12:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Fri, Mar 19, 2021 at 06:28:46PM +1300, Thomas Munro wrote: > > +++ b/doc/src/sgml/config.sgml > > > + <productname>PostgreSQL</productname> will recursively open and fsync > > + all files in the data directory before crash recovery begins. This > > Maybe it should say "data, tablespace, and WAL directories", or just "critical > directories". Fair point. Here's what I went with: When set to <literal>fsync</literal>, which is the default, <productname>PostgreSQL</productname> will recursively open and synchronize all files in the data directory before crash recovery begins. The search for files will follow symbolic links for the WAL directory and each configured tablespace (but not any other symbolic links). > > + { > > + {"recovery_init_sync_method", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS, > > + gettext_noop("Sets the method for synchronizing the data directory before crash recovery."), > > + }, > > "and tablespaces and WAL" I feel like that's getting too detailed for the GUC description? On Sat, Mar 20, 2021 at 2:55 AM David Steele <david@pgmasters.net> wrote: > I had a look at the patch and it looks good to me. Thanks! > Should we mention in the docs that the contents of non-standard symlinks > will not be synced, i.e. anything other than tablespaces and pg_wal? Or > can we point them to some docs saying not to do that (if such exists)? Good idea. See above for the adjustment I went with to describe the traditional behaviour, and then I also made a similar change to the syncfs part, which, I hope, manages to convey that the new mode matches the existing policy on symlinks: On Linux, <literal>syncfs</literal> may be used instead, to ask the operating system to synchronize the whole file systems that contain the data directory, the WAL file and each tablespace (but not any other file systems that may be reachable through symbolic links). I thought about adding some text along the lines that such symlinks are not expected, but I think you're right that what we really need is a good place to point to. I mean, generally you can't mess around with the files managed by PostgreSQL and expect everything to keep working correctly, but it wouldn't hurt to make an explicit statement about symlinks and where they're allowed (or maybe there is one already and I failed to find it). There are hints though, like pg_basebackup's documentation which tells you it won't follow or preserve them in general, but... hmm, it also contemplates various special subdirectories (pg_dynshmem, pg_notify, pg_replslot, ...) that might be symlinks without saying why. > > Ok, I made the changes you suggested. Let's see if anyone else would > > like to vote for or against the concept of the 0002 patch > > (recovery_init_sync_method=none). > > It worries me that this needs to be explicitly "turned off" after the > initial recovery. Seems like something of a foot gun. > > Since we have not offered this functionality before I'm not sure we > should rush to introduce it now. For backup solutions that do their own > syncing, syncfs() should provide excellent performance so long as the > file system is not shared, which is something the user can control (and > is noted in the docs). Thanks. I'm leaving the 0002 patch "on ice" until someone can explain how you're supposed to use it without putting a hole in your foot. I pushed the 0001 patch. Thanks to all who reviewed. Of course, further documentation improvement patches are always welcome. (One silly thing I noticed is that our comments generally think "filesystem" is one word, but our documentation always has a space; this patch followed the local convention in both cases!)
On 3/19/21 7:16 PM, Thomas Munro wrote: > Thanks Justin and David. Replies to two emails inline: > > Fair point. Here's what I went with: > > When set to <literal>fsync</literal>, which is the default, > <productname>PostgreSQL</productname> will recursively open and > synchronize all files in the data directory before crash > recovery > begins. The search for files will follow symbolic links for the WAL > directory and each configured tablespace (but not any other symbolic > links). > +1 > I thought about adding some text along the lines that such symlinks > are not expected, but I think you're right that what we really need is > a good place to point to. I mean, generally you can't mess around > with the files managed by PostgreSQL and expect everything to keep > working correctly WRT to symlinks I'm not sure that's fair to say. From PG's perspective it's just a dir/file after all. Other than pg_wal I have seen pg_stat/pg_stat_tmp sometimes symlinked, plus config files, and the log dir. pgBackRest takes a pretty liberal approach here. Were preserve all dir/file symlinks no matter where they appear and allow all of them to be remapped on restore. > but it wouldn't hurt to make an explicit statement > about symlinks and where they're allowed (or maybe there is one > already and I failed to find it). I couldn't find it either and I would be in favor of it. For instance, pgBackRest forbids tablespaces inside PGDATA and when people complain (more often then you might imagine) we can just point to the code/docs. > There are hints though, like > pg_basebackup's documentation which tells you it won't follow or > preserve them in general, but... hmm, it also contemplates various > special subdirectories (pg_dynshmem, pg_notify, pg_replslot, ...) that > might be symlinks without saying why. Right, pg_dynshmem is another one that I've seen symlinked. Some things are nice to have on fast storage. pg_notify and pg_replslot are similar since they get written to a lot in certain configurations. >> It worries me that this needs to be explicitly "turned off" after the >> initial recovery. Seems like something of a foot gun. >> >> Since we have not offered this functionality before I'm not sure we >> should rush to introduce it now. For backup solutions that do their own >> syncing, syncfs() should provide excellent performance so long as the >> file system is not shared, which is something the user can control (and >> is noted in the docs). > > Thanks. I'm leaving the 0002 patch "on ice" until someone can explain > how you're supposed to use it without putting a hole in your foot. +1 > (One silly thing I noticed is that our comments generally think > "filesystem" is one word, but our documentation always has a space; > this patch followed the local convention in both cases!) Personally I prefer "file system". Regards, -- -David david@pgmasters.net
On 2021/06/04 23:39, Justin Pryzby wrote: > You said switching to SIGHUP "would have zero effect"; but, actually it allows > an admin who's DB took a long time in recovery/startup to change the parameter > without shutting down the service. This mitigates the downtime if it crashes > again. I think that's at least 50% of how this feature might end up being > used. Yes, it would have an effect when the server is automatically restarted after crash when restart_after_crash is enabled. At least for me +1 to your proposed change. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Fri, Jun 18, 2021 at 1:11 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > Thomas, could you comment on this ? Sorry, I missed that. It is initially a confusing proposal, but after trying it out (that is: making recovery_init_sync_method PGC_SIGHUP and testing a scenario where you want to make the next crash use it that way and without the change), I agree. +1 from me.
On Tue, Jun 22, 2021 at 5:01 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Jun 18, 2021 at 1:11 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > Thomas, could you comment on this ? > > Sorry, I missed that. It is initially a confusing proposal, but after > trying it out (that is: making recovery_init_sync_method PGC_SIGHUP > and testing a scenario where you want to make the next crash use it > that way and without the change), I agree. +1 from me. ... and pushed.