Обсуждение: fdatasync(2) on macOS
Hello hackers, While following along with the nearby investigation into weird cross-version Apple toolchain issues that confuse configure, I noticed that the newer buildfarm Macs say: checking for fdatasync... (cached) yes That's a bit strange because there's no man page and no declaration: checking whether fdatasync is declared... (cached) no That's no obstacle for us, because c.h does: #if defined(HAVE_FDATASYNC) && !HAVE_DECL_FDATASYNC extern int fdatasync(int fildes); #endif So... does this unreleased function flush drive caches? We know that fsync(2) doesn't, based on Apple's advice[1] for databases hackers to call fcntl(fd, F_FULLSYNC, 0) instead. We do that. Speaking as an armchair Internet Unix detective, my guess is: no. In the source[2] we can see that there is a real system call table entry and VFS support, so there is *something* wired up to this lever. On the other hand, it shares code with fsync(2), and I suppose that fdatasync(2) isn't going to do *more* than fsync(2). But who knows? Not only is it unreleased, but below VNOP_FSYNC() you reach closed source file system code. That was fun, but now I'm asking myself: do we really want to use an IO synchronisation facility that's not declared by the vendor? I see that our declaration goes back 20 years to 33cc5d8a, which introduced fdatasync(2). The discussion from the time[3] makes it clear that the OS support was very patchy and thin back then. Just by the way, another fun thing I learned about libSystem while reading up on Big Sur changes is that the system libraries are no longer on the file system. dlopen() is magical. [1] https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/fsync.2.html [2] https://github.com/apple/darwin-xnu/blob/d4061fb0260b3ed486147341b72468f836ed6c8f/bsd/vfs/vfs_syscalls.c#L7708 [3] https://www.postgresql.org/message-id/flat/200102171805.NAA24180%40candle.pha.pa.us
On Fri, Jan 15, 2021 at 7:53 PM Thomas Munro <thomas.munro@gmail.com> wrote: > That was fun, but now I'm asking myself: do we really want to use an > IO synchronisation facility that's not declared by the vendor? I should add, the default wal_sync_method is open_datasync, not fdatasync. I'm pretty suspicious of that too: neither O_SYNC nor O_DSYNC appears as a documented flag for open(2) and the numbers look suspicious. Perhaps they only define them to support aio_fsync(2).
Thomas Munro <thomas.munro@gmail.com> writes: > While following along with the nearby investigation into weird > cross-version Apple toolchain issues that confuse configure, I noticed > that the newer buildfarm Macs say: > checking for fdatasync... (cached) yes > That's a bit strange because there's no man page and no declaration: Yeah, it's been there but undeclared for a long time. Who knows why. > So... does this unreleased function flush drive caches? We know that > fsync(2) doesn't, based on Apple's advice[1] for databases hackers to > call fcntl(fd, F_FULLSYNC, 0) instead. We do that. pg_test_fsync results make it clear that fdatasync is the same or a shade faster than fsync, which is pretty much what you'd expect. On my late-model Macbook Pro: Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 14251.416 ops/sec 70 usecs/op fdatasync 25345.103 ops/sec 39 usecs/op fsync 24677.445 ops/sec 41 usecs/op fsync_writethrough 41.519 ops/sec 24085 usecs/op open_sync 14188.903 ops/sec 70 usecs/op and on an old Mac mini with spinning rust: Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 2536.535 ops/sec 394 usecs/op fdatasync 4602.192 ops/sec 217 usecs/op fsync 4600.365 ops/sec 217 usecs/op fsync_writethrough 12.135 ops/sec 82404 usecs/op open_sync 2506.674 ops/sec 399 usecs/op So it's not a no-op, but on the other hand it's not succeeding in getting bits down to the platter. I'm not inclined to dike it out, but it does seem problematic that we're defaulting to open_datasync, which is also not getting bits down to the platter. I have a vague recollection that we discussed changing the default wal_sync_method for Darwin years ago, but I don't recall why we didn't pull the trigger. These results certainly suggest that we oughta. regards, tom lane
On Fri, Jan 15, 2021 at 12:55:52PM -0500, Tom Lane wrote: > > So... does this unreleased function flush drive caches? We know that > > fsync(2) doesn't, based on Apple's advice[1] for databases hackers to > > call fcntl(fd, F_FULLSYNC, 0) instead. We do that. > > pg_test_fsync results make it clear that fdatasync is the same or a shade > faster than fsync, which is pretty much what you'd expect. On my > late-model Macbook Pro: > > Compare file sync methods using two 8kB writes: > (in wal_sync_method preference order, except fdatasync is Linux's default) > open_datasync 14251.416 ops/sec 70 usecs/op > fdatasync 25345.103 ops/sec 39 usecs/op > fsync 24677.445 ops/sec 41 usecs/op > fsync_writethrough 41.519 ops/sec 24085 usecs/op > open_sync 14188.903 ops/sec 70 usecs/op > > and on an old Mac mini with spinning rust: > > Compare file sync methods using two 8kB writes: > (in wal_sync_method preference order, except fdatasync is Linux's default) > open_datasync 2536.535 ops/sec 394 usecs/op > fdatasync 4602.192 ops/sec 217 usecs/op > fsync 4600.365 ops/sec 217 usecs/op > fsync_writethrough 12.135 ops/sec 82404 usecs/op > open_sync 2506.674 ops/sec 399 usecs/op > > So it's not a no-op, but on the other hand it's not succeeding in getting > bits down to the platter. I'm not inclined to dike it out, but it does > seem problematic that we're defaulting to open_datasync, which is also > not getting bits down to the platter. > > I have a vague recollection that we discussed changing the default > wal_sync_method for Darwin years ago, but I don't recall why we > didn't pull the trigger. These results certainly suggest that > we oughta. Is this with an SSD? We used to be able to know something wasn't flushing to durable storage because magnetic disk was so slow you could tell from the numbers, but with SSDs, it might be harder to guess. Maybe time to use: https://brad.livejournal.com/2116715.html diskchecker.pl or find a way to automate that test. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Sat, Jan 16, 2021 at 6:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > So it's not a no-op, but on the other hand it's not succeeding in getting > bits down to the platter. I'm not inclined to dike it out, but it does > seem problematic that we're defaulting to open_datasync, which is also > not getting bits down to the platter. Hmm, OK, from these times it does appear that O_SYNC and O_DSYNC actually do something then. It's baffling that they are undocumented. It might be possible to use dtrace on a SIP-disabled Mac to trace the IOs with this script, to see if the B_FUA flag is being set, which might make open_datasync better than fdatasync (if it's being sent and not ignored), but again, who knows?!: https://github.com/apple/darwin-xnu/blob/master/bsd/dev/dtrace/scripts/io.d > I have a vague recollection that we discussed changing the default > wal_sync_method for Darwin years ago, but I don't recall why we > didn't pull the trigger. These results certainly suggest that > we oughta. No strong preference here, at least without more information. It's unsettling that two of our wal_sync_methods are based on half-released phantom operating system features, but there doesn't seem to be much we can do about that other than try to understand what they do. I see that the idea of defaulting to fsync_writethrough was discussed a decade ago and rejected[1]. I'm not entirely sure how it manages to be so slow. It looks like the reliability section of our manual could use a spring clean[2]. It's still talking about IDE and platters, instead of modern stuff like NVMe, cloud/network storage and FUA flags. [1] https://www.postgresql.org/message-id/flat/AANLkTik261QWc9kGv6acZz2h9ZrQy9rKQC8ow5U1tAaM%40mail.gmail.com [2] https://www.postgresql.org/docs/13/wal-reliability.html
Thomas Munro <thomas.munro@gmail.com> writes: > On Sat, Jan 16, 2021 at 6:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I have a vague recollection that we discussed changing the default >> wal_sync_method for Darwin years ago, but I don't recall why we >> didn't pull the trigger. These results certainly suggest that >> we oughta. > No strong preference here, at least without more information. It's > unsettling that two of our wal_sync_methods are based on half-released > phantom operating system features, but there doesn't seem to be much > we can do about that other than try to understand what they do. I see > that the idea of defaulting to fsync_writethrough was discussed a > decade ago and rejected[1]. > [1] https://www.postgresql.org/message-id/flat/AANLkTik261QWc9kGv6acZz2h9ZrQy9rKQC8ow5U1tAaM%40mail.gmail.com Ah, thanks for doing the archaeology on that. Re-reading that old thread, it seems like the two big arguments against making it safe-by-default were (1) other platforms weren't safe-by-default either. Perhaps the state of the art is better now, though? (2) we don't want to force exceedingly-expensive defaults on people who may be uninterested in reliable storage. That seemed like a shaky argument then and it still does now. Still, I see the point that suddenly degrading performance by orders of magnitude would be a PR disaster. regards, tom lane
On Mon, Jan 18, 2021 at 5:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > (1) other platforms weren't safe-by-default either. Perhaps the > state of the art is better now, though? Generally the answer seems to be yes, but there are still some systems out there that don't send flushes when volatile write cache is enabled. Probably still including Macs, by the admission of their man page. The numbers I saw would put a little M1 Air at the upper range of super expensive server storage if they included or didn't need a flush to survive power loss, but then that's a consumer device with a battery so it doesn't really fit into the usual way we think about database server storage and power loss... > (2) we don't want to force exceedingly-expensive defaults on people > who may be uninterested in reliable storage. That seemed like a > shaky argument then and it still does now. Still, I see the point > that suddenly degrading performance by orders of magnitude would > be a PR disaster. (Purely as a matter of curiosity, I wonder why the latency is so high for F_FULLFSYNC. Wild speculation: APFS is said to be a bit like ZFS, but it's also said to avoid the data journaling of HFS+... so perhaps it lacks an equivalent of ZFS's ZIL (a thing like WAL) that allows synchronous writes to avoid having to flush out a new tree and uber block (in ZFS lingo "spa_sync()"). It might be possible to see this with tools like iosnoop (or the underlying io:::start dtrace probe), if you overwrite a single block and then fcntl(F_FULLFSYNC). Your 12 ops/sec on spinning rust would have to be explained by something like that, and is significantly slower than the speeds I see on my spinning rust ZFS system that manages something like disk rotation speed.) Anyway, my purpose in this thread was to flag our usage of the undocumented system call and open flags; that is, "how we talk to the OS", not "how the OS talks to the disk". That turned out to be already well known and not as new as I first thought, so I'm not planning to pursue this Mac stuff any further, despite my curiosity...
On Mon, Jan 18, 2021 at 4:39 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, Jan 16, 2021 at 6:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > So it's not a no-op, but on the other hand it's not succeeding in getting > > bits down to the platter. I'm not inclined to dike it out, but it does > > seem problematic that we're defaulting to open_datasync, which is also > > not getting bits down to the platter. > > Hmm, OK, from these times it does appear that O_SYNC and O_DSYNC > actually do something then. It's baffling that they are undocumented. I was digging through Apple sources again trying to learn something about bug report #16827, and I spotted one extra detail that I wanted to share in this thread about these undocumented system interfaces, just for the record. It appears that as of macOS 11.2/XNU 7195.83.3, their vn_write() doesn't treat O_DSYNC any differently than O_SYNC: /* * Treat synchronous mounts and O_FSYNC on the fd as equivalent. * * XXX We treat O_DSYNC as O_FSYNC for now, since we can not delay * XXX the non-essential metadata without some additional VFS work; * XXX the intent at this point is to plumb the interface for it. */ if ((fp->fp_glob->fg_flag & (O_FSYNC | O_DSYNC)) || (vp->v_mount && (vp->v_mount->mnt_flag & MNT_SYNCHRONOUS))) { ioflag |= IO_SYNC; }