Обсуждение: Lowering the default wal_blocksize to 4K
Hi, I've mentioned this to a few people before, but forgot to start an actual thread. So here we go: I think we should lower the default wal_blocksize / XLOG_BLCKSZ to 4096, from the current 8192. The reason is that a) We don't gain much from a blocksize above 4096, as we already do one write all the pending WAL data in one go (except when at the tail of wal_buffers). We *do* incur more overhead for page headers, but compared to the actual WAL data it is not a lot (~0.29% of space is page headers 8192 vs 0.59% with 4096). b) Writing 8KB when we we have to flush a partially filled buffer can substantially increase write amplification. In a transactional workload, this will often double the write volume. Currently disks mostly have 4096 bytes as their "sector size". Sometimes that's exposed directly, sometimes they can also write in 512 bytes, but that internally requires a read-modify-write operation. For some example numbers, I ran a very simple insert workload with a varying number of clients with both a wal_blocksize=4096 and wal_blocksize=8192 cluster, and measured the amount of bytes written before/after. The table was recreated before each run, followed by a checkpoint and the benchmark. Here I ran the inserts only for 15s each, because the results don't change meaningfully with longer runs. With XLOG_BLCKSZ=8192 clients tps disk bytes written 1 667 81296 2 739 89796 4 1446 89208 8 2858 90858 16 5775 96928 32 11920 115351 64 23686 135244 128 46001 173390 256 88833 239720 512 146208 335669 With XLOG_BLCKSZ=4096 clients tps disk bytes written 1 751 46838 2 773 47936 4 1512 48317 8 3143 52584 16 6221 59097 32 12863 73776 64 25652 98792 128 48274 133330 256 88969 200720 512 146298 298523 This is on a not-that-fast NVMe SSD (Samsung SSD 970 PRO 1TB). It's IMO quite interesting that even at the higher client counts, the number of bytes written don't reach parity. On a stripe of two very fast SSDs: With XLOG_BLCKSZ=8192 clients tps disk bytes written 1 23786 2893392 2 38515 4683336 4 63436 4688052 8 106618 4618760 16 177905 4384360 32 254890 3890664 64 297113 3031568 128 299878 2297808 256 308774 1935064 512 292515 1630408 With XLOG_BLCKSZ=4096 clients tps disk bytes written 1 25742 1586748 2 43578 2686708 4 62734 2613856 8 116217 2809560 16 200802 2947580 32 269268 2461364 64 323195 2042196 128 317160 1550364 256 309601 1285744 512 292063 1103816 It's fun to see how the total number of writes *decreases* at higher concurrency, because it becomes more likely that pages are filled completely. One thing I noticed is that our auto-configuration of wal_buffers leads to different wal_buffers settings for different XLOG_BLCKSZ, which doesn't seem great. Performing the same COPY workload (1024 files, split across N clients) for both settings shows no performance difference, but a very slight increase in total bytes written (about 0.25%, which is roughly what I'd expect). Personally I'd say the slight increase in WAL volume is more than outweighed by the increase in throughput and decrease in bytes written. There's an alternative approach we could take, which is to write in 4KB increments, while keeping 8KB pages. With the current format that's not obviously a bad idea. But given there aren't really advantages in 8KB WAL pages, it seems we should just go for 4KB? Greetings, Andres Freund
On Mon, Oct 9, 2023 at 04:08:05PM -0700, Andres Freund wrote: > There's an alternative approach we could take, which is to write in 4KB > increments, while keeping 8KB pages. With the current format that's not > obviously a bad idea. But given there aren't really advantages in 8KB WAL > pages, it seems we should just go for 4KB? How do we handle shorter maximum row lengths and shorter maximum index entry lengths? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Hi, On 2023-10-09 19:26:54 -0400, Bruce Momjian wrote: > On Mon, Oct 9, 2023 at 04:08:05PM -0700, Andres Freund wrote: > > There's an alternative approach we could take, which is to write in 4KB > > increments, while keeping 8KB pages. With the current format that's not > > obviously a bad idea. But given there aren't really advantages in 8KB WAL > > pages, it seems we should just go for 4KB? > > How do we handle shorter maximum row lengths and shorter maximum index > entry lengths? The WAL blocksize shouldn't influence either, unless we have a bug somewhere. Greetings, Andres Freund
On Mon, Oct 9, 2023 at 04:36:20PM -0700, Andres Freund wrote: > Hi, > > On 2023-10-09 19:26:54 -0400, Bruce Momjian wrote: > > On Mon, Oct 9, 2023 at 04:08:05PM -0700, Andres Freund wrote: > > > There's an alternative approach we could take, which is to write in 4KB > > > increments, while keeping 8KB pages. With the current format that's not > > > obviously a bad idea. But given there aren't really advantages in 8KB WAL > > > pages, it seems we should just go for 4KB? > > > > How do we handle shorter maximum row lengths and shorter maximum index > > entry lengths? > > The WAL blocksize shouldn't influence either, unless we have a bug somewhere. Oh, good point. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Andres Freund <andres@anarazel.de> writes: > There's an alternative approach we could take, which is to write in 4KB > increments, while keeping 8KB pages. With the current format that's not > obviously a bad idea. But given there aren't really advantages in 8KB WAL > pages, it seems we should just go for 4KB? Seems like that's doubling the overhead of WAL page headers. Do we need to try to skinny those down? regards, tom lane
Hi, On 2023-10-09 23:16:30 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > There's an alternative approach we could take, which is to write in 4KB > > increments, while keeping 8KB pages. With the current format that's not > > obviously a bad idea. But given there aren't really advantages in 8KB WAL > > pages, it seems we should just go for 4KB? > > Seems like that's doubling the overhead of WAL page headers. Do we need > to try to skinny those down? I think the overhead is small, and we are wasting so much space in other places, that I am not worried about the proportional increase page header space usage at this point, particularly compared to saving in overall write rate and increase in TPS. There's other areas we can save much more space, if we want to focus on that. I was thinking we should perhaps do the opposite, namely getting rid of short page headers. The overhead in the "byte position" <-> LSN conversion due to the differing space is worse than the gain. Or do something inbetween - having the system ID in the header adds a useful crosscheck, but I'm far less convinced that having segment and block size in there, as 32bit numbers no less, is worthwhile. After all, if the system id matches, it's not likely that the xlog block or segment size differ. Greetings, Andres Freund
On Tue, 10 Oct 2023 at 01:08, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > I've mentioned this to a few people before, but forgot to start an actual > thread. So here we go: > > I think we should lower the default wal_blocksize / XLOG_BLCKSZ to 4096, from > the current 8192. Seems like a good idea. > It's IMO quite interesting that even at the higher client counts, the number > of bytes written don't reach parity. > > It's fun to see how the total number of writes *decreases* at higher > concurrency, because it becomes more likely that pages are filled completely. With higher client counts and short transactions I think it is not too unexpected to see commit_delay+commit_siblings configured. Did you measure the impact of this change on such configurations? > One thing I noticed is that our auto-configuration of wal_buffers leads to > different wal_buffers settings for different XLOG_BLCKSZ, which doesn't seem > great. Hmm. > Performing the same COPY workload (1024 files, split across N clients) for > both settings shows no performance difference, but a very slight increase in > total bytes written (about 0.25%, which is roughly what I'd expect). > > Personally I'd say the slight increase in WAL volume is more than outweighed > by the increase in throughput and decrease in bytes written. Agreed. > There's an alternative approach we could take, which is to write in 4KB > increments, while keeping 8KB pages. With the current format that's not > obviously a bad idea. But given there aren't really advantages in 8KB WAL > pages, it seems we should just go for 4KB? It is not just the disk overhead of blocks, but we also maintain some other data (currently in the form of XLogRecPtrs) in memory for each WAL buffer, the overhead of which will also increase when we increase the number of XLog pages per MB of WAL that we cache. Additionally, highly concurrent workloads with transactions that write a high multiple of XLOG_BLCKSZ bytes to WAL may start to see increased overhead due to the .25% additional WAL getting written and a doubling of the number of XLog pages being touched (both initialization and the smaller memcpy for records that would now cross an extra page boundary). However, for all of these issues I doubt that they actually matter much in the grand scheme of things, so I definitely wouldn't mind moving to 4KiB XLog pages. Kind regards, Matthias van de Meent Neon (https://neon.tech)
On Tue, 10 Oct 2023 at 06:14, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2023-10-09 23:16:30 -0400, Tom Lane wrote: >> Andres Freund <andres@anarazel.de> writes: >>> There's an alternative approach we could take, which is to write in 4KB >>> increments, while keeping 8KB pages. With the current format that's not >>> obviously a bad idea. But given there aren't really advantages in 8KB WAL >>> pages, it seems we should just go for 4KB? >> >> Seems like that's doubling the overhead of WAL page headers. Do we need >> to try to skinny those down? > > I think the overhead is small, and we are wasting so much space in other > places, that I am not worried about the proportional increase page header > space usage at this point, particularly compared to saving in overall write > rate and increase in TPS. There's other areas we can save much more space, if > we want to focus on that. > > I was thinking we should perhaps do the opposite, namely getting rid of short > page headers. The overhead in the "byte position" <-> LSN conversion due to > the differing space is worse than the gain. Or do something inbetween - having > the system ID in the header adds a useful crosscheck, but I'm far less > convinced that having segment and block size in there, as 32bit numbers no > less, is worthwhile. After all, if the system id matches, it's not likely that > the xlog block or segment size differ. Hmm. I don't think we should remove those checks, as I can see people that would want to change their XLog block size with e.g. pg_reset_wal. But I think we can relatively easily move segsize/blocksize checks to a different place in the normal page header, which would reduce the number of bytes we'd have to store elsewhere. We could move segsize/blocksize into the xlp_info flags: 12 of the 16 bits are currently unused. Using 4 of these bits for segsize (indicating 2^N MB, current accepted values are N=0..10 for 1 MB ... 1024MB) and 4 (or 3) for blcksz (as we currently support 1..64 kB blocks, or 2^{0..6} kB). This would remove the need for 2 of the 3 fields in the large xlog block header. After that we'll only have the system ID left from the extended header, which we could store across 2 pages in the (current) alignment losses of xlp_rem_len - even pages the upper half, uneven pages the lower half of the ID. This should allow for enough integrity checks without further increasing the size of XLogPageHeader in most installations. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Hi, On 2023-10-10 21:30:44 +0200, Matthias van de Meent wrote: > On Tue, 10 Oct 2023 at 06:14, Andres Freund <andres@anarazel.de> wrote: > > On 2023-10-09 23:16:30 -0400, Tom Lane wrote: > >> Andres Freund <andres@anarazel.de> writes: > >>> There's an alternative approach we could take, which is to write in 4KB > >>> increments, while keeping 8KB pages. With the current format that's not > >>> obviously a bad idea. But given there aren't really advantages in 8KB WAL > >>> pages, it seems we should just go for 4KB? > >> > >> Seems like that's doubling the overhead of WAL page headers. Do we need > >> to try to skinny those down? > > > > I think the overhead is small, and we are wasting so much space in other > > places, that I am not worried about the proportional increase page header > > space usage at this point, particularly compared to saving in overall write > > rate and increase in TPS. There's other areas we can save much more space, if > > we want to focus on that. > > > > I was thinking we should perhaps do the opposite, namely getting rid of short > > page headers. The overhead in the "byte position" <-> LSN conversion due to > > the differing space is worse than the gain. Or do something inbetween - having > > the system ID in the header adds a useful crosscheck, but I'm far less > > convinced that having segment and block size in there, as 32bit numbers no > > less, is worthwhile. After all, if the system id matches, it's not likely that > > the xlog block or segment size differ. > > Hmm. I don't think we should remove those checks, as I can see people > that would want to change their XLog block size with e.g. > pg_reset_wal. I don't think that's something we need to address in every physical segment. For one, there's no option to do so. But more importantly, if they don't change the xlog block size, we'll just accept random WAL as well. If somebody goes to the trouble of writing a custom tool, they can live with the consequences of that potentially causing breakage. Particularly if the checks wouldn't meaningfully prevent that anyway. > After that we'll only have the system ID left from the extended > header, which we could store across 2 pages in the (current) alignment > losses of xlp_rem_len - even pages the upper half, uneven pages the > lower half of the ID. This should allow for enough integrity checks > without further increasing the size of XLogPageHeader in most > installations. I doubt that that's a good idea - what if there's just a single page in a segment? And there aren't earlier segments? That's not a rare case, IME. Greetings, Andres Freund
On Wed, Oct 11, 2023 at 12:29 PM Andres Freund <andres@anarazel.de> wrote: > On 2023-10-10 21:30:44 +0200, Matthias van de Meent wrote: > > On Tue, 10 Oct 2023 at 06:14, Andres Freund <andres@anarazel.de> wrote: > > > I was thinking we should perhaps do the opposite, namely getting rid of short > > > page headers. The overhead in the "byte position" <-> LSN conversion due to > > > the differing space is worse than the gain. Or do something inbetween - having > > > the system ID in the header adds a useful crosscheck, but I'm far less > > > convinced that having segment and block size in there, as 32bit numbers no > > > less, is worthwhile. After all, if the system id matches, it's not likely that > > > the xlog block or segment size differ. > > > > Hmm. I don't think we should remove those checks, as I can see people > > that would want to change their XLog block size with e.g. > > pg_reset_wal. > > I don't think that's something we need to address in every physical > segment. For one, there's no option to do so. But more importantly, if they > don't change the xlog block size, we'll just accept random WAL as well. If > somebody goes to the trouble of writing a custom tool, they can live with the > consequences of that potentially causing breakage. Particularly if the checks > wouldn't meaningfully prevent that anyway. How about this idea: Put the system ID etc into the new record Robert is proposing for the redo point, and also into the checkpoint record, so that it's at both ends of the to-be-replayed range. That just leaves the WAL segments in between. If you find yourself writing a new record that would go in the first usable byte of a segment, insert a new special system ID (etc) record that will be checked during replay. For segments that start with XLP_FIRST_IS_CONTRECORD, don't worry about it: those already form part of a chain of verification (xlp_rem_len, xl_crc) that started on the preceding page, so it seems almost impossible to accidentally replay from a segment that came from another system.
Hi, On 2023-10-11 14:39:12 +1300, Thomas Munro wrote: > On Wed, Oct 11, 2023 at 12:29 PM Andres Freund <andres@anarazel.de> wrote: > > On 2023-10-10 21:30:44 +0200, Matthias van de Meent wrote: > > > On Tue, 10 Oct 2023 at 06:14, Andres Freund <andres@anarazel.de> wrote: > > > > I was thinking we should perhaps do the opposite, namely getting rid of short > > > > page headers. The overhead in the "byte position" <-> LSN conversion due to > > > > the differing space is worse than the gain. Or do something inbetween - having > > > > the system ID in the header adds a useful crosscheck, but I'm far less > > > > convinced that having segment and block size in there, as 32bit numbers no > > > > less, is worthwhile. After all, if the system id matches, it's not likely that > > > > the xlog block or segment size differ. > > > > > > Hmm. I don't think we should remove those checks, as I can see people > > > that would want to change their XLog block size with e.g. > > > pg_reset_wal. > > > > I don't think that's something we need to address in every physical > > segment. For one, there's no option to do so. But more importantly, if they > > don't change the xlog block size, we'll just accept random WAL as well. If > > somebody goes to the trouble of writing a custom tool, they can live with the > > consequences of that potentially causing breakage. Particularly if the checks > > wouldn't meaningfully prevent that anyway. > > How about this idea: Put the system ID etc into the new record Robert > is proposing for the redo point, and also into the checkpoint record, > so that it's at both ends of the to-be-replayed range. I think that's a very good idea. > That just leaves the WAL segments in between. If you find yourself writing > a new record that would go in the first usable byte of a segment, insert a > new special system ID (etc) record that will be checked during replay. I don't see how we can do that without incuring a lot of overhead though. This determination would need to happen in ReserveXLogInsertLocation(), while holding the spinlock. Which is one of the most contended bits of code in postgres. The whole reason that we have this "byte pos" to LSN conversion stuff is to make the spinlock-protected part of ReserveXLogInsertLocation() as short as possible. > For segments that start with XLP_FIRST_IS_CONTRECORD, don't worry about it: > those already form part of a chain of verification (xlp_rem_len, xl_crc) > that started on the preceding page, so it seems almost impossible to > accidentally replay from a segment that came from another system. But I think we might just be ok with logic similar to this, even for the non-contrecord case. If recovery starts in one segment where we have verified sysid, xlog block size etc and we encounter a WAL record starting on the first "content byte" of a segment, we can still verify that the prev LSN is correct etc. Sure, if you try hard you could come up with a scenario where you could mislead such a check, but we don't need to protect against intentional malice here, just against accidents. Greetings, Andres Freund
On Wed, 11 Oct 2023 at 01:29, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2023-10-10 21:30:44 +0200, Matthias van de Meent wrote: > > On Tue, 10 Oct 2023 at 06:14, Andres Freund <andres@anarazel.de> wrote: > > > On 2023-10-09 23:16:30 -0400, Tom Lane wrote: > > >> Andres Freund <andres@anarazel.de> writes: > > >>> There's an alternative approach we could take, which is to write in 4KB > > >>> increments, while keeping 8KB pages. With the current format that's not > > >>> obviously a bad idea. But given there aren't really advantages in 8KB WAL > > >>> pages, it seems we should just go for 4KB? > > >> > > >> Seems like that's doubling the overhead of WAL page headers. Do we need > > >> to try to skinny those down? > > > > > > I think the overhead is small, and we are wasting so much space in other > > > places, that I am not worried about the proportional increase page header > > > space usage at this point, particularly compared to saving in overall write > > > rate and increase in TPS. There's other areas we can save much more space, if > > > we want to focus on that. > > > > > > I was thinking we should perhaps do the opposite, namely getting rid of short > > > page headers. The overhead in the "byte position" <-> LSN conversion due to > > > the differing space is worse than the gain. Or do something inbetween - having > > > the system ID in the header adds a useful crosscheck, but I'm far less > > > convinced that having segment and block size in there, as 32bit numbers no > > > less, is worthwhile. After all, if the system id matches, it's not likely that > > > the xlog block or segment size differ. > > > > Hmm. I don't think we should remove those checks, as I can see people > > that would want to change their XLog block size with e.g. > > pg_reset_wal. > > I don't think that's something we need to address in every physical > segment. For one, there's no option to do so. Not block size, but xlog segment size is modifiable with pg_resetwal, and could thus reasonably change across restarts. Apart from more practical concerns around compile-time options requiring you to swap out binaries, I don't really see why xlog block size couldn't be changed with pg_resetwal in a securely shutdown cluster as one does with the WAL segment size. > But more importantly, if they > don't change the xlog block size, we'll just accept random WAL as well. If > somebody goes to the trouble of writing a custom tool, they can live with the > consequences of that potentially causing breakage. Particularly if the checks > wouldn't meaningfully prevent that anyway. I don't understand what you mean by that "we'll just accept random WAL as well". We do significant validation in XLogReaderValidatePageHeader to make sure that all pages of WAL are sufficiently formatted so that they can securely be read by the available infrastructure with the least chance of misreading data. There is no chance currently that we read WAL from WAL segments that contain correct data for different segment or block sizes. That includes WAL from segments created before a pg_resetwal changed the WAL segment size. If this "custom tool" refers to the typo-ed name of pg_resetwal, that is hardly a custom tool, it is shipped with PostgreSQL and you can find the sources under src/bin/pg_resetwal. > > After that we'll only have the system ID left from the extended > > header, which we could store across 2 pages in the (current) alignment > > losses of xlp_rem_len - even pages the upper half, uneven pages the > > lower half of the ID. This should allow for enough integrity checks > > without further increasing the size of XLogPageHeader in most > > installations. > > I doubt that that's a good idea - what if there's just a single page in a > segment? And there aren't earlier segments? That's not a rare case, IME. Then we'd still have 50% of a system ID which we can check against for any corruption. I agree that it increases the chance of conflics, but it's still strictly better than nothing at all. An alternative solution would be to write the first two pages of a WAL segment regardless of contents, so that we essentially never only have access to the first page during crash recovery. Physical replication's recovery wouldn't be able to read ahead, but I consider that as less problematic. Kind regards, Matthias van de Meent Neon (https://neon.tech)
On Tue, Oct 10, 2023 at 7:29 PM Andres Freund <andres@anarazel.de> wrote: > > Hmm. I don't think we should remove those checks, as I can see people > > that would want to change their XLog block size with e.g. > > pg_reset_wal. > > I don't think that's something we need to address in every physical > segment. For one, there's no option to do so. But more importantly, if they > don't change the xlog block size, we'll just accept random WAL as well. If > somebody goes to the trouble of writing a custom tool, they can live with the > consequences of that potentially causing breakage. Particularly if the checks > wouldn't meaningfully prevent that anyway. I'm extremely confused about what both of you are saying. Matthias is referring to pg_reset_wal, which I assume means pg_resetwal. But it has no option to change the WAL block size. It does have an option to change the WAL segment size, but that's not the same thing. And even if pg_resetwal did have an option to change the WAL segment size, it removes all WAL from pg_wal when it runs, so you wouldn't normally end up trying to replay WAL from before the operation because it would have been removed. You might still have those files around in an archive or something, but the primary doesn't replay from the archive. You might have standbys, but I would assume they would have to be rebuilt after changing the WAL block size on the master, unless you were trying to follow some probably-too-clever procedure to avoid a standby rebuild. So I'm really kind of lost as to what the scenario is that Matthias has in mind. But Andres's response doesn't make any sense to me either. What in the world does "if they don't change the xlog block size, we'll just accept random WAL as well" mean? Neither having or not having a check that the block size hasn't change causes us to "just accept random WAL". To "accept random WAL," we'd have to remove all of the sanity checks, which nobody is proposing and nobody would accept. But if we do want to keep those cross-checks, why not take what Thomas proposed a little further and move all of xlp_sysid, xlp_seg_size, and xlp_xlog_blcksz into XLOG_CHECKPOINT_REDO? Then long and short page headers would become identical. We'd lose the ability to recheck those values for every new segment, but it seems quite unlikely that any of these values would change in the middle of replay. If they did, would xl_prev and xl_crc be sufficient to catch that? I think Andres says in a later email that they would be, and I think I'm inclined to agree. False xl_prev matches don't seem especially unlikely, but xl_crc seems like it should be effective. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Oct 12, 2023 at 9:05 AM Robert Haas <robertmhaas@gmail.com> wrote: > But if we do want to keep those cross-checks, why not take what Thomas > proposed a little further and move all of xlp_sysid, xlp_seg_size, and > xlp_xlog_blcksz into XLOG_CHECKPOINT_REDO? Then long and short page > headers would become identical. FTR that's exactly what I was trying to say. > We'd lose the ability to recheck those > values for every new segment, but it seems quite unlikely that any of > these values would change in the middle of replay. If they did, would > xl_prev and xl_crc be sufficient to catch that? I think Andres says in > a later email that they would be, and I think I'm inclined to agree. > False xl_prev matches don't seem especially unlikely, but xl_crc seems > like it should be effective. Right, it is strong enough, and covers the common case where a record crosses the segment boundary. That leaves only the segments where a record starts exactly on the first usable byte of a segment, which is why I was trying to think of a way to cover that case too. I suggested we could notice and insert a new record at that place. But Andres suggests it would be too expensive and not worth worrying about.
On Thu, Oct 12, 2023 at 9:27 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Oct 12, 2023 at 9:05 AM Robert Haas <robertmhaas@gmail.com> wrote: > > But if we do want to keep those cross-checks, why not take what Thomas > > proposed a little further and move all of xlp_sysid, xlp_seg_size, and > > xlp_xlog_blcksz into XLOG_CHECKPOINT_REDO? Then long and short page > > headers would become identical. > > FTR that's exactly what I was trying to say. And to be extra double explicit, the point of that is to kill the 'div' instruction that Andres was complaining about, because now the division depends only on compile time constants so it can be done with multiplication and bitswizzling tricks. For example, when X is a variable I get: *a = n / X; 0x0000000000000003 <+3>: mov %rdi,%rax 0x0000000000000006 <+6>: xor %edx,%edx 0x0000000000000008 <+8>: divq 0x0(%rip) # 0xf <f+15> 0x0000000000000011 <+17>: mov %rax,(%rsi) *b = n % X; 0x000000000000000f <+15>: xor %edx,%edx 0x0000000000000014 <+20>: mov %rdi,%rax 0x0000000000000017 <+23>: divq 0x0(%rip) # 0x1e <f+30> 0x000000000000001e <+30>: mov %rdx,(%rcx) ... but when it's the constant 8192 - 24 I get: *a = n / X; 0x0000000000000000 <+0>: movabs $0x2018120d8a279db7,%rax 0x000000000000000d <+13>: mov %rdi,%rdx 0x0000000000000010 <+16>: shr $0x3,%rdx 0x0000000000000014 <+20>: mul %rdx 0x0000000000000017 <+23>: shr $0x7,%rdx 0x000000000000001b <+27>: mov %rdx,(%rsi) *b = n % X; 0x000000000000001e <+30>: imul $0x1fe8,%rdx,%rdx 0x0000000000000025 <+37>: sub %rdx,%rdi 0x0000000000000028 <+40>: mov %rdi,(%rcx)
Hi, On 2023-10-11 16:05:02 -0400, Robert Haas wrote: > On Tue, Oct 10, 2023 at 7:29 PM Andres Freund <andres@anarazel.de> wrote: > > > Hmm. I don't think we should remove those checks, as I can see people > > > that would want to change their XLog block size with e.g. > > > pg_reset_wal. > > > > I don't think that's something we need to address in every physical > > segment. For one, there's no option to do so. But more importantly, if they > > don't change the xlog block size, we'll just accept random WAL as well. If > > somebody goes to the trouble of writing a custom tool, they can live with the > > consequences of that potentially causing breakage. Particularly if the checks > > wouldn't meaningfully prevent that anyway. > > I'm extremely confused about what both of you are saying. > > Matthias is referring to pg_reset_wal, which I assume means > pg_resetwal. But it has no option to change the WAL block size. It > does have an option to change the WAL segment size, but that's not the > same thing. And even if pg_resetwal did have an option to change the > WAL segment size, it removes all WAL from pg_wal when it runs, so you > wouldn't normally end up trying to replay WAL from before the > operation because it would have been removed. You might still have > those files around in an archive or something, but the primary doesn't > replay from the archive. You might have standbys, but I would assume > they would have to be rebuilt after changing the WAL block size on the > master, unless you were trying to follow some probably-too-clever > procedure to avoid a standby rebuild. So I'm really kind of lost as to > what the scenario is that Matthias has in mind. I think the question is what the point of the crosschecks in long page headers is. It's pretty easy to see what the point of the xlp_sysid check is - make it less likely to accidentally replay WAL from a different system. It's much less clear what the point of xlp_seg_size and xlp_xlog_blcksz is - after all, they are also in ControlFileData and the xlp_sysid check tied the control file and WAL file together. > But Andres's response doesn't make any sense to me either. What in the world > does "if they don't change the xlog block size, we'll just accept random WAL > as well" mean? Neither having or not having a check that the block size > hasn't change causes us to "just accept random WAL". To "accept random WAL," > we'd have to remove all of the sanity checks, which nobody is proposing and > nobody would accept. Let me rephrase my point: If somebody uses a modified pg_resetwal to change the xlog block size, then tries to replay WAL from before that change, and is unlucky enough that the LSN looked for in a segment is the start of a valid record both before/after the pg_resetwal invocation, then yes, we might not catch that anymore if we remove the block size check. But the much much more common case is that the block size was *not* changed, in which case we *already* don't catch that pg_resetwal was invoked. ISTM that the xlp_seg_size and xlp_xlog_blcksz checks in long page headers are a belt and suspenders check that is very unlikely to ever catch a mistake that wouldn't otherwise be caught. > But if we do want to keep those cross-checks, why not take what Thomas > proposed a little further and move all of xlp_sysid, xlp_seg_size, and > xlp_xlog_blcksz into XLOG_CHECKPOINT_REDO? I think that's what Thomas was proposing. Thinking about it a bit more I'm not sure that having the data both in the checkpoint record itself and in XLOG_CHECKPOINT_REDO buys much. But it's also pretty much free, so ... > Then long and short page headers would become identical. Which would make the code more efficient... > We'd lose the ability to recheck those values for every new segment, but it > seems quite unlikely that any of these values would change in the middle of > replay. I guess the most likely scenario would be a replica that has some local WAL files (e.g. due to pg_basebackup -X ...), but accidentally configures a restore_command pointing to the wrong archive. In that case recovery could start up successfully as the checkpoint/redo records have sensible contents, but we then wouldn't necessarily notice that the subsequent files aren't from the correct system. Of course you need to be quite unlucky and have LSNs that match between the systems. The most likely path I can think of is an idle system with archive_timeout. As outlined above, I don't think xlp_seg_size, xlp_xlog_blcksz buy us anything, but that the protection by xlp_sysid is a bit more meaningful. So a compromise position could be to include xlp_sysid in the page header, possibly in a "chopped up" manner, as Matthias suggested. > If they did, would xl_prev and xl_crc be sufficient to catch that? I think > Andres says in a later email that they would be, and I think I'm inclined to > agree. False xl_prev matches don't seem especially unlikely, but xl_crc > seems like it should be effective. I think it'd be an entirely tolerable risk. Even if a WAL file were falsely replayed, we'd still notice the problem soon after. I think once such a mistake was made, it'd be inadvasible to continue using the cluster anyway. Greetings, Andres Freund
Hi, On 2023-10-11 16:09:21 +0200, Matthias van de Meent wrote: > On Wed, 11 Oct 2023 at 01:29, Andres Freund <andres@anarazel.de> wrote: > > > After that we'll only have the system ID left from the extended > > > header, which we could store across 2 pages in the (current) alignment > > > losses of xlp_rem_len - even pages the upper half, uneven pages the > > > lower half of the ID. This should allow for enough integrity checks > > > without further increasing the size of XLogPageHeader in most > > > installations. > > > > I doubt that that's a good idea - what if there's just a single page in a > > segment? And there aren't earlier segments? That's not a rare case, IME. > > Then we'd still have 50% of a system ID which we can check against for > any corruption. I agree that it increases the chance of conflics, but > it's still strictly better than nothing at all. A fair point - I somehow disregarded that all bits in the system id are equally meaningful. Greetings, Andres Freund
On Wed, Oct 11, 2023 at 4:28 PM Thomas Munro <thomas.munro@gmail.com> wrote: > That leaves only the segments where a record starts exactly on the > first usable byte of a segment, which is why I was trying to think of > a way to cover that case too. I suggested we could notice and insert > a new record at that place. But Andres suggests it would be too > expensive and not worth worrying about. Hmm. Even in that case, xl_prev has to match. It's not like it's the wild west. Sure, it's not nearly as good of a cross-check, but it's something. It seems to me that it's not worth worrying very much about xlp_seg_size or xlp_blcksz changing undetected in that scenario - if you're doing that kind of advanced magic, you need to be careful enough to not mess it up, and if we still cross-check once per checkpoint cycle that's pretty good. I do worry a bit about the sysid changing under us, though. It's not that hard to get your WAL archives mixed up, and it'd be nice to catch that right away. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Oct 11, 2023 at 6:11 PM Andres Freund <andres@anarazel.de> wrote: > I think the question is what the point of the crosschecks in long page headers > is. It's pretty easy to see what the point of the xlp_sysid check is - make it > less likely to accidentally replay WAL from a different system. It's much > less clear what the point of xlp_seg_size and xlp_xlog_blcksz is - after all, > they are also in ControlFileData and the xlp_sysid check tied the control file > and WAL file together. Yeah, fair. > Let me rephrase my point: > > If somebody uses a modified pg_resetwal to change the xlog block size, then > tries to replay WAL from before that change, and is unlucky enough that the > LSN looked for in a segment is the start of a valid record both before/after > the pg_resetwal invocation, then yes, we might not catch that anymore if we > remove the block size check. But the much much more common case is that the > block size was *not* changed, in which case we *already* don't catch that > pg_resetwal was invoked. Hmm. Should we invent a mechanism just for that? > ISTM that the xlp_seg_size and xlp_xlog_blcksz checks in long page headers are > a belt and suspenders check that is very unlikely to ever catch a mistake that > wouldn't otherwise be caught. I think that's probably right. > I think that's what Thomas was proposing. Thinking about it a bit more I'm > not sure that having the data both in the checkpoint record itself and in > XLOG_CHECKPOINT_REDO buys much. But it's also pretty much free, so ... Yes. To me, having it in the redo record seems considerably more valuable. Because that's where we're going to begin replay, so we should catch most problems straight off. To escape detection at that point, you need to not just be pointed at the wrong WAL archive, but actually have files of diverse origin mixed together in the same WAL archive. That's a less-likely error, and we still have some ways of catching it if it happens. > Which would make the code more efficient... Right. > As outlined above, I don't think xlp_seg_size, xlp_xlog_blcksz buy us > anything, but that the protection by xlp_sysid is a bit more meaningful. So a > compromise position could be to include xlp_sysid in the page header, possibly > in a "chopped up" manner, as Matthias suggested. I'm not that keen on the idea of storing the upper half and lower half in alternate pages. That seems to me to add code complexity and cognitive burden with little increased likelihood of catching real problems. I'm not completely opposed to the idea if somebody wants to make it happen, but I bet it would be better to either store the whole thing or just cut it in half and store, say, the low-order bits. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, 12 Oct 2023 at 16:36, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Oct 11, 2023 at 4:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> That leaves only the segments where a record starts exactly on the
> first usable byte of a segment, which is why I was trying to think of
> a way to cover that case too. I suggested we could notice and insert
> a new record at that place. But Andres suggests it would be too
> expensive and not worth worrying about.
Hmm. Even in that case, xl_prev has to match. It's not like it's the
wild west. Sure, it's not nearly as good of a cross-check, but it's
something. It seems to me that it's not worth worrying very much about
xlp_seg_size or xlp_blcksz changing undetected in that scenario - if
you're doing that kind of advanced magic, you need to be careful
enough to not mess it up, and if we still cross-check once per
checkpoint cycle that's pretty good. I do worry a bit about the sysid
changing under us, though. It's not that hard to get your WAL archives
mixed up, and it'd be nice to catch that right away.
This reminds me that xlp_tli is not being used to its full potential right now either. We only check that it's not going backwards, but there is at least one not very hard to hit way to get postgres to silently replay on the wrong timeline. [1]
Ants Aasma Senior Database Engineer www.cybertec-postgresql.com
On Thu, Oct 12, 2023 at 9:57 AM Ants Aasma <ants@cybertec.at> wrote: > This reminds me that xlp_tli is not being used to its full potential right now either. We only check that it's not goingbackwards, but there is at least one not very hard to hit way to get postgres to silently replay on the wrong timeline.[1] > > [1] https://www.postgresql.org/message-id/CANwKhkMN3QwAcvuDZHb6wsvLRtkweBiYso-KLFykkQVWuQLcOw@mail.gmail.com Maybe I'm missing something, but that seems mostly unrelated. What you're discussing there is the server's ability to figure out when it ought to perform a timeline switch. In other words, the server settles on the wrong TLI and therefore opens and reads from the wrong filename. But here, we're talking about the case where the server is correct about the TLI and LSN and hence opens exactly the right file on disk, but the contents of the file on disk aren't what they're supposed to be due to a procedural error. Said differently, I don't see how anything we could do with xlp_tli would actually fix the problem discussed in that thread. That can detect a situation where the TLI of the file doesn't match the TLI of the pages inside the file, but it doesn't help with the case where the server decided to read the wrong file in the first place. But this does make me wonder whether storing xlp_tli and xlp_pageaddr in every page is really worth the bit-space. That takes 12 bytes plus any padding it forces us to incur, but the actual entropy content of those 12 bytes must be quite low. In normal cases probably 7 or so of those bytes are going to consist entirely of zero bits (TLI < 256, LSN%8k == 0, LSN < 2^40). We could probably find a way of jumbling the LSN, TLI, and maybe some other stuff into an 8-byte quantity or even perhaps a 4-byte quantity that would do about as good a job catching problems as what we have now (e.g. LSN_HIGH32^LSN_LOW32^BITREVERSE(TLI)). In the event of a mismatch, the value actually stored in the page header would be harder for humans to understand, but I'm not sure that really matters here. Users should mostly be concerned with whether a WAL file matches the cluster where they're trying to replay it; forensics on misplaced or corrupted WAL files should be comparatively rare. -- Robert Haas EDB: http://www.enterprisedb.com