Обсуждение: Lowering the default wal_blocksize to 4K

Поиск
Список
Период
Сортировка

Lowering the default wal_blocksize to 4K

От
Andres Freund
Дата:
Hi,

I've mentioned this to a few people before, but forgot to start an actual
thread. So here we go:

I think we should lower the default wal_blocksize / XLOG_BLCKSZ to 4096, from
the current 8192.  The reason is that

a) We don't gain much from a blocksize above 4096, as we already do one write
   all the pending WAL data in one go (except when at the tail of
   wal_buffers). We *do* incur more overhead for page headers, but compared to
   the actual WAL data it is not a lot (~0.29% of space is page headers 8192
   vs 0.59% with 4096).

b) Writing 8KB when we we have to flush a partially filled buffer can
   substantially increase write amplification. In a transactional workload,
   this will often double the write volume.

Currently disks mostly have 4096 bytes as their "sector size". Sometimes
that's exposed directly, sometimes they can also write in 512 bytes, but that
internally requires a read-modify-write operation.


For some example numbers, I ran a very simple insert workload with a varying
number of clients with both a wal_blocksize=4096 and wal_blocksize=8192
cluster, and measured the amount of bytes written before/after.  The table was
recreated before each run, followed by a checkpoint and the benchmark. Here I
ran the inserts only for 15s each, because the results don't change
meaningfully with longer runs.


With XLOG_BLCKSZ=8192

clients         tps    disk bytes written
1         667         81296
2         739         89796
4        1446         89208
8        2858         90858
16        5775         96928
32       11920        115351
64       23686        135244
128       46001        173390
256       88833        239720
512      146208        335669


With XLOG_BLCKSZ=4096

clients         tps    disk bytes written
1         751         46838
2         773         47936
4        1512         48317
8        3143         52584
16        6221         59097
32       12863         73776
64       25652         98792
128       48274        133330
256       88969        200720
512      146298        298523


This is on a not-that-fast NVMe SSD (Samsung SSD 970 PRO 1TB).


It's IMO quite interesting that even at the higher client counts, the number
of bytes written don't reach parity.


On a stripe of two very fast SSDs:

With XLOG_BLCKSZ=8192

clients         tps    disk bytes written
1       23786        2893392
2       38515        4683336
4       63436        4688052
8      106618        4618760
16      177905        4384360
32      254890        3890664
64      297113        3031568
128      299878        2297808
256      308774        1935064
512      292515        1630408


With XLOG_BLCKSZ=4096

clients         tps    disk bytes written
1       25742        1586748
2       43578        2686708
4       62734        2613856
8      116217        2809560
16      200802        2947580
32      269268        2461364
64      323195        2042196
128      317160        1550364
256      309601        1285744
512      292063        1103816

It's fun to see how the total number of writes *decreases* at higher
concurrency, because it becomes more likely that pages are filled completely.


One thing I noticed is that our auto-configuration of wal_buffers leads to
different wal_buffers settings for different XLOG_BLCKSZ, which doesn't seem
great.


Performing the same COPY workload (1024 files, split across N clients) for
both settings shows no performance difference, but a very slight increase in
total bytes written (about 0.25%, which is roughly what I'd expect).


Personally I'd say the slight increase in WAL volume is more than outweighed
by the increase in throughput and decrease in bytes written.


There's an alternative approach we could take, which is to write in 4KB
increments, while keeping 8KB pages. With the current format that's not
obviously a bad idea. But given there aren't really advantages in 8KB WAL
pages, it seems we should just go for 4KB?

Greetings,

Andres Freund



Re: Lowering the default wal_blocksize to 4K

От
Bruce Momjian
Дата:
On Mon, Oct  9, 2023 at 04:08:05PM -0700, Andres Freund wrote:
> There's an alternative approach we could take, which is to write in 4KB
> increments, while keeping 8KB pages. With the current format that's not
> obviously a bad idea. But given there aren't really advantages in 8KB WAL
> pages, it seems we should just go for 4KB?

How do we handle shorter maximum row lengths and shorter maximum index
entry lengths?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Lowering the default wal_blocksize to 4K

От
Andres Freund
Дата:
Hi,

On 2023-10-09 19:26:54 -0400, Bruce Momjian wrote:
> On Mon, Oct  9, 2023 at 04:08:05PM -0700, Andres Freund wrote:
> > There's an alternative approach we could take, which is to write in 4KB
> > increments, while keeping 8KB pages. With the current format that's not
> > obviously a bad idea. But given there aren't really advantages in 8KB WAL
> > pages, it seems we should just go for 4KB?
> 
> How do we handle shorter maximum row lengths and shorter maximum index
> entry lengths?

The WAL blocksize shouldn't influence either, unless we have a bug somewhere.

Greetings,

Andres Freund



Re: Lowering the default wal_blocksize to 4K

От
Bruce Momjian
Дата:
On Mon, Oct  9, 2023 at 04:36:20PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2023-10-09 19:26:54 -0400, Bruce Momjian wrote:
> > On Mon, Oct  9, 2023 at 04:08:05PM -0700, Andres Freund wrote:
> > > There's an alternative approach we could take, which is to write in 4KB
> > > increments, while keeping 8KB pages. With the current format that's not
> > > obviously a bad idea. But given there aren't really advantages in 8KB WAL
> > > pages, it seems we should just go for 4KB?
> > 
> > How do we handle shorter maximum row lengths and shorter maximum index
> > entry lengths?
> 
> The WAL blocksize shouldn't influence either, unless we have a bug somewhere.

Oh, good point.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Lowering the default wal_blocksize to 4K

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> There's an alternative approach we could take, which is to write in 4KB
> increments, while keeping 8KB pages. With the current format that's not
> obviously a bad idea. But given there aren't really advantages in 8KB WAL
> pages, it seems we should just go for 4KB?

Seems like that's doubling the overhead of WAL page headers.  Do we need
to try to skinny those down?

            regards, tom lane



Re: Lowering the default wal_blocksize to 4K

От
Andres Freund
Дата:
Hi,

On 2023-10-09 23:16:30 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > There's an alternative approach we could take, which is to write in 4KB
> > increments, while keeping 8KB pages. With the current format that's not
> > obviously a bad idea. But given there aren't really advantages in 8KB WAL
> > pages, it seems we should just go for 4KB?
> 
> Seems like that's doubling the overhead of WAL page headers.  Do we need
> to try to skinny those down?

I think the overhead is small, and we are wasting so much space in other
places, that I am not worried about the proportional increase page header
space usage at this point, particularly compared to saving in overall write
rate and increase in TPS. There's other areas we can save much more space, if
we want to focus on that.

I was thinking we should perhaps do the opposite, namely getting rid of short
page headers. The overhead in the "byte position" <-> LSN conversion due to
the differing space is worse than the gain. Or do something inbetween - having
the system ID in the header adds a useful crosscheck, but I'm far less
convinced that having segment and block size in there, as 32bit numbers no
less, is worthwhile. After all, if the system id matches, it's not likely that
the xlog block or segment size differ.

Greetings,

Andres Freund



Re: Lowering the default wal_blocksize to 4K

От
Matthias van de Meent
Дата:
On Tue, 10 Oct 2023 at 01:08, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> I've mentioned this to a few people before, but forgot to start an actual
> thread. So here we go:
>
> I think we should lower the default wal_blocksize / XLOG_BLCKSZ to 4096, from
> the current 8192.

Seems like a good idea.

> It's IMO quite interesting that even at the higher client counts, the number
> of bytes written don't reach parity.
>
> It's fun to see how the total number of writes *decreases* at higher
> concurrency, because it becomes more likely that pages are filled completely.

With higher client counts and short transactions I think it is not too
unexpected to see commit_delay+commit_siblings configured. Did you
measure the impact of this change on such configurations?

> One thing I noticed is that our auto-configuration of wal_buffers leads to
> different wal_buffers settings for different XLOG_BLCKSZ, which doesn't seem
> great.

Hmm.

> Performing the same COPY workload (1024 files, split across N clients) for
> both settings shows no performance difference, but a very slight increase in
> total bytes written (about 0.25%, which is roughly what I'd expect).
>
> Personally I'd say the slight increase in WAL volume is more than outweighed
> by the increase in throughput and decrease in bytes written.

Agreed.

> There's an alternative approach we could take, which is to write in 4KB
> increments, while keeping 8KB pages. With the current format that's not
> obviously a bad idea. But given there aren't really advantages in 8KB WAL
> pages, it seems we should just go for 4KB?

It is not just the disk overhead of blocks, but we also maintain some
other data (currently in the form of XLogRecPtrs) in memory for each
WAL buffer, the overhead of which will also increase when we increase
the number of XLog pages per MB of WAL that we cache.
Additionally, highly concurrent workloads with transactions that write
a high multiple of XLOG_BLCKSZ bytes to WAL may start to see increased
overhead due to the .25% additional WAL getting written and a doubling
of the number of XLog pages being touched (both initialization and the
smaller memcpy for records that would now cross an extra page
boundary).

However, for all of these issues I doubt that they actually matter
much in the grand scheme of things, so I definitely wouldn't mind
moving to 4KiB XLog pages.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Lowering the default wal_blocksize to 4K

От
Matthias van de Meent
Дата:
On Tue, 10 Oct 2023 at 06:14, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-10-09 23:16:30 -0400, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> There's an alternative approach we could take, which is to write in 4KB
>>> increments, while keeping 8KB pages. With the current format that's not
>>> obviously a bad idea. But given there aren't really advantages in 8KB WAL
>>> pages, it seems we should just go for 4KB?
>>
>> Seems like that's doubling the overhead of WAL page headers.  Do we need
>> to try to skinny those down?
>
> I think the overhead is small, and we are wasting so much space in other
> places, that I am not worried about the proportional increase page header
> space usage at this point, particularly compared to saving in overall write
> rate and increase in TPS. There's other areas we can save much more space, if
> we want to focus on that.
>
> I was thinking we should perhaps do the opposite, namely getting rid of short
> page headers. The overhead in the "byte position" <-> LSN conversion due to
> the differing space is worse than the gain. Or do something inbetween - having
> the system ID in the header adds a useful crosscheck, but I'm far less
> convinced that having segment and block size in there, as 32bit numbers no
> less, is worthwhile. After all, if the system id matches, it's not likely that
> the xlog block or segment size differ.

Hmm. I don't think we should remove those checks, as I can see people
that would want to change their XLog block size with e.g.
pg_reset_wal.
But I think we can relatively easily move segsize/blocksize checks to
a different place in the normal page header, which would reduce the
number of bytes we'd have to store elsewhere.

We could move segsize/blocksize into the xlp_info flags: 12 of the 16
bits are currently unused. Using 4 of these bits for segsize
(indicating 2^N MB, current accepted values are N=0..10 for 1 MB ...
1024MB) and 4 (or 3) for blcksz (as we currently support 1..64 kB
blocks, or 2^{0..6} kB). This would remove the need for 2 of the 3
fields in the large xlog block header.

After that we'll only have the system ID left from the extended
header, which we could store across 2 pages in the (current) alignment
losses of xlp_rem_len - even pages the upper half, uneven pages the
lower half of the ID. This should allow for enough integrity checks
without further increasing the size of XLogPageHeader in most
installations.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Lowering the default wal_blocksize to 4K

От
Andres Freund
Дата:
Hi,

On 2023-10-10 21:30:44 +0200, Matthias van de Meent wrote:
> On Tue, 10 Oct 2023 at 06:14, Andres Freund <andres@anarazel.de> wrote:
> > On 2023-10-09 23:16:30 -0400, Tom Lane wrote:
> >> Andres Freund <andres@anarazel.de> writes:
> >>> There's an alternative approach we could take, which is to write in 4KB
> >>> increments, while keeping 8KB pages. With the current format that's not
> >>> obviously a bad idea. But given there aren't really advantages in 8KB WAL
> >>> pages, it seems we should just go for 4KB?
> >>
> >> Seems like that's doubling the overhead of WAL page headers.  Do we need
> >> to try to skinny those down?
> >
> > I think the overhead is small, and we are wasting so much space in other
> > places, that I am not worried about the proportional increase page header
> > space usage at this point, particularly compared to saving in overall write
> > rate and increase in TPS. There's other areas we can save much more space, if
> > we want to focus on that.
> >
> > I was thinking we should perhaps do the opposite, namely getting rid of short
> > page headers. The overhead in the "byte position" <-> LSN conversion due to
> > the differing space is worse than the gain. Or do something inbetween - having
> > the system ID in the header adds a useful crosscheck, but I'm far less
> > convinced that having segment and block size in there, as 32bit numbers no
> > less, is worthwhile. After all, if the system id matches, it's not likely that
> > the xlog block or segment size differ.
> 
> Hmm. I don't think we should remove those checks, as I can see people
> that would want to change their XLog block size with e.g.
> pg_reset_wal.

I don't think that's something we need to address in every physical
segment. For one, there's no option to do so. But more importantly, if they
don't change the xlog block size, we'll just accept random WAL as well. If
somebody goes to the trouble of writing a custom tool, they can live with the
consequences of that potentially causing breakage. Particularly if the checks
wouldn't meaningfully prevent that anyway.


> After that we'll only have the system ID left from the extended
> header, which we could store across 2 pages in the (current) alignment
> losses of xlp_rem_len - even pages the upper half, uneven pages the
> lower half of the ID. This should allow for enough integrity checks
> without further increasing the size of XLogPageHeader in most
> installations.

I doubt that that's a good idea - what if there's just a single page in a
segment? And there aren't earlier segments? That's not a rare case, IME.

Greetings,

Andres Freund



Re: Lowering the default wal_blocksize to 4K

От
Thomas Munro
Дата:
On Wed, Oct 11, 2023 at 12:29 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-10-10 21:30:44 +0200, Matthias van de Meent wrote:
> > On Tue, 10 Oct 2023 at 06:14, Andres Freund <andres@anarazel.de> wrote:
> > > I was thinking we should perhaps do the opposite, namely getting rid of short
> > > page headers. The overhead in the "byte position" <-> LSN conversion due to
> > > the differing space is worse than the gain. Or do something inbetween - having
> > > the system ID in the header adds a useful crosscheck, but I'm far less
> > > convinced that having segment and block size in there, as 32bit numbers no
> > > less, is worthwhile. After all, if the system id matches, it's not likely that
> > > the xlog block or segment size differ.
> >
> > Hmm. I don't think we should remove those checks, as I can see people
> > that would want to change their XLog block size with e.g.
> > pg_reset_wal.
>
> I don't think that's something we need to address in every physical
> segment. For one, there's no option to do so. But more importantly, if they
> don't change the xlog block size, we'll just accept random WAL as well. If
> somebody goes to the trouble of writing a custom tool, they can live with the
> consequences of that potentially causing breakage. Particularly if the checks
> wouldn't meaningfully prevent that anyway.

How about this idea: Put the system ID etc into the new record Robert
is proposing for the redo point, and also into the checkpoint record,
so that it's at both ends of the to-be-replayed range.  That just
leaves the WAL segments in between.  If you find yourself writing a
new record that would go in the first usable byte of a segment, insert
a new special system ID (etc) record that will be checked during
replay.  For segments that start with XLP_FIRST_IS_CONTRECORD, don't
worry about it: those already form part of a chain of verification
(xlp_rem_len, xl_crc) that started on the preceding page, so it seems
almost impossible to accidentally replay from a segment that came from
another system.



Re: Lowering the default wal_blocksize to 4K

От
Andres Freund
Дата:
Hi,

On 2023-10-11 14:39:12 +1300, Thomas Munro wrote:
> On Wed, Oct 11, 2023 at 12:29 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2023-10-10 21:30:44 +0200, Matthias van de Meent wrote:
> > > On Tue, 10 Oct 2023 at 06:14, Andres Freund <andres@anarazel.de> wrote:
> > > > I was thinking we should perhaps do the opposite, namely getting rid of short
> > > > page headers. The overhead in the "byte position" <-> LSN conversion due to
> > > > the differing space is worse than the gain. Or do something inbetween - having
> > > > the system ID in the header adds a useful crosscheck, but I'm far less
> > > > convinced that having segment and block size in there, as 32bit numbers no
> > > > less, is worthwhile. After all, if the system id matches, it's not likely that
> > > > the xlog block or segment size differ.
> > >
> > > Hmm. I don't think we should remove those checks, as I can see people
> > > that would want to change their XLog block size with e.g.
> > > pg_reset_wal.
> >
> > I don't think that's something we need to address in every physical
> > segment. For one, there's no option to do so. But more importantly, if they
> > don't change the xlog block size, we'll just accept random WAL as well. If
> > somebody goes to the trouble of writing a custom tool, they can live with the
> > consequences of that potentially causing breakage. Particularly if the checks
> > wouldn't meaningfully prevent that anyway.
> 
> How about this idea: Put the system ID etc into the new record Robert
> is proposing for the redo point, and also into the checkpoint record,
> so that it's at both ends of the to-be-replayed range.

I think that's a very good idea.


> That just leaves the WAL segments in between.  If you find yourself writing
> a new record that would go in the first usable byte of a segment, insert a
> new special system ID (etc) record that will be checked during replay.

I don't see how we can do that without incuring a lot of overhead though. This
determination would need to happen in ReserveXLogInsertLocation(), while
holding the spinlock. Which is one of the most contended bits of code in
postgres.  The whole reason that we have this "byte pos" to LSN conversion
stuff is to make the spinlock-protected part of ReserveXLogInsertLocation() as
short as possible.


> For segments that start with XLP_FIRST_IS_CONTRECORD, don't worry about it:
> those already form part of a chain of verification (xlp_rem_len, xl_crc)
> that started on the preceding page, so it seems almost impossible to
> accidentally replay from a segment that came from another system.

But I think we might just be ok with logic similar to this, even for the
non-contrecord case. If recovery starts in one segment where we have verified
sysid, xlog block size etc and we encounter a WAL record starting on the first
"content byte" of a segment, we can still verify that the prev LSN is correct
etc.  Sure, if you try hard you could come up with a scenario where you could
mislead such a check, but we don't need to protect against intentional malice
here, just against accidents.

Greetings,

Andres Freund



Re: Lowering the default wal_blocksize to 4K

От
Matthias van de Meent
Дата:
On Wed, 11 Oct 2023 at 01:29, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-10-10 21:30:44 +0200, Matthias van de Meent wrote:
> > On Tue, 10 Oct 2023 at 06:14, Andres Freund <andres@anarazel.de> wrote:
> > > On 2023-10-09 23:16:30 -0400, Tom Lane wrote:
> > >> Andres Freund <andres@anarazel.de> writes:
> > >>> There's an alternative approach we could take, which is to write in 4KB
> > >>> increments, while keeping 8KB pages. With the current format that's not
> > >>> obviously a bad idea. But given there aren't really advantages in 8KB WAL
> > >>> pages, it seems we should just go for 4KB?
> > >>
> > >> Seems like that's doubling the overhead of WAL page headers.  Do we need
> > >> to try to skinny those down?
> > >
> > > I think the overhead is small, and we are wasting so much space in other
> > > places, that I am not worried about the proportional increase page header
> > > space usage at this point, particularly compared to saving in overall write
> > > rate and increase in TPS. There's other areas we can save much more space, if
> > > we want to focus on that.
> > >
> > > I was thinking we should perhaps do the opposite, namely getting rid of short
> > > page headers. The overhead in the "byte position" <-> LSN conversion due to
> > > the differing space is worse than the gain. Or do something inbetween - having
> > > the system ID in the header adds a useful crosscheck, but I'm far less
> > > convinced that having segment and block size in there, as 32bit numbers no
> > > less, is worthwhile. After all, if the system id matches, it's not likely that
> > > the xlog block or segment size differ.
> >
> > Hmm. I don't think we should remove those checks, as I can see people
> > that would want to change their XLog block size with e.g.
> > pg_reset_wal.
>
> I don't think that's something we need to address in every physical
> segment. For one, there's no option to do so.

Not block size, but xlog segment size is modifiable with pg_resetwal,
and could thus reasonably change across restarts. Apart from more
practical concerns around compile-time options requiring you to swap
out binaries, I don't really see why xlog block size couldn't be
changed with pg_resetwal in a securely shutdown cluster as one does
with the WAL segment size.

> But more importantly, if they
> don't change the xlog block size, we'll just accept random WAL as well. If
> somebody goes to the trouble of writing a custom tool, they can live with the
> consequences of that potentially causing breakage. Particularly if the checks
> wouldn't meaningfully prevent that anyway.

I don't understand what you mean by that "we'll just accept random WAL
as well". We do significant validation in XLogReaderValidatePageHeader
to make sure that all pages of WAL are sufficiently formatted so that
they can securely be read by the available infrastructure with the
least chance of misreading data. There is no chance currently that we
read WAL from WAL segments that contain correct data for different
segment or block sizes. That includes WAL from segments created before
a pg_resetwal changed the WAL segment size.

If this "custom tool" refers to the typo-ed name of pg_resetwal, that
is hardly a custom tool, it is shipped with PostgreSQL and you can
find the sources under src/bin/pg_resetwal.

> > After that we'll only have the system ID left from the extended
> > header, which we could store across 2 pages in the (current) alignment
> > losses of xlp_rem_len - even pages the upper half, uneven pages the
> > lower half of the ID. This should allow for enough integrity checks
> > without further increasing the size of XLogPageHeader in most
> > installations.
>
> I doubt that that's a good idea - what if there's just a single page in a
> segment? And there aren't earlier segments? That's not a rare case, IME.

Then we'd still have 50% of a system ID which we can check against for
any corruption. I agree that it increases the chance of conflics, but
it's still strictly better than nothing at all.
An alternative solution would be to write the first two pages of a WAL
segment regardless of contents, so that we essentially never only have
access to the first page during crash recovery. Physical replication's
recovery wouldn't be able to read ahead, but I consider that as less
problematic.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Lowering the default wal_blocksize to 4K

От
Robert Haas
Дата:
On Tue, Oct 10, 2023 at 7:29 PM Andres Freund <andres@anarazel.de> wrote:
> > Hmm. I don't think we should remove those checks, as I can see people
> > that would want to change their XLog block size with e.g.
> > pg_reset_wal.
>
> I don't think that's something we need to address in every physical
> segment. For one, there's no option to do so. But more importantly, if they
> don't change the xlog block size, we'll just accept random WAL as well. If
> somebody goes to the trouble of writing a custom tool, they can live with the
> consequences of that potentially causing breakage. Particularly if the checks
> wouldn't meaningfully prevent that anyway.

I'm extremely confused about what both of you are saying.

Matthias is referring to pg_reset_wal, which I assume means
pg_resetwal. But it has no option to change the WAL block size. It
does have an option to change the WAL segment size, but that's not the
same thing. And even if pg_resetwal did have an option to change the
WAL segment size, it removes all WAL from pg_wal when it runs, so you
wouldn't normally end up trying to replay WAL from before the
operation because it would have been removed. You might still have
those files around in an archive or something, but the primary doesn't
replay from the archive. You might have standbys, but I would assume
they would have to be rebuilt after changing the WAL block size on the
master, unless you were trying to follow some probably-too-clever
procedure to avoid a standby rebuild. So I'm really kind of lost as to
what the scenario is that Matthias has in mind.

But Andres's response doesn't make any sense to me either. What in the
world does "if they don't change the xlog block size, we'll just
accept random WAL as well" mean? Neither having or not having a check
that the block size hasn't change causes us to "just accept random
WAL". To "accept random WAL," we'd have to remove all of the sanity
checks, which nobody is proposing and nobody would accept.

But if we do want to keep those cross-checks, why not take what Thomas
proposed a little further and move all of xlp_sysid, xlp_seg_size, and
xlp_xlog_blcksz into XLOG_CHECKPOINT_REDO? Then long and short page
headers would become identical. We'd lose the ability to recheck those
values for every new segment, but it seems quite unlikely that any of
these values would change in the middle of replay. If they did, would
xl_prev and xl_crc be sufficient to catch that? I think Andres says in
a later email that they would be, and I think I'm inclined to agree.
False xl_prev matches don't seem especially unlikely, but xl_crc seems
like it should be effective.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Lowering the default wal_blocksize to 4K

От
Thomas Munro
Дата:
On Thu, Oct 12, 2023 at 9:05 AM Robert Haas <robertmhaas@gmail.com> wrote:
> But if we do want to keep those cross-checks, why not take what Thomas
> proposed a little further and move all of xlp_sysid, xlp_seg_size, and
> xlp_xlog_blcksz into XLOG_CHECKPOINT_REDO? Then long and short page
> headers would become identical.

FTR that's exactly what I was trying to say.

> We'd lose the ability to recheck those
> values for every new segment, but it seems quite unlikely that any of
> these values would change in the middle of replay. If they did, would
> xl_prev and xl_crc be sufficient to catch that? I think Andres says in
> a later email that they would be, and I think I'm inclined to agree.
> False xl_prev matches don't seem especially unlikely, but xl_crc seems
> like it should be effective.

Right, it is strong enough, and covers the common case where a record
crosses the segment boundary.

That leaves only the segments where a record starts exactly on the
first usable byte of a segment, which is why I was trying to think of
a way to cover that case too.  I suggested we could notice and insert
a new record at that place.  But Andres suggests it would be too
expensive and not worth worrying about.



Re: Lowering the default wal_blocksize to 4K

От
Thomas Munro
Дата:
On Thu, Oct 12, 2023 at 9:27 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Oct 12, 2023 at 9:05 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > But if we do want to keep those cross-checks, why not take what Thomas
> > proposed a little further and move all of xlp_sysid, xlp_seg_size, and
> > xlp_xlog_blcksz into XLOG_CHECKPOINT_REDO? Then long and short page
> > headers would become identical.
>
> FTR that's exactly what I was trying to say.

And to be extra double explicit, the point of that is to kill the
'div' instruction that Andres was complaining about, because now the
division depends only on compile time constants so it can be done with
multiplication and bitswizzling tricks.  For example, when X is a
variable I get:

        *a = n / X;
   0x0000000000000003 <+3>:    mov    %rdi,%rax
   0x0000000000000006 <+6>:    xor    %edx,%edx
   0x0000000000000008 <+8>:    divq   0x0(%rip)        # 0xf <f+15>
   0x0000000000000011 <+17>:    mov    %rax,(%rsi)

         *b = n % X;
   0x000000000000000f <+15>:    xor    %edx,%edx
   0x0000000000000014 <+20>:    mov    %rdi,%rax
   0x0000000000000017 <+23>:    divq   0x0(%rip)        # 0x1e <f+30>
   0x000000000000001e <+30>:    mov    %rdx,(%rcx)


... but when it's the constant 8192 - 24 I get:

        *a = n / X;
   0x0000000000000000 <+0>:    movabs $0x2018120d8a279db7,%rax
   0x000000000000000d <+13>:    mov    %rdi,%rdx
   0x0000000000000010 <+16>:    shr    $0x3,%rdx
   0x0000000000000014 <+20>:    mul    %rdx
   0x0000000000000017 <+23>:    shr    $0x7,%rdx
   0x000000000000001b <+27>:    mov    %rdx,(%rsi)

         *b = n % X;
   0x000000000000001e <+30>:    imul   $0x1fe8,%rdx,%rdx
   0x0000000000000025 <+37>:    sub    %rdx,%rdi
   0x0000000000000028 <+40>:    mov    %rdi,(%rcx)



Re: Lowering the default wal_blocksize to 4K

От
Andres Freund
Дата:
Hi,

On 2023-10-11 16:05:02 -0400, Robert Haas wrote:
> On Tue, Oct 10, 2023 at 7:29 PM Andres Freund <andres@anarazel.de> wrote:
> > > Hmm. I don't think we should remove those checks, as I can see people
> > > that would want to change their XLog block size with e.g.
> > > pg_reset_wal.
> >
> > I don't think that's something we need to address in every physical
> > segment. For one, there's no option to do so. But more importantly, if they
> > don't change the xlog block size, we'll just accept random WAL as well. If
> > somebody goes to the trouble of writing a custom tool, they can live with the
> > consequences of that potentially causing breakage. Particularly if the checks
> > wouldn't meaningfully prevent that anyway.
>
> I'm extremely confused about what both of you are saying.
>
> Matthias is referring to pg_reset_wal, which I assume means
> pg_resetwal. But it has no option to change the WAL block size. It
> does have an option to change the WAL segment size, but that's not the
> same thing. And even if pg_resetwal did have an option to change the
> WAL segment size, it removes all WAL from pg_wal when it runs, so you
> wouldn't normally end up trying to replay WAL from before the
> operation because it would have been removed. You might still have
> those files around in an archive or something, but the primary doesn't
> replay from the archive. You might have standbys, but I would assume
> they would have to be rebuilt after changing the WAL block size on the
> master, unless you were trying to follow some probably-too-clever
> procedure to avoid a standby rebuild. So I'm really kind of lost as to
> what the scenario is that Matthias has in mind.

I think the question is what the point of the crosschecks in long page headers
is. It's pretty easy to see what the point of the xlp_sysid check is - make it
less likely to accidentally replay WAL from a different system.  It's much
less clear what the point of xlp_seg_size and xlp_xlog_blcksz is - after all,
they are also in ControlFileData and the xlp_sysid check tied the control file
and WAL file together.


> But Andres's response doesn't make any sense to me either. What in the world
> does "if they don't change the xlog block size, we'll just accept random WAL
> as well" mean? Neither having or not having a check that the block size
> hasn't change causes us to "just accept random WAL". To "accept random WAL,"
> we'd have to remove all of the sanity checks, which nobody is proposing and
> nobody would accept.

Let me rephrase my point:

If somebody uses a modified pg_resetwal to change the xlog block size, then
tries to replay WAL from before that change, and is unlucky enough that the
LSN looked for in a segment is the start of a valid record both before/after
the pg_resetwal invocation, then yes, we might not catch that anymore if we
remove the block size check. But the much much more common case is that the
block size was *not* changed, in which case we *already* don't catch that
pg_resetwal was invoked.

ISTM that the xlp_seg_size and xlp_xlog_blcksz checks in long page headers are
a belt and suspenders check that is very unlikely to ever catch a mistake that
wouldn't otherwise be caught.


> But if we do want to keep those cross-checks, why not take what Thomas
> proposed a little further and move all of xlp_sysid, xlp_seg_size, and
> xlp_xlog_blcksz into XLOG_CHECKPOINT_REDO?

I think that's what Thomas was proposing.  Thinking about it a bit more I'm
not sure that having the data both in the checkpoint record itself and in
XLOG_CHECKPOINT_REDO buys much. But it's also pretty much free, so ...


> Then long and short page headers would become identical.

Which would make the code more efficient...


> We'd lose the ability to recheck those values for every new segment, but it
> seems quite unlikely that any of these values would change in the middle of
> replay.

I guess the most likely scenario would be a replica that has some local WAL
files (e.g. due to pg_basebackup -X ...), but accidentally configures a
restore_command pointing to the wrong archive. In that case recovery could
start up successfully as the checkpoint/redo records have sensible contents,
but we then wouldn't necessarily notice that the subsequent files aren't from
the correct system.  Of course you need to be quite unlucky and have LSNs that
match between the systems. The most likely path I can think of is an idle
system with archive_timeout.

As outlined above, I don't think xlp_seg_size, xlp_xlog_blcksz buy us
anything, but that the protection by xlp_sysid is a bit more meaningful. So a
compromise position could be to include xlp_sysid in the page header, possibly
in a "chopped up" manner, as Matthias suggested.


> If they did, would xl_prev and xl_crc be sufficient to catch that? I think
> Andres says in a later email that they would be, and I think I'm inclined to
> agree.  False xl_prev matches don't seem especially unlikely, but xl_crc
> seems like it should be effective.

I think it'd be an entirely tolerable risk. Even if a WAL file were falsely
replayed, we'd still notice the problem soon after. I think once such a
mistake was made, it'd be inadvasible to continue using the cluster anyway.

Greetings,

Andres Freund



Re: Lowering the default wal_blocksize to 4K

От
Andres Freund
Дата:
Hi,

On 2023-10-11 16:09:21 +0200, Matthias van de Meent wrote:
> On Wed, 11 Oct 2023 at 01:29, Andres Freund <andres@anarazel.de> wrote:
> > > After that we'll only have the system ID left from the extended
> > > header, which we could store across 2 pages in the (current) alignment
> > > losses of xlp_rem_len - even pages the upper half, uneven pages the
> > > lower half of the ID. This should allow for enough integrity checks
> > > without further increasing the size of XLogPageHeader in most
> > > installations.
> >
> > I doubt that that's a good idea - what if there's just a single page in a
> > segment? And there aren't earlier segments? That's not a rare case, IME.
> 
> Then we'd still have 50% of a system ID which we can check against for
> any corruption. I agree that it increases the chance of conflics, but
> it's still strictly better than nothing at all.

A fair point - I somehow disregarded that all bits in the system id are
equally meaningful.

Greetings,

Andres Freund



Re: Lowering the default wal_blocksize to 4K

От
Robert Haas
Дата:
On Wed, Oct 11, 2023 at 4:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> That leaves only the segments where a record starts exactly on the
> first usable byte of a segment, which is why I was trying to think of
> a way to cover that case too.  I suggested we could notice and insert
> a new record at that place.  But Andres suggests it would be too
> expensive and not worth worrying about.

Hmm. Even in that case, xl_prev has to match. It's not like it's the
wild west. Sure, it's not nearly as good of a cross-check, but it's
something. It seems to me that it's not worth worrying very much about
xlp_seg_size or xlp_blcksz changing undetected in that scenario - if
you're doing that kind of advanced magic, you need to be careful
enough to not mess it up, and if we still cross-check once per
checkpoint cycle that's pretty good. I do worry a bit about the sysid
changing under us, though. It's not that hard to get your WAL archives
mixed up, and it'd be nice to catch that right away.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Lowering the default wal_blocksize to 4K

От
Robert Haas
Дата:
On Wed, Oct 11, 2023 at 6:11 PM Andres Freund <andres@anarazel.de> wrote:
> I think the question is what the point of the crosschecks in long page headers
> is. It's pretty easy to see what the point of the xlp_sysid check is - make it
> less likely to accidentally replay WAL from a different system.  It's much
> less clear what the point of xlp_seg_size and xlp_xlog_blcksz is - after all,
> they are also in ControlFileData and the xlp_sysid check tied the control file
> and WAL file together.

Yeah, fair.

> Let me rephrase my point:
>
> If somebody uses a modified pg_resetwal to change the xlog block size, then
> tries to replay WAL from before that change, and is unlucky enough that the
> LSN looked for in a segment is the start of a valid record both before/after
> the pg_resetwal invocation, then yes, we might not catch that anymore if we
> remove the block size check. But the much much more common case is that the
> block size was *not* changed, in which case we *already* don't catch that
> pg_resetwal was invoked.

Hmm. Should we invent a mechanism just for that?

> ISTM that the xlp_seg_size and xlp_xlog_blcksz checks in long page headers are
> a belt and suspenders check that is very unlikely to ever catch a mistake that
> wouldn't otherwise be caught.

I think that's probably right.

> I think that's what Thomas was proposing.  Thinking about it a bit more I'm
> not sure that having the data both in the checkpoint record itself and in
> XLOG_CHECKPOINT_REDO buys much. But it's also pretty much free, so ...

Yes. To me, having it in the redo record seems considerably more
valuable. Because that's where we're going to begin replay, so we
should catch most problems straight off. To escape detection at that
point, you need to not just be pointed at the wrong WAL archive, but
actually have files of diverse origin mixed together in the same WAL
archive. That's a less-likely error, and we still have some ways of
catching it if it happens.

> Which would make the code more efficient...

Right.

> As outlined above, I don't think xlp_seg_size, xlp_xlog_blcksz buy us
> anything, but that the protection by xlp_sysid is a bit more meaningful. So a
> compromise position could be to include xlp_sysid in the page header, possibly
> in a "chopped up" manner, as Matthias suggested.

I'm not that keen on the idea of storing the upper half and lower half
in alternate pages. That seems to me to add code complexity and
cognitive burden with little increased likelihood of catching real
problems. I'm not completely opposed to the idea if somebody wants to
make it happen, but I bet it would be better to either store the whole
thing or just cut it in half and store, say, the low-order bits.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Lowering the default wal_blocksize to 4K

От
Ants Aasma
Дата:
On Thu, 12 Oct 2023 at 16:36, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Oct 11, 2023 at 4:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> That leaves only the segments where a record starts exactly on the
> first usable byte of a segment, which is why I was trying to think of
> a way to cover that case too.  I suggested we could notice and insert
> a new record at that place.  But Andres suggests it would be too
> expensive and not worth worrying about.

Hmm. Even in that case, xl_prev has to match. It's not like it's the
wild west. Sure, it's not nearly as good of a cross-check, but it's
something. It seems to me that it's not worth worrying very much about
xlp_seg_size or xlp_blcksz changing undetected in that scenario - if
you're doing that kind of advanced magic, you need to be careful
enough to not mess it up, and if we still cross-check once per
checkpoint cycle that's pretty good. I do worry a bit about the sysid
changing under us, though. It's not that hard to get your WAL archives
mixed up, and it'd be nice to catch that right away.

This reminds me that xlp_tli is not being used to its full potential right now either. We only check that it's not going backwards, but there is at least one not very hard to hit way to get postgres to silently replay on the wrong timeline. [1]
 
--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com

Re: Lowering the default wal_blocksize to 4K

От
Robert Haas
Дата:
On Thu, Oct 12, 2023 at 9:57 AM Ants Aasma <ants@cybertec.at> wrote:
> This reminds me that xlp_tli is not being used to its full potential right now either. We only check that it's not
goingbackwards, but there is at least one not very hard to hit way to get postgres to silently replay on the wrong
timeline.[1] 
>
> [1] https://www.postgresql.org/message-id/CANwKhkMN3QwAcvuDZHb6wsvLRtkweBiYso-KLFykkQVWuQLcOw@mail.gmail.com

Maybe I'm missing something, but that seems mostly unrelated. What
you're discussing there is the server's ability to figure out when it
ought to perform a timeline switch. In other words, the server settles
on the wrong TLI and therefore opens and reads from the wrong
filename. But here, we're talking about the case where the server is
correct about the TLI and LSN and hence opens exactly the right file
on disk, but the contents of the file on disk aren't what they're
supposed to be due to a procedural error.

Said differently, I don't see how anything we could do with xlp_tli
would actually fix the problem discussed in that thread. That can
detect a situation where the TLI of the file doesn't match the TLI of
the pages inside the file, but it doesn't help with the case where the
server decided to read the wrong file in the first place.

But this does make me wonder whether storing xlp_tli and xlp_pageaddr
in every page is really worth the bit-space. That takes 12 bytes plus
any padding it forces us to incur, but the actual entropy content of
those 12 bytes must be quite low. In normal cases probably 7 or so of
those bytes are going to consist entirely of zero bits (TLI < 256,
LSN%8k ==  0, LSN < 2^40). We could probably find a way of jumbling
the LSN, TLI, and maybe some other stuff into an 8-byte quantity or
even perhaps a 4-byte quantity that would do about as good a job
catching problems as what we have now (e.g.
LSN_HIGH32^LSN_LOW32^BITREVERSE(TLI)). In the event of a mismatch, the
value actually stored in the page header would be harder for humans to
understand, but I'm not sure that really matters here. Users should
mostly be concerned with whether a WAL file matches the cluster where
they're trying to replay it; forensics on misplaced or corrupted WAL
files should be comparatively rare.

--
Robert Haas
EDB: http://www.enterprisedb.com