Обсуждение: Initdb-time block size specification

Поиск
Список
Период
Сортировка

Initdb-time block size specification

От
David Christensen
Дата:
Hi,

As discussed somewhat at PgCon, enclosed is v1 of a patch to provide
variable block sizes; basically instead of BLCKSZ being a compile-time
constant, a single set of binaries can support all of the block sizes
Pg can support, using the value stored in pg_control as the basis.
(Possible future plans would be to make this something even more
dynamic, such as configured per tablespace, but this is out of scope;
this just sets up the infrastructure for this.)

Whereas we had traditionally used BLCKSZ to indicate the compile-time selected
block size, this commit adjusted things so the cluster block size can be
selected at initdb time.

In order to code for this, we introduce a few new defines:

- CLUSTER_BLOCK_SIZE is the blocksize for this cluster itself.  This is not
valid until BlockSizeInit() has been called in the given backend, which we do as
early as possible by parsing the ControlFile and using the blcksz field.

- MIN_BLOCK_SIZE and MAX_BLOCK_SIZE are the limits for the selectable block
size. It is required that CLUSTER_BLOCK_SIZE is a power of 2 between these two
constants.

- DEFAULT_BLOCK_SIZE is the moral equivalent of BLCKSZ; it is the built-in
default value.  This is used in a few places that just needed a buffer of an
arbitrary size, but the dynamic value CLUSTER_BLOCK_SIZE should almost always be
used instead.

- CLUSTER_RELSEG_SIZE is used instead of RELSEG_SIZE, since internally we are
storing the segment size in terms of number of blocks.  RELSEG_SIZE is still
kept, but is used in terms of the number of blocks of DEFAULT_BLOCK_SIZE;
CLUSTER_RELSEG_SIZE scales appropriately (and is the only thing used internally)
to keep the same target total segment size regardless of block size.

This patch uses a precalculated table to store the block size itself, as well as
additional derived values that have traditionally been compile-time
constants (example: MaxHeapTuplesPerPage).  The traditional macro names are kept
so code that doesn't care about it should not need to change, however the
definition of these has changed (see the CalcXXX() routines in blocksize.h for
details).

A new function, BlockSizeInit() populates the appropriate values based on the
target block size.  This should be called as early as possible in any code that
utilizes block sizes.  This patch adds this in the appropriate place on the
handful of src/bin/ programs that used BLCKSZ, so this caveat mainly impacts new
code.

Code which had previously used BLCKZ should likely be able to get away with
changing these instances to CLUSTER_BLOCK_SIZE, unless you're using a structure
allocated on the stack.  In these cases, the compiler will complain about
dynamic structure.  The solution is to utilize an expression with MAX_BLOCK_SIZE
instead of BLCKSZ, ensuring enough stack space is allocated for the maximum
size. This also does require using CLUSTER_BLOCK_SIZE or an expression based on
it when actually using this structure, so in practice more stack space may be
allocated then used in principal; as long as there is plenty of stack this
should have no specific impacts on code.

Initial (basic) performance testing shows only minor changes with the pgbench -S
benchmark, though this is obviously an area that will need considerable
testing/verification across multiple workloads.

Thanks,

David

Вложения

Re: Initdb-time block size specification

От
Tomas Vondra
Дата:
On 6/30/23 19:35, David Christensen wrote:
> Hi,
> 
> As discussed somewhat at PgCon, enclosed is v1 of a patch to provide
> variable block sizes; basically instead of BLCKSZ being a compile-time
> constant, a single set of binaries can support all of the block sizes
> Pg can support, using the value stored in pg_control as the basis.
> (Possible future plans would be to make this something even more
> dynamic, such as configured per tablespace, but this is out of scope;
> this just sets up the infrastructure for this.)
> 
> Whereas we had traditionally used BLCKSZ to indicate the compile-time selected
> block size, this commit adjusted things so the cluster block size can be
> selected at initdb time.
> 
> In order to code for this, we introduce a few new defines:
> 
> - CLUSTER_BLOCK_SIZE is the blocksize for this cluster itself.  This is not
> valid until BlockSizeInit() has been called in the given backend, which we do as
> early as possible by parsing the ControlFile and using the blcksz field.
> 
> - MIN_BLOCK_SIZE and MAX_BLOCK_SIZE are the limits for the selectable block
> size. It is required that CLUSTER_BLOCK_SIZE is a power of 2 between these two
> constants.
> 
> - DEFAULT_BLOCK_SIZE is the moral equivalent of BLCKSZ; it is the built-in
> default value.  This is used in a few places that just needed a buffer of an
> arbitrary size, but the dynamic value CLUSTER_BLOCK_SIZE should almost always be
> used instead.
> 
> - CLUSTER_RELSEG_SIZE is used instead of RELSEG_SIZE, since internally we are
> storing the segment size in terms of number of blocks.  RELSEG_SIZE is still
> kept, but is used in terms of the number of blocks of DEFAULT_BLOCK_SIZE;
> CLUSTER_RELSEG_SIZE scales appropriately (and is the only thing used internally)
> to keep the same target total segment size regardless of block size.
> 

Do we really want to prefix the option with CLUSTER_? That seems to just
add a lot of noise into the patch, and I don't see much value in this
rename. I'd prefer keeping BLCKSZ and tweak just the couple places that
need "default" to use BLCKSZ_DEFAULT or something like that.

But more importantly, I'd say we use CAPITALIZED_NAMES for compile-time
values, so after making this initdb-time parameter we should abandon
that (just like fc49e24fa69a did for segment sizes). Perhaps something
like cluste_block_size would work ... (yes, that touches all the places
too).

> This patch uses a precalculated table to store the block size itself, as well as
> additional derived values that have traditionally been compile-time
> constants (example: MaxHeapTuplesPerPage).  The traditional macro names are kept
> so code that doesn't care about it should not need to change, however the
> definition of these has changed (see the CalcXXX() routines in blocksize.h for
> details).
> 
> A new function, BlockSizeInit() populates the appropriate values based on the
> target block size.  This should be called as early as possible in any code that
> utilizes block sizes.  This patch adds this in the appropriate place on the
> handful of src/bin/ programs that used BLCKSZ, so this caveat mainly impacts new
> code.
> 
> Code which had previously used BLCKZ should likely be able to get away with
> changing these instances to CLUSTER_BLOCK_SIZE, unless you're using a structure
> allocated on the stack.  In these cases, the compiler will complain about
> dynamic structure.  The solution is to utilize an expression with MAX_BLOCK_SIZE
> instead of BLCKSZ, ensuring enough stack space is allocated for the maximum
> size. This also does require using CLUSTER_BLOCK_SIZE or an expression based on
> it when actually using this structure, so in practice more stack space may be
> allocated then used in principal; as long as there is plenty of stack this
> should have no specific impacts on code.
> 
> Initial (basic) performance testing shows only minor changes with the pgbench -S
> benchmark, though this is obviously an area that will need considerable
> testing/verification across multiple workloads.
> 

I wonder how to best evaluate/benchmark this. AFAICS what we want to
measure is the extra cost of making the values dynamic (which means the
compiler can't just optimize them out). I'd say a "pgbench -S" seems
like a good test.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Initdb-time block size specification

От
David Christensen
Дата:
On Fri, Jun 30, 2023 at 1:14 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

> Do we really want to prefix the option with CLUSTER_? That seems to just
> add a lot of noise into the patch, and I don't see much value in this
> rename. I'd prefer keeping BLCKSZ and tweak just the couple places that
> need "default" to use BLCKSZ_DEFAULT or something like that.
>
> But more importantly, I'd say we use CAPITALIZED_NAMES for compile-time
> values, so after making this initdb-time parameter we should abandon
> that (just like fc49e24fa69a did for segment sizes). Perhaps something
> like cluste_block_size would work ... (yes, that touches all the places
> too).

Yes, I can see that being an equivalent change; thanks for the pointer
there. Definitely the "cluster_block_size" could be an approach,
though since it's just currently a #define for GetBlockSize(), maybe
we just replace with the equivalent instead. I was mainly trying to
make it something that was conceptually similar and easy to reason
about without getting bogged down in the details locally, but can see
that ALL_CAPS does have a specific meaning.  Also eliminating the
BLCKSZ symbol meant it was easier to catch anything which depended on
that value.  If we wanted to keep BLCKSZ, I'd be okay with that at
this point vs the CLUSTER_BLOCK_SIZE approach, could help to make the
patch smaller at this point.

> > Initial (basic) performance testing shows only minor changes with the pgbench -S
> > benchmark, though this is obviously an area that will need considerable
> > testing/verification across multiple workloads.
> >
>
> I wonder how to best evaluate/benchmark this. AFAICS what we want to
> measure is the extra cost of making the values dynamic (which means the
> compiler can't just optimize them out). I'd say a "pgbench -S" seems
> like a good test.

Yep, I tested 100 runs apiece with pgbench -S at scale factor 100,
default settings for optimized builds of the same base commit with and
without the patch; saw a reduction of TPS around 1% in that case, but
I do think we'd want to look at different workloads; I presume the
largest impact would be seen when it's all in shared_buffers and no IO
is required.

Thanks,

David



Re: Initdb-time block size specification

От
Andres Freund
Дата:
Hi,

On 2023-06-30 12:35:09 -0500, David Christensen wrote:
> As discussed somewhat at PgCon, enclosed is v1 of a patch to provide
> variable block sizes; basically instead of BLCKSZ being a compile-time
> constant, a single set of binaries can support all of the block sizes
> Pg can support, using the value stored in pg_control as the basis.
> (Possible future plans would be to make this something even more
> dynamic, such as configured per tablespace, but this is out of scope;
> this just sets up the infrastructure for this.)

I am extremely doubtful this is a good idea. For one it causes a lot of churn,
but more importantly it turns currently cheap code paths into more expensive
ones.

Changes like

> -#define BufHdrGetBlock(bufHdr)    ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
> +#define BufHdrGetBlock(bufHdr)    ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * CLUSTER_BLOCK_SIZE))

Note that CLUSTER_BLOCK_SIZE, despite looking like a macro that's constant, is
actually variable.

I am fairly certain this is going to be causing substantial performance
regressions.  I think we should reject this even if we don't immediately find
them, because it's almost guaranteed to cause some.


Besides this, I've not really heard any convincing justification for needing
this in the first place.

Greetings,

Andres Freund



Re: Initdb-time block size specification

От
David Christensen
Дата:
On Fri, Jun 30, 2023 at 2:39 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-06-30 12:35:09 -0500, David Christensen wrote:
> > As discussed somewhat at PgCon, enclosed is v1 of a patch to provide
> > variable block sizes; basically instead of BLCKSZ being a compile-time
> > constant, a single set of binaries can support all of the block sizes
> > Pg can support, using the value stored in pg_control as the basis.
> > (Possible future plans would be to make this something even more
> > dynamic, such as configured per tablespace, but this is out of scope;
> > this just sets up the infrastructure for this.)
>
> I am extremely doubtful this is a good idea. For one it causes a lot of churn,
> but more importantly it turns currently cheap code paths into more expensive
> ones.
>
> Changes like
>
> > -#define BufHdrGetBlock(bufHdr)       ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
> > +#define BufHdrGetBlock(bufHdr)       ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * CLUSTER_BLOCK_SIZE))
>
> Note that CLUSTER_BLOCK_SIZE, despite looking like a macro that's constant, is
> actually variable.

Correct; that is mainly a notational device which would be easy enough
to change (and presumably would follow along the lines of the commit
Tomas pointed out above).

> I am fairly certain this is going to be causing substantial performance
> regressions.  I think we should reject this even if we don't immediately find
> them, because it's almost guaranteed to cause some.

What would be considered substantial? Some overhead would be expected,
but I think having an actual patch to evaluate lets us see what
potential there is. Seems like this will likely be optimized as an
offset stored in a register, so wouldn't expect huge changes here.
(There may be other approaches I haven't thought of in terms of
getting this.)

> Besides this, I've not really heard any convincing justification for needing
> this in the first place.

Doing this would open up experiments in larger block sizes, so we
would be able to have larger indexable tuples, say, or be able to
store data types that are larger than currently supported for tuple
row limits without dropping to toast (native vector data types come to
mind as a candidate here).  We've had 8k blocks for a long time while
hardware has improved over 20+ years, and it would be interesting to
see how tuning things would open up additional avenues for performance
without requiring packagers to make a single choice on this regardless
of use-case.  (The fact that we allow compiling this at a different
value suggests there is thought to be some utility having this be
something other than the default value.)

I just think it's one of those things that is hard to evaluate without
actually having something specific, which is why we have this patch
now.

Thanks,

David



Re: Initdb-time block size specification

От
Andres Freund
Дата:
Hi,

On 2023-06-30 14:09:55 -0500, David Christensen wrote:
> On Fri, Jun 30, 2023 at 1:14 PM Tomas Vondra
> > I wonder how to best evaluate/benchmark this. AFAICS what we want to
> > measure is the extra cost of making the values dynamic (which means the
> > compiler can't just optimize them out). I'd say a "pgbench -S" seems
> > like a good test.
> 
> Yep, I tested 100 runs apiece with pgbench -S at scale factor 100,
> default settings for optimized builds of the same base commit with and
> without the patch; saw a reduction of TPS around 1% in that case, but
> I do think we'd want to look at different workloads; I presume the
> largest impact would be seen when it's all in shared_buffers and no IO
> is required.

I think pgbench -S indeed isn't a good workload - the overhead for it is much
more in context switches and instantiating executor state etc than code that
is affected by the change.

And indeed. Comparing e.g. TPC-H, I see *massive* regressions.  Some queries
are the same, sobut others regress by up to 70% (although more commonly around
10-20%).

That's larger than I thought, which makes me suspect that there's some bug in
the new code.

Interestingly, repeating the benchmark with a larger work_mem setting, the
regressions are still quite present, but smaller. I suspect the planner
chooses smarter plans which move bottlenecks more towards hashjoin code etc,
which won't be affected by this change.


IOW, you seriously need to evaluate analytics queries before this is worth
looking at further.

Greetings,

Andres Freund



Re: Initdb-time block size specification

От
Andres Freund
Дата:
Hi,

On 2023-06-30 15:05:54 -0500, David Christensen wrote:
> > I am fairly certain this is going to be causing substantial performance
> > regressions.  I think we should reject this even if we don't immediately find
> > them, because it's almost guaranteed to cause some.
> 
> What would be considered substantial? Some overhead would be expected,
> but I think having an actual patch to evaluate lets us see what
> potential there is.

Anything beyond 1-2%, although even that imo is a hard sell.


> > Besides this, I've not really heard any convincing justification for needing
> > this in the first place.
> 
> Doing this would open up experiments in larger block sizes, so we
> would be able to have larger indexable tuples, say, or be able to
> store data types that are larger than currently supported for tuple
> row limits without dropping to toast (native vector data types come to
> mind as a candidate here).

You can do experiments today with the compile time option. Which does not
require regressing performance for everyone.


> We've had 8k blocks for a long time while hardware has improved over 20+
> years, and it would be interesting to see how tuning things would open up
> additional avenues for performance without requiring packagers to make a
> single choice on this regardless of use-case.

I suspect you're going to see more benefits from going to a *lower* setting
than a higher one. Some practical issues aside, plenty of storage hardware
these days would allow to get rid of FPIs if you go to 4k blocks (although it
often requires explicit sysadmin action to reformat the drive into that mode
etc).  But obviously that's problematic from the "postgres limits" POV.


If we really wanted to do this - but I don't think we do - I'd argue for
working on the buildsystem support to build the postgres binary multiple
times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
exec's the relevant "real" binary based on the pg_control value.  I really
don't see us ever wanting to make BLCKSZ runtime configurable within one
postgres binary. There's just too much intrinsic overhead associated with
that.

Greetings,

Andres Freund



Re: Initdb-time block size specification

От
Tomas Vondra
Дата:

On 6/30/23 22:05, David Christensen wrote:
> On Fri, Jun 30, 2023 at 2:39 PM Andres Freund <andres@anarazel.de> wrote:
>>
>> ...
>>
>> Besides this, I've not really heard any convincing justification for needing
>> this in the first place.
> 
> Doing this would open up experiments in larger block sizes, so we
> would be able to have larger indexable tuples, say, or be able to
> store data types that are larger than currently supported for tuple
> row limits without dropping to toast (native vector data types come to
> mind as a candidate here).  We've had 8k blocks for a long time while
> hardware has improved over 20+ years, and it would be interesting to
> see how tuning things would open up additional avenues for performance
> without requiring packagers to make a single choice on this regardless
> of use-case.  (The fact that we allow compiling this at a different
> value suggests there is thought to be some utility having this be
> something other than the default value.)
> 
> I just think it's one of those things that is hard to evaluate without
> actually having something specific, which is why we have this patch
> now.
> 

But it's possible to evaluate that - you just need to rebuild with a
different configuration option. Yes, allowing doing that at initdb is
simpler and allows testing this on systems where rebuilding is not
convenient. And having a binary that can deal with any block size would
be nice too.

In fact, I did exactly that a year ago for a conference, and I spoke
about it at the 2022 unconference too. Not sure if there's recording
from pgcon, but there is one from the other conference [1][2].

The short story is that the possible gains are significant (say +50%)
for data sets that don't fit into RAM. But that was with block size set
at compile time, the question is what's the impact of making it a
variable instead of a macro ....

[1] https://www.youtube.com/watch?v=mVKpoQxtCXk
[2] https://blog.pgaddict.com/pdf/block-sizes-postgresvision-2022.pdf


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Initdb-time block size specification

От
Tomas Vondra
Дата:

On 6/30/23 23:11, Andres Freund wrote:
> ...
> 
> If we really wanted to do this - but I don't think we do - I'd argue for
> working on the buildsystem support to build the postgres binary multiple
> times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
> exec's the relevant "real" binary based on the pg_control value.  I really
> don't see us ever wanting to make BLCKSZ runtime configurable within one
> postgres binary. There's just too much intrinsic overhead associated with
> that.
> 

I don't quite understand why we shouldn't do this (or at least try to).
IMO the benefits of using smaller blocks were substantial (especially
for 4kB, most likely due matching the internal SSD page size). The other
benefits (reducing WAL volume) seem rather interesting too.

Sure, there are challenges (e.g. the overhead due to making it dynamic).
No doubt about that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Initdb-time block size specification

От
David Christensen
Дата:
On Fri, Jun 30, 2023 at 3:29 PM Andres Freund <andres@anarazel.de> wrote:
>> And indeed. Comparing e.g. TPC-H, I see *massive* regressions.  Some queries
> are the same, sobut others regress by up to 70% (although more commonly around
> 10-20%).

Hmm, that is definitely not good.

> That's larger than I thought, which makes me suspect that there's some bug in
> the new code.

Will do a little profiling here to see if I can figure out the
regression. Which build optimization settings are you seeing this
under?

> Interestingly, repeating the benchmark with a larger work_mem setting, the
> regressions are still quite present, but smaller. I suspect the planner
> chooses smarter plans which move bottlenecks more towards hashjoin code etc,
> which won't be affected by this change.

Interesting.

> IOW, you seriously need to evaluate analytics queries before this is worth
> looking at further.

Makes sense, thanks for reviewing.

David



Re: Initdb-time block size specification

От
David Christensen
Дата:
On Fri, Jun 30, 2023 at 4:12 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 6/30/23 22:05, David Christensen wrote:
> > On Fri, Jun 30, 2023 at 2:39 PM Andres Freund <andres@anarazel.de> wrote:
> >>
> >> ...
> >>
> >> Besides this, I've not really heard any convincing justification for needing
> >> this in the first place.
> >
> > Doing this would open up experiments in larger block sizes, so we
> > would be able to have larger indexable tuples, say, or be able to
> > store data types that are larger than currently supported for tuple
> > row limits without dropping to toast (native vector data types come to
> > mind as a candidate here).  We've had 8k blocks for a long time while
> > hardware has improved over 20+ years, and it would be interesting to
> > see how tuning things would open up additional avenues for performance
> > without requiring packagers to make a single choice on this regardless
> > of use-case.  (The fact that we allow compiling this at a different
> > value suggests there is thought to be some utility having this be
> > something other than the default value.)
> >
> > I just think it's one of those things that is hard to evaluate without
> > actually having something specific, which is why we have this patch
> > now.
> >
>
> But it's possible to evaluate that - you just need to rebuild with a
> different configuration option. Yes, allowing doing that at initdb is
> simpler and allows testing this on systems where rebuilding is not
> convenient. And having a binary that can deal with any block size would
> be nice too.
>
> In fact, I did exactly that a year ago for a conference, and I spoke
> about it at the 2022 unconference too. Not sure if there's recording
> from pgcon, but there is one from the other conference [1][2].
>
> The short story is that the possible gains are significant (say +50%)
> for data sets that don't fit into RAM. But that was with block size set
> at compile time, the question is what's the impact of making it a
> variable instead of a macro ....
>
> [1] https://www.youtube.com/watch?v=mVKpoQxtCXk
> [2] https://blog.pgaddict.com/pdf/block-sizes-postgresvision-2022.pdf

Cool, thanks for the video links; will review.

David



Re: Initdb-time block size specification

От
David Christensen
Дата:
On Fri, Jun 30, 2023 at 4:27 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 6/30/23 23:11, Andres Freund wrote:
> > ...
> >
> > If we really wanted to do this - but I don't think we do - I'd argue for
> > working on the buildsystem support to build the postgres binary multiple
> > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
> > exec's the relevant "real" binary based on the pg_control value.  I really
> > don't see us ever wanting to make BLCKSZ runtime configurable within one
> > postgres binary. There's just too much intrinsic overhead associated with
> > that.
> >
>
> I don't quite understand why we shouldn't do this (or at least try to).
> IMO the benefits of using smaller blocks were substantial (especially
> for 4kB, most likely due matching the internal SSD page size). The other
> benefits (reducing WAL volume) seem rather interesting too.

If it's dynamic, we could also be able to add detection of the best
block size at initdb time, leading to improvements all around, say.
:-)

> Sure, there are challenges (e.g. the overhead due to making it dynamic).
> No doubt about that.

I definitely agree that it seems worth looking into.  If nothing else,
being able to quantify just what/where the overhead comes from may be
instructive for future efforts.

David



Re: Initdb-time block size specification

От
David Christensen
Дата:
On Fri, Jun 30, 2023 at 4:17 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-06-30 15:05:54 -0500, David Christensen wrote:
> > > I am fairly certain this is going to be causing substantial performance
> > > regressions.  I think we should reject this even if we don't immediately find
> > > them, because it's almost guaranteed to cause some.
> >
> > What would be considered substantial? Some overhead would be expected,
> > but I think having an actual patch to evaluate lets us see what
> > potential there is.
>
> Anything beyond 1-2%, although even that imo is a hard sell.

I'd agree that that threshold seems like a reasonable target, and
anything much above that would be regressive.

> > > Besides this, I've not really heard any convincing justification for needing
> > > this in the first place.
> >
> > Doing this would open up experiments in larger block sizes, so we
> > would be able to have larger indexable tuples, say, or be able to
> > store data types that are larger than currently supported for tuple
> > row limits without dropping to toast (native vector data types come to
> > mind as a candidate here).
>
> You can do experiments today with the compile time option. Which does not
> require regressing performance for everyone.

Sure, not arguing that this is more performant than the current approach.

> > We've had 8k blocks for a long time while hardware has improved over 20+
> > years, and it would be interesting to see how tuning things would open up
> > additional avenues for performance without requiring packagers to make a
> > single choice on this regardless of use-case.
>
> I suspect you're going to see more benefits from going to a *lower* setting
> than a higher one. Some practical issues aside, plenty of storage hardware
> these days would allow to get rid of FPIs if you go to 4k blocks (although it
> often requires explicit sysadmin action to reformat the drive into that mode
> etc).  But obviously that's problematic from the "postgres limits" POV.
>
>
> If we really wanted to do this - but I don't think we do - I'd argue for
> working on the buildsystem support to build the postgres binary multiple
> times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
> exec's the relevant "real" binary based on the pg_control value.  I really
> don't see us ever wanting to make BLCKSZ runtime configurable within one
> postgres binary. There's just too much intrinsic overhead associated with
> that.

You may well be right, but I think if we haven't tried to do that and
measure it, it's hard to say exactly.  There are of course more parts
of the system that are about BLCKSZ than the backend, plus you'd need
to build other extensions to support each option, so there is a lot
more that would need to change. (That's neither here nor there, as my
approach also involves changing all those places, so change isn't
inherently bad; just saying it's not a trivial solution to merely
iterate over the block size for binaries.)

David



Re: Initdb-time block size specification

От
Tomas Vondra
Дата:

On 6/30/23 23:11, Andres Freund wrote:
> Hi,
>
> ...
> 
> I suspect you're going to see more benefits from going to a *lower* setting
> than a higher one. Some practical issues aside, plenty of storage hardware
> these days would allow to get rid of FPIs if you go to 4k blocks (although it
> often requires explicit sysadmin action to reformat the drive into that mode
> etc).  But obviously that's problematic from the "postgres limits" POV.
> 

I wonder what are the conditions/options for disabling FPI. I kinda
assume it'd apply to new drives with 4k sectors, with properly aligned
partitions etc. But I haven't seen any particularly clear confirmation
that's correct.

> 
> If we really wanted to do this - but I don't think we do - I'd argue for
> working on the buildsystem support to build the postgres binary multiple
> times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
> exec's the relevant "real" binary based on the pg_control value.  I really
> don't see us ever wanting to make BLCKSZ runtime configurable within one
> postgres binary. There's just too much intrinsic overhead associated with
> that.
> 

How would that work for extensions which may be built for a particular
BLCKSZ value (like pageinspect)?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Initdb-time block size specification

От
Bruce Momjian
Дата:
On Fri, Jun 30, 2023 at 11:42:30PM +0200, Tomas Vondra wrote:
> 
> 
> On 6/30/23 23:11, Andres Freund wrote:
> > Hi,
> >
> > ...
> > 
> > I suspect you're going to see more benefits from going to a *lower* setting
> > than a higher one. Some practical issues aside, plenty of storage hardware
> > these days would allow to get rid of FPIs if you go to 4k blocks (although it
> > often requires explicit sysadmin action to reformat the drive into that mode
> > etc).  But obviously that's problematic from the "postgres limits" POV.
> > 
> 
> I wonder what are the conditions/options for disabling FPI. I kinda
> assume it'd apply to new drives with 4k sectors, with properly aligned
> partitions etc. But I haven't seen any particularly clear confirmation
> that's correct.

I don't think we have ever had to study this --- we just request the
write to the operating system, and we either get a successful reply or
we go into WAL recovery to reread the pre-image.  We never really care
if the write is atomic, e.g., an 8k write can be done in 2 4kB writes 4
2kB writes --- we don't care --- we only care if they are all done or
not.

For a 4kB write, to say it is not partially written would be to require
the operating system to guarantee that the 4kB write is not split into
smaller writes which might each be atomic because smaller atomic writes
would not help us.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Initdb-time block size specification

От
Andres Freund
Дата:
Hi,

On 2023-06-30 23:27:45 +0200, Tomas Vondra wrote:
> On 6/30/23 23:11, Andres Freund wrote:
> > ...
> > 
> > If we really wanted to do this - but I don't think we do - I'd argue for
> > working on the buildsystem support to build the postgres binary multiple
> > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
> > exec's the relevant "real" binary based on the pg_control value.  I really
> > don't see us ever wanting to make BLCKSZ runtime configurable within one
> > postgres binary. There's just too much intrinsic overhead associated with
> > that.
>
> I don't quite understand why we shouldn't do this (or at least try to).
>
> IMO the benefits of using smaller blocks were substantial (especially
> for 4kB, most likely due matching the internal SSD page size). The other
> benefits (reducing WAL volume) seem rather interesting too.

Mostly because I think there are bigger gains to be had elsewhere.

IME not a whole lot of storage ships by default with externally visible 4k
sectors, but needs to be manually reformated [1], which looses all data, so it
has to be done initially.  Then postgres would also need OS specific trickery
to figure out that indeed the IO stack is entirely 4k (checking sector size is
not enough).  And you run into the issue that suddenly the #column and
index-tuple-size limits are lower, which won't make it easier.


I think we should change the default of the WAL blocksize to 4k
though. There's practically no downsides, and it drastically reduces
postgres-side write amplification in many transactional workloads, by only
writing out partially filled 4k pages instead of partially filled 8k pages.


> Sure, there are challenges (e.g. the overhead due to making it dynamic).
> No doubt about that.

I don't think the runtime-dynamic overhead is avoidable with reasonable effort
(leaving aside compiling code multiple times and switching between).

If we were to start building postgres for multiple compile-time settings, I
think there are uses other than switching between BLCKSZ, potentially more
interesting. E.g. you can see substantially improved performance by being able
to use SSE4.2 without CPU dispatch (partially because it allows compiler
autovectorization, partially because it allows to compiler to use newer
non-vectorized math instructions (when targetting AVX IIRC), partially because
the dispatch overhead is not insubstantial).  Another example: ARMv8
performance is substantially better if you target ARMv8.1-A instead of
ARMv8.0, due to having atomic instructions instead of LL/SC (it still baffles
me that they didn't do this earlier, ll/sc is just inherently inefficient).

Greetings,

Andres Freund


[1] to see the for d in /dev/nvme*n1; do echo "$d:"; sudo nvme id-ns -H $d|grep '^LBA Format';echo ;done



Re: Initdb-time block size specification

От
Andres Freund
Дата:
Hi,

On 2023-06-30 23:42:30 +0200, Tomas Vondra wrote:
> I wonder what are the conditions/options for disabling FPI. I kinda
> assume it'd apply to new drives with 4k sectors, with properly aligned
> partitions etc. But I haven't seen any particularly clear confirmation
> that's correct.

Yea, it's not trivial. And the downsides are also substantial from a
replication / crash recovery performance POV - even with reading blocks ahead
of WAL replay, it's hard to beat just sequentially reading nearly all the data
you're going to need.


> On 6/30/23 23:11, Andres Freund wrote:
> > If we really wanted to do this - but I don't think we do - I'd argue for
> > working on the buildsystem support to build the postgres binary multiple
> > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
> > exec's the relevant "real" binary based on the pg_control value.  I really
> > don't see us ever wanting to make BLCKSZ runtime configurable within one
> > postgres binary. There's just too much intrinsic overhead associated with
> > that.
>
> How would that work for extensions which may be built for a particular
> BLCKSZ value (like pageinspect)?

I think we'd need to do something similar for extensions, likely loading them
from a path that includes the "subvariant" the server currently is running. Or
alternatively adding a suffix to the filename indicating the
variant. Something like pageinspect.x86-64-v4-4kB.so.

The x86-64-v* stuff is something gcc and clang added a couple years ago, so
that not every project has to define different "baseline" levels. I think it
was done in collaboration with the sytem-v/linux AMD64 ABI specification group
([1]).

Greetings,

Andres

[1] https://gitlab.com/x86-psABIs/x86-64-ABI/-/jobs/artifacts/master/raw/x86-64-ABI/abi.pdf?job=build
section 3.1.1.



Re: Initdb-time block size specification

От
Tomas Vondra
Дата:

On 6/30/23 23:53, Bruce Momjian wrote:
> On Fri, Jun 30, 2023 at 11:42:30PM +0200, Tomas Vondra wrote:
>>
>>
>> On 6/30/23 23:11, Andres Freund wrote:
>>> Hi,
>>>
>>> ...
>>>
>>> I suspect you're going to see more benefits from going to a *lower* setting
>>> than a higher one. Some practical issues aside, plenty of storage hardware
>>> these days would allow to get rid of FPIs if you go to 4k blocks (although it
>>> often requires explicit sysadmin action to reformat the drive into that mode
>>> etc).  But obviously that's problematic from the "postgres limits" POV.
>>>
>>
>> I wonder what are the conditions/options for disabling FPI. I kinda
>> assume it'd apply to new drives with 4k sectors, with properly aligned
>> partitions etc. But I haven't seen any particularly clear confirmation
>> that's correct.
> 
> I don't think we have ever had to study this --- we just request the
> write to the operating system, and we either get a successful reply or
> we go into WAL recovery to reread the pre-image.  We never really care
> if the write is atomic, e.g., an 8k write can be done in 2 4kB writes 4
> 2kB writes --- we don't care --- we only care if they are all done or
> not.
> 
> For a 4kB write, to say it is not partially written would be to require
> the operating system to guarantee that the 4kB write is not split into
> smaller writes which might each be atomic because smaller atomic writes
> would not help us.
> 

Right, that's the dance we do to protect against torn pages. But Andres
suggested that if you have modern storage and configure it correctly,
writing with 4kB pages would be atomic. So we wouldn't need to do this
FPI stuff, eliminating pretty significant source of write amplification.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Initdb-time block size specification

От
Bruce Momjian
Дата:
On Sat, Jul  1, 2023 at 12:21:03AM +0200, Tomas Vondra wrote:
> On 6/30/23 23:53, Bruce Momjian wrote:
> > For a 4kB write, to say it is not partially written would be to require
> > the operating system to guarantee that the 4kB write is not split into
> > smaller writes which might each be atomic because smaller atomic writes
> > would not help us.
> 
> Right, that's the dance we do to protect against torn pages. But Andres
> suggested that if you have modern storage and configure it correctly,
> writing with 4kB pages would be atomic. So we wouldn't need to do this
> FPI stuff, eliminating pretty significant source of write amplification.

I agree the hardware is atomic for 4k writes, but do we know the OS
always issues 4k writes?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Initdb-time block size specification

От
Andres Freund
Дата:
Hi,

On 2023-06-30 17:53:34 -0400, Bruce Momjian wrote:
> On Fri, Jun 30, 2023 at 11:42:30PM +0200, Tomas Vondra wrote:
> > On 6/30/23 23:11, Andres Freund wrote:
> > > I suspect you're going to see more benefits from going to a *lower* setting
> > > than a higher one. Some practical issues aside, plenty of storage hardware
> > > these days would allow to get rid of FPIs if you go to 4k blocks (although it
> > > often requires explicit sysadmin action to reformat the drive into that mode
> > > etc).  But obviously that's problematic from the "postgres limits" POV.
> > >
> >
> > I wonder what are the conditions/options for disabling FPI. I kinda
> > assume it'd apply to new drives with 4k sectors, with properly aligned
> > partitions etc. But I haven't seen any particularly clear confirmation
> > that's correct.
>
> I don't think we have ever had to study this --- we just request the
> write to the operating system, and we either get a successful reply or
> we go into WAL recovery to reread the pre-image.  We never really care
> if the write is atomic, e.g., an 8k write can be done in 2 4kB writes 4
> 2kB writes --- we don't care --- we only care if they are all done or
> not.

Well, that works because we have FPI. This sub-discussion is motivated by
getting rid of FPIs.


> For a 4kB write, to say it is not partially written would be to require
> the operating system to guarantee that the 4kB write is not split into
> smaller writes which might each be atomic because smaller atomic writes
> would not help us.

That's why were talking about drives with 4k sector size - you *can't* split
the writes below that.

The problem is that, as far as I know,it's not always obvious what block size
is being used on the actual storage level.  It's not even trivial when
operating on a filesystem directly stored on a single block device ([1]). Once
there's things like LVM or disk encryption involved, it gets pretty hairy
([2]).  Once you know all the block devices, it's not too bad, but ...

Greetings,

Andres Freund

[1] On linux I think you need to use stat() to figure out the st_dev for a
file, then look in /proc/self/mountinfo for the block device, use the name
of the file to look in /sys/block/$d/queue/physical_block_size.

[2] The above doesn't work because e.g. a device mapper target might only
support 4k sectors, even though the sectors on the underlying storage device
are 512b sectors. E.g. my root filesystem is encrypted, and if you follow the
above recipe (with the added step of resolving the symlink to know the actual
device name), you would see a 4k sector size.  Even though the underlying NVMe
disk only supports 512b sectors.



Re: Initdb-time block size specification

От
Tomas Vondra
Дата:
On 7/1/23 00:05, Andres Freund wrote:
> Hi,
> 
> On 2023-06-30 23:27:45 +0200, Tomas Vondra wrote:
>> On 6/30/23 23:11, Andres Freund wrote:
>>> ...
>>>
>>> If we really wanted to do this - but I don't think we do - I'd argue for
>>> working on the buildsystem support to build the postgres binary multiple
>>> times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
>>> exec's the relevant "real" binary based on the pg_control value.  I really
>>> don't see us ever wanting to make BLCKSZ runtime configurable within one
>>> postgres binary. There's just too much intrinsic overhead associated with
>>> that.
>>
>> I don't quite understand why we shouldn't do this (or at least try to).
>>
>> IMO the benefits of using smaller blocks were substantial (especially
>> for 4kB, most likely due matching the internal SSD page size). The other
>> benefits (reducing WAL volume) seem rather interesting too.
> 
> Mostly because I think there are bigger gains to be had elsewhere.
> 

I think that decision is up to whoever chooses to work on it, especially
if performance is not their primary motivation (IIRC this was discussed
as part of the TDE session).

> IME not a whole lot of storage ships by default with externally visible 4k
> sectors, but needs to be manually reformated [1], which looses all data, so it
> has to be done initially.

I don't see why "you have to configure stuff" would be a reason against
improvements in this area. I don't know how prevalent storage with 4k
sectors is now, but AFAIK it's not hard to get and it's likely to get
yet more common in the future.

FWIW I don't think the benefits of different (lower) page sizes hinge on
4k sectors - it's just that not having to do FPIs would make it even
more interesting.

> Then postgres would also need OS specific trickery
> to figure out that indeed the IO stack is entirely 4k (checking sector size is
> not enough).

I haven't suggested we should be doing that automatically (would be
nice, but I'd be happy with knowing when it's safe to disable FPW using
the GUC in config). But knowing when it's safe would make it yet more
interesting be able to use a different block page size at initdb.

> And you run into the issue that suddenly the #column and
> index-tuple-size limits are lower, which won't make it easier.
> 

True. This limit is annoying, but no one is proposing to change the
default page size. initdb would just provide a more convenient way to do
that, but the user would have to check. (I rather doubt many people
actually index such large values).

> 
> I think we should change the default of the WAL blocksize to 4k
> though. There's practically no downsides, and it drastically reduces
> postgres-side write amplification in many transactional workloads, by only
> writing out partially filled 4k pages instead of partially filled 8k pages.
> 

+1 (although in my tests the benefits we much smaller than for BLCKSZ)

> 
>> Sure, there are challenges (e.g. the overhead due to making it dynamic).
>> No doubt about that.
> 
> I don't think the runtime-dynamic overhead is avoidable with reasonable effort
> (leaving aside compiling code multiple times and switching between).
> 
> If we were to start building postgres for multiple compile-time settings, I
> think there are uses other than switching between BLCKSZ, potentially more
> interesting. E.g. you can see substantially improved performance by being able
> to use SSE4.2 without CPU dispatch (partially because it allows compiler
> autovectorization, partially because it allows to compiler to use newer
> non-vectorized math instructions (when targetting AVX IIRC), partially because
> the dispatch overhead is not insubstantial).  Another example: ARMv8
> performance is substantially better if you target ARMv8.1-A instead of
> ARMv8.0, due to having atomic instructions instead of LL/SC (it still baffles
> me that they didn't do this earlier, ll/sc is just inherently inefficient).
> 

Maybe, although I think it depends on what parts of the code this would
affect. If it's sufficiently small/isolated, it'd be possible to have
multiple paths, specialized to a particular page size (pretty common
technique for GPU/SIMD, I believe).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Initdb-time block size specification

От
Bruce Momjian
Дата:
On Fri, Jun 30, 2023 at 03:51:18PM -0700, Andres Freund wrote:
> > For a 4kB write, to say it is not partially written would be to require
> > the operating system to guarantee that the 4kB write is not split into
> > smaller writes which might each be atomic because smaller atomic writes
> > would not help us.
> 
> That's why were talking about drives with 4k sector size - you *can't* split
> the writes below that.

Okay, good point.

> The problem is that, as far as I know,it's not always obvious what block size
> is being used on the actual storage level.  It's not even trivial when
> operating on a filesystem directly stored on a single block device ([1]). Once
> there's things like LVM or disk encryption involved, it gets pretty hairy
> ([2]).  Once you know all the block devices, it's not too bad, but ...
> 
> Greetings,
> 
> Andres Freund
> 
> [1] On linux I think you need to use stat() to figure out the st_dev for a
> file, then look in /proc/self/mountinfo for the block device, use the name
> of the file to look in /sys/block/$d/queue/physical_block_size.

I just got a new server:

    https://momjian.us/main/blogs/blog/2023.html#June_28_2023

so tested this on my new M.2 NVME storage device:

    $ /sys/block/nvme0n1/queue/physical_block_size
    262144

that's 256k, not 4k.

> [2] The above doesn't work because e.g. a device mapper target might only
> support 4k sectors, even though the sectors on the underlying storage device
> are 512b sectors. E.g. my root filesystem is encrypted, and if you follow the
> above recipe (with the added step of resolving the symlink to know the actual
> device name), you would see a 4k sector size.  Even though the underlying NVMe
> disk only supports 512b sectors.

Good point.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Initdb-time block size specification

От
Andres Freund
Дата:
On 2023-06-30 18:37:39 -0400, Bruce Momjian wrote:
> On Sat, Jul  1, 2023 at 12:21:03AM +0200, Tomas Vondra wrote:
> > On 6/30/23 23:53, Bruce Momjian wrote:
> > > For a 4kB write, to say it is not partially written would be to require
> > > the operating system to guarantee that the 4kB write is not split into
> > > smaller writes which might each be atomic because smaller atomic writes
> > > would not help us.
> > 
> > Right, that's the dance we do to protect against torn pages. But Andres
> > suggested that if you have modern storage and configure it correctly,
> > writing with 4kB pages would be atomic. So we wouldn't need to do this
> > FPI stuff, eliminating pretty significant source of write amplification.
> 
> I agree the hardware is atomic for 4k writes, but do we know the OS
> always issues 4k writes?

When using a sector size of 4K you *can't* make smaller writes via normal
paths. The addressing unit is in sectors. The details obviously differ between
storage protocol, but you pretty much always just specify a start sector and a
number of sectors to be operated on.

Obviously the kernel could read 4k, modify 512 bytes in-memory, and then write
4k back, but that shouldn't be a danger here.  There might also be debug
interfaces to allow reading/writing in different increments, but that'd not be
something happening during normal operation.



Re: Initdb-time block size specification

От
Bruce Momjian
Дата:
On Fri, Jun 30, 2023 at 06:58:20PM -0400, Bruce Momjian wrote:
> I just got a new server:
> 
>     https://momjian.us/main/blogs/blog/2023.html#June_28_2023
> 
> so tested this on my new M.2 NVME storage device:
> 
>     $ /sys/block/nvme0n1/queue/physical_block_size
>     262144
> 
> that's 256k, not 4k.

I have another approach to this.  My storage device has power
protection, so even though it has a 256k physical block size, it should
be fine with 4k write atomicity.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Initdb-time block size specification

От
Andres Freund
Дата:
Hi,

On 2023-06-30 18:58:20 -0400, Bruce Momjian wrote:
> > [1] On linux I think you need to use stat() to figure out the st_dev for a
> > file, then look in /proc/self/mountinfo for the block device, use the name
> > of the file to look in /sys/block/$d/queue/physical_block_size.
>
> I just got a new server:
>
>     https://momjian.us/main/blogs/blog/2023.html#June_28_2023
>
> so tested this on my new M.2 NVME storage device:
>
>     $ /sys/block/nvme0n1/queue/physical_block_size
>     262144

Ah, I got the relevant filename wrong. I think it's logical_block_size, not
physical one (that's the size of addressing).  I didn't realize because the
devices I looked at have the same...

Regards,

Andres



Re: Initdb-time block size specification

От
Tomas Vondra
Дата:
On 7/1/23 00:59, Andres Freund wrote:
> On 2023-06-30 18:37:39 -0400, Bruce Momjian wrote:
>> On Sat, Jul  1, 2023 at 12:21:03AM +0200, Tomas Vondra wrote:
>>> On 6/30/23 23:53, Bruce Momjian wrote:
>>>> For a 4kB write, to say it is not partially written would be to require
>>>> the operating system to guarantee that the 4kB write is not split into
>>>> smaller writes which might each be atomic because smaller atomic writes
>>>> would not help us.
>>>
>>> Right, that's the dance we do to protect against torn pages. But Andres
>>> suggested that if you have modern storage and configure it correctly,
>>> writing with 4kB pages would be atomic. So we wouldn't need to do this
>>> FPI stuff, eliminating pretty significant source of write amplification.
>>
>> I agree the hardware is atomic for 4k writes, but do we know the OS
>> always issues 4k writes?
> 
> When using a sector size of 4K you *can't* make smaller writes via normal
> paths. The addressing unit is in sectors. The details obviously differ between
> storage protocol, but you pretty much always just specify a start sector and a
> number of sectors to be operated on.
> 
> Obviously the kernel could read 4k, modify 512 bytes in-memory, and then write
> 4k back, but that shouldn't be a danger here.  There might also be debug
> interfaces to allow reading/writing in different increments, but that'd not be
> something happening during normal operation.

I think it's important to point out that there's a physical and logical
sector size. The "physical" is what the drive does internally, "logical"
defines what OS does.

Some drives have 4k physical sectors but only 512B logical sectors.
AFAIK most "old" SATA SSDs do it that way, for compatibility reasons.

New drives may have 4k physical sectors but typically support both 512B
and 4k logical sectors - my nvme SSDs do this, for example.

My understanding is that for drives with 4k physical+logical sectors,
the OS would only issue "full" 4k writes.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Initdb-time block size specification

От
Bruce Momjian
Дата:
On Fri, Jun 30, 2023 at 04:04:57PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2023-06-30 18:58:20 -0400, Bruce Momjian wrote:
> > > [1] On linux I think you need to use stat() to figure out the st_dev for a
> > > file, then look in /proc/self/mountinfo for the block device, use the name
> > > of the file to look in /sys/block/$d/queue/physical_block_size.
> >
> > I just got a new server:
> >
> >     https://momjian.us/main/blogs/blog/2023.html#June_28_2023
> >
> > so tested this on my new M.2 NVME storage device:
> >
> >     $ /sys/block/nvme0n1/queue/physical_block_size
> >     262144
> 
> Ah, I got the relevant filename wrong. I think it's logical_block_size, not
> physical one (that's the size of addressing).  I didn't realize because the
> devices I looked at have the same...

That one reports 512 _bytes_ for me:

    $ cat /sys/block/nvme0n1/queue/logical_block_size
    512

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Initdb-time block size specification

От
Tomas Vondra
Дата:
On 7/1/23 01:16, Bruce Momjian wrote:
> On Fri, Jun 30, 2023 at 04:04:57PM -0700, Andres Freund wrote:
>> Hi,
>>
>> On 2023-06-30 18:58:20 -0400, Bruce Momjian wrote:
>>>> [1] On linux I think you need to use stat() to figure out the st_dev for a
>>>> file, then look in /proc/self/mountinfo for the block device, use the name
>>>> of the file to look in /sys/block/$d/queue/physical_block_size.
>>>
>>> I just got a new server:
>>>
>>>     https://momjian.us/main/blogs/blog/2023.html#June_28_2023
>>>
>>> so tested this on my new M.2 NVME storage device:
>>>
>>>     $ /sys/block/nvme0n1/queue/physical_block_size
>>>     262144
>>
>> Ah, I got the relevant filename wrong. I think it's logical_block_size, not
>> physical one (that's the size of addressing).  I didn't realize because the
>> devices I looked at have the same...
> 
> That one reports 512 _bytes_ for me:
> 
>     $ cat /sys/block/nvme0n1/queue/logical_block_size
>     512
> 

What does "smartctl -a /dev/nvme0n1" say? There should be something like
this:

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 -    4096       0         0
 1 +     512       0         0

which says the drive supports 4k and 512B sectors, and is currently
configures to use 512B sectors.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Initdb-time block size specification

От
Bruce Momjian
Дата:
On Sat, Jul  1, 2023 at 01:18:34AM +0200, Tomas Vondra wrote:
> What does "smartctl -a /dev/nvme0n1" say? There should be something like
> this:
> 
> Supported LBA Sizes (NSID 0x1)
> Id Fmt  Data  Metadt  Rel_Perf
>  0 -    4096       0         0
>  1 +     512       0         0
> 
> which says the drive supports 4k and 512B sectors, and is currently
> configures to use 512B sectors.

It says:

    Supported LBA Sizes (NSID 0x1)
    Id Fmt  Data  Metadt  Rel_Perf
     0 +     512       0         2
     1 -    4096       0         0

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Initdb-time block size specification

От
Andres Freund
Дата:
Hi,

On 2023-06-30 16:28:59 -0500, David Christensen wrote:
> On Fri, Jun 30, 2023 at 3:29 PM Andres Freund <andres@anarazel.de> wrote:
> >> And indeed. Comparing e.g. TPC-H, I see *massive* regressions.  Some queries
> > are the same, sobut others regress by up to 70% (although more commonly around
> > 10-20%).
>
> Hmm, that is definitely not good.
>
> > That's larger than I thought, which makes me suspect that there's some bug in
> > the new code.
>
> Will do a little profiling here to see if I can figure out the
> regression. Which build optimization settings are you seeing this
> under?

gcc 12 with:

meson setup \
  -Doptimization=3 -Ddebug=true \
  -Dc_args="-ggdb -g3 -march=native -mtune=native -fno-plt -fno-semantic-interposition -Wno-array-bounds" \
  -Dc_link_args="-fuse-ld=mold -Wl,--gdb-index,--Bsymbolic" \
  ...

Relevant postgres settings:
-c huge_pages=on -c shared_buffers=12GB -c max_connections=120
-c work_mem=32MB
-c autovacuum=0 # I always do that for comparative benchmarks, too much variance
-c track_io_timing=on

The later run where I saw the smaller regression was with work_mem=1GB.  I
just had forgotten to adjust that.


I had loaded tpch scale 5 before, which is why I just used that.


FWIW, even just "SELECT count(*) FROM lineitem;" shows a substantial
regression.

I disabled parallelism, prewarmed the data and pinned postgres to a single
core to reduce noise. The result is the best of three (variance was low in all
cases).

                 HEAD       patch
index only scan  1896.364   2242.288
seq scan         1586.990   1628.042


A profile shows that 20% of the runtime in the IOS case is in
visibilitymap_get_status():

+   20.50%  postgres.new  postgres.new          [.] visibilitymap_get_status
+   19.54%  postgres.new  postgres.new          [.] ExecInterpExpr
+   14.47%  postgres.new  postgres.new          [.] IndexOnlyNext
+    6.47%  postgres.new  postgres.new          [.] index_deform_tuple_internal
+    4.67%  postgres.new  postgres.new          [.] ExecScan
+    4.12%  postgres.new  postgres.new          [.] btgettuple
+    3.97%  postgres.new  postgres.new          [.] ExecAgg
+    3.92%  postgres.new  postgres.new          [.] _bt_next
+    3.71%  postgres.new  postgres.new          [.] _bt_readpage
+    3.43%  postgres.new  postgres.new          [.] fetch_input_tuple
+    2.87%  postgres.new  postgres.new          [.] index_getnext_tid
+    2.45%  postgres.new  postgres.new          [.] MemoryContextReset
+    2.35%  postgres.new  postgres.new          [.] tts_virtual_clear
+    1.37%  postgres.new  postgres.new          [.] index_deform_tuple
+    1.14%  postgres.new  postgres.new          [.] ExecStoreVirtualTuple
+    1.13%  postgres.new  postgres.new          [.] PredicateLockPage
+    1.12%  postgres.new  postgres.new          [.] int8inc
+    1.04%  postgres.new  postgres.new          [.] ExecIndexOnlyScan
+    0.57%  postgres.new  postgres.new          [.] BufferGetBlockNumber

mostly due to

       │      BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
  2.46 │        lea    -0x60(,%rdx,4),%rcx
       │        xor    %edx,%edx
 59.79 │        div    %rcx


You can't have have divisions for this kind of thing in the vicinity of a
peformance critical path. With compile time constants the compiler can turn
this into shifts, but that's not possible as-is after the change.

While not quite as bad as divisions, the paths with multiplications are also
not going to be ok. E.g.
        return (Block) (BufferBlocks + ((Size) (buffer - 1)) * CLUSTER_BLOCK_SIZE);
is going to be noticeable.


You'd have to turn all of this into shifts (and enforce power of 2 sizes, if
you aren't yet).


I don't think pre-computed tables are a viable answer FWIW. Even just going
through a memory indirection is going to be noticable. This stuff is in a
crapton of hot code paths.

Greetings,

Andres Freund



Re: Initdb-time block size specification

От
Peter Eisentraut
Дата:
On 01.07.23 00:21, Tomas Vondra wrote:
> Right, that's the dance we do to protect against torn pages. But Andres
> suggested that if you have modern storage and configure it correctly,
> writing with 4kB pages would be atomic. So we wouldn't need to do this
> FPI stuff, eliminating pretty significant source of write amplification.

This work in progress for the Linux kernel was also mentioned at PGCon: 
<https://lwn.net/Articles/933015/>.  Subject the various conditions, the 
kernel would then guarantee atomic writes for blocks larger than the 
hardware's native size.




Re: Initdb-time block size specification

От
David Christensen
Дата:
Hi, enclosed is v2 of the variable blocksize patch.  This series is
based atop 9b581c5341.

Preparation phase:

0001 - add utility script for retokenizing all necessary scripts.
This is mainly for my own use in generating 0003, which is a simple
rename/preparation patch to change all symbols from their UPPER_CASE
to lower_case form, with several exceptions in renames.
0002 - add script to harness 0001 and apply to the relevant files in the repo
0003 - capture the effects of 0002 on the repo

The other patches in this series are as follows:

0004 - the "main" variable blocksize patch where the bulk of the code
changes take place - see comments here
0005 - utility functions for fast div/mod operations; basically
montgomery multiplication
0006 - use fastdiv code in the visiblity map, the main place where
this change is required
0007 - (optional) add/use libdivide for division which is license
compatible with other headers we bundle
0008 - (optional) tweaks to libdivide to make compiler/CI happy

I have also replaced multiple instances of division or multiplication
of BLOCKSZ with bitshift operations based on the number of bits in the
underlying blocksize.

The current approach for this is to replace any affected constant with
an inline switch statement based on an enum for the blocksize and the
compile-time calculation for that version.  In practice with -O2 this
generates a simple lookup table inline in the assembly with the costs
for calculating paid at compile time.

The visibility map was the main hot path which was affected by the
switch from compile-time sizes with the previous version of this
patch.  With the switch to a modified approach in 0005/0006 this issue
has been rectified in our testing.

I have tested a few workloads with this modified patch and have seen
positive results compared to v1.  I look forward to additional
review/testing/feedback.

Thanks,

David

Вложения

Re: Initdb-time block size specification

От
David Christensen
Дата:
Enclosed are TPC-H results for 1GB shared_buffers, 64MB work_mem on a
64GB laptop with SSD storage; everything else is default settings.

TL;DR: unpatched version: 17.30 seconds, patched version: 17.15; there
are some slight variations in runtime, but seems to be within the
noise level at this point.

Вложения

Re: Initdb-time block size specification

От
John Naylor
Дата:

On Thu, Aug 31, 2023 at 8:51 AM David Christensen <david.christensen@crunchydata.com> wrote:

> 0005 - utility functions for fast div/mod operations; basically
> montgomery multiplication

+/*
+ * pg_fastmod - calculates the modulus of a 32-bit number against a constant
+ * divisor without using the division operator
+ */
+static inline uint32 pg_fastmod(uint32 n, uint32 divisor, uint64 fastinv)
+{
+#ifdef HAVE_INT128
+ uint64_t lowbits = fastinv * n;
+ return ((uint128)lowbits * divisor) >> 64;
+#else
+ return n % divisor;
+#endif
+}

Requiring 128-bit arithmetic to avoid serious regression is a non-starter as written. Software that relies on fast 128-bit multiplication has to do backflips to get that working on multiple platforms. But I'm not sure it's necessary -- if the max block number is UINT32_MAX and max block size is UINT16_MAX, can't we just use 64-bit multiplication?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Initdb-time block size specification

От
David Christensen
Дата:
> + * pg_fastmod - calculates the modulus of a 32-bit number against a constant
> + * divisor without using the division operator
> + */
> +static inline uint32 pg_fastmod(uint32 n, uint32 divisor, uint64 fastinv)
> +{
> +#ifdef HAVE_INT128
> + uint64_t lowbits = fastinv * n;
> + return ((uint128)lowbits * divisor) >> 64;
> +#else
> + return n % divisor;
> +#endif
> +}
>
> Requiring 128-bit arithmetic to avoid serious regression is a non-starter as written. Software that relies on fast
128-bitmultiplication has to do backflips to get that working on multiple platforms. But I'm not sure it's necessary --
ifthe max block number is UINT32_MAX and max block size is UINT16_MAX, can't we just use 64-bit multiplication? 

I was definitely hand-waving additional implementation here for
non-native 128 bit support; the modulus algorithm as presented
requires 4 times the space as the divisor, so a uint16 implementation
should work for all 64-bit machines.  Certainly open to other ideas or
implementations, this was the one I was able to find initially.  If
the 16bit approach is all that is needed in practice we can also see
about narrowing the domain and not worry about making this a
general-purpose function.

Best,

David



Re: Initdb-time block size specification

От
David Christensen
Дата:
> I was definitely hand-waving additional implementation here for
> non-native 128 bit support; the modulus algorithm as presented
> requires 4 times the space as the divisor, so a uint16 implementation
> should work for all 64-bit machines.  Certainly open to other ideas or
> implementations, this was the one I was able to find initially.  If
> the 16bit approach is all that is needed in practice we can also see
> about narrowing the domain and not worry about making this a
> general-purpose function.

Here's a patch atop the series which converts to 16-bit uints and
passes regressions, but I don't consider well-vetted at this point.

David

Вложения

Re: Initdb-time block size specification

От
Robert Haas
Дата:
On Thu, Aug 31, 2023 at 2:32 PM David Christensen
<david.christensen@crunchydata.com> wrote:
> Here's a patch atop the series which converts to 16-bit uints and
> passes regressions, but I don't consider well-vetted at this point.

For what it's worth, my gut reaction to this patch series is similar
to that of Andres: I think it will be a disaster. If the disaster is
not evident to us, that's far more likely to mean that we've failed to
test the right things than it is to mean that there is no disaster.

I don't see that there is a lot of upside, either. I don't think we
have a lot of evidence that changing the block size is really going to
help performance. In fact, my guess is that there are large amounts of
code that are heavily optimized, without the authors even realizing
it, for 8kB blocks, because that's what we've always had. If we had
much larger or smaller blocks, the structure of heap pages or of the
various index AMs used for blocks might no longer be optimal, or might
be less optimal than they are for an 8kB block size. If you use really
large blocks, your blocks may need more internal structure than we
have today in order to avoid CPU inefficiencies. I suspect there's
been so little testing of non-default block sizes that I wouldn't even
count on the code to not be outright buggy.

If we could find a safe way to get rid of full page writes, I would
certainly agree that that was worth considering. I'm not sure that
anything in this thread adds up to that being a reasonable way to go,
but the savings would be massive.

I feel like the proposal here is a bit like deciding to change the
speed limit on all American highways from 65 mph or whatever it is to
130 mph or 32.5 mph and see which way works out best. The whole
infrastructure has basically been designed around the current rules.
The rate of curvature of the roads is appropriate for the speed that
you're currently allowed to drive on them. The vehicles are optimized
for long-term operation at about that speed. The people who drive the
vehicles are accustomed to driving at that speed, and the people who
maintain them are accustomed to the problems that happen when you
drive them at that speed. Just changing the speed limit doesn't change
all that other stuff, and changing all that other stuff is a truly
massive undertaking. Maybe this example somewhat overstates the
difficulties here, but I do think the difficulties are considerable.
The fact that we have 8kB block sizes has affected the thinking of
hundreds of developers over decades in making thousands or tens of
thousands or hundreds of thousands of decisions about algorithm
selection and page format and all kinds of stuff. Even if some other
page size seems to work better in a certain context, it's pretty hard
to believe that it has much chance of being better overall, even
without the added overhead of run-time configuration.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Initdb-time block size specification

От
Tomas Vondra
Дата:

On 9/1/23 16:57, Robert Haas wrote:
> On Thu, Aug 31, 2023 at 2:32 PM David Christensen
> <david.christensen@crunchydata.com> wrote:
>> Here's a patch atop the series which converts to 16-bit uints and
>> passes regressions, but I don't consider well-vetted at this point.
> 
> For what it's worth, my gut reaction to this patch series is similar
> to that of Andres: I think it will be a disaster. If the disaster is
> not evident to us, that's far more likely to mean that we've failed to
> test the right things than it is to mean that there is no disaster.
> 

Perhaps. The block size certainly affects a lot of places - both in
terms of the actual value, and being known (constant) at compile time.

> I don't see that there is a lot of upside, either. I don't think we
> have a lot of evidence that changing the block size is really going to
> help performance.

I don't think that's quite true. We have plenty of empirical evidence
that smaller block sizes bring significant improvements for certain
workloads. And we also have theoretical explanations for why that is.

> In fact, my guess is that there are large amounts of
> code that are heavily optimized, without the authors even realizing
> it, for 8kB blocks, because that's what we've always had. If we had
> much larger or smaller blocks, the structure of heap pages or of the
> various index AMs used for blocks might no longer be optimal, or might
> be less optimal than they are for an 8kB block size. If you use really
> large blocks, your blocks may need more internal structure than we
> have today in order to avoid CPU inefficiencies. I suspect there's
> been so little testing of non-default block sizes that I wouldn't even
> count on the code to not be outright buggy.
> 

Sure, and there are even various places where the page size implies hard
limits (e.g. index key size for btree indexes).

But so what? If that matters for your workload, keep using 8kB ...

> If we could find a safe way to get rid of full page writes, I would
> certainly agree that that was worth considering. I'm not sure that
> anything in this thread adds up to that being a reasonable way to go,
> but the savings would be massive.
> 

That's true, that'd be great. But that's clearly just a next level of
the optimization. It doesn't mean that if you can't eliminate FPW for
whatever reason it's worthless.

> I feel like the proposal here is a bit like deciding to change the
> speed limit on all American highways from 65 mph or whatever it is to
> 130 mph or 32.5 mph and see which way works out best. The whole
> infrastructure has basically been designed around the current rules.
> The rate of curvature of the roads is appropriate for the speed that
> you're currently allowed to drive on them. The vehicles are optimized
> for long-term operation at about that speed. The people who drive the
> vehicles are accustomed to driving at that speed, and the people who
> maintain them are accustomed to the problems that happen when you
> drive them at that speed. Just changing the speed limit doesn't change
> all that other stuff, and changing all that other stuff is a truly
> massive undertaking. Maybe this example somewhat overstates the
> difficulties here, but I do think the difficulties are considerable.
> The fact that we have 8kB block sizes has affected the thinking of
> hundreds of developers over decades in making thousands or tens of
> thousands or hundreds of thousands of decisions about algorithm
> selection and page format and all kinds of stuff. Even if some other
> page size seems to work better in a certain context, it's pretty hard
> to believe that it has much chance of being better overall, even
> without the added overhead of run-time configuration.
> 

Except that no one is forcing you to actually go 130mph or 32mph, right?
You make it seem like this patch forces people to use some other page
size, but that's clearly not what it's doing - it gives you the option
to use smaller or larger block, if you chose to. Just like increasing
the speed limit to 130mph doesn't mean you can't keep going 65mph.

The thing is - we *already* allow using different block size, except
that you have to do custom build. This just makes it easier.

I don't have strong opinions on how the patch actually does that, and
there certainly can be negative effects of making it dynamic. And yes,
we will have to do more testing with non-default block sizes. But
frankly, that's a gap we probably need to address anyway, considering we
allow changing the block size.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Initdb-time block size specification

От
Robert Haas
Дата:
On Sat, Sep 2, 2023 at 3:09 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Except that no one is forcing you to actually go 130mph or 32mph, right?
> You make it seem like this patch forces people to use some other page
> size, but that's clearly not what it's doing - it gives you the option
> to use smaller or larger block, if you chose to. Just like increasing
> the speed limit to 130mph doesn't mean you can't keep going 65mph.
>
> The thing is - we *already* allow using different block size, except
> that you have to do custom build. This just makes it easier.

Right. Which is worth doing if it doesn't hurt performance and is
likely to be useful to a lot of people, and is not worth doing if it
will hurt performance and be useful to relatively few people. My bet
is on the latter.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Initdb-time block size specification

От
David Christensen
Дата:
On Tue, Sep 5, 2023 at 9:04 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Sep 2, 2023 at 3:09 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> > Except that no one is forcing you to actually go 130mph or 32mph, right?
> > You make it seem like this patch forces people to use some other page
> > size, but that's clearly not what it's doing - it gives you the option
> > to use smaller or larger block, if you chose to. Just like increasing
> > the speed limit to 130mph doesn't mean you can't keep going 65mph.
> >
> > The thing is - we *already* allow using different block size, except
> > that you have to do custom build. This just makes it easier.
>
> Right. Which is worth doing if it doesn't hurt performance and is
> likely to be useful to a lot of people, and is not worth doing if it
> will hurt performance and be useful to relatively few people. My bet
> is on the latter.

Agreed that this doesn't make sense if there are major performance
regressions, however the goal here is patch evaluation to measure this
against other workloads and see if this is the case; from my localized
testing things were within acceptable noise levels with the latest
version.

I agree with Tomas' earlier thoughts: we already allow different block
sizes, and if there are baked-in algorithmic assumptions about block
size (which there probably are), then identifying those or places in
the code where we need additional work or test coverage will only
improve things overall for those non-standard block sizes.

Best,

David



Re: Initdb-time block size specification

От
Hannu Krosing
Дата:
Something I also asked at this years Unconference - Do we currently
have Build Farm animals testing with different page sizes ?

I'd say that testing all sizes from 4KB up (so 4, 8, 16, 32) should be
done at least before each release if not continuously.

-- Cheers

Hannu


On Tue, Sep 5, 2023 at 4:31 PM David Christensen
<david.christensen@crunchydata.com> wrote:
>
> On Tue, Sep 5, 2023 at 9:04 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Sat, Sep 2, 2023 at 3:09 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> > > Except that no one is forcing you to actually go 130mph or 32mph, right?
> > > You make it seem like this patch forces people to use some other page
> > > size, but that's clearly not what it's doing - it gives you the option
> > > to use smaller or larger block, if you chose to. Just like increasing
> > > the speed limit to 130mph doesn't mean you can't keep going 65mph.
> > >
> > > The thing is - we *already* allow using different block size, except
> > > that you have to do custom build. This just makes it easier.
> >
> > Right. Which is worth doing if it doesn't hurt performance and is
> > likely to be useful to a lot of people, and is not worth doing if it
> > will hurt performance and be useful to relatively few people. My bet
> > is on the latter.
>
> Agreed that this doesn't make sense if there are major performance
> regressions, however the goal here is patch evaluation to measure this
> against other workloads and see if this is the case; from my localized
> testing things were within acceptable noise levels with the latest
> version.
>
> I agree with Tomas' earlier thoughts: we already allow different block
> sizes, and if there are baked-in algorithmic assumptions about block
> size (which there probably are), then identifying those or places in
> the code where we need additional work or test coverage will only
> improve things overall for those non-standard block sizes.
>
> Best,
>
> David
>
>



Re: Initdb-time block size specification

От
Andres Freund
Дата:
Hi,

On 2023-09-05 21:52:18 +0200, Hannu Krosing wrote:
> Something I also asked at this years Unconference - Do we currently
> have Build Farm animals testing with different page sizes ?

You can check that yourself as easily as anybody else.

Greetings,

Andres Freund



Re: Initdb-time block size specification

От
David Christensen
Дата:
On Tue, Sep 5, 2023 at 2:52 PM Hannu Krosing <hannuk@google.com> wrote:
>
> Something I also asked at this years Unconference - Do we currently
> have Build Farm animals testing with different page sizes ?
>
> I'd say that testing all sizes from 4KB up (so 4, 8, 16, 32) should be
> done at least before each release if not continuously.
>
> -- Cheers
>
> Hannu

The regression tests currently have a lot of breakage when running
against non-standard block sizes, so I would assume the answer here is
no.  I would expect that we will want to add regression test variants
or otherwise normalize results so they will work with differing block
sizes, but have not done that yet.



Re: Initdb-time block size specification

От
Hannu Krosing
Дата:
Sure, I was just hoping that somebody already knew without needing to
specifically check :)

And as I see in David's response, the tests are actually broken for other sizes.

I'll see if I can (convince somebody to) set this up .

Cheers
Hannu

On Tue, Sep 5, 2023 at 10:23 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-09-05 21:52:18 +0200, Hannu Krosing wrote:
> > Something I also asked at this years Unconference - Do we currently
> > have Build Farm animals testing with different page sizes ?
>
> You can check that yourself as easily as anybody else.
>
> Greetings,
>
> Andres Freund



Re: Initdb-time block size specification

От
David Christensen
Дата:
Hi,

Here is version 3 of the patch series, rebased on 6d0c39a293.  The main tweak was to tweak per John N's suggestion and utilize only the 16-bit unsigned mod/div for the utility routines.  The breakdown of the patch series is the same as the last one, but re-including the descriptions here:

Preparation phase:

0001 - add utility script for retokenizing all necessary scripts.
This is mainly for my own use in generating 0003, which is a simple
rename/preparation patch to change all symbols from their UPPER_CASE
to lower_case form, with several exceptions in renames.
0002 - add script to harness 0001 and apply to the relevant files in the repo
0003 - capture the effects of 0002 on the repo

The other patches in this series are as follows:

0004 - the "main" variable blocksize patch where the bulk of the code
changes take place - see comments here
0005 - utility functions for fast div/mod operations; basically
montgomery multiplication
0006 - use fastdiv code in the visiblity map, the main place where
this change is required
0007 - (optional) add/use libdivide for division which is license
compatible with other headers we bundle
0008 - (optional) tweaks to libdivide to make compiler/CI happy

Best,

David
Вложения