Обсуждение: Initdb-time block size specification
Hi, As discussed somewhat at PgCon, enclosed is v1 of a patch to provide variable block sizes; basically instead of BLCKSZ being a compile-time constant, a single set of binaries can support all of the block sizes Pg can support, using the value stored in pg_control as the basis. (Possible future plans would be to make this something even more dynamic, such as configured per tablespace, but this is out of scope; this just sets up the infrastructure for this.) Whereas we had traditionally used BLCKSZ to indicate the compile-time selected block size, this commit adjusted things so the cluster block size can be selected at initdb time. In order to code for this, we introduce a few new defines: - CLUSTER_BLOCK_SIZE is the blocksize for this cluster itself. This is not valid until BlockSizeInit() has been called in the given backend, which we do as early as possible by parsing the ControlFile and using the blcksz field. - MIN_BLOCK_SIZE and MAX_BLOCK_SIZE are the limits for the selectable block size. It is required that CLUSTER_BLOCK_SIZE is a power of 2 between these two constants. - DEFAULT_BLOCK_SIZE is the moral equivalent of BLCKSZ; it is the built-in default value. This is used in a few places that just needed a buffer of an arbitrary size, but the dynamic value CLUSTER_BLOCK_SIZE should almost always be used instead. - CLUSTER_RELSEG_SIZE is used instead of RELSEG_SIZE, since internally we are storing the segment size in terms of number of blocks. RELSEG_SIZE is still kept, but is used in terms of the number of blocks of DEFAULT_BLOCK_SIZE; CLUSTER_RELSEG_SIZE scales appropriately (and is the only thing used internally) to keep the same target total segment size regardless of block size. This patch uses a precalculated table to store the block size itself, as well as additional derived values that have traditionally been compile-time constants (example: MaxHeapTuplesPerPage). The traditional macro names are kept so code that doesn't care about it should not need to change, however the definition of these has changed (see the CalcXXX() routines in blocksize.h for details). A new function, BlockSizeInit() populates the appropriate values based on the target block size. This should be called as early as possible in any code that utilizes block sizes. This patch adds this in the appropriate place on the handful of src/bin/ programs that used BLCKSZ, so this caveat mainly impacts new code. Code which had previously used BLCKZ should likely be able to get away with changing these instances to CLUSTER_BLOCK_SIZE, unless you're using a structure allocated on the stack. In these cases, the compiler will complain about dynamic structure. The solution is to utilize an expression with MAX_BLOCK_SIZE instead of BLCKSZ, ensuring enough stack space is allocated for the maximum size. This also does require using CLUSTER_BLOCK_SIZE or an expression based on it when actually using this structure, so in practice more stack space may be allocated then used in principal; as long as there is plenty of stack this should have no specific impacts on code. Initial (basic) performance testing shows only minor changes with the pgbench -S benchmark, though this is obviously an area that will need considerable testing/verification across multiple workloads. Thanks, David
Вложения
On 6/30/23 19:35, David Christensen wrote: > Hi, > > As discussed somewhat at PgCon, enclosed is v1 of a patch to provide > variable block sizes; basically instead of BLCKSZ being a compile-time > constant, a single set of binaries can support all of the block sizes > Pg can support, using the value stored in pg_control as the basis. > (Possible future plans would be to make this something even more > dynamic, such as configured per tablespace, but this is out of scope; > this just sets up the infrastructure for this.) > > Whereas we had traditionally used BLCKSZ to indicate the compile-time selected > block size, this commit adjusted things so the cluster block size can be > selected at initdb time. > > In order to code for this, we introduce a few new defines: > > - CLUSTER_BLOCK_SIZE is the blocksize for this cluster itself. This is not > valid until BlockSizeInit() has been called in the given backend, which we do as > early as possible by parsing the ControlFile and using the blcksz field. > > - MIN_BLOCK_SIZE and MAX_BLOCK_SIZE are the limits for the selectable block > size. It is required that CLUSTER_BLOCK_SIZE is a power of 2 between these two > constants. > > - DEFAULT_BLOCK_SIZE is the moral equivalent of BLCKSZ; it is the built-in > default value. This is used in a few places that just needed a buffer of an > arbitrary size, but the dynamic value CLUSTER_BLOCK_SIZE should almost always be > used instead. > > - CLUSTER_RELSEG_SIZE is used instead of RELSEG_SIZE, since internally we are > storing the segment size in terms of number of blocks. RELSEG_SIZE is still > kept, but is used in terms of the number of blocks of DEFAULT_BLOCK_SIZE; > CLUSTER_RELSEG_SIZE scales appropriately (and is the only thing used internally) > to keep the same target total segment size regardless of block size. > Do we really want to prefix the option with CLUSTER_? That seems to just add a lot of noise into the patch, and I don't see much value in this rename. I'd prefer keeping BLCKSZ and tweak just the couple places that need "default" to use BLCKSZ_DEFAULT or something like that. But more importantly, I'd say we use CAPITALIZED_NAMES for compile-time values, so after making this initdb-time parameter we should abandon that (just like fc49e24fa69a did for segment sizes). Perhaps something like cluste_block_size would work ... (yes, that touches all the places too). > This patch uses a precalculated table to store the block size itself, as well as > additional derived values that have traditionally been compile-time > constants (example: MaxHeapTuplesPerPage). The traditional macro names are kept > so code that doesn't care about it should not need to change, however the > definition of these has changed (see the CalcXXX() routines in blocksize.h for > details). > > A new function, BlockSizeInit() populates the appropriate values based on the > target block size. This should be called as early as possible in any code that > utilizes block sizes. This patch adds this in the appropriate place on the > handful of src/bin/ programs that used BLCKSZ, so this caveat mainly impacts new > code. > > Code which had previously used BLCKZ should likely be able to get away with > changing these instances to CLUSTER_BLOCK_SIZE, unless you're using a structure > allocated on the stack. In these cases, the compiler will complain about > dynamic structure. The solution is to utilize an expression with MAX_BLOCK_SIZE > instead of BLCKSZ, ensuring enough stack space is allocated for the maximum > size. This also does require using CLUSTER_BLOCK_SIZE or an expression based on > it when actually using this structure, so in practice more stack space may be > allocated then used in principal; as long as there is plenty of stack this > should have no specific impacts on code. > > Initial (basic) performance testing shows only minor changes with the pgbench -S > benchmark, though this is obviously an area that will need considerable > testing/verification across multiple workloads. > I wonder how to best evaluate/benchmark this. AFAICS what we want to measure is the extra cost of making the values dynamic (which means the compiler can't just optimize them out). I'd say a "pgbench -S" seems like a good test. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 30, 2023 at 1:14 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Do we really want to prefix the option with CLUSTER_? That seems to just > add a lot of noise into the patch, and I don't see much value in this > rename. I'd prefer keeping BLCKSZ and tweak just the couple places that > need "default" to use BLCKSZ_DEFAULT or something like that. > > But more importantly, I'd say we use CAPITALIZED_NAMES for compile-time > values, so after making this initdb-time parameter we should abandon > that (just like fc49e24fa69a did for segment sizes). Perhaps something > like cluste_block_size would work ... (yes, that touches all the places > too). Yes, I can see that being an equivalent change; thanks for the pointer there. Definitely the "cluster_block_size" could be an approach, though since it's just currently a #define for GetBlockSize(), maybe we just replace with the equivalent instead. I was mainly trying to make it something that was conceptually similar and easy to reason about without getting bogged down in the details locally, but can see that ALL_CAPS does have a specific meaning. Also eliminating the BLCKSZ symbol meant it was easier to catch anything which depended on that value. If we wanted to keep BLCKSZ, I'd be okay with that at this point vs the CLUSTER_BLOCK_SIZE approach, could help to make the patch smaller at this point. > > Initial (basic) performance testing shows only minor changes with the pgbench -S > > benchmark, though this is obviously an area that will need considerable > > testing/verification across multiple workloads. > > > > I wonder how to best evaluate/benchmark this. AFAICS what we want to > measure is the extra cost of making the values dynamic (which means the > compiler can't just optimize them out). I'd say a "pgbench -S" seems > like a good test. Yep, I tested 100 runs apiece with pgbench -S at scale factor 100, default settings for optimized builds of the same base commit with and without the patch; saw a reduction of TPS around 1% in that case, but I do think we'd want to look at different workloads; I presume the largest impact would be seen when it's all in shared_buffers and no IO is required. Thanks, David
Hi, On 2023-06-30 12:35:09 -0500, David Christensen wrote: > As discussed somewhat at PgCon, enclosed is v1 of a patch to provide > variable block sizes; basically instead of BLCKSZ being a compile-time > constant, a single set of binaries can support all of the block sizes > Pg can support, using the value stored in pg_control as the basis. > (Possible future plans would be to make this something even more > dynamic, such as configured per tablespace, but this is out of scope; > this just sets up the infrastructure for this.) I am extremely doubtful this is a good idea. For one it causes a lot of churn, but more importantly it turns currently cheap code paths into more expensive ones. Changes like > -#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ)) > +#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * CLUSTER_BLOCK_SIZE)) Note that CLUSTER_BLOCK_SIZE, despite looking like a macro that's constant, is actually variable. I am fairly certain this is going to be causing substantial performance regressions. I think we should reject this even if we don't immediately find them, because it's almost guaranteed to cause some. Besides this, I've not really heard any convincing justification for needing this in the first place. Greetings, Andres Freund
On Fri, Jun 30, 2023 at 2:39 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2023-06-30 12:35:09 -0500, David Christensen wrote: > > As discussed somewhat at PgCon, enclosed is v1 of a patch to provide > > variable block sizes; basically instead of BLCKSZ being a compile-time > > constant, a single set of binaries can support all of the block sizes > > Pg can support, using the value stored in pg_control as the basis. > > (Possible future plans would be to make this something even more > > dynamic, such as configured per tablespace, but this is out of scope; > > this just sets up the infrastructure for this.) > > I am extremely doubtful this is a good idea. For one it causes a lot of churn, > but more importantly it turns currently cheap code paths into more expensive > ones. > > Changes like > > > -#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ)) > > +#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * CLUSTER_BLOCK_SIZE)) > > Note that CLUSTER_BLOCK_SIZE, despite looking like a macro that's constant, is > actually variable. Correct; that is mainly a notational device which would be easy enough to change (and presumably would follow along the lines of the commit Tomas pointed out above). > I am fairly certain this is going to be causing substantial performance > regressions. I think we should reject this even if we don't immediately find > them, because it's almost guaranteed to cause some. What would be considered substantial? Some overhead would be expected, but I think having an actual patch to evaluate lets us see what potential there is. Seems like this will likely be optimized as an offset stored in a register, so wouldn't expect huge changes here. (There may be other approaches I haven't thought of in terms of getting this.) > Besides this, I've not really heard any convincing justification for needing > this in the first place. Doing this would open up experiments in larger block sizes, so we would be able to have larger indexable tuples, say, or be able to store data types that are larger than currently supported for tuple row limits without dropping to toast (native vector data types come to mind as a candidate here). We've had 8k blocks for a long time while hardware has improved over 20+ years, and it would be interesting to see how tuning things would open up additional avenues for performance without requiring packagers to make a single choice on this regardless of use-case. (The fact that we allow compiling this at a different value suggests there is thought to be some utility having this be something other than the default value.) I just think it's one of those things that is hard to evaluate without actually having something specific, which is why we have this patch now. Thanks, David
Hi, On 2023-06-30 14:09:55 -0500, David Christensen wrote: > On Fri, Jun 30, 2023 at 1:14 PM Tomas Vondra > > I wonder how to best evaluate/benchmark this. AFAICS what we want to > > measure is the extra cost of making the values dynamic (which means the > > compiler can't just optimize them out). I'd say a "pgbench -S" seems > > like a good test. > > Yep, I tested 100 runs apiece with pgbench -S at scale factor 100, > default settings for optimized builds of the same base commit with and > without the patch; saw a reduction of TPS around 1% in that case, but > I do think we'd want to look at different workloads; I presume the > largest impact would be seen when it's all in shared_buffers and no IO > is required. I think pgbench -S indeed isn't a good workload - the overhead for it is much more in context switches and instantiating executor state etc than code that is affected by the change. And indeed. Comparing e.g. TPC-H, I see *massive* regressions. Some queries are the same, sobut others regress by up to 70% (although more commonly around 10-20%). That's larger than I thought, which makes me suspect that there's some bug in the new code. Interestingly, repeating the benchmark with a larger work_mem setting, the regressions are still quite present, but smaller. I suspect the planner chooses smarter plans which move bottlenecks more towards hashjoin code etc, which won't be affected by this change. IOW, you seriously need to evaluate analytics queries before this is worth looking at further. Greetings, Andres Freund
Hi, On 2023-06-30 15:05:54 -0500, David Christensen wrote: > > I am fairly certain this is going to be causing substantial performance > > regressions. I think we should reject this even if we don't immediately find > > them, because it's almost guaranteed to cause some. > > What would be considered substantial? Some overhead would be expected, > but I think having an actual patch to evaluate lets us see what > potential there is. Anything beyond 1-2%, although even that imo is a hard sell. > > Besides this, I've not really heard any convincing justification for needing > > this in the first place. > > Doing this would open up experiments in larger block sizes, so we > would be able to have larger indexable tuples, say, or be able to > store data types that are larger than currently supported for tuple > row limits without dropping to toast (native vector data types come to > mind as a candidate here). You can do experiments today with the compile time option. Which does not require regressing performance for everyone. > We've had 8k blocks for a long time while hardware has improved over 20+ > years, and it would be interesting to see how tuning things would open up > additional avenues for performance without requiring packagers to make a > single choice on this regardless of use-case. I suspect you're going to see more benefits from going to a *lower* setting than a higher one. Some practical issues aside, plenty of storage hardware these days would allow to get rid of FPIs if you go to 4k blocks (although it often requires explicit sysadmin action to reformat the drive into that mode etc). But obviously that's problematic from the "postgres limits" POV. If we really wanted to do this - but I don't think we do - I'd argue for working on the buildsystem support to build the postgres binary multiple times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just exec's the relevant "real" binary based on the pg_control value. I really don't see us ever wanting to make BLCKSZ runtime configurable within one postgres binary. There's just too much intrinsic overhead associated with that. Greetings, Andres Freund
On 6/30/23 22:05, David Christensen wrote: > On Fri, Jun 30, 2023 at 2:39 PM Andres Freund <andres@anarazel.de> wrote: >> >> ... >> >> Besides this, I've not really heard any convincing justification for needing >> this in the first place. > > Doing this would open up experiments in larger block sizes, so we > would be able to have larger indexable tuples, say, or be able to > store data types that are larger than currently supported for tuple > row limits without dropping to toast (native vector data types come to > mind as a candidate here). We've had 8k blocks for a long time while > hardware has improved over 20+ years, and it would be interesting to > see how tuning things would open up additional avenues for performance > without requiring packagers to make a single choice on this regardless > of use-case. (The fact that we allow compiling this at a different > value suggests there is thought to be some utility having this be > something other than the default value.) > > I just think it's one of those things that is hard to evaluate without > actually having something specific, which is why we have this patch > now. > But it's possible to evaluate that - you just need to rebuild with a different configuration option. Yes, allowing doing that at initdb is simpler and allows testing this on systems where rebuilding is not convenient. And having a binary that can deal with any block size would be nice too. In fact, I did exactly that a year ago for a conference, and I spoke about it at the 2022 unconference too. Not sure if there's recording from pgcon, but there is one from the other conference [1][2]. The short story is that the possible gains are significant (say +50%) for data sets that don't fit into RAM. But that was with block size set at compile time, the question is what's the impact of making it a variable instead of a macro .... [1] https://www.youtube.com/watch?v=mVKpoQxtCXk [2] https://blog.pgaddict.com/pdf/block-sizes-postgresvision-2022.pdf regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6/30/23 23:11, Andres Freund wrote: > ... > > If we really wanted to do this - but I don't think we do - I'd argue for > working on the buildsystem support to build the postgres binary multiple > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just > exec's the relevant "real" binary based on the pg_control value. I really > don't see us ever wanting to make BLCKSZ runtime configurable within one > postgres binary. There's just too much intrinsic overhead associated with > that. > I don't quite understand why we shouldn't do this (or at least try to). IMO the benefits of using smaller blocks were substantial (especially for 4kB, most likely due matching the internal SSD page size). The other benefits (reducing WAL volume) seem rather interesting too. Sure, there are challenges (e.g. the overhead due to making it dynamic). No doubt about that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 30, 2023 at 3:29 PM Andres Freund <andres@anarazel.de> wrote: >> And indeed. Comparing e.g. TPC-H, I see *massive* regressions. Some queries > are the same, sobut others regress by up to 70% (although more commonly around > 10-20%). Hmm, that is definitely not good. > That's larger than I thought, which makes me suspect that there's some bug in > the new code. Will do a little profiling here to see if I can figure out the regression. Which build optimization settings are you seeing this under? > Interestingly, repeating the benchmark with a larger work_mem setting, the > regressions are still quite present, but smaller. I suspect the planner > chooses smarter plans which move bottlenecks more towards hashjoin code etc, > which won't be affected by this change. Interesting. > IOW, you seriously need to evaluate analytics queries before this is worth > looking at further. Makes sense, thanks for reviewing. David
On Fri, Jun 30, 2023 at 4:12 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 6/30/23 22:05, David Christensen wrote: > > On Fri, Jun 30, 2023 at 2:39 PM Andres Freund <andres@anarazel.de> wrote: > >> > >> ... > >> > >> Besides this, I've not really heard any convincing justification for needing > >> this in the first place. > > > > Doing this would open up experiments in larger block sizes, so we > > would be able to have larger indexable tuples, say, or be able to > > store data types that are larger than currently supported for tuple > > row limits without dropping to toast (native vector data types come to > > mind as a candidate here). We've had 8k blocks for a long time while > > hardware has improved over 20+ years, and it would be interesting to > > see how tuning things would open up additional avenues for performance > > without requiring packagers to make a single choice on this regardless > > of use-case. (The fact that we allow compiling this at a different > > value suggests there is thought to be some utility having this be > > something other than the default value.) > > > > I just think it's one of those things that is hard to evaluate without > > actually having something specific, which is why we have this patch > > now. > > > > But it's possible to evaluate that - you just need to rebuild with a > different configuration option. Yes, allowing doing that at initdb is > simpler and allows testing this on systems where rebuilding is not > convenient. And having a binary that can deal with any block size would > be nice too. > > In fact, I did exactly that a year ago for a conference, and I spoke > about it at the 2022 unconference too. Not sure if there's recording > from pgcon, but there is one from the other conference [1][2]. > > The short story is that the possible gains are significant (say +50%) > for data sets that don't fit into RAM. But that was with block size set > at compile time, the question is what's the impact of making it a > variable instead of a macro .... > > [1] https://www.youtube.com/watch?v=mVKpoQxtCXk > [2] https://blog.pgaddict.com/pdf/block-sizes-postgresvision-2022.pdf Cool, thanks for the video links; will review. David
On Fri, Jun 30, 2023 at 4:27 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 6/30/23 23:11, Andres Freund wrote: > > ... > > > > If we really wanted to do this - but I don't think we do - I'd argue for > > working on the buildsystem support to build the postgres binary multiple > > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just > > exec's the relevant "real" binary based on the pg_control value. I really > > don't see us ever wanting to make BLCKSZ runtime configurable within one > > postgres binary. There's just too much intrinsic overhead associated with > > that. > > > > I don't quite understand why we shouldn't do this (or at least try to). > IMO the benefits of using smaller blocks were substantial (especially > for 4kB, most likely due matching the internal SSD page size). The other > benefits (reducing WAL volume) seem rather interesting too. If it's dynamic, we could also be able to add detection of the best block size at initdb time, leading to improvements all around, say. :-) > Sure, there are challenges (e.g. the overhead due to making it dynamic). > No doubt about that. I definitely agree that it seems worth looking into. If nothing else, being able to quantify just what/where the overhead comes from may be instructive for future efforts. David
On Fri, Jun 30, 2023 at 4:17 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2023-06-30 15:05:54 -0500, David Christensen wrote: > > > I am fairly certain this is going to be causing substantial performance > > > regressions. I think we should reject this even if we don't immediately find > > > them, because it's almost guaranteed to cause some. > > > > What would be considered substantial? Some overhead would be expected, > > but I think having an actual patch to evaluate lets us see what > > potential there is. > > Anything beyond 1-2%, although even that imo is a hard sell. I'd agree that that threshold seems like a reasonable target, and anything much above that would be regressive. > > > Besides this, I've not really heard any convincing justification for needing > > > this in the first place. > > > > Doing this would open up experiments in larger block sizes, so we > > would be able to have larger indexable tuples, say, or be able to > > store data types that are larger than currently supported for tuple > > row limits without dropping to toast (native vector data types come to > > mind as a candidate here). > > You can do experiments today with the compile time option. Which does not > require regressing performance for everyone. Sure, not arguing that this is more performant than the current approach. > > We've had 8k blocks for a long time while hardware has improved over 20+ > > years, and it would be interesting to see how tuning things would open up > > additional avenues for performance without requiring packagers to make a > > single choice on this regardless of use-case. > > I suspect you're going to see more benefits from going to a *lower* setting > than a higher one. Some practical issues aside, plenty of storage hardware > these days would allow to get rid of FPIs if you go to 4k blocks (although it > often requires explicit sysadmin action to reformat the drive into that mode > etc). But obviously that's problematic from the "postgres limits" POV. > > > If we really wanted to do this - but I don't think we do - I'd argue for > working on the buildsystem support to build the postgres binary multiple > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just > exec's the relevant "real" binary based on the pg_control value. I really > don't see us ever wanting to make BLCKSZ runtime configurable within one > postgres binary. There's just too much intrinsic overhead associated with > that. You may well be right, but I think if we haven't tried to do that and measure it, it's hard to say exactly. There are of course more parts of the system that are about BLCKSZ than the backend, plus you'd need to build other extensions to support each option, so there is a lot more that would need to change. (That's neither here nor there, as my approach also involves changing all those places, so change isn't inherently bad; just saying it's not a trivial solution to merely iterate over the block size for binaries.) David
On 6/30/23 23:11, Andres Freund wrote: > Hi, > > ... > > I suspect you're going to see more benefits from going to a *lower* setting > than a higher one. Some practical issues aside, plenty of storage hardware > these days would allow to get rid of FPIs if you go to 4k blocks (although it > often requires explicit sysadmin action to reformat the drive into that mode > etc). But obviously that's problematic from the "postgres limits" POV. > I wonder what are the conditions/options for disabling FPI. I kinda assume it'd apply to new drives with 4k sectors, with properly aligned partitions etc. But I haven't seen any particularly clear confirmation that's correct. > > If we really wanted to do this - but I don't think we do - I'd argue for > working on the buildsystem support to build the postgres binary multiple > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just > exec's the relevant "real" binary based on the pg_control value. I really > don't see us ever wanting to make BLCKSZ runtime configurable within one > postgres binary. There's just too much intrinsic overhead associated with > that. > How would that work for extensions which may be built for a particular BLCKSZ value (like pageinspect)? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 30, 2023 at 11:42:30PM +0200, Tomas Vondra wrote: > > > On 6/30/23 23:11, Andres Freund wrote: > > Hi, > > > > ... > > > > I suspect you're going to see more benefits from going to a *lower* setting > > than a higher one. Some practical issues aside, plenty of storage hardware > > these days would allow to get rid of FPIs if you go to 4k blocks (although it > > often requires explicit sysadmin action to reformat the drive into that mode > > etc). But obviously that's problematic from the "postgres limits" POV. > > > > I wonder what are the conditions/options for disabling FPI. I kinda > assume it'd apply to new drives with 4k sectors, with properly aligned > partitions etc. But I haven't seen any particularly clear confirmation > that's correct. I don't think we have ever had to study this --- we just request the write to the operating system, and we either get a successful reply or we go into WAL recovery to reread the pre-image. We never really care if the write is atomic, e.g., an 8k write can be done in 2 4kB writes 4 2kB writes --- we don't care --- we only care if they are all done or not. For a 4kB write, to say it is not partially written would be to require the operating system to guarantee that the 4kB write is not split into smaller writes which might each be atomic because smaller atomic writes would not help us. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Hi, On 2023-06-30 23:27:45 +0200, Tomas Vondra wrote: > On 6/30/23 23:11, Andres Freund wrote: > > ... > > > > If we really wanted to do this - but I don't think we do - I'd argue for > > working on the buildsystem support to build the postgres binary multiple > > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just > > exec's the relevant "real" binary based on the pg_control value. I really > > don't see us ever wanting to make BLCKSZ runtime configurable within one > > postgres binary. There's just too much intrinsic overhead associated with > > that. > > I don't quite understand why we shouldn't do this (or at least try to). > > IMO the benefits of using smaller blocks were substantial (especially > for 4kB, most likely due matching the internal SSD page size). The other > benefits (reducing WAL volume) seem rather interesting too. Mostly because I think there are bigger gains to be had elsewhere. IME not a whole lot of storage ships by default with externally visible 4k sectors, but needs to be manually reformated [1], which looses all data, so it has to be done initially. Then postgres would also need OS specific trickery to figure out that indeed the IO stack is entirely 4k (checking sector size is not enough). And you run into the issue that suddenly the #column and index-tuple-size limits are lower, which won't make it easier. I think we should change the default of the WAL blocksize to 4k though. There's practically no downsides, and it drastically reduces postgres-side write amplification in many transactional workloads, by only writing out partially filled 4k pages instead of partially filled 8k pages. > Sure, there are challenges (e.g. the overhead due to making it dynamic). > No doubt about that. I don't think the runtime-dynamic overhead is avoidable with reasonable effort (leaving aside compiling code multiple times and switching between). If we were to start building postgres for multiple compile-time settings, I think there are uses other than switching between BLCKSZ, potentially more interesting. E.g. you can see substantially improved performance by being able to use SSE4.2 without CPU dispatch (partially because it allows compiler autovectorization, partially because it allows to compiler to use newer non-vectorized math instructions (when targetting AVX IIRC), partially because the dispatch overhead is not insubstantial). Another example: ARMv8 performance is substantially better if you target ARMv8.1-A instead of ARMv8.0, due to having atomic instructions instead of LL/SC (it still baffles me that they didn't do this earlier, ll/sc is just inherently inefficient). Greetings, Andres Freund [1] to see the for d in /dev/nvme*n1; do echo "$d:"; sudo nvme id-ns -H $d|grep '^LBA Format';echo ;done
Hi, On 2023-06-30 23:42:30 +0200, Tomas Vondra wrote: > I wonder what are the conditions/options for disabling FPI. I kinda > assume it'd apply to new drives with 4k sectors, with properly aligned > partitions etc. But I haven't seen any particularly clear confirmation > that's correct. Yea, it's not trivial. And the downsides are also substantial from a replication / crash recovery performance POV - even with reading blocks ahead of WAL replay, it's hard to beat just sequentially reading nearly all the data you're going to need. > On 6/30/23 23:11, Andres Freund wrote: > > If we really wanted to do this - but I don't think we do - I'd argue for > > working on the buildsystem support to build the postgres binary multiple > > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just > > exec's the relevant "real" binary based on the pg_control value. I really > > don't see us ever wanting to make BLCKSZ runtime configurable within one > > postgres binary. There's just too much intrinsic overhead associated with > > that. > > How would that work for extensions which may be built for a particular > BLCKSZ value (like pageinspect)? I think we'd need to do something similar for extensions, likely loading them from a path that includes the "subvariant" the server currently is running. Or alternatively adding a suffix to the filename indicating the variant. Something like pageinspect.x86-64-v4-4kB.so. The x86-64-v* stuff is something gcc and clang added a couple years ago, so that not every project has to define different "baseline" levels. I think it was done in collaboration with the sytem-v/linux AMD64 ABI specification group ([1]). Greetings, Andres [1] https://gitlab.com/x86-psABIs/x86-64-ABI/-/jobs/artifacts/master/raw/x86-64-ABI/abi.pdf?job=build section 3.1.1.
On 6/30/23 23:53, Bruce Momjian wrote: > On Fri, Jun 30, 2023 at 11:42:30PM +0200, Tomas Vondra wrote: >> >> >> On 6/30/23 23:11, Andres Freund wrote: >>> Hi, >>> >>> ... >>> >>> I suspect you're going to see more benefits from going to a *lower* setting >>> than a higher one. Some practical issues aside, plenty of storage hardware >>> these days would allow to get rid of FPIs if you go to 4k blocks (although it >>> often requires explicit sysadmin action to reformat the drive into that mode >>> etc). But obviously that's problematic from the "postgres limits" POV. >>> >> >> I wonder what are the conditions/options for disabling FPI. I kinda >> assume it'd apply to new drives with 4k sectors, with properly aligned >> partitions etc. But I haven't seen any particularly clear confirmation >> that's correct. > > I don't think we have ever had to study this --- we just request the > write to the operating system, and we either get a successful reply or > we go into WAL recovery to reread the pre-image. We never really care > if the write is atomic, e.g., an 8k write can be done in 2 4kB writes 4 > 2kB writes --- we don't care --- we only care if they are all done or > not. > > For a 4kB write, to say it is not partially written would be to require > the operating system to guarantee that the 4kB write is not split into > smaller writes which might each be atomic because smaller atomic writes > would not help us. > Right, that's the dance we do to protect against torn pages. But Andres suggested that if you have modern storage and configure it correctly, writing with 4kB pages would be atomic. So we wouldn't need to do this FPI stuff, eliminating pretty significant source of write amplification. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jul 1, 2023 at 12:21:03AM +0200, Tomas Vondra wrote: > On 6/30/23 23:53, Bruce Momjian wrote: > > For a 4kB write, to say it is not partially written would be to require > > the operating system to guarantee that the 4kB write is not split into > > smaller writes which might each be atomic because smaller atomic writes > > would not help us. > > Right, that's the dance we do to protect against torn pages. But Andres > suggested that if you have modern storage and configure it correctly, > writing with 4kB pages would be atomic. So we wouldn't need to do this > FPI stuff, eliminating pretty significant source of write amplification. I agree the hardware is atomic for 4k writes, but do we know the OS always issues 4k writes? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Hi, On 2023-06-30 17:53:34 -0400, Bruce Momjian wrote: > On Fri, Jun 30, 2023 at 11:42:30PM +0200, Tomas Vondra wrote: > > On 6/30/23 23:11, Andres Freund wrote: > > > I suspect you're going to see more benefits from going to a *lower* setting > > > than a higher one. Some practical issues aside, plenty of storage hardware > > > these days would allow to get rid of FPIs if you go to 4k blocks (although it > > > often requires explicit sysadmin action to reformat the drive into that mode > > > etc). But obviously that's problematic from the "postgres limits" POV. > > > > > > > I wonder what are the conditions/options for disabling FPI. I kinda > > assume it'd apply to new drives with 4k sectors, with properly aligned > > partitions etc. But I haven't seen any particularly clear confirmation > > that's correct. > > I don't think we have ever had to study this --- we just request the > write to the operating system, and we either get a successful reply or > we go into WAL recovery to reread the pre-image. We never really care > if the write is atomic, e.g., an 8k write can be done in 2 4kB writes 4 > 2kB writes --- we don't care --- we only care if they are all done or > not. Well, that works because we have FPI. This sub-discussion is motivated by getting rid of FPIs. > For a 4kB write, to say it is not partially written would be to require > the operating system to guarantee that the 4kB write is not split into > smaller writes which might each be atomic because smaller atomic writes > would not help us. That's why were talking about drives with 4k sector size - you *can't* split the writes below that. The problem is that, as far as I know,it's not always obvious what block size is being used on the actual storage level. It's not even trivial when operating on a filesystem directly stored on a single block device ([1]). Once there's things like LVM or disk encryption involved, it gets pretty hairy ([2]). Once you know all the block devices, it's not too bad, but ... Greetings, Andres Freund [1] On linux I think you need to use stat() to figure out the st_dev for a file, then look in /proc/self/mountinfo for the block device, use the name of the file to look in /sys/block/$d/queue/physical_block_size. [2] The above doesn't work because e.g. a device mapper target might only support 4k sectors, even though the sectors on the underlying storage device are 512b sectors. E.g. my root filesystem is encrypted, and if you follow the above recipe (with the added step of resolving the symlink to know the actual device name), you would see a 4k sector size. Even though the underlying NVMe disk only supports 512b sectors.
On 7/1/23 00:05, Andres Freund wrote: > Hi, > > On 2023-06-30 23:27:45 +0200, Tomas Vondra wrote: >> On 6/30/23 23:11, Andres Freund wrote: >>> ... >>> >>> If we really wanted to do this - but I don't think we do - I'd argue for >>> working on the buildsystem support to build the postgres binary multiple >>> times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just >>> exec's the relevant "real" binary based on the pg_control value. I really >>> don't see us ever wanting to make BLCKSZ runtime configurable within one >>> postgres binary. There's just too much intrinsic overhead associated with >>> that. >> >> I don't quite understand why we shouldn't do this (or at least try to). >> >> IMO the benefits of using smaller blocks were substantial (especially >> for 4kB, most likely due matching the internal SSD page size). The other >> benefits (reducing WAL volume) seem rather interesting too. > > Mostly because I think there are bigger gains to be had elsewhere. > I think that decision is up to whoever chooses to work on it, especially if performance is not their primary motivation (IIRC this was discussed as part of the TDE session). > IME not a whole lot of storage ships by default with externally visible 4k > sectors, but needs to be manually reformated [1], which looses all data, so it > has to be done initially. I don't see why "you have to configure stuff" would be a reason against improvements in this area. I don't know how prevalent storage with 4k sectors is now, but AFAIK it's not hard to get and it's likely to get yet more common in the future. FWIW I don't think the benefits of different (lower) page sizes hinge on 4k sectors - it's just that not having to do FPIs would make it even more interesting. > Then postgres would also need OS specific trickery > to figure out that indeed the IO stack is entirely 4k (checking sector size is > not enough). I haven't suggested we should be doing that automatically (would be nice, but I'd be happy with knowing when it's safe to disable FPW using the GUC in config). But knowing when it's safe would make it yet more interesting be able to use a different block page size at initdb. > And you run into the issue that suddenly the #column and > index-tuple-size limits are lower, which won't make it easier. > True. This limit is annoying, but no one is proposing to change the default page size. initdb would just provide a more convenient way to do that, but the user would have to check. (I rather doubt many people actually index such large values). > > I think we should change the default of the WAL blocksize to 4k > though. There's practically no downsides, and it drastically reduces > postgres-side write amplification in many transactional workloads, by only > writing out partially filled 4k pages instead of partially filled 8k pages. > +1 (although in my tests the benefits we much smaller than for BLCKSZ) > >> Sure, there are challenges (e.g. the overhead due to making it dynamic). >> No doubt about that. > > I don't think the runtime-dynamic overhead is avoidable with reasonable effort > (leaving aside compiling code multiple times and switching between). > > If we were to start building postgres for multiple compile-time settings, I > think there are uses other than switching between BLCKSZ, potentially more > interesting. E.g. you can see substantially improved performance by being able > to use SSE4.2 without CPU dispatch (partially because it allows compiler > autovectorization, partially because it allows to compiler to use newer > non-vectorized math instructions (when targetting AVX IIRC), partially because > the dispatch overhead is not insubstantial). Another example: ARMv8 > performance is substantially better if you target ARMv8.1-A instead of > ARMv8.0, due to having atomic instructions instead of LL/SC (it still baffles > me that they didn't do this earlier, ll/sc is just inherently inefficient). > Maybe, although I think it depends on what parts of the code this would affect. If it's sufficiently small/isolated, it'd be possible to have multiple paths, specialized to a particular page size (pretty common technique for GPU/SIMD, I believe). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 30, 2023 at 03:51:18PM -0700, Andres Freund wrote: > > For a 4kB write, to say it is not partially written would be to require > > the operating system to guarantee that the 4kB write is not split into > > smaller writes which might each be atomic because smaller atomic writes > > would not help us. > > That's why were talking about drives with 4k sector size - you *can't* split > the writes below that. Okay, good point. > The problem is that, as far as I know,it's not always obvious what block size > is being used on the actual storage level. It's not even trivial when > operating on a filesystem directly stored on a single block device ([1]). Once > there's things like LVM or disk encryption involved, it gets pretty hairy > ([2]). Once you know all the block devices, it's not too bad, but ... > > Greetings, > > Andres Freund > > [1] On linux I think you need to use stat() to figure out the st_dev for a > file, then look in /proc/self/mountinfo for the block device, use the name > of the file to look in /sys/block/$d/queue/physical_block_size. I just got a new server: https://momjian.us/main/blogs/blog/2023.html#June_28_2023 so tested this on my new M.2 NVME storage device: $ /sys/block/nvme0n1/queue/physical_block_size 262144 that's 256k, not 4k. > [2] The above doesn't work because e.g. a device mapper target might only > support 4k sectors, even though the sectors on the underlying storage device > are 512b sectors. E.g. my root filesystem is encrypted, and if you follow the > above recipe (with the added step of resolving the symlink to know the actual > device name), you would see a 4k sector size. Even though the underlying NVMe > disk only supports 512b sectors. Good point. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
On 2023-06-30 18:37:39 -0400, Bruce Momjian wrote: > On Sat, Jul 1, 2023 at 12:21:03AM +0200, Tomas Vondra wrote: > > On 6/30/23 23:53, Bruce Momjian wrote: > > > For a 4kB write, to say it is not partially written would be to require > > > the operating system to guarantee that the 4kB write is not split into > > > smaller writes which might each be atomic because smaller atomic writes > > > would not help us. > > > > Right, that's the dance we do to protect against torn pages. But Andres > > suggested that if you have modern storage and configure it correctly, > > writing with 4kB pages would be atomic. So we wouldn't need to do this > > FPI stuff, eliminating pretty significant source of write amplification. > > I agree the hardware is atomic for 4k writes, but do we know the OS > always issues 4k writes? When using a sector size of 4K you *can't* make smaller writes via normal paths. The addressing unit is in sectors. The details obviously differ between storage protocol, but you pretty much always just specify a start sector and a number of sectors to be operated on. Obviously the kernel could read 4k, modify 512 bytes in-memory, and then write 4k back, but that shouldn't be a danger here. There might also be debug interfaces to allow reading/writing in different increments, but that'd not be something happening during normal operation.
On Fri, Jun 30, 2023 at 06:58:20PM -0400, Bruce Momjian wrote: > I just got a new server: > > https://momjian.us/main/blogs/blog/2023.html#June_28_2023 > > so tested this on my new M.2 NVME storage device: > > $ /sys/block/nvme0n1/queue/physical_block_size > 262144 > > that's 256k, not 4k. I have another approach to this. My storage device has power protection, so even though it has a 256k physical block size, it should be fine with 4k write atomicity. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Hi, On 2023-06-30 18:58:20 -0400, Bruce Momjian wrote: > > [1] On linux I think you need to use stat() to figure out the st_dev for a > > file, then look in /proc/self/mountinfo for the block device, use the name > > of the file to look in /sys/block/$d/queue/physical_block_size. > > I just got a new server: > > https://momjian.us/main/blogs/blog/2023.html#June_28_2023 > > so tested this on my new M.2 NVME storage device: > > $ /sys/block/nvme0n1/queue/physical_block_size > 262144 Ah, I got the relevant filename wrong. I think it's logical_block_size, not physical one (that's the size of addressing). I didn't realize because the devices I looked at have the same... Regards, Andres
On 7/1/23 00:59, Andres Freund wrote: > On 2023-06-30 18:37:39 -0400, Bruce Momjian wrote: >> On Sat, Jul 1, 2023 at 12:21:03AM +0200, Tomas Vondra wrote: >>> On 6/30/23 23:53, Bruce Momjian wrote: >>>> For a 4kB write, to say it is not partially written would be to require >>>> the operating system to guarantee that the 4kB write is not split into >>>> smaller writes which might each be atomic because smaller atomic writes >>>> would not help us. >>> >>> Right, that's the dance we do to protect against torn pages. But Andres >>> suggested that if you have modern storage and configure it correctly, >>> writing with 4kB pages would be atomic. So we wouldn't need to do this >>> FPI stuff, eliminating pretty significant source of write amplification. >> >> I agree the hardware is atomic for 4k writes, but do we know the OS >> always issues 4k writes? > > When using a sector size of 4K you *can't* make smaller writes via normal > paths. The addressing unit is in sectors. The details obviously differ between > storage protocol, but you pretty much always just specify a start sector and a > number of sectors to be operated on. > > Obviously the kernel could read 4k, modify 512 bytes in-memory, and then write > 4k back, but that shouldn't be a danger here. There might also be debug > interfaces to allow reading/writing in different increments, but that'd not be > something happening during normal operation. I think it's important to point out that there's a physical and logical sector size. The "physical" is what the drive does internally, "logical" defines what OS does. Some drives have 4k physical sectors but only 512B logical sectors. AFAIK most "old" SATA SSDs do it that way, for compatibility reasons. New drives may have 4k physical sectors but typically support both 512B and 4k logical sectors - my nvme SSDs do this, for example. My understanding is that for drives with 4k physical+logical sectors, the OS would only issue "full" 4k writes. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 30, 2023 at 04:04:57PM -0700, Andres Freund wrote: > Hi, > > On 2023-06-30 18:58:20 -0400, Bruce Momjian wrote: > > > [1] On linux I think you need to use stat() to figure out the st_dev for a > > > file, then look in /proc/self/mountinfo for the block device, use the name > > > of the file to look in /sys/block/$d/queue/physical_block_size. > > > > I just got a new server: > > > > https://momjian.us/main/blogs/blog/2023.html#June_28_2023 > > > > so tested this on my new M.2 NVME storage device: > > > > $ /sys/block/nvme0n1/queue/physical_block_size > > 262144 > > Ah, I got the relevant filename wrong. I think it's logical_block_size, not > physical one (that's the size of addressing). I didn't realize because the > devices I looked at have the same... That one reports 512 _bytes_ for me: $ cat /sys/block/nvme0n1/queue/logical_block_size 512 -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
On 7/1/23 01:16, Bruce Momjian wrote: > On Fri, Jun 30, 2023 at 04:04:57PM -0700, Andres Freund wrote: >> Hi, >> >> On 2023-06-30 18:58:20 -0400, Bruce Momjian wrote: >>>> [1] On linux I think you need to use stat() to figure out the st_dev for a >>>> file, then look in /proc/self/mountinfo for the block device, use the name >>>> of the file to look in /sys/block/$d/queue/physical_block_size. >>> >>> I just got a new server: >>> >>> https://momjian.us/main/blogs/blog/2023.html#June_28_2023 >>> >>> so tested this on my new M.2 NVME storage device: >>> >>> $ /sys/block/nvme0n1/queue/physical_block_size >>> 262144 >> >> Ah, I got the relevant filename wrong. I think it's logical_block_size, not >> physical one (that's the size of addressing). I didn't realize because the >> devices I looked at have the same... > > That one reports 512 _bytes_ for me: > > $ cat /sys/block/nvme0n1/queue/logical_block_size > 512 > What does "smartctl -a /dev/nvme0n1" say? There should be something like this: Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 - 4096 0 0 1 + 512 0 0 which says the drive supports 4k and 512B sectors, and is currently configures to use 512B sectors. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jul 1, 2023 at 01:18:34AM +0200, Tomas Vondra wrote: > What does "smartctl -a /dev/nvme0n1" say? There should be something like > this: > > Supported LBA Sizes (NSID 0x1) > Id Fmt Data Metadt Rel_Perf > 0 - 4096 0 0 > 1 + 512 0 0 > > which says the drive supports 4k and 512B sectors, and is currently > configures to use 512B sectors. It says: Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 2 1 - 4096 0 0 -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Hi, On 2023-06-30 16:28:59 -0500, David Christensen wrote: > On Fri, Jun 30, 2023 at 3:29 PM Andres Freund <andres@anarazel.de> wrote: > >> And indeed. Comparing e.g. TPC-H, I see *massive* regressions. Some queries > > are the same, sobut others regress by up to 70% (although more commonly around > > 10-20%). > > Hmm, that is definitely not good. > > > That's larger than I thought, which makes me suspect that there's some bug in > > the new code. > > Will do a little profiling here to see if I can figure out the > regression. Which build optimization settings are you seeing this > under? gcc 12 with: meson setup \ -Doptimization=3 -Ddebug=true \ -Dc_args="-ggdb -g3 -march=native -mtune=native -fno-plt -fno-semantic-interposition -Wno-array-bounds" \ -Dc_link_args="-fuse-ld=mold -Wl,--gdb-index,--Bsymbolic" \ ... Relevant postgres settings: -c huge_pages=on -c shared_buffers=12GB -c max_connections=120 -c work_mem=32MB -c autovacuum=0 # I always do that for comparative benchmarks, too much variance -c track_io_timing=on The later run where I saw the smaller regression was with work_mem=1GB. I just had forgotten to adjust that. I had loaded tpch scale 5 before, which is why I just used that. FWIW, even just "SELECT count(*) FROM lineitem;" shows a substantial regression. I disabled parallelism, prewarmed the data and pinned postgres to a single core to reduce noise. The result is the best of three (variance was low in all cases). HEAD patch index only scan 1896.364 2242.288 seq scan 1586.990 1628.042 A profile shows that 20% of the runtime in the IOS case is in visibilitymap_get_status(): + 20.50% postgres.new postgres.new [.] visibilitymap_get_status + 19.54% postgres.new postgres.new [.] ExecInterpExpr + 14.47% postgres.new postgres.new [.] IndexOnlyNext + 6.47% postgres.new postgres.new [.] index_deform_tuple_internal + 4.67% postgres.new postgres.new [.] ExecScan + 4.12% postgres.new postgres.new [.] btgettuple + 3.97% postgres.new postgres.new [.] ExecAgg + 3.92% postgres.new postgres.new [.] _bt_next + 3.71% postgres.new postgres.new [.] _bt_readpage + 3.43% postgres.new postgres.new [.] fetch_input_tuple + 2.87% postgres.new postgres.new [.] index_getnext_tid + 2.45% postgres.new postgres.new [.] MemoryContextReset + 2.35% postgres.new postgres.new [.] tts_virtual_clear + 1.37% postgres.new postgres.new [.] index_deform_tuple + 1.14% postgres.new postgres.new [.] ExecStoreVirtualTuple + 1.13% postgres.new postgres.new [.] PredicateLockPage + 1.12% postgres.new postgres.new [.] int8inc + 1.04% postgres.new postgres.new [.] ExecIndexOnlyScan + 0.57% postgres.new postgres.new [.] BufferGetBlockNumber mostly due to │ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk); 2.46 │ lea -0x60(,%rdx,4),%rcx │ xor %edx,%edx 59.79 │ div %rcx You can't have have divisions for this kind of thing in the vicinity of a peformance critical path. With compile time constants the compiler can turn this into shifts, but that's not possible as-is after the change. While not quite as bad as divisions, the paths with multiplications are also not going to be ok. E.g. return (Block) (BufferBlocks + ((Size) (buffer - 1)) * CLUSTER_BLOCK_SIZE); is going to be noticeable. You'd have to turn all of this into shifts (and enforce power of 2 sizes, if you aren't yet). I don't think pre-computed tables are a viable answer FWIW. Even just going through a memory indirection is going to be noticable. This stuff is in a crapton of hot code paths. Greetings, Andres Freund
On 01.07.23 00:21, Tomas Vondra wrote: > Right, that's the dance we do to protect against torn pages. But Andres > suggested that if you have modern storage and configure it correctly, > writing with 4kB pages would be atomic. So we wouldn't need to do this > FPI stuff, eliminating pretty significant source of write amplification. This work in progress for the Linux kernel was also mentioned at PGCon: <https://lwn.net/Articles/933015/>. Subject the various conditions, the kernel would then guarantee atomic writes for blocks larger than the hardware's native size.
Hi, enclosed is v2 of the variable blocksize patch. This series is based atop 9b581c5341. Preparation phase: 0001 - add utility script for retokenizing all necessary scripts. This is mainly for my own use in generating 0003, which is a simple rename/preparation patch to change all symbols from their UPPER_CASE to lower_case form, with several exceptions in renames. 0002 - add script to harness 0001 and apply to the relevant files in the repo 0003 - capture the effects of 0002 on the repo The other patches in this series are as follows: 0004 - the "main" variable blocksize patch where the bulk of the code changes take place - see comments here 0005 - utility functions for fast div/mod operations; basically montgomery multiplication 0006 - use fastdiv code in the visiblity map, the main place where this change is required 0007 - (optional) add/use libdivide for division which is license compatible with other headers we bundle 0008 - (optional) tweaks to libdivide to make compiler/CI happy I have also replaced multiple instances of division or multiplication of BLOCKSZ with bitshift operations based on the number of bits in the underlying blocksize. The current approach for this is to replace any affected constant with an inline switch statement based on an enum for the blocksize and the compile-time calculation for that version. In practice with -O2 this generates a simple lookup table inline in the assembly with the costs for calculating paid at compile time. The visibility map was the main hot path which was affected by the switch from compile-time sizes with the previous version of this patch. With the switch to a modified approach in 0005/0006 this issue has been rectified in our testing. I have tested a few workloads with this modified patch and have seen positive results compared to v1. I look forward to additional review/testing/feedback. Thanks, David
Вложения
- v2-0001-Add-tool-to-retokenize-for-variable-blocksize.patch
- v2-0002-Add-wrapper-for-tokenizing-the-whole-repo.patch
- v2-0005-Add-support-for-fast-non-division-based-div-mod-a.patch
- v2-0003-Capture-rename-of-symbols.patch
- v2-0004-Introduce-initdb-selectable-block-sizes.patch
- v2-0008-Add-pragmas-to-libdivide-to-make-header-check-hap.patch
- v2-0006-Use-fastdiv-code-in-visibility-map.patch
- v2-0007-Use-libdivide-for-fast-division.patch
Enclosed are TPC-H results for 1GB shared_buffers, 64MB work_mem on a 64GB laptop with SSD storage; everything else is default settings. TL;DR: unpatched version: 17.30 seconds, patched version: 17.15; there are some slight variations in runtime, but seems to be within the noise level at this point.
Вложения
On Thu, Aug 31, 2023 at 8:51 AM David Christensen <david.christensen@crunchydata.com> wrote:
> 0005 - utility functions for fast div/mod operations; basically
> montgomery multiplication
+/*
+ * pg_fastmod - calculates the modulus of a 32-bit number against a constant
+ * divisor without using the division operator
+ */
+static inline uint32 pg_fastmod(uint32 n, uint32 divisor, uint64 fastinv)
+{
+#ifdef HAVE_INT128
+ uint64_t lowbits = fastinv * n;
+ return ((uint128)lowbits * divisor) >> 64;
+#else
+ return n % divisor;
+#endif
+}
+ * pg_fastmod - calculates the modulus of a 32-bit number against a constant
+ * divisor without using the division operator
+ */
+static inline uint32 pg_fastmod(uint32 n, uint32 divisor, uint64 fastinv)
+{
+#ifdef HAVE_INT128
+ uint64_t lowbits = fastinv * n;
+ return ((uint128)lowbits * divisor) >> 64;
+#else
+ return n % divisor;
+#endif
+}
Requiring 128-bit arithmetic to avoid serious regression is a non-starter as written. Software that relies on fast 128-bit multiplication has to do backflips to get that working on multiple platforms. But I'm not sure it's necessary -- if the max block number is UINT32_MAX and max block size is UINT16_MAX, can't we just use 64-bit multiplication?
> + * pg_fastmod - calculates the modulus of a 32-bit number against a constant > + * divisor without using the division operator > + */ > +static inline uint32 pg_fastmod(uint32 n, uint32 divisor, uint64 fastinv) > +{ > +#ifdef HAVE_INT128 > + uint64_t lowbits = fastinv * n; > + return ((uint128)lowbits * divisor) >> 64; > +#else > + return n % divisor; > +#endif > +} > > Requiring 128-bit arithmetic to avoid serious regression is a non-starter as written. Software that relies on fast 128-bitmultiplication has to do backflips to get that working on multiple platforms. But I'm not sure it's necessary -- ifthe max block number is UINT32_MAX and max block size is UINT16_MAX, can't we just use 64-bit multiplication? I was definitely hand-waving additional implementation here for non-native 128 bit support; the modulus algorithm as presented requires 4 times the space as the divisor, so a uint16 implementation should work for all 64-bit machines. Certainly open to other ideas or implementations, this was the one I was able to find initially. If the 16bit approach is all that is needed in practice we can also see about narrowing the domain and not worry about making this a general-purpose function. Best, David
> I was definitely hand-waving additional implementation here for > non-native 128 bit support; the modulus algorithm as presented > requires 4 times the space as the divisor, so a uint16 implementation > should work for all 64-bit machines. Certainly open to other ideas or > implementations, this was the one I was able to find initially. If > the 16bit approach is all that is needed in practice we can also see > about narrowing the domain and not worry about making this a > general-purpose function. Here's a patch atop the series which converts to 16-bit uints and passes regressions, but I don't consider well-vetted at this point. David
Вложения
On Thu, Aug 31, 2023 at 2:32 PM David Christensen <david.christensen@crunchydata.com> wrote: > Here's a patch atop the series which converts to 16-bit uints and > passes regressions, but I don't consider well-vetted at this point. For what it's worth, my gut reaction to this patch series is similar to that of Andres: I think it will be a disaster. If the disaster is not evident to us, that's far more likely to mean that we've failed to test the right things than it is to mean that there is no disaster. I don't see that there is a lot of upside, either. I don't think we have a lot of evidence that changing the block size is really going to help performance. In fact, my guess is that there are large amounts of code that are heavily optimized, without the authors even realizing it, for 8kB blocks, because that's what we've always had. If we had much larger or smaller blocks, the structure of heap pages or of the various index AMs used for blocks might no longer be optimal, or might be less optimal than they are for an 8kB block size. If you use really large blocks, your blocks may need more internal structure than we have today in order to avoid CPU inefficiencies. I suspect there's been so little testing of non-default block sizes that I wouldn't even count on the code to not be outright buggy. If we could find a safe way to get rid of full page writes, I would certainly agree that that was worth considering. I'm not sure that anything in this thread adds up to that being a reasonable way to go, but the savings would be massive. I feel like the proposal here is a bit like deciding to change the speed limit on all American highways from 65 mph or whatever it is to 130 mph or 32.5 mph and see which way works out best. The whole infrastructure has basically been designed around the current rules. The rate of curvature of the roads is appropriate for the speed that you're currently allowed to drive on them. The vehicles are optimized for long-term operation at about that speed. The people who drive the vehicles are accustomed to driving at that speed, and the people who maintain them are accustomed to the problems that happen when you drive them at that speed. Just changing the speed limit doesn't change all that other stuff, and changing all that other stuff is a truly massive undertaking. Maybe this example somewhat overstates the difficulties here, but I do think the difficulties are considerable. The fact that we have 8kB block sizes has affected the thinking of hundreds of developers over decades in making thousands or tens of thousands or hundreds of thousands of decisions about algorithm selection and page format and all kinds of stuff. Even if some other page size seems to work better in a certain context, it's pretty hard to believe that it has much chance of being better overall, even without the added overhead of run-time configuration. -- Robert Haas EDB: http://www.enterprisedb.com
On 9/1/23 16:57, Robert Haas wrote: > On Thu, Aug 31, 2023 at 2:32 PM David Christensen > <david.christensen@crunchydata.com> wrote: >> Here's a patch atop the series which converts to 16-bit uints and >> passes regressions, but I don't consider well-vetted at this point. > > For what it's worth, my gut reaction to this patch series is similar > to that of Andres: I think it will be a disaster. If the disaster is > not evident to us, that's far more likely to mean that we've failed to > test the right things than it is to mean that there is no disaster. > Perhaps. The block size certainly affects a lot of places - both in terms of the actual value, and being known (constant) at compile time. > I don't see that there is a lot of upside, either. I don't think we > have a lot of evidence that changing the block size is really going to > help performance. I don't think that's quite true. We have plenty of empirical evidence that smaller block sizes bring significant improvements for certain workloads. And we also have theoretical explanations for why that is. > In fact, my guess is that there are large amounts of > code that are heavily optimized, without the authors even realizing > it, for 8kB blocks, because that's what we've always had. If we had > much larger or smaller blocks, the structure of heap pages or of the > various index AMs used for blocks might no longer be optimal, or might > be less optimal than they are for an 8kB block size. If you use really > large blocks, your blocks may need more internal structure than we > have today in order to avoid CPU inefficiencies. I suspect there's > been so little testing of non-default block sizes that I wouldn't even > count on the code to not be outright buggy. > Sure, and there are even various places where the page size implies hard limits (e.g. index key size for btree indexes). But so what? If that matters for your workload, keep using 8kB ... > If we could find a safe way to get rid of full page writes, I would > certainly agree that that was worth considering. I'm not sure that > anything in this thread adds up to that being a reasonable way to go, > but the savings would be massive. > That's true, that'd be great. But that's clearly just a next level of the optimization. It doesn't mean that if you can't eliminate FPW for whatever reason it's worthless. > I feel like the proposal here is a bit like deciding to change the > speed limit on all American highways from 65 mph or whatever it is to > 130 mph or 32.5 mph and see which way works out best. The whole > infrastructure has basically been designed around the current rules. > The rate of curvature of the roads is appropriate for the speed that > you're currently allowed to drive on them. The vehicles are optimized > for long-term operation at about that speed. The people who drive the > vehicles are accustomed to driving at that speed, and the people who > maintain them are accustomed to the problems that happen when you > drive them at that speed. Just changing the speed limit doesn't change > all that other stuff, and changing all that other stuff is a truly > massive undertaking. Maybe this example somewhat overstates the > difficulties here, but I do think the difficulties are considerable. > The fact that we have 8kB block sizes has affected the thinking of > hundreds of developers over decades in making thousands or tens of > thousands or hundreds of thousands of decisions about algorithm > selection and page format and all kinds of stuff. Even if some other > page size seems to work better in a certain context, it's pretty hard > to believe that it has much chance of being better overall, even > without the added overhead of run-time configuration. > Except that no one is forcing you to actually go 130mph or 32mph, right? You make it seem like this patch forces people to use some other page size, but that's clearly not what it's doing - it gives you the option to use smaller or larger block, if you chose to. Just like increasing the speed limit to 130mph doesn't mean you can't keep going 65mph. The thing is - we *already* allow using different block size, except that you have to do custom build. This just makes it easier. I don't have strong opinions on how the patch actually does that, and there certainly can be negative effects of making it dynamic. And yes, we will have to do more testing with non-default block sizes. But frankly, that's a gap we probably need to address anyway, considering we allow changing the block size. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Sep 2, 2023 at 3:09 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Except that no one is forcing you to actually go 130mph or 32mph, right? > You make it seem like this patch forces people to use some other page > size, but that's clearly not what it's doing - it gives you the option > to use smaller or larger block, if you chose to. Just like increasing > the speed limit to 130mph doesn't mean you can't keep going 65mph. > > The thing is - we *already* allow using different block size, except > that you have to do custom build. This just makes it easier. Right. Which is worth doing if it doesn't hurt performance and is likely to be useful to a lot of people, and is not worth doing if it will hurt performance and be useful to relatively few people. My bet is on the latter. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Sep 5, 2023 at 9:04 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Sat, Sep 2, 2023 at 3:09 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > Except that no one is forcing you to actually go 130mph or 32mph, right? > > You make it seem like this patch forces people to use some other page > > size, but that's clearly not what it's doing - it gives you the option > > to use smaller or larger block, if you chose to. Just like increasing > > the speed limit to 130mph doesn't mean you can't keep going 65mph. > > > > The thing is - we *already* allow using different block size, except > > that you have to do custom build. This just makes it easier. > > Right. Which is worth doing if it doesn't hurt performance and is > likely to be useful to a lot of people, and is not worth doing if it > will hurt performance and be useful to relatively few people. My bet > is on the latter. Agreed that this doesn't make sense if there are major performance regressions, however the goal here is patch evaluation to measure this against other workloads and see if this is the case; from my localized testing things were within acceptable noise levels with the latest version. I agree with Tomas' earlier thoughts: we already allow different block sizes, and if there are baked-in algorithmic assumptions about block size (which there probably are), then identifying those or places in the code where we need additional work or test coverage will only improve things overall for those non-standard block sizes. Best, David
Something I also asked at this years Unconference - Do we currently have Build Farm animals testing with different page sizes ? I'd say that testing all sizes from 4KB up (so 4, 8, 16, 32) should be done at least before each release if not continuously. -- Cheers Hannu On Tue, Sep 5, 2023 at 4:31 PM David Christensen <david.christensen@crunchydata.com> wrote: > > On Tue, Sep 5, 2023 at 9:04 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Sat, Sep 2, 2023 at 3:09 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > > > Except that no one is forcing you to actually go 130mph or 32mph, right? > > > You make it seem like this patch forces people to use some other page > > > size, but that's clearly not what it's doing - it gives you the option > > > to use smaller or larger block, if you chose to. Just like increasing > > > the speed limit to 130mph doesn't mean you can't keep going 65mph. > > > > > > The thing is - we *already* allow using different block size, except > > > that you have to do custom build. This just makes it easier. > > > > Right. Which is worth doing if it doesn't hurt performance and is > > likely to be useful to a lot of people, and is not worth doing if it > > will hurt performance and be useful to relatively few people. My bet > > is on the latter. > > Agreed that this doesn't make sense if there are major performance > regressions, however the goal here is patch evaluation to measure this > against other workloads and see if this is the case; from my localized > testing things were within acceptable noise levels with the latest > version. > > I agree with Tomas' earlier thoughts: we already allow different block > sizes, and if there are baked-in algorithmic assumptions about block > size (which there probably are), then identifying those or places in > the code where we need additional work or test coverage will only > improve things overall for those non-standard block sizes. > > Best, > > David > >
Hi, On 2023-09-05 21:52:18 +0200, Hannu Krosing wrote: > Something I also asked at this years Unconference - Do we currently > have Build Farm animals testing with different page sizes ? You can check that yourself as easily as anybody else. Greetings, Andres Freund
On Tue, Sep 5, 2023 at 2:52 PM Hannu Krosing <hannuk@google.com> wrote: > > Something I also asked at this years Unconference - Do we currently > have Build Farm animals testing with different page sizes ? > > I'd say that testing all sizes from 4KB up (so 4, 8, 16, 32) should be > done at least before each release if not continuously. > > -- Cheers > > Hannu The regression tests currently have a lot of breakage when running against non-standard block sizes, so I would assume the answer here is no. I would expect that we will want to add regression test variants or otherwise normalize results so they will work with differing block sizes, but have not done that yet.
Sure, I was just hoping that somebody already knew without needing to specifically check :) And as I see in David's response, the tests are actually broken for other sizes. I'll see if I can (convince somebody to) set this up . Cheers Hannu On Tue, Sep 5, 2023 at 10:23 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2023-09-05 21:52:18 +0200, Hannu Krosing wrote: > > Something I also asked at this years Unconference - Do we currently > > have Build Farm animals testing with different page sizes ? > > You can check that yourself as easily as anybody else. > > Greetings, > > Andres Freund
Hi,
Here is version 3 of the patch series, rebased on 6d0c39a293. The main tweak was to tweak per John N's suggestion and utilize only the 16-bit unsigned mod/div for the utility routines. The breakdown of the patch series is the same as the last one, but re-including the descriptions here:
Preparation phase:
0001 - add utility script for retokenizing all necessary scripts.
This is mainly for my own use in generating 0003, which is a simple
rename/preparation patch to change all symbols from their UPPER_CASE
to lower_case form, with several exceptions in renames.
0002 - add script to harness 0001 and apply to the relevant files in the repo
0003 - capture the effects of 0002 on the repo
The other patches in this series are as follows:
0004 - the "main" variable blocksize patch where the bulk of the code
changes take place - see comments here
0005 - utility functions for fast div/mod operations; basically
montgomery multiplication
0006 - use fastdiv code in the visiblity map, the main place where
this change is required
0007 - (optional) add/use libdivide for division which is license
compatible with other headers we bundle
0008 - (optional) tweaks to libdivide to make compiler/CI happy
0001 - add utility script for retokenizing all necessary scripts.
This is mainly for my own use in generating 0003, which is a simple
rename/preparation patch to change all symbols from their UPPER_CASE
to lower_case form, with several exceptions in renames.
0002 - add script to harness 0001 and apply to the relevant files in the repo
0003 - capture the effects of 0002 on the repo
The other patches in this series are as follows:
0004 - the "main" variable blocksize patch where the bulk of the code
changes take place - see comments here
0005 - utility functions for fast div/mod operations; basically
montgomery multiplication
0006 - use fastdiv code in the visiblity map, the main place where
this change is required
0007 - (optional) add/use libdivide for division which is license
compatible with other headers we bundle
0008 - (optional) tweaks to libdivide to make compiler/CI happy
Best,
David
Вложения
- v3-0002-Add-wrapper-for-tokenizing-the-whole-repo.patch
- v3-0005-Add-support-for-fast-non-division-based-div-mod-a.patch
- v3-0001-Add-tool-to-retokenize-for-variable-blocksize.patch
- v3-0004-Introduce-initdb-selectable-block-sizes.patch
- v3-0003-Capture-rename-of-symbols.patch
- v3-0006-Use-fastdiv-code-in-visibility-map.patch
- v3-0007-Use-libdivide-for-fast-division.patch
- v3-0008-Add-pragmas-to-libdivide-to-make-header-check-hap.patch