Обсуждение: Run pg_amcheck in 002_pg_upgrade.pl and 027_stream_regress.pl?
Hi, We've had bugs in pg_upgrade where post-upgrade xid horizons weren't correctly set. We've had bugs were indexes were corrupted during replay. The latter can be caught by wal_consistency_checking - but that's pretty expensive. It seems $subject would have a chance of catching some of these bugs, as well as exposing amcheck to a database with a bit more varied content? Depending on the cost it might make sense to do this optionally, via PG_TEST_EXTRA? Greetings, Andres Freund
On Sun, Apr 3, 2022 at 11:53 AM Andres Freund <andres@anarazel.de> wrote: > We've had bugs in pg_upgrade where post-upgrade xid horizons weren't correctly > set. We've had bugs were indexes were corrupted during replay. > > The latter can be caught by wal_consistency_checking - but that's pretty > expensive. > > It seems $subject would have a chance of catching some of these bugs, as well > as exposing amcheck to a database with a bit more varied content? I thought that Andrew Dunstan (CC'd) had a BF animal that did this setup. But I'm not sure if that ever ended up happening. I meant to tell the authors of verify_heapam() (also CC'd) that it really helped with my recent VACUUM project. While the assertions that I wrote in vacuumlazy.c might catch certain bugs like this, verify_heapam() is much more effective in practice. Let's say that an all-visible page (or all-frozen page) has XIDs from before relfrozenxid. Why should the next VACUUM (or any VACUUM) be able to observe the problem? A testing strategy that doesn't rely on these kinds of accidental details to catch bugs is far better than one that does. Definitely all in favor of using verify_heapam() to its full potential. So I'm +1 on your proposal. -- Peter Geoghegan
On Sun, Apr 3, 2022 at 10:10 PM Peter Geoghegan <pg@bowt.ie> wrote: > I meant to tell the authors of verify_heapam() (also CC'd) that it > really helped with my recent VACUUM project. While the assertions that > I wrote in vacuumlazy.c might catch certain bugs like this, > verify_heapam() is much more effective in practice. Yeah, I was very excited about verify_heapam(). There is a lot more stuff that we could check, but a lot of those things would be much more expensive to check. It does a good job, I think, checking all the things that a human being could potentially spot just by looking at an individual page. I love the idea of using it in regression testing in more places. It might find bugs in amcheck, which would be good, but I think it's even more likely to help us find bugs in other code. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Apr 4, 2022 at 7:02 AM Robert Haas <robertmhaas@gmail.com> wrote: > Yeah, I was very excited about verify_heapam(). There is a lot more > stuff that we could check, but a lot of those things would be much > more expensive to check. If anything I understated the value of verify_heapam() with this kind of work before. Better to show just how valuable it is using an example. Let's introduce a fairly blatant bug to lazyvacuum.c. This change makes VACUUM fail to account for the fact that skipping a skippable range with an all-visible page makes it unsafe to advance relfrozenxid: --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -1371,8 +1371,6 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block, else { *skipping_current_range = true; - if (skipsallvis) - vacrel->skippedallvis = true; } return next_unskippable_block; If I run "make check-world", the tests all pass! But when I run pg_amcheck against an affected "regression" database, it will complain about relfrozenxid related corruption in several different tables. > It does a good job, I think, checking all the > things that a human being could potentially spot just by looking at an > individual page. I love the idea of using it in regression testing in > more places. It might find bugs in amcheck, which would be good, but I > think it's even more likely to help us find bugs in other code. I'd really like it if amcheck had HOT chain verification. That's the other area where catching bugs passively with assertions and whatnot is clearly not good enough. -- Peter Geoghegan
> On Apr 4, 2022, at 9:27 AM, Peter Geoghegan <pg@bowt.ie> wrote: > > I'd really like it if amcheck had HOT chain verification. That's the > other area where catching bugs passively with assertions and whatnot > is clearly not good enough. I agree, and was hoping to get around to this in the postgres 15 development cycle. Alas, that did not happen. Worse, Ihave several other projects that will keep me busy for the next few months, at least. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2022-04-04 10:02:37 -0400, Robert Haas wrote: > It does a good job, I think, checking all the things that a human being > could potentially spot just by looking at an individual page. I think there's a few more things that'd be good to check. For example amcheck doesn't verify that HOT chains are reasonable, which can often be spotted looking at an individual page. Which is a bit unfortunate, given how many bugs we had in that area. Stuff to check around that: - target of redirect has HEAP_ONLY_TUPLE, HEAP_UPDATED set - In a valid ctid chain within a page (i.e. xmax = xmin): - tuples have HEAP_UPDATED set - HEAP_ONLY_TUPLE / HEAP_HOT_UPDATED matches across chains elements I think it'd also be good to check for things like visible tuples following invisible ones. Greetings, Andres Freund
On Mon, Apr 4, 2022 at 2:16 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-04-04 10:02:37 -0400, Robert Haas wrote: > > It does a good job, I think, checking all the things that a human being > > could potentially spot just by looking at an individual page. > > I think there's a few more things that'd be good to check. For example amcheck > doesn't verify that HOT chains are reasonable, which can often be spotted > looking at an individual page. Which is a bit unfortunate, given how many bugs > we had in that area. > > Stuff to check around that: > - target of redirect has HEAP_ONLY_TUPLE, HEAP_UPDATED set > - In a valid ctid chain within a page (i.e. xmax = xmin): > - tuples have HEAP_UPDATED set > - HEAP_ONLY_TUPLE / HEAP_HOT_UPDATED matches across chains elements > > I think it'd also be good to check for things like visible tuples following > invisible ones. Interesting. *takes notes* -- Robert Haas EDB: http://www.enterprisedb.com
On Sun, Apr 03, 2022 at 11:53:03AM -0700, Andres Freund wrote: > It seems $subject would have a chance of catching some of these bugs, as well > as exposing amcheck to a database with a bit more varied content? Makes sense to me to extend that. > Depending on the cost it might make sense to do this optionally, via > PG_TEST_EXTRA? Yes, it would be good to check the difference in run-time before introducing more. A logical dump of the regression database is no more than 15MB if I recall correctly, so my guess is that most of the runtime is still going to be eaten by the run of pg_regress. -- Michael
Вложения
Hi, On 2022-04-05 08:46:06 +0900, Michael Paquier wrote: > On Sun, Apr 03, 2022 at 11:53:03AM -0700, Andres Freund wrote: > > It seems $subject would have a chance of catching some of these bugs, as well > > as exposing amcheck to a database with a bit more varied content? > > Makes sense to me to extend that. > > > Depending on the cost it might make sense to do this optionally, via > > PG_TEST_EXTRA? > > Yes, it would be good to check the difference in run-time before > introducing more. A logical dump of the regression database is no > more than 15MB if I recall correctly, so my guess is that most of the > runtime is still going to be eaten by the run of pg_regress. On my workstation it takes about 2.39s to run pg_amcheck on a regression database with all thoroughness options enabled. With -j4 it's 0.62s. Without more thorough checking it's 1.24s and 0.30s with -j4. Greetings, Andres Freund
On Mon, Apr 04, 2022 at 05:39:58PM -0700, Andres Freund wrote: > On my workstation it takes about 2.39s to run pg_amcheck on a regression > database with all thoroughness options enabled. With -j4 it's 0.62s. > > Without more thorough checking it's 1.24s and 0.30s with -j4. Okay. That sounds like an argument to enable that by default, with parallelism. -- Michael
Вложения
On 4/3/22 22:10, Peter Geoghegan wrote: > On Sun, Apr 3, 2022 at 11:53 AM Andres Freund <andres@anarazel.de> wrote: >> We've had bugs in pg_upgrade where post-upgrade xid horizons weren't correctly >> set. We've had bugs were indexes were corrupted during replay. >> >> The latter can be caught by wal_consistency_checking - but that's pretty >> expensive. >> >> It seems $subject would have a chance of catching some of these bugs, as well >> as exposing amcheck to a database with a bit more varied content? > I thought that Andrew Dunstan (CC'd) had a BF animal that did this > setup. But I'm not sure if that ever ended up happening. I don't think any of my BF animals do anything special in this area. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com