Обсуждение: BUG #17245: Index corruption involving deduplicated entries

Поиск
Список
Период
Сортировка

BUG #17245: Index corruption involving deduplicated entries

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17245
Logged by:          Andrew Gierth
Email address:      andrew@tao11.riddles.org.uk
PostgreSQL version: 14.0
Operating system:   FreeBSD 13.0-RELEASE
Description:

From a report from IRC, from a user running a mediawiki instance on PG:

select ctid, page_title from "page" WHERE page_title = 'Isokaze' and
page_namespace = 0;

returned two rows rather than the expected 1:

  (181,53) | Ying_Swei
  (722,3)  | Isokaze

Disabling index scans restored the expected output of 1 row.

Inspecting the index revealed this:

000114a0: 2f47 616c 6c65 7279 0000 1000 0320 2860  /Gallery..... (`
000114b0: 1149 736f 6b61 7a65 0000 7600 1800 0000  .Isokaze..v.....
000114c0: b500 0900 0000 d202 3800 0000 0000 0000  ........8.......
000114d0: 0000 c501 2d00 1840 1749 7365 2f51 756f  ....-..@.Ise/Quo

which appears to me to be a deduplicated index entry pointing to heap rows
(181,9) and (722,56). Inspecting the heap showed that both of those ctids
are just pointer entries, to (181,53) and (722,3) respectively, which
explains the 2 row result. What I don't know is how this could have happened
in the first place.

The definition of the affected index is:

CREATE INDEX page_main_title ON mediawiki.page USING btree (page_title
text_pattern_ops) WHERE (page_namespace = 0);

The real entries for 'Ying_Swei' in the index are not in the same page or
even any nearby page, and it seems unlikely that the page_title was
updated.

The original reporter (who I will CC on a followup message) still has a
snapshot of the corrupt data. (A REINDEX naturally fixed the data).


Re: BUG #17245: Index corruption involving deduplicated entries

От
Andrew Gierth
Дата:
The following bug has been logged on the website:
Bug reference:      17245

CC'ing to original reporter.

See report at
https://www.postgresql.org/message-id/17245-ddf06aaf85735f36%40postgresql.org

-- 
Andrew (irc:RhodiumToad)



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Sun, Oct 24, 2021 at 4:57 AM Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
> CC'ing to original reporter.

I suggest that pg_amcheck be run against the database, to try to
establish the overall extent of the corruption -- including all
indexes and heap relations from the database. This should start with
the original known-corrupt backup, from before the REINDEX. Then
report back the output of pg_amcheck to the list.

See: https://www.postgresql.org/docs/14/app-pgamcheck.html

Please specify --heapallindexed and --parent-check options when
running pg_amcheck, so that it will perform the most thorough possible
checks against indexes. --heapallindexed is particularly important
here, because the issue that Andrew highlighted is probably not
detectable just by looking at the index itself. It's purely an
inconsistency between the index and the table. This class of problem
has many possible causes.

Thanks
-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
"K. R."
Дата:
During investigation, `SELECT 
bt_index_parent_check('mediawiki.page_main_title', true, true);` 
displayed no errors.

I have installed the same build of 14.0 in a VM and started it with the 
data directory before reindex. pg_amcheck does report errors, but these 
are not related to the index in question (page_main_title):


$ ./pg_amcheck -U postgres --heapallindexed --parent-check --rootdescend 
azurlane_wiki
btree index "azurlane_wiki.mediawiki.page_redirect_namespace_len":
     ERROR:  posting list contains misplaced TID in index 
"page_redirect_namespace_len"
     DETAIL:  Index tid=(14,9) posting list offset=110 page lsn=2/2C4F7CD8.
btree index "azurlane_wiki.mediawiki.page_len":
     ERROR:  posting list contains misplaced TID in index "page_len"
     DETAIL:  Index tid=(1,2) posting list offset=34 page lsn=2/2DDA2378.
btree index "azurlane_wiki.mediawiki.transcode_key_idx":
     ERROR:  posting list contains misplaced TID in index 
"transcode_key_idx"
     DETAIL:  Index tid=(1,9) posting list offset=5 page lsn=2/2B53AED8.

This instance displays the same behaviour with page_main_title:

azurlane_wiki=# select ctid, page_title from mediawiki.page WHERE 
page_title = 'Isokaze' and page_namespace = 0;
    ctid   | page_title
----------+------------
  (181,53) | Ying_Swei
  (722,3)  | Isokaze
(2 rows)

azurlane_wiki=# select ctid, page_title from mediawiki.page WHERE 
page_title = 'Kaga' and page_namespace = 0;
    ctid   | page_title
----------+------------
  (446,32) | Belfast
  (720,53) | Kaga

(2 rows)

After REINDEX'ing the three indices listed by pg_amcheck, there is no 
apparent change to the state of the page_main_title index:

azurlane_wiki=# reindex index mediawiki.page_redirect_namespace_len;
REINDEX
azurlane_wiki=# reindex index mediawiki.page_len;
REINDEX
azurlane_wiki=# reindex index mediawiki.transcode_key_idx;
REINDEX

[xiatian@freebsd pg_amcheck]$ ./pg_amcheck -U postgres --heapallindexed 
--parent-check --rootdescend azurlane_wiki
[xiatian@freebsd pg_amcheck]$

azurlane_wiki=# select ctid, page_title from mediawiki.page WHERE 
page_title = 'Kaga' and page_namespace = 0;
    ctid   | page_title
----------+------------
  (446,32) | Belfast
  (720,53) | Kaga
(2 rows)

P. S. My previous copy of this e-mail was sent from an invalid address; 
I apologise for the possible noise.

--
K. R.




Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Sun, Oct 24, 2021 at 4:40 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
> PostgreSQL version: 14.0
> Operating system:   FreeBSD 13.0-RELEASE

PostgreSQL 14 is the first version where the default wal_sync_method
is fdatasync() on FreeBSD -- though only with FreeBSD 13. See commit
f900a79e. Perhaps that has something to do with the problem seen here.

I have no concrete reason to suspect fdatasync() here -- I'm just
pattern matching. But the inconsistency that Andrew reported looks
like any other scenario where the system's wal_sync_method doesn't
provide the expected guarantees. This looks like generic corruption to
me. Andrew said that the "real entries for 'Ying_Swei' in the index
are not in the same page or even any nearby page". It is reasonable to
suppose that the TID in the index pointing to the wrong tuple in the
heap/table is from an index tuple that has no business being anywhere
near the leaf page that we find it in -- I surmise that it's too far
away in the keyspace for that. (I think that this was Andrew's exact
point, in fact, but I just want to make sure.)

I suspect that there was TID recycling by VACUUM involved here.
Originally, the leaf page image that Andrew examined might well have
actually pointed to the right thing in the heap. Perhaps VACUUM ran,
and recycled the relevant heap TID, while somehow leaving an older
pre-VACUUM leaf page behind (any kind of storage misbehavior could do
this).

There doesn't have to have been a VACUUM, actually. A simple hard
crash (with a misbehaving wal_sync_method or whatever) could easily
lead to the same symptoms. You just need to have heapam "allocate new
heap TIDs" and then forget about having done so. Without the index
structure also forgetting.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Sun, Oct 24, 2021 at 2:35 PM K. R. <iijima.yun@koumakan.jp> wrote:
> $ ./pg_amcheck -U postgres --heapallindexed --parent-check --rootdescend
> azurlane_wiki
> btree index "azurlane_wiki.mediawiki.page_redirect_namespace_len":
>      ERROR:  posting list contains misplaced TID in index
> "page_redirect_namespace_len"
>      DETAIL:  Index tid=(14,9) posting list offset=110 page lsn=2/2C4F7CD8.
> btree index "azurlane_wiki.mediawiki.page_len":
>      ERROR:  posting list contains misplaced TID in index "page_len"
>      DETAIL:  Index tid=(1,2) posting list offset=34 page lsn=2/2DDA2378.
> btree index "azurlane_wiki.mediawiki.transcode_key_idx":
>      ERROR:  posting list contains misplaced TID in index
> "transcode_key_idx"
>      DETAIL:  Index tid=(1,9) posting list offset=5 page lsn=2/2B53AED8.

Thanks for getting back to me with that so quickly.

I suspect that these posting list tuples have duplicate TIDs, which is
never supposed to happen -- nbtree expects that heapam (and the
broader system) will never allow a duplicate TID to be present in the
whole index.

It's possible that I'm wrong, and the corrupt posting list TIDs are
actually in the wrong order (i.e. they're all unique, but are somehow
not in perfect TID-wise order). But I doubt it.

> After REINDEX'ing the three indices listed by pg_amcheck, there is no
> apparent change to the state of the page_main_title index:

> azurlane_wiki=# select ctid, page_title from mediawiki.page WHERE
> page_title = 'Kaga' and page_namespace = 0;
>     ctid   | page_title
> ----------+------------
>   (446,32) | Belfast
>   (720,53) | Kaga
> (2 rows)

I'm a little confused.

Do you mean that even a REINDEX isn't enough to stop the index from
giving this particular wrong answer, where it thinks that 'Belfast' is
the same as 'Kaga'? Even a REINDEX won't make it stop doing that? Or,
are you concerned that pg_amcheck doesn't detect a remaining problem
with one index?

Thanks
-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Sun, Oct 24, 2021 at 3:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Sun, Oct 24, 2021 at 2:35 PM K. R. <iijima.yun@koumakan.jp> wrote:
> > $ ./pg_amcheck -U postgres --heapallindexed --parent-check --rootdescend
> > azurlane_wiki
> > btree index "azurlane_wiki.mediawiki.page_redirect_namespace_len":
> >      ERROR:  posting list contains misplaced TID in index
> > "page_redirect_namespace_len"
> >      DETAIL:  Index tid=(14,9) posting list offset=110 page lsn=2/2C4F7CD8.
> > btree index "azurlane_wiki.mediawiki.page_len":
> >      ERROR:  posting list contains misplaced TID in index "page_len"
> >      DETAIL:  Index tid=(1,2) posting list offset=34 page lsn=2/2DDA2378.
> > btree index "azurlane_wiki.mediawiki.transcode_key_idx":
> >      ERROR:  posting list contains misplaced TID in index
> > "transcode_key_idx"
> >      DETAIL:  Index tid=(1,9) posting list offset=5 page lsn=2/2B53AED8.

Does it make any difference if you run "./pg_amcheck -U postgres
--heapallindexed azurlane_wiki" instead?

Typically I would expect this to be less interesting, because you're
asking pg_amcheck to do less checking. But this is not a typical case.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Andrew Gierth
Дата:
>>>>> "Peter" == Peter Geoghegan <pg@bowt.ie> writes:

 Peter> PostgreSQL 14 is the first version where the default
 Peter> wal_sync_method is fdatasync() on FreeBSD -- though only with
 Peter> FreeBSD 13. See commit f900a79e. Perhaps that has something to
 Peter> do with the problem seen here.

Nope. The database in question was restored from a pg_dumpall backup
three weeks ago and I'm told there have been no crashes or even unclean
restarts since then.

 Peter> I suspect that there was TID recycling by VACUUM involved here.

That also doesn't seem likely from the heap page images I saw, which
suggested a fairly low update rate, HOT updates, and page cleanup rather
than much vacuum activity (in particular there were several LP_REDIRECT
pointers and few free pointers).

-- 
Andrew (irc:RhodiumToad)



Re: BUG #17245: Index corruption involving deduplicated entries

От
Alvaro Herrera
Дата:
Do you happen to have WAL traffic for this table and index?

-- 
Álvaro Herrera



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Sun, Oct 24, 2021 at 5:56 PM Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
> Nope. The database in question was restored from a pg_dumpall backup
> three weeks ago and I'm told there have been no crashes or even unclean
> restarts since then.

Anything is possible, but this doesn't look like logical index
corruption. If we assume that it was something like a faulty
deduplication pass, then we have to explain how the spurious TID ended
up in that particular leaf page to begin with. We also have to explain
why it is a valid TID though totally distinct TID, which is unlikely
to occur by chance.

I find it slightly suspicious that the "correct" heap blocks (those
blocks that the correct heap tuple is located) are close together in
the heap -- 720 (in the case of 'Kaga'), 722 (in the case of
'Isokaze'). While the "incorrect" heap blocks (those blocks that the
non-matching heap tuple is located, that are nevertheless returned by
the index scan) are earlier in the table -- 181 and 446.

> That also doesn't seem likely from the heap page images I saw, which
> suggested a fairly low update rate, HOT updates, and page cleanup rather
> than much vacuum activity (in particular there were several LP_REDIRECT
> pointers and few free pointers).

FWIW Postgres 14 can truncate a heap page's line pointer array during
VACUUM's second pass -- see commit 3c3b8a4b. So a lack of observable
free line pointers is not quite the signal it once was.

What does pg_stat_user_tables say about autovacuum?

--
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
"K. R."
Дата:
On 25.10.2021 1:28, Peter Geoghegan wrote:
> Does it make any difference if you run "./pg_amcheck -U postgres
> --heapallindexed azurlane_wiki" instead?
>
> Typically I would expect this to be less interesting, because you're
> asking pg_amcheck to do less checking. But this is not a typical case.

No, there seem to be no differences between using different sets of 
arguments:

[xiatian@freebsd pg_amcheck]$ ./pg_amcheck -U postgres --heapallindexed 
azurlane_wiki
[xiatian@freebsd pg_amcheck]$

> Do you mean that even a REINDEX isn't enough to stop the index from
> giving this particular wrong answer, where it thinks that 'Belfast' is
> the same as 'Kaga'? Even a REINDEX won't make it stop doing that? Or,
> are you concerned that pg_amcheck doesn't detect a remaining problem
> with one index?

The latter. REINDEX fixes the problem with page_main_title. I apologise 
for not being clear enough; I have never looked too deep into 
PostgreSQL's inner workings before and have been using it purely as a 
regular application developer.

 > What does pg_stat_user_tables say about autovacuum?

For "mediawiki.page" specifically:

last_vacuum: 2021-10-24 12:21:22.256194+09
last_autovacuum: 2021-10-05 18:57:43.850106+09
last_analyze: 2021-10-24 18:48:04.706226+09
last_autoanalyze: 2021-10-09 19:38:36.441795+09

vacuum_count is 20 (the periodic daily script), autovacuum_count is 1.

Considering the constraints for the index in question, I can tell that 
changes affecting it were made on the 22nd of October and before, and 
then yesterday on the 24th. The first user report regarding the problem 
was on the 23rd, unless, of course, it went unnoticed before.

-- 
K. R.




Re: BUG #17245: Index corruption involving deduplicated entries

От
Andrew Gierth
Дата:
>>>>> "Alvaro" == Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

 Alvaro> Do you happen to have WAL traffic for this table and index?

Unfortunately there's no WAL archiving, and no available WAL files from
before the state was known to be corrupt. We'll look into enabling
archiving in case the problem recurs or can be reproduced.

-- 
Andrew (irc:RhodiumToad)



Re: BUG #17245: Index corruption involving deduplicated entries

От
"K. R."
Дата:
On 25.10.2021 0:52, Peter Geoghegan wrote:
> On Sun, Oct 24, 2021 at 4:40 AM PG Bug reporting form
> <noreply@postgresql.org> wrote:
>> PostgreSQL version: 14.0
>> Operating system:   FreeBSD 13.0-RELEASE
> PostgreSQL 14 is the first version where the default wal_sync_method
> is fdatasync() on FreeBSD -- though only with FreeBSD 13. See commit
> f900a79e. Perhaps that has something to do with the problem seen here.
To clarify: I had been running PostgreSQL 13.3 since the 18th of May 
(upgraded from 13.2) until the upgrade to 14.0.

The REL_13_3 tag seems to include the change from f900a79e (matching the 
commit message: ‘Like commit 576477e73c4, which did the same for Linux, 
back-patch to all supported releases.’) and both the 13.3 package from 
my package manager's cache and the 13.4 package available from FreeBSD 
repositories at the moment uses fdatasync() as well:

[root@lincle-backup ~]# postgres --version
postgres (PostgreSQL) 13.3
[root@lincle-backup ~]# postgres --describe-config| grep wal_sync_method
wal_sync_method sighup  Write-Ahead Log / Settings      ENUM 
fdatasync                       Selects the method used for forcing WAL 
updates to disk.

[root@lincle-backup ~]# postgres --version
postgres (PostgreSQL) 13.4
[root@lincle-backup ~]# postgres --describe-config| grep wal_sync_method
wal_sync_method sighup  Write-Ahead Log / Settings      ENUM 
fdatasync                       Selects the method used for forcing WAL 
updates to disk.

According to the logs I have (daily reports), there have been no crashes 
or unclean shutdowns of PostgreSQL on the server since the database jail 
re-creation time (August of 2014 or so).

Vacuum has been performed daily on all databases since at least around 
2017 via the daily periodic script.

-- 
K. R.




Re: BUG #17245: Index corruption involving deduplicated entries

От
Thomas Munro
Дата:
On Mon, Oct 25, 2021 at 10:50 PM K. R. <iijima.yun@koumakan.jp> wrote:
> On 25.10.2021 0:52, Peter Geoghegan wrote:
> > On Sun, Oct 24, 2021 at 4:40 AM PG Bug reporting form
> > <noreply@postgresql.org> wrote:
> >> PostgreSQL version: 14.0
> >> Operating system:   FreeBSD 13.0-RELEASE
> > PostgreSQL 14 is the first version where the default wal_sync_method
> > is fdatasync() on FreeBSD -- though only with FreeBSD 13. See commit
> > f900a79e. Perhaps that has something to do with the problem seen here.
> To clarify: I had been running PostgreSQL 13.3 since the 18th of May
> (upgraded from 13.2) until the upgrade to 14.0.
>
> The REL_13_3 tag seems to include the change from f900a79e (matching the
> commit message: ‘Like commit 576477e73c4, which did the same for Linux,
> back-patch to all supported releases.’) and both the 13.3 package from
> my package manager's cache and the 13.4 package available from FreeBSD
> repositories at the moment uses fdatasync() as well:

For the record, commit f900a79e was about preserving pre-existing
behaviour dating back a few years (ie avoiding a change).  The history
is like this:

On ancient FreeBSD releases, PostgreSQL would default to wal_sync_method=fsync.

If built on FreeBSD 11.1+ (July 2017), it would select
wal_sync_method=fdatasync as the default, because the fdatasync()
system call was added and would be detected by the configure script.

If built on FreeBSD 13.0+ (April 2021), it would traditionally have
selected wal_sync_method=open_datasync, because the O_DSYNC flag was
added to <fcntl.h>.  But commit f900a79e changed PostgreSQL 13.3+
(April 2021) and sibling releases to prefer wal_sync_method=fdatasync
still, despite the presence of O_DSYNC.

Given that this system upgraded PostgreSQL 13.2->13.3 on the 18th of
May, and assuming it had installed FreeBSD 13.0 and PostgreSQL 13.2
around the time of FreeBSD 13.0's release on the 13th of April, it
would have been running with wal_sync_method=open_datasync (assuming
default used) in that 5 week window.  I could talk about what exact
effects that would have if I knew which file system we're talking
about here, but it doesn't seems to be relevant to this case, given
that "The database in question was restored from a pg_dumpall backup
_three weeks ago_ and I'm told there have been no crashes or even
unclean restarts since then" (my emphasis).



Re: BUG #17245: Index corruption involving deduplicated entries

От
"K. R."
Дата:
On 25.10.2021 23:49, Thomas Munro wrote:
> For the record, commit f900a79e was about preserving pre-existing
> behaviour dating back a few years (ie avoiding a change).  The history
> is like this:
>
> On ancient FreeBSD releases, PostgreSQL would default to wal_sync_method=fsync.
>
> If built on FreeBSD 11.1+ (July 2017), it would select
> wal_sync_method=fdatasync as the default, because the fdatasync()
> system call was added and would be detected by the configure script.
>
> If built on FreeBSD 13.0+ (April 2021), it would traditionally have
> selected wal_sync_method=open_datasync, because the O_DSYNC flag was
> added to <fcntl.h>.  But commit f900a79e changed PostgreSQL 13.3+
> (April 2021) and sibling releases to prefer wal_sync_method=fdatasync
> still, despite the presence of O_DSYNC.

Thank you, this confirms the assumptions I had after reading through the 
sources of different releases and testing the --describe-config output 
from old packages in my package manager's cache.

> Given that this system upgraded PostgreSQL 13.2->13.3 on the 18th of
> May, and assuming it had installed FreeBSD 13.0 and PostgreSQL 13.2
> around the time of FreeBSD 13.0's release on the 13th of April, it
> would have been running with wal_sync_method=open_datasync (assuming
> default used) in that 5 week window.  I could talk about what exact
> effects that would have if I knew which file system we're talking
> about here, but it doesn't seems to be relevant to this case, given
> that "The database in question was restored from a pg_dumpall backup
> _three weeks ago_ and I'm told there have been no crashes or even
> unclean restarts since then" (my emphasis).


In case this might be important later: everything is on ZFS (upgraded 
from FreeBSD's old zfs port to openzfs, as imported in FreeBSD 13.0).

Log entries related to the upgrade:
Oct  5 09:55:27 db pkg[7101]: postgresql13-server-13.3_1 deinstalled
Oct  5 09:55:41 db pkg[7144]: postgresql14-server-14.0 installed

This was preceded by pg_dumpall and followed by initdb with the exact 
same parameters (using unchanged rc.conf 
postgresql_initdb_flags="--encoding=utf-8 --lc-collate=en_GB.UTF-8") and 
the import of the dump using psql. I do not use pg_upgrade since 
different versions of PostgreSQL cannot be installed at the same time in 
FreeBSD using the ports/packages system, and I do not find dump/restore 
much a chore (plus it solves issues with possible locale changes).

There have been no crashes since; there was one reload (pg_hba.conf 
edits) and several restarts (to snapshot the file structure with the 
corrupted index, plus another enabling WAL archiving today in the morning).

I have postgresql-rum installed, if this matters; it is used by a 
Pleroma instance in a separate database.

-- 
K. R.




Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Mon, Oct 25, 2021 at 2:08 PM K. R. <iijima.yun@koumakan.jp> wrote:
> There have been no crashes since; there was one reload (pg_hba.conf
> edits) and several restarts (to snapshot the file structure with the
> corrupted index, plus another enabling WAL archiving today in the morning).

Thank you for your help.

I wonder if you can show me a page that amcheck reports as having an
incorrect posting list? I am interested in posting list tuples that
are not just pointing to the wrong thing, but actually look wrong
without even looking at the heap.

You must have done this for Andrew already, but note that the
procedure is outlined here:


https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump

For example, the amcheck error  with "DETAIL:  Index tid=(14,9)
posting list offset=110 page lsn=2/2C4F7CD8 btree index
"azurlane_wiki.mediawiki.page_len" has an "index TID" of 14, meaning
that the block number 14 from the index
"azurlane_wiki.mediawiki.page_len" is interesting to me. It could help
me with debugging.

I can treat this page image as confidential. You could send it to me privately.

Thanks again
-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Thomas Munro
Дата:
On Tue, Oct 26, 2021 at 10:08 AM K. R. <iijima.yun@koumakan.jp> wrote:
> [...] I do not use pg_upgrade since
> different versions of PostgreSQL cannot be installed at the same time in
> FreeBSD using the ports/packages system, and I do not find dump/restore
> much a chore (plus it solves issues with possible locale changes).

It really would be nice to address that. (FWIW A few years ago I had a
mostly-working experiment-grade patch for the ports tree that could
install and run multiple versions in parallel, and the bones of a port
of the Debian wrapper scripts and cluster control tools so you can
list, start, stop, clone clusters etc, but I ran out of steam while
contemplating the squillions of other packages that interact with
PostgreSQL...  happy to share the starter bits and pieces I have
off-list if anyone is interested...).  But in this particular case it
significantly reduces the search area for the problem, so that's a
useful deficiency!



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Sun, Oct 24, 2021 at 6:33 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Anything is possible, but this doesn't look like logical index
> corruption. If we assume that it was something like a faulty
> deduplication pass, then we have to explain how the spurious TID ended
> up in that particular leaf page to begin with. We also have to explain
> why it is a valid TID though totally distinct TID, which is unlikely
> to occur by chance.

I discussed the robustness of some of the posting list split code
privately with Heikki earlier today. Heikki suggested that I more
consistently invalidate the insert state that is used across
nbtinsert.c. Although I don't have any reason to believe that that
makes a difference, I still agree with his conclusions.

Attached patch does this. It also makes the special posting split
related call to _bt_binsrch_insert() at the end of _bt_findinsertloc()
more robust. There is a similar defensive elog() within
_bt_swap_posting(), but that only kicks in when the posting offset is
totally out of bounds for the tuple. _bt_binsrch_insert() could still
give answers for some other reason, at least in theory, and so it
seems like a good idea to be extra careful, and throw an error at the
earliest opportunity.

-- 
Peter Geoghegan

Вложения

Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Mon, Oct 25, 2021 at 2:29 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Oct 25, 2021 at 2:08 PM K. R. <iijima.yun@koumakan.jp> wrote:
> > There have been no crashes since; there was one reload (pg_hba.conf
> > edits) and several restarts (to snapshot the file structure with the
> > corrupted index, plus another enabling WAL archiving today in the morning).
>
> Thank you for your help.

Thank you for sharing page images with me privately -- that was very
helpful. I can see the same basic issue in all 3 of the pages that you
said amcheck reports as corrupt (the 3 pages that you shared). As I
suspected, there are posting lists that contain duplicate TIDs, but
look normal in every other way. This is very subtle.

A few more questions for you, if you don't mind:

1. Can you tell me which tables the 4 known-corrupt indexes
(page_redirect_namespace_len, page_len, transcode_key_idx, and
page_main_title) are defined against?
2. Can you show me the definition of the table or tables, using \d from psql?
3. Do you notice any consistent pattern here? For example, are foreign
keys involved?
4. How do the table or tables tend to get updated?

Thanks again

Executive summary on my progress:

This now looks like it might be related to code on the heapam side,
perhaps in heapam_index_build_range_scan(), which has a history of
getting confused about duplicate TIDs with HOT chains (e.g., see
Alvaro's commit a811ea5bde from late 2020).

Here is why I now suspect heapam_index_build_range_scan():

It disturbed me that even "heapallindexed" amcheck verification did
not detect the original reported problem, where we see index scans
that give wrong answers (a scan of the page_main_title index shows an
extra non-matching row, per Andrew's original message). We saw that
amcheck does notice a few corrupt posting list tuples in other nearby
indexes, but not this bigger problem. That eventually made me think
about table_index_build_scan()/heapam_index_build_range_scan(), which
is used by amcheck for heapallindexed verification -- most of the
important work is outsourced to that core code. amcheck more or less
performs heapallindexed verification by comparing the tuples that a
REINDEX would see against the actual tuples it fingerprinting from the
index.

I wonder if heapallindexed verification doesn't detect corruption in
the page_main_title index because "the table seems to agree with the
index", even though the actual details are clearly wrong. In other
words, perhaps even the table_index_build_scan() call inside amcheck's
verify_ntree.c file says that there should be duplicate TIDs in the
index, even though that's always wrong, by definition. What if
table_index_build_scan() is actually the problem?

If my hypothesis is true, then we might expect problems to not even go
away following a REINDEX. Or maybe they would go away sometimes, but
not other times.

This idea is supported by remarks from Herman verschooten on Slack
[1], when discussing a similar problem case on Postgres 14 [2]. I'll
quote him directly:

"""
tranman_production=# CREATE INDEX index_freights_on_cmr_received ON
public.freights USING btree (cmr_received) with (deduplicate_items =
off);
CREATE INDEX
tranman_production=# \set VERBOSITY verbose
tranman_production=# update freights set cmr_received=false where id=49632;
ERROR:  XX000: posting list tuple with 20 items cannot be split at offset 168
LOCATION:  _bt_swap_posting, nbtdedup.c:1037
tranman_production=# drop index index_freights_on_cmr_received ;
DROP INDEX
tranman_production=# update freights set cmr_received=false where id=49632;
UPDATE 1
"""

Notice that the problem seems to be confined to one index, which has
deduplication explicitly disabled. We see an error message about a
posting list tuple all the same. I think that that's just a downstream
consequence of duplicate TIDs appearing in the index. It looks like
the corruption exists and is independent of whether the index is
present or not. When the index is present the problem becomes obvious,
but it's still there either way.

The supposed posting list tuple with 20 items that the error message
goes on about is actually just a plain tuple. The number 20 comes from
its IndexTupleData.t_tid offset number. In other words, it's an offset
number from the index tuple's pointed-to heap TID, which has been
misinterpreted as something else by BTreeTupleGetNPosting().
BTreeTupleGetNPosting() expects to never be called against such a
tuple, but it can happen once we assume duplicate TIDs are possible.
(You'd see an assertion failure if asserts were enabled, but of course
they're not enabled here.)

Note that this independently reported "freights" case from Herman has
the same amcheck issue we see with the mediawiki indexes on this
thread: amcheck won't actually complain about the clearly corrupt
index_freights_on_cmr_received index, but will complain about other
indexes on the same table. This time around amcheck complains about
out-of-order index tuples, but I strongly suspect that that's
practically the same condition as the "posting list contains misplaced
TID" error messages we see here. In other words, I think that both
cases just have duplicate heap TIDs, and that the variation in error
messages is totally unrelated to the true underlying problem.

[1] https://postgresteam.slack.com/archives/C0FS3UTAP/p1635161173202100?thread_ts=1635154585.197300&cid=C0FS3UTAP
[2] https://postgr.es/m/8CDB73C1-E3AF-40A6-BA81-8AFE174C6402@verschooten.net
--
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Tue, Oct 26, 2021 at 1:36 PM Peter Geoghegan <pg@bowt.ie> wrote:
> A few more questions for you, if you don't mind:
>
> 1. Can you tell me which tables the 4 known-corrupt indexes
> (page_redirect_namespace_len, page_len, transcode_key_idx, and
> page_main_title) are defined against?
> 2. Can you show me the definition of the table or tables, using \d from psql?
> 3. Do you notice any consistent pattern here? For example, are foreign
> keys involved?
> 4. How do the table or tables tend to get updated?

Also:

5. Do you use CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY much?

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Kamigishi Rei
Дата:
On 26.10.2021 23:36, Peter Geoghegan wrote:
> On Mon, Oct 25, 2021 at 2:29 PM Peter Geoghegan <pg@bowt.ie> wrote:
> A few more questions for you, if you don't mind:
> 
> 1. Can you tell me which tables the 4 known-corrupt indexes
> (page_redirect_namespace_len, page_len, transcode_key_idx, and
> page_main_title) are defined against?

page and transcode.

> 2. Can you show me the definition of the table or tables, using \d from psql?

azurlane_wiki=> \d mediawiki.page
                                                  Table "mediawiki.page"
        Column       |           Type           | Collation | Nullable | 
                     Default

--------------------+--------------------------+-----------+----------+-------------------------------------------------
  page_id            | integer                  |           | not null | 
nextval('mediawiki.page_page_id_seq'::regclass)
  page_namespace     | integer                  |           | not null |
  page_title         | text                     |           | not null |
  page_restrictions  | text                     |           |          |
  page_is_redirect   | smallint                 |           | not null | 0
  page_is_new        | smallint                 |           | not null | 0
  page_random        | double precision         |           | not null | 
random()
  page_touched       | timestamp with time zone |           | not null |
  page_links_updated | timestamp with time zone |           |          |
  page_latest        | integer                  |           | not null |
  page_len           | integer                  |           | not null |
  page_content_model | text                     |           |          |
  page_lang          | text                     |           |          |
  titlevector        | tsvector                 |           |          |
Indexes:
     "page_pkey" PRIMARY KEY, btree (page_id)
     "name_title" UNIQUE, btree (page_namespace, page_title)
     "page_len" btree (page_len)
     "page_main_title" btree (page_title text_pattern_ops) WHERE 
page_namespace = 0
     "page_mediawiki_title" btree (page_title text_pattern_ops) WHERE 
page_namespace = 8
     "page_project_title" btree (page_title text_pattern_ops) WHERE 
page_namespace = 4
     "page_random" btree (page_random)
     "page_redirect_namespace_len" btree (page_is_redirect, 
page_namespace, page_len)
     "page_talk_title" btree (page_title text_pattern_ops) WHERE 
page_namespace = 1
     "page_user_title" btree (page_title text_pattern_ops) WHERE 
page_namespace = 2
     "page_utalk_title" btree (page_title text_pattern_ops) WHERE 
page_namespace = 3
     "ts2_page_title" gist (titlevector)
Referenced by:
     TABLE "mediawiki.cu_changes" CONSTRAINT 
"cu_changes_cuc_page_id_fkey" FOREIGN KEY (cuc_page_id) REFERENCES 
mediawiki.page(page_id) ON DELETE SET NULL
     TABLE "mediawiki.revision" CONSTRAINT "revision_rev_page_fkey" 
FOREIGN KEY (rev_page) REFERENCES mediawiki.page(page_id) ON DELETE 
CASCADE DEFERRABLE INITIALLY DEFERRED
Triggers:
     page_deleted AFTER DELETE ON mediawiki.page FOR EACH ROW EXECUTE 
FUNCTION mediawiki.page_deleted()
     ts2_page_title BEFORE INSERT OR UPDATE ON mediawiki.page FOR EACH 
ROW EXECUTE FUNCTION mediawiki.ts2_page_title()

azurlane_wiki=> \d mediawiki.transcode
                                                       Table 
"mediawiki.transcode"
           Column          |           Type           | Collation | 
Nullable |                          Default

--------------------------+--------------------------+-----------+----------+-----------------------------------------------------------
  transcode_id             | integer                  |           | not 
null | nextval('mediawiki.transcode_transcode_id_seq'::regclass)
  transcode_image_name     | character varying(255)   |           | not 
null |
  transcode_key            | character varying(48)    |           | not 
null |
  transcode_error          | text                     |           | not 
null |
  transcode_time_addjob    | timestamp with time zone |           | 
      |
  transcode_time_startwork | timestamp with time zone |           | 
      |
  transcode_time_success   | timestamp with time zone |           | 
      |
  transcode_time_error     | timestamp with time zone |           | 
      |
  transcode_final_bitrate  | integer                  |           | not 
null |
Indexes:
     "transcode_key_idx" btree (transcode_key)
     "transcode_name_key" UNIQUE, btree (transcode_image_name, 
transcode_key)
     "transcode_time_inx" btree (transcode_time_addjob, 
transcode_time_startwork, transcode_time_success, transcode_time_error)

> 3. Do you notice any consistent pattern here? For example, are foreign
> keys involved?

It does not look like it:

[xiatian@freebsd ~]$ pg_dump -U azurlane_wiki -d azurlane_wiki -s | grep 
-i FOREIGN
     ADD CONSTRAINT account_credentials_acd_user_fkey FOREIGN KEY 
(acd_user) REFERENCES mediawiki.mwuser(user_id) ON DELETE SET NULL;
     ADD CONSTRAINT account_requests_acr_user_fkey FOREIGN KEY 
(acr_user) REFERENCES mediawiki.mwuser(user_id) ON DELETE SET NULL;
     ADD CONSTRAINT cu_changes_cuc_page_id_fkey FOREIGN KEY 
(cuc_page_id) REFERENCES mediawiki.page(page_id) ON DELETE SET NULL;
     ADD CONSTRAINT cu_changes_cuc_user_fkey FOREIGN KEY (cuc_user) 
REFERENCES mediawiki.mwuser(user_id) ON DELETE SET NULL;
     ADD CONSTRAINT cu_log_cul_target_id_fkey FOREIGN KEY 
(cul_target_id) REFERENCES mediawiki.mwuser(user_id) ON DELETE SET NULL;
     ADD CONSTRAINT cu_log_cul_user_fkey FOREIGN KEY (cul_user) 
REFERENCES mediawiki.mwuser(user_id) ON DELETE SET NULL;
     ADD CONSTRAINT revision_rev_page_fkey FOREIGN KEY (rev_page) 
REFERENCES mediawiki.page(page_id) ON DELETE CASCADE DEFERRABLE 
INITIALLY DEFERRED;
     ADD CONSTRAINT user_newtalk_user_id_fkey FOREIGN KEY (user_id) 
REFERENCES mediawiki.mwuser(user_id) ON DELETE CASCADE DEFERRABLE 
INITIALLY DEFERRED;

> 4. How do the table or tables tend to get updated?

Regarding page:
New rows are added rarely, usually once a week or two, in small batches 
(separate transactions).
The page_title field is almost never updated (pages are moved extremely 
rarely).
Updates to other fields, like page_len, happen multiple times daily, 
during page edits, and sometimes happen in large batches but as separate 
transactions.

Regarding transcode:
I assume inserts happen with each file upload, often in large batches 
(separate transactions). Updates happen whenever a transcode job runs, 
which is usually also in batches.

I can check the wiki's recent changes and upload log for more precise data.

 > Also:
 > 5. Do you use CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY much?>

I assume not, `find . -name "*.php" | xargs grep -i "CONCURRENTLY"` 
against MediaWiki includes/ gives no SQL results, source code comments only.

> If my hypothesis is true, then we might expect problems to not even go
> away following a REINDEX. Or maybe they would go away sometimes, but
> not other times.

REINDEX *seems* to have helped in my case (tested with both 
page_main_title and the three that showed up in pg_amcheck output). I am 
considering writing a script that would SELECT COUNT(*) for each 
page_title to detect the next possible page_main_title index corruption.

-- 
K. R.



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Wed, Oct 27, 2021 at 12:00 AM Kamigishi Rei <iijima.yun@koumakan.jp> wrote:
> page and transcode.

Thanks!

One more question: Did you use pg_restore with -j (jobs) to make the
restore faster?

Just one more request: Can I have some more pages from the same
indexes? You have already provided me with
"mediawiki.transcode_key_idx.block1.page". But could you now also
provide me with blocks 2, 8, and 16 from the same index? This is a
pretty random choice. I just want to see if they all contain corrupt
posting list tuples, like page 1 (I suspect that they all will).

Note that I pushed some additional hardening for posting list splits
today. This will be in Postgres 14.1. The new code won't fix the
problem, but it will detect problems earlier, limiting the damage.

> > If my hypothesis is true, then we might expect problems to not even go
> > away following a REINDEX. Or maybe they would go away sometimes, but
> > not other times.
>
> REINDEX *seems* to have helped in my case (tested with both
> page_main_title and the three that showed up in pg_amcheck output). I am
> considering writing a script that would SELECT COUNT(*) for each
> page_title to detect the next possible page_main_title index corruption.

I took another look at your 3 page images today. I noticed some new things.

Out of your 3 pages, I saw that 2 pages were the leftmost leaf page in
the indexes (mediawiki.transcode_key_idx.block1.page and
mediawiki.page_len.block1.page), and 1 was a very low block number
(block 14 from mediawiki.page_redirect_namespace_len.block14.page).
Also, the amount of posting list corruption on each page was higher
than I first realized. For example, almost all of the tuples on the
page mediawiki.page_len.block1.page had duplicate TIDs (corruption).

We know that the underlying table (for 2 out of the 3 indexes) is at
least ~720 blocks, so these are not tiny indexes. And yet amcheck
typically complains on the first leaf page it looks at, or not at all.
So the particular indexes I've taken a look at probably completely
full of corrupt posting lists. And so it seems as if most indexes have
no detectable corruption at all, but at least a few are almost totally
corrupt. It seems to be either one extreme or the other.

Another thing that I noticed today is that all 3 of the pages (from 3
indexes) looked like they have probably never had a regular
deduplication pass, even once. It's also unlikely that they were ever
split. I can tell because the limit on posting list size is different
during CREATE INDEX. The pages looked like there were a few posting
list splits, which is normal when you have some non-HOT updates, and
probably not relevant. These pages look like CREATE INDEX was run
recently, despite all the corruption. Plus Herman (CC'd) has said that
sometimes CREATE INDEX doesn't correct his own similar problem on
Postgres 14.

And so this still very much looks like a bug in CREATE INDEX,
somewhere. This suspicion started with the fact that corruption
appeared concentrated in only 2 (really just 1) tables.

I'll take another guess: I wonder if commit 56788d21 ("Allocate
consecutive blocks during parallel seqscans") is somehow causing
parallel CREATE INDEX to produce wrong results. The author (David
Rowley) is CC'd. Does a bug in that commit seem like it might explain
this problem, David? Might parallel workers in a parallel index build
somehow become confused about which worker is supposed to scan which
heap block, leading to duplicate TIDs in the final index?

I notice that the faulty duplicate TIDs tend to be high-ish heap block
offset numbers, and often appear in groups of 3 or even 4 duplicates.
Does that seem like it might be consistent with my (admittedly not
strong) theory about commit 56788d21 being behind all this.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Wed, Oct 27, 2021 at 6:36 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Just one more request: Can I have some more pages from the same
> indexes? You have already provided me with
> "mediawiki.transcode_key_idx.block1.page". But could you now also
> provide me with blocks 2, 8, and 16 from the same index? This is a
> pretty random choice. I just want to see if they all contain corrupt
> posting list tuples, like page 1 (I suspect that they all will).

Also, page 0 of the table "page" itself would be useful. The same
pageinspect recipe will work with a table.

Many thanks
-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Kamigishi Rei
Дата:
On 28.10.2021 4:36, Peter Geoghegan wrote:
> On Wed, Oct 27, 2021 at 12:00 AM Kamigishi Rei <iijima.yun@koumakan.jp> wrote:
> One more question: Did you use pg_restore with -j (jobs) to make the
> restore faster?

I am afraid I have not used pg_restore (as far as I know), unless psql 
replicates its behaviour and/or calls it. My typical procedure is 
`pg_dump[all] -U … -d … > file.sql`, followed by `psql -U … -d … < 
file.sql`. It did not seem like psql used any kind of parallel query 
execution, judging by the console output.

> Just one more request: Can I have some more pages from the same
> indexes? You have already provided me with
> "mediawiki.transcode_key_idx.block1.page". But could you now also
> provide me with blocks 2, 8, and 16 from the same index? This is a
> pretty random choice. I just want to see if they all contain corrupt
> posting list tuples, like page 1 (I suspect that they all will).

Despatched in a separate e-mail along with page 0 from "mediawiki.page" 
requested in your follow-up message.

> Note that I pushed some additional hardening for posting list splits
> today. This will be in Postgres 14.1. The new code won't fix the
> problem, but it will detect problems earlier, limiting the damage.

Would you recommend applying the patch to my ‘production’ instance or 
should keep it as is to see if that undetectable corruption case 
(leading to weird SELECT etc. results) shows up again?

> These pages look like CREATE INDEX was run
> recently, despite all the corruption. Plus Herman (CC'd) has said that
> sometimes CREATE INDEX doesn't correct his own similar problem on
> Postgres 14.

CREATE INDEX was run during the import of the 13.3-generated dump; there 
were no schema changes since. The transcode table gets updated every 
time an audio file is uploaded, which happened quite a few times since 
the upgrade.

-- 
K. R.



Re: BUG #17245: Index corruption involving deduplicated entries

От
Kamigishi Rei
Дата:
On 28.10.2021 7:21, Peter Geoghegan wrote:
> Also, page 0 of the table "page" itself would be useful. The same
> pageinspect recipe will work with a table.

I forgot to mention: I have the data directory from the PostgreSQL 13.3 
instance that ran inside the same jail on the same FreeBSD system since 
May, so if that might be useful for the investigation, I can pull it 
into another VM.

-- 
K. R.



Re: BUG #17245: Index corruption involving deduplicated entries

От
David Rowley
Дата:
On Thu, 28 Oct 2021 at 14:36, Peter Geoghegan <pg@bowt.ie> wrote:
> I'll take another guess: I wonder if commit 56788d21 ("Allocate
> consecutive blocks during parallel seqscans") is somehow causing
> parallel CREATE INDEX to produce wrong results. The author (David
> Rowley) is CC'd. Does a bug in that commit seem like it might explain
> this problem, David? Might parallel workers in a parallel index build
> somehow become confused about which worker is supposed to scan which
> heap block, leading to duplicate TIDs in the final index?

I stared at that code for a while and didn't really see how it could
be broken, unless if the atomics implementation on that machine were
broken.  Thomas and I had a look at that earlier and on his FreeBSD
machine, it uses the arch-x64.h implementation of
pg_atomic_fetch_add_u64_impl().

I also noted that pg_atomic_fetch_add_u64() is not really getting much
use.  regress.c is the only other location, however, I really doubt
that this is the issue here.

Just to see what it would look like if this was broken, I went and
mocked up such a bug by adding the following code just above "return
page;" at then of table_block_parallelscan_nextpage:

if (page == 1000)
    page = 999;

I then did:

create table b (b int not null, t text not null);
insert into b select x,x::text from generate_series(1,200000)x;
set max_parallel_workers_per_gather=0;
select sum(b), sum(length(t)) from b;
set max_parallel_workers_per_gather=2;
select sum(b), sum(length(t)) from b;

I noted that I get different results between the parallel and
non-parallel query due to page 999 being read twice.  I then created
the following index:

set max_parallel_maintenance_workers = 2;
create index on b(t);

If I run a query to perform an index scan to find any value of "t"
that's on page 999, then I get:

postgres=# select ctid,* from b where t = '185050';
   ctid   |   b    |   t
----------+--------+--------
 (999,54) | 185050 | 185050
 (999,54) | 185050 | 185050
(2 rows)

We just get the same tid twice in the index. That's quite different
from another value of "t" getting into the list of tids for '185050'.

David



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Thu, Oct 28, 2021 at 12:34 AM Kamigishi Rei <iijima.yun@koumakan.jp> wrote:
> > Just one more request: Can I have some more pages from the same
> > indexes? You have already provided me with
> > "mediawiki.transcode_key_idx.block1.page". But could you now also
> > provide me with blocks 2, 8, and 16 from the same index? This is a
> > pretty random choice. I just want to see if they all contain corrupt
> > posting list tuples, like page 1 (I suspect that they all will).
>
> Despatched in a separate e-mail along with page 0 from "mediawiki.page"
> requested in your follow-up message.

Thanks!

> > Note that I pushed some additional hardening for posting list splits
> > today. This will be in Postgres 14.1. The new code won't fix the
> > problem, but it will detect problems earlier, limiting the damage.
>
> Would you recommend applying the patch to my ‘production’ instance or
> should keep it as is to see if that undetectable corruption case
> (leading to weird SELECT etc. results) shows up again?

Yes, that's definitely a good idea, if you have a convenient way to
build the REL_14_STABLE branch from source easily (or you can just
wait for 14.1). The new defensive checks will at least give you more
information about any problems that happen in the B-Tree code. It will
also limit possible harms.

It would be great from my point of view if you had the extra checks
today, because it might lead to better information about what's really
going on here.

> CREATE INDEX was run during the import of the 13.3-generated dump; there
> were no schema changes since. The transcode table gets updated every
> time an audio file is uploaded, which happened quite a few times since
> the upgrade.

Got it. Thanks.

--
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Thu, Oct 28, 2021 at 1:04 AM Kamigishi Rei <iijima.yun@koumakan.jp> wrote:
> I forgot to mention: I have the data directory from the PostgreSQL 13.3
> instance that ran inside the same jail on the same FreeBSD system since
> May, so if that might be useful for the investigation, I can pull it
> into another VM.

I know I keep saying it's the last thing, but...I need more pages!

Can you please provide me with additional pages from the heap relation
"page", as follows: 5, 7, 12, 14, 16, 21, 22, 26

Your heap page (page 0 of the "page" heap relation) was very
interesting. I saw that the TID that is incorrectly repeated in the
index "page_len" is marked LP_UNUSED in the heap page. It's one of
only two such TIDs/line pointers on your heap/table page, out of 61
total, so I doubt that this is a coincidence. That's why I'm asking
for more heap pages -- I would like to verify that this is a
consistent thing, and not just a remarkable coincidence.

Now this is looking like a problem in VACUUM (pruning?), not a CREATE
INDEX thing. It looks like somehow an item that should be LP_DEAD ends
up being LP_UNUSED during pruning. I have CC'd Andres, to get his
thoughts on this.

This explanation fits quite well, because:

1. Even with every verification option enabled, amcheck only ever
complains about duplicate TIDs (the exact details vary in
not-very-interesting ways). I would expect to only see this if my new
hypothesis was correct, because in practice, with many workloads,
unused line pointers tend to get reused by later non-HOT updates that
create new versions without changing the value of indexes columns.
Because amcheck doesn't exhaustively verify that the index has no
duplicate TIDs, you're only going to see complaints about duplicates
that happen to be duplicated on all key values, up to and including
the heap TID itself.

2. If this theory was correct, then we'd occasionally notice TIDs that
are recycled by totally unrelated logical rows (or maybe the key
values do change). This is what we see in the original problem report,
where the index seems to think that 'Isokaze' and 'Ying_Swei' are the
same value. The TID originally *did* point to some version that had
the correct value, but since VACUUM never removed the TID from the
index (because it was never made LP_DEAD in the heap by pruning when
it should have been), the "pointed-to value" changed without the index
having the obsolete TID removed.

I am suggesting that 1 and 2 are the same thing, really. They look a
little different, but that's only due to limitations in our tooling.
In addition to the "amcheck cannot reliably detect duplicate TIDs in
an index" issue I mentioned, there are also heapallindexed limitations
to consider. We should not expect the heapallindexed stuff to detect
the case where an index has TIDs that currently point to LP_UNUSED
items in the heap -- heapallindexed verification can only detect the
absence of a needed index tuple, not the presence of a faulty/corrupt
index tuple. That's probably why heapallindexed doesn't detect one
single problem here, even though the original complaint describes
exactly the kind of scenario you might expect heapallindexed to catch.

3. Multiple indexes on the same table "independently" have very
similar corruption.

I noticed that the heap TID (21,41) is duplicated in posting lists in
both the page_redirect_namespace_len index (on block 14, posting list
tuple at page offset 13), as well as the page_len index (on block 1,
posting list tuple at page offset 15). If this was just a coincidence
(if the problem was just in nbtree) then it would be a remarkable
coincidence.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Thu, Oct 28, 2021 at 2:50 AM David Rowley <dgrowleyml@gmail.com> wrote:
> I stared at that code for a while and didn't really see how it could
> be broken, unless if the atomics implementation on that machine were
> broken.  Thomas and I had a look at that earlier and on his FreeBSD
> machine, it uses the arch-x64.h implementation of
> pg_atomic_fetch_add_u64_impl().

Thank you for going through that exercise. I now strongly doubt that
it's CREATE INDEX at all.

My suspicion is now falling on the snapshot scalability work. And some
interaction with VACUUM. Probably only with shared row locks and
concurrency. More on this later.

> We just get the same tid twice in the index. That's quite different
> from another value of "t" getting into the list of tids for '185050'.

FWIW I thought that it might have been possible for the index to
become even more corrupt, due to a lack of defensive measures inside
nbtree.  But I now see that that can't have been it.

Apologies for the noise.
-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Thu, Oct 28, 2021 at 10:16 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Now this is looking like a problem in VACUUM (pruning?), not a CREATE
> INDEX thing. It looks like somehow an item that should be LP_DEAD ends
> up being LP_UNUSED during pruning. I have CC'd Andres, to get his
> thoughts on this.

Further analysis based on lots more heap pages provided by Kamigishi
Rei (thanks a lot!) strongly supports this theory. The problem looks
like an interaction between the snapshot scalability work, VACUUM, and
shared row locks. I cannot tell for sure if the table in question has
actually allocated MultiXacts in its lifetime, but it has certainly
had quite a few shared locks. Note that the other people that had
similar complaints about Postgres 14 all used foreign keys on affected
tables.

I'm attaching my personal notes on this. They have a little
commentary, but are mostly useful because they outline the exact ways
in which the data is corrupt, which is pretty tedious to put together
manually.

There are some very clear patterns here:

* Most of the heap pages I've looked at have rows that were never
updated or locked. There are usually 2 or 3 such tuples on each heap
page, at least among those known to be corrupt -- 1 or 2 of them
usually tie back to corruption in the index.

* Most individual duplicated-in-index heap TIDs point to heap tuples
that are HEAP_XMAX_KEYSHR_LOCK|HEAP_XMAX_LOCK_ONLY. These heap tuples
have the same xmin and xmax.

* The transaction ID 365637 is very over-represented, appearing in
several corrupt heap tuple headers, located across several heap pages.

* Its "neighbor" transaction ID is 365638, which appears once more. To
me this suggests some kind of confusion with an OldestXmin style
cutoff during VACUUM.

* As suspected, there are a smaller number of TIDs in the index that
point to LP_UNUSED items in the heap -- a distinct form of corruption
to the more noticeable duplicate TIDs (superficially distinct, at
least). These aren't usually duplicated in the index, though they can
be.

This all but confirms that the original complaint was in fact just a
result of a TID/item pointer being recycled in the heap "too early".
It also explains why amcheck's heapallindexed option didn't ever
complain about any index in the whole database (only the index
structure itself looked corrupt).

-- 
Peter Geoghegan

Вложения

Re: BUG #17245: Index corruption involving deduplicated entries

От
Andres Freund
Дата:
Hi,

On 2021-10-28 10:16:35 -0700, Peter Geoghegan wrote:
> Your heap page (page 0 of the "page" heap relation) was very
> interesting. I saw that the TID that is incorrectly repeated in the
> index "page_len" is marked LP_UNUSED in the heap page. It's one of
> only two such TIDs/line pointers on your heap/table page, out of 61
> total, so I doubt that this is a coincidence. That's why I'm asking
> for more heap pages -- I would like to verify that this is a
> consistent thing, and not just a remarkable coincidence.

This makes me wonder if the issue could be that we're loosing writes / that
something is reading old page versions (e.g. due to filesystem bug). If both
heap and index are vacuumed, but the index write is lost, this'd be what we
see, right?

Another way this could happen is if we got the wrong relation size for either
index or table, and a vacuum scan doesn't scan the whole table or index.


I've not yet read the whole thread, but if not done, it seems like a good idea
to use pg_waldump and grep for changes to the relevant heap / index
pages. That might give us more information about what could have happened.


> Now this is looking like a problem in VACUUM (pruning?), not a CREATE
> INDEX thing. It looks like somehow an item that should be LP_DEAD ends
> up being LP_UNUSED during pruning. I have CC'd Andres, to get his
> thoughts on this.

Hm.

There were a fair bit of changes around the separation between heap and index
vacuuming in 14. I wonder if there's potentially something broken around
repeatedly vacuuming the heap without doing index vacuums or such.

It's also possible that there's something wrong in that darned path that
handles recently-dead tuples.

Greetings,

Andres Freund



Re: BUG #17245: Index corruption involving deduplicated entries

От
Andres Freund
Дата:
Hi,

On 2021-10-28 14:18:32 -0700, Peter Geoghegan wrote:
> Note that the other people that had similar complaints about Postgres 14 all
> used foreign keys on affected tables.

Do you have links to the other reports?


> * The transaction ID 365637 is very over-represented, appearing in
> several corrupt heap tuple headers, located across several heap pages.
> 
> * Its "neighbor" transaction ID is 365638, which appears once more. To
> me this suggests some kind of confusion with an OldestXmin style
> cutoff during VACUUM.

I'm not quite following this bit, could you expand on that?

Greetings,

Andres Freund



Re: BUG #17245: Index corruption involving deduplicated entries

От
Thomas Munro
Дата:
On Fri, Oct 29, 2021 at 10:31 AM Andres Freund <andres@anarazel.de> wrote:
> On 2021-10-28 10:16:35 -0700, Peter Geoghegan wrote:
> > Your heap page (page 0 of the "page" heap relation) was very
> > interesting. I saw that the TID that is incorrectly repeated in the
> > index "page_len" is marked LP_UNUSED in the heap page. It's one of
> > only two such TIDs/line pointers on your heap/table page, out of 61
> > total, so I doubt that this is a coincidence. That's why I'm asking
> > for more heap pages -- I would like to verify that this is a
> > consistent thing, and not just a remarkable coincidence.
>
> This makes me wonder if the issue could be that we're loosing writes / that
> something is reading old page versions (e.g. due to filesystem bug). If both
> heap and index are vacuumed, but the index write is lost, this'd be what we
> see, right?

Relevant to this hypothesis, it's interesting that FreeBSD 13.0 is the
first release that switched to OpenZFS, a big rebase to a common ZFS
implementation shared with eg Ubuntu (whereas previously multiple
groups targeting different OSes maintained different forks deriving
from OpenSolaris code).  I see there is another suspected related
report[1], and it's on Ubuntu, but filesystem is not mentioned.

[1] https://www.postgresql.org/message-id/flat/8CDB73C1-E3AF-40A6-BA81-8AFE174C6402%40verschooten.net



Re: BUG #17245: Index corruption involving deduplicated entries

От
Pavel Borisov
Дата:
One hypothesis just out-of-the air:
If heap page visibility map bit is wrongly set to all_visible, index-only scan can give duplicates (in effect new and old versions of the same row), with no apparent corruption of neither index nor heap relation. It just skips checking visibility of tuples in heap. Not sure it's the case now, but I've seen this previousy.

It could be diagnosed by deletion of VM file for a relation, the error (if it is this error) will disappear.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: BUG #17245: Index corruption involving deduplicated entries

От
Andres Freund
Дата:
Hi,

On 2021-10-25 09:43:18 +0100, Andrew Gierth wrote:
> >>>>> "Alvaro" == Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> 
>  Alvaro> Do you happen to have WAL traffic for this table and index?
> 
> Unfortunately there's no WAL archiving, and no available WAL files from
> before the state was known to be corrupt. We'll look into enabling
> archiving in case the problem recurs or can be reproduced.

Any chance autovacuum logging was enabled and that those logs still exist? Not
quite as good as still having the WAL, but it still might allow us to
reconstruct the index / vacuum scans, and the horizons used.

Greetings,

Andres Freund



Re: BUG #17245: Index corruption involving deduplicated entries

От
Kamigishi Rei
Дата:
On 29.10.2021 0:42, Andres Freund wrote:
>> Unfortunately there's no WAL archiving, and no available WAL files from
>> before the state was known to be corrupt. We'll look into enabling
>> archiving in case the problem recurs or can be reproduced.
> 
> Any chance autovacuum logging was enabled and that those logs still exist? Not
> quite as good as still having the WAL, but it still might allow us to
> reconstruct the index / vacuum scans, and the horizons used.

The issue manifested again earlier today *after* a REINDEX followed by 
enabling WAL replica logging on the 24th of October. I saved a snapshot 
of the filesystem holding the data directory. Would that be useful for 
further analysis?

(The daily vacuum runs have not been disabled so there have been at 
least 3 since.)

-- 
K. R.



Re: BUG #17245: Index corruption involving deduplicated entries

От
Andres Freund
Дата:
Hi,

On 2021-10-29 00:46:29 +0300, Kamigishi Rei wrote:
> The issue manifested again earlier today *after* a REINDEX followed by
> enabling WAL replica logging on the 24th of October. I saved a snapshot of
> the filesystem holding the data directory. Would that be useful for further
> analysis?

Yes, that's *quite* useful.  I assume you can't just share that snapshot?

Once we identified an affected heap and index page with the corruption, we
should use pg_waldump to scan for all chanes to that table.

Do you have the log file(s) from between the 24th and now? That might give us
a good starting point for the LSN range to scan.

Greetings,

Andres Freund



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Thu, Oct 28, 2021 at 2:40 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Relevant to this hypothesis, it's interesting that FreeBSD 13.0 is the
> first release that switched to OpenZFS, a big rebase to a common ZFS
> implementation shared with eg Ubuntu (whereas previously multiple
> groups targeting different OSes maintained different forks deriving
> from OpenSolaris code).  I see there is another suspected related
> report[1], and it's on Ubuntu, but filesystem is not mentioned.

Anything is possible. But Kamigishi Rei has said that this database
has never had a hard crash or unclean shut down, which I definitely
believe. Also, they are using ECC on a Xeon processor. This is the
kind of hardware that is generally assumed to be very reliable.

Kamigishi Rei has been an exemplary example of how to report a bug to
an open source community. I want to thank him again. Thanks!

A second similar complaint from Herman Verschooten on Slack didn't
mention ZFS at all. A third similar-seeming report on Slack was from
somebody named Brandon Ros, who used Ubuntu (I believe 20.04, like
Herman Verschooten). Also no indication that ZFS was used.

I find it slightly hard to believe that it's ZFS, simply because all 3
complaints involve Postgres 14. And have a lot of common factors. For
example, Herman also used foreign keys -- a lot of users never bother
with them. And like Kamigishi Rei, Herman found that a REINDEX (or was
it VACUUM FULL?) seemingly made the problem go away.

Brandon Ros gave less information (never talked to him directly), but
did say that he saw extensive corruption, which this looks just like.
Most of the indexes are not corrupt at all, but a small number
(usually all on the same one or two tables) are very corrupt indeed --
even though the usual issues with collation could be ruled out. I
don't want to overstate my confidence here (I could easily be wrong),
but this all seems to add up to a pattern.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Thu, Oct 28, 2021 at 3:23 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I find it slightly hard to believe that it's ZFS, simply because all 3
> complaints involve Postgres 14. And have a lot of common factors. For
> example, Herman also used foreign keys -- a lot of users never bother
> with them. And like Kamigishi Rei, Herman found that a REINDEX (or was
> it VACUUM FULL?) seemingly made the problem go away.

I mean it would seem to go away, only to return some time later.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Thu, Oct 28, 2021 at 2:36 PM Andres Freund <andres@anarazel.de> wrote:
> On 2021-10-28 14:18:32 -0700, Peter Geoghegan wrote:
> > Note that the other people that had similar complaints about Postgres 14 all
> > used foreign keys on affected tables.
>
> Do you have links to the other reports?

You've already found the other one, so no need to tell you where that
is. A third from Brandon Ros was only ever on Slack, but that didn't
add new information IMV. I haven't talked to Brandon myself (just the
other 2).

BTW, it's possible that Brandon wasn't using foreign keys or shared
locks at all -- I didn't verify that part. I can only say for sure
that that was the case with the other 2 reports.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Thu, Oct 28, 2021 at 2:31 PM Andres Freund <andres@anarazel.de> wrote:
> This makes me wonder if the issue could be that we're loosing writes / that
> something is reading old page versions (e.g. due to filesystem bug). If both
> heap and index are vacuumed, but the index write is lost, this'd be what we
> see, right?

Right, but that just doesn't seem to fit. That was the first question I asked.

> Another way this could happen is if we got the wrong relation size for either
> index or table, and a vacuum scan doesn't scan the whole table or index.

I doubt that, since the heap blocks involved include heap block 0. On
the table/indexes actually affected by this, the indexes are riddled
with corruption. But every other table seems fine (at least as far as
anybody knows).

> I've not yet read the whole thread, but if not done, it seems like a good idea
> to use pg_waldump and grep for changes to the relevant heap / index
> pages. That might give us more information about what could have happened.

I think that there is a fairly high likelihood that that alone will be
enough to diagnose the bug.

> There were a fair bit of changes around the separation between heap and index
> vacuuming in 14. I wonder if there's potentially something broken around
> repeatedly vacuuming the heap without doing index vacuums or such.

I did ask myself that question earlier today, but quickly rejected the
idea. There is very little mechanism involved with that stuff. It's
very hard to imagine what could break. The code for this in
lazy_vacuum() is quite simple.

> It's also possible that there's something wrong in that darned path that
> handles recently-dead tuples.

That sounds much more likely to me.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Andres Freund
Дата:
Hi,

On 2021-10-28 15:23:38 -0700, Peter Geoghegan wrote:
> Anything is possible. But Kamigishi Rei has said that this database
> has never had a hard crash or unclean shut down, which I definitely
> believe. Also, they are using ECC on a Xeon processor. This is the
> kind of hardware that is generally assumed to be very reliable.

That wouldn't protect against e.g. a logic bug in ZFS. Given its copy-on-write
nature corruption could very well manifest as seeing an older version of the
data when re-reading data from disk. Which could very well lead to the type of
corruption we're seeing here.

A few years back I tried to help somebody investigate corruption that turned
out to be caused by something roughly along those lines (IIRC several bugs in
ZFS on linux, although I don't remember the details anymore).

Not saying that that is the most likely explanation, just something worth
checking.


> Kamigishi Rei has been an exemplary example of how to report a bug to
> an open source community. I want to thank him again. Thanks!

+1


> A second similar complaint from Herman Verschooten on Slack didn't
> mention ZFS at all. A third similar-seeming report on Slack was from
> somebody named Brandon Ros, who used Ubuntu (I believe 20.04, like
> Herman Verschooten). Also no indication that ZFS was used.
> 
> I find it slightly hard to believe that it's ZFS, simply because all 3
> complaints involve Postgres 14. And have a lot of common factors. For
> example, Herman also used foreign keys -- a lot of users never bother
> with them. And like Kamigishi Rei, Herman found that a REINDEX (or was
> it VACUUM FULL?) seemingly made the problem go away.

Didn't 14 change the logic when index vacuums are done? That could cause
previously existing issues to manifest with a higher likelihood.


Greetings,

Andres Freund



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Thu, Oct 28, 2021 at 3:48 PM Andres Freund <andres@anarazel.de> wrote:
> That wouldn't protect against e.g. a logic bug in ZFS.

> Not saying that that is the most likely explanation, just something worth
> checking.

True. It's too early to rule that out. Though note that a full
pg_amcheck of the database mostly didn't complain about anything -- it
was just a handful of indexes, associated with just 2 tables. And this
is mediawiki, which has lots of tables. None of the new heapam
verification functionality found any problems (as with the older
index-matches-table heapallindexed stuff).

> Didn't 14 change the logic when index vacuums are done? That could cause
> previously existing issues to manifest with a higher likelihood.

I don't follow. The new logic that skips index vacuuming kicks in 1)
in an anti-wraparound vacuum emergency, and 2) when there are very few
LP_DEAD line pointers in the heap. We can rule 1 out, I think, because
the XIDs we see are in the low millions, and our starting point was a
database that was upgraded via a dump and reload.

The second criteria for skipping index vacuuming (the "less than 2% of
heap pages have any LP_DEAD items" thing) might well have been hit on
these tables -- it is after all very common. But I don't see how that
could matter. We're never going to get to a code path inside
vacuumlazy.c that sets LP_DEAD items from VACUUM's dead_tuples array
to LP_UNUSED (how could reached such a code path without also index
vacuuming, given the way things are set up inside lazy_vacuum()?).
We're always going to have the opportunity to do index vacuuming with
any left-behind LP_DEAD line pointers in the next VACUUM -- right
after the later VACUUM successfully returns from
lazy_vacuum_all_indexes().

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Thu, Oct 28, 2021 at 2:36 PM Andres Freund <andres@anarazel.de> wrote:
> > * The transaction ID 365637 is very over-represented, appearing in
> > several corrupt heap tuple headers, located across several heap pages.
> >
> > * Its "neighbor" transaction ID is 365638, which appears once more. To
> > me this suggests some kind of confusion with an OldestXmin style
> > cutoff during VACUUM.
>
> I'm not quite following this bit, could you expand on that?

I think it's odd that we use both an GlobalVisState and an OldestXmin
for VACUUM's first pass over the heap as of Postgres 14 (before the
snapshot scalability stuff, we just had OldestXmin). I have in the
past wondered if that might cause problems. For example, how do we
know that GlobalVisState won't ever slightly disagree with OldestXmin?
For example, about which tuples are dead, rather than recently dead?

This scenario reminds me of the tupgone mess that used to exist inside
the code that's called lazy_scan_prune() in Postgres 14. Actually, I
probably thought of when working on removing tupgone (which happened
in commit 8523492d4e).

I am of course just guessing. Perhaps this is unfair.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Thu, Oct 28, 2021 at 2:40 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
> One hypothesis just out-of-the air:
> If heap page visibility map bit is wrongly set to all_visible, index-only scan can give duplicates (in effect new and
oldversions of the same row), with no apparent corruption of neither index nor heap relation.
 

That can't be it, because affected physical index pages directly
contain duplicate heap TIDs (and plenty of them). That's never allowed
to happen, no matter what.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Thomas Munro
Дата:
This doesn't look too healthy:

CREATE TABLE page (
  page_id SERIAL NOT NULL,
  page_namespace INT NOT NULL,
  page_title TEXT NOT NULL,
  page_restrictions TEXT DEFAULT NULL,
  page_is_redirect SMALLINT DEFAULT 0 NOT NULL,
  page_is_new SMALLINT DEFAULT 0 NOT NULL,
  page_random FLOAT NOT NULL,
  page_touched TIMESTAMPTZ NOT NULL,
  page_links_updated TIMESTAMPTZ DEFAULT NULL,
  page_latest INT NOT NULL,
  page_len INT NOT NULL,
  page_content_model TEXT DEFAULT NULL,
  page_lang TEXT DEFAULT NULL,
  PRIMARY KEY(page_id)
);

INSERT INTO page (page_namespace, page_title, page_is_redirect,
page_is_new, page_random, page_touched, page_latest, page_len) select
case when random() > 0.5 then 1 else 0 end, random()::text, 0, 0,
42.0, now(), 0, 42 from generate_series(1, 170000);

CREATE INDEX page_main_title ON page USING btree(page_title text_pattern_ops)
WHERE (page_namespace = 0);

postgres=# select count(*) from page;
 count
--------
 170000
(1 row)

postgres=# select count(*) from page where page_namespace = 0;
 count
-------
 84934
(1 row)

postgres=# explain select count(*) from page where page_namespace = 0;
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Aggregate  (cost=3179.75..3179.76 rows=1 width=8)
   ->  Index Only Scan using page_main_title on page
(cost=0.42..2967.85 rows=84762 width=0)
(2 rows)



Re: BUG #17245: Index corruption involving deduplicated entries

От
Thomas Munro
Дата:
On Fri, Oct 29, 2021 at 12:54 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> This doesn't look too healthy:

Erm.  Sorry, please ignore, I made a mistake.



Re: BUG #17245: Index corruption involving deduplicated entries

От
Andres Freund
Дата:
Hi,

It's not the cause of this problem, but I did find a minor issue: the retry
path in lazy_scan_prune() looses track of the deleted tuple count when
retrying.

The retry codepath also made me wonder if there could be problems if we do
FreezeMultiXactId() multiple times due to retry. I think we can end up
creating multiple multixactids for the same tuple (if the members change,
which is likely in the retry path). But that should be fine, I think.


On 2021-10-28 16:04:44 -0700, Peter Geoghegan wrote:
> > Didn't 14 change the logic when index vacuums are done? That could cause
> > previously existing issues to manifest with a higher likelihood.
>
> I don't follow. The new logic that skips index vacuuming kicks in 1)
> in an anti-wraparound vacuum emergency, and 2) when there are very few
> LP_DEAD line pointers in the heap. We can rule 1 out, I think, because
> the XIDs we see are in the low millions, and our starting point was a
> database that was upgraded via a dump and reload.

Right.


> The second criteria for skipping index vacuuming (the "less than 2% of
> heap pages have any LP_DEAD items" thing) might well have been hit on
> these tables -- it is after all very common. But I don't see how that
> could matter. We're never going to get to a code path inside
> vacuumlazy.c that sets LP_DEAD items from VACUUM's dead_tuples array
> to LP_UNUSED (how could reached such a code path without also index
> vacuuming, given the way things are set up inside lazy_vacuum()?).
> We're always going to have the opportunity to do index vacuuming with
> any left-behind LP_DEAD line pointers in the next VACUUM -- right
> after the later VACUUM successfully returns from
> lazy_vacuum_all_indexes().

Shrug. It doesn't seem that hard to believe that repeatedly trying to prune
the same page could unearth some bugs. E.g. via the heap_prune_record_unused()
path in heap_prune_chain().

Hm. I assume somebody checked and verified that old_snapshot_threshold is not
in use? Seems unlikely, but wrongly entering that heap_prune_record_unused()
path could certainly cause issues like we're observing.

Greetings,

Andres Freund



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Thu, Oct 28, 2021 at 6:19 PM Andres Freund <andres@anarazel.de> wrote:
> It's not the cause of this problem, but I did find a minor issue: the retry
> path in lazy_scan_prune() looses track of the deleted tuple count when
> retrying.

Matthias van de Meent (now my coworker) pointed this out several
months back. I don't see any reason to prefer remembering it to not
remembering it. In any case it's not too important, since the retry
behavior is inherently very rare. But we can change it around later,
if you prefer.

> The retry codepath also made me wonder if there could be problems if we do
> FreezeMultiXactId() multiple times due to retry. I think we can end up
> creating multiple multixactids for the same tuple (if the members change,
> which is likely in the retry path). But that should be fine, I think.

That's not possible, because we fully decide what we're going to do
with the page as soon as we break out of the for() loop. The closest
thing to calling FreeMultiXactId() that can happen before that point
(before we "commit" to a certain course of action for the page) are
calls made to heap_prepare_freeze_tuple(), from inside the loop.

heap_prepare_freeze_tuple() doesn't actually modify anything -- it
just decides what to do later on, in the "nfrozen > 0" critical
section back in lazy_scan_prune().

> Shrug. It doesn't seem that hard to believe that repeatedly trying to prune
> the same page could unearth some bugs. E.g. via the heap_prune_record_unused()
> path in heap_prune_chain().

I wasn't thinking of lazy_scan_prune() before, when you brought up
index vacuuming -- I was thinking of lazy_vacuum(). But FWIW I'm not
sure what you mean here.

I believe that the goto retry logic inside lazy_scan_prune() can
restart and expect a clean slate. Sure, we're pruning again, but is
that appreciably different to a concurrent opportunistic prune that
could happen at almost any time during the VACUUM, anyway? (Well,
maybe it is in that we get the accounting for things like VACUUM
VERBOSE very slightly wrong, but even that's debatable.)

> Hm. I assume somebody checked and verified that old_snapshot_threshold is not
> in use? Seems unlikely, but wrongly entering that heap_prune_record_unused()
> path could certainly cause issues like we're observing.

We should inquire about settings used, on general principle.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Thu, Oct 28, 2021 at 2:46 PM Kamigishi Rei <iijima.yun@koumakan.jp> wrote:
> The issue manifested again earlier today *after* a REINDEX followed by
> enabling WAL replica logging on the 24th of October. I saved a snapshot
> of the filesystem holding the data directory. Would that be useful for
> further analysis?

Access to the WAL archive would be great. Now might be a good time to
set "log_min_autovacuum_duration=0" in postgresql.conf -- that
information might be very helpful later on. This will appear in the
system logs for PostgreSQL. We'll be particularly interested in the
"page" table, for the obvious reason.

Also, can you please post output of the following on your server:

"select * from pg_settings where setting != boot_val;"

Thanks
-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Thu, Oct 28, 2021 at 6:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> heap_prepare_freeze_tuple() doesn't actually modify anything -- it
> just decides what to do later on, in the "nfrozen > 0" critical
> section back in lazy_scan_prune().

Oh, wait. heap_prepare_freeze_tuple() is advertised as not modifying
anything, but that's kind of a lie, since FreezeMultiXactId() is
called and can sometimes allocate a new MultiXactId. The comments
above heap_prepare_freeze_tuple() never got the memo about this
following one of the big MultiXact bug fixes, "Rework tuple freezing
protocol" (commit 3b97e6823b). My mistake.

Presumably this isn't very different to a hard crash that also
allocates a MultiXactId that never gets used. It should be even rarer
than that. We hardly ever hit the goto path, with or without freezing
or almost-freezing. I see that coverage.postgresql.org says we don't
hit it even once. So I think that this is fine.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Kamigishi Rei
Дата:
On 29.10.2021 1:01, Andres Freund wrote:
>> The issue manifested again earlier today *after* a REINDEX followed by
>> enabling WAL replica logging on the 24th of October. I saved a snapshot of
>> the filesystem holding the data directory. Would that be useful for further
>> analysis?
> Yes, that's *quite* useful.  I assume you can't just share that snapshot?

I am afraid it contains personal data (the mwuser table with e-mail 
addresses, passwords, and so on) for multiple different MediaWiki 
instances' databases. I will look into scrubbing that kind of data out 
later today. I assume dropping the other databases from the cluster 
should be fine and will not affect further analysis?

With the personal data scrubbed I will likely be able to provide SSH 
access (with su/sudo available) to the VM if needed, though this will 
take time (I will need to make a DMZ for that VM). Please inform me if 
this would be desirable.

> Once we identified an affected heap and index page with the corruption, we
> should use pg_waldump to scan for all changes to that table.
> 
> Do you have the log file(s) from between the 24th and now? That might give us
> a good starting point for the LSN range to scan.

There are multiple WAL log files, the first of them with the timestamp 
of Oct 25 09:45.

I am currently moving the snapshot over from my server to the VM I made 
for this investigation. I will look into pg_waldump documentation as 
soon as possible; I have not had to deal with WAL logs before.

P. S. To possibly make some things simpler: I am on #postgresql on 
Libera as Remilia (or IijimaYun in case of disconnects) and am generally 
available from 06:30 UTC to around 21:00 UTC.

-- 
K. R.



Re: BUG #17245: Index corruption involving deduplicated entries

От
Alexander Kukushkin
Дата:
I think we experienced something similar.

Now a few words about our setup:
- AWS, i3.8xlarge
- Ubuntu 18.04
- ext4
- It is a shared database, with 8 clusters in total
- Size of each cluster ~1TB
- Each cluster produces ~3TB of WAL every day (plenty of UPDATEs, about 90% of which are HOT updates).

Corruption was found on all shards, but the list of affected indexes a bit varies from shard to shard.

Database schema:
- mostly PRIMARY or UNIQUE keys
- a couple of non-unique btree indexes
- plenty of foreign keys

The timeline:
2021-10-11 - we did the major upgrade from 9.6 to 14
2021-10-14 - executed reindexdb -a --concurrently, which finished successfully. In order to speed up reindexing we were using PGOPTIONS="-c maintenance_work_mem=64GB -c max_parallel_maintenance_workers=4"
2021-10-25 - I noticed that some of the indexes are corrupted, and these are mostly UNIQUE indexes on int and/or bigint.

After that, I identified affected indexes with amcheck, found and removed duplicated rows, and run pg_repack on affected tables. The pg_repack was running with max_parallel_maintenance_workers=0

Since we keep an archive of WALs and backups only for the past 6 days it would not be possible to find respective files that produced the corruption.

As of today (2021-10-29), amcheck doesn't report any problems.

I hope this information could give you some hints.

Regards,
--
Alexander Kukushkin

Re: BUG #17245: Index corruption involving deduplicated entries

От
Kamigishi Rei
Дата:
On 29.10.2021 4:50, Peter Geoghegan wrote:

> Access to the WAL archive would be great. Now might be a good time to
> set "log_min_autovacuum_duration=0" in postgresql.conf -- that
> information might be very helpful later on. This will appear in the
> system logs for PostgreSQL. We'll be particularly interested in the
> "page" table, for the obvious reason.

It looks like this parameter is named "log_autovacuum_min_duration" 
instead. Done.

> Also, can you please post output of the following on your server:
> "select * from pg_settings where setting != boot_val;"

Attached; this was done before the above.

-- 
K. R.
Вложения

Re: BUG #17245: Index corruption involving deduplicated entries

От
Kamigishi Rei
Дата:
On 29.10.2021 1:48, Andres Freund wrote:

> That wouldn't protect against e.g. a logic bug in ZFS. Given its copy-on-write
> nature corruption could very well manifest as seeing an older version of the
> data when re-reading data from disk. Which could very well lead to the type of
> corruption we're seeing here.

> Didn't 14 change the logic when index vacuums are done? That could cause
> previously existing issues to manifest with a higher likelihood.

Note that I ran 13.2 (that came with open_datasync()) and 13.3 (which 
switched back to fdatasync()) on openzfs for almost a year since 
upgrading to 13.0-RELEASE, with no discernible issues. The 13.3 data 
directory has been kept intact since the upgrade to 14.

-- 
K. R.



Re: BUG #17245: Index corruption involving deduplicated entries

От
Kamigishi Rei
Дата:
On 29.10.2021 10:55, Kamigishi Rei wrote:
>> Do you have the log file(s) from between the 24th and now? That might 
>> give us
>> a good starting point for the LSN range to scan.
>
> There are multiple WAL log files, the first of them with the timestamp 
> of Oct 25 09:45.


The newly manifested issue is caught by pg_amcheck:

btree index "azurlane_wiki.mediawiki.page_main_title":
     ERROR:  item order invariant violated for index "page_main_title"
     DETAIL:  Lower index tid=(17,157) (points to heap tid=(540,5)) 
higher index tid=(17,158) (points to heap tid=(540,5)) page lsn=2/A019DD78.

The weird part about this is that the WAL archive does not seem to 
contain any data for 157 and 158 above (in 1663/19243/274869 blk 17). 
The last two entries are

rmgr: Btree       len (rec/tot):     53/  4885, tx:    2085600, lsn: 
2/A0195AE0, prev 2/A01943F0, desc: INSERT_LEAF off 155, blkref #0: rel 
1663/19243/274869 blk 17 FPW
rmgr: Btree       len (rec/tot):     72/    72, tx:    2085602, lsn: 
2/A019DD30, prev 2/A019DCF0, desc: INSERT_LEAF off 156, blkref #0: rel 
1663/19243/274869 blk 17

The WAL file in data14/pg_wal does not have anything related to 157 and 
158 for this filenode/blk as well.

-- 
K. R.




Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Fri, Oct 29, 2021 at 11:36 AM Kamigishi Rei <iijima.yun@koumakan.jp> wrote:
> The newly manifested issue is caught by pg_amcheck:
>
> btree index "azurlane_wiki.mediawiki.page_main_title":
>      ERROR:  item order invariant violated for index "page_main_title"
>      DETAIL:  Lower index tid=(17,157) (points to heap tid=(540,5))
> higher index tid=(17,158) (points to heap tid=(540,5)) page lsn=2/A019DD78.

Great!

I'm not surprised to see that it's the page table, once again. It's
not particularly big, right? Are there other tables that are much
larger?

> The weird part about this is that the WAL archive does not seem to
> contain any data for 157 and 158 above (in 1663/19243/274869 blk 17).
> The last two entries are
>
> rmgr: Btree       len (rec/tot):     53/  4885, tx:    2085600, lsn:
> 2/A0195AE0, prev 2/A01943F0, desc: INSERT_LEAF off 155, blkref #0: rel
> 1663/19243/274869 blk 17 FPW
> rmgr: Btree       len (rec/tot):     72/    72, tx:    2085602, lsn:
> 2/A019DD30, prev 2/A019DCF0, desc: INSERT_LEAF off 156, blkref #0: rel
> 1663/19243/274869 blk 17
>
> The WAL file in data14/pg_wal does not have anything related to 157 and
> 158 for this filenode/blk as well.

If this was a heap relation then that would be true, because the
offset number of a heap needs to be stable, at least within a "VACUUM
cycle" (otherwise indexes will point to the wrong things, which would
of course be wrong). However, this relation is a B-Tree index, where
TIDs/page offset numbers are not stable at all.

Almost all individual index tuple inserts onto a B-Tree page put the
new index tuple "between" existing index tuples. This will "shift"
whatever index tuples are to the right of the position of the new
tuple. For example, with "INSERT_LEAF off 156", the insert atomic
operation will shift any existing index tuple at page offset 156 go to
page offset 157, plus any index tuple that was at page offset 157 is
going to go to 158. And so on.

We don't physically shift the index tuples themselves, but we do shift
the item ID/line pointer array at the start of the page, so it's not
too expensive.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Alvaro Herrera
Дата:
On 2021-Oct-29, Kamigishi Rei wrote:

> The weird part about this is that the WAL archive does not seem to contain
> any data for 157 and 158 above (in 1663/19243/274869 blk 17). The last two
> entries are
> 
> rmgr: Btree       len (rec/tot):     53/  4885, tx:    2085600, lsn:
> 2/A0195AE0, prev 2/A01943F0, desc: INSERT_LEAF off 155, blkref #0: rel
> 1663/19243/274869 blk 17 FPW
> rmgr: Btree       len (rec/tot):     72/    72, tx:    2085602, lsn:
> 2/A019DD30, prev 2/A019DCF0, desc: INSERT_LEAF off 156, blkref #0: rel
> 1663/19243/274869 blk 17
> 
> The WAL file in data14/pg_wal does not have anything related to 157 and 158
> for this filenode/blk as well.

Hmm, I don't remember precisely how index tuple removal works, but maybe
these 157 and 158 entries existed previously and were truncated away.
It might be useful to have a look at the page header in the page image
contained in the FPW for 2/A0195AE0.

pg_waldump doesn't do it, but I have this patch sitting around ...
probably outdated, but it may be a useful starting point for somebody.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/

Вложения

Re: BUG #17245: Index corruption involving deduplicated entries

От
Alvaro Herrera
Дата:
On 2021-Oct-29, Peter Geoghegan wrote:

> However, this relation is a B-Tree index, where TIDs/page offset
> numbers are not stable at all.
> 
> Almost all individual index tuple inserts onto a B-Tree page put the
> new index tuple "between" existing index tuples. This will "shift"
> whatever index tuples are to the right of the position of the new
> tuple.

Bah, of course.  Sorry for the noise.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente"



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Fri, Oct 29, 2021 at 11:52 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Bah, of course.  Sorry for the noise.

It would be nice if users could follow what's going on in the index by
looking at pg_waldump. Theoretically you could compensate for every
individual change to the page (i.e. track the current page offset
number for each tuple dynamically), but that's just too tedious to
ever really work.

It would be nice if pg_waldump printed the contents of index tuples, a
little like pg_filedump can. That would actually offer Kamigishi Rei a
practical way of following what's going on at a low level. It's a pity
that we don't have some kind of generic facility for dumping a random
index tuple, that doesn't require knowledge of the structure of the
index, which isn't easy to get into a tool like pg_waldump. It's
usually possible to see what's in the index without needing to apply
metadata about the "shape" of tuples -- the generic tuple headers show
a lot.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Fri, Oct 29, 2021 at 1:12 AM Kamigishi Rei <iijima.yun@koumakan.jp> wrote:
> It looks like this parameter is named "log_autovacuum_min_duration"
> instead. Done.

Great.

> > Also, can you please post output of the following on your server:
> > "select * from pg_settings where setting != boot_val;"
>
> Attached; this was done before the above.

Looks like there are no changes that could possibly be relevant to our
investigation, which is not surprising.

Thanks
-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Fri, Oct 29, 2021 at 1:17 AM Kamigishi Rei <iijima.yun@koumakan.jp> wrote:
> Note that I ran 13.2 (that came with open_datasync()) and 13.3 (which
> switched back to fdatasync()) on openzfs for almost a year since
> upgrading to 13.0-RELEASE, with no discernible issues. The 13.3 data
> directory has been kept intact since the upgrade to 14.

I now think that this is very likely to be a bug in Postgres 14. Note
that another user with very similar corruption (same symptoms, similar
workload) happened to be using ext4 on Linux.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Kamigishi Rei
Дата:
On 29.10.2021 21:45, Peter Geoghegan wrote:
>> btree index "azurlane_wiki.mediawiki.page_main_title":
>>       ERROR:  item order invariant violated for index "page_main_title"
>>       DETAIL:  Lower index tid=(17,157) (points to heap tid=(540,5))
>> higher index tid=(17,158) (points to heap tid=(540,5)) page lsn=2/A019DD78.
> 
> I'm not surprised to see that it's the page table, once again. It's
> not particularly big, right? Are there other tables that are much
> larger?

This table is 15 MB in size and it contains wiki page titles along with 
their namespace IDs and such. The largest table is mediawiki.text, which 
is 807 MB and holds article text. Here are some of the largest tables:

  log_search            | 11 MB
  imagelinks            | 11 MB
  revision_comment_temp | 13 MB
  page                  | 15 MB
  content               | 15 MB
  slots                 | 19 MB
  watchlist             | 24 MB
  revision_actor_temp   | 25 MB
  image                 | 33 MB
  revision              | 37 MB
  logging               | 39 MB
  pagelinks             | 56 MB
  text                  | 807 MB

-- 
K. R.



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Fri, Oct 29, 2021 at 11:45 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Oct 29, 2021 at 11:36 AM Kamigishi Rei <iijima.yun@koumakan.jp> wrote:
> > The newly manifested issue is caught by pg_amcheck:
> >
> > btree index "azurlane_wiki.mediawiki.page_main_title":
> >      ERROR:  item order invariant violated for index "page_main_title"
> >      DETAIL:  Lower index tid=(17,157) (points to heap tid=(540,5))
> > higher index tid=(17,158) (points to heap tid=(540,5)) page lsn=2/A019DD78.

> I'm not surprised to see that it's the page table, once again. It's
> not particularly big, right? Are there other tables that are much
> larger?

BTW, can you look for autovacuum log output for the "page" table from
the logs? Then share it with the list, or at least share it with me
privately?

Thanks
-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Kamigishi Rei
Дата:
On 29.10.2021 22:20, Peter Geoghegan wrote:

> BTW, can you look for autovacuum log output for the "page" table from
> the logs? Then share it with the list, or at least share it with me
> privately?

I am afraid the severity level of autovacuum was too low for the default 
syslog configuration on FreeBSD. I changed it just now to log *.* to a 
separate file and I can see autovacuum messages for other databases 
appear there now.

The page table has not been autovacuum'd since the 5th of October (the 
13.3 to 14 upgrade day). The only tables that underwent autovacuum since 
that time were

         relname        |          last_vacuum          | 
last_autovacuum
-----------------------+-------------------------------+-------------------------------
  user_properties       | 2021-10-29 12:31:09.849931+09 | 2021-10-22 
12:41:37.727299+09
  cargo_pages           | 2021-10-29 12:31:08.925664+09 | 2021-10-24 
21:53:27.063671+09
  querycache_info       | 2021-10-29 12:31:11.49013+09  | 2021-10-28 
14:53:44.448064+09
  site_stats            | 2021-10-29 12:31:12.296138+09 | 2021-10-29 
00:16:27.778438+09
  job                   | 2021-10-29 12:31:11.897731+09 | 2021-10-30 
00:46:30.479516+09
  objectcache           | 2021-10-29 12:31:05.71646+09  | 2021-10-30 
01:00:33.519683+09

Should I disable the daily vacuum script?

-- 
K. R.



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Fri, Oct 29, 2021 at 12:37 PM Kamigishi Rei <iijima.yun@koumakan.jp> wrote:
> I am afraid the severity level of autovacuum was too low for the default
> syslog configuration on FreeBSD. I changed it just now to log *.* to a
> separate file and I can see autovacuum messages for other databases
> appear there now.

Oh well. It's enabled now, and it doesn't seem to be too hard for you
to reproduce the issue, given time. We can just wait a few more days.

> Should I disable the daily vacuum script?

Please don't change anything. The daily VACUUM script is probably
where the problem actually starts.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Kamigishi Rei
Дата:
On 29.10.2021 21:36, Kamigishi Rei wrote:
> btree index "azurlane_wiki.mediawiki.page_main_title":
>     ERROR:  item order invariant violated for index "page_main_title"
>     DETAIL:  Lower index tid=(17,157) (points to heap tid=(540,5)) 
> higher index tid=(17,158) (points to heap tid=(540,5)) page 
> lsn=2/A019DD78.

I forgot to include WAL archive grepped for the "page" table's heap page 
540. Note that 540,5 is mentioned below just before VACUUM. `SELECT h.* 
FROM heap_page_items(get_raw_page('page',540)) h;` returns lp_flags=3 
for lp=5 for the snapshot (though I am not sure if lp=5 here is the same 
5 as in the tid=(540,5) tuple).

00000001000000020000008D:
rmgr: Heap        len (rec/tot):     59/  7939, tx:    2013796, lsn: 
2/8DA7C498, prev 2/8DA7C470, desc: LOCK off 5: xid 2013796: flags 0x00 
LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 
540 FPW
rmgr: Heap        len (rec/tot):     98/    98, tx:    2013796, lsn: 
2/8DAAA0A0, prev 2/8DAA8548, desc: UPDATE off 5 xmax 2013796 flags 0x61 
KEYS_UPDATED ; new off 22 xmax 2013796, blkref #0: rel 1663/19243/19560 
blk 540
rmgr: Heap2       len (rec/tot):     56/    56, tx:          0, lsn: 
2/8DABF558, prev 2/8DABF528, desc: PRUNE latestRemovedXid 2013796 
nredirected 0 ndead 1, blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     54/    54, tx:    2013798, lsn: 
2/8DABF590, prev 2/8DABF558, desc: LOCK off 22: xid 2013798: flags 0x00 
LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     83/    83, tx:    2013798, lsn: 
2/8DABF5C8, prev 2/8DABF590, desc: HOT_UPDATE off 22 xmax 2013798 flags 
0x60 ; new off 41 xmax 2013798, blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 
2/8DACCBB0, prev 2/8DACCB88, desc: PRUNE latestRemovedXid 2013798 
nredirected 1 ndead 0, blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     59/  7939, tx:    2014289, lsn: 
2/8DE8A460, prev 2/8DE8A438, desc: LOCK off 41: xid 2014289: flags 0x00 
LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 
540 FPW
rmgr: Heap        len (rec/tot):     99/    99, tx:    2014289, lsn: 
2/8DEB5250, prev 2/8DEB36D8, desc: UPDATE off 41 xmax 2014289 flags 0x60 
KEYS_UPDATED ; new off 53 xmax 2014289, blkref #0: rel 1663/19243/19560 
blk 540
rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 
2/8DEC0420, prev 2/8DEC03F0, desc: PRUNE latestRemovedXid 2014289 
nredirected 0 ndead 1, blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     54/    54, tx:    2014291, lsn: 
2/8DEC0460, prev 2/8DEC0420, desc: LOCK off 53: xid 2014291: flags 0x00 
LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     82/    82, tx:    2014291, lsn: 
2/8DEC0498, prev 2/8DEC0460, desc: HOT_UPDATE off 53 xmax 2014291 flags 
0x60 ; new off 41 xmax 2014291, blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 
2/8DED6A10, prev 2/8DED69E8, desc: PRUNE latestRemovedXid 2014291 
nredirected 1 ndead 0, blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     59/  7939, tx:    2014784, lsn: 
2/8DFB15D0, prev 2/8DFB1598, desc: LOCK off 41: xid 2014784: flags 0x00 
LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 
540 FPW
rmgr: Heap        len (rec/tot):    100/   100, tx:    2014784, lsn: 
2/8DFDAD20, prev 2/8DFD9180, desc: UPDATE off 41 xmax 2014784 flags 0x60 
KEYS_UPDATED ; new off 57 xmax 2014784, blkref #0: rel 1663/19243/19560 
blk 540
rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 
2/8DFE5F90, prev 2/8DFE5F60, desc: PRUNE latestRemovedXid 2014784 
nredirected 0 ndead 1, blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     54/    54, tx:    2014786, lsn: 
2/8DFE5FD0, prev 2/8DFE5F90, desc: LOCK off 57: xid 2014786: flags 0x00 
LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     82/    82, tx:    2014786, lsn: 
2/8DFE6020, prev 2/8DFE5FD0, desc: HOT_UPDATE off 57 xmax 2014786 flags 
0x60 ; new off 41 xmax 2014786, blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 
2/8DFFA420, prev 2/8DFFA3F8, desc: PRUNE latestRemovedXid 2014786 
nredirected 1 ndead 0, blkref #0: rel 1663/19243/19560 blk 540
000000010000000200000090:
rmgr: Heap2       len (rec/tot):     53/  7937, tx:          0, lsn: 
2/90CEF528, prev 2/90CEF4E8, desc: VACUUM nunused 3, blkref #0: rel 
1663/19243/19560 blk 540 FPW
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 
2/90CF1448, prev 2/90CEF528, desc: VISIBLE cutoff xid 2014786 flags 
0x01, blkref #0: rel 1663/19243/19560 fork vm blk 0, blkref #1: rel 
1663/19243/19560 blk 540
000000010000000200000093:
rmgr: Heap        len (rec/tot):     59/  7943, tx:    2045060, lsn: 
2/93FF90E8, prev 2/93FF90B0, desc: LOCK off 41: xid 2045060: flags 0x00 
LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 
540 FPW
000000010000000200000094:
rmgr: Heap        len (rec/tot):     99/    99, tx:    2045060, lsn: 
2/94026D80, prev 2/94025150, desc: UPDATE off 41 xmax 2045060 flags 0x61 
KEYS_UPDATED ; new off 5 xmax 2045060, blkref #0: rel 1663/19243/19560 
blk 540
rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 
2/94031F28, prev 2/94031EF8, desc: PRUNE latestRemovedXid 2045060 
nredirected 0 ndead 1, blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     54/    54, tx:    2045062, lsn: 
2/94031F68, prev 2/94031F28, desc: LOCK off 5: xid 2045062: flags 0x00 
LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     83/    83, tx:    2045062, lsn: 
2/94031FA0, prev 2/94031F68, desc: HOT_UPDATE off 5 xmax 2045062 flags 
0x60 ; new off 22 xmax 2045062, blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 
2/94043C50, prev 2/94043C28, desc: PRUNE latestRemovedXid 2045062 
nredirected 1 ndead 0, blkref #0: rel 1663/19243/19560 blk 540
000000010000000200000097:
rmgr: Heap        len (rec/tot):     73/  8133, tx:    2065110, lsn: 
2/97207F18, prev 2/97207EE0, desc: UPDATE off 8 xmax 2065110 flags 0x00 
; new off 41 xmax 2065110, blkref #0: rel 1663/19243/19560 blk 540 FPW, 
blkref #1: rel 1663/19243/19560 blk 318
rmgr: Heap2       len (rec/tot):     53/  8109, tx:          0, lsn: 
2/97ED2598, prev 2/97ED2558, desc: VACUUM nunused 1, blkref #0: rel 
1663/19243/19560 blk 540 FPW
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 
2/97ED4560, prev 2/97ED2598, desc: VISIBLE cutoff xid 2065110 flags 
0x01, blkref #0: rel 1663/19243/19560 fork vm blk 0, blkref #1: rel 
1663/19243/19560 blk 540
0000000100000002000000A0:
rmgr: Heap        len (rec/tot):     59/  8115, tx:    2085600, lsn: 
2/A0165A70, prev 2/A0165A38, desc: LOCK off 22: xid 2085600: flags 0x00 
LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 
540 FPW
rmgr: Heap        len (rec/tot):     54/    54, tx:    2085600, lsn: 
2/A018E858, prev 2/A018D7D8, desc: LOCK off 22: xid 2085600: flags 0x00 
LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     73/  8237, tx:    2085600, lsn: 
2/A018E890, prev 2/A018E858, desc: UPDATE off 22 xmax 2085600 flags 0x03
KEYS_UPDATED ; new off 21 xmax 2085600, blkref #0: rel 1663/19243/19560 
blk 328 FPW, blkref #1: rel 1663/19243/19560 blk 540
rmgr: Heap2       len (rec/tot):     58/    58, tx:    2085602, lsn: 
2/A019DC20, prev 2/A019BC08, desc: PRUNE latestRemovedXid 2085600 
nredirected 0 ndead 1, blkref #0: rel 1663/19243/19560 blk 540

-- 
K. R.




Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Fri, Oct 29, 2021 at 12:52 PM Kamigishi Rei <iijima.yun@koumakan.jp> wrote:
> I forgot to include WAL archive grepped for the "page" table's heap page
> 540. Note that 540,5 is mentioned below just before VACUUM. `SELECT h.*
> FROM heap_page_items(get_raw_page('page',540)) h;` returns lp_flags=3
> for lp=5 for the snapshot (though I am not sure if lp=5 here is the same
> 5 as in the tid=(540,5) tuple).

Can you send me the heap page privately, please? As well as the index
leaf block 17 from the corrupt index "page_main_title"?

This looks like exactly what we need! We're close.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Fri, Oct 29, 2021 at 12:58 PM Peter Geoghegan <pg@bowt.ie> wrote:
> This looks like exactly what we need! We're close.

Also, it would be good to see everything that pg_waldump shows about
transaction IDs 2014289, 2014291, 2014784, 2014786, 2045060, 2045062,
2065110, 2085600, and 2085602. You can use "pg_waldump --xid=" to make
this easier.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Andres Freund
Дата:
Hi,

On 2021-10-29 22:52:39 +0300, Kamigishi Rei wrote:
> On 29.10.2021 21:36, Kamigishi Rei wrote:
> > btree index "azurlane_wiki.mediawiki.page_main_title":
> >     ERROR:  item order invariant violated for index "page_main_title"
> >     DETAIL:  Lower index tid=(17,157) (points to heap tid=(540,5))
> > higher index tid=(17,158) (points to heap tid=(540,5)) page
> > lsn=2/A019DD78.
> 
> I forgot to include WAL archive grepped for the "page" table's heap page
> 540. Note that 540,5 is mentioned below just before VACUUM. `SELECT h.* FROM
> heap_page_items(get_raw_page('page',540)) h;` returns lp_flags=3 for lp=5
> for the snapshot (though I am not sure if lp=5 here is the same 5 as in the
> tid=(540,5) tuple).

Could you search for btree WAL records before the following records? Most
importantly for the corrupted page in the corrupted index, but other ones
might be interesting as well

> rmgr: Heap2       len (rec/tot):     53/  7937, tx:          0, lsn:
> 2/90CEF528, prev 2/90CEF4E8, desc: VACUUM nunused 3, blkref #0: rel
> 1663/19243/19560 blk 540 FPW
> rmgr: Heap2       len (rec/tot):     53/  8109, tx:          0, lsn:
> 2/97ED2598, prev 2/97ED2558, desc: VACUUM nunused 1, blkref #0: rel
> 1663/19243/19560 blk 540 FPW

The currently most likely seeming explanation is that somehow the index
entries pointing to the heap items removed here weren't removed
themselves. Subsequently leading to visible corruption once those tids are
reused.

Greetings,

Andres Freund



Re: BUG #17245: Index corruption involving deduplicated entries

От
Andres Freund
Дата:
Hi,

There's something odd going on with 540,41, see below.


On 2021-10-29 22:52:39 +0300, Kamigishi Rei wrote:
> rmgr: Heap        len (rec/tot):     98/    98, tx:    2013796, lsn:
> 2/8DAAA0A0, prev 2/8DAA8548, desc: UPDATE off 5 xmax 2013796 flags 0x61
> KEYS_UPDATED ; new off 22 xmax 2013796, blkref #0: rel 1663/19243/19560 blk
> 540

Does a non-HOT update, from 540,5 to 540,22


> rmgr: Heap2       len (rec/tot):     56/    56, tx:          0, lsn:
> 2/8DABF558, prev 2/8DABF528, desc: PRUNE latestRemovedXid 2013796
> nredirected 0 ndead 1, blkref #0: rel 1663/19243/19560 blk 540

This presumably marks 540,5 dead, given that the removed xid is the same.


> rmgr: Heap        len (rec/tot):     83/    83, tx:    2013798, lsn:
> 2/8DABF5C8, prev 2/8DABF590, desc: HOT_UPDATE off 22 xmax 2013798 flags 0x60
> ; new off 41 xmax 2013798, blkref #0: rel 1663/19243/19560 blk 540

HOT of 540,22 (which was 540,5), now at 540,41.


> rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn:
> 2/8DACCBB0, prev 2/8DACCB88, desc: PRUNE latestRemovedXid 2013798
> nredirected 1 ndead 0, blkref #0: rel 1663/19243/19560 blk 540

Presumably redirecting 540,22 -> 540->41,


> rmgr: Heap        len (rec/tot):     99/    99, tx:    2014289, lsn:
> 2/8DEB5250, prev 2/8DEB36D8, desc: UPDATE off 41 xmax 2014289 flags 0x60
> KEYS_UPDATED ; new off 53 xmax 2014289, blkref #0: rel 1663/19243/19560 blk
> 540

Non-HOT of 540,41 (which was 540,22, 540,5), now at 540,53.


> rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn:
> 2/8DEC0420, prev 2/8DEC03F0, desc: PRUNE latestRemovedXid 2014289
> nredirected 0 ndead 1, blkref #0: rel 1663/19243/19560 blk 540

Likely marks 540,41 dead


> rmgr: Heap        len (rec/tot):     54/    54, tx:    2014291, lsn:
> 2/8DEC0460, prev 2/8DEC0420, desc: LOCK off 53: xid 2014291: flags 0x00
> LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540
> rmgr: Heap        len (rec/tot):     82/    82, tx:    2014291, lsn:
> 2/8DEC0498, prev 2/8DEC0460, desc: HOT_UPDATE off 53 xmax 2014291 flags 0x60
> ; new off 41 xmax 2014291, blkref #0: rel 1663/19243/19560 blk 540

HOT of 540,53, now at 540,41.

Here I am confused. 540,41 was presumably marked dead in 2/8DEC0420, but not
marked unused? So this shouldn't be possible.

What am I missing?


> rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn:
> 2/8DED6A10, prev 2/8DED69E8, desc: PRUNE latestRemovedXid 2014291
> nredirected 1 ndead 0, blkref #0: rel 1663/19243/19560 blk 540

Likely redirecting 540,53 -> 540,41


> rmgr: Heap        len (rec/tot):     59/  7939, tx:    2014784, lsn:
> 2/8DFB15D0, prev 2/8DFB1598, desc: LOCK off 41: xid 2014784: flags 0x00
> LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540
> FPW
> rmgr: Heap        len (rec/tot):    100/   100, tx:    2014784, lsn:
> 2/8DFDAD20, prev 2/8DFD9180, desc: UPDATE off 41 xmax 2014784 flags 0x60
> KEYS_UPDATED ; new off 57 xmax 2014784, blkref #0: rel 1663/19243/19560 blk
> 540
> rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn:
> 2/8DFE5F90, prev 2/8DFE5F60, desc: PRUNE latestRemovedXid 2014784
> nredirected 0 ndead 1, blkref #0: rel 1663/19243/19560 blk 540
> rmgr: Heap        len (rec/tot):     54/    54, tx:    2014786, lsn:
> 2/8DFE5FD0, prev 2/8DFE5F90, desc: LOCK off 57: xid 2014786: flags 0x00
> LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540
> rmgr: Heap        len (rec/tot):     82/    82, tx:    2014786, lsn:
> 2/8DFE6020, prev 2/8DFE5FD0, desc: HOT_UPDATE off 57 xmax 2014786 flags 0x60
> ; new off 41 xmax 2014786, blkref #0: rel 1663/19243/19560 blk 540

Same deal. I don't understand why this is possible?


> rmgr: Heap        len (rec/tot):     59/  8115, tx:    2085600, lsn:
> 2/A0165A70, prev 2/A0165A38, desc: LOCK off 22: xid 2085600: flags 0x00
> LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540
> FPW
> rmgr: Heap        len (rec/tot):     54/    54, tx:    2085600, lsn:
> 2/A018E858, prev 2/A018D7D8, desc: LOCK off 22: xid 2085600: flags 0x00
> LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540
> rmgr: Heap        len (rec/tot):     73/  8237, tx:    2085600, lsn:
> 2/A018E890, prev 2/A018E858, desc: UPDATE off 22 xmax 2085600 flags 0x03
> KEYS_UPDATED ; new off 21 xmax 2085600, blkref #0: rel 1663/19243/19560 blk
> 328 FPW, blkref #1: rel 1663/19243/19560 blk 540

This is also odd. Why are we locking the same row twice, in the same
transaction?

Greetings,

Andres Freund



Re: BUG #17245: Index corruption involving deduplicated entries

От
Andres Freund
Дата:
Hi,

On 2021-10-29 13:49:45 -0700, Andres Freund wrote:
> > rmgr: Heap        len (rec/tot):     54/    54, tx:    2014291, lsn:
> > 2/8DEC0460, prev 2/8DEC0420, desc: LOCK off 53: xid 2014291: flags 0x00
> > LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540
> > rmgr: Heap        len (rec/tot):     82/    82, tx:    2014291, lsn:
> > 2/8DEC0498, prev 2/8DEC0460, desc: HOT_UPDATE off 53 xmax 2014291 flags 0x60
> > ; new off 41 xmax 2014291, blkref #0: rel 1663/19243/19560 blk 540
> 
> HOT of 540,53, now at 540,41.
> 
> Here I am confused. 540,41 was presumably marked dead in 2/8DEC0420, but not
> marked unused? So this shouldn't be possible.
> 
> What am I missing?

Oh. Likely the issue is that heap2_desc() doesn't print the number of
redirects.

I'm considering writing a patch that
1) displays the number of tuples marked unused in HEAP2_PRUNE. This might only
   be possible if no FPW was used
2) if a HEAP2_PRUNE or HEAP2_VACUUM isn't an FPW, display the offsets


For 15 I think it might be worth to explicitly store the number of offets
marked unused, rather than inferring that. It's hard to believe that the 16
bit for that would be a relevant overhead, and having that more readily
available seems like a significant improvement in debuggability.

Greetings,

Andres Freund



Re: BUG #17245: Index corruption involving deduplicated entries

От
Kamigishi Rei
Дата:
On 29.10.2021 23:18, Andres Freund wrote:
> Could you search for btree WAL records before the following records? Most
> importantly for the corrupted page in the corrupted index, but other ones
> might be interesting as well
> 
>> rmgr: Heap2       len (rec/tot):     53/  7937, tx:          0, lsn:
>> 2/90CEF528, prev 2/90CEF4E8, desc: VACUUM nunused 3, blkref #0: rel
>> 1663/19243/19560 blk 540 FPW
>> rmgr: Heap2       len (rec/tot):     53/  8109, tx:          0, lsn:
>> 2/97ED2598, prev 2/97ED2558, desc: VACUUM nunused 1, blkref #0: rel
>> 1663/19243/19560 blk 540 FPW

These are daily vacuum runs (on the 27th and on the 28th) with dozens of 
preceding VACUUM lines for other tables. Should I try to filter out those?

-- 
K. R.



Re: BUG #17245: Index corruption involving deduplicated entries

От
Andres Freund
Дата:
Hi,

On 2021-10-30 00:08:25 +0300, Kamigishi Rei wrote:
> On 29.10.2021 23:18, Andres Freund wrote:
> > Could you search for btree WAL records before the following records? Most
> > importantly for the corrupted page in the corrupted index, but other ones
> > might be interesting as well
> > 
> > > rmgr: Heap2       len (rec/tot):     53/  7937, tx:          0, lsn:
> > > 2/90CEF528, prev 2/90CEF4E8, desc: VACUUM nunused 3, blkref #0: rel
> > > 1663/19243/19560 blk 540 FPW
> > > rmgr: Heap2       len (rec/tot):     53/  8109, tx:          0, lsn:
> > > 2/97ED2598, prev 2/97ED2558, desc: VACUUM nunused 1, blkref #0: rel
> > > 1663/19243/19560 blk 540 FPW
> 
> These are daily vacuum runs (on the 27th and on the 28th) with dozens of
> preceding VACUUM lines for other tables. Should I try to filter out those?

I think it's moot for now, because of the discovery that waldump for PRUNE
doesn't show records marked unused. I'll try to write a patch to change that.

After that it'd mostly be interesting to see just enough records for btree
vacuums not affecting the target table to establish which indexes were
vacuumed. I.e. only records back until a heap VACUUM for a different table
started, and only a single record for each other index. Does that make sense?

Greetings,

Andres Freund



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Fri, Oct 29, 2021 at 1:59 PM Andres Freund <andres@anarazel.de> wrote:
> I'm considering writing a patch that
> 1) displays the number of tuples marked unused in HEAP2_PRUNE. This might only
>    be possible if no FPW was used
> 2) if a HEAP2_PRUNE or HEAP2_VACUUM isn't an FPW, display the offsets

> For 15 I think it might be worth to explicitly store the number of offets
> marked unused, rather than inferring that. It's hard to believe that the 16
> bit for that would be a relevant overhead, and having that more readily
> available seems like a significant improvement in debuggability.

I like that idea.

Attached is output from pg_filedump for heap page 540, with the page
header LSN reading 2/A019DC60. I have output, with and without the -i
option.

This doesn't have any of the real user data, which won't matter to you.

-- 
Peter Geoghegan

Вложения

Re: BUG #17245: Index corruption involving deduplicated entries

От
Andres Freund
Дата:
On 2021-10-29 14:30:48 -0700, Andres Freund wrote:
> I think it's moot for now, because of the discovery that waldump for PRUNE
> doesn't show records marked unused. I'll try to write a patch to change that.

Attached the quick version.

Вложения

Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Fri, Oct 29, 2021 at 1:49 PM Andres Freund <andres@anarazel.de> wrote:
> > rmgr: Heap        len (rec/tot):     59/  8115, tx:    2085600, lsn:
> > 2/A0165A70, prev 2/A0165A38, desc: LOCK off 22: xid 2085600: flags 0x00
> > LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540
> > FPW
> > rmgr: Heap        len (rec/tot):     54/    54, tx:    2085600, lsn:
> > 2/A018E858, prev 2/A018D7D8, desc: LOCK off 22: xid 2085600: flags 0x00
> > LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540
> > rmgr: Heap        len (rec/tot):     73/  8237, tx:    2085600, lsn:
> > 2/A018E890, prev 2/A018E858, desc: UPDATE off 22 xmax 2085600 flags 0x03
> > KEYS_UPDATED ; new off 21 xmax 2085600, blkref #0: rel 1663/19243/19560 blk
> > 328 FPW, blkref #1: rel 1663/19243/19560 blk 540
>
> This is also odd. Why are we locking the same row twice, in the same
> transaction?

My guess is that mediawiki is supposed to be usable with at least
MySQL, Postgres, and SQLite, and so uses SELECT FOR UPDATE in ways
that seem excessive to us.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Andres Freund
Дата:
Hi,

On 2021-10-29 15:06:52 -0700, Peter Geoghegan wrote:
> On Fri, Oct 29, 2021 at 1:49 PM Andres Freund <andres@anarazel.de> wrote:
> > > rmgr: Heap        len (rec/tot):     59/  8115, tx:    2085600, lsn:
> > > 2/A0165A70, prev 2/A0165A38, desc: LOCK off 22: xid 2085600: flags 0x00
> > > LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540
> > > FPW
> > > rmgr: Heap        len (rec/tot):     54/    54, tx:    2085600, lsn:
> > > 2/A018E858, prev 2/A018D7D8, desc: LOCK off 22: xid 2085600: flags 0x00
> > > LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540
> > > rmgr: Heap        len (rec/tot):     73/  8237, tx:    2085600, lsn:
> > > 2/A018E890, prev 2/A018E858, desc: UPDATE off 22 xmax 2085600 flags 0x03
> > > KEYS_UPDATED ; new off 21 xmax 2085600, blkref #0: rel 1663/19243/19560 blk
> > > 328 FPW, blkref #1: rel 1663/19243/19560 blk 540
> >
> > This is also odd. Why are we locking the same row twice, in the same
> > transaction?
> 
> My guess is that mediawiki is supposed to be usable with at least
> MySQL, Postgres, and SQLite, and so uses SELECT FOR UPDATE in ways
> that seem excessive to us.

I was wondering from a different angle:

The two locks are from the same transaction. As far as I remember that code we
actually shouldn't end up emitting a separate lock if the lock level is the
same.

Just briefly looking over the code, the second heap_lock_tuple() should get a
TM_BeingModified from HTSU, which should then end up in the
TransactionIdIsCurrentTransactionId(xwait)) branch, resulting in a goto
out_unlocked.

Greetings,

Andres Freund



Re: BUG #17245: Index corruption involving deduplicated entries

От
Kamigishi Rei
Дата:
On 30.10.2021 0:59, Andres Freund wrote:
> On 2021-10-29 14:30:48 -0700, Andres Freund wrote:
>> I think it's moot for now, because of the discovery that waldump for PRUNE
>> doesn't show records marked unused. I'll try to write a patch to change that.
> 
> Attached the quick version. 

Here are WAL archive entries related to page 540 of mediawiki.page, with 
the patch applied:

# grep "rel 1663/19243/19560 blk 540" ~/waldump-everything.txt
rmgr: Heap        len (rec/tot):     59/  7939, tx:    2013796, lsn: 
2/8DA7C498, prev 2/8DA7C470, desc: LOCK off 5: xid 2013796: flags 0x00 
LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 
540 FPW
rmgr: Heap        len (rec/tot):     98/    98, tx:    2013796, lsn: 
2/8DAAA0A0, prev 2/8DAA8548, desc: UPDATE off 5 xmax 2013796 flags 0x61 
KEYS_UPDATED ; new off 22 xmax 2013796, blkref #0: rel 1663/19243/19560 
blk 540
rmgr: Heap2       len (rec/tot):     56/    56, tx:          0, lsn: 
2/8DABF558, prev 2/8DABF528, desc: PRUNE latestRemovedXid 2013796 
nredirected 0 ndead 1 nunused 0, dead: [5], blkref #0: rel 
1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     54/    54, tx:    2013798, lsn: 
2/8DABF590, prev 2/8DABF558, desc: LOCK off 22: xid 2013798: flags 0x00 
LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     83/    83, tx:    2013798, lsn: 
2/8DABF5C8, prev 2/8DABF590, desc: HOT_UPDATE off 22 xmax 2013798 flags 
0x60 ; new off 41 xmax 2013798, blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 
2/8DACCBB0, prev 2/8DACCB88, desc: PRUNE latestRemovedXid 2013798 
nredirected 1 ndead 0 nunused 0, redirected: [22->41], blkref #0: rel 
1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     59/  7939, tx:    2014289, lsn: 
2/8DE8A460, prev 2/8DE8A438, desc: LOCK off 41: xid 2014289: flags 0x00 
LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 
540 FPW
rmgr: Heap        len (rec/tot):     99/    99, tx:    2014289, lsn: 
2/8DEB5250, prev 2/8DEB36D8, desc: UPDATE off 41 xmax 2014289 flags 0x60 
KEYS_UPDATED ; new off 53 xmax 2014289, blkref #0: rel 1663/19243/19560 
blk 540
rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 
2/8DEC0420, prev 2/8DEC03F0, desc: PRUNE latestRemovedXid 2014289 
nredirected 0 ndead 1 nunused 1, unused: [41], dead: [22], blkref #0: 
rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     54/    54, tx:    2014291, lsn: 
2/8DEC0460, prev 2/8DEC0420, desc: LOCK off 53: xid 2014291: flags 0x00 
LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     82/    82, tx:    2014291, lsn: 
2/8DEC0498, prev 2/8DEC0460, desc: HOT_UPDATE off 53 xmax 2014291 flags 
0x60 ; new off 41 xmax 2014291, blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 
2/8DED6A10, prev 2/8DED69E8, desc: PRUNE latestRemovedXid 2014291 
nredirected 1 ndead 0 nunused 0, redirected: [53->41], blkref #0: rel 
1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     59/  7939, tx:    2014784, lsn: 
2/8DFB15D0, prev 2/8DFB1598, desc: LOCK off 41: xid 2014784: flags 0x00 
LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 
540 FPW
rmgr: Heap        len (rec/tot):    100/   100, tx:    2014784, lsn: 
2/8DFDAD20, prev 2/8DFD9180, desc: UPDATE off 41 xmax 2014784 flags 0x60 
KEYS_UPDATED ; new off 57 xmax 2014784, blkref #0: rel 1663/19243/19560 
blk 540
rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 
2/8DFE5F90, prev 2/8DFE5F60, desc: PRUNE latestRemovedXid 2014784 
nredirected 0 ndead 1 nunused 1, unused: [41], dead: [53], blkref #0: 
rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     54/    54, tx:    2014786, lsn: 
2/8DFE5FD0, prev 2/8DFE5F90, desc: LOCK off 57: xid 2014786: flags 0x00 
LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     82/    82, tx:    2014786, lsn: 
2/8DFE6020, prev 2/8DFE5FD0, desc: HOT_UPDATE off 57 xmax 2014786 flags 
0x60 ; new off 41 xmax 2014786, blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 
2/8DFFA420, prev 2/8DFFA3F8, desc: PRUNE latestRemovedXid 2014786 
nredirected 1 ndead 0 nunused 0, redirected: [57->41], blkref #0: rel 
1663/19243/19560 blk 540
rmgr: Heap2       len (rec/tot):     53/  7937, tx:          0, lsn: 
2/90CEF528, prev 2/90CEF4E8, desc: VACUUM nunused 3, blkref #0: rel 
1663/19243/19560 blk 540 FPW
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 
2/90CF1448, prev 2/90CEF528, desc: VISIBLE cutoff xid 2014786 flags 
0x01, blkref #0: rel 1663/19243/19560 fork vm blk 0, blkref #1: rel 
1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     59/  7943, tx:    2045060, lsn: 
2/93FF90E8, prev 2/93FF90B0, desc: LOCK off 41: xid 2045060: flags 0x00 
LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 
540 FPW
rmgr: Heap        len (rec/tot):     99/    99, tx:    2045060, lsn: 
2/94026D80, prev 2/94025150, desc: UPDATE off 41 xmax 2045060 flags 0x61 
KEYS_UPDATED ; new off 5 xmax 2045060, blkref #0: rel 1663/19243/19560 
blk 540
rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 
2/94031F28, prev 2/94031EF8, desc: PRUNE latestRemovedXid 2045060 
nredirected 0 ndead 1 nunused 1, unused: [41], dead: [57], blkref #0: 
rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     54/    54, tx:    2045062, lsn: 
2/94031F68, prev 2/94031F28, desc: LOCK off 5: xid 2045062: flags 0x00 
LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     83/    83, tx:    2045062, lsn: 
2/94031FA0, prev 2/94031F68, desc: HOT_UPDATE off 5 xmax 2045062 flags 
0x60 ; new off 22 xmax 2045062, blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 
2/94043C50, prev 2/94043C28, desc: PRUNE latestRemovedXid 2045062 
nredirected 1 ndead 0 nunused 0, redirected: [5->22], blkref #0: rel 
1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     73/  8133, tx:    2065110, lsn: 
2/97207F18, prev 2/97207EE0, desc: UPDATE off 8 xmax 2065110 flags 0x00 
; new off 41 xmax 2065110, blkref #0: rel 1663/19243/19560 blk 540 FPW, 
blkref #1: rel 1663/19243/19560 blk 318
rmgr: Heap2       len (rec/tot):     53/  8109, tx:          0, lsn: 
2/97ED2598, prev 2/97ED2558, desc: VACUUM nunused 1, blkref #0: rel 
1663/19243/19560 blk 540 FPW
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 
2/97ED4560, prev 2/97ED2598, desc: VISIBLE cutoff xid 2065110 flags 
0x01, blkref #0: rel 1663/19243/19560 fork vm blk 0, blkref #1: rel 
1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     59/  8115, tx:    2085600, lsn: 
2/A0165A70, prev 2/A0165A38, desc: LOCK off 22: xid 2085600: flags 0x00 
LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 
540 FPW
rmgr: Heap        len (rec/tot):     54/    54, tx:    2085600, lsn: 
2/A018E858, prev 2/A018D7D8, desc: LOCK off 22: xid 2085600: flags 0x00 
LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540
rmgr: Heap        len (rec/tot):     73/  8237, tx:    2085600, lsn: 
2/A018E890, prev 2/A018E858, desc: UPDATE off 22 xmax 2085600 flags 0x03 
KEYS_UPDATED ; new off 21 xmax 2085600, blkref #0: rel 1663/19243/19560 
blk 328 FPW, blkref #1: rel 1663/19243/19560 blk 540
rmgr: Heap2       len (rec/tot):     58/    58, tx:    2085602, lsn: 
2/A019DC20, prev 2/A019BC08, desc: PRUNE latestRemovedXid 2085600 
nredirected 0 ndead 1 nunused 1, unused: [22], dead: [5], blkref #0: rel 
1663/19243/19560 blk 540


-- 
K. R.



Re: BUG #17245: Index corruption involving deduplicated entries

От
Andres Freund
Дата:
Hi,

Amit, this looks to be a parallel vacuum related bug, see below.


Whew, we found the bug, I think. Or at least one that can create exactly this
situation.

The problem is that parallel vacuum doesn't have correct handling for small
indexes, when there are other indexes that are big enough to use parallelism.

CREATE TABLE multiple_indexes(id serial primary key, data text, other serial unique);
CREATE INDEX ON multiple_indexes(data );
INSERT INTO multiple_indexes(data) SELECT 1 FROM generate_series(1, 50000);

\di+ multiple_indexes_*
                                                        List of relations

┌────────┬────────────────────────────┬───────┬────────┬──────────────────┬─────────────┬───────────────┬─────────┬─────────────┐
│ Schema │            Name            │ Type  │ Owner  │      Table       │ Persistence │ Access method │  Size   │
Description│
 

├────────┼────────────────────────────┼───────┼────────┼──────────────────┼─────────────┼───────────────┼─────────┼─────────────┤
│ public │ multiple_indexes_data_idx  │ index │ andres │ multiple_indexes │ permanent   │ btree         │ 336 kB  │
       │
 
│ public │ multiple_indexes_other_key │ index │ andres │ multiple_indexes │ permanent   │ btree         │ 1112 kB │
       │
 
│ public │ multiple_indexes_pkey      │ index │ andres │ multiple_indexes │ permanent   │ btree         │ 1112 kB │
       │
 

└────────┴────────────────────────────┴───────┴────────┴──────────────────┴─────────────┴───────────────┴─────────┴─────────────┘
(3 rows)

DELETE FROM multiple_indexes;

postgres[1253065][1]=# SHOW min_parallel_index_scan_size;
┌──────────────────────────────┐
│ min_parallel_index_scan_size │
├──────────────────────────────┤
│ 512kB                        │
└──────────────────────────────┘
(1 row)

I.e. we have two indexes that are big enough to use parallelism, and one that
isn't.


first heap scan:

VACUUM (VERBOSE,PARALLEL 4) multiple_indexes;
INFO:  00000: vacuuming "public.multiple_indexes"
LOCATION:  lazy_scan_heap, vacuumlazy.c:937

index vacuum:
INFO:  00000: launched 1 parallel vacuum worker for index vacuuming (planned: 1)
LOCATION:  do_parallel_vacuum_or_cleanup, vacuumlazy.c:2768
INFO:  00000: scanned index "multiple_indexes_pkey" to remove 50000 row versions
DETAIL:  CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s
LOCATION:  lazy_vacuum_one_index, vacuumlazy.c:3057
INFO:  00000: scanned index "multiple_indexes_other_key" to remove 50000 row versions
DETAIL:  CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s
LOCATION:  lazy_vacuum_one_index, vacuumlazy.c:3057

second heap scan:
INFO:  00000: table "multiple_indexes": removed 50000 dead item identifiers in 271 pages
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
LOCATION:  lazy_vacuum_heap_rel, vacuumlazy.c:2384
INFO:  00000: table "multiple_indexes": found 50000 removable, 0 nonremovable row versions in 271 out of 271 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 748
Skipped 0 pages due to buffer pins, 0 frozen pages.
CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s.
LOCATION:  lazy_scan_heap, vacuumlazy.c:1668
...


So the two bigger indexes were scanned, but the small one was not.


The reason for this is that

begin_parallel_vacuum() does:
    /*
     * Initialize variables for shared index statistics, set NULL bitmap and
     * the size of stats for each index.
     */
    memset(shared->bitmap, 0x00, BITMAPLEN(nindexes));
    for (int idx = 0; idx < nindexes; idx++)
    {
        if (!can_parallel_vacuum[idx])
            continue;

        /* Set NOT NULL as this index does support parallelism */
        shared->bitmap[idx >> 3] |= 1 << (idx & 0x07);
    }

which causes do_parallel_processing() to skip over the index, because
parallel_stats_for() returns NULL due to checking that bitmap.

And do_serial_processing_for_unsafe_indexes() skips over the index because:
        /*
         * We're only here for the unsafe indexes
         */
        if (parallel_processing_is_safe(indrel, lvshared))
            continue;

but the index is actually not unsafe.


Which leaves us with a severely corrupted index. Unfortunately one that
amcheck at this stage will not recognize as corrupted :(. It'll only get there
later, if new rows for the same tids will be inserted.

I haven't yet checked whether this is a bug introduced in 14, or whether it
was possible to hit before as well.


Greetings,

Andres Freund



Re: BUG #17245: Index corruption involving deduplicated entries

От
Andres Freund
Дата:
Hi,

On 2021-10-29 19:37:40 -0700, Andres Freund wrote:
> Amit, this looks to be a parallel vacuum related bug, see below.
> 
> 
> Whew, we found the bug, I think. Or at least one that can create exactly this
> situation.
> [explanation]
> I haven't yet checked whether this is a bug introduced in 14, or whether it
> was possible to hit before as well.

It looks like a v14 issue. I can't reproduce it in 13, and I think a
change in

commit b4af70cb210393c9c8f41643acf6b213e21178e7
Author: Peter Geoghegan <pg@bowt.ie>
Date:   2021-04-05 13:21:44 -0700

    Simplify state managed by VACUUM.

which changed the logic for which relations are done in the leader:

-        /* Process the indexes skipped by parallel workers */
-        if (shared_indstats == NULL ||
-            skip_parallel_vacuum_index(Irel[i], lps->lvshared))
-            vacuum_one_index(Irel[i], &(vacrelstats->indstats[i]), lps->lvshared,
-                             shared_indstats, vacrelstats->dead_tuples,
-                             vacrelstats);
+        /* Skip already-complete indexes */
+        if (shared_istat != NULL)
+            continue;
+
+        indrel = vacrel->indrels[idx];
+
+        /*
+         * We're only here for the unsafe indexes
+         */
+        if (parallel_processing_is_safe(indrel, lvshared))
+            continue;
+

without realizing that the "shared_indstats == NULL || " piece is
important to handle parallel-safe but too-small indexes correctly.


Going a bit further than this, ISTM that once we decide to use
parallelism for any index, there's no point not using parallelism for
all the parallel-safe indexes...

Greetings,

Andres Freund



Re: BUG #17245: Index corruption involving deduplicated entries

От
Kamigishi Rei
Дата:
On 30.10.2021 5:37, Andres Freund wrote:
> The problem is that parallel vacuum doesn't have correct handling for small
> indexes, when there are other indexes that are big enough to use parallelism.

Index sizes in use (pg_total_relation_size), btree only, with only the 
largest, the smallest, and those reported by pg_amcheck listed:

┌───────────────────────────────────┬──────────────┬────────────┐
│               index               │ index_method │ index_size │
├───────────────────────────────────┼──────────────┼────────────┤
│  pl_backlinks_namespace           │ btree        │ 18 MB      │
│  pl_namespace                     │ btree        │ 16 MB      │
│  wl_user                          │ btree        │ 6416 kB    │
│  …                                │ …            │ …          │
│  page_redirect_namespace_len      │ btree        │ 544 kB     │
│  …                                │ …            │ …          │
│  page_len                         │ btree        │ 488 kB     │
│  …                                │ …            │ …          │
│  transcode_key_idx                │ btree        │ 192 kB     │
│  …                                │ …            │ …          │
│  page_main_title                  │ btree        │ 144 kB     │
│  …                                │ …            │ …          │
│  ir_type_value                    │ btree        │ 8192 bytes │
│  ipc_hex_time                     │ btree        │ 8192 bytes │
└───────────────────────────────────┴──────────────┴────────────┘


-- 
K. R.



Re: BUG #17245: Index corruption involving deduplicated entries

От
Kamigishi Rei
Дата:
On 30.10.2021 7:12, Andres Freund wrote:

>> Whew, we found the bug, I think. Or at least one that can create exactly this
>> situation.
>> [explanation]
>> I haven't yet checked whether this is a bug introduced in 14, or whether it
>> was possible to hit before as well.
> 
> It looks like a v14 issue.
> 
> Going a bit further than this, ISTM that once we decide to use
> parallelism for any index, there's no point not using parallelism for
> all the parallel-safe indexes...

Probably of use: on Andrew Gierth's suggestion I performed `VACUUM 
VERBOSE mediawiki.page` with the snapshot database:

azurlane_wiki=# vacuum verbose mediawiki.page;
INFO:  vacuuming "mediawiki.page"
INFO:  launched 2 parallel vacuum workers for index vacuuming (planned: 2)
INFO:  scanned index "page_pkey" to remove 251 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.02 s
INFO:  scanned index "page_random" to remove 251 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.02 s
INFO:  scanned index "page_redirect_namespace_len" to remove 251 row 
versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  scanned index "name_title" to remove 251 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.02 s
INFO:  scanned index "ts2_page_title" to remove 251 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.05 s
INFO:  table "page": removed 251 dead item identifiers in 215 pages
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.05 s
INFO:  table "page": found 14 removable, 44250 nonremovable row versions 
in 876 out of 912 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 2097658
Skipped 0 pages due to buffer pins, 0 frozen pages.
CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.18 s.
INFO:  vacuuming "pg_toast.pg_toast_19560"
INFO:  table "pg_toast_19560": found 0 removable, 0 nonremovable row 
versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 2097658
Skipped 0 pages due to buffer pins, 0 frozen pages.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

Here is the list of indices associated with mediawiki.page:

┌────────────────┬─────────────────────────────┬──────────────┐
│     table      │            index            │ index_method │
├────────────────┼─────────────────────────────┼──────────────┤
│ mediawiki.page │ page_pkey                   │ btree        │
│ mediawiki.page │ name_title                  │ btree        │
│ mediawiki.page │ page_len                    │ btree        │
│ mediawiki.page │ page_main_title             │ btree        │
│ mediawiki.page │ page_mediawiki_title        │ btree        │
│ mediawiki.page │ page_project_title          │ btree        │
│ mediawiki.page │ page_random                 │ btree        │
│ mediawiki.page │ page_redirect_namespace_len │ btree        │
│ mediawiki.page │ page_talk_title             │ btree        │
│ mediawiki.page │ page_user_title             │ btree        │
│ mediawiki.page │ page_utalk_title            │ btree        │
│ mediawiki.page │ ts2_page_title              │ gist         │
└────────────────┴─────────────────────────────┴──────────────┘

VACUUM lists page_redirect_namespace_len, however pg_amcheck still 
complains about it afterwards. page_len and page_main_title (and others) 
are not listed by VACUUM.

-- 
K. R.



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Fri, Oct 29, 2021 at 9:12 PM Andres Freund <andres@anarazel.de> wrote:
> It looks like a v14 issue. I can't reproduce it in 13, and I think a
> change in
>
> commit b4af70cb210393c9c8f41643acf6b213e21178e7
> Author: Peter Geoghegan <pg@bowt.ie>
> Date:   2021-04-05 13:21:44 -0700
>
>     Simplify state managed by VACUUM.
>
> which changed the logic for which relations are done in the leader:

> without realizing that the "shared_indstats == NULL || " piece is
> important to handle parallel-safe but too-small indexes correctly.

This was a simple blunder on my part. I didn't intend to change
anything about the behavior of parallel VACUUM, and so this commit was
just about the only one of mine from Postgres 14 that I didn't
consider as a possibility, even once. Mea culpa.

Attached is a draft patch that fixes the problem.

Also attached is a second patch. This adds assertions to
heap_index_delete_tuples() to catch cases where a heap TID in an index
points to an LP_UNUSED item in the heap (which is what this bug looked
like, mostly). It also checks for certain more or less equivalent
inconsistencies: the case where a heap TID in an index points to a
line pointer that's past the end of the heap page's line pointer
array, and the case where a heap TID in an index points directly to a
heap-only tuple.

Putting these new assertions in heap_index_delete_tuples() makes a lot
of sense. The interlocking rules between an index and a heap relation
are very simple here. They're also uniform across all supported index
access methods (of those that support index tuple deletion, currently
just nbtree, hash, and GiST). This is one of the very few code paths
where there is a buffer lock (not just a pin) on the index page held
throughout checking the heap; there is clearly no question of spurious
assertion failures due to a concurrent vacuum. Note that the index
deletion stuff I worked on in Postgres 14 batches-up TIDs in various
ways, and so in practice heap_index_delete_tuples() will tend to pass
over many of the TIDs from the index page, in many important cases
(not so much with GiST and hash right now, but I expect that to change
in the future).

I want to once again thank Kamigishi Rei for her tireless work on
this. If she didn't go to the trouble of providing us with so much
help, then it would have probably taken significantly longer to figure
out what was going on here. Also want to thank Andres.

-- 
Peter Geoghegan

Вложения

Re: BUG #17245: Index corruption involving deduplicated entries

От
Kamigishi Rei
Дата:
On 30.10.2021 21:46, Peter Geoghegan wrote:
> 
> Attached is a draft patch that fixes the problem.

Thank you!

Going to install the patched version on the production system to see how 
it fares (and I expect good things).

Tried against the re-extracted snapshot, looks good (everything still 
needs reindexing, obviously, but that is beside the point right now):

azurlane_wiki=# vacuum verbose mediawiki.page;
INFO:  vacuuming "mediawiki.page"
INFO:  launched 2 parallel vacuum workers for index vacuuming (planned: 2)
INFO:  scanned index "page_len" to remove 251 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  scanned index "page_main_title" to remove 251 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.05 s
INFO:  scanned index "page_mediawiki_title" to remove 251 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  scanned index "page_project_title" to remove 251 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  scanned index "page_talk_title" to remove 251 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  scanned index "page_user_title" to remove 251 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  scanned index "page_utalk_title" to remove 251 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  scanned index "page_pkey" to remove 251 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.07 s
INFO:  scanned index "page_redirect_namespace_len" to remove 251 row 
versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.08 s
INFO:  scanned index "page_random" to remove 251 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.14 s
INFO:  scanned index "name_title" to remove 251 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.19 s
INFO:  scanned index "ts2_page_title" to remove 251 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.04 s
INFO:  table "page": removed 251 dead item identifiers in 215 pages
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  index "page_pkey" now contains 46056 row versions in 131 pages
DETAIL:  190 index row versions were removed.
0 index pages were newly deleted.
0 index pages are currently deleted, of which 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  index "name_title" now contains 46056 row versions in 330 pages
DETAIL:  211 index row versions were removed.
0 index pages were newly deleted.
0 index pages are currently deleted, of which 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  index "page_len" now contains 51848 row versions in 61 pages
DETAIL:  681 index row versions were removed.
0 index pages were newly deleted.
0 index pages are currently deleted, of which 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  index "page_main_title" now contains 3569 row versions in 18 pages
DETAIL:  147 index row versions were removed.
0 index pages were newly deleted.
0 index pages are currently deleted, of which 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  index "page_mediawiki_title" now contains 15 row versions in 2 pages
DETAIL:  0 index row versions were removed.
0 index pages were newly deleted.
0 index pages are currently deleted, of which 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  index "page_project_title" now contains 1 row versions in 2 pages
DETAIL:  0 index row versions were removed.
0 index pages were newly deleted.
0 index pages are currently deleted, of which 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  index "page_random" now contains 46056 row versions in 127 pages
DETAIL:  251 index row versions were removed.
0 index pages were newly deleted.
0 index pages are currently deleted, of which 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  index "page_redirect_namespace_len" now contains 47663 row 
versions in 68 pages
DETAIL:  265 index row versions were removed.
0 index pages were newly deleted.
0 index pages are currently deleted, of which 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  index "page_talk_title" now contains 35 row versions in 2 pages
DETAIL:  0 index row versions were removed.
0 index pages were newly deleted.
0 index pages are currently deleted, of which 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  index "page_user_title" now contains 735 row versions in 6 pages
DETAIL:  6 index row versions were removed.
0 index pages were newly deleted.
0 index pages are currently deleted, of which 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  index "page_utalk_title" now contains 44 row versions in 2 pages
DETAIL:  0 index row versions were removed.
0 index pages were newly deleted.
0 index pages are currently deleted, of which 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  index "ts2_page_title" now contains 46056 row versions in 211 pages
DETAIL:  251 index row versions were removed.
0 index pages were newly deleted.
0 index pages are currently deleted, of which 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  table "page": found 14 removable, 44250 nonremovable row versions 
in 876 out of 912 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 2097658
Skipped 0 pages due to buffer pins, 0 frozen pages.
CPU: user: 0.00 s, system: 0.01 s, elapsed: 0.50 s.
INFO:  vacuuming "pg_toast.pg_toast_19560"
INFO:  table "pg_toast_19560": found 0 removable, 0 nonremovable row 
versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 2097658
Skipped 0 pages due to buffer pins, 0 frozen pages.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

-- 
K. R.



Re: BUG #17245: Index corruption involving deduplicated entries

От
Andres Freund
Дата:
Hi,

On 2021-10-30 11:46:22 -0700, Peter Geoghegan wrote:
> Attached is a draft patch that fixes the problem.

I think it probably is worth adding an error check someplace that verifies
that problems of this kind will be detected with, uh, less effort.

I think it'd also be good to add a test that specifically verifies that
parallel vacuum doesn't have a bug around "parallel worthy" and not "parallel
worthy" indexes. It's too easy a mistake to make, and because visible
corruption is delayed, it's likely that we won't detect such cases.


> Also attached is a second patch. This adds assertions to
> heap_index_delete_tuples() to catch cases where a heap TID in an index
> points to an LP_UNUSED item in the heap (which is what this bug looked
> like, mostly). It also checks for certain more or less equivalent
> inconsistencies: the case where a heap TID in an index points to a
> line pointer that's past the end of the heap page's line pointer
> array, and the case where a heap TID in an index points directly to a
> heap-only tuple.

ISTM that at least a basic version of this is worth doing as a check throwing
an ERROR, rather than an assertion. It's hard to believe this'd be a
significant portion of the cost of heap_index_delete_tuples(), and I think it
would help catch problems a lot earlier.


Greetings,

Andres Freund



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Sat, Oct 30, 2021 at 1:42 PM Andres Freund <andres@anarazel.de> wrote:
> I think it probably is worth adding an error check someplace that verifies
> that problems of this kind will be detected with, uh, less effort.

The draft fix is something I wanted to get out quickly to signal that
the bug is definitely going to be fixed soon. I'll need to carefully
think about this some more on Monday, to make sure that it becomes
much harder to break in roughly the same way once again.

> I think it'd also be good to add a test that specifically verifies that
> parallel vacuum doesn't have a bug around "parallel worthy" and not "parallel
> worthy" indexes. It's too easy a mistake to make, and because visible
> corruption is delayed, it's likely that we won't detect such cases.

I agree.

> ISTM that at least a basic version of this is worth doing as a check throwing
> an ERROR, rather than an assertion. It's hard to believe this'd be a
> significant portion of the cost of heap_index_delete_tuples(), and I think it
> would help catch problems a lot earlier.

I like the idea of doing that, but only on master. I think that we
could do this as an ERROR, provided we avoid adding a new inner loop
to heap_index_delete_tuples() --- we can definitely afford that. And
we can even have a nice, detailed error message that blames a
particular tuple from a specific index page, pointing to a specific
heap TID.

Separately, we should add an assertion that catches cases where a TID
in the index points to an LP_REDIRECT line pointer, which does not
point to a heap tuple with storage. That check has much less practical
value, and necessitates adding a new inner loop, which is why an
assert seems good enough to me. (The patch I've posted doesn't have
any of this LP_REDIRECT business, but my v2 revision will.)

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Andres Freund
Дата:
On 2021-10-30 15:11:18 -0700, Peter Geoghegan wrote:
> On Sat, Oct 30, 2021 at 1:42 PM Andres Freund <andres@anarazel.de> wrote:
> > I think it probably is worth adding an error check someplace that verifies
> > that problems of this kind will be detected with, uh, less effort.
> 
> The draft fix is something I wanted to get out quickly to signal that
> the bug is definitely going to be fixed soon. I'll need to carefully
> think about this some more on Monday, to make sure that it becomes
> much harder to break in roughly the same way once again.

Makes sense.


> > ISTM that at least a basic version of this is worth doing as a check throwing
> > an ERROR, rather than an assertion. It's hard to believe this'd be a
> > significant portion of the cost of heap_index_delete_tuples(), and I think it
> > would help catch problems a lot earlier.
> 
> I like the idea of doing that, but only on master.

Hm. I wonder if it's not actually good to do something like it in 14, given
that we know of a path to have corrupted indexes out there.


> Separately, we should add an assertion that catches cases where a TID
> in the index points to an LP_REDIRECT line pointer, which does not
> point to a heap tuple with storage.

Is it actually guaranteed that that's impossible to happen and wasn't ever
possible? It's not obvious to me that a LP_REDIRECT pointing to an LP_DEAD
tuple would be a real problem.

Greetings,

Andres Freund



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Sat, Oct 30, 2021 at 3:18 PM Andres Freund <andres@anarazel.de> wrote:
> Hm. I wonder if it's not actually good to do something like it in 14, given
> that we know of a path to have corrupted indexes out there.

My concern is that people might not be too happy about errors that
occur due to corruption in tuples that are only vaguely related to
their own incoming inserted tuple.

> > Separately, we should add an assertion that catches cases where a TID
> > in the index points to an LP_REDIRECT line pointer, which does not
> > point to a heap tuple with storage.
>
> Is it actually guaranteed that that's impossible to happen and wasn't ever
> possible? It's not obvious to me that a LP_REDIRECT pointing to an LP_DEAD
> tuple would be a real problem.

As far as I know an LP_REDIRECT item can never point to an LP_DEAD
item. As far as I know it has worked that way since HOT first went in.
There certainly seems to be far fewer problems with a rule that says
it can never happen, and so I think we should introduce that rule (or
clear things up, if you prefer).

ISTM that this is all about TID stability for indexes. The only reason
we have LP_DEAD items in heap pages is to have something that reliably
tells index scans that the heap tuple they're looking for is logically
dead (and so we can't recycle TIDs except during VACUUM). Similarly,
the only reason we have LP_REDIRECT items is so that index scans have
lightweight stable forwarding information from the heap page that
contains the HOT chain. The advantage of LP_REDIRECT items is that
they allow pruning to avoid keeping around the original heap tuple
(the tuple that was not a heap-only tuple) after pruning. Also,
pruning doesn't have to "merge line pointers" such that the new first
member of a HOT chain has the same page offset number as the original
first member had -- imagine how messy that would have to be.

VACUUM only ever deletes index tuples from indexes when their
pointed-to TIDs were found to be LP_DEAD stub line pointers in the
first heap pass. Such an LP_DEAD item represents a (former) whole HOT
chain, which could just be a "single tuple degenerate HOT chain", that
never actually had any heap-only tuples (say because all UPDATEs for
the table will modify indexed columns). If an LP_REDIRECT item points
to an LP_DEAD item, then what is VACUUM supposed to do about it when
it comes time to vacuum indexes? Which TID is it supposed to delete
from indexes? The LP_DEAD item, the LP_REDIRECT item that points to
the LP_DEAD item, or both?

While it's okay if the link from a tuple header in a HOT chain points
to an LP_UNUSED item (that just indicates that the chain is "broken"
at that point), it's not okay if a link from an LP_REDIRECT line
pointer points to an LP_UNUSED item -- that's virtually the same
condition as having a TID from an index point directly to an LP_UNUSED
item, which is of course always wrong. We can do no "last tuple's xmax
== current tuple's xmin" validation during chain traversal when the
"last tuple" was actually just an LP_REDIRECT item. And so we need
another rule for LP_REDIRECT items, to compensate. That rule seems to
be: HOT chains cannot be allowed to "break" between the LP_REDIRECT
item and the first tuple with storage.

The only way it could be okay for an LP_REDIRECT item to point to an
LP_DEAD item would be if you knew for sure that the LP_REDIRECT item
would actually become LP_DEAD at the same time as the LP_DEAD item (so
both get removed from indexes) -- which is a contradiction in terms.
Why wouldn't pruning just mark the LP_REDIRECT item LP_DEAD instead,
while making the would-be LP_DEAD item skip straight to being an
LP_UNUSED item? That approach is strictly better. It just makes sense,
which leads me to believe that we must have always done it that way.
It would be nice to be able to say for sure that we have a simple
rule: "marking a heap-only tuple LP_DEAD is always not just
unnecessary, but wrong, because LP_DEAD items in heap pages are
supposed to have a 1:1 mapping with dead index tuples".

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Sun, Oct 31, 2021 at 5:19 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The only way it could be okay for an LP_REDIRECT item to point to an
> LP_DEAD item would be if you knew for sure that the LP_REDIRECT item
> would actually become LP_DEAD at the same time as the LP_DEAD item (so
> both get removed from indexes) -- which is a contradiction in terms.
> Why wouldn't pruning just mark the LP_REDIRECT item LP_DEAD instead,
> while making the would-be LP_DEAD item skip straight to being an
> LP_UNUSED item? That approach is strictly better.

Attached patch adds assertions and comments to
heap_page_prune_execute() that document my understanding of things.
This passes "make check-world" for me.

How do you feel about doing something like this too?

-- 
Peter Geoghegan

Вложения

Re: BUG #17245: Index corruption involving deduplicated entries

От
Pavel Borisov
Дата:
Attached patch adds assertions and comments to
heap_page_prune_execute() that document my understanding of things.
This passes "make check-world" for me.

How do you feel about doing something like this too?
I've applied patch v1, built with assertion checking. Installcheck-world passes on my machine (OSX) too.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Sun, Oct 31, 2021 at 7:21 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached patch adds assertions and comments to
> heap_page_prune_execute() that document my understanding of things.
> This passes "make check-world" for me.

Attached is a more worked out fix for the bug. It includes a test case
that will lead to an assertion failure in nbtdedup.c (when the fix
itself is absent). The test is loosely based on one from Masahiko's
patch on the dedicated parallel VACUUM thread.

The testing here isn't quite as extensive as we talked about, but only
because it's hard to do that within the confines of the current
design, per my remarks earlier on the other thread. It's not like we
have long to go before 14.1, and so I'd like to commit something like
this in the next couple of days.

I've also combined my heap pruning assertion patch (from the other
thread) with the heap_index_delete_tuples() assertions we talked about
on this thread -- they're very similar, and so might as well just be
handled as a single commit. I can delay committing this second patch
until I hear back about the details. Note that there still aren't any
new defensive ERRORs here, for now.

-- 
Peter Geoghegan

Вложения

Re: BUG #17245: Index corruption involving deduplicated entries

От
Masahiko Sawada
Дата:
On Tue, Nov 2, 2021 at 9:14 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Sun, Oct 31, 2021 at 7:21 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Attached patch adds assertions and comments to
> > heap_page_prune_execute() that document my understanding of things.
> > This passes "make check-world" for me.
>
> Attached is a more worked out fix for the bug. It includes a test case
> that will lead to an assertion failure in nbtdedup.c (when the fix
> itself is absent). The test is loosely based on one from Masahiko's
> patch on the dedicated parallel VACUUM thread.

Thank you for updating the patch. For parallel vacuum fix, the fix
looks good to me.

Regarding regression tests:

+        * XXX: Actually, there might be no statistics at all -- index AMs
+        * sometimes return NULL from amvacuumcleanup.  That makes it hard to
+        * assert that we really called amvacuumcleanup in the leader from
+        * here.

that's true. We cannot know whether indexes were actually vacuumed by
parallel workers by checking the shared index stats. While thinking
it's a good idea to use on an assertion in nbtdedup.c I'm concerned a
bit about relying on another module to detect a vacuum bug. If we were
to need to change the code of nbtdedup.c in the future, the premise
that the bug in parallel vacuum leads to the assertion failure might
break. It might not happen in back branches, though.

The problem with this bug is that leader process misses to vacuum
indexes that must be processed by the leader. So, another idea (but
not a comprehensive approach) to detect this issue would be that we
check if the leader processed all indexes that must be processed by
leader at the end of do_serial_processing_for_unsafe_indexes(). That
is, we check if all of both indexes whose entries in
will_parallel_vacuum[] are false and unsafe indexes have been
processed by leader.

> I've also combined my heap pruning assertion patch (from the other
> thread) with the heap_index_delete_tuples() assertions we talked about
> on this thread -- they're very similar, and so might as well just be
> handled as a single commit. I can delay committing this second patch
> until I hear back about the details. Note that there still aren't any
> new defensive ERRORs here, for now.

I'll look at the second patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Mon, Nov 1, 2021 at 9:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Thank you for updating the patch. For parallel vacuum fix, the fix
> looks good to me.

I pushed the same patch earlier (with small tweaks).

> We cannot know whether indexes were actually vacuumed by
> parallel workers by checking the shared index stats. While thinking
> it's a good idea to use on an assertion in nbtdedup.c I'm concerned a
> bit about relying on another module to detect a vacuum bug.

I agree that it's not ideal that we're relying on the nbtdedup.c
assertion failure, but there are lots of ways that the test could fail
(with assertions enabled). We're relying on nbtdedup.c, but it's not
too delicate in practice.

> The problem with this bug is that leader process misses to vacuum
> indexes that must be processed by the leader.

I think that the underlying problem is that there is too much
indirection, for no real benefit. A more explicit state machine
approach would make testing easy, and would minimize divergence
between parallel and serial VACUUM. I think that it would be easier
and less risky overall to do this than it would be to have
comprehensive testing of the existing design.

As I said on the other thread, we should decide *what we're allowed to
do* up front, in the leader, probably when
compute_parallel_vacuum_workers() is called. After that point,
parallel workers (and the parallel leader) don't have to care about
*why* the required/allowed work is a certain way -- they just need to
actually do the work. We can explicitly remember "can [or cannot] have
ambulkdelete() performed by parallel worker", and "can [or cannot]
have amvacuumcleanup() performed by parallel worker" as per-index
immutable state, stored in shared memory. This immutable state is a
little like a query plan.

We can also have mutable state that describes the current status of
each index, like "still needs vacuuming" and "still needs cleanup" --
this tracks the overall progress for each index, plus maybe extra
instrumentation. It will then be easy to notice when invalid state
transitions seem to be necessary -- just throw an error at that point.
Also, we can directly check "Does what I asked for match the work that
parallel workers said they completed using the same data structure?",
at the end of each parallel VACUUM. That nails things down.

BTW, do we really need separate
do_serial_processing_for_unsafe_indexes() and do_parallel_processing()
functions? I think that we could just have one event loop function if
it was structured as a state machine. Parallel workers could see
indexes that can only be processed by the leader, but that would be
okay -- they'd understand not to touch those indexes (without caring
about the true reason).

Right now we really have two separate ambulkddelete (!for_cleanup) and
amvacuumcleanup (for_cleanup) parallel operations -- not a single
parallel VACUUM operation. That may be part of the problem. It would
also be a good idea to have just one parallel VACUUM operation, and to
handle the transition from ambulkdelete to amvacuumcleanup using state
transitions. That's more ambitious than adding more state to shared
memory, but it would be nice if workers (or the leader acting like a
worker) could just consume "a unit of work" in any that's convenient
(as long as it's also in an order that's correct, based on state
transition rules like "go from ambulkddelete directly to
amvacuumcleanup, but only when we know that it's the last call to
ambulkddelete for the VACUUM operation as a whole").

> So, another idea (but
> not a comprehensive approach) to detect this issue would be that we
> check if the leader processed all indexes that must be processed by
> leader at the end of do_serial_processing_for_unsafe_indexes(). That
> is, we check if all of both indexes whose entries in
> will_parallel_vacuum[] are false and unsafe indexes have been
> processed by leader.

I don't think that that would help very much. It still spreads out
knowledge of what is supposed to happen, rather than concentrating
that knowledge into just one function.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Sun, Oct 31, 2021 at 5:19 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Sat, Oct 30, 2021 at 3:18 PM Andres Freund <andres@anarazel.de> wrote:
> > Hm. I wonder if it's not actually good to do something like it in 14, given
> > that we know of a path to have corrupted indexes out there.
>
> My concern is that people might not be too happy about errors that
> occur due to corruption in tuples that are only vaguely related to
> their own incoming inserted tuple.

Attached patch adds ERRORs in the event of detecting
index-tuple-TID-points-to-LP_UNUSED conditions, as well as other
similar conditions -- not just assertions, as before.

I do think that this would be a good idea on HEAD. Still have
reservations about doing that for 14, but you're welcome to try and
change my mind.

Thoughts?

-- 
Peter Geoghegan

Вложения

Re: BUG #17245: Index corruption involving deduplicated entries

От
Andres Freund
Дата:
Hi,

On 2021-11-02 19:03:05 -0700, Peter Geoghegan wrote:
> Attached patch adds ERRORs in the event of detecting
> index-tuple-TID-points-to-LP_UNUSED conditions, as well as other
> similar conditions -- not just assertions, as before.
> 
> I do think that this would be a good idea on HEAD. Still have
> reservations about doing that for 14, but you're welcome to try and
> change my mind.

I'm not quite sure either - I'm worried about followon corruption based on the
bug. That'll be much more likely be detected with something like this in
place. But it's also easy to get some subtlety wrong...



> +static inline void
> +index_delete_check_htid(TM_IndexDeleteOp *delstate,
> +                        Page page, OffsetNumber maxoff,
> +                        ItemPointer htid, TM_IndexStatus *istatus)
> +{
> +    OffsetNumber indexpagehoffnum = ItemPointerGetOffsetNumber(htid);
> +    ItemId        iid;
> +    HeapTupleHeader htup;
> +
> +    Assert(OffsetNumberIsValid(istatus->idxoffnum));
> +
> +    if (indexpagehoffnum > maxoff)
> +        ereport(ERROR,
> +                (errcode(ERRCODE_INDEX_CORRUPTED),
> +                 errmsg_internal("heap tid from index tuple (%u,%u) points past end of heap page line pointer array
atoffset %u of block %u in index \"%s\"",
 
> +                                 ItemPointerGetBlockNumber(htid),
> +                                 indexpagehoffnum,
> +                                 istatus->idxoffnum, delstate->iblknum,
> +                                 RelationGetRelationName(delstate->irel))));
> +    iid = PageGetItemId(page, indexpagehoffnum);
> +    if (!ItemIdIsUsed(iid))
> +        ereport(ERROR,
> +                (errcode(ERRCODE_INDEX_CORRUPTED),
> +                 errmsg_internal("heap tid from index tuple (%u,%u) points to unused heap page item at offset %u of
block%u in index \"%s\"",
 
> +                                 ItemPointerGetBlockNumber(htid),
> +                                 indexpagehoffnum,
> +                                 istatus->idxoffnum, delstate->iblknum,
> +                                 RelationGetRelationName(delstate->irel))));
> +
> +    if (ItemIdHasStorage(iid))
> +    {
> +        Assert(ItemIdIsNormal(iid));
> +        htup = (HeapTupleHeader) PageGetItem(page, iid);
> +
> +        if (HeapTupleHeaderIsHeapOnly(htup))
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_INDEX_CORRUPTED),
> +                     errmsg_internal("heap tid from index tuple (%u,%u) points to heap-only tuple at offset %u of
block%u in index \"%s\"",
 
> +                                     ItemPointerGetBlockNumber(htid),
> +                                     indexpagehoffnum,
> +                                     istatus->idxoffnum, delstate->iblknum,
> +                                     RelationGetRelationName(delstate->irel))));
> +    }

I'd make the error paths unlikely().




> diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
> index db6912e9f..43549be04 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -844,39 +844,115 @@ heap_page_prune_execute(Buffer buffer,
>  {
>      Page        page = (Page) BufferGetPage(buffer);
>      OffsetNumber *offnum;
> -    int            i;
> +    HeapTupleHeader htup PG_USED_FOR_ASSERTS_ONLY;
>  
>      /* Shouldn't be called unless there's something to do */
>      Assert(nredirected > 0 || ndead > 0 || nunused > 0);
>  
>      /* Update all redirected line pointers */
>      offnum = redirected;
> -    for (i = 0; i < nredirected; i++)
> +    for (int i = 0; i < nredirected; i++)
>      {
>          OffsetNumber fromoff = *offnum++;
>          OffsetNumber tooff = *offnum++;
>          ItemId        fromlp = PageGetItemId(page, fromoff);
> +        ItemId        tolp PG_USED_FOR_ASSERTS_ONLY;
> +
> +#ifdef USE_ASSERT_CHECKING

These I'd definitely only do in HEAD.


> +        /*
> +         * The existing items that we set as redirects must be the first tuple
> +         * of a HOT chain that has not be pruned before now (can't be a
> +         * heap-only tuple) when target item has tuple storage.  When the item
> +         * has no storage, then we must just be maintaining the LP_REDIRECT of
> +         * a HOT chain that has been pruned at least once before now.
> +         */
> +        if (!ItemIdIsRedirected(fromlp))
> +        {
> +            Assert(ItemIdHasStorage(fromlp) && ItemIdIsNormal(fromlp));
> +
> +            htup = (HeapTupleHeader) PageGetItem(page, fromlp);
> +            Assert(!HeapTupleHeaderIsHeapOnly(htup));
> +        }

I was a bit confused because I initially read this this as the first tuple
redirect *to* can't be a heap-only tuple. Perhaps this could be rephrased a
bit ("redirecting item"?).


Greetings,

Andres Freund



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Thu, Nov 4, 2021 at 3:07 PM Andres Freund <andres@anarazel.de> wrote:
> I'm not quite sure either - I'm worried about followon corruption based on the
> bug. That'll be much more likely be detected with something like this in
> place.

True, but the additional hardening I added to nbtree in commit
a5213adf is definitely worth something. Sure, tools like amcheck
disappointed when it came to detecting the parallel VACUUM bug -- I
don't deny that. To some degree the nbtree side hardening will be a
little like that too. But let's not overstate it; pg_amcheck was still
pretty useful. It has a decent chance of detecting that there are
certain problems with one or more indexes.

> But it's also easy to get some subtlety wrong...

I'm not worried about incorrectly identifying TIDs as corrupt in this
hardening patch. I'm more worried about the patch throwing ERRORs in
cases where it's arguably not really necessary -- which is not a
problem for the nbtree side hardening we have already. With this new
hardening, the insert or update that we end up throwing an error for
might be pretty far removed from the ERROR. And so it seems like
something that would be better left till Postgres 15.

Of course I don't believe for a second that ignorance is bliss. At the
same time, I don't want to act rashly.

> I'd make the error paths unlikely().

I was under the impression that all ERRORs were unlikely(), in effect.
At least following David Rowley's work in commit 913ec71d -- the
"ereport(ERROR) pg_attribute_cold() shim" patch. But now that I think
about it some more, I imagine that compilers are free to interpret
these any way they want. They're certainly not officially equivalent.
And so it couldn't hurt to add some unlikely()s, for good luck. This
is a pretty tight, performance sensitive loop, after all.

> These I'd definitely only do in HEAD.

Works for me. I see the changes to pruneheap.c as developer
documentation more than anything else. It's a nice way to explain what
is possible, without leaving much room for misunderstanding.

> I was a bit confused because I initially read this this as the first tuple
> redirect *to* can't be a heap-only tuple. Perhaps this could be rephrased a
> bit ("redirecting item"?).

That makes sense. What do you think of this?:

        /*
         * The item that we're about to set as an LP_REDIRECT (the 'from'
         * item) must point to an existing item (the 'to' item) that is
         * already a heap-only tuple.  There can be at most one LP_REDIRECT
         * item per HOT chain.
         *
         * We need to keep around an LP_REDIRECT item (after original
         * non-heap-only root tuple gets pruned away) so that it's always
         * possible for VACUUM to easily figure out what TID to delete from
         * indexes when an entire HOT chain becomes dead.  A heap-only tuple
         * can never become LP_DEAD; an LP_REDIRECT item or a regular heap
         * tuple can.
         */
        tolp = PageGetItemId(page, tooff);
        Assert(ItemIdHasStorage(tolp) && ItemIdIsNormal(tolp));
        htup = (HeapTupleHeader) PageGetItem(page, tolp);
        Assert(HeapTupleHeaderIsHeapOnly(htup));

(I changed the first paragraph, not the second. Just showing
everything together for your convenience.)

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Andres Freund
Дата:
On 2021-11-04 15:57:42 -0700, Peter Geoghegan wrote:
> That makes sense. What do you think of this?:

WFM.



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Thu, Nov 4, 2021 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:
> WFM.

Okay, everything has been pushed.

Thanks
-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Sun, Nov 7, 2021 at 4:28 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> The index in one case is now 448 kB, in the other case is 632 kB, so I suspect
> they're due to this bug (for the 2nd one, a new index created from scratch with
> the same definition is 144 kB).

That seems quite likely. If you REINDEX, everything should be okay
(barring unique violations, perhaps).

> I tried looking for duplicate TIDs like this (which I wrote to search every
> suspect table before I found the above issue) - I'm not sure this is right,
> though.

The problem that we have with detecting this in amcheck is that it
doesn't detect duplicate TIDs, except when they happen to appear in
fully duplicate index tuples. In practice this means that amcheck has
a decent chance of detecting some problems, but cannot detect problems
caused by this particular bug reliably. Even with the heapallindexed
stuff.

> But this doesn't find any issue in that index.  If I'm not wrong, "duplicate
> TIDs" are just one likely symptom of the bug, but not its only consequence.
> I donno how the TIDs can get out of order, though?

They don't. It's just that amcheck considers two adjacent duplicate
TIDs to involve a "misplaced TID" (in a posting list), which is just a
catch-all term. It's true that you could in theory see the same error
message with a posting list whose TIDs were in entirely the wrong
order, but there is no evidence that that ever happened as a result of
this bug -- and there is no reason to even suspect it. Not that it
makes that much difference, mind you -- corruption is corruption.

> I thought you might like to know about another instance of this in the wild.
> I'll disable those indexes and leave them around for a bit in case there's any
> question about them.

Thanks, but there is no further need for this.


--
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Peter Geoghegan
Дата:
On Fri, Oct 29, 2021 at 1:10 AM Alexander Kukushkin <cyberdemn@gmail.com> wrote:
> - Each cluster produces ~3TB of WAL every day (plenty of UPDATEs, about 90% of which are HOT updates).
>
> Corruption was found on all shards, but the list of affected indexes a bit varies from shard to shard.
>
> Database schema:
> - mostly PRIMARY or UNIQUE keys
> - a couple of non-unique btree indexes
> - plenty of foreign keys

You have said elsewhere that you're sure that this isn't the parallel
VACUUM bug, since you know that you didn't run a manual VACUUM, even
once. So I wonder what the issue might be. Since you deleted duplicate
rows from a unique index, there probably weren't very many affected
rows in total. It sounds like a pretty subtle issue to me
(particularly compared to the parallel VACUUM bug, which wasn't all
that subtle when it hit at all).

If I had to guess, I'd guess that it has something to do with the
snapshot scalability work. Specifically, a recently reported issue
involving confusion about the structure of HOT chains during pruning:


https://www.postgresql.org/message-id/flat/20211110192010.ckvfzz352hsba5xf%40alap3.anarazel.de#4c3d9c9988164f5ea3c15999bcf50ce7

Please join in on the other thread if you have anything more to add.

I could easily be wrong about that, though. You upgraded using
pg_upgrade, right? That's certainly a confounding factor here.

-- 
Peter Geoghegan



Re: BUG #17245: Index corruption involving deduplicated entries

От
Alexander Kukushkin
Дата:
Hi Peter,

On Fri, 12 Nov 2021 at 03:15, Peter Geoghegan <pg@bowt.ie> wrote:

You have said elsewhere that you're sure that this isn't the parallel
VACUUM bug, since you know that you didn't run a manual VACUUM, even
once. So I wonder what the issue might be. Since you deleted duplicate
rows from a unique index, there probably weren't very many affected
rows in total.

It is a sharded setup, 16 shards in total. The table structure is similar between shards.
Also, more or less the same tables were affected across all shards.
Depending on the table the number of affected rows was from one to a few hundreds.
Picture looked similar across all shards, although, the set of tables was a bit varying from shard to shard.
The first duplicates I resolved manually, but after rules became clear I simply wrote queries that remove either everything but the first or the last version of the row based on the timestamp.
 
It sounds like a pretty subtle issue to me
(particularly compared to the parallel VACUUM bug, which wasn't all
that subtle when it hit at all).

If I had to guess, I'd guess that it has something to do with the
snapshot scalability work. Specifically, a recently reported issue
involving confusion about the structure of HOT chains during pruning:

https://www.postgresql.org/message-id/flat/20211110192010.ckvfzz352hsba5xf%40alap3.anarazel.de#4c3d9c9988164f5ea3c15999bcf50ce7

Please join in on the other thread if you have anything more to add.

Unfortunately, not really. I only ran bt_index_check(index => r.oid, heapallindexed => true), because bt_index_parent_check takes SharedLock on the index and blocks DML, which we have a lot.
All errors looked similarly:
ERROR: heap tuple (314222,250) from table ""${table_name}"" lacks matching index tuple within index ""${table_name}_pkey""
CONTEXT:
DETAIL: SQL statement ""SELECT bt_index_check(index => r.oid, heapallindexed => true)""
HINT: Retrying verification using the function bt_index_parent_check() might provide a more specific error.
 

I could easily be wrong about that, though. You upgraded using
pg_upgrade, right? That's certainly a confounding factor here.

Correct, pg_upgrade --link, from 9.6 to 14. Any other upgrade method is resource and time consuming.
 
--
Alexander Kukushkin