Hi all,
I've discovered bug, when vacuum full fails with error, because it
couldn't find toast chunks deleted by itself. That happens because
cluster_rel() sets OldestXmin, but toast accesses gets snapshot later
and independently. That causes heap_page_prune_opt() to clean chunks,
which rebuild_relation() expects to exist. This bug very rarely
happens on busy systems which actively update toast values. But I
found way to reliably reproduce it using debugger.
*Setup*
CREATE FUNCTION random_string(seed integer, length integer) RETURNS text
AS $$
SELECT substr(
string_agg(
substr(
encode(
decode(
md5(seed::text || '-' || i::text),
'hex'),
'base64'),
1, 21),
''),
1, length)
FROM generate_series(1, (length + 20) / 21) i; $$
LANGUAGE SQL;
CREATE TABLE test (val text);
INSERT INTO test (random_string(1,100000));
*Reproduction steps*
s1-s3 are three parallel PostgreSQL sessions
s3lldb is lldb connected to s1
At first s1 acquires snapshot and holds it.
s1# begin transaction isolation level repeatable read;
s1# select 1;
Then s2 makes multiple updates of our toasted value.
s2# update test set val = random_string(2,100000);
s2# update test set val = random_string(3,100000);
s2# update test set val = random_string(4,100000);
s2# update test set val = random_string(5,100000);
s2# update test set val = random_string(6,100000);
s2# update test set val = random_string(7,100000);
Then s3 starting vacuum full stopping on vacuum_set_xid_limits().
s3lldb# b vacuum_set_xid_limits
s3# vacuum full test;
We pass vacuum_set_xid_limits() making sure old tuple versions made by
s2 would be recently dead for vacuum full.
s3lldb# finish
Then s1 releases snapshot. Then heap_page_prune_opt() called from
toast accessed would cleanup toast chunks, which vacuum full expects
to be recently dead.
s1# commit;
Finally, we continue our vacuum full and get error!
s3lldb# continue
s3#
ERROR: unexpected chunk number 50 (expected 2) for toast value 16429
in pg_toast_16387
Attached patch contains dirty fix of this bug, which just prevents
heap_page_prune_opt() from clean tuples, when it's called from
rebuild_relation(). Actually, it's not something I'm proposing to
commit or even review, it might be just some start point for thoughts.
Any ideas?
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company