Обсуждение: v12.0: reindex CONCURRENTLY: lock ShareUpdateExclusiveLock on object14185/39327/0 is already held

Поиск
Список
Период
Сортировка

v12.0: reindex CONCURRENTLY: lock ShareUpdateExclusiveLock on object14185/39327/0 is already held

От
Justin Pryzby
Дата:
I ran into this while trying to trigger the previously-reported segfault. 

CREATE TABLE t(i) AS SELECT * FROM generate_series(1,9);
CREATE INDEX ON t(i);

[pryzbyj@database ~]$ for i in `seq 1 9`; do PGOPTIONS='-cstatement_timeout=9' psql postgres --host /tmp --port 5678 -c
"REINDEXINDEX CONCURRENTLY t_i_idx" ; done
 
ERROR:  canceling statement due to statement timeout
ERROR:  lock ShareUpdateExclusiveLock on object 14185/47287/0 is already held
[...]

Variations on this seem to leave the locks table (?) or something else in a
Real Bad state, such that I cannot truncate the table or drop it; or at least
commands are unreasonably delayed for minutes, on this otherwise-empty test
cluster.

Justin



Re: v12.0: reindex CONCURRENTLY: lock ShareUpdateExclusiveLock onobject 14185/39327/0 is already held

От
Michael Paquier
Дата:
On Sat, Oct 12, 2019 at 09:51:45PM -0500, Justin Pryzby wrote:
> Variations on this seem to leave the locks table (?) or something else in a
> Real Bad state, such that I cannot truncate the table or drop it; or at least
> commands are unreasonably delayed for minutes, on this otherwise-empty test
> cluster.

I got an assertion failure on that:
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f417a283535 in __GI_abort () at abort.c:79
#2  0x0000564c351f0f4f in ExceptionalCondition
(conditionName=0x564c353d0ac8
"SHMQueueEmpty(&(MyProc->myProcLocks[i]))", errorType=0x564c353d09de
"FailedAssertion",
fileName=0x564c353d09d7 "proc.c", lineNumber=832) at assert.c:54
#3  0x0000564c3504debe in ProcKill (code=0, arg=0) at proc.c:832
#4  0x0000564c3503430e in shmem_exit (code=0) at ipc.c:272
#5  0x0000564c3503413d in proc_exit_prepare (code=0) at ipc.c:194
#6  0x0000564c3503409c in proc_exit (code=0) at ipc.c:107
#7  0x0000564c3506a629 in PostgresMain (argc=1,
argv=0x564c35c12ae0, dbname=0x564c35c129d0 "ioltas",
username=0x564c35c129b0 "ioltas") at postgres.c:4464
#8  0x0000564c34fb94ed in BackendRun (port=0x564c35c0c6b0) at
postmaster.c:4465
#9  0x0000564c34fb8c59 in BackendStartup (port=0x564c35c0c6b0) at
postmaster.c:4156
#10 0x0000564c34fb4c7f in ServerLoop () at postmaster.c:1718
#11 0x0000564c34fb44ad in PostmasterMain (argc=3,
argv=0x564c35bdefd0) at postmaster.c:1391
#12 0x0000564c34ec0d3d in main (argc=3, argv=0x564c35bdefd0) at main.c:210

This means that all the locks hold have not actually been released
when the timeout has kicked in.  Not sure that this is only an issue
related to REINDEX CONCURRENTLY, but if that's the case then we are
missing a cleanup step.
--
Michael

Вложения

Re: v12.0: reindex CONCURRENTLY: lock ShareUpdateExclusiveLock onobject 14185/39327/0 is already held

От
Justin Pryzby
Дата:
Checking if anybody is working on either of these
https://www.postgresql.org/message-id/20191013025145.GC4475%40telsasoft.com
https://www.postgresql.org/message-id/20191015164047.GA22729%40telsasoft.com

On Sat, Oct 12, 2019 at 09:51:45PM -0500, Justin Pryzby wrote:
> I ran into this while trying to trigger the previously-reported segfault. 
> 
> CREATE TABLE t(i) AS SELECT * FROM generate_series(1,9);
> CREATE INDEX ON t(i);
> 
> [pryzbyj@database ~]$ for i in `seq 1 9`; do PGOPTIONS='-cstatement_timeout=9' psql postgres --host /tmp --port 5678
-c"REINDEX INDEX CONCURRENTLY t_i_idx" ; done
 
> ERROR:  canceling statement due to statement timeout
> ERROR:  lock ShareUpdateExclusiveLock on object 14185/47287/0 is already held
> [...]
> 
> Variations on this seem to leave the locks table (?) or something else in a
> Real Bad state, such that I cannot truncate the table or drop it; or at least
> commands are unreasonably delayed for minutes, on this otherwise-empty test
> cluster.

On Tue, Oct 15, 2019 at 11:40:47AM -0500, Justin Pryzby wrote:
> On a badly-overloaded VM, we hit the previously-reported segfault in progress
> reporting.  This left around some *ccold indices.  I tried to drop them but:
> 
> sentinel=# DROP INDEX child.alarms_null_alarm_id_idx1_ccold; -- child.alarms_null_alarm_time_idx_ccold; --
alarms_null_alarm_id_idx_ccold;
> ERROR:  could not find tuple for parent of relation 41351896
> 
> Those are children of relkind=I index on relkind=p table.
> 
> postgres=# CREATE TABLE t(i int)PARTITION BY RANGE(i);
> postgres=# CREATE TABLE t1 PARTITION OF t FOR VALUES FROM (1)TO(100);
> postgres=# INSERT INTO t1 SELECT 1 FROM generate_series(1,99999);
> postgres=# CREATE INDEX ON t(i);
> 
> postgres=# begin; SELECT * FROM t; -- DO THIS IN ANOTHER SESSION
> 
> postgres=# REINDEX INDEX CONCURRENTLY t1_i_idx; -- cancel this one
> ^CCancel request sent
> ERROR:  canceling statement due to user request
> 
> postgres=# \d t1
> ...
>     "t1_i_idx" btree (i)
>     "t1_i_idx_ccold" btree (i) INVALID
> 
> postgres=# SELECT inhrelid::regclass FROM pg_inherits WHERE inhparent='t_i_idx'::regclass;
> inhrelid
> t1_i_idx
> (1 row)
> 
> Not only can't I DROP the _ccold indexes, but also dropping the table doesn't
> cause them to be dropped, and then I can't even slash dee them anymore:
> 
> jtp=# DROP INDEX t1_i_idx_ccold;
> ERROR:  could not find tuple for parent of relation 290818869
> 
> jtp=# DROP TABLE t; -- does not fail, but ..
> 
> jtp=# \d t1_i_idx_ccold
> ERROR:  cache lookup failed for relation 290818865
> 
> jtp=# SELECT indrelid::regclass, * FROM pg_index WHERE indexrelid='t1_i_idx_ccold'::regclass;
> indrelid       | 290818865
> indexrelid     | 290818869
> indrelid       | 290818865
> [...]



Re: v12.0: reindex CONCURRENTLY: lock ShareUpdateExclusiveLock onobject 14185/39327/0 is already held

От
Michael Paquier
Дата:
On Fri, Oct 18, 2019 at 01:26:27PM -0500, Justin Pryzby wrote:
> Checking if anybody is working on either of these
> https://www.postgresql.org/message-id/20191013025145.GC4475%40telsasoft.com
> https://www.postgresql.org/message-id/20191015164047.GA22729%40telsasoft.com

FWIW, I have spent an hour or two poking at this issue the last couple
of days so I am not ignoring both, not as much as I would have liked
but well...  My initial lookup leads me to think that something is
going wrong with the cleanup of the session-level lock on the parent
table taken in the first transaction doing the REINDEX CONCURRENTLY.
--
Michael

Вложения

Re: v12.0: reindex CONCURRENTLY: lock ShareUpdateExclusiveLock onobject 14185/39327/0 is already held

От
Michael Paquier
Дата:
On Sat, Oct 19, 2019 at 11:41:06AM +0900, Michael Paquier wrote:
> FWIW, I have spent an hour or two poking at this issue the last couple
> of days so I am not ignoring both, not as much as I would have liked
> but well...  My initial lookup leads me to think that something is
> going wrong with the cleanup of the session-level lock on the parent
> table taken in the first transaction doing the REINDEX CONCURRENTLY.

I can confirm that this is an issue related to session locks which are
not cleaned up when in an out-of-transaction state, state that can be
reached between a transaction commit or start while holding at least
one session lock within one single command of VACUUM, CIC or REINDEX
CONCURRENTLY.  The failure is actually pretty easy to reproduce if you
add an elog(ERROR) after a CommitTransactionCommand() call and then
shut down the connection.  I am starting a new thread about that.  The
problem is larger than it looks, and exists for a long time.
--
Michael

Вложения

Re: v12.0: reindex CONCURRENTLY: lock ShareUpdateExclusiveLock onobject 14185/39327/0 is already held

От
Michael Paquier
Дата:
On Wed, Oct 23, 2019 at 07:18:33PM +0900, Michael Paquier wrote:
> I can confirm that this is an issue related to session locks which are
> not cleaned up when in an out-of-transaction state, state that can be
> reached between a transaction commit or start while holding at least
> one session lock within one single command of VACUUM, CIC or REINDEX
> CONCURRENTLY.

Please let me back-pedal a bit on this one after sleeping on it.
Actually, if you look at CIC and VACUUM, those code paths are much
more careful regarding the position of CHECK_FOR_INTERRUPTS() than
REINDEX CONCURRENTLY is in the fact that they happen only within a
transaction context.  In the case of REINDEX CONCURRENTLY and the
failure reported here, the current code is careless: it depends of
course on the timing of statement_timeout, but session locks would
remain behind when hitting an interruption at the beginning of phase 2
or 3 in indexcmds.c.  So the answer is simple: by moving the interrupt
checks within a transaction context, the problem gets solved.  This
also fixes a second issue as the original code would cause xact.c to
generate some useless warnings.

Please see the attached.  Justin, does it fix your problems regarding
the locks?  For me it does.

> The failure is actually pretty easy to reproduce if you
> add an elog(ERROR) after a CommitTransactionCommand() call and then
> shut down the connection.  I am starting a new thread about that.  The
> problem is larger than it looks, and exists for a long time.

I am still wondering if we could put more safeguards in this area
though...
--
Michael

Вложения

Re: v12.0: reindex CONCURRENTLY: lock ShareUpdateExclusiveLock onobject 14185/39327/0 is already held

От
Justin Pryzby
Дата:
On Thu, Oct 24, 2019 at 11:42:04AM +0900, Michael Paquier wrote:
> Please see the attached.  Justin, does it fix your problems regarding
> the locks?

Confirmed.

Thanks,
Justin



Re: v12.0: reindex CONCURRENTLY: lock ShareUpdateExclusiveLock onobject 14185/39327/0 is already held

От
Michael Paquier
Дата:
On Wed, Oct 23, 2019 at 10:08:21PM -0500, Justin Pryzby wrote:
> On Thu, Oct 24, 2019 at 11:42:04AM +0900, Michael Paquier wrote:
>> Please see the attached.  Justin, does it fix your problems regarding
>> the locks?
>
> Confirmed.

Okay, committed and back-patched.  I have checked manually all the
interruptions for plain indexes and it is possible to clean up the
invalid indexes properly (old or new depending on the phase).
Partition indexes have other issues as you reported, but let's see
about that on the other thread.
--
Michael

Вложения