Обсуждение: "ERROR: could not open relation with OID 16391" error was encountered when reindexing

Поиск
Список
Период
Сортировка
This issue has been reported in the <pgsql-bugs> list at the link below, but has not received a reply.
https://www.postgresql.org/message-id/18286-f6273332500c2a62%40postgresql.org
Hopefully to get some response from kernel hackers, thanks!

Hi,
When reindex the partitioned table's index and the drop index are executed concurrently, we may encounter the error "could not open relation with OID”.

The reconstruction of the partitioned table's index is completed in multiple transactions and can be simply summarized into the following steps:
1. Obtain the oids of all partition indexes in the ReindexPartitions function, and then commit the transaction to release all locks.
2. Reindex each index in turn
   2.1 Start a new transaction
   2.2 Check whether the index still exists
   2.3 Call the reindex_index function to complete the index rebuilding work
   2.4 Submit transaction

There is no lock between steps 2.2 and 2.3 to protect the heap table and index from being deleted, so whether the heap table still exists is determined in the reindex_index function, but the index is not checked.

One fix I can think of is: after successfully opening the heap table in reindex_index, check again whether the index still exists, Something like this:
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 143fae01eb..21777ec98c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3594,6 +3594,17 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
        if (!heapRelation)
                return;
 
+       /*
+        * Before opening the index, check if the index relation still exists.
+        * If index relation is gone, leave.
+        */
+       if (params->options & REINDEXOPT_MISSING_OK != 0 &&
+               !SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexId)))
+       {
+               table_close(heapRelation, NoLock);
+               return;
+       }
+
        /*
         * Switch to the table owner's userid, so that any index functions are run
         * as that user.  Also lock down security-restricted operations and

The above analysis is based on the latest master branch.

I'm not sure if my idea is reasonable, I hope you can give me some suggestions. Thanks.

Best Regards,


Fei Changhong
Alibaba Cloud Computing Ltd.

Re: "ERROR: could not open relation with OID 16391" error was encountered when reindexing

От
Aleksander Alekseev
Дата:
Hi,

> This issue has been reported in the <pgsql-bugs> list at the link below, but has not received a reply.
> https://www.postgresql.org/message-id/18286-f6273332500c2a62%40postgresql.org
> Hopefully to get some response from kernel hackers, thanks!
>
> Hi,
> When reindex the partitioned table's index and the drop index are executed concurrently, we may encounter the error
"couldnot open relation with OID”. 
>
> The reconstruction of the partitioned table's index is completed in multiple transactions and can be simply
summarizedinto the following steps: 
> 1. Obtain the oids of all partition indexes in the ReindexPartitions function, and then commit the transaction to
releaseall locks. 
> 2. Reindex each index in turn
>    2.1 Start a new transaction
>    2.2 Check whether the index still exists
>    2.3 Call the reindex_index function to complete the index rebuilding work
>    2.4 Submit transaction
>
> There is no lock between steps 2.2 and 2.3 to protect the heap table and index from being deleted, so whether the
heaptable still exists is determined in the reindex_index function, but the index is not checked. 
>
> One fix I can think of is: after successfully opening the heap table in reindex_index, check again whether the index
stillexists, Something like this: 
> diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
> index 143fae01eb..21777ec98c 100644
> --- a/src/backend/catalog/index.c
> +++ b/src/backend/catalog/index.c
> @@ -3594,6 +3594,17 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
>         if (!heapRelation)
>                 return;
>
> +       /*
> +        * Before opening the index, check if the index relation still exists.
> +        * If index relation is gone, leave.
> +        */
> +       if (params->options & REINDEXOPT_MISSING_OK != 0 &&
> +               !SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexId)))
> +       {
> +               table_close(heapRelation, NoLock);
> +               return;
> +       }
> +
>         /*
>          * Switch to the table owner's userid, so that any index functions are run
>          * as that user.  Also lock down security-restricted operations and
>
> The above analysis is based on the latest master branch.
>
> I'm not sure if my idea is reasonable, I hope you can give me some suggestions. Thanks.

Any chance you could provide minimal steps to reproduce the issue on
an empty PG instance, ideally as a script? That's going to be helpful
to reproduce / investigate the issue and also make sure that it's
fixed.

--
Best regards,
Aleksander Alekseev



Thank you for your attention.


Any chance you could provide minimal steps to reproduce the issue on
an empty PG instance, ideally as a script? That's going to be helpful
to reproduce / investigate the issue and also make sure that it's
fixed.

I have provided a python script in the attachment to minimize the reproduction of the issue.

The specific reproduction steps are as follows:
1. Initialize the data
```
DROP TABLE IF EXISTS tbl_part;
CREATE TABLE tbl_part (a integer) PARTITION BY RANGE (a);
CREATE TABLE tbl_part_p1 PARTITION OF tbl_part FOR VALUES FROM (0) TO (10);
CREATE INDEX ON tbl_part(a);
```
2. session1 reindex and gdb break at index.c:3585
```
REINDEX INDEX tbl_part_a_idx;
```
3. session2 drop index succeed

```
DROP INDEX tbl_part_a_idx;
```
4. session1 gdb continue


Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.

Вложения
I have provided a python script in the attachment to minimize the reproduction of the issue.

I'm sorry that I lost the attached script in my last reply, but I've added it in this reply.

You can also use psql to reproduce it with the following steps:
1. Initialize the data
```
DROP TABLE IF EXISTS tbl_part;
CREATE TABLE tbl_part (a integer) PARTITION BY RANGE (a);
CREATE TABLE tbl_part_p1 PARTITION OF tbl_part FOR VALUES FROM (0) TO (10);
CREATE INDEX ON tbl_part(a);
```
2. session1 reindex and gdb break at index.c:3585
```
REINDEX INDEX tbl_part_a_idx;
```
3. session2 drop index succeed

```
DROP INDEX tbl_part_a_idx;
```
4. session1 gdb continue


Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.
Вложения
"=?ISO-8859-1?B?ZmVpY2hhbmdob25n?=" <feichanghong@qq.com> writes:
> 2. session1 reindex and gdb break at index.c:3585

This is extremely nonspecific, as line numbers in our code change
constantly.  Please quote a chunk of code surrounding that
and indicate which line you are trying to stop at.

            regards, tom lane




This is extremely nonspecific, as line numbers in our code change
constantly.  Please quote a chunk of code surrounding that
and indicate which line you are trying to stop at.
Thanks for the suggestion, I've refined the steps below to reproduce:
1. Initialize the data
```
DROP TABLE IF EXISTS tbl_part;
CREATE TABLE tbl_part (a integer) PARTITION BY RANGE (a);
CREATE TABLE tbl_part_p1 PARTITION OF tbl_part FOR VALUES FROM (0) TO (10);
CREATE INDEX ON tbl_part(a);
```
2. session1 reindex and the gdb break after the reindex_index function successfully obtains the heapId, as noted in the code chunk below:

reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 const ReindexParams *params)
{
......
/*
* Open and lock the parent heap relation.  ShareLock is sufficient since
* we only need to be sure no schema or data changes are going on.
*/
heapId = IndexGetRelation(indexId,
 (params->options & REINDEXOPT_MISSING_OK) != 0);
====> gdb break at here
/* if relation is missing, leave */
if (!OidIsValid(heapId))
return;
```
REINDEX INDEX tbl_part_a_idx;
```
3. session2 drop index succeed

```
DROP INDEX tbl_part_a_idx;
```
4. session1 gdb continue


Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.

On Wed, Jan 17, 2024 at 12:54:26AM +0800, feichanghong wrote:
>> This is extremely nonspecific, as line numbers in our code change
>> constantly.  Please quote a chunk of code surrounding that
>> and indicate which line you are trying to stop at.
>
> Thanks for the suggestion, I've refined the steps below to reproduce:

Yeah, thanks for the steps.  I am not surprised that there are still a
few holes in this area.  CONCURRENTLY can behave differently depending
on the step where the old index is getting opened.

For this specific job, I have always wanted a try_index_open() that
would attempt to open the index with a relkind check, perhaps we could
introduce one and reuse it here?
--
Michael

Вложения

For this specific job, I have always wanted a try_index_open() that
would attempt to open the index with a relkind check, perhaps we could
introduce one and reuse it here?

Yes, replacing index_open with try_index_open solves the problem. The
idea is similar to my initial report of "after successfully opening the heap
table in reindex_index, check again whether the index still exists”.
But it should be better to introduce try_index_open.


Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.

Вложения
It has been verified that the patch in the attachment can solve the
above problems. I sincerely look forward to your suggestions!
Вложения
On Wed, Jan 17, 2024 at 04:03:46PM +0800, feichanghong wrote:
> It has been verified that the patch in the attachment can solve the
> above problems. I sincerely look forward to your suggestions!

Thanks for the patch.  I have completely forgotten to update this
thread.  Except for a few comments and names that I've tweaked, this
was OK, so applied and backpatched after splitting things into two:
- One commit for try_index_open().
- Second commit for the fix in reindex_index().

I've looked at the concurrent paths as well, and even if these involve
more relations opened we maintain a session lock on the parent
relations that we manipulate, so I could not see a pattern where the
index would be dropped and where we'd try to open it.  Now, there are
cases where it is possible to deadlock for the concurrent paths, but
that's not new: schema or database level reindexes can also hit that.

This is one of these areas where tests are hard to write now because
we want to stop operations at specific points but we cannot.
--
Michael

Вложения