RE: Random pg_upgrade test failure on drongo

Поиск
Список
Период
Сортировка
От Hayato Kuroda (Fujitsu)
Тема RE: Random pg_upgrade test failure on drongo
Дата
Msg-id TYAPR01MB5866A4E7342088E91362BEF0F5BBA@TYAPR01MB5866.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на RE: Random pg_upgrade test failure on drongo  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Ответы RE: Random pg_upgrade test failure on drongo  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Список pgsql-hackers
Dear hackers,

This email tells an update. The machine drongo failed the test a week ago [1]
and finally got logfiles. PSA files.

## Observed failure

pg_upgrade_server.log is a server log during the pg_upgrade command. According to
it, the TRUNCATE command seemed to be failed due to a "File exists" error.

```
2023-11-15 00:02:02.239 UTC [1752:18] 003_logical_slots.pl ERROR:  could not create file "base/1/2683": File exists
2023-11-15 00:02:02.239 UTC [1752:19] 003_logical_slots.pl STATEMENT:
    -- For binary upgrade, preserve pg_largeobject and index relfilenodes
    SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
    SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
    TRUNCATE pg_catalog.pg_largeobject;
...
```

## Analysis

I think it caused due to the STATUS_DELETE_PENDING failure, not related with recent
updates for pg_upgrade.

The file "base/1/2683" is an index file for pg_largeobject_loid_pn_index, and the
output meant that file creation was failed. Below is a backtrace.

```
pgwin32_open() // <-- this returns -1
open()
BasicOpenFilePerm()
PathNameOpenFilePerm()
PathNameOpenFile()
mdcreate()
smgrcreate()
RelationCreateStorage()
RelationSetNewRelfilenumber()
ExecuteTruncateGuts()
ExecuteTruncate()
```

But this is strange. Before calling mdcreate(), we surely unlink the file which
have the same name. Below is a trace until unlink.

```
pgunlink()
unlink()
mdunlinkfork()
mdunlink()
smgrdounlinkall()
RelationSetNewRelfilenumber() // common path with above
ExecuteTruncateGuts()
ExecuteTruncate()
```

I found Thomas said that [2] pgunlink sometimes could not remove file even if
it returns OK, at that time NTSTATUS is STATUS_DELETE_PENDING. Also, a comment
in pgwin32_open_handle() mentions the same thing:

```
        /*
         * ERROR_ACCESS_DENIED is returned if the file is deleted but not yet
         * gone (Windows NT status code is STATUS_DELETE_PENDING).  In that
         * case, we'd better ask for the NT status too so we can translate it
         * to a more Unix-like error.  We hope that nothing clobbers the NT
         * status in between the internal NtCreateFile() call and CreateFile()
         * returning.
         *
```

The definition of STATUS_DELETE_PENDING can be seen in [3]. Based on that, indeed,
open() would be able to fail with STATUS_DELETE_PENDING if the deletion is pending
but it is tried to open.

Another thread [4] also tries the issue while doing rmtree->unlink, and it reties
to remove if it fails with STATUS_DELETE_PENDING. So, should we retry to open when
it fails as well? Anyway, this fix seems out-of-scope for pg_upgrade.

How do you think? Do you have any thoughts about it?

## Acknowledgement

I want to say thanks to Sholk, Vingesh, for helping the analysis.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2023-11-15%2006%3A16%3A15
[2]: https://www.postgresql.org/message-id/CA%2BhUKGKsdzw06c5nnb%3DKYG9GmvyykoVpJA_VR3k0r7cZOKcx6Q%40mail.gmail.com
[3]: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55
[4]:
https://www.postgresql.org/message-id/flat/20220919213217.ptqfdlcc5idk5xup%40awork3.anarazel.de#6ae5e2ba3dd6e1fd680dcc34eab710d5

Best Regards,
Hayato Kuroda
FUJITSU LIMITED




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Vitaly Davydov
Дата:
Сообщение: Re: How to accurately determine when a relation should use local buffers?
Следующее
От: "Hayato Kuroda (Fujitsu)"
Дата:
Сообщение: RE: Random pg_upgrade test failure on drongo