Обсуждение: Bug report - pg_upgrade tool seems to have a race condition when trying to delete a pg_wal file
Hello,
I tested the pg_upgrade tool many times on different servers (always Windows server 19, actual subversion may differ) when trying to upgrade an existing database from Postgres 9.6 to Postgres 15 (I tried both the 15.4.2 and 15.7) and was almost all the time faced with this issue during the step “Setting next transaction ID and epoch for new cluster”.
Here’s the version of one of the servers, on which it failed at least three times :
The command I ran is "C:\Program Files\PostgreSQL\15\bin\pg_upgrade.exe" -d "C:\Program Files\PostgreSQL\9.6\data" -D "C:\Program Files\PostgreSQL\15\data" -b "C:\Program Files\PostgreSQL\9.6\bin" -B "C:\Program Files\PostgreSQL\15\bin" -U postgres after having set PGPASSWORD to the correct password.
The issue was either “pg_resetwal: error: could not delete file "pg_wal/000000010000000000000001": Permission denied” or sometimes it was saying that the file could not be found instead of Permission denied. When I look in the directory while it is executing, I can see that the file is there previously, and always removed after the pg_upgrade crashes. I tried to inspect with Process Explorer what processes were using it, always processes from postgres, only one after a fresh install of postgres 15, but I saw that during the execution of pg_upgrade, sometimes two processes were using it.
I suspect that there is some sort of race condition where one process sees that the file exists, does something with it and deletes it, while another process saw the file existing, but upon trying to delete it, it could not find it anymore. I had a look in the code and I believe it happens in the function KillExistingXLOG from line 973 of pg_resetwal.c (https://github.com/postgres/postgres/blob/master/src/bin/pg_resetwal/pg_resetwal.c#L973) though I cannot be entirely sure of the cause.
You can find the logs produced by the pg_upgrade tool attached, with the verbose option.
Thanks in advance for the investigation and I hope to understand better the problem and hopefully see a fix soon as it is complicating the deployment of a major upgrade of our product,
Have a great day,
Thomas
Вложения
On Wed, 2024-05-22 at 18:14 +0200, Waka Ranai wrote: > The command I ran is > "C:\Program Files\PostgreSQL\15\bin\pg_upgrade.exe" -d "C:\Program Files\PostgreSQL\9.6\data" -D "C:\Program Files\PostgreSQL\15\data"-b "C:\Program Files\PostgreSQL\9.6\bin" -B "C:\Program Files\PostgreSQL\15\bin" -U postgres > after having set PGPASSWORD to the correct password. > > The issue was either > “pg_resetwal: error: could not delete file "pg_wal/000000010000000000000001": Permission denied” > or sometimes it was saying that the file could not be found instead of Permission denied. > When I look in the directory while it is executing, I can see that the file is there previously, > and always removed after the pg_upgrade crashes. I tried to inspect with Process Explorer what > processes were using it, always processes from postgres, only one after a fresh install of > postgres 15, but I saw that during the execution of pg_upgrade, sometimes two processes were > using it. Did you make sure that no virus scanner is active on the data directories? Yours, Laurenz Albe
On Wed, 2024-05-22 at 18:14 +0200, Waka Ranai wrote:
> The command I ran is
> "C:\Program Files\PostgreSQL\15\bin\pg_upgrade.exe" -d "C:\Program Files\PostgreSQL\9.6\data" -D "C:\Program Files\PostgreSQL\15\data" -b "C:\Program Files\PostgreSQL\9.6\bin" -B "C:\Program Files\PostgreSQL\15\bin" -U postgres
> after having set PGPASSWORD to the correct password.
>
> The issue was either
> “pg_resetwal: error: could not delete file "pg_wal/000000010000000000000001": Permission denied”
> or sometimes it was saying that the file could not be found instead of Permission denied.
> When I look in the directory while it is executing, I can see that the file is there previously,
> and always removed after the pg_upgrade crashes. I tried to inspect with Process Explorer what
> processes were using it, always processes from postgres, only one after a fresh install of
> postgres 15, but I saw that during the execution of pg_upgrade, sometimes two processes were
> using it.
Did you make sure that no virus scanner is active on the data directories?
Yours,
Laurenz Albe
On Thu, 2024-05-23 at 11:39 +0200, Waka Ranai wrote: > no virus scanner were installed apart from the default one of Microsoft Perhaps the default virus scanner is at fault. To exclude that possibility, disable it for the PostgreSQL data directories. Yours, Laurenz Albe
![image.jpeg](/media/maillist_attaches/pgsql-bugs/2024/05/28/CAP8Vo=_4xK06+vLefRv8wHy9FiEbn7uOFnu2wJQrvGNF7mT5OQ@mail.gmail.com/image.jpeg)
On Thu, 2024-05-23 at 11:39 +0200, Waka Ranai wrote:
> no virus scanner were installed apart from the default one of Microsoft
Perhaps the default virus scanner is at fault. To exclude that possibility,
disable it for the PostgreSQL data directories.
Yours,
Laurenz Albe
Вложения
On Tue, 2024-05-28 at 16:14 +0200, Waka Ranai wrote: > We tested on the aforementioned computer after adding an exception on the pg_wal > folder for the Microsoft default antivirus with > Add-MpPreference -ExclusionPath "C:\Program Files\PostgreSQL\15\data\pg_wal" > but we still faced the same issue, I included the pg_upgrade logs Thanks. I see command: "C:/Program Files/PostgreSQL/15/bin/pg_resetwal" -f -u 536 "C:/Program Files/PostgreSQL/15/data" >> "C:/ProgramFiles/PostgreSQL/15/data/pg_upgrade_output.d/202405> Write-ahead log reset command: "C:/Program Files/PostgreSQL/15/bin/pg_resetwal" -f -x 3466214 "C:/Program Files/PostgreSQL/15/data" >> "C:/ProgramFiles/PostgreSQL/15/data/pg_upgrade_output.d/20> pg_resetwal: error: could not delete file "pg_wal/000000010000000000000001": No such file or directory So it is failing in KillExistingXLOG(): readdir() finds the file, but by the time unlink() is executed, the file is already gone. The file in question is the WAL segment written by WriteEmptyXLOG() in the previous "pg_resetwal" execution. But the previous "pg_resetwal" has exited by the time the next one is started, so it should not be at fault. I found this similar thread: https://postgr.es/m/20090910094211.166C5753FB7%40cvs.postgresql.org The symptoms are the same. I wonder if something like commit 4e2d5efc6a45b1f9f96df42629f6d1c7740e657e would be useful here too. But it cannot be a PostgreSQL process that is holding the file open - the creating process has already exited, and no other PostgreSQL process would read the file. So the fact remains that there is something *outside of PostgreSQL* that opens newly created files. You say you disabled the virus scanner, but can you think of any other software on your system that would do that? Perhaps you can try disabling the virus scanner completely and check if that gets rid of the problem. Yours, Laurenz Albe
On Tue, 2024-05-28 at 16:14 +0200, Waka Ranai wrote:
> We tested on the aforementioned computer after adding an exception on the pg_wal
> folder for the Microsoft default antivirus with
> Add-MpPreference -ExclusionPath "C:\Program Files\PostgreSQL\15\data\pg_wal"
> but we still faced the same issue, I included the pg_upgrade logs
Thanks. I see
command: "C:/Program Files/PostgreSQL/15/bin/pg_resetwal" -f -u 536 "C:/Program Files/PostgreSQL/15/data" >> "C:/Program Files/PostgreSQL/15/data/pg_upgrade_output.d/202405>
Write-ahead log reset
command: "C:/Program Files/PostgreSQL/15/bin/pg_resetwal" -f -x 3466214 "C:/Program Files/PostgreSQL/15/data" >> "C:/Program Files/PostgreSQL/15/data/pg_upgrade_output.d/20>
pg_resetwal: error: could not delete file "pg_wal/000000010000000000000001": No such file or directory
So it is failing in KillExistingXLOG(): readdir() finds the file,
but by the time unlink() is executed, the file is already gone.
The file in question is the WAL segment written by WriteEmptyXLOG() in the
previous "pg_resetwal" execution.
But the previous "pg_resetwal" has exited by the time the next one is started,
so it should not be at fault.
I found this similar thread:
https://postgr.es/m/20090910094211.166C5753FB7%40cvs.postgresql.org
The symptoms are the same.
I wonder if something like commit 4e2d5efc6a45b1f9f96df42629f6d1c7740e657e
would be useful here too. But it cannot be a PostgreSQL process that is
holding the file open - the creating process has already exited, and no
other PostgreSQL process would read the file.
So the fact remains that there is something *outside of PostgreSQL* that
opens newly created files. You say you disabled the virus scanner, but can
you think of any other software on your system that would do that?
Perhaps you can try disabling the virus scanner completely and check if
that gets rid of the problem.
Yours,
Laurenz Albe