Обсуждение: Issue with Postgres process startup after instance restart

Поиск
Список
Период
Сортировка

Issue with Postgres process startup after instance restart

От
Shishir Joshi
Дата:
Hello,
I recently faced an issue with PG 11 where the VM that the PG process was running on got restarted because of a hardware issue. After the VM restart, the Postgres process failed to start on the 1st attempt with the error "LOG:  could not open directory "pg_tblspc/16388/PG_11_201809051": No such file or directory" even though that directory was present. But on the 2nd attempt it started up without issues. There didn't seem to be any disk corruption issues and there were no other errors in the syslog either. Has anyone else faced such an issue or has any ideas on why this could have occurred? 

Re: Issue with Postgres process startup after instance restart

От
Tom Lane
Дата:
Shishir Joshi <shishir.joshi@gojek.com> writes:
> I recently faced an issue with PG 11 where the VM that the PG process was
> running on got restarted because of a hardware issue. After the VM restart,
> the Postgres process failed to start on the 1st attempt with the error "*LOG:
>  could not open directory "pg_tblspc/16388/PG_11_201809051": No such file
> or directory*" even though that directory was present. But on the 2nd
> attempt it started up without issues. There didn't seem to be any disk
> corruption issues and there were no other errors in the syslog either. Has
> anyone else faced such an issue or has any ideas on why this could have
> occurred?

Maybe whatever the tablespace is pointing at wasn't mounted yet?
Slow remote mounts are the bane of PG DBAs --- I can recall at least
one famous incident in which someone's database became totally
corrupt because the NFS mount it was on came up after server start,
leading to the server having a mishmash of files on the NFS server
and files on the local disk, now hidden underneath the mount point.

If this is what your issue was, you got very lucky to escape without
damage.  Suggest adapting your PG server start script to make sure the
mounted file system is present before you allow the server to start.

            regards, tom lane



Re: Issue with Postgres process startup after instance restart

От
Shishir Joshi
Дата:
Hi Tom,
I forgot to mention, but in this case it looks the mount was completed before the PG process was started up. But we don't have an explicit check for making sure the file system is present in the start script. Thanks for the tip.

On Fri, 27 Mar 2020 at 19:30, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Shishir Joshi <shishir.joshi@gojek.com> writes:
> I recently faced an issue with PG 11 where the VM that the PG process was
> running on got restarted because of a hardware issue. After the VM restart,
> the Postgres process failed to start on the 1st attempt with the error "*LOG:
>  could not open directory "pg_tblspc/16388/PG_11_201809051": No such file
> or directory*" even though that directory was present. But on the 2nd
> attempt it started up without issues. There didn't seem to be any disk
> corruption issues and there were no other errors in the syslog either. Has
> anyone else faced such an issue or has any ideas on why this could have
> occurred?

Maybe whatever the tablespace is pointing at wasn't mounted yet?
Slow remote mounts are the bane of PG DBAs --- I can recall at least
one famous incident in which someone's database became totally
corrupt because the NFS mount it was on came up after server start,
leading to the server having a mishmash of files on the NFS server
and files on the local disk, now hidden underneath the mount point.

If this is what your issue was, you got very lucky to escape without
damage.  Suggest adapting your PG server start script to make sure the
mounted file system is present before you allow the server to start.

                        regards, tom lane

Re: Issue with Postgres process startup after instance restart

От
Laurenz Albe
Дата:
On Mon, 2020-03-30 at 11:02 +0530, Shishir Joshi wrote:
> On Fri, 27 Mar 2020 at 19:30, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Shishir Joshi <shishir.joshi@gojek.com> writes:
> > > I recently faced an issue with PG 11 where the VM that the PG process was
> > > running on got restarted because of a hardware issue. After the VM restart,
> > > the Postgres process failed to start on the 1st attempt with the error "*LOG:
> > >  could not open directory "pg_tblspc/16388/PG_11_201809051": No such file
> > > or directory*" even though that directory was present. But on the 2nd
> > > attempt it started up without issues. There didn't seem to be any disk
> > > corruption issues and there were no other errors in the syslog either. Has
> > > anyone else faced such an issue or has any ideas on why this could have
> > > occurred?
> > 
> > Maybe whatever the tablespace is pointing at wasn't mounted yet?
> > Slow remote mounts are the bane of PG DBAs --- I can recall at least
> > one famous incident in which someone's database became totally
> > corrupt because the NFS mount it was on came up after server start,
> > leading to the server having a mishmash of files on the NFS server
> > and files on the local disk, now hidden underneath the mount point.
> > 
> > If this is what your issue was, you got very lucky to escape without
> > damage.  Suggest adapting your PG server start script to make sure the
> > mounted file system is present before you allow the server to start.
>
> I forgot to mention, but in this case it looks the mount was completed before
> the PG process was started up. But we don't have an explicit check for making
> sure the file system is present in the start script. Thanks for the tip.

If that is an NFS mount, make sure it is "fg", not "bg".

Also, check that your startup script simply fails if the file system is not
mounted yet, rather than automatically running "initdb".

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com