Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f
Дата
Msg-id 1709793.1644523137@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f  (Noah Misch <noah@leadboat.com>)
Ответы Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f  (Andres Freund <andres@anarazel.de>)
Список pgsql-committers
Noah Misch <noah@leadboat.com> writes:
> On AIX, this pair of commits added 5-10h of runtime to 017_shm.pl.

I poked into this on gcc119 (hoverfly's host).  As best I can tell,
the root of the problem is that "kill 9" isn't instantaneous on this
platform.  In [1] we find this in 017_shm_gnat.log:

2022-02-10 07:08:20.175 UTC [41615696:1] LOG:  starting PostgreSQL 15devel on powerpc-ibm-aix7.2.4.0, compiled by
/opt/IBM/xlc/16.1.0/bin/xlc_r-D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY, 64-bit 
2022-02-10 07:08:20.175 UTC [41615696:2] LOG:  listening on Unix socket "/home/nm/farm/tmp/FTwLLIVrmg/.s.PGSQL.63971"
2022-02-10 07:08:20.182 UTC [34931288:1] LOG:  database system was shut down at 2022-02-10 07:08:20 UTC
2022-02-10 07:08:20.194 UTC [41615696:3] LOG:  database system is ready to accept connections
2022-02-10 07:08:20.380 UTC [26018140:1] FATAL:  lock file "postmaster.pid" already exists
2022-02-10 07:08:20.380 UTC [26018140:2] HINT:  Is another postmaster (PID 41615696) running in data directory
"/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata"?
2022-02-10 12:48:44.665 UTC [18350518:1] LOG:  starting PostgreSQL 15devel on powerpc-ibm-aix7.2.4.0, compiled by
/opt/IBM/xlc/16.1.0/bin/xlc_r-D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY, 64-bit 

41615696 is the postmaster PID that is targeted by 017_shm.pl's first
kill9 call, after which we immediately attempt to start a new postmaster
(26018140), which is complaining that 41615696 is still alive.  Looking
at the code that issues that in miscinit.c, either kill(41615696, 0)
succeeded, or it returned some errno that's neither ESRCH nor EPERM.
AIX doesn't document any kill(2) errors outside of what POSIX says,
so I reject the second theory and conclude that kill() was still seeing
41615696 as alive at this point.

But, once the replacement postmaster fails, we got trouble, because
the old postmaster.pid file (mentioning 41615696) is still there.
017_shm.pl's poll_start sees that "pg_ctl start" failed but then
does
        # Clean up in case the start attempt just timed out or some such.
        $node->stop('fast', fail_ok => 1);
(which I added in 4fdbf9af5).

My hypothesis is that the kill() call in pg_ctl's do_stop() still sees
41615696 as alive at this point, so pg_ctl enters a loop of waiting for
postmaster.pid to go away, which it never will.  PGCTLTIMEOUT later
(~18000 seconds for hoverfly), pg_ctl gives up and the test proceeds.

So it looks to me like the core problem is that pg_ctl's do_stop()
is too trusting: if it once sees the postmaster PID as alive, it
figures that's the end of the story.  We need to modify it to
recheck that during the wait loop.  (do_restart and do_promote
seem similarly naive ... and why are there so many copies of the
wait loop, anyway?)

Another idea is to modify Cluster.pm's kill9 to not return until the
PID is gone according to kill(0).  On the other hand, that'd mask
problems like this, so I'm not entirely enthused about changing
that end of things.

            regards, tom lane

[1]
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2022-02-10%2006%3A29%3A25&stg=recovery-check



В списке pgsql-committers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: pgsql: Use gendef instead of pexports for building windows .def files
Следующее
От: Tom Lane
Дата:
Сообщение: pgsql: Make pg_ctl stop/restart/promote recheck postmaster aliveness.