Re: Perform streaming logical transactions by background workers and parallel apply

Поиск
Список
Период
Сортировка
От Masahiko Sawada
Тема Re: Perform streaming logical transactions by background workers and parallel apply
Дата
Msg-id CAD21AoDoNOT_5srNYc_iBwEXd1TsKrnJcDg5iq=m7V-PnX8guA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Perform streaming logical transactions by background workers and parallel apply  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
On Mon, May 8, 2023 at 3:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 8, 2023 at 11:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, May 8, 2023 at 12:52 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Monday, May 8, 2023 11:08 AM Masahiko Sawada <sawada.mshk@gmail.com>
> > >
> > > Hi,
> > >
> > > >
> > > > On Tue, May 2, 2023 at 12:22 PM Amit Kapila <amit.kapila16@gmail.com>
> > > > wrote:
> > > > >
> > > > > On Fri, Apr 28, 2023 at 11:48 AM Masahiko Sawada
> > > > <sawada.mshk@gmail.com> wrote:
> > > > > >
> > > > > > While investigating this issue, I've reviewed the code around
> > > > > > callbacks and worker termination etc and I found a problem.
> > > > > >
> > > > > > A parallel apply worker calls the before_shmem_exit callbacks in the
> > > > > > following order:
> > > > > >
> > > > > > 1. ShutdownPostgres()
> > > > > > 2. logicalrep_worker_onexit()
> > > > > > 3. pa_shutdown()
> > > > > >
> > > > > > Since the worker is detached during logicalrep_worker_onexit(),
> > > > > > MyLogicalReplication->leader_pid is an invalid when we call
> > > > > > pa_shutdown():
> > > > > >
> > > > > > static void
> > > > > > pa_shutdown(int code, Datum arg)
> > > > > > {
> > > > > >     Assert(MyLogicalRepWorker->leader_pid != InvalidPid);
> > > > > >     SendProcSignal(MyLogicalRepWorker->leader_pid,
> > > > > >                    PROCSIG_PARALLEL_APPLY_MESSAGE,
> > > > > >                    InvalidBackendId);
> > > > > >
> > > > > > Also, if the parallel apply worker fails shm_toc_lookup() during the
> > > > > > initialization, it raises an error (because of noError = false) but
> > > > > > ends up a SEGV as MyLogicalRepWorker is still NULL.
> > > > > >
> > > > > > I think that we should not use MyLogicalRepWorker->leader_pid in
> > > > > > pa_shutdown() but instead store the leader's pid to a static variable
> > > > > > before registering pa_shutdown() callback.
> > > > > >
> > > > >
> > > > > Why not simply move the registration of pa_shutdown() to someplace
> > > > > after logicalrep_worker_attach()?
> > > >
> > > > If we do that, the worker won't call dsm_detach() if it raises an
> > > > ERROR in logicalrep_worker_attach(), is that okay? It seems that it's
> > > > no practically problem since we call dsm_backend_shutdown() in
> > > > shmem_exit(), but if so why do we need to call it in pa_shutdown()?
> > >
> > > I think the dsm_detach in pa_shutdown was intended to fire on_dsm_detach
> > > callbacks to give callback a chance to report stat before the stat system is
> > > shutdown, following what we do in ParallelWorkerShutdown() (e.g.
> > > sharedfileset.c callbacks cause fd.c to do ReportTemporaryFileUsage(), so we
> > > need to fire that earlier).
> > >
> > > But for parallel apply, we currently only have one on_dsm_detach
> > > callback(shm_mq_detach_callback) which doesn't report extra stats. So the
> > > dsm_detach in pa_shutdown is only used to make it a bit future-proof in case
> > > we add some other on_dsm_detach callbacks in the future which need to report
> > > stats.
> >
> > Make sense . Given that it's possible that we add other callbacks that
> > report stats in the future, I think it's better not to move the
> > position to register pa_shutdown() callback.
> >
>
> Hmm, what kind of stats do we expect to be collected before we
> register pa_shutdown? I think if required we can register such a
> callback after pa_shutdown. I feel without reordering the callbacks,
> the fix would be a bit complicated as explained in my previous email,
> so I don't think it is worth complicating this code unless really
> required.

Fair point. I agree that the issue can be resolved by carefully
ordering the callback registration.

Another thing I'm concerned about is that since both the leader worker
and parallel worker detach DSM before logicalrep_worker_onexit(),
cleaning up work that touches DSM cannot be done in
logicalrep_worker_onexit(). If we need to do something in the future,
we would need to have another callback called before detaching DSM.
But I'm fine as it's not a problem for now.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Nathan Bossart
Дата:
Сообщение: Re: Subscription statistics are not dropped at DROP SUBSCRIPTION in some cases
Следующее
От: Masahiko Sawada
Дата:
Сообщение: Re: Perform streaming logical transactions by background workers and parallel apply