Re: walsender performance regression due to logical decoding on standby changes

Поиск
Список
Период
Сортировка
От Bharath Rupireddy
Тема Re: walsender performance regression due to logical decoding on standby changes
Дата
Msg-id CALj2ACXtHuS9PcP+R2TiMdSx_GepYSfx8n1O0pZSAPZQY2PWJA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: walsender performance regression due to logical decoding on standby changes  (Andres Freund <andres@anarazel.de>)
Ответы Re: walsender performance regression due to logical decoding on standby changes  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Re: walsender performance regression due to logical decoding on standby changes  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On Thu, May 18, 2023 at 1:23 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-05-15 20:09:00 +0530, Bharath Rupireddy wrote:
> > > [1]
> > > max_wal_senders = 100
> > > before regression(ms)                after regression(ms)    v2 patch(ms)
> > > 13394.4013                          14141.2615              13455.2543
> > > Compared with before regression     5.58%                   0.45%
> > >
> > > max_wal_senders = 200
> > > before regression(ms)                after regression(ms)     v2 patch(ms)
> > > 13280.8507                          14597.1173              13632.0606
> > > Compared with before regression     9.91%                   1.64%
> > >
> > > max_wal_senders = 300
> > > before regression(ms)                after regression(ms)     v2 patch(ms)
> > > 13535.0232                          16735.7379              13705.7135
> > > Compared with before regression     23.65%                  1.26%
> >
> > Yes, the numbers with v2 patch look close to where we were before.
> > Thanks for confirming. Just wondering, where does this extra
> > 0.45%/1.64%/1.26% coming from?
>
> We still do more work for each WAL record than before, so I'd expect something
> small. I'd say right now the main overhead with the patch comes from the
> spinlock acquisitions in ConditionVariableBroadcast(), which happen even when
> nobody is waiting.


> > +             ConditionVariableInit(&WalSndCtl->physicalWALSndCV);
> > +             ConditionVariableInit(&WalSndCtl->logicalWALSndCV);
>
> It's not obvious to me that it's worth having two CVs, because it's more
> expensive to find no waiters in two CVs than to find no waiters in one CV.

I disagree. In the tight per-WAL record recovery loop, WalSndWakeup
wakes up logical walsenders for every WAL record, but it wakes up
physical walsenders only if the applied WAL record causes a TLI
switch. Therefore, the extra cost of spinlock acquire-release for per
WAL record applies only for logical walsenders. On the other hand, if
we were to use a single CV, we would be unnecessarily waking up (if at
all they are sleeping) physical walsenders for every WAL record -
which is costly IMO.

I still think separate CVs are good for selective wake ups given there
can be not so many TLI switch WAL records.

> > +      *
> > +      * XXX: When available, WaitEventSetWait() can be replaced with its CV's
> > +      * counterpart.
>
> I don't really understand that XXX - the potential bright future would be to
> add support for CVs into wait event sets, not to replace WES with a CV?

Yes, I meant it and modified that part to:

     * XXX: A desirable future improvement would be to add support for CVs into
     * WaitEventSetWait().

> FWIW, if I just make WalSndWakeup() do nothing, I still see a very small, but
> reproducible, overhead: 1.72s - that's just the cost of the additional
> external function call.

Hm, this is unavoidable.

> If I add a no-waiters fastpath using proclist_is_empty() to
> ConditionVariableBroadcast(), I get 1.77s. So the majority of the remaining
> slowdown indeed comes from the spinlock acquisition in
> ConditionVariableBroadcast().

Acquiring spinlock for every replayed WAL record seems not great. In
general, this problem exists for CV infrastructure as a whole if
ConditionVariableBroadcast()/CVB() is called in a loop/hot path. I
think this can be proven easily by doing something like - select
pg_replication_origin_session_setup()/pg_wal_replay_resume()/pg_create_physical_replication_slot()
from generate_series(1, 1000000...);, all of these functions end up in
CVB().

Do you think adding a fastpath exit when no waiters to CVB() is worth
implementing? Something like - an atomic state variable to each CV,
similar to LWLock's state variable, setting it when adding waiters and
resetting it when removing waiters, CVB() atomically reading the state
for fastpath (of course, we need memory barriers here). This might
complicate things as each CV structure gets an extra state variable (4
bytes), memory barriers, and atomic writes and reads.

Or given the current uses of CVB() with no callers except recovery
calling it in a hot path with patch, maybe we can add an atomic waiter
count to WalSndCtl, incrementing it atomically in WalSndWait() before
the wait, decrementing it after the wait, and a fastpath in
WalSndWakeup() reading it atomically to avoid CVB() calls. IIUC, this
is something Jeff proposed upthread.

Thoughts?

Please find the attached v4 patch addressing the review comment (not
the fastpath one).

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Assert failure of the cross-check for nullingrels
Следующее
От: Tom Lane
Дата:
Сообщение: Re: The documentation for READ COMMITTED may be incomplete or wrong