Re: Synchronizing slots from primary to standby

Поиск
Список
Период
Сортировка
От Drouvot, Bertrand
Тема Re: Synchronizing slots from primary to standby
Дата
Msg-id 5a858c63-cd59-4622-a9ca-eebbb7bd59cc@gmail.com
обсуждение исходный текст
Ответ на Re: Synchronizing slots from primary to standby  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: Synchronizing slots from primary to standby  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
Hi,

On 11/7/23 11:55 AM, Amit Kapila wrote:
> On Tue, Nov 7, 2023 at 3:51 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> On 10/31/23 10:37 AM, shveta malik wrote:
>>> On Fri, Oct 27, 2023 at 8:43 PM Drouvot, Bertrand
>>>>
>>>> Agree with your test case, but in my case I was not using pub/sub.
>>>>
>>>> I was not clear, so when I said:
>>>>
>>>>>> - create logical_slot1 on the primary (and don't start using it)
>>>>
>>>> I meant don't start decoding from it (like using pg_recvlogical() or
>>>> pg_logical_slot_get_changes()).
>>>>
>>>> By using pub/sub the "don't start using it" is not satisfied.
>>>>
>>>> My test case is:
>>>>
>>>> "
>>>> SELECT * FROM pg_create_logical_replication_slot('logical_slot1', 'test_decoding', false, true, true);
>>>> SELECT * FROM pg_create_logical_replication_slot('logical_slot2', 'test_decoding', false, true, true);
>>>> pg_recvlogical -d postgres -S logical_slot2 --no-loop --start -f -
>>>> "
>>>>
>>>
>>> Okay, I am able to reproduce it now. Thanks for clarification. I have
>>> tried to change the algorithm as per suggestion by Amit in [1]
>>>
>>> [1]:
https://www.postgresql.org/message-id/CAA4eK1KBL0110gamQfc62X%3D5JV8-Qjd0dw0Mq0o07cq6kE%2Bq%3Dg%40mail.gmail.com
>>
>> Thanks!
>>
>>>
>>> This is not full proof solution but optimization over first one. Now
>>> in any sync-cycle, we take 2 attempts for slots-creation (if any slots
>>> are available to be created). In first attempt, we do not wait
>>> indefinitely on inactive slots, we wait only for a fixed amount of
>>> time and if remote-slot is still behind, then we add that to the
>>> pending list and move to the next slot. Once we are done with first
>>> attempt, in second attempt, we go for the pending ones and now we wait
>>> on each of them until the primary catches up.
>>
>> Aren't we "just" postponing the "issue"? I mean if there is really no activity
>> on, say, the first created slot, then once we move to the second attempt then any newly
>> created slot from that time would wait to be synced forever, no?
>>
> 
> We have to wait at some point in time for such inactive slots and the
> same is true even for manually created slots on standby. Do you have
> any better ideas to deal with it?
> 

What about:

- get rid of the second attempt and the pending_slot_list
- keep the wait_count and PrimaryCatchupWaitAttempt logic

so basically, get rid of:

    /*
     * Now sync the pending slots which were failed to be created in first
     * attempt.
     */
    foreach(cell, pending_slot_list)
    {
        RemoteSlot *remote_slot = (RemoteSlot *) lfirst(cell);

        /* Wait until the primary server catches up */
        PrimaryCatchupWaitAttempt = 0;

        synchronize_one_slot(wrconn, remote_slot, NULL);
    }

and the pending_slot_list list.

That way, for each slot that have not been created and synced yet:

- it will be created on the standby
- we will wait up to PrimaryCatchupWaitAttempt attempts
- the slot will be synced or removed on/from the standby

That way an inactive slot on the primary would not "block"
any other slots on the standby.

By "created" here I mean calling ReplicationSlotCreate() (not to be confused
with emitting "ereport(LOG, errmsg("created slot \"%s\" locally", remote_slot->name)); "
which is confusing as mentioned up-thread).

The problem I can see with this proposal is that the "sync" window waiting
for slot activity on the primary is "only" during the PrimaryCatchupWaitAttempt
attempts (as the slot will be dropped/recreated).

If we think this window is too short we could:

- increase it
or
- don't drop the slot once created (even if there is no activity
on the primary during PrimaryCatchupWaitAttempt attempts) so that
the next loop of attempts will compare with "older" LSN/xmin (as compare to
dropping and re-creating the slot). That way the window would be since the
initial slot creation.

Thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Jonathan S. Katz"
Дата:
Сообщение: Re: 2023-11-09 release announcement draft
Следующее
От: Dean Rasheed
Дата:
Сообщение: Re: Infinite Interval