Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

Поиск
Список
Период
Сортировка
От Melih Mutlu
Тема Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Дата
Msg-id CAGPVpCQZZ5SPtVYSGSBTQEgp0eWV9qEtHcLNAywkd-CJ7M-Umg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication  (shveta malik <shveta.malik@gmail.com>)
Ответы Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication  (Melih Mutlu <m.melihmutlu@gmail.com>)
Список pgsql-hackers
Hi Shveta and Shi,

Thanks for your investigations.

shveta malik <shveta.malik@gmail.com>, 8 Şub 2023 Çar, 16:49 tarihinde şunu yazdı:
On Tue, Feb 7, 2023 at 8:18 AM shiy.fnst@fujitsu.com
<shiy.fnst@fujitsu.com> wrote:
> I reproduced the problem I reported before with latest patch (v7-0001,
> v10-0002), and looked into this problem. It is caused by a similar reason. Here
> is some analysis for the problem I reported [1].#6.
>
> First, a tablesync worker (worker-1) started for "tbl1", its originname is
> "pg_16398_1". And it exited because of unique constraint. In
> LogicalRepSyncTableStart(), originname in pg_subscription_rel is updated when
> updating table state to DATASYNC, and the origin is created when updating table
> state to FINISHEDCOPY. So when it exited with state DATASYNC , the origin is not
> created but the originname has been updated in pg_subscription_rel.
>
> Then a tablesync worker (worker-2) started for "tbl2", its originname is
> "pg_16398_2". After tablesync of "tbl2" finished, this worker moved to sync
> table "tbl1". In LogicalRepSyncTableStart(), it got the originname of "tbl1" -
> "pg_16398_1", by calling ReplicationOriginNameForLogicalRep(), and tried to drop
> the origin (although it is not actually created before). After that, it called
> replorigin_by_name to get the originid whose name is "pg_16398_1" and the result
> is InvalidOid. Origin won't be created in this case because the sync worker has
> created a replication slot (when it synced tbl2), so the originid was still
> invalid and it caused an assertion failure when calling replorigin_advance().
>
> It seems we don't need to drop previous origin in worker-2 because the previous
> origin was not created in worker-1. I think one way to fix it is to not update
> originname of pg_subscription_rel when setting state to DATASYNC, and only do
> that when setting state to FINISHEDCOPY. If so, the originname in
> pg_subscription_rel will be set at the same time the origin is created.

+1. Update of system-catalog needs to be done carefully and only when
origin is created.

I see that setting originname in the catalog before actually creating it causes issues. My concern with setting originname when setting the state to FINISHEDCOPY is that if worker waits until FINISHEDCOPY to update slot/origin name but it fails somewhere before reaching FINISHEDCOPY and after creating slot/origin, then this new created slot/origin will be left behind. It wouldn't be possible to find and drop them since their names are not stored in the catalog. Eventually, this might also cause hitting the max_replication_slots limit in case of such failures between origin creation and FINISHEDCOPY.

One fix I can think is to update the catalog right after creating a new origin. But this would also require commiting the current transaction to actually persist the originname. I guess this action of commiting the transaction in the middle of initial sync could hurt the copy process.

What do you think? 

Also; working on an updated patch to address your other comments. Thanks again.

Best,
--
Melih Mutlu
Microsoft

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Daniel Gustafsson
Дата:
Сообщение: Re: Missing default value of createrole_self_grant in document
Следующее
От: gkokolatos@pm.me
Дата:
Сообщение: RE: Add LZ4 compression in pg_dump