Re: BUG #18433: Logical replication timeout

Поиск
Список
Период
Сортировка
От Shlok Kyal
Тема Re: BUG #18433: Logical replication timeout
Дата
Msg-id CANhcyEWtED9_UiTsaM_PYmBikpOh1BYxQFvdoWPEJe064vjLeQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #18433: Logical replication timeout  (Kostiantyn Tomakh <tomahkvt@gmail.com>)
Ответы Re: BUG #18433: Logical replication timeout  (Kostiantyn Tomakh <tomahkvt@gmail.com>)
Список pgsql-bugs
Hi,

> I was able to reproduce the problem.
> I did it on docker based platform I hope you will be able to reproduce this problem too.

Thanks for providing the detailed steps to reproduce the issue. I was
able to reproduce the issue with the steps you provided.
I noticed that the issue regarding the increased table size on the
subscriber can happen in all versions till Postgres 13 and I was able
to reproduce that. This is a timing issue and hence you may not be
getting this issue in postgres 10.

This issue occurs because tablesync worker exits (due to UPDATE
command) and restarts again as seen in logs:
2024-05-01 16:26:15.384 GMT [40] LOG:  logical replication table
synchronization worker for subscription "db_name_public_subscription",
table "table" has started
2024-05-01 16:26:16.994 GMT [40] ERROR:  logical replication target
relation "public.table" has neither REPLICA IDENTITY index nor PRIMARY
KEY and published relation does not have REPLICA IDENTITY FULL
2024-05-01 16:26:20.393 GMT [41] LOG:  logical replication table
synchronization worker for subscription "db_name_public_subscription",
table "table" has started

Tablesync worker sync the initial data from publisher to subscriber
using COPY command. But in this case it exits (after copy phase is
completed) and restarts, so it will perform entire copy operation
again. And hence we can see the increased table size on the
subscriber.

This issue is not reproducible in Postgres 14 and above versions. This
issue was mitigated after the commit [1]. In this commit a new state
'FINISHEDCOPY' is introduced. So if the tablesync worker exits (after
copy phase is completed) and restarts, it donot not perform COPY
command again and proceeds directly to synchronize the WAL position
between tablesync worker and apply worker.

code:
+   else if (MyLogicalRepWorker->relstate == SUBREL_STATE_FINISHEDCOPY)
+   {
+       /*
+        * The COPY phase was previously done, but tablesync then crashed
+        * before it was able to finish normally.
+        */
+       StartTransactionCommand();
+
+       /*
+        * The origin tracking name must already exist. It was created first
+        * time this tablesync was launched.
+        */
+       originid = replorigin_by_name(originname, false);
+       replorigin_session_setup(originid);
+       replorigin_session_origin = originid;
+       *origin_startpos = replorigin_session_get_progress(false);
+
+       CommitTransactionCommand();
+
+       goto copy_table_done;
+   }

Backpatching commit [1] to Postgres 13 and Postgres 12 will mitigate this issue.
Thoughts?

[1] https://github.com/postgres/postgres/commit/ce0fdbfe9722867b7fad4d3ede9b6a6bfc51fb4e

Thanks and Regards,
Shlok Kyal



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Corey Huinker
Дата:
Сообщение: Re: BUG #18429: Inconsistent results on similar queries with join lateral
Следующее
От: Kostiantyn Tomakh
Дата:
Сообщение: Re: BUG #18433: Logical replication timeout