Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

Поиск

Список

Период

Сортировка

От	Melih Mutlu
Тема	Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Дата	25 мая 2023 г. 11:59:26
Msg-id	CAGPVpCRYapb0FBYB+ZKbU+=u6CQfKBeHNeqmGAcb3y7r=upazQ@mail.gmail.com обсуждение исходный текст
Ответ на	Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication (Peter Smith <smithpb2250@gmail.com>)
Ответы	Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication (Peter Smith <smithpb2250@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

Hi,

Peter Smith <smithpb2250@gmail.com>, 24 May 2023 Çar, 05:59 tarihinde şunu yazdı:

Hi, and thanks for the patch! It is an interesting idea.

I have not yet fully read this thread, so below are only my first
impressions after looking at patch 0001. Sorry if some of these were
already discussed earlier.

TBH the patch "reuse-workers" logic seemed more complicated than I had
imagined it might be.

If you mean patch 0001 by the patch "reuse-workers", most of the complexity comes with some refactoring to split apply worker and tablesync worker paths. [1]

If you mean the whole patch set, then I believe it's because reusing replication slots also requires having a proper snapshot each time the worker moves to a new table. [2]

1.
IIUC with patch 0001, each/every tablesync worker (a.k.a. TSW) when it
finishes dealing with one table then goes looking to find if there is
some relation that it can process next. So now every TSW has a loop
where it will fight with every other available TSW over who will get
to process the next relation.

Somehow this seems all backwards to me. Isn't it strange for the TSW
to be the one deciding what relation it would deal with next?

IMO it seems more natural to simply return the finished TSW to some
kind of "pool" of available workers and the main Apply process can
just grab a TSW from that pool instead of launching a brand new one in
the existing function process_syncing_tables_for_apply(). Or, maybe
those "available" workers can be returned to a pool maintained within
the launcher.c code, which logicalrep_worker_launch() can draw from
instead of launching a whole new process?

(I need to read the earlier posts to see if these options were already
discussed and rejected)

I think ([3]) relying on a single apply worker for the assignment of several tablesync workers might bring some overhead, it's possible that some tablesync workers wait in idle until the apply worker assigns them something. OTOH yes, the current approach makes tablesync workers race for a new table to sync.

TBF both ways might be worth discussing/investigating more, before deciding which way to go.

2.
AFAIK the thing that identifies a tablesync worker is the fact that
only TSW will have a 'relid'.

But it feels very awkward to me to have a TSW marked as "available"
and yet that LogicalRepWorker must still have some OLD relid field
value lurking (otherwise it will forget that it is a "tablesync"
worker!).

IMO perhaps it is time now to introduce some enum 'type' to the
LogicalRepWorker. Then an "in_use" type=TSW would have a valid 'relid'
whereas an "available" type=TSW would have relid == InvalidOid.

Hmm, relid will be immediately updated when the worker moves to a new table. And the time between finishing sync of a table and finding a new table to sync should be minimal. I'm not sure how having an old relid for such a small amount of time can do any harm.

3.
Maybe I am mistaken, but it seems the benchmark results posted are
only using quite a small/default values for
"max_sync_workers_per_subscription", so I wondered how those results
are affected by increasing that GUC. I think having only very few
workers would cause more sequential processing, so conveniently the
effect of the patch avoiding re-launch might be seen in the best
possible light. OTOH, using more TSW in the first place might reduce
the overall tablesync time because the subscriber can do more work in
parallel.

So I'm not quite sure what the goal is here. E.g. if the user doesn't

care much about how long tablesync phase takes then there is maybe no
need for this patch at all. OTOH, I thought if a user does care about
the subscription startup time, won't those users be opting for a much
larger "max_sync_workers_per_subscription" in the first place?
Therefore shouldn't the benchmarking be using a larger number too?

Regardless of how many tablesync workers there are, reusing workers will speed things up if a worker has a chance to sync more than one table. Increasing the number of tablesync workers, of course, improves the tablesync performance. But if it doesn't make 100% parallel ( meaning that # of sync workers != # of tables to sync), then reusing workers can bring an additional improvement.

Here are some benchmarks similar to earlier, but with 100 tables and different number of workers:

+--------+-------------+-------------+-------------+------------+
| | 2 workers | 4 workers | 6 workers | 8 workers |
+--------+-------------+-------------+-------------+------------+
| master | 2579.154 ms | 1383.153 ms | 1001.559 ms | 911.758 ms |
+--------+-------------+-------------+-------------+------------+
| patch | 1724.230 ms | 853.894 ms | 601.176 ms | 496.395 ms |
+--------+-------------+-------------+-------------+------------+

So yes, increasing the number of workers makes it faster. But reusing workers can still improve more.

[1] https://www.postgresql.org/message-id/CAAKRu_YKGyF%2BsvRQqe1th-mG9xLdzneWgh9H1z1DtypBkawkkw%40mail.gmail.com

[2] https://www.postgresql.org/message-id/CAGPVpCRWEVhXa7ovrhuSQofx4to7o22oU9iKtrOgAOtz_%3DY6vg%40mail.gmail.com

[3] https://www.postgresql.org/message-id/CAGPVpCRzD-ZZEc9ienhyrVpCzd9AJ7fxE--OFFJBnBg3E0438w%40mail.gmail.com

Best,

Melih Mutlu

Microsoft

В списке pgsql-hackers по дате отправления: