Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

Поиск

Список

Период

Сортировка

От	Önder Kalacı
Тема	Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Дата	9 января 2023 г. 21:21:36
Msg-id	CACawEhUjLNq5if+Es6Lz_H3tW2T0m86oDCocaC7RSyQyHpc8Jw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher (Tom Lane <tgl@sss.pgh.pa.us>)
Список	pgsql-hackers

Дерево обсуждения

Hi,

Thank you for the useful comments!

I took a very brief look through this. I'm not too pleased with
this whole line of development TBH. It seems to me that the core
design of execReplication.c and related code is "let's build our
own half-baked executor and much-less-than-half-baked planner,
because XXX". (I'm not too sure what XXX was, really, but apparently
somebody managed to convince people that that is a sane and
maintainable design.)

This provided me with a broad perspective for the whole execReplication.c.

Before your comment, I have not thought about why there is a specific

logic for the execution of logical replication.

I tried to read the initial commit that adds execReplication.c (665d1fad99e7b11678b0d5fa24d2898424243cd6)

and the main relevant mail thread (PostgreSQL: Re: Logical Replication WIP). But, I couldn't find

any references on this decision. Maybe I'm missing something?

Regarding planner, as far as I can speculate, before my patch, there is probably no need for any planner infrastructure.

The reason seems that the logical replication either needs a sequential scan for REPLICA IDENTITY FULL

or an index scan for the primary key / unique index. I'm not suggesting that we shouldn't use planner at all,

just trying to understand the design choices that have been made earlier.

Now this patch has decided that it *will*
use the real planner, or at least portions of it in some cases.
If we're going to do that ISTM we ought to replace all the existing
not-really-a-planner logic, but this has not done that; instead
we have a large net addition to the already very duplicative
replication code, with weird restrictions because it doesn't want
to make changes to the half-baked executor.

That sounds like a one good perspective on the restrictions that this patch adds.

From my perspective, I wanted to fit into the existing execReplication.c, which only

works for primary keys / unique keys. And, if you look closely, the restrictions I suggest

are actually the same/similar restrictions with REPLICA IDENTITY ... USING INDEX.

I hope/assume this is no surprise for you and not too hard to explain to the users.

I think we should either live within the constraints set by this
overarching design, or else nuke execReplication.c from orbit and
start using the real planner and executor. Perhaps the foreign
key enforcement mechanisms could be a model --- although if you
don't want to buy into using SPI as well, you probably should look
at Amit L's work at [1].

This sounds like a good long term plan to me. Are you also suggesting to do that

before this patch?

I think that such a change is a non-trivial / XL project, which could likely not be easily

achievable by myself in a reasonable time frame.

Also ... maybe I am missing something, but is REPLICA IDENTITY FULL
sanely defined in the first place? It looks to me that
RelationFindReplTupleSeq assumes without proof that there is a unique
full-tuple match, but that is only reasonable to assume if there is at
least one unique index (and maybe not even then, if nulls are involved).

In general, RelationFindReplTupleSeq is ok not to match any tuples. So, I'm not sure

if uniqueness is required?

Even if there are multiple matches, RelationFindReplTupleSeq does only one change at

a time. My understanding is that if there are multiple matches on the source, they are

generated as different messages, and each message triggers RelationFindReplTupleSeq.

If there is a unique index, why can't that be chosen as replica identity?
If there isn't, what semantics are we actually providing?

I'm not sure I can fully follow this question. In this patch, I'm trying to allow non-unique

indexes to be used in the subscription. And, the target could have multiple indexes.

So, the semantics is that we automatically allow users to be able to use non-unique

indexes on the subscription side even if the replica identity is full on the source.

The reason (a) we use planner (b) not ask users which index to use, is that

it'd be very inconvenient for any user to pick the indexes among multiple

indexes on the subscription.

If there is a unique index, the expectation is that the user would pick

REPLICA IDENTITY .. USING INDEX or just make it the primary key.

In those cases, this patch would not interfere with the existing logic.

What I'm thinking about is that maybe REPLICA IDENTITY FULL should be
defined as "the subscriber can pick any unique index to match on,
and is allowed to fail if the table has none". Or if "fail" is a bridge
too far for you, we could fall back to the existing seqscan logic.
But thumbing through the existing indexes to find a non-expression unique
index wouldn't require invoking the full planner. Any candidate index
would result in a plan estimated to fetch just one row, so there aren't
likely to be serious cost differences.

Again, maybe I'm missing something in your comments, but this patch deals with

non-unique indexes. That's why we rely on the planner to pick the optimal index

among what we have on the subscription. (In my first iteration of this patch,

I decided to pick the index without planner, but than it seems much nicer to rely

on the planner for obvious reasons of picking the right index)

For example, if you have a unique index on the subscription, the planner already

picks that. But, still, if you could afford to have unique index, you should better

use REPLICA IDENTITY .. USING INDEX or just primary key. I gave this example

for explaining one edge case that many devs could think of.

Lastly, any (auto)-ANALYZE on the target table re-calculates the candidate index on

the subscription. So, hopefully we are not too behind with the statistics for a long time,

and have a good index to use.

Thanks,

Onder KALACI

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Peter Geoghegan
Дата: 09 января 2023 г., 21:16:03
Сообщение: Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

Следующее

От: Nathan Bossart
Дата: 09 января 2023 г., 21:24:08
Сообщение: Re: Allow +group in pg_ident.conf

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

Предыдущее

Следующее