Re: long-standing data loss bug in initial sync of logical replication

Поиск

Список

Период

Сортировка

От	Andres Freund
Тема	Re: long-standing data loss bug in initial sync of logical replication
Дата	18 ноября 2023 г. 21:12:57
Msg-id	20231118181257.qtebsnkrzj3icg36@awork3.anarazel.de обсуждение исходный текст
Ответ на	Re: long-standing data loss bug in initial sync of logical replication (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы	Re: long-standing data loss bug in initial sync of logical replication (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Список	pgsql-hackers

Дерево обсуждения

Hi,

On 2023-11-18 11:56:47 +0100, Tomas Vondra wrote:
> > I guess it's not really feasible to just increase the lock level here though
> > :(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL
> > would perhaps lead to new deadlocks and such? But it also seems quite wrong.
> > 
> 
> If this really is about the lock being too weak, then I don't see why
> would it be wrong?

Sorry, that was badly formulated. The wrong bit is the use of
ShareUpdateExclusiveLock.


> If it's required for correctness, it's not really wrong, IMO. Sure, stronger
> locks are not great ...
> 
> I'm not sure about the risk of deadlocks. If you do
> 
>     ALTER PUBLICATION ... ADD TABLE
> 
> it's not holding many other locks. It essentially gets a lock just a
> lock on pg_publication catalog, and then the publication row. That's it.
> 
> If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
> we're making it conflict with RowExclusive. Which is just DML, and I
> think we need to do that.

From what I can tell it needs to to be an AccessExlusiveLock. Completely
independent of logical decoding. The way the cache stays coherent is catalog
modifications conflicting with anything that builds cache entries. We have a
few cases where we do use lower level locks, but for those we have explicit
analysis for why that's ok (see e.g. reloptions.c) or we block until nobody
could have an old view of the catalog (various CONCURRENTLY) operations.


> So maybe that's fine? For me, a detected deadlock is better than
> silently missing some of the data.

That certainly is true.


> > We could brute force this in the logical decoding infrastructure, by
> > distributing invalidations from catalog modifying transactions to all
> > concurrent in-progress transactions (like already done for historic catalog
> > snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()).  But I think that'd
> > be a fairly significant increase in overhead.
> > 
> 
> I have no idea what the overhead would be - perhaps not too bad,
> considering catalog changes are not too common (I'm sure there are
> extreme cases). And maybe we could even restrict this only to
> "interesting" catalogs, or something like that? (However I hate those
> weird differences in behavior, it can easily lead to bugs.)
>
> But it feels more like a band-aid than actually fixing the issue.

Agreed.

Greetings,

Andres Freund

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Andres Freund
Дата: 18 ноября 2023 г., 21:05:46
Сообщение: Re: Use of backup_label not noted in log

Следующее

От: Alvaro Herrera
Дата: 18 ноября 2023 г., 22:00:04
Сообщение: Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: long-standing data loss bug in initial sync of logical replication

Предыдущее

Следующее