Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION
От | Tomas Vondra |
---|---|
Тема | Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION |
Дата | |
Msg-id | 8948e4a2-fec7-8a61-9ace-2537cee3ef6f@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION (Amit Kapila <amit.kapila16@gmail.com>) |
Ответы |
Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION
(Amit Kapila <amit.kapila16@gmail.com>)
|
Список | pgsql-hackers |
On 11/21/23 14:16, Amit Kapila wrote: > On Tue, Nov 21, 2023 at 5:17 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> I decided to do some stress-testing of the built-in logical replication, >> as part of the sequence decoding work. And I soon ran into an undetected >> deadlock related to ALTER SUBSCRIPTION ... REFRESH PUBLICATION :-( >> >> The attached bash scripts triggers that in a couple seconds for me. The >> script looks complicated, but most of the code is waiting for sync to >> complete, catchup, and that sort of thing. >> >> What the script does is pretty simple: >> >> 1) initialize two clusters, set them as publisher/subscriber pair >> >> 2) create some number of tables, add them to publication and wait for >> the sync to complete >> >> 3) start two pgbench runs in the background, modifying the publication >> (one removes+adds all tables in a single transaction, one does that >> with transaction per table) >> >> 4) run refresh.sh which does ALTER PUBLICATION ... REFRESH PUBLICATION >> in a loop (now that I think about it, could be another pgbench >> script, but well ...) >> >> 5) some consistency checks, but the lockup happens earlier so this does >> not really matter >> >> After a small number of refresh cycles (for me it's usually a couple >> dozen), we end up with a couple stuck locks (I shortened the backend >> type string a bit, for formatting reasons): >> >> test=# select a.pid, classid, objid, backend_type, query >> from pg_locks l join pg_stat_activity a on (a.pid = l.pid) >> where not granted; >> >> pid | classid | objid | backend_type | query >> ---------+---------+-------+------------------+---------------------- >> 2691941 | 6100 | 16785 | client backend | ALTER SUBSCRIPTION s >> REFRESH PUBLICATION >> 2691837 | 6100 | 16785 | tablesync worker | >> 2691936 | 6100 | 16785 | tablesync worker | >> (3 rows) >> >> All these backends wait for 6100/16785, which is the subscription row in >> pg_subscription. The tablesync workers are requesting AccessShareLock, >> the client backend however asks for AccessExclusiveLock. >> >> The entry is currently locked by: >> >> test=# select a.pid, mode, backend_type from pg_locks l >> join pg_stat_activity a on (a.pid = l.pid) >> where classid=6100 and objid=16785 and granted; >> >> pid | mode | backend_type >> ---------+-----------------+---------------------------------- >> 2690477 | AccessShareLock | logical replication apply worker >> (1 row) >> >> But the apply worker is not waiting for any locks, so what's going on? >> >> Well, the problem is the apply worker is waiting for notification from >> the tablesync workers the relation is synced, which happens through >> updating the pg_subscription_rel row. And that wait happens in >> wait_for_relation_state_change, which simply checks the row in a loop, >> with a sleep by WaitLatch(). >> >> Unfortunately, the tablesync workers can't update the row because the >> client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION >> sneaked in, and waits for an AccessExclusiveLock. So the tablesync >> workers are stuck in the queue and can't proceed. >> >> The client backend can't proceed, because it's waiting for a lock held >> by the apply worker. >> > > It seems there is some inconsistency in what you have written for > client backends/tablesync worker vs. apply worker. The above text > seems to be saying that the client backend and table sync worker are > waiting on a "subscription row in pg_subscription" and the apply > worker is operating on "pg_subscription_rel". So, if that is true then > they shouldn't get stuck. > > I think here client backend and tablesync worker seems to be blocked > for a lock on pg_subscription_rel. > Not really, they are all locking the subscription. All the locks are on classid=6100, which is pg_subscription: test=# select 6100::regclass; regclass ----------------- pg_subscription (1 row) The thing is, the tablesync workers call UpdateSubscriptionRelState, which locks the pg_subscription catalog at the very beginning: LockSharedObject(SubscriptionRelationId, ...); So that's the issue. I haven't explored why it's done this way, and there's no comment explaining locking the subscriptions is needed ... >> The tablesync workers can't proceed because their lock request is stuck >> behind the AccessExclusiveLock request. >> >> And the apply worker can't proceed, because it's waiting for status >> update from the tablesync workers. >> > > This part is not clear to me because > wait_for_relation_state_change()->GetSubscriptionRelState() seems to > be releasing the lock while closing the relation. Am, I missing > something? > I think you're missing the fact that GetSubscriptionRelState() acquires and releases the lock on pg_subscription_rel, but that's not the lock causing the issue. The problem is the lock on the pg_subscription row. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления: