Обсуждение: [HACKERS] logical replication - possible remaining problem
I am not sure whether what I found here amounts to a bug, I might be doing something dumb. During the last few months I did tests by running pgbench over logical replication. Earlier emails have details. The basic form of that now works well (and the fix has been comitted) but as I looked over my testing program I noticed one change I made to it, already many weeks ago: In the cleanup during startup (pre-flight check you might say) and also before the end, instead of echo "delete from pg_subscription;" | psql -qXp $port2 -- (1) I changed that (as I say, many weeks ago) to: echo "delete from pg_subscription; delete from pg_subscription_rel; delete from pg_replication_origin; "| psql -qXp $port2 -- (2) This occurs (2x) inside the bash function clean_pubsub(), in main test script pgbench_detail2.sh This change was an effort to ensure to arrive at a 'clean' start (and end-) state which would always be the same. All my more recent testing (and that of Mark, I have to assume) was thus done with (2). Now, looking at the script again I am thinking that it would be reasonable to expect that after issuing delete from pg_subscription; the other 2 tables are /also/ cleaned, automatically, as a consequence. (Is this reasonable? this is really the main question of this email). So I removed the latter two delete statements again, and ran the tests again with the form in (1) I have established that (after a number of successful cycles) the test stops succeeding with in the replica log repetitions of: 2017-06-07 22:10:29.057 CEST [2421] LOG: logical replication apply worker for subscription "sub1" has started 2017-06-07 22:10:29.057 CEST [2421] ERROR: could not find free replication state slot for replication origin with OID 11 2017-06-07 22:10:29.057 CEST [2421] HINT: Increase max_replication_slots and try again. 2017-06-07 22:10:29.058 CEST [2061] LOG: worker process: logical replication worker for subscription 29235 (PID 2421) exited with exit code 1 when I manually 'clean up' by doing: delete from pg_replication_origin; then, and only then, does the session finish and succeed ('replica ok'). So to me it looks as if there is an omission of pg_replication_origin-cleanup when pg_description is deleted. Does that make sense? All this is probably vague and I am only posting in the hope that Petr (or someone else) perhaps immediately understands what goes wrong, with even his limited amount of info. In the meantime I will try to dig up more detailed info... thanks, Erik Rijkers
Erik Rijkers wrote: > Now, looking at the script again I am thinking that it would be reasonable > to expect that after issuing > delete from pg_subscription; > > the other 2 tables are /also/ cleaned, automatically, as a consequence. (Is > this reasonable? this is really the main question of this email). I don't think it's reasonable to expect that the system recovers automatically from what amounts to catalog corruption. You should be using the DDL that removes subscriptions instead. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-06-07 23:18, Alvaro Herrera wrote: > Erik Rijkers wrote: > >> Now, looking at the script again I am thinking that it would be >> reasonable >> to expect that after issuing >> delete from pg_subscription; >> >> the other 2 tables are /also/ cleaned, automatically, as a >> consequence. (Is >> this reasonable? this is really the main question of this email). > > I don't think it's reasonable to expect that the system recovers > automatically from what amounts to catalog corruption. You should be > using the DDL that removes subscriptions instead. You're right, that makes sense. Thanks.
Hi, On 07/06/17 22:49, Erik Rijkers wrote: > I am not sure whether what I found here amounts to a bug, I might be > doing something dumb. > > During the last few months I did tests by running pgbench over logical > replication. Earlier emails have details. > > The basic form of that now works well (and the fix has been comitted) > but as I looked over my testing program I noticed one change I made to > it, already many weeks ago: > > In the cleanup during startup (pre-flight check you might say) and also > before the end, instead of > > echo "delete from pg_subscription;" | psql -qXp $port2 -- (1) > > I changed that (as I say, many weeks ago) to: > > echo "delete from pg_subscription; > delete from pg_subscription_rel; > delete from pg_replication_origin; " | psql -qXp $port2 -- (2) > > This occurs (2x) inside the bash function clean_pubsub(), in main test > script pgbench_detail2.sh > > This change was an effort to ensure to arrive at a 'clean' start (and > end-) state which would always be the same. > > All my more recent testing (and that of Mark, I have to assume) was thus > done with (2). > > Now, looking at the script again I am thinking that it would be > reasonable to expect that after issuing > delete from pg_subscription; > > the other 2 tables are /also/ cleaned, automatically, as a consequence. > (Is this reasonable? this is really the main question of this email). > Hmm, they are not cleaned automatically, deleting from system catalogs manually like this never propagates to related tables, we don't use FKs there. > So I removed the latter two delete statements again, and ran the tests > again with the form in (1) > > I have established that (after a number of successful cycles) the test > stops succeeding with in the replica log repetitions of: > > 2017-06-07 22:10:29.057 CEST [2421] LOG: logical replication apply > worker for subscription "sub1" has started > 2017-06-07 22:10:29.057 CEST [2421] ERROR: could not find free > replication state slot for replication origin with OID 11 > 2017-06-07 22:10:29.057 CEST [2421] HINT: Increase > max_replication_slots and try again. > 2017-06-07 22:10:29.058 CEST [2061] LOG: worker process: logical > replication worker for subscription 29235 (PID 2421) exited with exit > code 1 > > when I manually 'clean up' by doing: > delete from pg_replication_origin; > Yeah because you consumed all the origins (I am still not huge fan of how that limit works, but that's separate discussion). > then, and only then, does the session finish and succeed ('replica ok'). > > So to me it looks as if there is an omission of > pg_replication_origin-cleanup when pg_description is deleted. > There is no omission, origin is not supposed to be deleted automatically unless you use DROP SUBSCRIPTION. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services