Обсуждение: replication_slots usability issue
-Hackers,
Working on 9.6 today (unsure if fixed in newer versions). Had an issue where the wal was 280G despite max_wal_size being 8G. Found out there were stale replication slots from a recent base backup. I went to drop the replication slots and found that since the wal_level was set to minimal vs replica or higher, I couldn't drop the replication slot. Clearly that makes sense for creating a replication slot but it seems like an artificial limitation for dropping them.
JD
-- Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc *** A fault and talent of mine is to tell it exactly how it is. *** PostgreSQL centered full stack support, consulting and development. Advocate: @amplifypostgres || Learn: https://postgresconf.org ***** Unless otherwise stated, opinions are my own. *****
On October 29, 2018 1:31:56 PM EDT, "Joshua D. Drake" <jd@commandprompt.com> wrote: >-Hackers, > > >Working on 9.6 today (unsure if fixed in newer versions). Had an issue >where the wal was 280G despite max_wal_size being 8G. Found out there >were stale replication slots from a recent base backup. I went to drop >the replication slots and found that since the wal_level was set to >minimal vs replica or higher, I couldn't drop the replication slot. >Clearly that makes sense for creating a replication slot but it seems >like an artificial limitation for dropping them. Uh, huh? How did you manage to start a server with existing slots with that configuration? It should have errored out atstart... Andres Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 2018-Oct-29, Joshua D. Drake wrote: > -Hackers, > > > Working on 9.6 today (unsure if fixed in newer versions). Had an issue where > the wal was 280G despite max_wal_size being 8G. Found out there were stale > replication slots from a recent base backup. I went to drop the replication > slots and found that since the wal_level was set to minimal vs replica or > higher, I couldn't drop the replication slot. Clearly that makes sense for > creating a replication slot but it seems like an artificial limitation for > dropping them. This sounds closely related to https://www.postgresql.org/message-id/20180508143725.mn3ivlyvgpul6ovr%40alvherre.pgsql (commit a1f680d962ff) wherein we made it possible to drop a slot in single-user mode. Seems worth fixing. Send a patch? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-10-29 16:02:18 -0300, Alvaro Herrera wrote: > On 2018-Oct-29, Joshua D. Drake wrote: > > > -Hackers, > > > > > > Working on 9.6 today (unsure if fixed in newer versions). Had an issue where > > the wal was 280G despite max_wal_size being 8G. Found out there were stale > > replication slots from a recent base backup. I went to drop the replication > > slots and found that since the wal_level was set to minimal vs replica or > > higher, I couldn't drop the replication slot. Clearly that makes sense for > > creating a replication slot but it seems like an artificial limitation for > > dropping them. > > This sounds closely related to > https://www.postgresql.org/message-id/20180508143725.mn3ivlyvgpul6ovr%40alvherre.pgsql > (commit a1f680d962ff) wherein we made it possible to drop a slot in > single-user mode. > > Seems worth fixing. Send a patch? I don't think this quite is the problem. ISTM the issue is rather that StartupReplicationSlots() *needs* to check whether wal_level > minimal, and doesn't. So you can create a slot, shutdown, change wal_level, startup. A slot exists but won't work correctly. Greetings, Andres Freund
On 10/29/18 11:26 AM, Andres Freund wrote: > > On October 29, 2018 1:31:56 PM EDT, "Joshua D. Drake" <jd@commandprompt.com> wrote: >> -Hackers, >> >> >> Working on 9.6 today (unsure if fixed in newer versions). Had an issue >> where the wal was 280G despite max_wal_size being 8G. Found out there >> were stale replication slots from a recent base backup. I went to drop >> the replication slots and found that since the wal_level was set to >> minimal vs replica or higher, I couldn't drop the replication slot. >> Clearly that makes sense for creating a replication slot but it seems >> like an artificial limitation for dropping them. > Uh, huh? How did you manage to start a server with existing slots with that configuration? It should have errored out atstart... Well, this is the recovery.conf: standby_mode = 'on' recovery_target = 'immediate' primary_slot_name = 'testing_db01' primary_conninfo = 'user=replication passfile=/var/lib/postgresql/.pgpass host=db01 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any' recovery_target_action = 'promote' The machine came up clean and the only reason I noticed the problem is that it ran out of disk space. I cleared enough disk space to get it to come up again and noticed that there were replication slots that were identical to the primary. JD -- Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc *** A fault and talent of mine is to tell it exactly how it is. *** PostgreSQL centered full stack support, consulting and development. Advocate: @amplifypostgres || Learn: https://postgresconf.org ***** Unless otherwise stated, opinions are my own. *****
On Mon, Oct 29, 2018 at 12:13:04PM -0700, Andres Freund wrote: > I don't think this quite is the problem. ISTM the issue is rather that > StartupReplicationSlots() *needs* to check whether wal_level > minimal, > and doesn't. So you can create a slot, shutdown, change wal_level, > startup. A slot exists but won't work correctly. It seems to me that what we are looking for is just to complain at startup if we find any slot data and if trying to start up with wal_level = minimal. Er... At the same time, shouldn't RestoreSlotFromDisk() *not* use PANIC if more slots are found in pg_replslot than max_replication_slots can handle. A FATAL is fine at startup, PANIC blows up a core file, which is clearly overdoing it if the goal is to give a recommendation at the end. -- Michael
Вложения
On 2018-10-30 11:51:09 +0900, Michael Paquier wrote: > On Mon, Oct 29, 2018 at 12:13:04PM -0700, Andres Freund wrote: > > I don't think this quite is the problem. ISTM the issue is rather that > > StartupReplicationSlots() *needs* to check whether wal_level > minimal, > > and doesn't. So you can create a slot, shutdown, change wal_level, > > startup. A slot exists but won't work correctly. > > It seems to me that what we are looking for is just to complain at > startup if we find any slot data and if trying to start up with > wal_level = minimal. Right, we really should just call CheckSlotRequirements() before doing so. I'll make it so, once I'm actually awake and had some coffee. > Er... At the same time, shouldn't RestoreSlotFromDisk() *not* use PANIC > if more slots are found in pg_replslot than max_replication_slots can > handle. A FATAL is fine at startup, PANIC blows up a core file, which > is clearly overdoing it if the goal is to give a recommendation at the > end. I can't get particularly excited about this. I guess we can change it, but I'd only do so in master. Greetings, Andres Freund
On 10/30/18 10:52 AM, Andres Freund wrote: > On 2018-10-30 11:51:09 +0900, Michael Paquier wrote: >> On Mon, Oct 29, 2018 at 12:13:04PM -0700, Andres Freund wrote: >>> I don't think this quite is the problem. ISTM the issue is rather that >>> StartupReplicationSlots() *needs* to check whether wal_level > minimal, >>> and doesn't. So you can create a slot, shutdown, change wal_level, >>> startup. A slot exists but won't work correctly. >> It seems to me that what we are looking for is just to complain at >> startup if we find any slot data and if trying to start up with >> wal_level = minimal. > Right, we really should just call CheckSlotRequirements() before doing > so. I'll make it so, once I'm actually awake and had some coffee. Why not just disable the slot and report an INFO: line? JD > > >> Er... At the same time, shouldn't RestoreSlotFromDisk() *not* use PANIC >> if more slots are found in pg_replslot than max_replication_slots can >> handle. A FATAL is fine at startup, PANIC blows up a core file, which >> is clearly overdoing it if the goal is to give a recommendation at the >> end. > I can't get particularly excited about this. I guess we can change it, > but I'd only do so in master. > > > Greetings, > > Andres Freund > -- Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc *** A fault and talent of mine is to tell it exactly how it is. *** PostgreSQL centered full stack support, consulting and development. Advocate: @amplifypostgres || Learn: https://postgresconf.org ***** Unless otherwise stated, opinions are my own. *****
On 2018-10-30 11:02:04 -0700, Joshua D. Drake wrote: > On 10/30/18 10:52 AM, Andres Freund wrote: > > On 2018-10-30 11:51:09 +0900, Michael Paquier wrote: > > > On Mon, Oct 29, 2018 at 12:13:04PM -0700, Andres Freund wrote: > > > > I don't think this quite is the problem. ISTM the issue is rather that > > > > StartupReplicationSlots() *needs* to check whether wal_level > minimal, > > > > and doesn't. So you can create a slot, shutdown, change wal_level, > > > > startup. A slot exists but won't work correctly. > > > It seems to me that what we are looking for is just to complain at > > > startup if we find any slot data and if trying to start up with > > > wal_level = minimal. > > Right, we really should just call CheckSlotRequirements() before doing > > so. I'll make it so, once I'm actually awake and had some coffee. > > Why not just disable the slot and report an INFO: line? Because afterwards there'd be a slot with corrupted contents. Especially bad for logical slots. Which might look ok, but crash in weird ways. - Andres
On Tue, Oct 30, 2018 at 10:52:54AM -0700, Andres Freund wrote: > On 2018-10-30 11:51:09 +0900, Michael Paquier wrote: >> Er... At the same time, shouldn't RestoreSlotFromDisk() *not* use PANIC >> if more slots are found in pg_replslot than max_replication_slots can >> handle. A FATAL is fine at startup, PANIC blows up a core file, which >> is clearly overdoing it if the goal is to give a recommendation at the >> end. > > I can't get particularly excited about this. This gets only used by the startup process to restore slot information, so for example if you have an instance with two slots, shut it down, reduce max_replication_slots to 1, and restart then you generate a core dump, which is a bit overdoing it to simply let the user know that he just needs to bump up one parameter and to give him a recommendation to do so :) For example: 2018-10-31 11:04:08 JST [9014]: [11-1] db=,user=,app=,client= DEBUG: starting up replication slots 2018-10-31 11:04:08 JST [9014]: [12-1] db=,user=,app=,client= DEBUG: restoring replication slot from "pg_replslot/popo3/state" 2018-10-31 11:04:08 JST [9014]: [13-1] db=,user=,app=,client= DEBUG: restoring replication slot from "pg_replslot/popo2/state" 2018-10-31 11:04:08 JST [9014]: [14-1] db=,user=,app=,client= PANIC: too many replication slots active before shutdown 2018-10-31 11:04:08 JST [9014]: [15-1] db=,user=,app=,client= HINT: Increase max_replication_slots and try again. > I guess we can change it, but I'd only do so in master. I won't fight hard for a back-patch, now I think that we should back-patch a fix. Generating a core file if a state is unexpected is fine because you expect the system to react so, being able to trigger one based on a parameter switch just to give a recommendation is not in my opinion. For example with the attached the system reacts as follows: 2018-10-31 11:07:23 JST [10063]: [14-1] db=,user=,app=,client= FATAL: too many replication slots active before shutdown 2018-10-31 11:07:23 JST [10063]: [15-1] db=,user=,app=,client= HINT: Increase max_replication_slots and try again. 2018-10-31 11:07:23 JST [10061]: [6-1] db=,user=,app=,client= LOG: startup process (PID 10063) exited with exit code 1 2018-10-31 11:07:23 JST [10061]: [7-1] db=,user=,app=,client= LOG: aborting startup due to startup process failure -- Michael
Вложения
On 2018-10-30 10:52:54 -0700, Andres Freund wrote: > On 2018-10-30 11:51:09 +0900, Michael Paquier wrote: > > On Mon, Oct 29, 2018 at 12:13:04PM -0700, Andres Freund wrote: > > > I don't think this quite is the problem. ISTM the issue is rather that > > > StartupReplicationSlots() *needs* to check whether wal_level > minimal, > > > and doesn't. So you can create a slot, shutdown, change wal_level, > > > startup. A slot exists but won't work correctly. > > > > It seems to me that what we are looking for is just to complain at > > startup if we find any slot data and if trying to start up with > > wal_level = minimal. > > Right, we really should just call CheckSlotRequirements() before doing > so. I'll make it so, once I'm actually awake and had some coffee. And done. Thanks for the report JD. - Andres
HI Andres, On Wed, Oct 31, 2018 at 03:48:02PM -0700, Andres Freund wrote: > And done. Thanks for the report JD. Shouldn't we also switch the PANIC to a FATAL in RestoreSlotFromDisk()? I don't mind doing so myself if you agree with the change, only on HEAD as you seemed to disagree about changing that on back-branches. Also, from 691d79a which you just committed: + ereport(FATAL, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("logical replication slots \"%s\" exists, but wal_level < logical", + NameStr(cp.slotdata.name)), I can see one grammar mistake here, as you refer to only one slot here. The error messages should read: "logical replication slot \"%s\" exists, but wal_level < logical" and: "physical replication slot \"%s\" exists, but wal_level < replica" Thanks, -- Michael
Вложения
On 2018-11-01 09:34:05 +0900, Michael Paquier wrote: > HI Andres, > > On Wed, Oct 31, 2018 at 03:48:02PM -0700, Andres Freund wrote: > > And done. Thanks for the report JD. > > Shouldn't we also switch the PANIC to a FATAL in > RestoreSlotFromDisk()? That has absolutely nothing to do with the issue at hand though, so I don't think it'd have made much sense to do it at the same time. Nor do I think it's particularly important. > I don't mind doing so myself if you agree with the change, only on > HEAD as you seemed to disagree about changing that on back-branches. Cool. And yes, I don't think a cosmetic log level adjustment that could affect people's scripts should be backpatched without need. Even if not particularly likely to break something. > Also, from 691d79a which you just committed: > + ereport(FATAL, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("logical replication slots \"%s\" exists, but wal_level < logical", > + NameStr(cp.slotdata.name)), > I can see one grammar mistake here, as you refer to only one slot here. > The error messages should read: > "logical replication slot \"%s\" exists, but wal_level < logical" > and: > "physical replication slot \"%s\" exists, but wal_level < replica" Darnit. Fixed. Thanks. Greetings, Andres Freund
On Thu, Nov 01, 2018 at 10:54:23AM -0700, Andres Freund wrote: > On 2018-11-01 09:34:05 +0900, Michael Paquier wrote: > That has absolutely nothing to do with the issue at hand though, so I > don't think it'd have made much sense to do it at the same time. Nor do > I think it's particularly important. Thanks for the confirmation. >> I don't mind doing so myself if you agree with the change, only on >> HEAD as you seemed to disagree about changing that on back-branches. > > Cool. And yes, I don't think a cosmetic log level adjustment that could > affect people's scripts should be backpatched without need. Even if not > particularly likely to break something. No issues with your arguments. I did the change this way. >> Also, from 691d79a which you just committed: >> + ereport(FATAL, >> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), >> + errmsg("logical replication slots \"%s\" exists, but wal_level < logical", >> + NameStr(cp.slotdata.name)), >> I can see one grammar mistake here, as you refer to only one slot here. >> The error messages should read: >> "logical replication slot \"%s\" exists, but wal_level < logical" >> and: >> "physical replication slot \"%s\" exists, but wal_level < replica" > > Darnit. Fixed. Thanks. And thanks for fixing that. -- Michael
Вложения
On 01/11/2018 18:54, Andres Freund wrote:> >> Also, from 691d79a which you just committed: >> + ereport(FATAL, >> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), >> + errmsg("logical replication slots \"%s\" exists, but wal_level < logical", >> + NameStr(cp.slotdata.name)), >> I can see one grammar mistake here, as you refer to only one slot here. >> The error messages should read: >> "logical replication slot \"%s\" exists, but wal_level < logical" >> and: >> "physical replication slot \"%s\" exists, but wal_level < replica" > > Darnit. Fixed. Thanks. > Since we are fixing this message, shouldn't the hint for logical slot say "Change wal_level to be logical or higher" rather than "replica or higher" :) -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018-11-02 15:51:34 +0100, Petr Jelinek wrote: > On 01/11/2018 18:54, Andres Freund wrote:> > >> Also, from 691d79a which you just committed: > >> + ereport(FATAL, > >> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > >> + errmsg("logical replication slots \"%s\" exists, but wal_level < logical", > >> + NameStr(cp.slotdata.name)), > >> I can see one grammar mistake here, as you refer to only one slot here. > >> The error messages should read: > >> "logical replication slot \"%s\" exists, but wal_level < logical" > >> and: > >> "physical replication slot \"%s\" exists, but wal_level < replica" > > > > Darnit. Fixed. Thanks. > > > > Since we are fixing this message, shouldn't the hint for logical slot > say "Change wal_level to be logical or higher" rather than "replica or > higher" :) Gah. I really wasn't firing on all cylinders here. Darned jetlag (let's just assume that's what it was). Greetings, Andres Freund