Обсуждение: canceling autovacuum task woes
I am running into a lot of customer situations where the customer reports that "canceling autovacuum task" shows up in the logs, and it's unclear whether this is happening often enough to matter, and even more unclear what's causing it. Me: So, do you know what table it's getting cancelled on? Customer: Nope. Me: Are you running any DDL commands anywhere in the cluster? Customer: Nope, absolutely none. Me: Well you've got to be running something somewhere or it wouldn't be having a lock conflict. Customer: OK, well I don't know of any. What should I do? It would be awfully nice if the process that does the cancelling would provide the same kind of reporting that we do for a deadlock: the relevant lock tag, the PID of the process sending the cancel, and the query string. Personally, I'm starting to have a sneaky suspicion that there is something actually broken here - that is, that there are lock conflicts involve here other than the obvious one (SHARE UPDATE EXCLUSIVE on the table) that are allowing autovac to get cancelled more often than we realize. But whether that's true or not, the current logging is wholly inadequate. Thoughts? Anybody else have this problem? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On Tuesday, July 24, 2012 07:48:27 PM Robert Haas wrote: > I am running into a lot of customer situations where the customer > reports that "canceling autovacuum task" shows up in the logs, and > it's unclear whether this is happening often enough to matter, and > even more unclear what's causing it. > > Me: So, do you know what table it's getting cancelled on? > Customer: Nope. > Me: Are you running any DDL commands anywhere in the cluster? > Customer: Nope, absolutely none. > Me: Well you've got to be running something somewhere or it wouldn't > be having a lock conflict. > Customer: OK, well I don't know of any. What should I do? > > It would be awfully nice if the process that does the cancelling would > provide the same kind of reporting that we do for a deadlock: the > relevant lock tag, the PID of the process sending the cancel, and the > query string. > > Personally, I'm starting to have a sneaky suspicion that there is > something actually broken here - that is, that there are lock > conflicts involve here other than the obvious one (SHARE UPDATE > EXCLUSIVE on the table) that are allowing autovac to get cancelled > more often than we realize. But whether that's true or not, the > current logging is wholly inadequate. Very, very, very quick guess: The relation extension lock? > Thoughts? Anybody else have this problem? I have seen spuriously high occurances of that message before, but I never really investigated it. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 12-07-24 01:48 PM, Robert Haas wrote: > I am running into a lot of customer situations where the customer > reports that "canceling autovacuum task" shows up in the logs, and > it's unclear whether this is happening often enough to matter, and > even more unclear what's causing it. Could autovacuum be compacting a lot of space at the end of the table. This is described in the thread http://archives.postgresql.org/message-id/4D8DF88E.7080205@Yahoo.com > Me: So, do you know what table it's getting cancelled on? > Customer: Nope. > Me: Are you running any DDL commands anywhere in the cluster? > Customer: Nope, absolutely none. > Me: Well you've got to be running something somewhere or it wouldn't > be having a lock conflict. > Customer: OK, well I don't know of any. What should I do? > > It would be awfully nice if the process that does the cancelling would > provide the same kind of reporting that we do for a deadlock: the > relevant lock tag, the PID of the process sending the cancel, and the > query string. > > Personally, I'm starting to have a sneaky suspicion that there is > something actually broken here - that is, that there are lock > conflicts involve here other than the obvious one (SHARE UPDATE > EXCLUSIVE on the table) that are allowing autovac to get cancelled > more often than we realize. But whether that's true or not, the > current logging is wholly inadequate. > > Thoughts? Anybody else have this problem? >
On Tue, Jul 24, 2012 at 2:11 PM, Steve Singer <ssinger@ca.afilias.info> wrote: > On 12-07-24 01:48 PM, Robert Haas wrote: >> I am running into a lot of customer situations where the customer >> reports that "canceling autovacuum task" shows up in the logs, and >> it's unclear whether this is happening often enough to matter, and >> even more unclear what's causing it. > > Could autovacuum be compacting a lot of space at the end of the table. This > is described > in the thread > http://archives.postgresql.org/message-id/4D8DF88E.7080205@Yahoo.com You (and Andres) may well be right, but I think the way we find out is to add some better logging. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 07/24/2012 01:48 PM, Robert Haas wrote: > I am running into a lot of customer situations where the customer > reports that "canceling autovacuum task" shows up in the logs, and > it's unclear whether this is happening often enough to matter, and > even more unclear what's causing it. > > Me: So, do you know what table it's getting cancelled on? > Customer: Nope. > Me: Are you running any DDL commands anywhere in the cluster? > Customer: Nope, absolutely none. > Me: Well you've got to be running something somewhere or it wouldn't > be having a lock conflict. > Customer: OK, well I don't know of any. What should I do? > > It would be awfully nice if the process that does the cancelling would > provide the same kind of reporting that we do for a deadlock: the > relevant lock tag, the PID of the process sending the cancel, and the > query string. > +1 for more information on this. cheers andrew
> On Tuesday, July 24, 2012 07:48:27 PM Robert Haas wrote: > > I am running into a lot of customer situations where the customer > > reports that "canceling autovacuum task" shows up in the logs, and > > it's unclear whether this is happening often enough to matter, and > > even more unclear what's causing it. > > > > Me: So, do you know what table it's getting cancelled on? > > Customer: Nope. > > Me: Are you running any DDL commands anywhere in the cluster? > > Customer: Nope, absolutely none. > > Me: Well you've got to be running something somewhere or it wouldn't > > be having a lock conflict. > > Customer: OK, well I don't know of any. What should I do? > > > > It would be awfully nice if the process that does the cancelling would > > provide the same kind of reporting that we do for a deadlock: the > > relevant lock tag, the PID of the process sending the cancel, and the > > query string. Hm, autovacuum is supposed to set an errcontext callback that would tell you the table name it's working on at the time of the cancel. So if even that is missing, something strange is going on. No objections to the general idea of adding more info about the process blocked on autovacuum. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Excerpts from Alvaro Herrera's message of mar jul 24 15:30:49 -0400 2012: > > > On Tuesday, July 24, 2012 07:48:27 PM Robert Haas wrote: > > > I am running into a lot of customer situations where the customer > > > reports that "canceling autovacuum task" shows up in the logs, and > > > it's unclear whether this is happening often enough to matter, and > > > even more unclear what's causing it. > > > > > > Me: So, do you know what table it's getting cancelled on? > > > Customer: Nope. > Hm, autovacuum is supposed to set an errcontext callback that would tell > you the table name it's working on at the time of the cancel. So if > even that is missing, something strange is going on. Yep, it says: ERROR: canceling autovacuum task CONTEXT: automatic vacuum of table "alvherre.public.foo" So at least that part seems pilot error more than anything else. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Jul 24, 2012 at 3:35 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Yep, it says: > > ERROR: canceling autovacuum task > CONTEXT: automatic vacuum of table "alvherre.public.foo" > > So at least that part seems pilot error more than anything else. Yeah, you're right. So you do get the table name. But you don't get the cause, which is what you really need to understand why it's happening. Attached is a patch that adds some more detail. Here's an example of what the output looks like: LOG: sending cancel to blocking autovacuum PID 21595 DETAIL: Process 21618 waits for AccessExclusiveLock on relation 27863 of database 16384 STATEMENT: drop table if exists pgbench_accounts ERROR: canceling autovacuum task CONTEXT: automatic vacuum of table "rhaas.public.pgbench_accounts" I think that's a lot more useful than just getting those last two lines... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Excerpts from Robert Haas's message of mar jul 24 15:52:23 -0400 2012: > On Tue, Jul 24, 2012 at 3:35 PM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: > > Yep, it says: > > > > ERROR: canceling autovacuum task > > CONTEXT: automatic vacuum of table "alvherre.public.foo" > > > > So at least that part seems pilot error more than anything else. > > Yeah, you're right. So you do get the table name. But you don't get > the cause, which is what you really need to understand why it's > happening. Attached is a patch that adds some more detail. Here's an > example of what the output looks like: > > LOG: sending cancel to blocking autovacuum PID 21595 > DETAIL: Process 21618 waits for AccessExclusiveLock on relation 27863 > of database 16384 > STATEMENT: drop table if exists pgbench_accounts > ERROR: canceling autovacuum task > CONTEXT: automatic vacuum of table "rhaas.public.pgbench_accounts" Looks great. Are you considering backpatching this? -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Jul 24, 2012 at 4:03 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Excerpts from Robert Haas's message of mar jul 24 15:52:23 -0400 2012: >> On Tue, Jul 24, 2012 at 3:35 PM, Alvaro Herrera >> <alvherre@commandprompt.com> wrote: >> > Yep, it says: >> > >> > ERROR: canceling autovacuum task >> > CONTEXT: automatic vacuum of table "alvherre.public.foo" >> > >> > So at least that part seems pilot error more than anything else. >> >> Yeah, you're right. So you do get the table name. But you don't get >> the cause, which is what you really need to understand why it's >> happening. Attached is a patch that adds some more detail. Here's an >> example of what the output looks like: >> >> LOG: sending cancel to blocking autovacuum PID 21595 >> DETAIL: Process 21618 waits for AccessExclusiveLock on relation 27863 >> of database 16384 >> STATEMENT: drop table if exists pgbench_accounts >> ERROR: canceling autovacuum task >> CONTEXT: automatic vacuum of table "rhaas.public.pgbench_accounts" > > Looks great. Are you considering backpatching this? Well, that would certainly make MY life easier. I am not sure whether it would be in line with project policy, however. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Yeah, you're right. So you do get the table name. But you don't get > the cause, which is what you really need to understand why it's > happening. Attached is a patch that adds some more detail. Uh, what's the added dependency on pgstat.h for? Looks sane to the eyeball otherwise. regards, tom lane
On Tue, Jul 24, 2012 at 4:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Yeah, you're right. So you do get the table name. But you don't get >> the cause, which is what you really need to understand why it's >> happening. Attached is a patch that adds some more detail. > > Uh, what's the added dependency on pgstat.h for? Looks sane to the > eyeball otherwise. Woops, that was leftovers from some earlier silliness that I (mostly) removed before posting. New version attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jul 24, 2012 at 4:03 PM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: >> Looks great. Are you considering backpatching this? > Well, that would certainly make MY life easier. I am not sure whether > it would be in line with project policy, however. +1 for a backpatch. Otherwise it'll be years before we gain any information about the unexpected cancels that you think exist. However, after looking some more at deadlock.c, I wonder whether (a) this patch gives sufficient detail, and (b) whether there isn't a problem that's obvious by inspection. It appears to me that as the blocking_autovacuum_proc stuff is coded, it will finger an AV proc as needing to be killed even though it may be several graph edges out from the current proc. This means that with respect to (a), the connection from the process doing the kill to the AV proc may be inadequately documented by this patch, and with respect to (b), there might well be cases where we found an AV proc somewhere in the graph traversal but it's not actually guilty of blocking the current process ... especially not after the queue reorderings that we may have done. I think I'd be happier with that code if it restricted its AV targets to procs that *directly* block the current process, which not incidentally would make this amount of log detail sufficient. regards, tom lane
On Jul 24, 2012, at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Jul 24, 2012 at 4:03 PM, Alvaro Herrera >> <alvherre@commandprompt.com> wrote: >>> Looks great. Are you considering backpatching this? > >> Well, that would certainly make MY life easier. I am not sure whether >> it would be in line with project policy, however. > > +1 for a backpatch. Otherwise it'll be years before we gain any > information about the unexpected cancels that you think exist OK, great. > However, after looking some more at deadlock.c, I wonder whether > (a) this patch gives sufficient detail, and (b) whether there isn't a > problem that's obvious by inspection. It appears to me that as the > blocking_autovacuum_proc stuff is coded, it will finger an AV proc as > needing to be killed even though it may be several graph edges out from > the current proc. This means that with respect to (a), the connection > from the process doing the kill to the AV proc may be inadequately > documented by this patch, and with respect to (b), there might well be > cases where we found an AV proc somewhere in the graph traversal but > it's not actually guilty of blocking the current process ... especially > not after the queue reorderings that we may have done. I think I'd be > happier with that code if it restricted its AV targets to procs that > *directly* block the current process, which not incidentally would make > this amount of log detail sufficient. Uggh. Well, that certainly sounds like something that could cause spurious cancels - or excessively fast ones, since presumablyif we limit it to things that directly block the current process, you'll always allow the full deadlock_timeoutbefore nuking the autovac worker. So +1 for changing that. Does an edge in this context mean any lock, or just an ungranted one? I assume the latter, which still leaves the questionof where the edges are coming from in the first place. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > On Jul 24, 2012, at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ... This means that with respect to (a), the connection >> from the process doing the kill to the AV proc may be inadequately >> documented by this patch, and with respect to (b), there might well be >> cases where we found an AV proc somewhere in the graph traversal but >> it's not actually guilty of blocking the current process ... especially >> not after the queue reorderings that we may have done. I think I'd be >> happier with that code if it restricted its AV targets to procs that >> *directly* block the current process, which not incidentally would make >> this amount of log detail sufficient. > Uggh. Well, that certainly sounds like something that could cause spurious cancels - or excessively fast ones, since presumablyif we limit it to things that directly block the current process, you'll always allow the full deadlock_timeoutbefore nuking the autovac worker. So +1 for changing that. I think something as simple as the attached would do the trick. I've verified that this still allows the expected cancel cases, though of course it's harder to prove that it gives any benefit. > Does an edge in this context mean any lock, or just an ungranted one? I assume the latter, which still leaves the questionof where the edges are coming from in the first place. The deadlock code follows "hard" edges, which mean "A wants a lock that B has already got", as well as "soft" edges, which mean "A wants a lock that B also wants, and B is ahead of A in the queue for it". We don't kill autovacs that are the end of soft edges, which I think is good, but the fact that we follow them at all makes it a little unclear what we might reach recursively. Your point about determinacy of the timeout is probably even a better argument for only allowing the direct blockee to wield the knife. regards, tom lane diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c index 288186a6ceebf40d5a43638c7be5be35c12dd3b3..6620a3d571d769f75e6ae3fbbe4a0ed2cf271720 100644 *** a/src/backend/storage/lmgr/deadlock.c --- b/src/backend/storage/lmgr/deadlock.c *************** FindLockCycleRecurse(PGPROC *checkProc, *** 527,551 **** if ((proclock->holdMask & LOCKBIT_ON(lm)) && (conflictMask & LOCKBIT_ON(lm))) { - /* - * Look for a blocking autovacuum. There can be more than - * one in the deadlock cycle, in which case we just pick a - * random one. We stash the autovacuum worker's PGPROC so - * that the caller can send a cancel signal to it, if - * appropriate. - * - * Note we read vacuumFlags without any locking. This is - * OK only for checking the PROC_IS_AUTOVACUUM flag, - * because that flag is set at process start and never - * reset; there is logic elsewhere to avoid canceling an - * autovacuum that is working for preventing Xid - * wraparound problems (which needs to read a different - * vacuumFlag bit), but we don't do that here to avoid - * grabbing ProcArrayLock. - */ - if (pgxact->vacuumFlags & PROC_IS_AUTOVACUUM) - blocking_autovacuum_proc = proc; - /* This proc hard-blocks checkProc */ if (FindLockCycleRecurse(proc, depth + 1, softEdges, nSoftEdges)) --- 527,532 ---- *************** FindLockCycleRecurse(PGPROC *checkProc, *** 559,565 **** return true; } ! /* If no deadlock, we're done looking at this proclock */ break; } } --- 540,575 ---- return true; } ! ! /* ! * No deadlock here, but see if this proc is an autovacuum ! * that is directly hard-blocking our own proc. If so, ! * report it so that the caller can send a cancel signal ! * to it, if appropriate. If there's more than one such ! * proc (probably not possible given that autovacuums all ! * take similar lock types), it's indeterminate which one ! * will be reported. ! * ! * We don't touch autovacuums that are indirectly blocking ! * us; it's up to the direct blockee to take action. This ! * rule simplifies understanding the behavior and ensures ! * that an autovacuum won't be canceled with less than ! * deadlock_timeout grace period. ! * ! * Note we read vacuumFlags without any locking. This is ! * OK only for checking the PROC_IS_AUTOVACUUM flag, ! * because that flag is set at process start and never ! * reset. There is logic elsewhere to avoid canceling an ! * autovacuum that is working to prevent XID wraparound ! * problems (which needs to read a different vacuumFlag ! * bit), but we don't do that here to avoid grabbing ! * ProcArrayLock. ! */ ! if (checkProc == MyProc && ! pgxact->vacuumFlags & PROC_IS_AUTOVACUUM) ! blocking_autovacuum_proc = proc; ! ! /* We're done looking at this proclock */ break; } }
On Tue, 2012-07-24 at 16:14 -0400, Robert Haas wrote: > On Tue, Jul 24, 2012 at 4:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> Yeah, you're right. So you do get the table name. But you don't get > >> the cause, which is what you really need to understand why it's > >> happening. Attached is a patch that adds some more detail. > > > > Uh, what's the added dependency on pgstat.h for? Looks sane to the > > eyeball otherwise. > > Woops, that was leftovers from some earlier silliness that I (mostly) > removed before posting. > > New version attached. > The detail message should have a period at the end.
On Mon, Aug 13, 2012 at 11:59 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On Tue, 2012-07-24 at 16:14 -0400, Robert Haas wrote: >> On Tue, Jul 24, 2012 at 4:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > Robert Haas <robertmhaas@gmail.com> writes: >> >> Yeah, you're right. So you do get the table name. But you don't get >> >> the cause, which is what you really need to understand why it's >> >> happening. Attached is a patch that adds some more detail. >> > >> > Uh, what's the added dependency on pgstat.h for? Looks sane to the >> > eyeball otherwise. >> >> Woops, that was leftovers from some earlier silliness that I (mostly) >> removed before posting. >> >> New version attached. >> > > The detail message should have a period at the end. Oops. Fixed, sorry it took me so long to get to this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company