Обсуждение: Are we accepting cancel interrupts too often?

Поиск
Список
Период
Сортировка

Are we accepting cancel interrupts too often?

От
Tom Lane
Дата:
Presently, the RESUME_INTERRUPTS() and END_CRIT_SECTION() macros implicitly
do a CHECK_FOR_INTERRUPTS(); that is, if a cancel request arrived during
the interrupt-free section it will be serviced immediately upon exit
from the section.

It strikes me that this is a really bad idea.  There are lots of places
where we release one lock then acquire another, and are not expecting to
lose control in between.  The original concept of the query-cancel
facility was that we'd accept cancels only at *explicit*
CHECK_FOR_INTERRUPTS points.  What we actually have at the moment is
that cancels could be accepted in a very wide variety of places, and
I don't believe we've considered the consequences at each such place.

I am inclined to remove the ProcessInterrupts calls from
RESUME_INTERRUPTS and END_CRIT_SECTION.  Does anyone object?
        regards, tom lane


Re: Are we accepting cancel interrupts too often?

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I started to look at when this nice code was added to determine if this
> was part of the original design or added later and found you wrote it
> yourself, so I guess we don't have to ask anyone to make sure there
> isn't something were are missing.

As far as I can recall my thinking at the time, it went like so:
"We *should* be able to accept a cancel interrupt anywhere we are not
actually in the midst of modifying shared-memory data structures,
because after all the database system is supposed to be robust against
crashes, and those could happen anyplace".

But the fallacy in equating a cancel to a crash is that we have rather
extensive logic for coping with a crash (including reinitializing shared
memory from scratch).  A cancel will only provoke elog cleanup, which is
not nearly as thorough.  For example, it's not obvious that shared
memory structures that are protected by different locks couldn't get out
of sync.


BTW, I spent some time yesterday trying to use this worry to explain my
latest favorite bugaboo, the duplicate-rows complaints we've gotten from
a few people.  It is easy to see that a cancel being accepted at the
right place (exit from the first WriteBuffer in heap_update) could leave
an updated tuple created and its buffer marked dirty, while the old
tuple's buffer is not yet marked dirty and might therefore be discarded
unwritten.  (The WAL entry is correct but will never be consulted unless
there's a crash.)  However, this scenario doesn't seem to explain the
failures because the cancel would lead to transaction abort, so the
updated tuple should never be considered good anyway.  Back to the
drawing board...
        regards, tom lane


Re: Are we accepting cancel interrupts too often?

От
Bruce Momjian
Дата:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I started to look at when this nice code was added to determine if this
> > was part of the original design or added later and found you wrote it
> > yourself, so I guess we don't have to ask anyone to make sure there
> > isn't something were are missing.
> 
> As far as I can recall my thinking at the time, it went like so:
> "We *should* be able to accept a cancel interrupt anywhere we are not
> actually in the midst of modifying shared-memory data structures,
> because after all the database system is supposed to be robust against
> crashes, and those could happen anyplace".
> 
> But the fallacy in equating a cancel to a crash is that we have rather
> extensive logic for coping with a crash (including reinitializing shared
> memory from scratch).  A cancel will only provoke elog cleanup, which is
> not nearly as thorough.  For example, it's not obvious that shared
> memory structures that are protected by different locks couldn't get out
> of sync.
> 

Yes, I saw the RESUME_INTERRUPTS in SpinLockRelease().  It seems very
aggresive to allow a query cancel there.

> 
> BTW, I spent some time yesterday trying to use this worry to explain my
> latest favorite bugaboo, the duplicate-rows complaints we've gotten from
> a few people.  It is easy to see that a cancel being accepted at the
> right place (exit from the first WriteBuffer in heap_update) could leave
> an updated tuple created and its buffer marked dirty, while the old
> tuple's buffer is not yet marked dirty and might therefore be discarded
> unwritten.  (The WAL entry is correct but will never be consulted unless
> there's a crash.)  However, this scenario doesn't seem to explain the
> failures because the cancel would lead to transaction abort, so the
> updated tuple should never be considered good anyway.  Back to the
> drawing board...

I thought we were seeing duplicates in 7.1, which didn't have this code.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Are we accepting cancel interrupts too often?

От
Bruce Momjian
Дата:
Tom Lane wrote:
> Presently, the RESUME_INTERRUPTS() and END_CRIT_SECTION() macros implicitly
> do a CHECK_FOR_INTERRUPTS(); that is, if a cancel request arrived during
> the interrupt-free section it will be serviced immediately upon exit
> from the section.
> 
> It strikes me that this is a really bad idea.  There are lots of places
> where we release one lock then acquire another, and are not expecting to
> lose control in between.  The original concept of the query-cancel
> facility was that we'd accept cancels only at *explicit*
> CHECK_FOR_INTERRUPTS points.  What we actually have at the moment is
> that cancels could be accepted in a very wide variety of places, and
> I don't believe we've considered the consequences at each such place.
> 
> I am inclined to remove the ProcessInterrupts calls from
> RESUME_INTERRUPTS and END_CRIT_SECTION.  Does anyone object?

I read the nice description of RESUME_INTERRUPTS at the top of
miscadmin.c and I agree that the idea of allowing a CHECK_FOR_INTERRUPTS
call is not the same as making the call.

I started to look at when this nice code was added to determine if this
was part of the original design or added later and found you wrote it
yourself, so I guess we don't have to ask anyone to make sure there
isn't something were are missing.

Looking at CHECK_FOR_INTERRUPTS calls, those are all in safe places,
while the RESUME_INTERRUPTS are not in obviously safe places, so I agree
with your suggested change.

---------------------------------------------------------------------------
Revision 1.77 / (download) - annotate - [select for diffs] , Sun Jan 14
05:08:16 2001 UTC (11 months, 2 weeks ago) by tgl
Changes since 1.76: +73 -36 lines
Diff to previous 1.76

Restructure backend SIGINT/SIGTERM handling so that 'die' interrupts
are treated more like 'cancel' interrupts: the signal handler sets a
flag that is examined at well-defined spots, rather than trying to cope
with an interrupt that might happen anywhere.  See pghackers discussion
of 1/12/01.



--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Are we accepting cancel interrupts too often?

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I thought we were seeing duplicates in 7.1, which didn't have this code.

No, the query-cancel stuff is the same in 7.1.
        regards, tom lane