Обсуждение: pika buildfarm member failure on isolationCheck tests

Поиск

Список

Период

Сортировка

pika buildfarm member failure on isolationCheck tests

От

Rémi Zara

Дата:

18 июня 2011 г., 15:57:47

Hi,

Pika failed at the isolationCheck phase, hitting an assertion:

TRAP: FailedAssertion("!(!((oldSerXidControl->tailXid) != ((TransactionId) 0)) || TransactionIdFollows(xid,
oldSerXidControl->tailXid))",File: "predicate.c", Line: 991) 

see http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=pika&dt=2011-06-17%2015%3A45%3A30

It seems that for this stage, the server log (which contains the failed assertion) is not sent back to the buildfarm
server.Maybe that should be fixed ? 

Regards,

Rémi Zara

Re: pika buildfarm member failure on isolationCheck tests

От

Andrew Dunstan

Дата:

18 июня 2011 г., 21:21:52


On 06/18/2011 11:57 AM, Rémi Zara wrote:
> Hi,
>
> Pika failed at the isolationCheck phase, hitting an assertion:
>
> TRAP: FailedAssertion("!(!((oldSerXidControl->tailXid) != ((TransactionId) 0)) || TransactionIdFollows(xid,
oldSerXidControl->tailXid))",File: "predicate.c", Line: 991)
 
>
> see http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=pika&dt=2011-06-17%2015%3A45%3A30
>
> It seems that for this stage, the server log (which contains the failed assertion) is not sent back to the buildfarm
server.Maybe that should be fixed ?
 
>
>

Change committed. See 
<https://github.com/PGBuildFarm/client-code/commit/dd49c04>. It will be 
in the next release.

cheers

andrew

Re: pika buildfarm member failure on isolationCheck tests

От

Robert Haas

Дата:

20 июня 2011 г., 01:10:10

On Sat, Jun 18, 2011 at 11:57 AM, Rémi Zara <remi_zara@mac.com> wrote:
> Pika failed at the isolationCheck phase, hitting an assertion:
>
> TRAP: FailedAssertion("!(!((oldSerXidControl->tailXid) != ((TransactionId) 0)) || TransactionIdFollows(xid,
oldSerXidControl->tailXid))",File: "predicate.c", Line: 991) 
>
> see http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=pika&dt=2011-06-17%2015%3A45%3A30
>
> It seems that for this stage, the server log (which contains the failed assertion) is not sent back to the buildfarm
server.Maybe that should be fixed ? 

Is this an open item for 9.1beta3?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pika buildfarm member failure on isolationCheck tests

От

Dan Ports

Дата:

20 июня 2011 г., 09:13:19

On Sun, Jun 19, 2011 at 09:10:02PM -0400, Robert Haas wrote:
> Is this an open item for 9.1beta3?

Yes. I've put it on the list.

The SxactGlobalXmin and its refcount were getting out of sync with the
actual transaction state. This is timing-dependent but I can reproduce it
fairly reliably under concurrent workloads.

It looks the problem comes from the change a couple days ago that
removed the SXACT_FLAG_ROLLED_BACK flag and changed the
SxactIsRolledBack checks to SxactIsDoomed. That's the correct thing to
do everywhere else, but gets us in trouble here. We shouldn't be
ignoring doomed transactions in SetNewSxactGlobalXmin, because they're
not eligible for cleanup until the associated backend aborts the
transaction and calls ReleasePredicateLocks.

However, it isn't as simple as just removing the SxactIsDoomed check
from SetNewSxactGlobalXmin. ReleasePredicateLocks calls
SetNewSxactGlobalXmin *before* it releases the aborted transaction's
SerializableXact (it pretty much has to, because we must drop and
reacquire SerializableXactHashLock in between). But we don't want the
aborted transaction included in the SxactGlobalXmin computation.

It seems like we do need that SXACT_FLAG_ROLLED_BACK after all, even
though it's only set for this brief interval. We need to be able to
distinguish a transaction that's just been marked for death (doomed)
from one that's already called ReleasePredicateLocks.

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/

Re: pika buildfarm member failure on isolationCheck tests

От

"Kevin Grittner"

Дата:

20 июня 2011 г., 14:41:03

Dan Ports <drkp@csail.mit.edu> wrote:
> It looks the problem comes from the change a couple days ago that
> removed the SXACT_FLAG_ROLLED_BACK flag and changed the
> SxactIsRolledBack checks to SxactIsDoomed. That's the correct
> thing to do everywhere else, but gets us in trouble here. We
> shouldn't be ignoring doomed transactions in
> SetNewSxactGlobalXmin, because they're not eligible for cleanup
> until the associated backend aborts the transaction and calls
> ReleasePredicateLocks.
> 
> However, it isn't as simple as just removing the SxactIsDoomed
> check from SetNewSxactGlobalXmin. ReleasePredicateLocks calls
> SetNewSxactGlobalXmin *before* it releases the aborted
> transaction's SerializableXact (it pretty much has to, because we
> must drop and reacquire SerializableXactHashLock in between). But
> we don't want the aborted transaction included in the
> SxactGlobalXmin computation.
> 
> It seems like we do need that SXACT_FLAG_ROLLED_BACK after all,
> even though it's only set for this brief interval. We need to be
> able to distinguish a transaction that's just been marked for
> death (doomed) from one that's already called
> ReleasePredicateLocks.
I just looked this over and I agree.  It looks to me like we can set
this flag in ReleasePredicateLocks (along with the doomed flag) and
test it only in SetNewSxactGlobalXmin (instead of the doomed flag). 
Fundamentally, the problem is that while it's enough to know that a
transaction *will be* canceled in order to ignore it during
detection of rw-conflicts and dangerous structures, it's not enough
when calculating the new serializable xmin -- we need to know that a
transaction actually *has been* canceled to ignore it there.
Essentially, the fix is to revert a very small part of
http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=264a6b127a918800d9f8bac80b5f4a8a8799d0f1
Either of us could easily fix this, but since you have the test
environment which can reproduce the problem, it's probably best that
you do it.
Since Heikki took the trouble to put the flags in a nice logical
order, we should probably put this right after the doomed flag.
-Kevin

Re: pika buildfarm member failure on isolationCheck tests

От

Dan Ports

Дата:

21 июня 2011 г., 02:16:13

While testing the fix for this one, I found another bug. Patches for
both are attached.

The first patch addresses this bug by re-adding SXACT_FLAG_ROLLED_BACK,
in a more limited form than its previous incarnation.

We need to be able to distinguish transactions that have already
called ReleasePredicateLocks and are thus eligible for cleanup from
those that have been merely marked for abort by other
backends. Transactions that are ROLLED_BACK are excluded from
SxactGlobalXmin calculations, but those that are merely DOOMED need to
be included.

Also update a couple of assertions to ensure we only try to clean up
ROLLED_BACK transactions.


The second patch fixes a bug in PreCommit_CheckForSerializationFailure.
This function checks whether there's a dangerous structure of the form
     far ---> near ---> me
where neither the "far" or "near" transactions have committed. If so,
it aborts the "near" transaction by marking it as DOOMED. However, that
transaction might already be PREPARED. We need to check whether that's
the case and, if so, abort the transaction that's trying to commit
instead.

One of the prepared_xacts regression tests actually hits this bug.
I removed the anomaly from the duplicate-gids test so that it fails in
the intended way, and added a new test to check serialization failures
with a prepared transaction.

Dan

--
Dan R. K. Ports              MIT CSAIL                http://drkp.net/

Вложения

Re: pika buildfarm member failure on isolationCheck tests

От

Heikki Linnakangas

Дата:

21 июня 2011 г., 12:02:13

On 21.06.2011 05:18, Dan Ports wrote:
> The first patch addresses this bug by re-adding SXACT_FLAG_ROLLED_BACK,
> in a more limited form than its previous incarnation.
>
> We need to be able to distinguish transactions that have already
> called ReleasePredicateLocks and are thus eligible for cleanup from
> those that have been merely marked for abort by other
> backends. Transactions that are ROLLED_BACK are excluded from
> SxactGlobalXmin calculations, but those that are merely DOOMED need to
> be included.
>
> Also update a couple of assertions to ensure we only try to clean up
> ROLLED_BACK transactions.

Thanks, committed.

> The second patch fixes a bug in PreCommit_CheckForSerializationFailure.
> This function checks whether there's a dangerous structure of the form
>       far --->  near --->  me
> where neither the "far" or "near" transactions have committed. If so,
> it aborts the "near" transaction by marking it as DOOMED. However, that
> transaction might already be PREPARED. We need to check whether that's
> the case and, if so, abort the transaction that's trying to commit
> instead.

Yep, committed. All the other places where we set the DOOMED flag seem 
to handle that already.

In the long term, I'm not sure this is the best way to handle this. It 
feels a bit silly to set the flag, release the SerializableXactHashLock, 
and reacquire it later to remove the transaction from the hash table. 
Surely that could be done in some more straightforward way. But I don't 
feel like fiddling with it this late in the release cycle.

> One of the prepared_xacts regression tests actually hits this bug.
> I removed the anomaly from the duplicate-gids test so that it fails in
> the intended way, and added a new test to check serialization failures
> with a prepared transaction.

Hmm, I have ran "make installcheck" with 
default_transaction_isolation='serializable' earlier. I wonder why I 
didn't see that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: pika buildfarm member failure on isolationCheck tests

От

Dan Ports

Дата:

22 июня 2011 г., 05:28:48

On Tue, Jun 21, 2011 at 03:01:48PM +0300, Heikki Linnakangas wrote:
> Thanks, committed.

Thanks.

> In the long term, I'm not sure this is the best way to handle this. It 
> feels a bit silly to set the flag, release the SerializableXactHashLock, 
> and reacquire it later to remove the transaction from the hash table. 
> Surely that could be done in some more straightforward way. But I don't 
> feel like fiddling with it this late in the release cycle.

Yes, I suspect it can be done better. The reason it's tricky is a lock
ordering issue; part of releasing a SerializableXact has to be done
while holding SerializableXactHashLock and part has to be done without
it (because it involves taking partition locks). Reworking the order of
these things might work, but would require some careful thought since
most of the code is shared with the non-abort cleanup paths. And yes,
it's definitely the time for that.

I've been meaning to take another look at this part of the code anyway,
with an eye towards performance. SerializableXactHashLock can be a
bottleneck on certain in-memory workloads.

> > One of the prepared_xacts regression tests actually hits this bug.
> > I removed the anomaly from the duplicate-gids test so that it fails in
> > the intended way, and added a new test to check serialization failures
> > with a prepared transaction.
> 
> Hmm, I have ran "make installcheck" with 
> default_transaction_isolation='serializable' earlier. I wonder why I 
> didn't see that.

The affected test was being run at SERIALIZABLE already, so that
wouldn't have made a difference. One reason I didn't notice an issue
when I looked at that test before before, is that it was intended to
fail anyway, just with a different error. I didn't think too hard about
which error would take precedence.

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/

Re: pika buildfarm member failure on isolationCheck tests

От

Dan Ports

Дата:

22 июня 2011 г., 05:42:34

On Wed, Jun 22, 2011 at 01:31:11AM -0400, Dan Ports wrote:
> Yes, I suspect it can be done better. The reason it's tricky is a lock
> ordering issue; part of releasing a SerializableXact has to be done
> while holding SerializableXactHashLock and part has to be done without
> it (because it involves taking partition locks). Reworking the order of
> these things might work, but would require some careful thought since
> most of the code is shared with the non-abort cleanup paths. And yes,
> it's definitely the time for that.

...by which I mean it's definitely *not* the time for that, of course.

Dan

-- 
Dan R. K. Ports              MIT CSAIL                http://drkp.net/

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: pika buildfarm member failure on isolationCheck tests

Вложения