Обсуждение: pika buildfarm member failure on isolationCheck tests
Hi, Pika failed at the isolationCheck phase, hitting an assertion: TRAP: FailedAssertion("!(!((oldSerXidControl->tailXid) != ((TransactionId) 0)) || TransactionIdFollows(xid, oldSerXidControl->tailXid))",File: "predicate.c", Line: 991) see http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=pika&dt=2011-06-17%2015%3A45%3A30 It seems that for this stage, the server log (which contains the failed assertion) is not sent back to the buildfarm server.Maybe that should be fixed ? Regards, Rémi Zara
On 06/18/2011 11:57 AM, Rémi Zara wrote: > Hi, > > Pika failed at the isolationCheck phase, hitting an assertion: > > TRAP: FailedAssertion("!(!((oldSerXidControl->tailXid) != ((TransactionId) 0)) || TransactionIdFollows(xid, oldSerXidControl->tailXid))",File: "predicate.c", Line: 991) > > see http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=pika&dt=2011-06-17%2015%3A45%3A30 > > It seems that for this stage, the server log (which contains the failed assertion) is not sent back to the buildfarm server.Maybe that should be fixed ? > > Change committed. See <https://github.com/PGBuildFarm/client-code/commit/dd49c04>. It will be in the next release. cheers andrew
On Sat, Jun 18, 2011 at 11:57 AM, Rémi Zara <remi_zara@mac.com> wrote: > Pika failed at the isolationCheck phase, hitting an assertion: > > TRAP: FailedAssertion("!(!((oldSerXidControl->tailXid) != ((TransactionId) 0)) || TransactionIdFollows(xid, oldSerXidControl->tailXid))",File: "predicate.c", Line: 991) > > see http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=pika&dt=2011-06-17%2015%3A45%3A30 > > It seems that for this stage, the server log (which contains the failed assertion) is not sent back to the buildfarm server.Maybe that should be fixed ? Is this an open item for 9.1beta3? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jun 19, 2011 at 09:10:02PM -0400, Robert Haas wrote: > Is this an open item for 9.1beta3? Yes. I've put it on the list. The SxactGlobalXmin and its refcount were getting out of sync with the actual transaction state. This is timing-dependent but I can reproduce it fairly reliably under concurrent workloads. It looks the problem comes from the change a couple days ago that removed the SXACT_FLAG_ROLLED_BACK flag and changed the SxactIsRolledBack checks to SxactIsDoomed. That's the correct thing to do everywhere else, but gets us in trouble here. We shouldn't be ignoring doomed transactions in SetNewSxactGlobalXmin, because they're not eligible for cleanup until the associated backend aborts the transaction and calls ReleasePredicateLocks. However, it isn't as simple as just removing the SxactIsDoomed check from SetNewSxactGlobalXmin. ReleasePredicateLocks calls SetNewSxactGlobalXmin *before* it releases the aborted transaction's SerializableXact (it pretty much has to, because we must drop and reacquire SerializableXactHashLock in between). But we don't want the aborted transaction included in the SxactGlobalXmin computation. It seems like we do need that SXACT_FLAG_ROLLED_BACK after all, even though it's only set for this brief interval. We need to be able to distinguish a transaction that's just been marked for death (doomed) from one that's already called ReleasePredicateLocks. Dan -- Dan R. K. Ports MIT CSAIL http://drkp.net/
Dan Ports <drkp@csail.mit.edu> wrote: > It looks the problem comes from the change a couple days ago that > removed the SXACT_FLAG_ROLLED_BACK flag and changed the > SxactIsRolledBack checks to SxactIsDoomed. That's the correct > thing to do everywhere else, but gets us in trouble here. We > shouldn't be ignoring doomed transactions in > SetNewSxactGlobalXmin, because they're not eligible for cleanup > until the associated backend aborts the transaction and calls > ReleasePredicateLocks. > > However, it isn't as simple as just removing the SxactIsDoomed > check from SetNewSxactGlobalXmin. ReleasePredicateLocks calls > SetNewSxactGlobalXmin *before* it releases the aborted > transaction's SerializableXact (it pretty much has to, because we > must drop and reacquire SerializableXactHashLock in between). But > we don't want the aborted transaction included in the > SxactGlobalXmin computation. > > It seems like we do need that SXACT_FLAG_ROLLED_BACK after all, > even though it's only set for this brief interval. We need to be > able to distinguish a transaction that's just been marked for > death (doomed) from one that's already called > ReleasePredicateLocks. I just looked this over and I agree. It looks to me like we can set this flag in ReleasePredicateLocks (along with the doomed flag) and test it only in SetNewSxactGlobalXmin (instead of the doomed flag). Fundamentally, the problem is that while it's enough to know that a transaction *will be* canceled in order to ignore it during detection of rw-conflicts and dangerous structures, it's not enough when calculating the new serializable xmin -- we need to know that a transaction actually *has been* canceled to ignore it there. Essentially, the fix is to revert a very small part of http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=264a6b127a918800d9f8bac80b5f4a8a8799d0f1 Either of us could easily fix this, but since you have the test environment which can reproduce the problem, it's probably best that you do it. Since Heikki took the trouble to put the flags in a nice logical order, we should probably put this right after the doomed flag. -Kevin
While testing the fix for this one, I found another bug. Patches for both are attached. The first patch addresses this bug by re-adding SXACT_FLAG_ROLLED_BACK, in a more limited form than its previous incarnation. We need to be able to distinguish transactions that have already called ReleasePredicateLocks and are thus eligible for cleanup from those that have been merely marked for abort by other backends. Transactions that are ROLLED_BACK are excluded from SxactGlobalXmin calculations, but those that are merely DOOMED need to be included. Also update a couple of assertions to ensure we only try to clean up ROLLED_BACK transactions. The second patch fixes a bug in PreCommit_CheckForSerializationFailure. This function checks whether there's a dangerous structure of the form far ---> near ---> me where neither the "far" or "near" transactions have committed. If so, it aborts the "near" transaction by marking it as DOOMED. However, that transaction might already be PREPARED. We need to check whether that's the case and, if so, abort the transaction that's trying to commit instead. One of the prepared_xacts regression tests actually hits this bug. I removed the anomaly from the duplicate-gids test so that it fails in the intended way, and added a new test to check serialization failures with a prepared transaction. Dan -- Dan R. K. Ports MIT CSAIL http://drkp.net/
Вложения
On 21.06.2011 05:18, Dan Ports wrote: > The first patch addresses this bug by re-adding SXACT_FLAG_ROLLED_BACK, > in a more limited form than its previous incarnation. > > We need to be able to distinguish transactions that have already > called ReleasePredicateLocks and are thus eligible for cleanup from > those that have been merely marked for abort by other > backends. Transactions that are ROLLED_BACK are excluded from > SxactGlobalXmin calculations, but those that are merely DOOMED need to > be included. > > Also update a couple of assertions to ensure we only try to clean up > ROLLED_BACK transactions. Thanks, committed. > The second patch fixes a bug in PreCommit_CheckForSerializationFailure. > This function checks whether there's a dangerous structure of the form > far ---> near ---> me > where neither the "far" or "near" transactions have committed. If so, > it aborts the "near" transaction by marking it as DOOMED. However, that > transaction might already be PREPARED. We need to check whether that's > the case and, if so, abort the transaction that's trying to commit > instead. Yep, committed. All the other places where we set the DOOMED flag seem to handle that already. In the long term, I'm not sure this is the best way to handle this. It feels a bit silly to set the flag, release the SerializableXactHashLock, and reacquire it later to remove the transaction from the hash table. Surely that could be done in some more straightforward way. But I don't feel like fiddling with it this late in the release cycle. > One of the prepared_xacts regression tests actually hits this bug. > I removed the anomaly from the duplicate-gids test so that it fails in > the intended way, and added a new test to check serialization failures > with a prepared transaction. Hmm, I have ran "make installcheck" with default_transaction_isolation='serializable' earlier. I wonder why I didn't see that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Jun 21, 2011 at 03:01:48PM +0300, Heikki Linnakangas wrote: > Thanks, committed. Thanks. > In the long term, I'm not sure this is the best way to handle this. It > feels a bit silly to set the flag, release the SerializableXactHashLock, > and reacquire it later to remove the transaction from the hash table. > Surely that could be done in some more straightforward way. But I don't > feel like fiddling with it this late in the release cycle. Yes, I suspect it can be done better. The reason it's tricky is a lock ordering issue; part of releasing a SerializableXact has to be done while holding SerializableXactHashLock and part has to be done without it (because it involves taking partition locks). Reworking the order of these things might work, but would require some careful thought since most of the code is shared with the non-abort cleanup paths. And yes, it's definitely the time for that. I've been meaning to take another look at this part of the code anyway, with an eye towards performance. SerializableXactHashLock can be a bottleneck on certain in-memory workloads. > > One of the prepared_xacts regression tests actually hits this bug. > > I removed the anomaly from the duplicate-gids test so that it fails in > > the intended way, and added a new test to check serialization failures > > with a prepared transaction. > > Hmm, I have ran "make installcheck" with > default_transaction_isolation='serializable' earlier. I wonder why I > didn't see that. The affected test was being run at SERIALIZABLE already, so that wouldn't have made a difference. One reason I didn't notice an issue when I looked at that test before before, is that it was intended to fail anyway, just with a different error. I didn't think too hard about which error would take precedence. Dan -- Dan R. K. Ports MIT CSAIL http://drkp.net/
On Wed, Jun 22, 2011 at 01:31:11AM -0400, Dan Ports wrote: > Yes, I suspect it can be done better. The reason it's tricky is a lock > ordering issue; part of releasing a SerializableXact has to be done > while holding SerializableXactHashLock and part has to be done without > it (because it involves taking partition locks). Reworking the order of > these things might work, but would require some careful thought since > most of the code is shared with the non-abort cleanup paths. And yes, > it's definitely the time for that. ...by which I mean it's definitely *not* the time for that, of course. Dan -- Dan R. K. Ports MIT CSAIL http://drkp.net/