Обсуждение: Some bogus results from prairiedog
According to http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-07-21%2022%3A36%3A55 prairiedog saw a crash in "make check" on the 9.4 branch earlier tonight; but there's not a lot of evidence as to why in the buildfarm report, because the postmaster log file is truncated well before where things got interesting. Fortunately, I was able to capture a copy of check.log before it got overwritten by the next run. I find that the place where the webserver report stops matches this section of check.log: [53cd99bb.134a:158] LOG: statement: create index test_range_gist_idx on test_range_gist using gist (ir); [53cd99bb.134a:159] LOG: statement: insert into test_range_gist select int4range(g, g+10) from generate_series(1,2000) g; ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^\ @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@[53cd99ba.1344:329] LOG: statement: INSERT INTO num_exp_div VALUES (7,8,'-1108.80577182462841041118'); [53cd99ba.1344:330] LOG: statement: INSERT INTO num_exp_add VALUES (7,9,'-107955289.045047420'); [53cd99ba.1344:331] LOG: statement: INSERT INTO num_exp_sub VALUES (7,9,'-58101680.954952580'); The ^@'s represent nul bytes, which I find runs of elsewhere in the file as well. I think they are an artifact of OS X buffering policy caused by multiple processes writing into the same file without any interlocks. Perhaps we ought to consider making buildfarm runs use the logging collector by default? But in any case, it seems uncool that either the buildfarm log-upload process, or the buildfarm web server, is unable to cope with log files containing nul bytes. Anyway, to cut to the chase, the crash seems to be from this: TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 2957) which the postmaster shortly later says this about: [53cd99b6.130e:2] LOG: server process (PID 5230) was terminated by signal 6: Abort trap [53cd99b6.130e:3] DETAIL: Failed process was running: ROLLBACK PREPARED 'foo1'; [53cd99b6.130e:4] LOG: terminating any other active server processes So there is still something rotten in the fastpath lock logic. regards, tom lane
On 07/22/2014 12:24 AM, Tom Lane wrote: > According to > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-07-21%2022%3A36%3A55 > prairiedog saw a crash in "make check" on the 9.4 branch earlier tonight; > but there's not a lot of evidence as to why in the buildfarm report, > because the postmaster log file is truncated well before where things got > interesting. Fortunately, I was able to capture a copy of check.log > before it got overwritten by the next run. I find that the place where > the webserver report stops matches this section of check.log: > > [53cd99bb.134a:158] LOG: statement: create index test_range_gist_idx on test_range_gist using gist (ir); > [53cd99bb.134a:159] LOG: statement: insert into test_range_gist select int4range(g, g+10) from generate_series(1,2000)g; > ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^\ > @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@[53cd99ba.1344:329] LOG: statement: INSERT INTO num_exp_div VALUES (7,8,'-1108.80577182462841041118'); > [53cd99ba.1344:330] LOG: statement: INSERT INTO num_exp_add VALUES (7,9,'-107955289.045047420'); > [53cd99ba.1344:331] LOG: statement: INSERT INTO num_exp_sub VALUES (7,9,'-58101680.954952580'); > > The ^@'s represent nul bytes, which I find runs of elsewhere in the file > as well. I think they are an artifact of OS X buffering policy caused by > multiple processes writing into the same file without any interlocks. > Perhaps we ought to consider making buildfarm runs use the logging > collector by default? But in any case, it seems uncool that either the > buildfarm log-upload process, or the buildfarm web server, is unable to > cope with log files containing nul bytes. The data is there, just click on the "check" stage link at the top of the page to see it in raw form. cheers andrew
On Tue, Jul 22, 2014 at 12:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > According to > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-07-21%2022%3A36%3A55 > prairiedog saw a crash in "make check" on the 9.4 branch earlier tonight; > but there's not a lot of evidence as to why in the buildfarm report, > because the postmaster log file is truncated well before where things got > interesting. Fortunately, I was able to capture a copy of check.log > before it got overwritten by the next run. I find that the place where > the webserver report stops matches this section of check.log: > > [53cd99bb.134a:158] LOG: statement: create index test_range_gist_idx on test_range_gist using gist (ir); > [53cd99bb.134a:159] LOG: statement: insert into test_range_gist select int4range(g, g+10) from generate_series(1,2000)g; > ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^\ > @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@[53cd99ba.1344:329] LOG: statement: INSERT INTO num_exp_div VALUES (7,8,'-1108.80577182462841041118'); > [53cd99ba.1344:330] LOG: statement: INSERT INTO num_exp_add VALUES (7,9,'-107955289.045047420'); > [53cd99ba.1344:331] LOG: statement: INSERT INTO num_exp_sub VALUES (7,9,'-58101680.954952580'); > > The ^@'s represent nul bytes, which I find runs of elsewhere in the file > as well. I think they are an artifact of OS X buffering policy caused by > multiple processes writing into the same file without any interlocks. > Perhaps we ought to consider making buildfarm runs use the logging > collector by default? But in any case, it seems uncool that either the > buildfarm log-upload process, or the buildfarm web server, is unable to > cope with log files containing nul bytes. > > Anyway, to cut to the chase, the crash seems to be from this: > > TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 2957) > > which the postmaster shortly later says this about: > > [53cd99b6.130e:2] LOG: server process (PID 5230) was terminated by signal 6: Abort trap > [53cd99b6.130e:3] DETAIL: Failed process was running: ROLLBACK PREPARED 'foo1'; > [53cd99b6.130e:4] LOG: terminating any other active server processes > > So there is still something rotten in the fastpath lock logic. Gosh, that sucks. The inconstancy of this problem would seem to suggest some kind of locking bug rather than a flat-out concurrency issue, but it looks to me like everything relevant is marked volatile. But ... maybe the problem could be in s_lock.h? That machine is powerpc, and powerpc's definition of S_UNLOCK looks like this: #ifdef USE_PPC_LWSYNC #define S_UNLOCK(lock) \ do \ { \ __asm__ __volatile__ (" lwsync \n"); \ *((volatile slock_t *) (lock)) = 0; \ } while (0) #else #define S_UNLOCK(lock) \ do \ { \ __asm__ __volatile__ (" sync \n"); \ *((volatile slock_t *) (lock)) = 0; \ } while (0) #endif /* USE_PPC_LWSYNC */ I believe Andres recently reported that this isn't enough to prevent compiler reordering; for that, the __asm__ __volatile___ would need to be augmented with ::: "memory". If that's correct, then the compiler could decide to issue the volatile store before the lwsync or sync. Then, the CPU could decide to perform the volatile store to the spinlock before performing the update to FastPathStrongRelationLocks. That would explain this problem. The other possible explanation (at least, AFAICS) is that lock_twophase_postcommit() is trying to release a strong lock count that it doesn't actually hold. That would indicate a strong-lock-count handling bug upstream someplace - e.g. that the count got released when the transaction was prepared, or somesuch. I certainly can't rule that out, although I don't know exactly where to look. We could add an assertion to AtPrepare_Locks(), just before setting locallock->holdsStrongLockCount = FALSE, that locallock->holdsStrongLockCount is true if and only if the lock tag and more suggest that it ought to be. But that's really just guessing wildly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 07/22/2014 10:55 AM, Andrew Dunstan wrote: > > On 07/22/2014 12:24 AM, Tom Lane wrote: >> According to >> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-07-21%2022%3A36%3A55 >> >> prairiedog saw a crash in "make check" on the 9.4 branch earlier >> tonight; >> but there's not a lot of evidence as to why in the buildfarm report, >> because the postmaster log file is truncated well before where things >> got >> interesting. Fortunately, I was able to capture a copy of check.log >> before it got overwritten by the next run. I find that the place where >> the webserver report stops matches this section of check.log: >> >> [53cd99bb.134a:158] LOG: statement: create index test_range_gist_idx >> on test_range_gist using gist (ir); >> [53cd99bb.134a:159] LOG: statement: insert into test_range_gist >> select int4range(g, g+10) from generate_series(1,2000) g; >> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^\ >> >> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@[53cd99ba.1344:329] LOG: >> statement: INSERT INTO num_exp_div VALUES >> (7,8,'-1108.80577182462841041118'); >> [53cd99ba.1344:330] LOG: statement: INSERT INTO num_exp_add VALUES >> (7,9,'-107955289.045047420'); >> [53cd99ba.1344:331] LOG: statement: INSERT INTO num_exp_sub VALUES >> (7,9,'-58101680.954952580'); >> >> The ^@'s represent nul bytes, which I find runs of elsewhere in the file >> as well. I think they are an artifact of OS X buffering policy >> caused by >> multiple processes writing into the same file without any interlocks. >> Perhaps we ought to consider making buildfarm runs use the logging >> collector by default? But in any case, it seems uncool that either the >> buildfarm log-upload process, or the buildfarm web server, is unable to >> cope with log files containing nul bytes. > > > The data is there, just click on the "check" stage link at the top of > the page to see it in raw form. > > I have made a change in the upload receiver app to escape nul bytes in the main log field too. This will operate prospectively. Using the logging collector would be a larger change, but we can look at it if this isn't sufficient. cheers andrew
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jul 22, 2014 at 12:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Anyway, to cut to the chase, the crash seems to be from this: >> TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 2957) >> So there is still something rotten in the fastpath lock logic. > Gosh, that sucks. > The inconstancy of this problem would seem to suggest some kind of > locking bug rather than a flat-out concurrency issue, but it looks to > me like everything relevant is marked volatile. I don't think that you need any big assumptions about machine-specific coding issues to spot the problem. The assert in question is here: /* * Decrement strong lock count. This logic is needed only for 2PC. */ if (decrement_strong_lock_count && ConflictsWithRelationFastPath(&lock->tag, lockmode)) { uint32 fasthashcode = FastPathStrongLockHashPartition(hashcode); SpinLockAcquire(&FastPathStrongRelationLocks->mutex); Assert(FastPathStrongRelationLocks->count[fasthashcode]> 0); FastPathStrongRelationLocks->count[fasthashcode]--; SpinLockRelease(&FastPathStrongRelationLocks->mutex); } and it sure looks to me like that "ConflictsWithRelationFastPath(&lock->tag" is looking at the tag of the shared-memory lock object you just released. If someone else had managed to recycle that locktable entry for some other purpose, the ConflictsWithRelationFastPath call might incorrectly return true. I think s/&lock->tag/locktag/ would fix it, but maybe I'm missing something. regards, tom lane
On Tue, Jul 22, 2014 at 8:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Jul 22, 2014 at 12:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Anyway, to cut to the chase, the crash seems to be from this: >>> TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 2957) >>> So there is still something rotten in the fastpath lock logic. > >> Gosh, that sucks. > >> The inconstancy of this problem would seem to suggest some kind of >> locking bug rather than a flat-out concurrency issue, but it looks to >> me like everything relevant is marked volatile. > > I don't think that you need any big assumptions about machine-specific > coding issues to spot the problem. I don't think that I'm making what could be described as big assumptions; I think we should fix and back-patch the PPC64 spinlock change. But... > The assert in question is here: > > /* > * Decrement strong lock count. This logic is needed only for 2PC. > */ > if (decrement_strong_lock_count > && ConflictsWithRelationFastPath(&lock->tag, lockmode)) > { > uint32 fasthashcode = FastPathStrongLockHashPartition(hashcode); > > SpinLockAcquire(&FastPathStrongRelationLocks->mutex); > Assert(FastPathStrongRelationLocks->count[fasthashcode] > 0); > FastPathStrongRelationLocks->count[fasthashcode]--; > SpinLockRelease(&FastPathStrongRelationLocks->mutex); > } > > and it sure looks to me like that > "ConflictsWithRelationFastPath(&lock->tag" is looking at the tag of the > shared-memory lock object you just released. If someone else had managed > to recycle that locktable entry for some other purpose, the > ConflictsWithRelationFastPath call might incorrectly return true. > > I think s/&lock->tag/locktag/ would fix it, but maybe I'm missing > something. ...this is probably the real cause of the failures we've actually been seeing. I'll go back-patch that change. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jul 22, 2014 at 8:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> and it sure looks to me like that >> "ConflictsWithRelationFastPath(&lock->tag" is looking at the tag of the >> shared-memory lock object you just released. If someone else had managed >> to recycle that locktable entry for some other purpose, the >> ConflictsWithRelationFastPath call might incorrectly return true. >> >> I think s/&lock->tag/locktag/ would fix it, but maybe I'm missing >> something. > ...this is probably the real cause of the failures we've actually been > seeing. I'll go back-patch that change. For the archives' sake --- I took another look at the two previous instances we'd seen of FastPathStrongRelationLocks assertion failures: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-03-25%2003%3A15%3A03 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rover_firefly&dt=2014-04-06%2017%3A04%3A00 Neither of these tripped at the bug site in LockRefindAndRelease, but that's not so surprising, because there was no Assert there at the time. An occurrence of the bug would have silently left a negative entry in the FastPathStrongRelationLocks->count[] array, which would have triggered an assertion by the next process unlucky enough to use the same array entry. In the prairiedog case, the assert happened in a test running concurrently with the prepared_xacts test (which is presumably where the bug occurred), so that this seems a highly plausible explanation. In the rover_firefly case, the assert happened quite a bit later than prepared_xacts; but it might've been just luck that that particular array entry (out of 1024) wasn't re-used for awhile. So it's not certain that this one bug explains all three cases, but it seems well within the bounds of plausibility that it does. Also, the narrow width of the race condition window (from LWLockRelease(partitionLock) to an inlined test in the next line) is consistent with the very low failure rate we've seen in the buildfarm. BTW, I wonder whether it would be interesting for testing purposes to have a build option that causes SpinLockRelease and LWLockRelease to do a sched_yield before returning. One could assume that that would greatly increase the odds of detecting this type of bug. regards, tom lane