Re: FOR KEY LOCK foreign keys
От | Noah Misch |
---|---|
Тема | Re: FOR KEY LOCK foreign keys |
Дата | |
Msg-id | 20110716171121.GB2047@tornado.leadboat.com обсуждение исходный текст |
Ответ на | Re: FOR KEY LOCK foreign keys (Alvaro Herrera <alvherre@commandprompt.com>) |
Ответы |
Re: FOR KEY LOCK foreign keys
|
Список | pgsql-hackers |
On Fri, Jul 15, 2011 at 07:01:26PM -0400, Alvaro Herrera wrote: > Excerpts from Noah Misch's message of mié jul 13 01:34:10 -0400 2011: > > > coypu failed during the run of the test due to a different session being chosen > > as the deadlock victim. We can now vary deadlock_timeout to prevent this; see > > attached fklocks-tests-deadlock_timeout.patch. This also makes the tests much > > faster on a default postgresql.conf. > > I applied your patch, thanks. I couldn't reproduce the failures without > it, even running only the three new tests in a loop a few dozen times. It's probably more likely to crop up on a loaded system. I did not actually reproduce it myself. However, if you swap the timeouts, the opposite session finds the deadlock. From there, I'm convinced that the right timing perturbations could yield the symptom coypu exhibited. > > crake failed when it reported waiting on the first step of an existing isolation > > test ("two-ids.spec"). I will need to look into that further. > > Actually, there are four failures in tests other than the two fixed by > your patch. These are: > > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2011-07-12%2022:32:02 > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2011-07-14%2016:27:00 > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pitta&dt=2011-07-15%2015:00:08 > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2011-07-15%2018:32:02 Thanks for summarizing. These all boil down to lock waits not anticipated by the test specs. Having pondered this, I've been able to come up with just one explanation. If autovacuum runs VACUUM during the test and finds that it can truncate dead space from the end of a relation, it will acquire an AccessExclusiveLock. When I decrease autovacuum_naptime to 1s, I do see plenty of pg_type and pg_attribute truncations during a test run. When I sought to reproduce this, what I first saw instead was an indefinite test suite hang. That turned out to arise from an unrelated thinko -- I assumed that backend IDs were stable for the life of the backend, but they're only stable for the life of a pgstat snapshot. This fell down when a backend older than one of the test backends exited during the test: 4199 2011-07-16 03:33:28.733 EDT DEBUG: forked new backend, pid=23984 socket=8 23984 2011-07-16 03:33:28.737 EDT LOG: statement: SET client_min_messages = warning; 23984 2011-07-16 03:33:28.739 EDT LOG: statement: SELECT i FROM pg_stat_get_backend_idset() t(i) WHERE pg_stat_get_backend_pid(i)= pg_backend_pid() 23985 2011-07-16 03:33:28.740 EDT DEBUG: autovacuum: processing database "postgres" 4199 2011-07-16 03:33:28.754 EDT DEBUG: forked new backend, pid=23986 socket=8 23986 2011-07-16 03:33:28.754 EDT LOG: statement: SET client_min_messages = warning; 4199 2011-07-16 03:33:28.755 EDT DEBUG: server process (PID 23985) exited with exit code 0 23986 2011-07-16 03:33:28.755 EDT LOG: statement: SELECT i FROM pg_stat_get_backend_idset() t(i) WHERE pg_stat_get_backend_pid(i)= pg_backend_pid() 4199 2011-07-16 03:33:28.766 EDT DEBUG: forked new backend, pid=23987 socket=8 23987 2011-07-16 03:33:28.766 EDT LOG: statement: SET client_min_messages = warning; 23987 2011-07-16 03:33:28.767 EDT LOG: statement: SELECT i FROM pg_stat_get_backend_idset() t(i) WHERE pg_stat_get_backend_pid(i)= pg_backend_pid() This led isolationtester to initialize backend_ids = {1,2,2}, making us unable to detect lock waits correctly. That's also consistent with the symptoms Rémi Zara just reported. With that fixed, I was able to reproduce the failure due to autovacuum-truncate-induced transient waiting using this recipe: - autovacuum_naptime = 1s - src/test/isolation/Makefile changed to pass --use-existing during installcheck - Run 'make installcheck' in a loop - A concurrent session running this in a loop: CREATE TABLE churn (a int, b int, c int, d int, e int, f int, g int, h int); DROP TABLE churn; That yields a steady stream of vacuum truncations, and an associated lock wait generally capsized the suite within 5-10 runs. Frankly, I have some difficulty believing that this mechanic alone produced all four failures you cite above; I suspect I'm still missing some more-frequent cause. Any other theories on which system background activities can cause a transient lock wait? It would have to produce a "pgstat_report_waiting(true)" call, so I believe that excludes all LWLock and lighter contention. In any event, I have attached a patch that fixes the problems I have described here. To ignore autovacuum, it only recognizes a wait when one of the backends under test holds a conflicting lock. (It occurs to me that perhaps we should expose a pg_lock_conflicts(lockmode_held text, lockmode_req text) function to simplify this query -- this is a fairly common monitoring need.) With that change in place, my setup survived through about fifty suite runs at a time. The streak would end when session 2 would unexpectedly detect a deadlock that session 1 should have detected. The session 1 deadlock_timeout I chose, 20ms, is too aggressive. When session 2 is to issue the command that completes the deadlock, it must do so before session 1 runs the deadlock detector. Since we burn 10ms just noticing that the previous statement has blocked, that left only 10ms to issue the next statement. This patch bumps the figure from 20s to 100ms; hopefully that will be enough for even a decently-loaded virtual host. We should keep it as low as is reasonable, because it contributes directly to the isolation suite runtime. Each addition to deadlock_timeout slows the suite by 12x that amount. With this patch in its final form, I have completed 180+ suite runs without a failure. In the absence of better theories on the cause for the buildfarm failures, we should give the buildfarm a whirl with this patch. I apologize for the quantity of errata this change is entailing. Thanks, nm -- Noah Misch http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
В списке pgsql-hackers по дате отправления:
Предыдущее
От: "Kevin Grittner"Дата:
Сообщение: Re: isolation test deadlocking on buildfarm member coypu