Обсуждение: Updating multiple bool values crashes backend
Sean Kelly (S.Kelly@ncl.ac.uk) reports a bug with a severity of 1 The lower the number the more severe it is. Short Description Updating multiple bool values crashes backend Long Description Bug best described by an example, see below. Tested on two systems: Intel Pentium III 600 128Mb RAM Linux 2.2.17 AMD K6 350 96Mb RAM Linux 2.2.16 both PostgreSQL 7.0.2 Sample Code users=> select username,added from users_tbl where username like 'neta%'; username | added ----------+------- neta1 | f neta2 | f neta3 | f neta4 | f (4 rows) users=> update users_tbl set added=TRUE where username like 'neta%'; pqReadData() -- backend closed the channel unexpectedly. This probably means the backend terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !>\q bash$ tail ~postgres/server.log Server process (pid 23747) exited with status 11 at Tue Oct 24 13:52:29 2000 Terminating any active server processes... Server processes were terminated at Tue Oct 24 13:52:29 2000 Reinitializing shared memory and semaphores The Data Base System is starting up DEBUG: Data Base System is starting up at Tue Oct 24 13:52:29 2000 DEBUG: Data Base System was interrupted being in production at Tue Oct 24 13:51:22 2000 DEBUG: Data Base System is in production state at Tue Oct 24 13:52:29 2000 No file was uploaded with this report
> Short Description > Updating multiple bool values crashes backend I cannot reproduce this example with 7.0.2 on my Linux-2.2.16 laptop. We will need more details and a reproducible example to help out... - Thomas
Here is a debug (level 2) output - does this help any more?... If not, what should I provide for you in terms of debugging?... FindExec: found "/usr/local/postgres/bin/postgres" using argv[0] binding ShmemCreate(key=52e2c1, size=1104896) DEBUG: Data Base System is starting up at Wed Oct 25 13:15:17 2000 DEBUG: Data Base System was shut down at Wed Oct 25 13:14:49 2000 DEBUG: Data Base System is in production state at Wed Oct 25 13:15:17 2000 proc_exit(0) shmem_exit(0) exit(0) /usr/local/postgres/bin/postmaster: reaping dead processes... /usr/local/postgres/bin/postmaster: ServerLoop: handling reading 5 /usr/local/postgres/bin/postmaster: ServerLoop: handling reading 5 /usr/local/postgres/bin/postmaster: ServerLoop: handling writing 5 /usr/local/postgres/bin/postmaster: BackendStartup: pid 28428 user sean db users socket 5 /usr/local/postgres/bin/postmaster child[28428]: starting with (/usr/local/postgres/bin/postgres -d2 -v131072 -p users ) FindExec: found "/usr/local/postgres/bin/postgres" using argv[0] started: host=localhost user=sean database=users InitPostgres StartTransactionCommand query: SELECT usesuper FROM pg_user WHERE usename = 'sean' ProcessQuery CommitTransactionCommand StartTransactionCommand query: delete from dialup_tbl where username='sib'; ProcessQuery CommitTransactionCommand StartTransactionCommand query: delete from users_tbl where username='sib'; ProcessQuery query: SELECT oid FROM "dialup_tbl" WHERE "username" = $1 FOR UPDATE OF "dialup_tbl" /usr/local/postgres/bin/postmaster: reaping dead processes... /usr/local/postgres/bin/postmaster: CleanupProc: pid 28428 exited with status 11 Server process (pid 28428) exited with status 11 at Wed Oct 25 13:15:35 2000 Terminating any active server processes... Server processes were terminated at Wed Oct 25 13:15:35 2000 Reinitializing shared memory and semaphores shmem_exit(0) binding ShmemCreate(key=52e325, size=1104896) /usr/local/postgres/bin/postmaster: ServerLoop: handling reading 5 /usr/local/postgres/bin/postmaster: ServerLoop: handling reading 5 DEBUG: Data Base System is starting up at Wed Oct 25 13:15:35 2000 DEBUG: Data Base System was interrupted being in production at Wed Oct 25 13:15:17 2000 /usr/local/postgres/bin/postmaster: ServerLoop: handling writing 5 The Data Base System is starting up /usr/local/postgres/bin/postmaster: ServerLoop: handling writing 5 DEBUG: Data Base System is in production state at Wed Oct 25 13:15:35 2000 proc_exit(0) shmem_exit(0) exit(0) /usr/local/postgres/bin/postmaster: reaping dead processes... Thanks, -- Sean Kelly <S.Kelly@ncl.ac.uk> "If 99% is good enough, then gravity will not work for 14 minutes every day."
> Here is a debug (level 2) output - does this help any > more?... If not, what should I provide for you in terms of > debugging?... Hmm. I didn't find the update/select combination you specified in your problem statement in your debugging output, but it wouldn't have likely helped anyway. Your initial problem statement was of the form "if you do this, then if you do that, you will get a server crash". That problem statement is simple and testable, *if* it were accompanied by a reproducible example. We are not likely to be able to help track down a problem if we can not reproduce it. So I would suggest the following: 1) dump your database using pg_dump or pg_dumpall. 2) reload your database using the dump from (1). 3) show that the problem is reproducible for you. 4) distill the scenerio down to the fundamental elements, if possible. 5) file a problem report, and be ready to send an example including schema and data. The problem for us is that your problem statement is not reproducible here. So you will need to show how *you* can reproduce it using fresh data for us to be able to help. There are other causes of database failure (e.g. bad server memory) which we have no control over and which are less likely to be a problem if you can get a reproducible case. Just remember that "reproducible" doesn't necessarily mean that you can get it to happen more than once on the same database. Ideally it means that you can create a new database and demonstrate the same problem. Hope this helps. - Thomas
pgsql-bugs@postgreSQL.org writes: > users=> update users_tbl set added=TRUE where username like 'neta%'; > pqReadData() -- backend closed the channel unexpectedly. > bash$ tail ~postgres/server.log > Server process (pid 23747) exited with status 11 at Tue Oct 24 13:52:29 2000 This backend crash should have left a core file in your database directory (PGDATA/base/users/core). Can you provide a backtrace from that corefile using gdb? regards, tom lane
On Wed, 25 Oct 2000 12:52:44 -0400, Tom Lane said: > pgsql-bugs@postgreSQL.org writes: > > users=> update users_tbl set added=TRUE where username like 'neta%'; > > pqReadData() -- backend closed the channel unexpectedly. > > > bash$ tail ~postgres/server.log > > Server process (pid 23747) exited with status 11 at Tue Oct 24 13:52:29 2000 > > This backend crash should have left a core file in your database > directory (PGDATA/base/users/core). Can you provide a backtrace > from that corefile using gdb? No core there ... any other suggestions? With respect to GCC errors, '11' normally indicates a hardware problem - could this be the case? One of the machines I tested it on was brand new hardware... Thanks, -- Sean Kelly <S.Kelly@ncl.ac.uk> "If 99% is good enough, then gravity will not work for 14 minutes every day."
Sean Kelly <S.Kelly@ncl.ac.uk> writes: >> This backend crash should have left a core file in your database >> directory (PGDATA/base/users/core). Can you provide a backtrace >> from that corefile using gdb? > No core there ... any other suggestions? You probably started the postmaster with a ulimit setting that prevents coredumps (ulimit -c 0 or something like that, see your ulimit man page). On some Unixen, this ulimit setting is the default for anything started from a system boot script. Restart the postmaster with ulimit -c unlimited, either by starting it by hand or adding a ulimit call to the boot script. Then reproduce the crash to get a core file. > With respect to GCC errors, '11' normally indicates a hardware > problem Uh, whoever told you that? Signal 11 is SIGSEGV on most Unixen, and that just means the program tried to dereference an invalid pointer. Almost certainly, we're looking at some software bug here, not a hardware failure. regards, tom lane
On Wed, 25 Oct 2000 14:14:22 -0400, Tom Lane said: > > No core there ... any other suggestions? > > You probably started the postmaster with a ulimit setting that prevents > coredumps (ulimit -c 0 or something like that, see your ulimit man page). > On some Unixen, this ulimit setting is the default for anything started > from a system boot script. Restart the postmaster with ulimit -c > unlimited, either by starting it by hand or adding a ulimit call to the > boot script. Then reproduce the crash to get a core file. Ok, I sorted that ... I now have a 2Mb core file. Can you explain how to 'backtrace' it with gdb ... I'm not really a developer and haven't played with gdb much ... ever ... I've stuck the core file at http://www.randomfx.net/core.html if you need it. As someone suggested, I 'pg_dump'ed the database, 'dropdb'ed and 'createdb'ed it, before reloading. After reloading the results were the same. I tried this on both the machines running 7.0.2 with the same results. > > With respect to GCC errors, '11' normally indicates a hardware > > problem > > Uh, whoever told you that? Signal 11 is SIGSEGV on most Unixen, > and that just means the program tried to dereference an invalid > pointer. Almost certainly, we're looking at some software bug > here, not a hardware failure. One example can be found on http://www.bitwizard.nl/sig11/ Thanks for your time and help, -- Sean Kelly <S.Kelly@ncl.ac.uk> "If 99% is good enough, then gravity will not work for 14 minutes every day."
Sean Kelly <S.Kelly@ncl.ac.uk> writes: > Ok, I sorted that ... I now have a 2Mb core file. Can you > explain how to 'backtrace' it with gdb ... gdb /path/to/postgres-executable /path/to/core-file bt quit and send the results. Hopefully there will be at least function names in the display --- if it's all numbers then don't bother sending it :-( > I've stuck the core > file at http://www.randomfx.net/core.html if you need it. Thanks, but it's pretty much useless to anyone who doesn't have the exact same executable and same system platform as you. >>>>> With respect to GCC errors, '11' normally indicates a hardware >>>>> problem >> >> Uh, whoever told you that? > One example can be found on http://www.bitwizard.nl/sig11/ Hmph. bitwizard may think that flaky hardware is a normal state of affairs, but I don't. Perhaps he buys his machines from incompetent manufacturers. regards, tom lane
Sean Kelly <S.Kelly@ncl.ac.uk> writes: > (gdb) bt > #0 0x8115eb2 in ri_BuildQueryKeyFull () > #1 0x8115dc2 in RI_FKey_keyequal_upd () > #2 0x8096d7c in DeferredTriggerSaveEvent () Hmm. There wasn't any mention of foreign keys for this table in your bug report, now was there? At a guess, you've run into the known bug that foreign key triggers don't track renames of referenced tables. Did you rename a table that is a foreign-key referencer or referencee of this one? If so, rename it back, or drop and reload both tables. (The crash is fixed for 7.0.3, though actually tracking the renames is further downstream.) regards, tom lane
On Thu, 26 Oct 2000 10:14:16 -0400, Tom Lane said: > gdb /path/to/postgres-executable /path/to/core-file > bt > quit [postgres@nis-master] ~ 132: gdb bin/postgres data/base/users/core This GDB was configured as "i386-slackware-linux"... Core was generated by `/usr/local/postgres/bin/postgres localhost s'. Program terminated with signal 11, Segmentation fault. .. [SNIP: Loading symbols...] .. #0 0x8115eb2 in ri_BuildQueryKeyFull () (gdb) bt #0 0x8115eb2 in ri_BuildQueryKeyFull () #1 0x8115dc2 in RI_FKey_keyequal_upd () #2 0x8096d7c in DeferredTriggerSaveEvent () #3 0x8096016 in ExecARUpdateTriggers () #4 0x809c617 in ExecReplace () #5 0x809c256 in ExecutePlan () #6 0x809b8f3 in ExecutorRun () #7 0x80eb46a in ProcessQueryDesc () #8 0x80eb4d0 in ProcessQuery () #9 0x80ea153 in pg_exec_query_dest () #10 0x80ea033 in pg_exec_query () #11 0x80eaeec in PostgresMain () #12 0x80d565a in DoBackend () #13 0x80d523a in BackendStartup () #14 0x80d45ee in ServerLoop () #15 0x80d407c in PostmasterMain () #16 0x80ab115 in main () #17 0x400f9577 in __libc_start_main () from /lib/libc.so.6 (gdb) quit There we go :) Thanks, -- Sean Kelly <S.Kelly@ncl.ac.uk> "If 99% is good enough, then gravity will not work for 14 minutes every day."
On Thu, 26 Oct 2000 11:27:22 -0400, Tom Lane said: > Sean Kelly <S.Kelly@ncl.ac.uk> writes: > > (gdb) bt > > #0 0x8115eb2 in ri_BuildQueryKeyFull () > > #1 0x8115dc2 in RI_FKey_keyequal_upd () > > #2 0x8096d7c in DeferredTriggerSaveEvent () > > Hmm. There wasn't any mention of foreign keys for this table in your > bug report, now was there? > > At a guess, you've run into the known bug that foreign key triggers > don't track renames of referenced tables. Did you rename a table that > is a foreign-key referencer or referencee of this one? If so, rename > it back, or drop and reload both tables. (The crash is fixed for > 7.0.3, though actually tracking the renames is further downstream.) Ah ha!.... oldname_tbl referenced the primary key in users_tbl, and oldname_tbl was renamed newname_tbl. Is this the bug you mean?... When you say drop/reload both tables do you mean both users_tbl and newname_tbl?... Thanks, I think it's nearly sorted now :) -- Sean Kelly <S.Kelly@ncl.ac.uk> "If 99% is good enough, then gravity will not work for 14 minutes every day."
Do we need a TODO here? > Sean Kelly <S.Kelly@ncl.ac.uk> writes: > > (gdb) bt > > #0 0x8115eb2 in ri_BuildQueryKeyFull () > > #1 0x8115dc2 in RI_FKey_keyequal_upd () > > #2 0x8096d7c in DeferredTriggerSaveEvent () > > Hmm. There wasn't any mention of foreign keys for this table in your > bug report, now was there? > > At a guess, you've run into the known bug that foreign key triggers > don't track renames of referenced tables. Did you rename a table that > is a foreign-key referencer or referencee of this one? If so, rename > it back, or drop and reload both tables. (The crash is fixed for > 7.0.3, though actually tracking the renames is further downstream.) > > regards, tom lane > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Do we need a TODO here? None that we haven't got already, AFAIK. regards, tom lane
Could it also be the result of a Cluster operation? I've seen strange things related to functions/triggers on tables that I've clustered. >> At a guess, you've run into the known bug that foreign key triggers >> don't track renames of referenced tables. Did you rename a table that >> is a foreign-key referencer or referencee of this one? If so, rename >> it back, or drop and reload both tables. (The crash is fixed for >> 7.0.3, though actually tracking the renames is further downstream.) >> >> regards, tom lane >> > > >-- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > >
On Fri, 27 Oct 2000 11:55:24 -0700, Darcy Buskermolen said: > Could it also be the result of a Cluster operation? I've seen strange > things related to functions/triggers on tables that I've clustered. Personally for me it turned out to be, as Tom said, the renaming of a table involving foreign keys. I renamed the table back to what it was, dropped it, and then recreated the new one with new foreign keys. Thanks, -- Sean Kelly <S.Kelly@ncl.ac.uk> "If 99% is good enough, then gravity will not work for 14 minutes every day."