Обсуждение: Unhelpful debug tools on OS X :-(
Dunno if anyone's noticed, but buildfarm member jackal has been crashing intermittently in the "cluster" regression test. I found out that I can reproduce that here on a Mac Mini (apparently the same or nearly same hardware/software as jackal) --- just repeat the parallel tests often enough, and it'll fail once in awhile. I've got a core file but gdb is being no help at all: mini:~/pgsql/src/test/regress tgl$ gdb /Users/tgl/testversion/bin/postmaster /cores/core.23609 GNU gdb 6.3.50-20050815 (Apple version gdb-563) (Wed Jul 19 05:10:58 GMT 2006) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-apple-darwin"...Reading symbols for shared libraries .. done Core was generated by `/Users/tgl/testversion/bin/postmaster'. Core file contained no thread-specific data (gdb) bt #0 0x00000000 in ?? () (gdb) Any suggestions how to extract some info out of this? regards, tom lane
Tom Lane wrote: > Dunno if anyone's noticed, but buildfarm member jackal has been crashing > intermittently in the "cluster" regression test. I found out that I can > reproduce that here on a Mac Mini (apparently the same or nearly same > hardware/software as jackal) --- just repeat the parallel tests often > enough, and it'll fail once in awhile. I've got a core file but gdb > is being no help at all: > > mini:~/pgsql/src/test/regress tgl$ gdb /Users/tgl/testversion/bin/postmaster /cores/core.23609 > GNU gdb 6.3.50-20050815 (Apple version gdb-563) (Wed Jul 19 05:10:58 GMT 2006) > Copyright 2004 Free Software Foundation, Inc. > GDB is free software, covered by the GNU General Public License, and you are > welcome to change it and/or distribute copies of it under certain conditions. > Type "show copying" to see the conditions. > There is absolutely no warranty for GDB. Type "show warranty" for details. > This GDB was configured as "i386-apple-darwin"...Reading symbols for shared libraries .. done > > Core was generated by `/Users/tgl/testversion/bin/postmaster'. > Core file contained no thread-specific data > > (gdb) bt > #0 0x00000000 in ?? () > (gdb) > > Any suggestions how to extract some info out of this? no idea on that one - but clownfish managed to fail the cluster test once too lately: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=clownfish&dt=2007-04-09%2023:03:03 Stefan
Tom Lane wrote: > Any suggestions how to extract some info out of this? Does OS X have the catchsegv tool? If you can run postmaster with that, you should get a backtrace when it crashes. Unless it has the same problem as gdb, of course.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> Any suggestions how to extract some info out of this? > Does OS X have the catchsegv tool? No, but I suddenly remembered about CrashReporter, and sure enough it's catching these crashes: Exception: EXC_BAD_ACCESS (0x0001) Codes: KERN_PROTECTION_FAILURE (0x0002) at 0x00000010 Thread 0 Crashed: 0 postmaster 0x001af4ef smgrextend + 12 (smgr.c:485) 1 postmaster 0x00029044 end_heap_rewrite + 208 (rewriteheap.c:278) 2 postmaster 0x000bdc22 cluster_rel + 850 (cluster.c:806) 3 postmaster 0x000be119 cluster + 160 (cluster.c:220) 4 postmaster 0x001b74a8 PortalRunUtility + 233 (palloc.h:84) 5 postmaster 0x001b7784 PortalRunMulti + 237 (pquery.c:1271) 6 postmaster 0x001b80ae PortalRun + 918 (pquery.c:813) 7 postmaster 0x001b2afd exec_simple_query + 656 (postgres.c:965) 8 postmaster 0x001b4b0c PostgresMain + 5628 (postgres.c:3507) 9 postmaster 0x00183973 ServerLoop + 2828 (postmaster.c:2614) 10 postmaster 0x00184b1e PostmasterMain + 2794 (postmaster.c:972) 11 postmaster 0x00130f8e main + 1236 (main.c:188) 12 postmaster 0x00001e86 _start + 216 13 postmaster 0x00001dad start + 41 So it looks like this has got something to do with the MVCC-safe cluster changes, which is not too surprising considering it started happening around about then. Off to have a look ... regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> Tom Lane wrote: >>> Any suggestions how to extract some info out of this? > >> Does OS X have the catchsegv tool? > > No, but I suddenly remembered about CrashReporter, and sure enough it's > catching these crashes: > > Exception: EXC_BAD_ACCESS (0x0001) > Codes: KERN_PROTECTION_FAILURE (0x0002) at 0x00000010 > > Thread 0 Crashed: > 0 postmaster 0x001af4ef smgrextend + 12 (smgr.c:485) > 1 postmaster 0x00029044 end_heap_rewrite + 208 (rewriteheap.c:278) > 2 postmaster 0x000bdc22 cluster_rel + 850 (cluster.c:806) > 3 postmaster 0x000be119 cluster + 160 (cluster.c:220) > 4 postmaster 0x001b74a8 PortalRunUtility + 233 (palloc.h:84) > 5 postmaster 0x001b7784 PortalRunMulti + 237 (pquery.c:1271) > 6 postmaster 0x001b80ae PortalRun + 918 (pquery.c:813) > 7 postmaster 0x001b2afd exec_simple_query + 656 (postgres.c:965) > 8 postmaster 0x001b4b0c PostgresMain + 5628 (postgres.c:3507) > 9 postmaster 0x00183973 ServerLoop + 2828 (postmaster.c:2614) > 10 postmaster 0x00184b1e PostmasterMain + 2794 (postmaster.c:972) > 11 postmaster 0x00130f8e main + 1236 (main.c:188) > 12 postmaster 0x00001e86 _start + 216 > 13 postmaster 0x00001dad start + 41 > > So it looks like this has got something to do with the MVCC-safe cluster > changes, which is not too surprising considering it started happening > around about then. Off to have a look ... I've been looking at the code for a few minutes as well, but haven't found an explanation for that yet. But I did notice that we're not fsyncing the newly written relation like we should. There's a comment raw_heap_insert:/* * Now write the page. We say isTemp = true even if it's not a * temp table,because there's no need for smgr to schedule an * fsync for this write; we'll do it ourselves before committing. */smgrextend(state->rs_new_rel->rd_smgr,state->rs_blockno, (char *) page, true); That's copy-pasted from tablecmds.c. But unlike in tablecmds.c, end_heap_rewrite only fsyncs the new file if we're not WAL-logging. Proposed fix: Index: src/backend/access/heap/rewriteheap.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/rewriteheap.c,v retrieving revision 1.1 diff -c -r1.1 rewriteheap.c *** src/backend/access/heap/rewriteheap.c 8 Apr 2007 01:26:27 -0000 1.1 --- src/backend/access/heap/rewriteheap.c 17 Apr 2007 20:50:05 -0000 *************** *** 272,282 **** } /* ! * If not WAL-logging, must fsync before commit. We use heap_sync ! * to ensure that the toast table gets fsync'd too. */ ! if (!state->rs_use_wal) ! heap_sync(state->rs_new_rel); /* Deleting the context frees everything */ MemoryContextDelete(state->rs_cxt); --- 272,284 ---- } /* ! * Must fsync before commit, even if we've WAL-logged the changes, ! * because we've written pages outside the buffer manager. See comments! * in copy_relation_data in commands/tablecmds.c for more information. ! * ! * We use heap_sync to ensure that the toast table gets fsync'd too. */ ! heap_sync(state->rs_new_rel); /* Deleting the context frees everything */ MemoryContextDelete(state->rs_cxt); BTW: In tablecmds.c the new relation is fsynced with smgrimmedsync, not heap_sync. How about the toast table, it goes through shared buffers as usual, right? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
I wrote: > Thread 0 Crashed: > 0 postmaster 0x001af4ef smgrextend + 12 (smgr.c:485) > 1 postmaster 0x00029044 end_heap_rewrite + 208 (rewriteheap.c:278) > 2 postmaster 0x000bdc22 cluster_rel + 850 (cluster.c:806) > 3 postmaster 0x000be119 cluster + 160 (cluster.c:220) AFAICS, a crash inside smgrextend must indicate it was passed a junk SMgrRelation, which implies that the problem has to trace back to the new heap's rd_smgr getting closed since the rewriteheap operation started. I think the only possible path for that is an sinval queue overflow and consequent SI reset during the test. That would explain why it's such an intermittent failure, but it is curious that Darwin seems much more prone to this than other platforms. I wonder if Apple has tweaked the scheduler in a way that allows backends to not be given any cycles for long periods ... Anyway, patch committed; we'll see if things don't get more stable. Memo to self: run some tests with CLOBBER_CACHE_ALWAYS after the remaining major patches have landed. regards, tom lane
Heikki Linnakangas <heikki@enterprisedb.com> writes: > But I did notice that we're not fsyncing the newly written relation like > we should. Good point, but doesn't the analogy to copy_relation_data say that we should sync if not rd_istemp? (This is my fault BTW; your original patch kept the data in shared buffers, so it wasn't subject to the problem.) > BTW: In tablecmds.c the new relation is fsynced with smgrimmedsync, not > heap_sync. That's okay since that routine is just copying the one table. TOAST is handled via recursion of ATExecSetTableSpace. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> But I did notice that we're not fsyncing the newly written relation like >> we should. > > Good point, but doesn't the analogy to copy_relation_data say that we > should sync if not rd_istemp? Right. I don't know why I didn't copy that if-statement. It seems to be bedtime for me... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com