Обсуждение: vacuum analyze corrupts database
We're running into problem with vacuum analyze. We first noticed that with Postgres v.7.2.1. Now with v.7.3.2 it appears again. After nightly maintenance application isn't functioning. It looks that corruption occurs (at least) on one table. I was able to narrow down the problem by dropping all but one column from the table in question. What I have now is a table with a single varchar column and just over three hundred records on it. Running vacuum analyze on this table results in some sort of corruption. I'm running Pg 7.3.2 on Solaris. There's no explicit settings for vacuum in postgresql.conf Attached is the dump of this table. The problem can be easily reproduced. Load the dump into your database, it'll create table "nla" syncdb=# select count (*) from nla where note_url LIKE 'sync:///FUNCTREE%' ; count ------- 121 (1 row) syncdb=# vacuum analyze nla; VACUUM -- --- now repeat the same query --- syncdb=# select count (*) from nla where note_url LIKE 'sync:///FUNCTREE%' ; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: WARNING: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. Failed. Now, if I reconnect and delete the nla's record from pg_statistic I can run the query again. Problem could be data related, I tried to poke around and update all records beginning with "sync:///FUNCTREE" to something different, then problem may go away. Could someone from the Postgres team comment on this, please. Mike.
Вложения
Michael Brusser <michael@synchronicity.com> writes: > Attached is the dump of this table. The problem can be easily reproduced. Not as easily as all that --- your example works fine for me, under 7.2.4, 7.3.3, or CVS tip. I speculate that there may be some necessary factor in build options or runtime environment. How exactly did you configure Postgres, what compiler did you use, are you using a non-C locale or non-default multibyte encoding, etc? It would help to know where the crash is occurring; could you get a debugger backtrace please? regards, tom lane
On Fri, 23 May 2003 10:33:17 -0400, Michael Brusser <michael@synchronicity.com> wrote: >syncdb=# vacuum analyze nla; >VACUUM > >-- --- now repeat the same query --- >syncdb=# select count (*) from nla where note_url LIKE 'sync:///FUNCTREE%' ; >server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. Works for me: regression=# VACUUM VERBOSE ANALYSE nla; INFO: --Relation public.nla-- INFO: Pages 3: Changed 0, Empty 0; Tup 323: Vac 0, Keep 0, UnUsed 0. Total CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: Analyzing public.nla VACUUM regression=# select count (*) from nla where note_url LIKE 'sync:///FUNCTREE%' ;count ------- 121 (1 row) regression=# SELECT version(); version ----------------------------------------------------------------PostgreSQL 7.3.2 on i686-pc-linux-gnu, compiled by GCC 2.7.2.1 (1 row) I did the VACUUM - SELECT cycle almost 20 times on my develpment system and then several times on a second machine, all without any problem: version -------------------------------------------------------------PostgreSQL 7.3 on i586-pc-linux-gnu, compiled by GCC 2.95.2 Are you able to reproduce the problem on any other computer? ServusManfred
> Not as easily as all that --- your example works fine for me, under > 7.2.4, 7.3.3, or CVS tip. I speculate that there may be some necessary > factor in build options or runtime environment. How exactly did you > configure Postgres, what compiler did you use, are you using a non-C > locale or non-default multibyte encoding, etc? > > It would help to know where the crash is occurring; could you get a > debugger backtrace please? > > regards, tom lane This is a good news. What you said here made me do something I should've done already - I loaded this table into another instance of the database and I could not reproduce the problem! I try to analyze situation, I think this is what it looks like. Two databases were built few days apart with no changes to the code, make files, build options, etc. vacuum analyze on the same table very consistently creates problem on one db, and works fine on another. I use the same instance of postmaster against each db, I guess whatever makes difference must be somewhere under $PGDATA directory. The database that has the problem underwent some testing. Some data was created and deleted while QA guys tested our application. Test was manual, this makes it difficult to exactly reproduce the case. I've attached the config options we use and database log. If you're willing to look farther I could tar/zip the db as well. but let me know if there's something else I could try first. What I did this time was select count (*) from nla where note_url LIKE 'sync:///FUNCTREE%' ; vacuum analyze nla; \encoding select count (*) from nla where note_url LIKE 'sync:///FUNCTREE%' ; I alo tried to drop and recreate database - same result. Thank you. Mike. ------------------------------------------------------ tomkins% pg_config --configure '--srcdir=/home/synch/build/relipg21/src/vendor/postgresql/732' '--prefix=/home/synch/build/relipg21/obj/SunOS_58_CC_60/postgresql-732/insta ll' '--enable-locale' '--enable-multibyte=UNICODE' '--with-tcl' '--without-tk' '--with-tclconfig=/home/synch/build/relipg21/sbin/SunOS_58_CC_60' '--with-tkconfig=/home/synch/build/relipg21/sbin/SunOS_58_CC_60' '--with-includes=/home/synch/build/relipg21/obj/SunOS_58_CC_60/postgresql-73 2/../tcl/include /home/synch/build/relipg21/obj/SunOS_58_CC_60/postgresql-732/../tk/include' '--with-libraries=/home/synch/build/relipg21/sbin/SunOS_58_CC_60' '--without-readline' '--without-zlib' '--bindir=/home/synch/build/relipg21/sbin/SunOS_58_CC_60' '--libdir=/home/synch/build/relipg21/sbin/SunOS_58_CC_60' 'CC=gcc -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64' ------------------------------------------------------ tomkins% uname -a SunOS tomkins 5.8 Generic_108528-19 sun4u sparc SUNW,Ultra-80 tomkins% locale LANG= LC_CTYPE="C" LC_NUMERIC="C" LC_TIME="C" LC_COLLATE="C" LC_MONETARY="C" LC_MESSAGES="C" LC_ALL= ------------------------------------------------------
Вложения
Michael Brusser <michael@synchronicity.com> writes: > I should've done already - I loaded this table into another > instance of the database and I could not reproduce the problem! Hm. The log looks like the crash is occurring during planning (since you get a "parse tree" message but no "ProcessQuery"). I'm guessing that some selectivity-estimation operation that tries to use the pg_statistic entries generated by VACUUM ANALYZE is failing. Can't narrow it down further than that though. Again, how about a debugger stack trace? Also, the pg_stats rows for the table would be interesting. > If you're willing to look farther I could tar/zip the db as well. Waste of time, likely, since I don't have a Solaris system to install it on. regards, tom lane
>... Again, how about a debugger stack trace? > Also, the pg_stats rows for the table would be interesting. > --------- Tom, thanks for feedback. This time I started postgres directly and ran 'analyze nla;' followed by 'select ... where ... like..' clause with debugger attached - I got this trace: ================================================================= (dbx) signal SEGV (no mapping at the fault address) in pfree at 0x2535e4 pfree+0x1c: ld [%o1 + 0x4], %o0 (dbx) where 30 =>[1] pfree(0x489420, 0xffbee890, 0x489420, 0xffbee880, 0x489628, 0xffbee888), at 0x2535e4 [2] convert_to_scalar(0x489078, 0x19, 0xffbee890, 0x489008, 0x488fc0, 0x413), at 0x1fc6b4 [3] scalarineqsel(0x484608, 0x42a, 0x0, 0x488a88, 0x489078, 0x19), at 0x1f94e4 [4] scalarltsel(0xffbeea30, 0x1f9864, 0x413, 0x413, 0xb, 0xff0000), at 0x1f9a00 [5] DirectFunctionCall4(0x1f9864, 0x484608, 0x42a, 0x4892d0, 0x0, 0x488a08), at 0x242de8 [6] prefix_selectivity(0x484608, 0x488a88, 0x488e28, 0xffbeec08, 0xffbeec23, 0xffbeec90), at 0x1fe8fc [7] patternsel(0xffbeed38, 0x0, 0x42e9b0, 0x0, 0x0, 0xffbeed10), at 0x1f9e3c [8] likesel(0xffbeed38, 0xffbeed38, 0x1f9ff4, 0x20, 0x0, 0x28), at 0x1fa004 [9] OidFunctionCall4(0x71b, 0x484608, 0x4bd, 0x488b70, 0x1, 0x492c18), at 0x2447c0 [10] restriction_selectivity(0x484608, 0x4bd, 0x488b70, 0x1, 0x0, 0xffbeee80), at 0x169720 [11] clauselist_selectivity(0x484608, 0x488d98, 0x1, 0x0, 0x0, 0xff141c4c), at 0x14d990 [12] restrictlist_selectivity(0x484608, 0x488d50, 0x1, 0x20, 0x0, 0xff18e9d4), at 0x14d7f4 [13] set_baserel_size_estimates(0x484608, 0x488be0, 0x1000000, 0x10, 0xc0, 0x488a08), at 0x1503d4 [14] set_plain_rel_pathlist(0x484608, 0x488be0, 0x484a10, 0x0, 0x0, 0x0), at 0x14cb94 [15] set_base_rel_pathlists(0x484608, 0x0, 0x1, 0x0, 0x0, 0x488cb0), at 0x14cb58 [16] make_one_rel(0x484608, 0x484608, 0xffbef268, 0xffbef260, 0x4885d8, 0x28), at 0x14ca00 [17] subplanner(0x484608, 0x0, 0x0, 0x0, 0x29, 0x492c18), at 0x15c600 [18] query_planner(0x484608, 0x0, 0x0, 0x0, 0x4, 0x0), at 0x15c4f0 [19] grouping_planner(0x484608, 0xbff00000, 0x0, 0x5, 0x0, 0xff141c4c), at 0x15e4c8 [20] subquery_planner(0x484608, 0xbff00000, 0x0, 0x2b14c0, 0x225e0, 0xff18e9d4), at 0x15cc28 [21] planner(0x484608, 0x0, 0x0, 0xa, 0x4354, 0x0), at 0x15c8d0 [22] pg_plan_query(0x484608, 0x1, 0x0, 0xffbef590, 0x0, 0x0), at 0x1a5afc [23] pg_exec_query_string(0x4844e8, 0x1, 0x42e8a0, 0x2b8f6c, 0x73, 0x484500), at 0x1a618c [24] PostgresMain(0x2, 0x323428, 0x323440, 0x400, 0x3251c4, 0x175), at 0x1a8038 [25] main(0x2, 0xffbef7b4, 0xffbef7c0, 0x2ea310, 0x0, 0x0), at 0x1282e0 =================================================================== At some point I also ran this case under truss, here's the tail of the truss log. ... ... ... read(0, " s e l e c t c o u n".., 1024) = 68 write(2, " D E B U G : S t a r".., 32) = 32 write(2, " L O G : q u e r y :".., 82) = 82 write(2, " L O G : p a r s e ".., 1246) = 1246 write(2, " L O G : r e w r i t".., 1258) = 1258 llseek(24, 81920, SEEK_SET) = 81920 read(24, "\0\0\0\0\0\0\010\0\0\001".., 8192) = 8192 llseek(25, 49152, SEEK_SET) = 49152 read(25, "\0\0\0\0\0\0\010\0\0\001".., 8192) = 8192 brk(0x00499AE8) = 0 brk(0x004D9AE8) = 0 open64("/home/tmichael/sean/postgresql/base/100423/16645", O_RDWR) = 32 read(32, "\0\0\0\0\0 C 6E0\0\0\0\t".., 8192) = 8192 read(32, "\0\0\0\001 Y 7C8\0\0\0 '".., 8192) = 8192 open64("/home/tmichael/sean/postgresql/base/100423/16408", O_RDWR) = 33 llseek(33, 24576, SEEK_SET) = 24576 read(33, "\0\0\0\001 Y17 h\0\0\0 '".., 8192) = 8192 llseek(24, 57344, SEEK_SET) = 57344 read(24, "\0\0\0\0\0\0\010\0\0\001".., 8192) = 8192 Incurred fault #6, FLTBOUNDS %pc = 0x002541D4 siginfo: SIGSEGV SEGV_MAPERR addr=0x0161F3EC Received signal #11, SIGSEGV [default] siginfo: SIGSEGV SEGV_MAPERR addr=0x0161F3EC *** process killed *** =========================================== I am attaching the dump of pg_statistic table, I believe the record for the nla table is the last one in the file. Thank you, please let me know what else can I do. Mike.
Вложения
Michael Brusser <michael@synchronicity.com> writes: > (dbx) where 30 > =>[1] pfree(0x489420, 0xffbee890, 0x489420, 0xffbee880, 0x489628, > 0xffbee888), at 0x2535e4 > [2] convert_to_scalar(0x489078, 0x19, 0xffbee890, 0x489008, 0x488fc0, > 0x413), at 0x1fc6b4 > [3] scalarineqsel(0x484608, 0x42a, 0x0, 0x488a88, 0x489078, 0x19), at > 0x1f94e4 Hmm. Are you running the database in a non-C locale? (pg_controldata would tell you.) If so, this pfree is trying to pfree one of three strings that were filled with strxfrm(). I am wondering if strxfrm() on your platform sometimes writes more bytes than it is supposed to. I have seen vsnprintf() overrun its output buffer on some flavors of Solaris (according to FAQ_Solaris, the 64-bit libc in Solaris 7 had such a problem). Could there be a similar bug in their strxfrm? regards, tom lane
> Hmm. Are you running the database in a non-C locale? > (pg_controldata would tell you.) ----------- Here's the output of pg_controldata: pg_control version number: 72 Catalog version number: 200211021 Database cluster state: in production pg_control last modified: Sun May 25 18:38:06 2003 Current log file ID: 0 Next log file segment: 2 Latest checkpoint location: 0/15EF7A8 Prior checkpoint location: 0/15ED2D8 Latest checkpoint's REDO location: 0/15EF7A8 Latest checkpoint's UNDO location: 0/0 Latest checkpoint's StartUpID: 47 Latest checkpoint's NextXID: 3563 Latest checkpoint's NextOID: 118086 Time of latest checkpoint: Sun May 25 18:38:02 2003 Database block size: 8192 Blocks per segment of large relation: 131072 Maximum length of identifiers: 64 Maximum number of function arguments: 32 Date/time type storage: Floating point Maximum length of locale name: 128 LC_COLLATE: en_US.ISO8859-1 LC_CTYPE: en_US.ISO8859-1 ----------------------------- locale settings on the host: tomkins% locale LANG= LC_CTYPE="C" LC_NUMERIC="C" LC_TIME="C" LC_COLLATE="C" LC_MONETARY="C" LC_MESSAGES="C" LC_ALL= ----------------------------- > I am wondering if strxfrm() on your platform sometimes writes more bytes > than it is supposed to. I have seen vsnprintf() overrun its output > buffer on some flavors of Solaris (according to FAQ_Solaris, the 64-bit > libc in Solaris 7 had such a problem). Could there be a similar bug in > their strxfrm? We're on Solaris 8. I'll try to find information on strxfrm bugs, but do you rule out any problems in Postgres code? Is there a good explanation to why the same table loaded into another Postgres installation on the same host can be analyzed without any problems? Also in my database I can drop/create database, load table and reproduce the error. Not sure what to make out of this. Mike. -----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Tom Lane Sent: Sunday, May 25, 2003 9:03 PM To: michael@synchronicity.com Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] vacuum analyze corrupts database Michael Brusser <michael@synchronicity.com> writes: > (dbx) where 30 > =>[1] pfree(0x489420, 0xffbee890, 0x489420, 0xffbee880, 0x489628, > 0xffbee888), at 0x2535e4 > [2] convert_to_scalar(0x489078, 0x19, 0xffbee890, 0x489008, 0x488fc0, > 0x413), at 0x1fc6b4 > [3] scalarineqsel(0x484608, 0x42a, 0x0, 0x488a88, 0x489078, 0x19), at > 0x1f94e4 Hmm. Are you running the database in a non-C locale? (pg_controldata would tell you.) If so, this pfree is trying to pfree one of three strings that were filled with strxfrm(). I am wondering if strxfrm() on your platform sometimes writes more bytes than it is supposed to. I have seen vsnprintf() overrun its output buffer on some flavors of Solaris (according to FAQ_Solaris, the 64-bit libc in Solaris 7 had such a problem). Could there be a similar bug in their strxfrm? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly
Michael Brusser <michael@synchronicity.com> writes: > but do you rule out any problems in Postgres code? Never ... but when you're the only one reporting a problem, a local issue is something to consider. > Is there a good explanation to why the same table loaded into another > Postgres > installation on the same host can be analyzed without any problems? Well, first thing I'd ask is whether the other installation is using the same locale settings. regards, tom lane
> Well, first thing I'd ask is whether the other installation is using > the same locale settings. Oh, yes, I am very sure of that. Mike
On Wed, May 28, 2003 at 09:04:18AM -0400, Michael Brusser wrote: > > Well, first thing I'd ask is whether the other installation is using > > the same locale settings. > Oh, yes, I am very sure of that. Note that while the machine reports the C locale, pg_controldata actually said the database was en_US.iso8859-1 or something like that. Make sure pg_controldata reports the same on the other machine... -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "The eagle never lost so much time as when he submitted to learn from the crow." (William Blake, citado por Nobody)