Обсуждение: what could cause this PANIC on enterprice 7.3.4 db?
Usually db works fine & stable months, but couple days ago it was PANIC, and i have no ideas why this happen. Nov 5 20:22:30 monstr postgres[16071]: [1] LOG: connection received: host=[local] Nov 5 20:22:30 monstr postgres[16071]: [2] LOG: connection authorized: user=ant database=tele Nov 5 20:22:42 monstr postgres[16071]: [3] PANIC: open of /usr/local/pgsql/data/pg_clog/0040 failed: No such file or directory ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Nov 5 20:22:42 monstr postgres[16071]: [4-1] LOG: statement: select [...] skip the log of failed query (it was just usual query) Nov 5 20:22:42 monstr postgres[1109]: [1] LOG: server process (pid 16071) was terminated by signal 6 Nov 5 20:22:42 monstr postgres[1109]: [2] LOG: terminating any other active server processes Nov 5 20:22:42 monstr postgres[15913]: [3-1] WARNING: Message from PostgreSQL backend: Nov 5 20:22:42 monstr postgres[1412]: [143-1] WARNING: Message from PostgreSQL backend: Nov 5 20:22:42 monstr postgres[15913]: [3-2] ^IThe Postmaster has informed me that some other backend Nov 5 20:22:42 monstr postgres[1412]: [143-2] ^IThe Postmaster has informed me that some other backend Nov 5 20:22:42 monstr postgres[15913]: [3-3] ^Idied abnormally and possibly corrupted shared memory. Nov 5 20:22:42 monstr postgres[1412]: [143-3] ^Idied abnormally and possibly corrupted shared memory. Nov 5 20:22:42 monstr postgres[15913]: [3-4] ^II have rolled back the current transaction and am Nov 5 20:22:42 monstr postgres[1412]: [143-4] ^II have rolled back the current transaction and am Nov 5 20:22:42 monstr postgres[15913]: [3-5] ^Igoing to terminate your database system connection and exit. Nov 5 20:22:42 monstr postgres[1412]: [143-5] ^Igoing to terminate your database system connection and exit. Nov 5 20:22:42 monstr postgres[15913]: [3-6] ^IPlease reconnect to the database system and repeat your query. Nov 5 20:22:42 monstr postgres[1412]: [143-6] ^IPlease reconnect to the database system and repeat your query. Nov 5 20:22:42 monstr postgres[1109]: [3] LOG: all server processes terminated; reinitializing shared memory and semaphores Nov 5 20:22:42 monstr postgres[16077]: [4] LOG: database system was interrupted at 2003-11-05 20:17:48 EET Nov 5 20:22:42 monstr postgres[16077]: [5] LOG: checkpoint record is at 14/E956482C Nov 5 20:22:42 monstr postgres[16077]: [6] LOG: redo record is at 14/E956482C; undo record is at 0/0; shutdown FALSE Nov 5 20:22:42 monstr postgres[16077]: [7] LOG: next transaction id: 13580788; next oid: 190057632 Nov 5 20:22:42 monstr postgres[16077]: [8] LOG: database system was not properly shut down; automatic recovery in progress Nov 5 20:22:42 monstr postgres[16077]: [9] LOG: redo starts at 14/E956486C Nov 5 20:22:42 monstr postgres[16077]: [10] LOG: ReadRecord: record with zero length at 14/E95ACA14 Nov 5 20:22:42 monstr postgres[16077]: [11] LOG: redo done at 14/E95AC9F0 Nov 5 20:22:45 monstr postgres[16077]: [12] LOG: database system is ready so db was restarted succesfully itself but all applications that had static connection (it's tcl apps) where fell the hard disk is fine (might be, because i've used to check it at night time from cron like this: cat /dev/hda > /dev/null), so i have no idea why pg didn't find this file (see error from log). Does someone? system is: Linux 2.4.18-5 (RH) Thank you, Andriy Tkachuk. -- http://www.imt.com.ua
sorry , guys, for my english i meant "enterprise db" in subj, not enterprice :) On Fri, 7 Nov 2003, Andriy Tkachuk wrote: > Usually db works fine & stable months, but couple days ago > it was PANIC, and i have no ideas why this happen. > > Nov 5 20:22:30 monstr postgres[16071]: [1] LOG: connection received: host=[local] > Nov 5 20:22:30 monstr postgres[16071]: [2] LOG: connection authorized: user=ant database=tele > Nov 5 20:22:42 monstr postgres[16071]: [3] PANIC: open of /usr/local/pgsql/data/pg_clog/0040 failed: No such file ordirectory > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > Nov 5 20:22:42 monstr postgres[16071]: [4-1] LOG: statement: select > [...] skip the log of failed query (it was just usual query) > Nov 5 20:22:42 monstr postgres[1109]: [1] LOG: server process (pid 16071) was terminated by signal 6 > Nov 5 20:22:42 monstr postgres[1109]: [2] LOG: terminating any other active server processes > Nov 5 20:22:42 monstr postgres[15913]: [3-1] WARNING: Message from PostgreSQL backend: > Nov 5 20:22:42 monstr postgres[1412]: [143-1] WARNING: Message from PostgreSQL backend: > Nov 5 20:22:42 monstr postgres[15913]: [3-2] ^IThe Postmaster has informed me that some other backend > Nov 5 20:22:42 monstr postgres[1412]: [143-2] ^IThe Postmaster has informed me that some other backend > Nov 5 20:22:42 monstr postgres[15913]: [3-3] ^Idied abnormally and possibly corrupted shared memory. > Nov 5 20:22:42 monstr postgres[1412]: [143-3] ^Idied abnormally and possibly corrupted shared memory. > Nov 5 20:22:42 monstr postgres[15913]: [3-4] ^II have rolled back the current transaction and am > Nov 5 20:22:42 monstr postgres[1412]: [143-4] ^II have rolled back the current transaction and am > Nov 5 20:22:42 monstr postgres[15913]: [3-5] ^Igoing to terminate your database system connection and exit. > Nov 5 20:22:42 monstr postgres[1412]: [143-5] ^Igoing to terminate your database system connection and exit. > Nov 5 20:22:42 monstr postgres[15913]: [3-6] ^IPlease reconnect to the database system and repeat your query. > Nov 5 20:22:42 monstr postgres[1412]: [143-6] ^IPlease reconnect to the database system and repeat your query. > Nov 5 20:22:42 monstr postgres[1109]: [3] LOG: all server processes terminated; reinitializing shared memory and semaphores > Nov 5 20:22:42 monstr postgres[16077]: [4] LOG: database system was interrupted at 2003-11-05 20:17:48 EET > Nov 5 20:22:42 monstr postgres[16077]: [5] LOG: checkpoint record is at 14/E956482C > Nov 5 20:22:42 monstr postgres[16077]: [6] LOG: redo record is at 14/E956482C; undo record is at 0/0; shutdown FALSE > Nov 5 20:22:42 monstr postgres[16077]: [7] LOG: next transaction id: 13580788; next oid: 190057632 > Nov 5 20:22:42 monstr postgres[16077]: [8] LOG: database system was not properly shut down; automatic recovery in progress > Nov 5 20:22:42 monstr postgres[16077]: [9] LOG: redo starts at 14/E956486C > Nov 5 20:22:42 monstr postgres[16077]: [10] LOG: ReadRecord: record with zero length at 14/E95ACA14 > Nov 5 20:22:42 monstr postgres[16077]: [11] LOG: redo done at 14/E95AC9F0 > Nov 5 20:22:45 monstr postgres[16077]: [12] LOG: database system is ready > > > so db was restarted succesfully itself but all applications that had > static connection (it's tcl apps) where fell > > the hard disk is fine (might be, because i've used to check it > at night time from cron like this: cat /dev/hda > /dev/null), > so i have no idea why pg didn't find this file (see error from log). > Does someone? > > system is: Linux 2.4.18-5 (RH) > > Thank you, > Andriy Tkachuk. > > -- > http://www.imt.com.ua > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > С уважением,Андрей Ткачук. http://www.imt.com.ua
Andriy Tkachuk <ant@imt.com.ua> writes: > Nov 5 20:22:42 monstr postgres[16071]: [3] PANIC: open of /usr/local/pgsql/data/pg_clog/0040 failed: No such file ordirectory Could we see ls -l /usr/local/pgsql/data/pg_clog/ regards, tom lane
shure, Tom [10:49]/2:ant@monstr:~>sudo ls -al /usr/local/pgsql/data/pg_clog total 40 drwx------ 2 pgsql postgres 4096 Nov 7 03:28 . drwx------ 6 pgsql root 4096 Oct 23 10:45 .. -rw------- 1 pgsql postgres 32768 Nov 10 10:47 000D [10:49]/2:ant@monstr:~>date Mon Nov 10 10:49:50 EET 2003 [10:49]/2:ant@monstr:~>uname -sr Linux 2.4.18-5custom On Fri, 7 Nov 2003, Tom Lane wrote: > Andriy Tkachuk <ant@imt.com.ua> writes: > > Nov 5 20:22:42 monstr postgres[16071]: [3] PANIC: open of /usr/local/pgsql/data/pg_clog/0040 failed: No such file ordirectory > > Could we see ls -l /usr/local/pgsql/data/pg_clog/ > > regards, tom lane > regards,Andriy Tkachuk -- http://www.imt.com.ua
Andriy Tkachuk <ant@imt.com.ua> writes: > On Fri, 7 Nov 2003, Tom Lane wrote: >> Andriy Tkachuk <ant@imt.com.ua> writes: >>> Nov 5 20:22:42 monstr postgres[16071]: [3] PANIC: open of /usr/local/pgsql/data/pg_clog/0040 failed: No such file ordirectory >> >> Could we see ls -l /usr/local/pgsql/data/pg_clog/ > [10:49]/2:ant@monstr:~>sudo ls -al /usr/local/pgsql/data/pg_clog > total 40 > drwx------ 2 pgsql postgres 4096 Nov 7 03:28 . > drwx------ 6 pgsql root 4096 Oct 23 10:45 .. > -rw------- 1 pgsql postgres 32768 Nov 10 10:47 000D Okay, given that the file the code was trying to access is nowhere near the current or past set of valid transaction numbers, it's pretty clear that what you have is a corrupted transaction number in some tuple's header. The odds are that not only the transaction number is affected; usually when we see something like this, anywhere from dozens to hundreds of bytes have been replaced by garbage data. In the cases I've been able to study in the past, the cause seemed to be faulty hardware or possibly kernel bugs --- for instance someone recently reported a case where a whole kilobyte of a Postgres file had been replaced with what seemed to be part of a mail message. I'd ascribe that to either a disk drive writing a sector at the wrong place, or the kernel getting confused about which buffer held which file. So I'd recommend running some hardware diagnostics and checking to see if there are errata available for your kernel. As far as cleaning up the immediate damage is concerned, you'll probably want to use pg_filedump or some such tool to get a better feeling for the extent of the damage. There are descriptions of this process in the archives --- try looking for recent references to pg_filedump. regards, tom lane