Обсуждение: what could cause this PANIC on enterprice 7.3.4 db?

Поиск
Список
Период
Сортировка

what could cause this PANIC on enterprice 7.3.4 db?

От
Andriy Tkachuk
Дата:
Usually db works fine & stable months, but couple days ago
it was PANIC, and i have no ideas why this happen.

Nov  5 20:22:30 monstr postgres[16071]: [1] LOG:  connection received: host=[local]
Nov  5 20:22:30 monstr postgres[16071]: [2] LOG:  connection authorized: user=ant database=tele
Nov  5 20:22:42 monstr postgres[16071]: [3] PANIC:  open of /usr/local/pgsql/data/pg_clog/0040 failed: No such file or
directory

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Nov  5 20:22:42 monstr postgres[16071]: [4-1] LOG:  statement: select
[...] skip the log of failed query (it was just usual query)
Nov  5 20:22:42 monstr postgres[1109]: [1] LOG:  server process (pid 16071) was terminated by signal 6
Nov  5 20:22:42 monstr postgres[1109]: [2] LOG:  terminating any other active server processes
Nov  5 20:22:42 monstr postgres[15913]: [3-1] WARNING:  Message from PostgreSQL backend:
Nov  5 20:22:42 monstr postgres[1412]: [143-1] WARNING:  Message from PostgreSQL backend:
Nov  5 20:22:42 monstr postgres[15913]: [3-2] ^IThe Postmaster has informed me that some other backend
Nov  5 20:22:42 monstr postgres[1412]: [143-2] ^IThe Postmaster has informed me that some other backend
Nov  5 20:22:42 monstr postgres[15913]: [3-3] ^Idied abnormally and possibly corrupted shared memory.
Nov  5 20:22:42 monstr postgres[1412]: [143-3] ^Idied abnormally and possibly corrupted shared memory.
Nov  5 20:22:42 monstr postgres[15913]: [3-4] ^II have rolled back the current transaction and am
Nov  5 20:22:42 monstr postgres[1412]: [143-4] ^II have rolled back the current transaction and am
Nov  5 20:22:42 monstr postgres[15913]: [3-5] ^Igoing to terminate your database system connection and exit.
Nov  5 20:22:42 monstr postgres[1412]: [143-5] ^Igoing to terminate your database system connection and exit.
Nov  5 20:22:42 monstr postgres[15913]: [3-6] ^IPlease reconnect to the database system and repeat your query.
Nov  5 20:22:42 monstr postgres[1412]: [143-6] ^IPlease reconnect to the database system and repeat your query.
Nov  5 20:22:42 monstr postgres[1109]: [3] LOG:  all server processes terminated; reinitializing shared memory and
semaphores
Nov  5 20:22:42 monstr postgres[16077]: [4] LOG:  database system was interrupted at 2003-11-05 20:17:48 EET
Nov  5 20:22:42 monstr postgres[16077]: [5] LOG:  checkpoint record is at 14/E956482C
Nov  5 20:22:42 monstr postgres[16077]: [6] LOG:  redo record is at 14/E956482C; undo record is at 0/0; shutdown FALSE
Nov  5 20:22:42 monstr postgres[16077]: [7] LOG:  next transaction id: 13580788; next oid: 190057632
Nov  5 20:22:42 monstr postgres[16077]: [8] LOG:  database system was not properly shut down; automatic recovery in
progress
Nov  5 20:22:42 monstr postgres[16077]: [9] LOG:  redo starts at 14/E956486C
Nov  5 20:22:42 monstr postgres[16077]: [10] LOG:  ReadRecord: record with zero length at 14/E95ACA14
Nov  5 20:22:42 monstr postgres[16077]: [11] LOG:  redo done at 14/E95AC9F0
Nov  5 20:22:45 monstr postgres[16077]: [12] LOG:  database system is ready


so db was restarted succesfully itself but all applications that had
static connection (it's tcl apps) where fell

the hard disk is fine (might be, because i've used to check it
at night time from cron like this: cat /dev/hda > /dev/null),
so i have no idea why pg didn't find this file (see error from log).
Does someone?

system is: Linux 2.4.18-5 (RH)

Thank you, Andriy Tkachuk.

--
http://www.imt.com.ua


Re: what could cause this PANIC on enterprise 7.3.4 db?

От
Andriy Tkachuk
Дата:
sorry , guys, for my english

i meant "enterprise db" in subj, not enterprice :)

On Fri, 7 Nov 2003, Andriy Tkachuk wrote:

> Usually db works fine & stable months, but couple days ago
> it was PANIC, and i have no ideas why this happen.
>
> Nov  5 20:22:30 monstr postgres[16071]: [1] LOG:  connection received: host=[local]
> Nov  5 20:22:30 monstr postgres[16071]: [2] LOG:  connection authorized: user=ant database=tele
> Nov  5 20:22:42 monstr postgres[16071]: [3] PANIC:  open of /usr/local/pgsql/data/pg_clog/0040 failed: No such file
ordirectory
 
>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Nov  5 20:22:42 monstr postgres[16071]: [4-1] LOG:  statement: select
> [...] skip the log of failed query (it was just usual query)
> Nov  5 20:22:42 monstr postgres[1109]: [1] LOG:  server process (pid 16071) was terminated by signal 6
> Nov  5 20:22:42 monstr postgres[1109]: [2] LOG:  terminating any other active server processes
> Nov  5 20:22:42 monstr postgres[15913]: [3-1] WARNING:  Message from PostgreSQL backend:
> Nov  5 20:22:42 monstr postgres[1412]: [143-1] WARNING:  Message from PostgreSQL backend:
> Nov  5 20:22:42 monstr postgres[15913]: [3-2] ^IThe Postmaster has informed me that some other backend
> Nov  5 20:22:42 monstr postgres[1412]: [143-2] ^IThe Postmaster has informed me that some other backend
> Nov  5 20:22:42 monstr postgres[15913]: [3-3] ^Idied abnormally and possibly corrupted shared memory.
> Nov  5 20:22:42 monstr postgres[1412]: [143-3] ^Idied abnormally and possibly corrupted shared memory.
> Nov  5 20:22:42 monstr postgres[15913]: [3-4] ^II have rolled back the current transaction and am
> Nov  5 20:22:42 monstr postgres[1412]: [143-4] ^II have rolled back the current transaction and am
> Nov  5 20:22:42 monstr postgres[15913]: [3-5] ^Igoing to terminate your database system connection and exit.
> Nov  5 20:22:42 monstr postgres[1412]: [143-5] ^Igoing to terminate your database system connection and exit.
> Nov  5 20:22:42 monstr postgres[15913]: [3-6] ^IPlease reconnect to the database system and repeat your query.
> Nov  5 20:22:42 monstr postgres[1412]: [143-6] ^IPlease reconnect to the database system and repeat your query.
> Nov  5 20:22:42 monstr postgres[1109]: [3] LOG:  all server processes terminated; reinitializing shared memory and
semaphores
> Nov  5 20:22:42 monstr postgres[16077]: [4] LOG:  database system was interrupted at 2003-11-05 20:17:48 EET
> Nov  5 20:22:42 monstr postgres[16077]: [5] LOG:  checkpoint record is at 14/E956482C
> Nov  5 20:22:42 monstr postgres[16077]: [6] LOG:  redo record is at 14/E956482C; undo record is at 0/0; shutdown
FALSE
> Nov  5 20:22:42 monstr postgres[16077]: [7] LOG:  next transaction id: 13580788; next oid: 190057632
> Nov  5 20:22:42 monstr postgres[16077]: [8] LOG:  database system was not properly shut down; automatic recovery in
progress
> Nov  5 20:22:42 monstr postgres[16077]: [9] LOG:  redo starts at 14/E956486C
> Nov  5 20:22:42 monstr postgres[16077]: [10] LOG:  ReadRecord: record with zero length at 14/E95ACA14
> Nov  5 20:22:42 monstr postgres[16077]: [11] LOG:  redo done at 14/E95AC9F0
> Nov  5 20:22:45 monstr postgres[16077]: [12] LOG:  database system is ready
>
>
> so db was restarted succesfully itself but all applications that had
> static connection (it's tcl apps) where fell
>
> the hard disk is fine (might be, because i've used to check it
> at night time from cron like this: cat /dev/hda > /dev/null),
> so i have no idea why pg didn't find this file (see error from log).
> Does someone?
>
> system is: Linux 2.4.18-5 (RH)
>
> Thank you,
>   Andriy Tkachuk.
>
> --
> http://www.imt.com.ua
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

С уважением,Андрей Ткачук.

http://www.imt.com.ua


Re: what could cause this PANIC on enterprice 7.3.4 db?

От
Tom Lane
Дата:
Andriy Tkachuk <ant@imt.com.ua> writes:
> Nov  5 20:22:42 monstr postgres[16071]: [3] PANIC:  open of /usr/local/pgsql/data/pg_clog/0040 failed: No such file
ordirectory
 

Could we see ls -l /usr/local/pgsql/data/pg_clog/
        regards, tom lane


Re: what could cause this PANIC on enterprise 7.3.4 db?

От
Andriy Tkachuk
Дата:
shure, Tom

[10:49]/2:ant@monstr:~>sudo ls -al /usr/local/pgsql/data/pg_clog
total 40
drwx------    2 pgsql    postgres     4096 Nov  7 03:28 .
drwx------    6 pgsql    root         4096 Oct 23 10:45 ..
-rw-------    1 pgsql    postgres    32768 Nov 10 10:47 000D

[10:49]/2:ant@monstr:~>date
Mon Nov 10 10:49:50 EET 2003

[10:49]/2:ant@monstr:~>uname -sr
Linux 2.4.18-5custom


On Fri, 7 Nov 2003, Tom Lane wrote:

> Andriy Tkachuk <ant@imt.com.ua> writes:
> > Nov  5 20:22:42 monstr postgres[16071]: [3] PANIC:  open of /usr/local/pgsql/data/pg_clog/0040 failed: No such file
ordirectory
 
>
> Could we see ls -l /usr/local/pgsql/data/pg_clog/
>
>             regards, tom lane
>

regards,Andriy Tkachuk
--
http://www.imt.com.ua


Re: what could cause this PANIC on enterprise 7.3.4 db?

От
Tom Lane
Дата:
Andriy Tkachuk <ant@imt.com.ua> writes:
> On Fri, 7 Nov 2003, Tom Lane wrote:
>> Andriy Tkachuk <ant@imt.com.ua> writes:
>>> Nov  5 20:22:42 monstr postgres[16071]: [3] PANIC:  open of /usr/local/pgsql/data/pg_clog/0040 failed: No such file
ordirectory
 
>> 
>> Could we see ls -l /usr/local/pgsql/data/pg_clog/

> [10:49]/2:ant@monstr:~>sudo ls -al /usr/local/pgsql/data/pg_clog
> total 40
> drwx------    2 pgsql    postgres     4096 Nov  7 03:28 .
> drwx------    6 pgsql    root         4096 Oct 23 10:45 ..
> -rw-------    1 pgsql    postgres    32768 Nov 10 10:47 000D

Okay, given that the file the code was trying to access is nowhere near
the current or past set of valid transaction numbers, it's pretty clear
that what you have is a corrupted transaction number in some tuple's
header.  The odds are that not only the transaction number is affected;
usually when we see something like this, anywhere from dozens to
hundreds of bytes have been replaced by garbage data.

In the cases I've been able to study in the past, the cause seemed to
be faulty hardware or possibly kernel bugs --- for instance someone
recently reported a case where a whole kilobyte of a Postgres file had
been replaced with what seemed to be part of a mail message.  I'd
ascribe that to either a disk drive writing a sector at the wrong place,
or the kernel getting confused about which buffer held which file.
So I'd recommend running some hardware diagnostics and checking to see
if there are errata available for your kernel.

As far as cleaning up the immediate damage is concerned, you'll probably
want to use pg_filedump or some such tool to get a better feeling for
the extent of the damage.  There are descriptions of this process in the
archives --- try looking for recent references to pg_filedump.
        regards, tom lane