Re: bug, bad memory, or bad disk?

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: bug, bad memory, or bad disk?
Дата
Msg-id 00a601ce0b85$f1fcb730$d5f62590$@kapila@huawei.com
обсуждение исходный текст
Ответ на bug, bad memory, or bad disk?  (Ben Chobot <bench@silentmedia.com>)
Список pgsql-general
On Friday, February 15, 2013 1:33 AM Ben Chobot wrote:

> 2013-02-13T23:13:18.042875+00:00 pgdb18-vpc postgres[20555]: [76-1]
=A0ERROR: =A0invalid memory alloc request size=20
> 1968078400
> 2013-02-13T23:13:18.956173+00:00 pgdb18-vpc postgres[23880]: [58-1]
=A0ERROR: =A0invalid page header in block 2948 of=20
> relation pg_tblspc/16435/PG_9.1_201105231/188417/56951641
> 2013-02-13T23:13:19.025971+00:00 pgdb18-vpc postgres[25027]: [36-1]
=A0ERROR: =A0could not open file=20
> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
3936767042): No such file or directory
> 2013-02-13T23:13:19.847422+00:00 pgdb18-vpc postgres[28333]: [8-1] =
=A0ERROR:
=A0could not open file=20
> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
3936767042): No such file or directory
> 2013-02-13T23:13:19.913595+00:00 pgdb18-vpc postgres[28894]: [8-1] =
=A0ERROR:
=A0could not open file=20
> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
3936767042): No such file or directory
> 2013-02-13T23:13:20.043527+00:00 pgdb18-vpc postgres[20917]: [72-1]
=A0ERROR: =A0invalid memory alloc request size=20
> 1968078400
> 2013-02-13T23:13:21.548259+00:00 pgdb18-vpc postgres[23318]: [54-1]
=A0ERROR: =A0could not open file=20
> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
3936767042): No such file or directory
> 2013-02-13T23:13:28.405529+00:00 pgdb18-vpc postgres[28055]: [12-1]
=A0ERROR: =A0invalid page header in block 38887 of=20
> relation pg_tblspc/16435/PG_9.1_201105231/188417/58206627
> 2013-02-13T23:13:29.199447+00:00 pgdb18-vpc postgres[25513]: [46-1]
=A0ERROR: =A0invalid page header in block 2368 of=20
> relation pg_tblspc/16435/PG_9.1_201105231/188417/60418945

> There didn't seem to be much correlation to which files were affected, =
and
this was a critical server, so once we=20
> realized a simple reindex wasn't going to solve things, we shut it =
down
and brought up a slave as the new master db.

> While that seemed to fix these issues, we soon noticed problems with
missing clog files. The missing clogs were outside > the range of the
existing clogs, so we tried using dummy clog files. It didn't help, and
running pg_check we found that > one block of one table was definitely
corrupt. Worse, that corruption had spread to all our replicas.

Can you check that corrupted block is from one of the relations =
mentioned in
your errors. This is just to reconfirm.

> I know this is a little sparse on details, but my questions are:

> 1. What kind of fault should I be looking to fix? Because it spread to =
all
the replicas, both those that stream and=20
> those that replicate by replaying wals in the wal archive, I assume =
it's
not=A0a storage issue. (My understanding is that > streaming replicas =
stream
their changes from memory, not from wals.)=20

  Streaming replication stream their changes from wals.

> 2. Is it possible that the corruption that was on the master got
replicated to the slaves when I tried to cleanly shut > down the master
before bringing up a new slave as the new master and switching the other
slaves over to replicating=20
> from that?

At shutdown, master will send all WAL (upto shutdown checkpoint)

I think there are 2 issues in your mail
1. access to corrupted blocks - there are 2 things in this, one is how =
the
block get corrupted in master and why it's replicated to other servers.

The corrupted block replication can be done because of WAL as WAL =
contains
backup copies of blocks if full_page_write=3Don, which is default
configuration.
So I think now the main question remains is how the block/'s get =
corrupted
on master. For that I think some more information is required, like what
kind of operations are being done for relation which has corrupted =
block.
If we drop and recreate that relation, will this problem remains.
Is there any chance that the block gets corrupted due to hardware =
problem?

2. missing clog files - how did you find missing clog files, is any
operation got failed or just an observation?
Do you see any problems in system due to it?

With Regards,
Amit Kapila.

В списке pgsql-general по дате отправления:

Предыдущее
От: Jan Strube
Дата:
Сообщение: Re: Query becomes slow when written as view
Следующее
От: Merlin Moncure
Дата:
Сообщение: Re: bug, bad memory, or bad disk?