Обсуждение: Problem with data corruption and psql memory usage

Поиск
Список
Период
Сортировка

Problem with data corruption and psql memory usage

От
Gerhard Wiesinger
Дата:
Hello!

I'm new to Postgresql and I did make some import with about 2.8
Mio with normal insert commands.

Config was (difference from default config):
listen_addresses = '*'
temp_buffers = 20MB                    # min 800kB
work_mem = 20MB                                # min 64kB
maintenance_work_mem = 32MB            # min 1MB
fsync = off                            # turns forced synchronization on or off
full_page_writes = off
wal_buffers = 20MB

It crashed with a core dump (ulimit -c 0):
LOG:  server process (PID 12720) was terminated by signal 11
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server
process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server proc
ess exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server
process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server proc
ess exited abnormally and possibly corrupted shared memory.

Afterwards I got the following error messages:
WARNING:  index "table_pkey" contains 2572948 row versions, but
table contains 2572949 row versions
HINT:  Rebuild the index with REINDEX.

LOG:  server process (PID 13794) was terminated by signal 11
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server
process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server proc
ess exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.

LOG:  could not fsync segment 0 of relation 1663/16386/42726: Input/output
error
ERROR:  storage sync failed on magnetic disk: Input/output error

ERROR:  could not access status of transaction 808464434
DETAIL:  Could not open file "pg_clog/0303": No such file or directory.

Afterwards I got:
ERROR:  could not access status of transaction 5526085

There were also some coredumps afterwards where I have a stack trace:
#0  0x0807d241 in heap_deform_tuple ()
#1  0x08095b8c in toast_delete ()
#2  0x0809432e in heap_delete ()
#3  0x0814bfa4 in ExecutorRun ()
#4  0x081d7ece in FreeQueryDesc ()
#5  0x081d80c1 in FreeQueryDesc ()
#6  0x081d8979 in PortalRun ()
#7  0x081d4480 in pg_parse_query ()
#8  0x081d5a57 in PostgresMain ()
#9  0x081ad4fe in ClosePostmasterPorts ()
#10 0x081ae307 in PostmasterMain ()
#11 0x0816dec0 in main ()

So my questions are:
1.) Are my settings to aggresive (fsync=off, full_page_writes=off)?
2.) Should PostgreSQL also recover with these 2 options enabled on a core
dump or is data corruption normally with these settings?
3.) Any ideas for the reason of coredumps?

Write access was only from one session at a time. I only did select
count(*) from table from other sessions.

Afterwards I cleaned up the tables, pg_dumpall/restore session,
initdb and disabled these 2 settings and everything went fine.

I also had a problem with psql:
psql < file.sql
=> psql took around 2GB virtual memory with heavy swapping. After
Ctrl-C, restarting, it worked well. Any ideas?

Machine is stable so I would say that a hardware failure is not the
problem.

Postgresql version is 8.2.3 on FC6

Thank you for the answer.

Ciao,
Gerhard

--
http://www.wiesinger.com/

Re: Problem with data corruption and psql memory usage

От
Tom Lane
Дата:
Gerhard Wiesinger <lists@wiesinger.com> writes:
> LOG:  could not fsync segment 0 of relation 1663/16386/42726: Input/output
> error

[ raised eyebrow... ]  I think your machine is flakier than you believe.
This error is particularly damning, but the general pattern of weird
failures all over the place seems to me to fit the idea of hardware
problems much better than any other explanation.  FC6 and PG 8.2.3 are
both pretty darn stable for most people, so there's *something* wrong
with your installation, and unstable hardware is the first thing that
comes to mind.

            regards, tom lane

Re: Problem with data corruption and psql memory usage

От
Gerhard Wiesinger
Дата:
Hello Tom!

I don't think this is a hardware problem. Machine runs 24/7 for around 4
years without any problems, daily backup with GBs of data to it,
uptimes to the next kernel security patch, etc.

The only problem I could believe is:
I'm running the FC7 test packages of postgresql in FC6 and maybe there is
a slight glibc library conflict or any other incompatibility.

Ciao,
Gerhard

--
http://www.wiesinger.com/


On Wed, 9 May 2007, Tom Lane wrote:

> Gerhard Wiesinger <lists@wiesinger.com> writes:
>> LOG:  could not fsync segment 0 of relation 1663/16386/42726: Input/output
>> error
>
> [ raised eyebrow... ]  I think your machine is flakier than you believe.
> This error is particularly damning, but the general pattern of weird
> failures all over the place seems to me to fit the idea of hardware
> problems much better than any other explanation.  FC6 and PG 8.2.3 are
> both pretty darn stable for most people, so there's *something* wrong
> with your installation, and unstable hardware is the first thing that
> comes to mind.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>

Re: Problem with data corruption and psql memory usage

От
Tom Lane
Дата:
Gerhard Wiesinger <lists@wiesinger.com> writes:
> The only problem I could believe is:
> I'm running the FC7 test packages of postgresql in FC6 and maybe there is
> a slight glibc library conflict or any other incompatibility.

Hmm, I'd be suspicious of that too.  You'd be well advised to take the FC7
SRPM and rebuild it locally to ensure you don't have any conflicts of
that sort.

            regards, tom lane

Re: Problem with data corruption and psql memory usage

От
Scott Marlowe
Дата:
On Wed, 2007-05-09 at 11:18, Gerhard Wiesinger wrote:
> Hello Tom!
>
> I don't think this is a hardware problem. Machine runs 24/7 for around 4
> years without any problems, daily backup with GBs of data to it,
> uptimes to the next kernel security patch, etc.
>
> The only problem I could believe is:
> I'm running the FC7 test packages of postgresql in FC6 and maybe there is
> a slight glibc library conflict or any other incompatibility.

While I agree with Tom that you should look at recompiling the fc7
packages to fc6, hardware does break in strange ways sometimes.  A piece
of dust in just the right place, a bit of heat sink compound that
finally migrated onto a circuit trace.

I'd test the hardware to be sure.

Re: Problem with data corruption and psql memory usage

От
Gerhard Wiesinger
Дата:
Hello Tom,

Late answer, but answer :-) :
Finally, it was a very strange hardware problem, where a very small part
of RAM was defect but kernel never crashed.

I had also a very strange behavior when verifying rpm packages with rpm
-V. First I had the harddisk under suspicion. But then I flushed the OS caches:
echo 3 > /proc/sys/vm/drop_caches
and rpm -V was correct. => RAM issue.

A memtest86+ showed very fast a defect RAM.

So PostgreSQL didn't have any issue :-)

Ciao,
Gerhard

--
http://www.wiesinger.com/


On Wed, 9 May 2007, Gerhard Wiesinger wrote:

> Hello Tom!
>
> I don't think this is a hardware problem. Machine runs 24/7 for around 4
> years without any problems, daily backup with GBs of data to it, uptimes to
> the next kernel security patch, etc.
>
> The only problem I could believe is:
> I'm running the FC7 test packages of postgresql in FC6 and maybe there is a
> slight glibc library conflict or any other incompatibility.
>
> Ciao,
> Gerhard
>
> --
> http://www.wiesinger.com/
>
>
> On Wed, 9 May 2007, Tom Lane wrote:
>
>> Gerhard Wiesinger <lists@wiesinger.com> writes:
>>> LOG:  could not fsync segment 0 of relation 1663/16386/42726: Input/output
>>> error
>>
>> [ raised eyebrow... ]  I think your machine is flakier than you believe.
>> This error is particularly damning, but the general pattern of weird
>> failures all over the place seems to me to fit the idea of hardware
>> problems much better than any other explanation.  FC6 and PG 8.2.3 are
>> both pretty darn stable for most people, so there's *something* wrong
>> with your installation, and unstable hardware is the first thing that
>> comes to mind.
>>
>>             regards, tom lane
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 3: Have you checked our extensive FAQ?
>>
>>               http://www.postgresql.org/docs/faq
>>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>              http://www.postgresql.org/docs/faq
>