Обсуждение: Frustration

Поиск
Список
Период
Сортировка

Frustration

От
Michael Simms
Дата:
Hi

I will admit I am getting reeeeeallllly frustrated right now. Currently
postgresql is crashing approximately once every 5 minutes on me

template1=> select version();
version                                                            
-------------------------------------------------------------------
PostgreSQL 6.5.2 on i686-pc-linux-gnu, compiled by gcc egcs-2.91.66
(1 row)

I am not doing anything except vary basic commands, things like inserts
and updates and nothing involving too many expressions.

Now, I know nobody can debug anything from what I have just said, but I
cannot get a better set of bug reports. I CANT get postgres to send out debug

For example. I start it using:

/usr/bin/postmaster -o "-F -S 10240" -d 3 -S -N 512 -B 3000 -D/var/lib/pgsql/data -o -F > /tmp/postmasterout 2>
/tmp/postmastererr


Spot in there I have -d 3 and redirect (this under /bin/sh) to /tmp

Now, after repeated backend crashes, I have:

[postgres@home bin]$ cat /tmp/postmastererr 
FindExec: found "/usr/bin/postgres" using argv[0]
binding ShmemCreate(key=52e2c1, size=31684608)
[postgres@home bin]$ cat /tmp/postmasterout 
[postgres@home bin]$ 

Or exactly NOTHING

This is out of the box 6.5.2, no changes made, no changes made in the config
except to make it install into the right place.

I just need to get some debug, so I can actually report something. Am I
doing something very dumb, or SHOULD there be debug here and there isnt?

I am about ready to pull my hair out over this. I NEED to have a stable
database, and crashing EVERY five minutes is not helping me at all {:-(

Also, I seem to remember that someone posted here that when one backend
crashed, it shouldnt close the other backends any more. Well, mine does.

NOTICE:  Message from PostgreSQL backend:       The Postmaster has informed me that some other backend died abnormally
andpossibly corrupted shared memory.       I have rolled back the current transaction and am going to terminate your
databasesystem connection and exit.       Please reconnect to the database system and repeat your query.
 

I am getting this about every five minutes. I wish I knew what was doing it.
Even if the backend recovered and just perforned the query again, that
would be enough, the overhead of checking to see if the database has crashed
EVERY TIME I start or finish performing a query is a huge overhead.

I can appreciate that the backend that crashed cannot do this, but the others
surely can! Rollback and start again, instead of rollback and panic

Appologies if I sound a bit stressed right now, I was under the impression
I had tested my system, and so I opened it to the public, and now it
is blowing up in my face BADLY.

If someone can tell me WHAT I am doing wrong with getting the debug info,
please please do! I am just watching it blow up again as we speak, and I
must get SOMETHING fixed asap
                ~Michael


Re: [HACKERS] Frustration

От
Tom Lane
Дата:
Michael Simms <grim@argh.demon.co.uk> writes:
> Now, I know nobody can debug anything from what I have just said, but
> I cannot get a better set of bug reports. I CANT get postgres to send
> out debug

> /usr/bin/postmaster -o "-F -S 10240" -d 3 -S -N 512 -B 3000 -D/var/lib/pgsql/data -o -F > /tmp/postmasterout 2>
/tmp/postmastererr

Don't use the -S switch (the second one, not the one inside -o).

Looking in postmaster.c, I see that causes it to redirect stdout/stderr
to /dev/null (probably not too hot an idea, but that's doubtless been
like that for a *long* time).  Instead launch with something like
nohup postmaster switches... </dev/null >logfile 2>errfile &

Good luck figuring out where the real problem is...
        regards, tom lane


Re: [HACKERS] Frustration

От
Michael Simms
Дата:
> Good luck figuring out where the real problem is...
> 
>             regards, tom lane

Well, thanks to tom, I know what was wrong, and I have found the problem,
or one of them at least...

FATAL: s_lock(0c9ef824) at bufmgr.c:1106, stuck spinlock. Aborting.

Okee, that segment of code is, well, its some deep down internals that
are as clear as mud to me.

Anyone in the know have an idea what this does?

Just to save you looking, it is included below.

One question, is that does postgresql Inc have a 'normal person' support
level? I ask that cos I was planning on getting some of the commercial
support, and whilst it is a reasonable price to pay for corporations or
people with truckloads of money, I am a humble developer with more
expenses than income, and $600 is just way out of my league {:-(

If not, fair enough, just thought Id ask cos the support I have had from
this list is excellent and I wanted to provide some payback to the
developoment group.
            ~Michael

/** WaitIO -- Block until the IO_IN_PROGRESS flag on 'buf'*              is cleared.  Because IO_IN_PROGRESS conflicts
are*             expected to be rare, there is only one BufferIO*              lock in the entire system.      All
processesblock*              on this semaphore when they try to use a buffer*              that someone else is
faultingin.  Whenever a*              process finishes an IO and someone is waiting for*              the buffer,
BufferIOis signaled (SignalIO).  All*              waiting processes then wake up and check to see*              if
theirbuffer is now ready.  This implementation*              is simple, but efficient enough if WaitIO is*
rarelycalled by multiple processes simultaneously.**      ProcSleep atomically releases the spinlock and goes to*
      sleep.**      Note: there is an easy fix if the queue becomes long.*              save the id of the buffer we
arewaiting for in*              the queue structure.  That way signal can figure*              out which proc to wake
up.*/
#ifdef HAS_TEST_AND_SET
static void
WaitIO(BufferDesc *buf, SPINLOCK spinlock)
{       SpinRelease(spinlock);       S_LOCK(&(buf->io_in_progress_lock));       S_UNLOCK(&(buf->io_in_progress_lock));
    SpinAcquire(spinlock);
 
}


Re: [HACKERS] Frustration

От
Tom Lane
Дата:
Michael Simms <grim@argh.demon.co.uk> writes:
> Well, thanks to tom, I know what was wrong, and I have found the problem,
> or one of them at least...
> FATAL: s_lock(0c9ef824) at bufmgr.c:1106, stuck spinlock. Aborting.
> Okee, that segment of code is, well, its some deep down internals that
> are as clear as mud to me.

Hmph.  Apparently, some backend was waiting for some other backend to
finish reading a page in or writing it out, and gave up after deciding
it had waited an unreasonable amount of time (~ 1 minute, which does
seem plenty long enough).  Probably, the I/O did in fact finish, but
the waiting backend didn't get the word for some reason.

Is it possible that there's something wrong with the spinlock code on
your hardware?  There are a bunch of different spinlock implementations
(assembly code for various hardware) in include/storage/s_lock.h and
backend/storage/buffer/s_lock.c.  Some of 'em might not be as well
tested as others.  But you're on PC hardware, right?  I would've thought
that flavor of the code would be pretty well wrung out.

Another likely explanation is that there's something wrong in
bufmgr.c's logic for setting and releasing the io_in_progress lock ---
but a quick look doesn't show any obvious error, and I would have
thought we'd have found out about any such problem long since.
Since we're not being buried in reports of stuck-spinlock errors,
I'm guessing there is some platform-specific problem on your machine.
No good ideas what it is if it isn't a spinlock failure.

(Finally, are you sure this is the *only* indication of trouble in
the logs?  If a backend crashed while holding the spinlock, the other
ones would eventually die with complaints like this, but that wouldn't
make the spinlock code be at fault...)
        regards, tom lane


RE: [HACKERS] Frustration

От
"Hiroshi Inoue"
Дата:
> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tom Lane
> Sent: Friday, September 24, 1999 11:27 PM
> To: Michael Simms
> Cc: pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] Frustration
>
>
> Michael Simms <grim@argh.demon.co.uk> writes:
> > Well, thanks to tom, I know what was wrong, and I have found
> the problem,
> > or one of them at least...
> > FATAL: s_lock(0c9ef824) at bufmgr.c:1106, stuck spinlock. Aborting.
> > Okee, that segment of code is, well, its some deep down internals that
> > are as clear as mud to me.
>
> Hmph.  Apparently, some backend was waiting for some other backend to
> finish reading a page in or writing it out, and gave up after deciding
> it had waited an unreasonable amount of time (~ 1 minute, which does
> seem plenty long enough).  Probably, the I/O did in fact finish, but
> the waiting backend didn't get the word for some reason.
>

[snip]

>
> Another likely explanation is that there's something wrong in
> bufmgr.c's logic for setting and releasing the io_in_progress lock ---
> but a quick look doesn't show any obvious error, and I would have
> thought we'd have found out about any such problem long since.
> Since we're not being buried in reports of stuck-spinlock errors,
> I'm guessing there is some platform-specific problem on your machine.
> No good ideas what it is if it isn't a spinlock failure.
>

Different from other spinlocks,io_in_progress spinlock is a per bufpage
spinlock and ProcReleaseSpins() doesn't release the spinlock.
If an error(in md.c in most cases) occured while holding the spinlock
,the spinlock would necessarily freeze.

Michael Simms saysERROR:  cannot read block 641 of server
occured before the spinlock stuck abort.

Probably it is an original cause of the spinlock freeze.

However I don't understand the following status of his machine.

Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/hda3              1109780    704964    347461  67% /
/dev/hda1                33149      6140     25297  20% /boot
/dev/hdc1              9515145   3248272   5773207  36% /home
/dev/hdb1               402852    154144    227903  40% /tmp
/dev/sda1   30356106785018642307    43892061535609608   0 100%
/var/lib/pgsql

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp



Re: [HACKERS] Frustration

От
Tom Lane
Дата:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> Different from other spinlocks,io_in_progress spinlock is a per bufpage
> spinlock and ProcReleaseSpins() doesn't release the spinlock.
> If an error(in md.c in most cases) occured while holding the spinlock
> ,the spinlock would necessarily freeze.

Oooh, good point.  Shouldn't this be fixed?  If we don't fix it, then
a disk I/O error will translate to an installation-wide shutdown and
restart as soon as some backend tries to touch the locked page (as
indeed was happening to Michael).  That seems a tad extreme.

> Michael Simms says
>     ERROR:  cannot read block 641 of server
> occured before the spinlock stuck abort.
> Probably it is an original cause of the spinlock freeze.

I seem to have missed the message containing that bit of info,
but it certainly suggests that your diagnosis is correct.

> However I don't understand the following status of his machine.
> /dev/sda1   30356106785018642307    43892061535609608   0 100%

Now that we know the root problem was disk driver flakiness, I think
we can write that off as Not Our Fault ;-)
        regards, tom lane


RE: [HACKERS] Frustration

От
"Hiroshi Inoue"
Дата:
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Monday, September 27, 1999 10:20 PM
> To: Hiroshi Inoue
> Cc: Michael Simms; pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] Frustration 
> 
> 
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > Different from other spinlocks,io_in_progress spinlock is a per bufpage
> > spinlock and ProcReleaseSpins() doesn't release the spinlock.
> > If an error(in md.c in most cases) occured while holding the spinlock
> > ,the spinlock would necessarily freeze.
> 
> Oooh, good point.  Shouldn't this be fixed?  If we don't fix it, then

Yes,it's on TODO.
* spinlock stuck problem when elog(FATAL) and elog(ERROR) inside bufmgr

I would try to fix it.
> a disk I/O error will translate to an installation-wide shutdown and
> restart as soon as some backend tries to touch the locked page (as
> indeed was happening to Michael).  That seems a tad extreme.
> 

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp