Обсуждение: performance enhancements for PostgreSQL

Поиск
Список
Период
Сортировка

performance enhancements for PostgreSQL

От
"Johnson, Shaunn"
Дата:

Howdy:

I have a Linux server running PostgreSQL 7.2.1. with about 1 Gig of
memory.  The proc speed is about 1.14 GHz. 

I'm getting more and more concerned about how often
the database gets used and the days (like today) where I
wonder if buying more memory rather than buying a 2nd CPU
was such a great idea.

In an effort to enhance / streamline performance, I've done
the following:

* memory upgrade from 512M to 1G
* move RAID5 to scsi drives (10K RPM)
* set up cron script to vacuum database weekly
* set number of client connects ( i.e., /usr/bin/postmaster -i -B 128 -N 64 -d 4 )

But I'm at the point now that I can't kill some jobs.  Yes, I know
I shouldn't use 'kill' in any forceful way, but just a kill seems to do
nothing (or, if it is doing something, it's not fast enough
for the user community and it's stopping production).

I got this from a co-worker:

[snip]

the server is blocking on access to the metadata tables.  not even
logins are being processed.  I'm not sure what caused the
problem, but I think a database restart is the best course,
which I have been trying to do.  Do not kill -9, as it will corrupt  the WAL.

[/snip]

I see some errors in the messages file regarding the RAID drives
(the filesystem where the  database lives)

[snip error]

Nov 18 12:33:51 hmp kernel: scsi : aborting command due to timeout : pid 0, scsi
2, channel 2, id 0, lun 1 Write (10) 20 11 d2 76 9f 00 00 40 00
Nov 18 12:33:51 hmp kernel: scsi : aborting command due to timeout : pid 0, scsi
2, channel 2, id 0, lun 1 Write (10) 20 11 d2 76 df 00 00 08 00
Nov 18 12:33:51 hmp kernel: scsi : aborting command due to timeout : pid 0, scsi
2, channel 2, id 0, lun 1 Write (10) 20 11 d2 76 ef 00 00 08 00
Nov 18 12:33:52 hmp kernel: scsi : aborting command due to timeout : pid 0, scsi
2, channel 2, id 0, lun 1 Write (10) 20 11 d2 77 27 00 00 40 00
Nov 18 12:33:52 hmp kernel: scsi : aborting command due to timeout : pid 0, scsi
2, channel 2, id 0, lun 1 Write (10) 20 11 d2 77 67 00 00 40 00
Nov 18 12:33:53 hmp kernel: scsi : aborting command due to timeout : pid 0, scsi

[/snip error]

my questions are:

* have can I figure out how access to the metadata tables
   are being stopped? (my guess is the error on the scsi drive, but ... )
* how to restart PostgreSQL without running the risk of corrupting data?
* what are the benefits to adding a 2nd CPU over, say, more memory?

I'm sorry that I don't have enough information at this time ... I'm getting
swamped by users as I type this.

Thanks!

-X

Re: performance enhancements for PostgreSQL

От
Doug McNaught
Дата:
"Johnson, Shaunn" <SJohnson6@bcbsm.com> writes:

> 2, channel 2, id 0, lun 1 Write (10) 20 11 d2 77 67 00 00 40 00
> Nov 18 12:33:53 hmp kernel: scsi : aborting command due to timeout : pid 0,
> scsi
>
> [/snip error]
>
>
> my questions are:
>
> * have can I figure out how access to the metadata tables
>    are being stopped? (my guess is the error on the scsi drive, but ... )

Good guess.

> * how to restart PostgreSQL without running the risk of corrupting data?

If you're getting hardware errors you may have to kill it hard, or
just reboot.  WAL should keep you from losing much data.

> * what are the benefits to adding a 2nd CPU over, say, more memory?

I wouldn't worry about either one before you FIX YOUR DISK DRIVE!

If one of your drives is timing out on writes, that's going to slow the
whole thing down tremendously.  Check cabling and termination first.

-Dpiug

Re: performance enhancements for PostgreSQL

От
"scott.marlowe"
Дата:
On Mon, 18 Nov 2002, Johnson, Shaunn wrote:

> Howdy:
>
> I have a Linux server running PostgreSQL 7.2.1. with about 1 Gig of
> memory.  The proc speed is about 1.14 GHz.
>
> I'm getting more and more concerned about how often
> the database gets used and the days (like today) where I
> wonder if buying more memory rather than buying a 2nd CPU
> was such a great idea.
>
> In an effort to enhance / streamline performance, I've done
> the following:
>
> * memory upgrade from 512M to 1G
> * move RAID5 to scsi drives (10K RPM)
> * set up cron script to vacuum database weekly
> * set number of client connects ( i.e., /usr/bin/postmaster -i -B 128 -N 64
> -d 4 )

You could probably allocate WAY more buffer blocks than that.  I run 4000
on most of my medium weight machines, and with a gig of ram you could get
away with quite a bit more, but you'll need to increads shmmax and shmall
to go very high.   But I don't think that's your major problem.

More important, don't start the postmaster that way.  Edit the
$PGDATA/postgresql.conf file, then use the pg_ctl command to start and
stop it.

> But I'm at the point now that I can't kill some jobs.  Yes, I know
> I shouldn't use 'kill' in any forceful way, but just a kill seems to do
> nothing (or, if it is doing something, it's not fast enough
> for the user community and it's stopping production).

You can kill individual backends pretty safely, it's the postmaster you
can't kill -9 safely.  If you see a postgres child running away with all
your memory etc... you can kill -9 that pid pretty safely.

> I got this from a co-worker:
>
> [snip]
>
> the server is blocking on access to the metadata tables.  not even
> logins are being processed.  I'm not sure what caused the
> problem, but I think a database restart is the best course,
> which I have been trying to do.  Do not kill -9, as it will corrupt  the
> WAL.

If you want to shut down the server and it doesn't seem to respond to
pg_ctl stop, try 'pg_ctl -m fast stop' and see if that works.

> I see some errors in the messages file regarding the RAID drives
> (the filesystem where the  database lives)
>
> [snip error]
>
> Nov 18 12:33:51 hmp kernel: scsi : aborting command due to timeout : pid 0,
> scsi
> 2, channel 2, id 0, lun 1 Write (10) 20 11 d2 76 9f 00 00 40 00
> Nov 18 12:33:51 hmp kernel: scsi : aborting command due to timeout : pid 0,
> scsi
> 2, channel 2, id 0, lun 1 Write (10) 20 11 d2 76 df 00 00 08 00
> Nov 18 12:33:51 hmp kernel: scsi : aborting command due to timeout : pid 0,
> scsi
> 2, channel 2, id 0, lun 1 Write (10) 20 11 d2 76 ef 00 00 08 00
> Nov 18 12:33:52 hmp kernel: scsi : aborting command due to timeout : pid 0,
> scsi
> 2, channel 2, id 0, lun 1 Write (10) 20 11 d2 77 27 00 00 40 00
> Nov 18 12:33:52 hmp kernel: scsi : aborting command due to timeout : pid 0,
> scsi
> 2, channel 2, id 0, lun 1 Write (10) 20 11 d2 77 67 00 00 40 00
> Nov 18 12:33:53 hmp kernel: scsi : aborting command due to timeout : pid 0,
> scsi

That looks like a dead drive in your RAID array.  What does 'cat
/proc/mdstat' say about the drive (are you using linux software raid, or a
hardware controller?)

> * have can I figure out how access to the metadata tables
>    are being stopped? (my guess is the error on the scsi drive, but ... )

Sounds like processes are hanging, and users are just trying to reconnect
over and over and you're running out of connections.  This is a symptom,
not the problem, which is your machine is having issues.

> * how to restart PostgreSQL without running the risk of corrupting data?
> * what are the benefits to adding a 2nd CPU over, say, more memory?

pg_ctl -m fast stop

> I'm sorry that I don't have enough information at this time ... I'm getting
> swamped by users as I type this.

Well, good luck.  Write back to let us know how things are going.


Re: performance enhancements for PostgreSQL

От
Mike Mascari
Дата:
scott.marlowe wrote:
> On Mon, 18 Nov 2002, Johnson, Shaunn wrote:
>
>>But I'm at the point now that I can't kill some jobs.  Yes, I know
>>I shouldn't use 'kill' in any forceful way, but just a kill seems to do
>>nothing (or, if it is doing something, it's not fast enough
>>for the user community and it's stopping production).
>

Kill -9 the postmaster and all postgres processes. This doesn't
solve your problem though.

> You can kill individual backends pretty safely, it's the postmaster you
> can't kill -9 safely.

Feel free to kill -9 the postmaster process. Be sure to also
kill all postgres processes. Shared memory segments won't be
released though.

>>the server is blocking on access to the metadata tables.  not even
>>logins are being processed.  I'm not sure what caused the
>>problem, but I think a database restart is the best course,
>>which I have been trying to do.  Do not kill -9, as it will corrupt  the
>>WAL.

False. Feel free to kill -9 the postmaster and all backends.

>
> If you want to shut down the server and it doesn't seem to respond to
> pg_ctl stop, try 'pg_ctl -m fast stop' and see if that works.

> Sounds like processes are hanging, and users are just trying to reconnect
> over and over and you're running out of connections.  This is a symptom,
> not the problem, which is your machine is having issues.

Agreed.

>>* how to restart PostgreSQL without running the risk of corrupting data?
>>* what are the benefits to adding a 2nd CPU over, say, more memory?
>
> pg_ctl -m fast stop

If that doesn't work (I doubt it will):

killall -9 postmaster
killall -9 postgres

Killing the postmaster and postgres processes with -9 will not
corrupt your data. A flaky HD/SCSI controller/device driver
might though. The whole point of an ACID compliant database is
that you can yank the power and it will not result in
corruption. I'd guess that postgres processes are not responding
to signals because they are trapped in a kernel call for I/O in
a device driver that is either buggy itself, or isn't robust
enough to handle I/O errors from a bad device - SCSI controller,
HD, etc.

Mike Mascari
mascarm@mascari.com