Обсуждение: Major Performance decrease after some hours

Поиск
Список
Период
Сортировка

Major Performance decrease after some hours

От
"Peter Bauer"
Дата:
Hi all,

i have a Tomcat application with PostgreSQL 8.1.4 running which
performs about 10000 inserts/deletes every 2-4 minutes and updates on
a database and after some hours of loadtesting the top output says
0.0% idle, 6-7% system load, load average 32, 31, 28 and there are
many exceptions at statement execution like:
An I/O error occured while sending to the backend.

There are 10-15 postmaster processes running which use all the CPU power.
A restart of tomcat and then postgresql results in the same situation.
Some postgres processes are in DELETE waiting or SELECT waiting.
VACUUM runs through in just about 1-2 seconds and is run via cron
every minute and reports that the fsm setting are high enough.

The situation just gets better if tomcat is stopped and postgresql is
restarted -> 90%idle

The tomcat application talks to another application on the server
which uses much CPU cycles at startup of tomcat and my observation is
that after this startup with 0.0% idle, PostgreSQL can't recover from
this situation.

Dual Xeon Server
machines (Dell PowerEdge 2850) using Heartbeat1. Each server has only
one harddisk, no RAID configuration is used.
- Each Cluster has 4 drbd Devices, one for the PostgreSQL data
- Two of these clusters are using the same PostgreSQL installation to
share the data, the database can be moved from one cluster to the
other in case of failure
- OS: Debian Sarge with postgresql 8.1.4
- Two cronjobs are configured to perform a "vacuumdb --all"
every 1 minute and a "vacuumdb --all--analyze" every 23
minutes
- There are 3 applications using the PostgreSQL installation, each
with their own database.
- The main application is based on Tomcat 4.1-30 partly a
Web-Application, partly a terminal login protocol based on http and
XML, so the database access is of course done using JDBC
- A cronjob is configured to perform a pg_dump of the main database
every 4 hours


Any ideas are welcome.

thx,
Peter

Re: Major Performance decrease after some hours

От
"Peter Bauer"
Дата:
2006/10/1, Peter Bauer <peter.m.bauer@gmail.com>:
> Hi all,
>
> i have a Tomcat application with PostgreSQL 8.1.4 running which
> performs about 10000 inserts/deletes every 2-4 minutes and updates on
> a database and after some hours of loadtesting the top output says
> 0.0% idle, 6-7% system load, load average 32, 31, 28 and there are
> many exceptions at statement execution like:
> An I/O error occured while sending to the backend.
>
> There are 10-15 postmaster processes running which use all the CPU power.
> A restart of tomcat and then postgresql results in the same situation.
> Some postgres processes are in DELETE waiting or SELECT waiting.
> VACUUM runs through in just about 1-2 seconds and is run via cron
> every minute and reports that the fsm setting are high enough.
>
> The situation just gets better if tomcat is stopped and postgresql is
> restarted -> 90%idle
>
> The tomcat application talks to another application on the server
> which uses much CPU cycles at startup of tomcat and my observation is
> that after this startup with 0.0% idle, PostgreSQL can't recover from
> this situation.
>
> Dual Xeon Server
> machines (Dell PowerEdge 2850) using Heartbeat1. Each server has only
> one harddisk, no RAID configuration is used.
> - Each Cluster has 4 drbd Devices, one for the PostgreSQL data
> - Two of these clusters are using the same PostgreSQL installation to
> share the data, the database can be moved from one cluster to the
> other in case of failure
> - OS: Debian Sarge with postgresql 8.1.4
> - Two cronjobs are configured to perform a "vacuumdb --all"
> every 1 minute and a "vacuumdb --all--analyze" every 23
> minutes
> - There are 3 applications using the PostgreSQL installation, each
> with their own database.
> - The main application is based on Tomcat 4.1-30 partly a
> Web-Application, partly a terminal login protocol based on http and
> XML, so the database access is of course done using JDBC
> - A cronjob is configured to perform a pg_dump of the main database
> every 4 hours
>
>
> Any ideas are welcome.
>
> thx,
> Peter
>

Hi again,

i decreased the load of the test and the system is stable for some
hours now, no Exceptions so far, top says 80-90% idle.
The question is what are the reasons for this behaviour? Is a hardware
upgrade required? How can i check how bad the condition of the
database is?

thx,
Peter

Re: Major Performance decrease after some hours

От
Chris Mair
Дата:
Hi,

a few random question...

> > i have a Tomcat application with PostgreSQL 8.1.4 running which
> > performs about 10000 inserts/deletes every 2-4 minutes and updates on
> > a database and after some hours of loadtesting the top output says
> > 0.0% idle, 6-7% system load, load average 32, 31, 28

Who is responsible for this high load values?
Do you see many postmaster processes at the top?


>  and there are
> > many exceptions at statement execution like:
> > An I/O error occured while sending to the backend.

Is this from the application?
How many connections does the application open in parallel?
Using JDBC, I guess?


> > - Each Cluster has 4 drbd Devices, one for the PostgreSQL data
> > - Two of these clusters are using the same PostgreSQL installation to
> > share the data, the database can be moved from one cluster to the
> > other in case of failure

Just to be 100% sure: just one server at a time runs PostgreSQL on
that shared data disk, right?


Bye,
Chris.


Re: Major Performance decrease after some hours

От
MaXX
Дата:
Peter Bauer wrote:
[...]
> There are 10-15 postmaster processes running which use all the CPU power.
> A restart of tomcat and then postgresql results in the same situation.
> Some postgres processes are in DELETE waiting or SELECT waiting.
> VACUUM runs through in just about 1-2 seconds and is run via cron
> every minute and reports that the fsm setting are high enough.

Pure speculation: are you sure you aren't vacuuming too agressively?
The DELETE waiting and SELECT waiting sound to me like they are waiting
for a lock that another vacuum is holding. You've said that you spawn a
vacumm process by cron every minute, that may be fine when the server
isn't too loaded. But when vacuums begins to take longer cron will spawn
more and more vacuums killing your machine.

I'll add a little check in the cron script to see if there isn't already
a vacuum process running or use an idependant script,
while [ 1 ]
   do
    vacuum --all
    sleep 60
   done

[...]

HTH,
--
MaXX

Re: Major Performance decrease after some hours

От
"Peter Bauer"
Дата:
2006/10/1, Chris Mair <chrisnospam@1006.org>:
> Hi,
>
> a few random question...
>
> > > i have a Tomcat application with PostgreSQL 8.1.4 running which
> > > performs about 10000 inserts/deletes every 2-4 minutes and updates on
> > > a database and after some hours of loadtesting the top output says
> > > 0.0% idle, 6-7% system load, load average 32, 31, 28
>
> Who is responsible for this high load values?
> Do you see many postmaster processes at the top?

yes, there are about 10 postmaster processes in top which eat up all
of the CPU cycles at equal parts.

> >  and there are
> > > many exceptions at statement execution like:
> > > An I/O error occured while sending to the backend.
>
> Is this from the application?

Yes, this happens when the sql statements are executed.

> How many connections does the application open in parallel?

30-40, one tomcat runs on the same machine as PostgreSQL, another runs
on the other cluster and uses the database via network.

> Using JDBC, I guess?

yes, postgresql-8.1-407.jdbc3 is used

> > > - Each Cluster has 4 drbd Devices, one for the PostgreSQL data
> > > - Two of these clusters are using the same PostgreSQL installation to
> > > share the data, the database can be moved from one cluster to the
> > > other in case of failure
>
> Just to be 100% sure: just one server at a time runs PostgreSQL on
> that shared data disk, right?

One server of a cluster runs the database and the tomcat, the other
one just runs tomcat, so they share this common database.

> Bye,
> Chris.

thx,
Peter

Re: Major Performance decrease after some hours

От
"Matthew T. O'Connor"
Дата:
MaXX wrote:
>> There are 10-15 postmaster processes running which use all the CPU
>> power.
>> A restart of tomcat and then postgresql results in the same situation.
>> Some postgres processes are in DELETE waiting or SELECT waiting.
>> VACUUM runs through in just about 1-2 seconds and is run via cron
>> every minute and reports that the fsm setting are high enough.
>
> Pure speculation: are you sure you aren't vacuuming too agressively?

Have you tried autovacuum?  You can manually tweak the settings so that
your hot spot tables are vacuumed aggressivly.  One of the main upsides
to autovacuum is tha tit saves cycles, that is, it only vacuums when you
need it.

Re: Major Performance decrease after some hours

От
Tom Lane
Дата:
"Peter Bauer" <peter.m.bauer@gmail.com> writes:
> yes, there are about 10 postmaster processes in top which eat up all
> of the CPU cycles at equal parts.

What are these processes doing exactly --- can you show us the queries
they're executing?  It might be worth attaching to a few of them with
gdb to get stack traces, to see if there's any commonality about the
traces:
    $ gdb /path/to/postgres PID-of-backend
    gdb> bt
    gdb> quit
    are you sure? y
    $ <repeat with other backends, and/or same backend after a few sec>

            regards, tom lane

Re: Major Performance decrease after some hours

От
"Peter Bauer"
Дата:
2006/10/1, MaXX <bs139412@skynet.be>:
> Peter Bauer wrote:
> > 2006/10/1, MaXX <bs139412@skynet.be>:
> >> Peter Bauer wrote:
> >> [...]
> >> > There are 10-15 postmaster processes running which use all the CPU
> >> power.
> >> > A restart of tomcat and then postgresql results in the same situation.
> >> > Some postgres processes are in DELETE waiting or SELECT waiting.
> >> > VACUUM runs through in just about 1-2 seconds and is run via cron
> >> > every minute and reports that the fsm setting are high enough.
> >>
> >> Pure speculation: are you sure you aren't vacuuming too agressively?
> >
> > The previous configuration was to vacuum all 10 minutes. This resulted
> > in 60-70% system load (harddisk i suppose) so i thought more vacuums
> > would be a good idea.
> It won't hurt to ensure that you will never have 2 vacuum processes
> running at the same time...
> >> The DELETE waiting and SELECT waiting sound to me like they are waiting
> >> for a lock that another vacuum is holding. You've said that you spawn a
> >> vacumm process by cron every minute, that may be fine when the server
> >> isn't too loaded. But when vacuums begins to take longer cron will spawn
> >> more and more vacuums killing your machine.
> > When the postmaster processes eat up all CPU cycles the vacuum still
> > does only take some seconds and there are no postgres VACUUM processes
> > hanging around, so i don't think they hamper each other.
> > I have the impression that the database "condition" is getting worse
> > over time so the queries take longer and longer and at some point the
> > I/O Exceptions start because the data cannot be written to the disk in
> > time. I just don't know how to pinpoint this bad "condition" or
> > whatever gets worse over time.
>
> you can connect to your server with pgAdmin, go to "tools -> server
> status" to identify wich process is causing troubles on a sunday ;-)
> (you may have to turn on some logging option on postresql.conf) I find
> it more friendly than psql but it's a matter of taste.
> You are running pg on a Linux platform, try systat and iostat. (I'm
> running FreeBSD, I suppose those 2 are availlable on Linux)
>
> Long running transactions are not you friends too... I once made a typo
> in a script which prevented commits from appenning... not good...

This was the first thing we checked when these problems appeared. The
only locking is done in functions which should not be a problem. Of
course it is possible that there are bugs which prevent some
transactions from being commited, but its a pretty big application so
i would need some advice for finding and debugging such problems.
Is it possible to check which transactions are currently running,
which thread or java process runs them and for how long?
I will attach a logfile to the reply to Toms mail which contains all
sql statements executed from the start of the loadtest to the point
the problems began, so maybe you can find something in it.

thx,
Peter

Re: Major Performance decrease after some hours

От
"Peter Bauer"
Дата:
2006/10/1, Matthew T. O'Connor <matthew@zeut.net>:
> MaXX wrote:
> >> There are 10-15 postmaster processes running which use all the CPU
> >> power.
> >> A restart of tomcat and then postgresql results in the same situation.
> >> Some postgres processes are in DELETE waiting or SELECT waiting.
> >> VACUUM runs through in just about 1-2 seconds and is run via cron
> >> every minute and reports that the fsm setting are high enough.
> >
> > Pure speculation: are you sure you aren't vacuuming too agressively?
>
> Have you tried autovacuum?  You can manually tweak the settings so that
> your hot spot tables are vacuumed aggressivly.  One of the main upsides
> to autovacuum is tha tit saves cycles, that is, it only vacuums when you
> need it.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

Autovacuum is enabled and runs in addition to the configured cronjob
vacuums. I will attach  logfiles to the reply to Toms mail, so please
have a look at it if autovacuum can be reconfigured to replace the
cronjobs.

thx,
Peter

Re: Major Performance decrease after some hours

От
Tom Lane
Дата:
"Peter Bauer" <peter.m.bauer@gmail.com> writes:
> Attached you can find the postgresql logfiles and a logfile which
> contains alls SQL statements executed in the relevant time together
> with the excpetions thrown. I also attached a file with all used
> Pl/pgSQL functions. Since we were not able to find a problem for so
> long, i think it makes no sense to filter the information because we
> are probably not looking on the right spots.

I hope the pgsql-general moderators are not going to approve an 8MB
message :-(.  I'll hang onto my copy though, just in case it proves
useful.

>> It might be worth attaching to a few of them with
>> gdb to get stack traces,

> Should this be done when the problems are appearing or during normal
> operation or both to compare the results?

When the problems are appearing.

            regards, tom lane

Re: Major Performance decrease after some hours

От
"Peter Bauer"
Дата:
2006/10/2, Tom Lane <tgl@sss.pgh.pa.us>:
> "Peter Bauer" <peter.m.bauer@gmail.com> writes:
> > Attached you can find the postgresql logfiles and a logfile which
> > contains alls SQL statements executed in the relevant time together
> > with the excpetions thrown. I also attached a file with all used
> > Pl/pgSQL functions. Since we were not able to find a problem for so
> > long, i think it makes no sense to filter the information because we
> > are probably not looking on the right spots.
>
> I hope the pgsql-general moderators are not going to approve an 8MB
> message :-(.  I'll hang onto my copy though, just in case it proves
> useful.

Sorry, i put the files on http://dagobert.apus.co.at/pglogfiles/ so
please use them.

> >> It might be worth attaching to a few of them with
> >> gdb to get stack traces,
>
> > Should this be done when the problems are appearing or during normal
> > operation or both to compare the results?
>
> When the problems are appearing.

ok, i will try to reproduce it and post the stack traces

thx,
Peter

Re: Major Performance decrease after some hours

От
Ray Stell
Дата:
On Sun, Oct 01, 2006 at 12:55:51PM +0200, MaXX wrote:
>
> Pure speculation: are you sure you aren't vacuuming too agressively?
> The DELETE waiting and SELECT waiting sound to me like they are waiting
> for a lock that another vacuum is holding.

How would one determine the lock situation definitively?  Is there
an internal mechanism that can be queried?

Re: Major Performance decrease after some hours

От
Tom Lane
Дата:
Ray Stell <stellr@cns.vt.edu> writes:
> How would one determine the lock situation definitively?  Is there
> an internal mechanism that can be queried?

pg_locks view.

            regards, tom lane

Re: Major Performance decrease after some hours

От
"Peter Bauer"
Дата:
Hi all,

inspired by the last posting "Weird disk write load caused by
PostgreSQL?" i increased the work_mem from 1 to 7MB and did some
loadtest with vacuum every 10 minutes. The system load (harddisk) went
down and everything was very stable at 80% idle for nearly 24 hours!
I am currently performing some pgbench runs to evaluate the hardware
and configuration for the system but i think the biggest problems are
solved so far.

thx everybody,
Peter

2006/10/2, Tom Lane <tgl@sss.pgh.pa.us>:
> Ray Stell <stellr@cns.vt.edu> writes:
> > How would one determine the lock situation definitively?  Is there
> > an internal mechanism that can be queried?
>
> pg_locks view.
>
>                         regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

Re: Major Performance decrease after some hours

От
Alexander Staubo
Дата:
It appears to me that work_mem is a more significant configuration
option than previously assumed by many PostgreSQL users, myself
included. As with many database optimizations, it's an obscure
problem to diagnose because you generally only observe it through I/O
activity.

One possibility would be to log a warning whenever work_mem is
exceeded (or exceeded by a certain ratio). I would also love a couple
of new statistics counters tracking the amount of work memory used
and the amount of work memory that has spilled over into pgsql_tmp.

Alexander.

On Oct 5, 2006, at 10:48 , Peter Bauer wrote:

> Hi all,
>
> inspired by the last posting "Weird disk write load caused by
> PostgreSQL?" i increased the work_mem from 1 to 7MB and did some
> loadtest with vacuum every 10 minutes. The system load (harddisk) went
> down and everything was very stable at 80% idle for nearly 24 hours!
> I am currently performing some pgbench runs to evaluate the hardware
> and configuration for the system but i think the biggest problems are
> solved so far.
>
> thx everybody,
> Peter
>
> 2006/10/2, Tom Lane <tgl@sss.pgh.pa.us>:
>> Ray Stell <stellr@cns.vt.edu> writes:
>> > How would one determine the lock situation definitively?  Is there
>> > an internal mechanism that can be queried?
>>
>> pg_locks view.
>>
>>                         regards, tom lane
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 2: Don't 'kill -9' the postmaster
>>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>               http://archives.postgresql.org


Re: Major Performance decrease after some hours

От
"Peter Bauer"
Дата:
I finished the little benchmarking on our server and the results are
quite curios.
With the numbers from http://sitening.com/tools/postgresql-benchmark/
in mind i did
./pgbench -i pgbench
and then performed some pgbench tests, for example
./pgbench -c 1 -t 1000 -s 1 pgbench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
number of clients: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
tps = 50.703609 (including connections establishing)
tps = 50.709265 (excluding connections establishing)

So our server with two 3.00 GHz Xeon CPUs and 2GB has about 5% of the
performance of the server described in the article!

I did some tests on a Xen machine running on my workstation and the
results are about 400-500tps which seems to be quite reasonable.

I also tried to disable drbd and put the data directory elsewhere, but
the performance was the same.

any ideas?

thx,
Peter


2006/10/5, Alexander Staubo <alex@purefiction.net>:
> It appears to me that work_mem is a more significant configuration
> option than previously assumed by many PostgreSQL users, myself
> included. As with many database optimizations, it's an obscure
> problem to diagnose because you generally only observe it through I/O
> activity.
>
> One possibility would be to log a warning whenever work_mem is
> exceeded (or exceeded by a certain ratio). I would also love a couple
> of new statistics counters tracking the amount of work memory used
> and the amount of work memory that has spilled over into pgsql_tmp.
>
> Alexander.
>
> On Oct 5, 2006, at 10:48 , Peter Bauer wrote:
>
> > Hi all,
> >
> > inspired by the last posting "Weird disk write load caused by
> > PostgreSQL?" i increased the work_mem from 1 to 7MB and did some
> > loadtest with vacuum every 10 minutes. The system load (harddisk) went
> > down and everything was very stable at 80% idle for nearly 24 hours!
> > I am currently performing some pgbench runs to evaluate the hardware
> > and configuration for the system but i think the biggest problems are
> > solved so far.
> >
> > thx everybody,
> > Peter
> >
> > 2006/10/2, Tom Lane <tgl@sss.pgh.pa.us>:
> >> Ray Stell <stellr@cns.vt.edu> writes:
> >> > How would one determine the lock situation definitively?  Is there
> >> > an internal mechanism that can be queried?
> >>
> >> pg_locks view.
> >>
> >>                         regards, tom lane
> >>
> >> ---------------------------(end of
> >> broadcast)---------------------------
> >> TIP 2: Don't 'kill -9' the postmaster
> >>
> >
> > ---------------------------(end of
> > broadcast)---------------------------
> > TIP 4: Have you searched our list archives?
> >
> >               http://archives.postgresql.org
>
>

Re: Major Performance decrease after some hours

От
"Peter Bauer"
Дата:
I forgot to mention that top does not show a noticeable increase of
CPU or system load during the pgbench runs (postmaster has 4-8% CPU).
Shouldn't the machine be busy during such a test?

thx,
Peter

2006/10/5, Peter Bauer <peter.m.bauer@gmail.com>:
> I finished the little benchmarking on our server and the results are
> quite curios.
> With the numbers from http://sitening.com/tools/postgresql-benchmark/
> in mind i did
> ./pgbench -i pgbench
> and then performed some pgbench tests, for example
> ./pgbench -c 1 -t 1000 -s 1 pgbench
> starting vacuum...end.
> transaction type: TPC-B (sort of)
> scaling factor: 1
> number of clients: 1
> number of transactions per client: 1000
> number of transactions actually processed: 1000/1000
> tps = 50.703609 (including connections establishing)
> tps = 50.709265 (excluding connections establishing)
>
> So our server with two 3.00 GHz Xeon CPUs and 2GB has about 5% of the
> performance of the server described in the article!
>
> I did some tests on a Xen machine running on my workstation and the
> results are about 400-500tps which seems to be quite reasonable.
>
> I also tried to disable drbd and put the data directory elsewhere, but
> the performance was the same.
>
> any ideas?
>
> thx,
> Peter
>
>
> 2006/10/5, Alexander Staubo <alex@purefiction.net>:
> > It appears to me that work_mem is a more significant configuration
> > option than previously assumed by many PostgreSQL users, myself
> > included. As with many database optimizations, it's an obscure
> > problem to diagnose because you generally only observe it through I/O
> > activity.
> >
> > One possibility would be to log a warning whenever work_mem is
> > exceeded (or exceeded by a certain ratio). I would also love a couple
> > of new statistics counters tracking the amount of work memory used
> > and the amount of work memory that has spilled over into pgsql_tmp.
> >
> > Alexander.
> >
> > On Oct 5, 2006, at 10:48 , Peter Bauer wrote:
> >
> > > Hi all,
> > >
> > > inspired by the last posting "Weird disk write load caused by
> > > PostgreSQL?" i increased the work_mem from 1 to 7MB and did some
> > > loadtest with vacuum every 10 minutes. The system load (harddisk) went
> > > down and everything was very stable at 80% idle for nearly 24 hours!
> > > I am currently performing some pgbench runs to evaluate the hardware
> > > and configuration for the system but i think the biggest problems are
> > > solved so far.
> > >
> > > thx everybody,
> > > Peter
> > >
> > > 2006/10/2, Tom Lane <tgl@sss.pgh.pa.us>:
> > >> Ray Stell <stellr@cns.vt.edu> writes:
> > >> > How would one determine the lock situation definitively?  Is there
> > >> > an internal mechanism that can be queried?
> > >>
> > >> pg_locks view.
> > >>
> > >>                         regards, tom lane
> > >>
> > >> ---------------------------(end of
> > >> broadcast)---------------------------
> > >> TIP 2: Don't 'kill -9' the postmaster
> > >>
> > >
> > > ---------------------------(end of
> > > broadcast)---------------------------
> > > TIP 4: Have you searched our list archives?
> > >
> > >               http://archives.postgresql.org
> >
> >
>

Re: Major Performance decrease after some hours

От
Alexander Staubo
Дата:
If you are on Linux, I recommend iostat(1) and vmstat(8) over top.

Iostat will report I/O transfer statistics; it's how I discovered
that work_mem buffers were spilling over to disk files. For Vmstat,
look in particular at the load (ie., how many processes are competing
for the scheduler) in the first field ("r") and how many processes
are blocked by I/O waits ("b").

Alexander.

On Oct 5, 2006, at 14:35 , Peter Bauer wrote:

> I forgot to mention that top does not show a noticeable increase of
> CPU or system load during the pgbench runs (postmaster has 4-8% CPU).
> Shouldn't the machine be busy during such a test?
>
> thx,
> Peter
>
> 2006/10/5, Peter Bauer <peter.m.bauer@gmail.com>:
>> I finished the little benchmarking on our server and the results are
>> quite curios.
>> With the numbers from http://sitening.com/tools/postgresql-benchmark/
>> in mind i did
>> ./pgbench -i pgbench
>> and then performed some pgbench tests, for example
>> ./pgbench -c 1 -t 1000 -s 1 pgbench
>> starting vacuum...end.
>> transaction type: TPC-B (sort of)
>> scaling factor: 1
>> number of clients: 1
>> number of transactions per client: 1000
>> number of transactions actually processed: 1000/1000
>> tps = 50.703609 (including connections establishing)
>> tps = 50.709265 (excluding connections establishing)
>>
>> So our server with two 3.00 GHz Xeon CPUs and 2GB has about 5% of the
>> performance of the server described in the article!
>>
>> I did some tests on a Xen machine running on my workstation and the
>> results are about 400-500tps which seems to be quite reasonable.
>>
>> I also tried to disable drbd and put the data directory elsewhere,
>> but
>> the performance was the same.
>>
>> any ideas?
>>
>> thx,
>> Peter
>>
>>
>> 2006/10/5, Alexander Staubo <alex@purefiction.net>:
>> > It appears to me that work_mem is a more significant configuration
>> > option than previously assumed by many PostgreSQL users, myself
>> > included. As with many database optimizations, it's an obscure
>> > problem to diagnose because you generally only observe it
>> through I/O
>> > activity.
>> >
>> > One possibility would be to log a warning whenever work_mem is
>> > exceeded (or exceeded by a certain ratio). I would also love a
>> couple
>> > of new statistics counters tracking the amount of work memory used
>> > and the amount of work memory that has spilled over into pgsql_tmp.
>> >
>> > Alexander.
>> >
>> > On Oct 5, 2006, at 10:48 , Peter Bauer wrote:
>> >
>> > > Hi all,
>> > >
>> > > inspired by the last posting "Weird disk write load caused by
>> > > PostgreSQL?" i increased the work_mem from 1 to 7MB and did some
>> > > loadtest with vacuum every 10 minutes. The system load
>> (harddisk) went
>> > > down and everything was very stable at 80% idle for nearly 24
>> hours!
>> > > I am currently performing some pgbench runs to evaluate the
>> hardware
>> > > and configuration for the system but i think the biggest
>> problems are
>> > > solved so far.
>> > >
>> > > thx everybody,
>> > > Peter
>> > >
>> > > 2006/10/2, Tom Lane <tgl@sss.pgh.pa.us>:
>> > >> Ray Stell <stellr@cns.vt.edu> writes:
>> > >> > How would one determine the lock situation definitively?
>> Is there
>> > >> > an internal mechanism that can be queried?
>> > >>
>> > >> pg_locks view.
>> > >>
>> > >>                         regards, tom lane
>> > >>
>> > >> ---------------------------(end of
>> > >> broadcast)---------------------------
>> > >> TIP 2: Don't 'kill -9' the postmaster
>> > >>
>> > >
>> > > ---------------------------(end of
>> > > broadcast)---------------------------
>> > > TIP 4: Have you searched our list archives?
>> > >
>> > >               http://archives.postgresql.org
>> >
>> >
>>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>       choose an index scan if your joining column's datatypes do not
>       match


Re: Major Performance decrease after some hours

От
"Peter Bauer"
Дата:
it seems that the machine doesn't really care about the pgbench run. I did a
pgbench -c 10 -t 10000 -s 10 pgbench
and here is the output of vmstat 1 100 which has been started some
seconds before pgbench:
vmstat 1 100
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0  10236 1240952 111324 553908    0    0     3     2    1     3  5  3 92  0
 0  0  10236 1240944 111324 553908    0    0     0     0  167    48  0  0 100  0
 0  0  10236 1240936 111332 553908    0    0     0    20  169    52  0  0 100  0
 0  0  10236 1240936 111332 553908    0    0     0     0  165    44  0  0 100  0
 0  0  10236 1240936 111332 553908    0    0     0     0  164    44  0  0 100  0
 0  0  10236 1240936 111332 553908    0    0     0     0  165    47  0  0 100  0
 0  0  10236 1240936 111332 553908    0    0     0     0  165    47  0  0 100  0
 0  0  10236 1240936 111332 553908    0    0     0     0  164    49  0  0 100  0
 0  0  10236 1240936 111332 553908    0    0     0     0  165    40  0  0 100  0
 0  0  10236 1240936 111332 553908    0    0     0     0  165    41  0  0 100  0
 0  0  10236 1240936 111332 553908    0    0     0     0  165    45  0  0 100  0
 0  0  10236 1240936 111332 553908    0    0     0     0  165    48  0  0 100  0
 0  0  10236 1240936 111332 553908    0    0     0     0  164    42  0  0 100  0
 0  0  10236 1240936 111332 553908    0    0     0     0  165    45  0  0 100  0
 0  0  10236 1240936 111332 553908    0    0     0     0  165    41  0  0 100  0
 0  2  10236 1237688 111332 550256    0    0     0 10976  449  1793 13  1 86  0
 0  2  10236 1237688 111332 550256    0    0     0  1928  345  3206  2  1 97  0
 0  2  10236 1237680 111340 550256    0    0     0  1969  352  3012  4  1 95  0
 0  2  10236 1237600 111340 550336    0    0     0  2096  353  2986  2  0 98  0
 0  2  10236 1237608 111340 550336    0    0     0  1932  351  2985  1  2 97  0
 0  2  10236 1237600 111340 550336    0    0     0  1836  350  3097  4  1 95  0
 0  2  10236 1237600 111340 550336    0    0     0  1852  353  2971  3  1 95  0
 4  1  10236 1237600 111340 550336    0    0     0  1975  372  2682  3  0 97  0
 0  2  10236 1237608 111340 550336    0    0     0  2056  372  2458  2  1 96  0
 0  2  10236 1237600 111340 550336    0    0     0  2028  377  2360  1  1 98  0
 0  2  10236 1237588 111340 550336    0    0     0  2000  372  2630  3  1 95  0
 0  2  10236 1237588 111340 550336    0    0     0  2044  372  2326  3  0 97  0
 0  2  10236 1237588 111340 550336    0    0     0  1976  372  2171  2  0 98  0
 0  2  10236 1237588 111340 550336    0    0     0  1972  383  2275  4  1 95  0
 1  2  10236 1237588 111340 550336    0    0     0  1924  382  2500  3  1 95  0
 0  2  10236 1237588 111340 550336    0    0     0  1804  372  2798  3  1 96  0
 0  2  10236 1237588 111340 550336    0    0     0  1900  374  2974  6  2 92  0
 0  2  10236 1237588 111340 550336    0    0     0  1820  361  2871  2  2 95  0
 0  2  10236 1237576 111340 550336    0    0     0  1876  366  2762  4  1 95  0
 0  2  10236 1237576 111340 550336    0    0     0  1904  370  2724  3  0 96  0
 0  2  10236 1237576 111340 550336    0    0     0  1972  378  2585  6  0 93  0
 6  1  10236 1237576 111340 550336    0    0     0  1800  371  2838  1  1 98  0
 0  2  10236 1237576 111340 550336    0    0     0  1792  362  2826  3  1 96  0
 0  2  10236 1237576 111340 550344    0    0     0  1804  362  3068  3  0 96  0
 1  1  10236 1237560 111340 550360    0    0     0  1936  373  2718  2  1 96  0
 6  1  10236 1237552 111340 550360    0    0     0  1788  365  2447  4  1 95  0
 0  1  10236 1237552 111340 550360    0    0     0  1804  368  2362  5  2 93  0
 0  2  10236 1237544 111340 550368    0    0     0  1908  370  2434  3  4 93  0
 0  2  10236 1237544 111340 550368    0    0     0  1848  369  2360  4  1 94  0
 0  2  10236 1237504 111340 550408    0    0     0  1796  358  2655  3  1 96  0
 0  2  10236 1237496 111340 550416    0    0     0  1988  374  2491  4  1 95  0
 0  2  10236 1237488 111340 550424    0    0     0  1960  372  2111  2  1 97  0
 0  2  10236 1237488 111340 550424    0    0     0  1760  360  2433  4  1 95  0
 0  2  10236 1237488 111340 550424    0    0     0  1944  374  2064  2  1 97  0
 0  2  10236 1237496 111340 550424    0    0     0  1868  373  2169  3  0 97  0
 4  1  10236 1237476 111340 550432    0    0     0  1868  372  2170  3  1 96  0

i did the same on the Xen machine:
vmstat 1 100
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 2  0      4   3964   5916  99288    0    0     1    21   34    53  0  0 99  0
 0  0      4   3964   5916  99288    0    0     0     0   50    20  0  0 100  0
 0  0      4   3964   5916  99288    0    0     0     0   38    20  0  0 100  0
 0  0      4   3964   5916  99288    0    0     0     0   33    18  0  0 100  0
 0  0      4   3964   5916  99288    0    0     0     0   28    20  0  1 99  0
 0  0      4   3964   5924  99288    0    0     0    28   50    25  0  0 100  0
 0  0      4   3964   5924  99288    0    0     0     0   24    19  0  0 100  0
 0  0      4   3968   5924  99288    0    0     0     0   37    20  0  0 100  0
 0  0      4   3968   5924  99288    0    0     0     0   50    24  0  0 100  0
 0  0      4   3772   5924  99292    0    0     0     0   33    29  0  0 100  0
 0  0      4   3748   5924  99292    0    0     0     0   31    19  0  0 100  0
 0  0      4   3744   5924  99292    0    0     0     0   43    25  0  0 100  0
 1  0     12   2560   5496 100048    0    0    44 17004 1896   812 26 10 56  8
 2  0     16   2388   5080  95152    0    0   180 25788 2505 11372 54 19  2 25
 1  1     16   2576   5084  94956    0    0     0  6824 1215 23437 50 19  1 31
 2  0     16   2668   5084  94872    0    0     0  6588 1188 22923 51 12  2 35
 1  1     16   2600   5088  94840    0    0     0  8664 1701 22326 51 10  2 37
 0  1     16   2432   5100  94968    0    0     0  5492 1183  9985 26  6  2 66
 0  1     16   2464   5112  95048    0    0     0  2404  495  5670 18  5  1 76
 1  0     16   2596   5112  94980    0    0    32  6036 1082 21986 42 16  2 41
 1  0     16   2244   5124  95268    0    0     0  7740 1526 20645 37 14  2 48
 2  1     16   2540   5108  95064    0    0     0  7016 1343 18769 46 12  2 41
 0  1     16   2752   5108  94668    0    0     0  5244 1165  8660 16  6  1 77
 0  2     16   2780   5116  94668    0    0     8   648   80    95  0  1  0 99
 4  0     16   2736   5140  94716    0    0     0  1160  363  2556  9  2  1 88
 1  0     16   2268   5148  95036    0    0     8  5112  853 21498 67 13  2 18
 2  1     16   2788   5048  94676    0    0     8  7876 1535 21278 65 12  2 21
 0  1     16   2764   5060  94788    0    0     0  5372 1203  7024 21  6  1 72
 1  0     16   2648   5076  94932    0    0    12  3112  596 10241 24  9  2 65
 1  0     16   2728   5056  94772    0    0    24  6152 1142 19822 47 16  4 34
 1  0     16   2504   5068  95068    0    0     0  7196 1387 20474 50 17  2 32
 1  1     16   2652   5064  94844    0    0     8  7108 1018 17572 56  8  3 33
 0  1     16   2568   5076  94916    0    0     0  4460 1003  5825 14  1  1 84
 0  1     16   2572   5076  94924    0    0     0   704   52    90  0  0  2 98
 1  0     16   2300   5096  95116    0    0     0  3688  762 14174 52  9  2 37
 1  1     16   2436   5080  95080    0    0     0  7256 1407 19964 66 21  1 12
 1  0     16   2640   5092  94904    0    0     0  6504 1223 19809 69 15  1 15
 0  1     16   2528   5104  94960    0    0     0  4964 1165  4573 18  3  1 78
 1  1     16   2332   5116  95064    0    0     0  2492  521  7197 15  5  1 79
 1  1     16   2580   5080  94900    0    0     0  5076  863 19775 60 14  2 24
 2  0     16   2728   5020  94732    0    0     0  7636 1533 19246 51 15  1 33
 0  0     16   2484   5032  94980    0    0     0  6068 1200 16340 50 16  2 32
 1  0     16   2316   5044  95036    0    0     0  3940  934  3570 10  5  3 82
 0  2     16   2788   5024  94628    0    0     0   640   63    46  0  0  0 100
 0  1     16   2876   5060  94636    0    0     0  1212  402   602  2  0  3 95
 1  1     16   2580   5072  94876    0    0     0  6440 1255 17866 68 11  2 19
 1  0     16   2636   5084  94788    0    0     0  7188 1405 18753 66 16  1 17
 0  1     16   2580   5084  94828    0    0     0   544  116  2877 12  1  0 87
 2  1     16   2536   5104  94908    0    0     0  2968  656  5413 26  5  2 67

The only difference i see is that there is always at least one process
waiting for I/O.

thx,
Peter

2006/10/5, Alexander Staubo <alex@purefiction.net>:
> If you are on Linux, I recommend iostat(1) and vmstat(8) over top.
>
> Iostat will report I/O transfer statistics; it's how I discovered
> that work_mem buffers were spilling over to disk files. For Vmstat,
> look in particular at the load (ie., how many processes are competing
> for the scheduler) in the first field ("r") and how many processes
> are blocked by I/O waits ("b").
>
> Alexander.
>
> On Oct 5, 2006, at 14:35 , Peter Bauer wrote:
>
> > I forgot to mention that top does not show a noticeable increase of
> > CPU or system load during the pgbench runs (postmaster has 4-8% CPU).
> > Shouldn't the machine be busy during such a test?
> >
> > thx,
> > Peter
> >
> > 2006/10/5, Peter Bauer <peter.m.bauer@gmail.com>:
> >> I finished the little benchmarking on our server and the results are
> >> quite curios.
> >> With the numbers from http://sitening.com/tools/postgresql-benchmark/
> >> in mind i did
> >> ./pgbench -i pgbench
> >> and then performed some pgbench tests, for example
> >> ./pgbench -c 1 -t 1000 -s 1 pgbench
> >> starting vacuum...end.
> >> transaction type: TPC-B (sort of)
> >> scaling factor: 1
> >> number of clients: 1
> >> number of transactions per client: 1000
> >> number of transactions actually processed: 1000/1000
> >> tps = 50.703609 (including connections establishing)
> >> tps = 50.709265 (excluding connections establishing)
> >>
> >> So our server with two 3.00 GHz Xeon CPUs and 2GB has about 5% of the
> >> performance of the server described in the article!
> >>
> >> I did some tests on a Xen machine running on my workstation and the
> >> results are about 400-500tps which seems to be quite reasonable.
> >>
> >> I also tried to disable drbd and put the data directory elsewhere,
> >> but
> >> the performance was the same.
> >>
> >> any ideas?
> >>
> >> thx,
> >> Peter
> >>
> >>
> >> 2006/10/5, Alexander Staubo <alex@purefiction.net>:
> >> > It appears to me that work_mem is a more significant configuration
> >> > option than previously assumed by many PostgreSQL users, myself
> >> > included. As with many database optimizations, it's an obscure
> >> > problem to diagnose because you generally only observe it
> >> through I/O
> >> > activity.
> >> >
> >> > One possibility would be to log a warning whenever work_mem is
> >> > exceeded (or exceeded by a certain ratio). I would also love a
> >> couple
> >> > of new statistics counters tracking the amount of work memory used
> >> > and the amount of work memory that has spilled over into pgsql_tmp.
> >> >
> >> > Alexander.
> >> >
> >> > On Oct 5, 2006, at 10:48 , Peter Bauer wrote:
> >> >
> >> > > Hi all,
> >> > >
> >> > > inspired by the last posting "Weird disk write load caused by
> >> > > PostgreSQL?" i increased the work_mem from 1 to 7MB and did some
> >> > > loadtest with vacuum every 10 minutes. The system load
> >> (harddisk) went
> >> > > down and everything was very stable at 80% idle for nearly 24
> >> hours!
> >> > > I am currently performing some pgbench runs to evaluate the
> >> hardware
> >> > > and configuration for the system but i think the biggest
> >> problems are
> >> > > solved so far.
> >> > >
> >> > > thx everybody,
> >> > > Peter
> >> > >
> >> > > 2006/10/2, Tom Lane <tgl@sss.pgh.pa.us>:
> >> > >> Ray Stell <stellr@cns.vt.edu> writes:
> >> > >> > How would one determine the lock situation definitively?
> >> Is there
> >> > >> > an internal mechanism that can be queried?
> >> > >>
> >> > >> pg_locks view.
> >> > >>
> >> > >>                         regards, tom lane
> >> > >>
> >> > >> ---------------------------(end of
> >> > >> broadcast)---------------------------
> >> > >> TIP 2: Don't 'kill -9' the postmaster
> >> > >>
> >> > >
> >> > > ---------------------------(end of
> >> > > broadcast)---------------------------
> >> > > TIP 4: Have you searched our list archives?
> >> > >
> >> > >               http://archives.postgresql.org
> >> >
> >> >
> >>
> >
> > ---------------------------(end of
> > broadcast)---------------------------
> > TIP 9: In versions below 8.0, the planner will ignore your desire to
> >       choose an index scan if your joining column's datatypes do not
> >       match
>
>

Re: Major Performance decrease after some hours

От
Tom Lane
Дата:
"Peter Bauer" <peter.m.bauer@gmail.com> writes:
> tps = 50.703609 (including connections establishing)
> tps = 50.709265 (excluding connections establishing)

That's about what you ought to expect for a single transaction stream
running on honest disk hardware (ie, disks that don't lie about write
complete).  You can't commit a transaction more often than once per disk
revolution, because you have to wait for the current WAL file endpoint
to pass under the heads again.  If there are multiple clients then
"ganging" concurrent commits is possible, but you tested only one.

The benchmark you reference might have been done on disks with battery
backed write cache.  Or it might have been just plain unsafe (ie, the
equivalent of fsync off, but in hardware :-()

            regards, tom lane

Re: Major Performance decrease after some hours

От
"Peter Bauer"
Дата:
Hi all,

2006/10/5, Tom Lane <tgl@sss.pgh.pa.us>:
> "Peter Bauer" <peter.m.bauer@gmail.com> writes:
> > tps = 50.703609 (including connections establishing)
> > tps = 50.709265 (excluding connections establishing)
>
> That's about what you ought to expect for a single transaction stream
> running on honest disk hardware (ie, disks that don't lie about write
> complete).  You can't commit a transaction more often than once per disk
> revolution, because you have to wait for the current WAL file endpoint
> to pass under the heads again.  If there are multiple clients then
> "ganging" concurrent commits is possible, but you tested only one.
>
> The benchmark you reference might have been done on disks with battery
> backed write cache.  Or it might have been just plain unsafe (ie, the
> equivalent of fsync off, but in hardware :-()

You are right, i performed the pgbench tests on another machine with
the same hardware but a kernel which supports the onboard Dell Raid
Controller with battery backed write cache and the result is about 400
tps. We will see how much difference this makes in practice but at
least i know where the "problem" was.

thx,
Peter