Обсуждение: Huge spikes in number of connections doing "PARSE"

Поиск
Список
Период
Сортировка

Huge spikes in number of connections doing "PARSE"

От
hubert depesz lubaczewski
Дата:
hi,
i have a server with pg 8.3.11.

Every now and then, for no apparent reason (io normal, cpu usage normal,
web traffic normal) i see hundreds of connections in "PARSE" state.

to give some perspective - we're doing in the lowest traffic of the day
around 3000 transactions per second, with ~ 500-600 connections.

So. every now and then (couple of times per day at most). I see hundreds
(800-900) of connections in "PARSE" state.

I did notice one thing.

we do log output of ps axo user,pid,ppid,pgrp,%cpu,%mem,rss,lstart,nice,nlwp,sgi_p,cputime,tty,wchan:25,args
every 15 seconds or so.

And based on its output, I was able to get stats of "wchan" of all PARSE
pg processes when the problem was logged.
Results:

805 x semtimedop
 10 x stext

Any ideas on what could be wrong? Machine was definitely not loaded most
of the times it happened.

The problem usually goes away in ~ 10-15 seconds.

Any hints? Suggestions?

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/

Re: Huge spikes in number of connections doing "PARSE"

От
Noah Misch
Дата:
On Wed, Mar 09, 2011 at 09:38:07PM +0100, hubert depesz lubaczewski wrote:
> So. every now and then (couple of times per day at most). I see hundreds
> (800-900) of connections in "PARSE" state.
>
> I did notice one thing.
>
> we do log output of ps axo user,pid,ppid,pgrp,%cpu,%mem,rss,lstart,nice,nlwp,sgi_p,cputime,tty,wchan:25,args
> every 15 seconds or so.
>
> And based on its output, I was able to get stats of "wchan" of all PARSE
> pg processes when the problem was logged.
> Results:
>
> 805 x semtimedop
>  10 x stext
>
> Any ideas on what could be wrong? Machine was definitely not loaded most
> of the times it happened.
>
> The problem usually goes away in ~ 10-15 seconds.

Would you have your monitoring process detect this condition and capture stack
traces, preferably from a gdb with access to debug information, of several of
these processes?  That will probably make the specific contention point clear.

Re: Huge spikes in number of connections doing "PARSE"

От
hubert depesz lubaczewski
Дата:
On Fri, Mar 11, 2011 at 03:03:55AM -0500, Noah Misch wrote:
> On Wed, Mar 09, 2011 at 09:38:07PM +0100, hubert depesz lubaczewski wrote:
> > So. every now and then (couple of times per day at most). I see hundreds
> > (800-900) of connections in "PARSE" state.
> >
> > I did notice one thing.
> >
> > we do log output of ps axo user,pid,ppid,pgrp,%cpu,%mem,rss,lstart,nice,nlwp,sgi_p,cputime,tty,wchan:25,args
> > every 15 seconds or so.
> >
> > And based on its output, I was able to get stats of "wchan" of all PARSE
> > pg processes when the problem was logged.
> > Results:
> >
> > 805 x semtimedop
> >  10 x stext
> >
> > Any ideas on what could be wrong? Machine was definitely not loaded most
> > of the times it happened.
> >
> > The problem usually goes away in ~ 10-15 seconds.
>
> Would you have your monitoring process detect this condition and capture stack
> traces, preferably from a gdb with access to debug information, of several of
> these processes?  That will probably make the specific contention point clear.

unfortunately debug was not enabled on this server, and changing
binaries would be rather complicated as it's production environment.

i'm not a c programmer, can you tell me how to get stack trace (assuming
it makes any sense without debug enabled) without damaging the process
in any way?

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/

Re: Huge spikes in number of connections doing "PARSE"

От
Noah Misch
Дата:
On Fri, Mar 11, 2011 at 04:13:52PM +0100, hubert depesz lubaczewski wrote:
> On Fri, Mar 11, 2011 at 03:03:55AM -0500, Noah Misch wrote:
> > On Wed, Mar 09, 2011 at 09:38:07PM +0100, hubert depesz lubaczewski wrote:
> > > So. every now and then (couple of times per day at most). I see hundreds
> > > (800-900) of connections in "PARSE" state.
> > >
> > > I did notice one thing.
> > >
> > > we do log output of ps axo user,pid,ppid,pgrp,%cpu,%mem,rss,lstart,nice,nlwp,sgi_p,cputime,tty,wchan:25,args
> > > every 15 seconds or so.
> > >
> > > And based on its output, I was able to get stats of "wchan" of all PARSE
> > > pg processes when the problem was logged.
> > > Results:
> > >
> > > 805 x semtimedop
> > >  10 x stext
> > >
> > > Any ideas on what could be wrong? Machine was definitely not loaded most
> > > of the times it happened.
> > >
> > > The problem usually goes away in ~ 10-15 seconds.
> >
> > Would you have your monitoring process detect this condition and capture stack
> > traces, preferably from a gdb with access to debug information, of several of
> > these processes?  That will probably make the specific contention point clear.
>
> unfortunately debug was not enabled on this server, and changing
> binaries would be rather complicated as it's production environment.

Understood.  Not a critical problem, most likely.

> i'm not a c programmer, can you tell me how to get stack trace (assuming
> it makes any sense without debug enabled) without damaging the process
> in any way?

  gdb -ex=bt /path/to/bin/postgres $pid </dev/null

I've used this on production systems to debug issues like this one, and I've
never observed damage.  The exact effect of debugger attach/detach may be
OS/kernel-dependent, so it's hard to make categorical guarantees of safety.

nm

Re: Huge spikes in number of connections doing "PARSE"

От
hubert depesz lubaczewski
Дата:
On Fri, Mar 11, 2011 at 11:13:43AM -0500, Noah Misch wrote:
>   gdb -ex=bt /path/to/bin/postgres $pid </dev/null

thanks. added cronjob that will get stacktrace of random 10 backend in
parse state, when we'll have more than 100 of them.

It will take some time to gather some more evidence though.

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/

Re: Huge spikes in number of connections doing "PARSE"

От
Gurjeet Singh
Дата:
On Fri, Mar 11, 2011 at 11:13 AM, Noah Misch <noah@leadboat.com> wrote:

  gdb -ex=bt /path/to/bin/postgres $pid </dev/null

I've used this on production systems to debug issues like this one, and I've
never observed damage.  The exact effect of debugger attach/detach may be
OS/kernel-dependent, so it's hard to make categorical guarantees of safety.

At least on Windows NT and derivatives (until I last checked a few years ago),  detaching a debugger kills the debugged process.

Regards,
--
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: Huge spikes in number of connections doing "PARSE"

От
hubert depesz lubaczewski
Дата:
On Fri, Mar 11, 2011 at 11:13:43AM -0500, Noah Misch wrote:
>   gdb -ex=bt /path/to/bin/postgres $pid </dev/null

hi
so, let me remind what's what.

I wrote a script, that every 15 seconds, checks system for Pg backends in
"PARSE" state. If there are more than 100 of them, script randombly chooses
10 of them, and runs "gdb -batch -quiet -ex=bt /usr/bin/postgres PID" on
them.

Over the weekend I got 2125 such stack traces logged, but only 60 of them
happened when we had such huge unexpected spikes (this db server is quite
busy), with over 400 parsing backends.

These 60 were summarized, and output is available here:
http://www.depesz.com/various/locks.summary.txt

as you can seem, in 48 cases backend process was in semop(), which relates
directly to my previous findings with ps/wchan.

summary format is:
     11 0x00000031884d4665 in recv () from /lib64/libc.so.6
        #0  0x00000031884d4665 in recv () from /lib64/libc.so.6
        #1  0x00000000005366d7 in secure_read ()
        #2  0x000000000053d204 in ?? ()
        #3  0x000000000053d607 in pq_getbyte ()
        #4  0x00000000005afa6d in PostgresMain ()
        #5  0x00000000005857b4 in ?? ()
        #6  0x000000000058643a in PostmasterMain ()
        #7  0x000000000053efde in main ()

means that there were 11 stack traces exacly like the one shown next to number "11"

Any ideas based on the stack traces in the file ( the file itself is 20kB, so I
didn't want to put it in email )

Best regards,

depesz


--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/

Re: Huge spikes in number of connections doing "PARSE"

От
Martijn van Oosterhout
Дата:
On Mon, Mar 14, 2011 at 07:49:46PM +0100, hubert depesz lubaczewski wrote:
> These 60 were summarized, and output is available here:
> http://www.depesz.com/various/locks.summary.txt
>
> as you can seem, in 48 cases backend process was in semop(), which relates
> directly to my previous findings with ps/wchan.

It's unfortunate you don't have debug symbols enabled, which makes
these traces somewhat unreliable. So you get odd things like index_open
calling index_close.

The common factor seems to be lots of index locks. Do you have very
many indexes?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Вложения

Re: Huge spikes in number of connections doing "PARSE"

От
hubert depesz lubaczewski
Дата:
On Mon, Mar 14, 2011 at 10:21:27PM +0100, Martijn van Oosterhout wrote:
> The common factor seems to be lots of index locks. Do you have very
> many indexes?

$ select count(*) from pg_class  where relkind = 'i';
 count
-------
   450
(1 row)

$ select count(*) from pg_class  where relkind = 'r';
 count
-------
   201
(1 row)

not sure if that's many.



--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/

Re: Huge spikes in number of connections doing "PARSE"

От
Noah Misch
Дата:
On Mon, Mar 14, 2011 at 07:49:46PM +0100, hubert depesz lubaczewski wrote:
> I wrote a script, that every 15 seconds, checks system for Pg backends in
> "PARSE" state. If there are more than 100 of them, script randombly chooses
> 10 of them, and runs "gdb -batch -quiet -ex=bt /usr/bin/postgres PID" on
> them.
>
> Over the weekend I got 2125 such stack traces logged, but only 60 of them
> happened when we had such huge unexpected spikes (this db server is quite
> busy), with over 400 parsing backends.
>
> These 60 were summarized, and output is available here:
> http://www.depesz.com/various/locks.summary.txt

Thanks; that's exactly what I wanted.

> Any ideas based on the stack traces in the file ( the file itself is 20kB, so I
> didn't want to put it in email )

The common theme is contention over updating the shared lock table, with the
lockers targeting system catalogs.  Such requests normally get satisfied from
backend-local caches, greatly reducing the amount of lock activity.  That isn't
happening, indicating that either your backend lifespan is very short or your
caches are getting reset.  More likely the latter.

What is a typical lifespan for a backend in this system?  What sort of
connection pooling are you using, if any?

Do you create temporary objects or perform other DDL regularly?  In PostgreSQL
8.3, when any one backend-local local cache fell too far out of date, all
backends would take a full cache reset.  That seems like the most credible
explanation for your symptoms.  PostgreSQL 8.4 brought major changes to this
area (commit fad153ec45299bd4d4f29dec8d9e04e2f1c08148), so upgrading might
eliminate the spikes you're seeing.

Regardless of PostgreSQL version, constraining the number of concurrent backends
at the connection pool level will make these events less frequent and costly.
You'll almost always be better off running 200 backends and letting 200
additional queries wait than running 400 backends.

nm

Re: Huge spikes in number of connections doing "PARSE"

От
Noah Misch
Дата:
On Mon, Mar 14, 2011 at 10:21:27PM +0100, Martijn van Oosterhout wrote:
> On Mon, Mar 14, 2011 at 07:49:46PM +0100, hubert depesz lubaczewski wrote:
> > These 60 were summarized, and output is available here:
> > http://www.depesz.com/various/locks.summary.txt
> >
> > as you can seem, in 48 cases backend process was in semop(), which relates
> > directly to my previous findings with ps/wchan.
>
> It's unfortunate you don't have debug symbols enabled, which makes
> these traces somewhat unreliable. So you get odd things like index_open
> calling index_close.

That trace is legitimate.  If index_open()'s call into the relcache needs to
consult the catalogs, it will in turn scan pg_class_oid_index.

Re: Huge spikes in number of connections doing "PARSE"

От
hubert depesz lubaczewski
Дата:
On Tue, Mar 15, 2011 at 12:13:21AM -0400, Noah Misch wrote:
> What is a typical lifespan for a backend in this system?  What sort of
> connection pooling are you using, if any?

quite long, but:

we have n (~ 40 i think) web servers. each webserver has it's own
pgbouncer (in session pooling).

application connects to pgbouncer, issues (usually) single query, and
disconnects.

pgbouncers are set to close oder connections, if unused, after (i think
2 minutes. (i don't actually have access there).

generally, through the day we see from 400 to 800 connections, mostly
idle, but sometimes it goes much higher (like 1400), and then the
connections are mostly parsing.

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/

Re: Huge spikes in number of connections doing "PARSE"

От
Tom Lane
Дата:
hubert depesz lubaczewski <depesz@depesz.com> writes:
> On Tue, Mar 15, 2011 at 12:13:21AM -0400, Noah Misch wrote:
>> What is a typical lifespan for a backend in this system?  What sort of
>> connection pooling are you using, if any?

> quite long, but:

> we have n (~ 40 i think) web servers. each webserver has it's own
> pgbouncer (in session pooling).

> application connects to pgbouncer, issues (usually) single query, and
> disconnects.

> pgbouncers are set to close oder connections, if unused, after (i think
> 2 minutes. (i don't actually have access there).

> generally, through the day we see from 400 to 800 connections, mostly
> idle, but sometimes it goes much higher (like 1400), and then the
> connections are mostly parsing.

If the backends are "mostly idle" then it'd be a good idea to set a
smaller maximum limit on their number.  Idle backends can be a
performance risk pre-8.4 because of the cache reset problem Noah pointed
to.  You'd be better off having pgbouncer delay incoming queries until
there's a free backend to pass them to.

            regards, tom lane

Re: Huge spikes in number of connections doing "PARSE"

От
hubert depesz lubaczewski
Дата:
On Tue, Mar 15, 2011 at 10:20:38AM -0400, Tom Lane wrote:
> hubert depesz lubaczewski <depesz@depesz.com> writes:
> > On Tue, Mar 15, 2011 at 12:13:21AM -0400, Noah Misch wrote:
> >> What is a typical lifespan for a backend in this system?  What sort of
> >> connection pooling are you using, if any?
>
> > quite long, but:
>
> > we have n (~ 40 i think) web servers. each webserver has it's own
> > pgbouncer (in session pooling).
>
> > application connects to pgbouncer, issues (usually) single query, and
> > disconnects.
>
> > pgbouncers are set to close oder connections, if unused, after (i think
> > 2 minutes. (i don't actually have access there).
>
> > generally, through the day we see from 400 to 800 connections, mostly
> > idle, but sometimes it goes much higher (like 1400), and then the
> > connections are mostly parsing.
>
> If the backends are "mostly idle" then it'd be a good idea to set a
> smaller maximum limit on their number.  Idle backends can be a
> performance risk pre-8.4 because of the cache reset problem Noah pointed
> to.  You'd be better off having pgbouncer delay incoming queries until
> there's a free backend to pass them to.

while I understand it, and will pass this information to client, one
thing bugs me - one the problem happens only sometimes, with relatively
low count of backends?
we saw database running normally with 1200 connections, yet sometimes,
it has 500 connections, and then something happens and they all end up
locking in this parse mode.

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/

Re: Huge spikes in number of connections doing "PARSE"

От
Tom Lane
Дата:
hubert depesz lubaczewski <depesz@depesz.com> writes:
> while I understand it, and will pass this information to client, one
> thing bugs me - one the problem happens only sometimes, with relatively
> low count of backends?
> we saw database running normally with 1200 connections, yet sometimes,
> it has 500 connections, and then something happens and they all end up
> locking in this parse mode.

Well, if they're all busy, you're not going to get this cache reset
problem.  It's idle backends that have fallen behind that create the
problem.

            regards, tom lane

Re: Huge spikes in number of connections doing "PARSE"

От
hubert depesz lubaczewski
Дата:
On Tue, Mar 15, 2011 at 10:58:52AM -0400, Tom Lane wrote:
> hubert depesz lubaczewski <depesz@depesz.com> writes:
> > while I understand it, and will pass this information to client, one
> > thing bugs me - one the problem happens only sometimes, with relatively
> > low count of backends?
> > we saw database running normally with 1200 connections, yet sometimes,
> > it has 500 connections, and then something happens and they all end up
> > locking in this parse mode.
>
> Well, if they're all busy, you're not going to get this cache reset
> problem.  It's idle backends that have fallen behind that create the
> problem.

thanks.

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/

Re: Huge spikes in number of connections doing "PARSE"

От
Dimitri Fontaine
Дата:
hubert depesz lubaczewski <depesz@depesz.com> writes:
> we have n (~ 40 i think) web servers. each webserver has it's own
> pgbouncer (in session pooling).

In some cases I have found useful to have those webserver's pgbouncer
connect to another pgbouncer on the database host.  But if your problem
is tied to real active connection, I don't see what it would solve.
Still, if you're searching ideas…

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

Re: Huge spikes in number of connections doing "PARSE"

От
hubert depesz lubaczewski
Дата:
On Sat, Apr 30, 2011 at 08:55:09PM +0200, Dimitri Fontaine wrote:
> hubert depesz lubaczewski <depesz@depesz.com> writes:
> > we have n (~ 40 i think) web servers. each webserver has it's own
> > pgbouncer (in session pooling).
>
> In some cases I have found useful to have those webserver's pgbouncer
> connect to another pgbouncer on the database host.  But if your problem
> is tied to real active connection, I don't see what it would solve.
> Still, if you're searching ideas…

thanks. we'll probably look into it, but the situaion kind of stalled
now.

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/