Обсуждение: One source of constant annoyance identified

Поиск

Список

Период

Сортировка

One source of constant annoyance identified

От

"Markus Wollny"

Дата:

27 июня 2002 г., 11:46:05

...or so it seems: I configured our webserver to not use persistant
connections et voilà - no more 200+MB backends!

I knew there was supposed to be some bug in PHP, so users of Apache/PHP
are discouraged to use persistant connections. As we are using
ColdFusion/IIS on Win2k Server with ODBC I never suspected that there
should be a similar issue with my configuration. Now we just switched
off the persistant connection option (which was a true winner for our
Oracle DB, performance-wise) and noticed the lack of these
giant-backends we had learned to fear and endure before.

Now there's still the odd 250MB backend lingering around for some time,
but it's not four or five of them any more, wich is a big gain when
there's 250MB swap around more often than not.

Are there any known issues concerning persistant connections apart from
the pgpconnect-thingy with PHP? Is anyone running the same combination
of *nix/Postgres+IIS/ColdFusion?

Regards,

Markus

Re: One source of constant annoyance identified

От

Tom Lane

Дата:

27 июня 2002 г., 14:39:01

"Markus Wollny" <Markus.Wollny@computec.de> writes:
> [ turned off persistent connections ]

> Now there's still the odd 250MB backend lingering around for some time,
> but it's not four or five of them any more, wich is a big gain when
> there's 250MB swap around more often than not.

So you haven't really solved the problem --- somewhere there is a query
being issued that ramps the backend up to a lot of memory.  All you've
done is ensured that the backend won't hang around very long.  The
persistent connection isn't really at fault, except in that it causes
backends to keep being used after their memory usage has become bloated.

Although this might be enough to eliminate your immediate performance
problem, you should keep looking.  Starting lots more backends than you
need to is a performance hit, so turning off persistent connections is
really only a stopgap not a desirable answer.  And I'm still wondering
whether you've exposed a fixable memory leak bug.  We need to identify
exactly what query is causing the backends to eat memory.

            regards, tom lane

Re: One source of constant annoyance identified

От

Jochem van Dieten

Дата:

27 июня 2002 г., 14:55:54

Markus Wollny wrote:
>
> Are there any known issues concerning persistant connections apart from
> the pgpconnect-thingy with PHP? Is anyone running the same combination
> of *nix/Postgres+IIS/ColdFusion?

PostgreSQL 7.1 on i386-unknown-openbsd2.9, compiled by GCC 2.95.3
ColdFusion 4.5.1 SP2 on Windows NT 4 SP6a, ODBC 7.01.00.06

But it is all super light load, all the queries return 1 row or they are
cached on the ColdFusion side. Only issue I ever had was that I needed
to explicitly cast text columns to varchar or else the query would
return an error (which I have been unable to reproduce with later ODBC
drivers).

If you are running CF MX, try the JDBC driver. I haven't finished
testing it, but so far it is looking good.

Jochem

--
Jochem van Dieten
Team Macromedia Volunteer for ColdFusion
http://www.macromedia.com/support/forums/team_macromedia/

Re: One source of constant annoyance identified

От

"Markus Wollny"

Дата:

27 июня 2002 г., 15:09:25

Hi!

I see - so the beast is still roaming, but the current settings sort of
keep it at bay for a while...

I'd surely like to find the specific query/queries which cause this
behaviour; all I could find out as yet was the database for the query.
If I switch on query_stats and query-output in the logfile, I get about
300-500MB log per hour - even during low-load times. Scanning through
these logs is like searching for a needle in a haystack. I can indeed
identify those queries with exceptionally high execution times, but
there's no mentioning of memory-usage anywhere - so I'm still left in
the dark. I can get the PIDs from top-output and scan through the log,
taking down execution times, but then there's nothing much out of the
ordinary - mostly below 0.05 seconds or much, much less. Again I cannot
guess about any correlation between execution time and memory usage. As
one backend processes lots of queries in sequence, I cannot find it hard
to imagine that each and every one of these queries isn't in any way
extraordinary, but there might be some issue with freeing up memory that
belonged to the predecessor, when the backend begins with a new query,
thus grabbing more and more memory during its lifetime.

With the current tools I know of, I shall have a hard time finding the
error - if I ever do. I cannot claim however to really know a lot :) How
would you, being "real DBAs" and knowing the intestines of this
particular DB probably much better than your very own, go about with
this task of tracking down the memory-eater?

Regards,

   Markus

> -----Ursprüngliche Nachricht-----
> Von: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Gesendet: Donnerstag, 27. Juni 2002 16:41
> An: Markus Wollny
> Cc: pgsql-general@postgresql.org
> Betreff: Re: [GENERAL] One source of constant annoyance identified
>
>
> "Markus Wollny" <Markus.Wollny@computec.de> writes:
> > [ turned off persistent connections ]
>
> > Now there's still the odd 250MB backend lingering around
> for some time,
> > but it's not four or five of them any more, wich is a big gain when
> > there's 250MB swap around more often than not.
>
> So you haven't really solved the problem --- somewhere there
> is a query
> being issued that ramps the backend up to a lot of memory.  All you've
> done is ensured that the backend won't hang around very long.  The
> persistent connection isn't really at fault, except in that it causes
> backends to keep being used after their memory usage has
> become bloated.
>
> Although this might be enough to eliminate your immediate performance
> problem, you should keep looking.  Starting lots more
> backends than you
> need to is a performance hit, so turning off persistent connections is
> really only a stopgap not a desirable answer.  And I'm still wondering
> whether you've exposed a fixable memory leak bug.  We need to identify
> exactly what query is causing the backends to eat memory.
>
>             regards, tom lane
>

Re: One source of constant annoyance identified

От

Curt Sampson

Дата:

28 июня 2002 г., 09:06:31

On Thu, 27 Jun 2002, Tom Lane wrote:

> So you haven't really solved the problem --- somewhere there is a query
> being issued that ramps the backend up to a lot of memory.  All you've
> done is ensured that the backend won't hang around very long.  The
> persistent connection isn't really at fault, except in that it causes
> backends to keep being used after their memory usage has become bloated.

Yeah, but if the queries after that are not using all of the mapped
memory, that should be swapped out fairly quickly because the
machine is short on memory.

Same for memory leaks; if you're losing a lot of memory, you'd
think there would be a fair number of pages you never touch that
could then be swapped out.

The bloated processes, at least from the top fragment I saw, appear
to have a working set of 200-250 MB; basically the entire data
space is resident. So what's touching all of those pages often
enough that they don't get swapped?

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC

Re: One source of constant annoyance identified

От

"Markus Wollny"

Дата:

28 июня 2002 г., 10:28:14

Hi!

It seems I found one of the queries which suck up memory as if there
were terabytes available.

If a user starts a search for e.g. "Ich brauche Mitleid" on one of our
websites (site-ID is 43 in this example), we construct our select like
this:

select      MESSAGE.BOARD_ID
                , MESSAGE.THREAD_ID
                , MESSAGE.MESSAGE_ID
                , MESSAGE.TITLE
                , MESSAGE.USER_ID
                , USERS.LOGIN
                , USERS.STATUS
                , USERS.RIGHTS
                , to_char(MESSAGE.CREATED,'DD.MM.YY
hh24:mi') as DATUM
                , MESSAGE.COUNT_REPLY

                , (select count(*) from
CT_COM_USER_THREAD_FOLLOW where USER_ID= '484387' and
thread_id=MESSAGE.THREAD_ID) as TFUID

        from        CT_COM_BOARD_MESSAGE    MESSAGE
                ,    CT_COM_USER
USERS
                ,    CT_COM_BOARD_RULES    READRULE
                ,    CT_COM_SITE_BOARDS    SITE
        where        SITE.SITE_ID    =    '43'

            and
            (
                lower(MESSAGE.TEXT) like '%ich%'
            or    lower(MESSAGE.TEXT) like 'ich%'
            or    lower(MESSAGE.TEXT) like '%ich'

            or    lower(MESSAGE.TITLE) like '%ich%'
            or    lower(MESSAGE.TITLE) like 'ich%'
            or    lower(MESSAGE.TITLE) like '%ich'

            )

            and
            (
                lower(MESSAGE.TEXT) like '%brauche%'
            or    lower(MESSAGE.TEXT) like 'brauche%'
            or    lower(MESSAGE.TEXT) like '%brauche'

            or    lower(MESSAGE.TITLE) like '%brauche%'
            or    lower(MESSAGE.TITLE) like 'brauche%'
            or    lower(MESSAGE.TITLE) like '%brauche'

            )

            and
            (
                lower(MESSAGE.TEXT) like '%mitleid%'
            or    lower(MESSAGE.TEXT) like 'mitleid%'
            or    lower(MESSAGE.TEXT) like '%mitleid'

            or    lower(MESSAGE.TITLE) like '%mitleid%'
            or    lower(MESSAGE.TITLE) like 'mitleid%'
            or    lower(MESSAGE.TITLE) like '%mitleid'

            )

    and            MESSAGE.STATE_ID    =    0
    and            MESSAGE.USER_ID        =
USERS.USER_ID
    and            USERS.STATUS        >    0
    and            SITE.BOARD_ID        =
MESSAGE.BOARD_ID
    and            READRULE.BOARD_ID    =
MESSAGE.BOARD_ID
    and            READRULE.RULE_ID    =    1
    and            READRULE.VALUE        <=    '5'
    order by    MESSAGE.LAST_REPLY desc

Now I think it's the bit with the LIKEs that kills us, especially as the
database refuses to create an index on MESSAGE.TEXT for it being to big
or whatever - search me, but it just wouldn't do it (the field is of
type varchar with a maximum length of 10,000 characters). This query is
a true killer, taking over 2 minutes to complete while eating up more
than a quarter of a gig of memory. Oracle wasn't too bothered about this
one, but now PostgreSQL seems to act very differently...

Now as far as I searched through the docs and the archives, there's this
fulltext-search method provided in CONTRIB
(http://developer.postgresql.org/cvsweb.cgi/pgsql/contrib/fulltextindex/
?only_with_tag=REL7_2_STABLE); is this an equivalent of the
conText-cartridge provided by Oracle? This lack for a full-text-search
might be the main issue in our attempts to migrate from Oracle to
PostgreSQL, so to me it looks like it might just be the saving straw.
Please feel free to correct me if I'm wrong...

I never had much experience with this CVS-system and as yet left it to
those ambitous enough to tinker with the innards of their projects, but
right now it seems like I am forced to risk a try... Unfortunately from
my point of view this thing lacks a bit in terms of documentation - do I
need to recompile the whole of PostgreSQL or just this bit? How would I
go about installing it up to the point of actually running and making
use of it on two columns (TITLE and TEXT in the MESSAGE-table)?

Or am I completely misled concerning this fulltext-search-option - and
there's some cheaper way out to speed things up without reducing
functionality?

Regards,

Markus

Re: One source of constant annoyance identified

От

"P.J. \"Josh\" Rovero"

Дата:

28 июня 2002 г., 10:40:48

LIKE '%dfjdsklfdfjdklfjds' is what is killing you.

LIKE 'sdfdklf%' can be indexed, but the leading wildcard
forces an index search.

Markus Wollny wrote:
> Hi!
>
> It seems I found one of the queries which suck up memory as if there
> were terabytes available.

--
P. J. "Josh" Rovero                                 Sonalysts, Inc.
Email: rovero@sonalysts.com    www.sonalysts.com    215 Parkway North
Work: (860)326-3671 or 442-4355                     Waterford CT 06385
***********************************************************************

Re: One source of constant annoyance identified

От

"P.J. \"Josh\" Rovero"

Дата:

28 июня 2002 г., 13:00:56

Should have said:

"leading % forces a sequential scan".  Sorry

P.J. \"Josh\" Rovero wrote:
> LIKE '%dfjdsklfdfjdklfjds' is what is killing you.
>
> LIKE 'sdfdklf%' can be indexed, but the leading wildcard
> forces an index search.
>

--
P. J. "Josh" Rovero                                 Sonalysts, Inc.
Email: rovero@sonalysts.com    www.sonalysts.com    215 Parkway North
Work: (860)326-3671 or 442-4355                     Waterford CT 06385
***********************************************************************

Re: One source of constant annoyance identified

От

Tom Lane

Дата:

28 июня 2002 г., 13:15:22

Curt Sampson <cjs@cynic.net> writes:
> On Thu, 27 Jun 2002, Tom Lane wrote:
>> So you haven't really solved the problem --- somewhere there is a query
>> being issued that ramps the backend up to a lot of memory.  All you've
>> done is ensured that the backend won't hang around very long.  The
>> persistent connection isn't really at fault, except in that it causes
>> backends to keep being used after their memory usage has become bloated.

> Yeah, but if the queries after that are not using all of the mapped
> memory, that should be swapped out fairly quickly because the
> machine is short on memory.

And the swapping activity is exactly the problem, isn't it?

In any case, we can't make much progress until we identify the query
that is making the backend's address space grow.

(Markus, you don't happen to have SORT_MEM set to a large value,
do you?)

            regards, tom lane

Re: One source of constant annoyance identified

От

"Markus Wollny"

Дата:

28 июня 2002 г., 13:26:59

> And the swapping activity is exactly the problem, isn't it?

Yupp, I guess so; once swapping is reduced, there should be not only
much more fast memory available to the queries but also more processing
time.

> In any case, we can't make much progress until we identify the query
> that is making the backend's address space grow.

This here is one of them (see full text in mail from 13:25):

> -----Ursprüngliche Nachricht-----
> Von: Markus Wollny
> Gesendet: Freitag, 28. Juni 2002 13:25
> An: pgsql-general@postgresql.org
> Betreff: Re: [GENERAL] One source of constant annoyance identified
>
>
> Hi!
>
> It seems I found one of the queries which suck up memory as if there
> were terabytes available.
>
> If a user starts a search for e.g. "Ich brauche Mitleid" on one of our
> websites (site-ID is 43 in this example), we construct our select like
> this:
>
> select      MESSAGE.BOARD_ID
>                 , MESSAGE.THREAD_ID
>                 , MESSAGE.MESSAGE_ID
>                 , MESSAGE.TITLE
>                 , MESSAGE.USER_ID
>                 , USERS.LOGIN
>                 , USERS.STATUS
>                 , USERS.RIGHTS
>                 , to_char(MESSAGE.CREATED,'DD.MM.YY
> hh24:mi') as DATUM
>                 , MESSAGE.COUNT_REPLY
>
>                 , (select count(*) from
> CT_COM_USER_THREAD_FOLLOW where USER_ID= '484387' and
> thread_id=MESSAGE.THREAD_ID) as TFUID
>
>         from        CT_COM_BOARD_MESSAGE    MESSAGE
>                 ,    CT_COM_USER
> USERS
>                 ,    CT_COM_BOARD_RULES    READRULE
>                 ,    CT_COM_SITE_BOARDS    SITE
>         where        SITE.SITE_ID    =    '43'
>
>             and
>             (
>                 lower(MESSAGE.TEXT) like '%ich%'
>             or    lower(MESSAGE.TEXT) like 'ich%'
>             or    lower(MESSAGE.TEXT) like '%ich'
>
>             or    lower(MESSAGE.TITLE) like '%ich%'
>             or    lower(MESSAGE.TITLE) like 'ich%'
>             or    lower(MESSAGE.TITLE) like '%ich'
>
>             )
>
>             and
>             (
>                 lower(MESSAGE.TEXT) like '%brauche%'
>             or    lower(MESSAGE.TEXT) like 'brauche%'
>             or    lower(MESSAGE.TEXT) like '%brauche'
>
>             or    lower(MESSAGE.TITLE) like '%brauche%'
>             or    lower(MESSAGE.TITLE) like 'brauche%'
>             or    lower(MESSAGE.TITLE) like '%brauche'
>
>             )
>
>             and
>             (
>                 lower(MESSAGE.TEXT) like '%mitleid%'
>             or    lower(MESSAGE.TEXT) like 'mitleid%'
>             or    lower(MESSAGE.TEXT) like '%mitleid'
>
>             or    lower(MESSAGE.TITLE) like '%mitleid%'
>             or    lower(MESSAGE.TITLE) like 'mitleid%'
>             or    lower(MESSAGE.TITLE) like '%mitleid'
>
>             )
>
>     and            MESSAGE.STATE_ID    =    0
>     and            MESSAGE.USER_ID        =
> USERS.USER_ID
>     and            USERS.STATUS        >    0
>     and            SITE.BOARD_ID        =
> MESSAGE.BOARD_ID
>     and            READRULE.BOARD_ID    =
> MESSAGE.BOARD_ID
>     and            READRULE.RULE_ID    =    1
>     and            READRULE.VALUE        <=    '5'
>     order by    MESSAGE.LAST_REPLY desc
>

[snip]

> (Markus, you don't happen to have SORT_MEM set to a large value,
> do you?)

Not at all - recommendations are 2-4% of available RAM AFAIK. On a 1GB
machine this would be 21000 to 42000KBs. Nevertheless I reduced it from
these values to a meagre sort_mem = 8192, that's 8MB, but to no avail.
Reducing this value any further doesn't really make too much sense, does
it.

Anyway, I think that one of our main problems is the lack of options for
fulltext-indexing. I'm currently trying to find out how to get FTI (from
CVS/Contrib) up and running, because I hope that this will solve 90% of
my problems if not more :)

Regards,

Markus

Re: One source of constant annoyance identified

От

nconway@klamath.dyndns.org (Neil Conway)

Дата:

28 июня 2002 г., 13:49:45

On Fri, Jun 28, 2002 at 07:40:46AM -0400, P.J. Josh Rovero wrote:
> LIKE '%dfjdsklfdfjdklfjds' is what is killing you.
>
> LIKE 'sdfdklf%' can be indexed, but the leading wildcard
> forces an index search.

You mean it forces a sequential scan, of course.

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

Re: One source of constant annoyance identified

От

Tom Lane

Дата:

28 июня 2002 г., 14:51:20

"Markus Wollny" <Markus.Wollny@computec.de> writes:
>                 lower(MESSAGE.TEXT) like '%ich%'
>             or    lower(MESSAGE.TEXT) like 'ich%'
>             or    lower(MESSAGE.TEXT) like '%ich'

Is whoever wrote this under the misimpression that % can't match zero
characters?  You could reduce the number of LIKE tests by a factor of 3,
because the foo% and %foo tests are completely redundant.

But, back to the problem at hand --- it seems like a fair bet that
we must have a memory leak in lower() or LIKE or both.  Did you build
with locale or multibyte (or both) enabled?  If so, what locale and
what database encoding are you using, respectively?

            regards, tom lane

Re: One source of constant annoyance identified

От

"Markus Wollny"

Дата:

28 июня 2002 г., 15:06:01

Hi!

> -----Ursprüngliche Nachricht-----
> Von: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Gesendet: Freitag, 28. Juni 2002 17:03
> An: Markus Wollny
> Cc: pgsql-general@postgresql.org
> Betreff: Re: [GENERAL] One source of constant annoyance identified
>
>
> "Markus Wollny" <Markus.Wollny@computec.de> writes:
> >                 lower(MESSAGE.TEXT) like '%ich%'
> >             or    lower(MESSAGE.TEXT) like 'ich%'
> >             or    lower(MESSAGE.TEXT) like '%ich'
>
> Is whoever wrote this under the misimpression that % can't match zero
> characters?  You could reduce the number of LIKE tests by a
> factor of 3,
> because the foo% and %foo tests are completely redundant.

Wasn't me :) I think there might be the odd generous wastage of
processing time still in the code just because we could afford it under
Oracle. We intend to implement this very bit using regular expressions,
as we hope that this will improve performance a bit. So we might get
away without using LIKE at all in this particular case. We cannot
however remove LIKE completely from every bit of code.

> But, back to the problem at hand --- it seems like a fair bet that
> we must have a memory leak in lower() or LIKE or both.  Did you build
> with locale or multibyte (or both) enabled?  If so, what locale and
> what database encoding are you using, respectively?

Compilation-options were:

--enable-locale
--enable-recode
--enable-multibyte
--with-perl
--enable-odbc
--enable-syslog

Environment-variables RC_LANG/LC_CTYPE are set to de_DE@euro, encoding
for the databases is SQL_ASCII. As we are hosting german websites and
communities, we need to sort data containing characters ÄÖÜäöüß in the
correct context (AÄ,OÖ,UÜ,sß), so I figured we'd need locale support.

Would it be worth a try recompiling without multibyte enabled? Can I
dump/reimport the current DB afterwards?

Regards,

Markus

Re: One source of constant annoyance identified

От

Thomas Beutin

Дата:

28 июня 2002 г., 16:58:42

Hi,

On Fri, Jun 28, 2002 at 05:30:58PM +0200, Markus Wollny wrote:
>
> Compilation-options were:
>
> --enable-locale
> --enable-recode
> --enable-multibyte
> --with-perl
> --enable-odbc
> --enable-syslog
>
> Environment-variables RC_LANG/LC_CTYPE are set to de_DE@euro, encoding
> for the databases is SQL_ASCII. As we are hosting german websites and
> communities, we need to sort data containing characters ÄÖÜäöüß in the
> correct context (AÄ,OÖ,UÜ,sß), so I figured we'd need locale support.

AFAIK the compiling options are ok, but did You the initdb command with
the proper LC_ALL env set? It shoul be de_DE or de_DE@euro.

BTW: I've found on my linux distribution (SlackWare 7.x) a wrong i18n
source file.  After fixing this and compiling the new locale evering
worked fine.  It's corrected in the newer SlackWare.

-tb
--
Thomas Beutin                             tb@laokoon.IN-Berlin.DE
Beam me up, Scotty. There is no intelligent live down in Redmond.

Re: One source of constant annoyance identified

От

"Markus Wollny"

Дата:

28 июня 2002 г., 18:28:48

Hi!

    -----Ursprüngliche Nachricht----- 
    Von: Thomas Beutin 
    Gesendet: Fr 28.06.2002 19:43 
    An: pgsql-general@postgresql.org 
    Cc: 
    Betreff: Re: [GENERAL] One source of constant annoyance
identified
    

    > Environment-variables RC_LANG/LC_CTYPE are set to de_DE@euro,
encoding
    > for the databases is SQL_ASCII. As we are hosting german
websites and
    > communities, we need to sort data containing characters
ÄÖÜäöüß in the
    > correct context (AÄ,OÖ,UÜ,sß), so I figured we'd need locale
support.
    
    AFAIK the compiling options are ok, but did You the initdb
command with
    the proper LC_ALL env set? It shoul be de_DE or de_DE@euro.

For initdb-option "--locale=" I found in the documentation "If this
option is not specified, the locale is inherited from the environment
that initdb runs in." So I presume that if I didn't explicitly set it to
anything other than my current environment, which I didn't, it should be
okay.


        BTW: I've found on my linux distribution (SlackWare 7.x)
a wrong i18n
        source file.  After fixing this and compiling the new
locale evering
        worked fine.  It's corrected in the newer SlackWare.

I use SuSE 7.3, a german distro, so I cannot imagine that some serious
problem with german localization would have gone unnoticed. Of course I
did not use the PostgreSQL-Version provided with the distro - I got the
latest stable release 7.2.1 and compiled. So I guess that's not the
source of the problem.

 

Regards,

Markus

Re: One source of constant annoyance identified

От

Tom Lane

Дата:

28 июня 2002 г., 19:50:58

"Markus Wollny" <Markus.Wollny@computec.de> writes:
> For initdb-option "--locale=" I found in the documentation "If this
> option is not specified, the locale is inherited from the environment
> that initdb runs in." So I presume that if I didn't explicitly set it to
> anything other than my current environment, which I didn't, it should be
> okay.

If you're not certain about this, compile contrib/pg_controldata and see
what it says the locale values in your pg_control file are.

            regards, tom lane

Re: One source of constant annoyance identified

От

Curt Sampson

Дата:

30 июня 2002 г., 02:24:33

On Fri, 28 Jun 2002, Tom Lane wrote:

> > Yeah, but if the queries after that are not using all of the mapped
> > memory, that should be swapped out fairly quickly because the
> > machine is short on memory.
>
> And the swapping activity is exactly the problem, isn't it?

That particular swapping activity would not be a problem. Memory
that's not used gets paged out and that's the end of it. The problem
is that something is *using* that memory, so it's not being paged
out, or if it does get paged out, it gets paged back in again.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC

Re: One source of constant annoyance identified

От

Tom Lane

Дата:

30 июня 2002 г., 02:40:00

Curt Sampson <cjs@cynic.net> writes:
> On Fri, 28 Jun 2002, Tom Lane wrote:
>> And the swapping activity is exactly the problem, isn't it?

> That particular swapping activity would not be a problem. Memory
> that's not used gets paged out and that's the end of it. The problem
> is that something is *using* that memory, so it's not being paged
> out, or if it does get paged out, it gets paged back in again.

Yeah --- but typical implementations of malloc are very pager-
unfriendly; they tend to traverse data structures that consist
of a word or two at the head of each randomly-sized chunk of
data or former-now-freed data.

PG adds its own layer of not-very-paging-friendly allocation
logic on top of whatever sins your local malloc may commit.

Bottom line is that a PG backend that's swollen to a couple
hundred MB is trouble.  Don't assume it'll play nice with the
swapper; it won't.

            regards, tom lane

Re: One source of constant annoyance identified

От

"Markus Wollny"

Дата:

30 июня 2002 г., 09:35:38

Hi!

I did as you suggested - and locale is indeed not set correctly, it
seems. Here's the output:

pg_control version number:            71
[...]
LC_COLLATE:                           C
LC_CTYPE:                             de_DE@euro

So LC_COLLATE wasn't set correctly; I dumped all databases, recompiled
(without multibyte- and recode-support this time, just to reduce
possible sources of failure), did initdb again and reimported the
dumpfile.

pg_control-output is now:
pg_control version number:            71
[...]
LC_COLLATE:                           de_DE@euro
LC_CTYPE:                             de_DE@euro

Configure-options were ./configure --prefix=/opt/pgsql/ --with-perl
--enable-odbc --enable-locale --enable-syslog

So that's most definitely fixed. It didn't do anything for me, though -
there are still several giant backends, as you can see by top-output

 12:29pm  up 1 day, 21:04,  3 users,  load average: 1.44, 1.13, 1.18
85 processes: 81 sleeping, 4 running, 0 zombie, 0 stopped
CPU0 states: 87.3% user,  9.2% system,  0.0% nice,  3.1% idle
CPU1 states: 34.0% user,  6.0% system,  0.0% nice, 59.5% idle
CPU2 states: 47.3% user,  1.4% system,  0.0% nice, 50.5% idle
CPU3 states: 44.6% user,  7.3% system,  0.0% nice, 47.3% idle
Mem:  1029400K av, 1002272K used,   27128K free,       0K shrd,    5352K
buff
Swap: 2097136K av,   13160K used, 2083976K free                  842792K
cached
  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
 9269 postgres  10   0  257M 257M  256M S    96.2 25.5   0:54 postmaster
 9264 postgres   9   0  250M 250M  249M S     1.9 24.9   0:42 postmaster
 9375 postgres   9   0  127M 127M  126M S     1.2 12.6   0:05 postmaster
 9372 postgres   9   0  106M 106M  105M S     3.9 10.5   0:26 postmaster
 9312 postgres   9   0 68900  67M 67616 S     1.8  6.6   0:02 postmaster
 9379 postgres   9   0 41144  40M 39880 S     2.8  3.9   0:04 postmaster
 9377 postgres   9   0 39276  38M 38104 S     0.0  3.8   0:14 postmaster
 9380 postgres   9   0 34548  33M 33388 S     0.0  3.3   1:14 postmaster
 9381 postgres   9   0 32140  31M 30760 S     0.0  3.1   0:26 postmaster
 9373 postgres   9   0 24068  23M 23064 S     0.0  2.3   0:01 postmaster
 9353 postgres   9   0 22224  21M 20332 S     7.8  2.1   0:37 postmaster
 9371 postgres   9   0 22036  21M 21032 S     0.0  2.1   0:01 postmaster
 9382 postgres   9   0 17840  17M 16652 S     0.0  1.7   0:01 postmaster
 9268 postgres  12   0 16104  15M 15036 R     0.7  1.5   0:16 postmaster
 9352 postgres   9   0 15584  15M 14076 S     0.0  1.5   0:19 postmaster
 9351 postgres  19   0 14644  14M 13080 R    36.9  1.4   0:16 postmaster
 9313 postgres   9   0 14068  13M 13036 S     5.1  1.3   0:14 postmaster
 9374 postgres   9   0 13996  13M 12976 S     3.1  1.3   0:14 postmaster
 9354 postgres   9   0 13116  12M 12088 S     1.3  1.2   0:03 postmaster
 9376 postgres   9   0 12744  12M 11732 S     0.9  1.2   0:02 postmaster
 9452 postgres   9   0 12356  12M 11208 S     0.0  1.1   0:10 postmaster
 9370 postgres   9   0 12300  12M 11332 S     1.8  1.1   0:02 postmaster
 9355 postgres   9   0 11768  11M 10756 S     0.0  1.1   0:00 postmaster
 9524 postgres   9   0 10744  10M  9740 S     3.6  1.0   0:00 postmaster
 9476 postgres   9   0 10020 9.8M  9044 S     1.9  0.9   0:00 postmaster
 9602 postgres  18   0  9472 9468  8356 R    58.0  0.9   0:03 postmaster
 9535 postgres   9   0  9368 9364  8216 S     0.7  0.9   0:00 postmaster

Any more ideas?

Regards,

Markus

> -----Ursprüngliche Nachricht-----
> Von: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Gesendet: Freitag, 28. Juni 2002 22:42
> An: Markus Wollny
> Cc: Thomas Beutin; pgsql-general@postgresql.org
> Betreff: Re: [GENERAL] One source of constant annoyance identified
>
>
> If you're not certain about this, compile
> contrib/pg_controldata and see
> what it says the locale values in your pg_control file are.
>
>             regards, tom lane
>

Re: One source of constant annoyance identified

От

Tom Lane

Дата:

30 июня 2002 г., 18:34:36

"Markus Wollny" <Markus.Wollny@computec.de> writes:
> Now I think it's the bit with the LIKEs that kills us,

That's what I thought too, but I've tried and failed to reproduce any
memory leak with lower/LIKE and the same configuration options that you
used.  It might be that some other part of the query is the problem, or
maybe I'm not duplicating the setup correctly.  Could I trouble you for
the exact schemas of the tables used by the problem query?  (The output
of pg_dump -s would be the best thing to send.)

            regards, tom lane

Re: One source of constant annoyance identified

От

"Markus Wollny"

Дата:

02 июля 2002 г., 13:10:20

Hello!

Sorry I took so long - I attached the schema as asked.

Actually it seems to be very often the case, that certain operations
suck up more than 25% of available memory and processing capacities.

I managed to customize and install the Full Text Index-option (fti) from
the contrib-directory (substrings are at least 3 characters in length
for us and I defined a list of StopWords to be not included). Right now
I have started filling the fti-table with the substrings; I tried using
the Perl-script supplied, but the results are quite dissatisfactory
because it doesn't exclude any stopwords, nor does it limit itself to
alphanumeric - you may be able to imagine what happens when there's
loads of kiddies posting "!!!!!!!!!!!!!!!!!!" (ad nauseum) and suchlike
in quite a lot of postings. So used the already implemented trigger to
execute the fti-function:

update ct_com_board_message
set state_id=0
where state_id=0
and to_char(last_reply, 'yyyymmdd') between '20020301' and '20020830';

I took a quick look at top: Even this humble query causes memory- and
processor-load like a giant: 266M RAM, 38.3% processor time, 26.4%
memory usage. Okay, it's calling the trigger for each row which in turn
inserts some new tuples into ct_com_board_fti, but is it expected to
cause so much load?

Regards,

    Markus

> -----Ursprüngliche Nachricht-----
> Von: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Gesendet: Sonntag, 30. Juni 2002 21:34
> An: Markus Wollny
> Cc: pgsql-general@postgresql.org
> Betreff: Re: [GENERAL] One source of constant annoyance identified

> That's what I thought too, but I've tried and failed to reproduce any
> memory leak with lower/LIKE and the same configuration
> options that you
> used.  It might be that some other part of the query is the
> problem, or
> maybe I'm not duplicating the setup correctly.  Could I
> trouble you for
> the exact schemas of the tables used by the problem query?
> (The output
> of pg_dump -s would be the best thing to send.)
>
>             regards, tom lane
>

Вложения

community_schema_dump.sql

Re: One source of constant annoyance identified

От

Tom Lane

Дата:

03 июля 2002 г., 05:09:43

"Markus Wollny" <Markus.Wollny@computec.de> writes:
> Sorry I took so long - I attached the schema as asked.

Thanks.  But I'm still unable to reproduce the memory bloat you see on
SELECTs.  This seems very peculiar.  You said you were running SuSE 7.3
--- how recent is that?  Which glibc version is it running?  (I've been
reduced to speculating about memory leakage inside libc, which is a
pretty long shot but...)

> So used the already implemented trigger to
> execute the fti-function:

> update ct_com_board_message
> set state_id=3D0
> where state_id=3D0
> and to_char(last_reply, 'yyyymmdd') between '20020301' and '20020830';

> I took a quick look at top: Even this humble query causes memory- and
> processor-load like a giant: 266M RAM, 38.3% processor time, 26.4%
> memory usage. Okay, it's calling the trigger for each row which in turn
> inserts some new tuples into ct_com_board_fti, but is it expected to
> cause so much load?

Wouldn't surprise me.  Since you're using an AFTER trigger, the pending
trigger events have to be saved up for commit time, so the list of
pending events is going to grow quite large.  (How many rows do you have
in ct_com_board_message, anyway?  How many did that query try to
update?)  This however does not explain your problem with SELECT, since
selects don't fire triggers.

Could I see the output of EXPLAIN for that problem SELECT on your
machine?

            regards, tom lane

Re: One source of constant annoyance identified

От

"Nigel J. Andrews"

Дата:

03 июля 2002 г., 08:01:20

On Tue, 2 Jul 2002, Tom Lane wrote:

> "Markus Wollny" <Markus.Wollny@computec.de> writes:
> > Sorry I took so long - I attached the schema as asked.
>
> Thanks.  But I'm still unable to reproduce the memory bloat you see on
> SELECTs.  This seems very peculiar.  You said you were running SuSE 7.3
> --- how recent is that?  Which glibc version is it running?  (I've been
> reduced to speculating about memory leakage inside libc, which is a
> pretty long shot but...)
>
> > So used the already implemented trigger to
> > execute the fti-function:
>
> > update ct_com_board_message
> > set state_id=3D0
> > where state_id=3D0
> > and to_char(last_reply, 'yyyymmdd') between '20020301' and '20020830';
>
> > I took a quick look at top: Even this humble query causes memory- and
> > processor-load like a giant: 266M RAM, 38.3% processor time, 26.4%
> > memory usage. Okay, it's calling the trigger for each row which in turn
> > inserts some new tuples into ct_com_board_fti, but is it expected to
> > cause so much load?
>
> Wouldn't surprise me.  Since you're using an AFTER trigger, the pending
> trigger events have to be saved up for commit time, so the list of
> pending events is going to grow quite large.  (How many rows do you have
> in ct_com_board_message, anyway?  How many did that query try to
> update?)

Whoa, that's what I was trying to remember. I had problems at one time
when loading a large amount of data into a table, with a txtidx type column. It
might not have been a memory problem I had though it could just have been slow
loading. I was loading with COPY in a transaction and ended up just doing the
COPY outside of a transaction. It still took a while but then it's only  a low
powered machine.

If that wasn't the process footprint growing huge then that problem was
occuring for me when doing selects. I can't remember what it was that I did
that fixed it though. I wonder if it's in the list's archives since the issue
was raised here.

> This however does not explain your problem with SELECT, since
> selects don't fire triggers.
>
> Could I see the output of EXPLAIN for that problem SELECT on your
> machine?
>
>             regards, tom lane
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>
>

--
Nigel J. Andrews
Director

---
Logictree Systems Limited
Computer Consultants

Re: One source of constant annoyance identified

От

"Markus Wollny"

Дата:

03 июля 2002 г., 08:07:09

Hello!

Thank you very much for your efforts - we appreciate that very much :)

> -----Ursprüngliche Nachricht-----
> Von: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Gesendet: Mittwoch, 3. Juli 2002 00:28
> An: Markus Wollny
> Cc: pgsql-general@postgresql.org
> Betreff: Re: [GENERAL] One source of constant annoyance identified
>
>
> "Markus Wollny" <Markus.Wollny@computec.de> writes:
> > Sorry I took so long - I attached the schema as asked.
>
> Thanks.  But I'm still unable to reproduce the memory bloat you see on
> SELECTs.  This seems very peculiar.  You said you were
> running SuSE 7.3
> --- how recent is that?  Which glibc version is it running?
> (I've been
> reduced to speculating about memory leakage inside libc, which is a
> pretty long shot but...)

I agree - it is a long shot: SuSE 7.3 has got Kernel: 2.4.10 and glibc:
2.2.4; it was released in October 2001. I tried using SuSE 8.0, released
in late April this year, but I was more than unhappy with some of the
new "features" - and it seems that hardly anybody has switched to 8.0
for server-usage as yet. Generally SuSE 7.3 (which is probably by far
the most popular distro in Germany) is considered quite stable and
current enough for server-usage. Would it really be worth the hassle
updating glibc to 2.2.5?

> > I took a quick look at top: Even this humble query causes
> memory- and
> > processor-load like a giant: 266M RAM, 38.3% processor time, 26.4%
> > memory usage. Okay, it's calling the trigger for each row
> which in turn
> > inserts some new tuples into ct_com_board_fti, but is it expected to
> > cause so much load?
>
> Wouldn't surprise me.  Since you're using an AFTER trigger,
> the pending
> trigger events have to be saved up for commit time, so the list of
> pending events is going to grow quite large.

Okay, so there are indeed situations when this kind of backend size can
be considered normal. That's some sort of relief :)

> (How many rows do you have  in ct_com_board_message, anyway?
> How many did that query try to
> update?)  This however does not explain your problem with
> SELECT, since
> selects don't fire triggers.

Currently there are 362,154 rows in ct_com_board_message and 85,101 rows
in ct_com_user. I don't know if this can be considered a lot; we will
expect that to grow at an accelerating rate during the next months, so
more than a million can be expected within the next 12 months or so. We
scarcely use any database-specific features like triggers as yet, it's
99.99% pure SQL, tables, indexes, data, selects, inserts, updates,
deletes, no "fancy stuff" as yet. We'll have to get to know PostgreSQL
better before using anything specific; we never needed to get too deep
into Oracle either, which was an advantage when porting the whole thing
over to PostgreSQL, but it seems that the latter calls for a bit more
attention and know-how...

> Could I see the output of EXPLAIN for that problem SELECT on your
> machine?

We are currently working on a new version for the search I gave you
before, but here's another one that's causing similar symptoms - extra
long running time and most probably huge backends (it's 10:20 am, far
from peak time, and we already have ~280MB swap and six backends
>100MB):

This one generates an overview over all the forum-threads in one board
which has 41,624 messages, 2971 of them are FATHER_ID=0, so
thread-starters:

select          MESSAGE.BOARD_ID
                    , MESSAGE.THREAD_ID
                    , MESSAGE.FATHER_ID
                    , MESSAGE.MESSAGE_ID
                    , MESSAGE.USER_ID
                    , MESSAGE.USER_LOGIN
as LOGIN
                    , MESSAGE.USER_STATUS
as STATUS
                    , MESSAGE.USER_RIGHTS
as RIGHTS
                    , MESSAGE.TITLE
                    , MESSAGE.COUNT_REPLY
as COUNT_REPLY
                    ,
to_char(MESSAGE.LAST_REPLY,'DD.MM.YY hh24:mi')    as LAST_REPLY
                    ,
round((date_part('epoch',CURRENT_TIMESTAMP)-date_part('epoch',MESSAGE.LA
ST_REPLY))/60)      as diff_posting
                    ,
to_char(MESSAGE.CREATED,'DD.MM.YY hh24:mi')        as DATUM

                    , (select count(*) from
CT_COM_USER_THREAD_FOLLOW where USER_ID= '295798' and
thread_id=MESSAGE.THREAD_ID) as TFUID

        from          CT_COM_BOARD_MESSAGE    MESSAGE
        where        (0=0)

        and            MESSAGE.BOARD_ID    = 10
        and            MESSAGE.FATHER_ID    = 0
        and            MESSAGE.STATE_ID    = 0
        order by    MESSAGE.LAST_REPLY    desc


Sort  (cost=30695.27..30695.27 rows=7693 width=154) (actual
time=9745.94..9751.58 rows=4663 loops=1)
  ->  Index Scan using idx_bm_show_topics on ct_com_board_message
message  (cost=0.00..30198.72 rows=7693 width=154) (actual
time=111.56..9549.99 rows=4663 loops=1)
        SubPlan
          ->  Aggregate  (cost=5.83..5.83 rows=1 width=0) (actual
time=0.91..0.91 rows=1 loops=4663)
                ->  Index Scan using idx_user_thread_follow on
ct_com_user_thread_follow  (cost=0.00..5.83 rows=1 width=0) (actual
time=0.88..0.88 rows=0 loops=4663)
Total runtime: 9835.57 msec

I do hope you can make anything of this...

Regards,

    Markus

Re: One source of constant annoyance identified

От

"Markus Wollny"

Дата:

03 июля 2002 г., 12:22:05

Okay, now it's even more annoying...

We just upgraded the server from 1GB to 2GB of RAM because it kept
swapping out about 300MB.

I updated postgresql.conf accordingly, setting

max_connections = 190 (this is the minimum we need to satisfy webservers
and backend-jobs)
shared_buffers = 60000 (that's 468,75MB; I took this value assuming a
recommended value of 25% of RAM)
sort_mem = 80000 (that's 78,125MB, recommended value is 4% of RAM,
equalling 82MB).

Now top-output sorted by memory usage turns out to be:

  2:43pm  up  1:34,  3 users,  load average: 3.29, 2.25, 2.08
123 processes: 118 sleeping, 5 running, 0 zombie, 0 stopped
CPU0 states: 60.5% user,  3.1% system,  0.0% nice, 35.4% idle
CPU1 states: 47.3% user, 15.3% system,  0.0% nice, 36.4% idle
CPU2 states: 51.0% user, 17.1% system,  0.0% nice, 31.3% idle
CPU3 states: 75.4% user,  4.0% system,  0.0% nice, 20.0% idle
Mem:  2061560K av, 2054268K used,    7292K free,       0K shrd,   13924K
buff
Swap: 2097136K av,    7788K used, 2089348K free                 1825104K
cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
 1652 postgres   9   0  469M 469M  467M S     0.0 23.3   2:21 postmaster
 2041 postgres   9   0  431M 431M  429M S     1.2 21.4   1:32 postmaster
 1588 postgres   9   0  411M 411M  410M S    15.6 20.4   1:25 postmaster
 1597 postgres   9   0  289M 289M  287M S     0.1 14.3   0:31 postmaster
 1849 postgres   9   0  259M 259M  258M S     0.0 12.9   0:14 postmaster
 2046 postgres   9   0  239M 239M  238M S     0.0 11.9   0:25 postmaster
 1973 postgres   9   0  172M 172M  171M S     0.0  8.5   0:21 postmaster
 2142 postgres   9   0  128M 128M  127M S     0.0  6.3   0:04 postmaster
 2156 postgres   9   0  116M 116M  114M S     0.0  5.7   0:02 postmaster
 1598 postgres   9   0 86548  84M 84752 S     1.7  4.1   2:36 postmaster
 1608 postgres   9   0 60932  59M 59356 S     0.0  2.9   2:09 postmaster
 1582 postgres   9   0 57624  56M 55444 S     0.0  2.7   1:58 postmaster
 1609 postgres   9   0 56408  55M 55164 S     8.3  2.7   2:05 postmaster
 1766 postgres   9   0 45248  44M 43824 S     0.0  2.1   1:09 postmaster
 2139 postgres   9   0 45276  44M 43892 S     0.0  2.1   0:01 postmaster
 2045 postgres   9   0 41500  40M 39820 S     1.7  2.0   2:43 postmaster
 1610 postgres   9   0 41336  40M 40172 S     0.0  2.0   0:53 postmaster
 2044 postgres  14   0 38328  37M 37056 R    25.0  1.8   0:33 postmaster
 1881 postgres   9   0 34936  34M 33676 S    17.2  1.6   0:29 postmaster
 2042 postgres   9   0 33144  32M 31920 S     3.7  1.6   1:13 postmaster
 1679 postgres   9   0 32516  31M 31288 S     0.0  1.5   0:05 postmaster
 1678 postgres   9   0 31996  31M 30812 S     0.0  1.5   0:19 postmaster
 1653 postgres   9   0 29424  28M 28180 S     0.0  1.4   0:05 postmaster
 2048 postgres   9   0 27772  27M 26556 S     0.0  1.3   0:09 postmaster
 1802 postgres   9   0 26676  26M 25504 S     0.0  1.2   0:06 postmaster
 2211 postgres   9   0 25940  25M 24592 S     0.0  1.2   0:02 postmaster
 2047 postgres  14   0 25588  24M 23532 R     8.0  1.2   0:55 postmaster
 2065 postgres  11   0 25584  24M 24404 S     3.9  1.2   0:04 postmaster
 1980 postgres  11   0 24584  24M 22864 S     3.5  1.1   0:07 postmaster
 1872 postgres   9   0 23908  23M 22800 S     0.0  1.1   0:04 postmaster
 2068 postgres   9   0 21352  20M 20188 S     0.0  1.0   0:01 postmaster
 2138 postgres   9   0 20928  20M 19644 S     0.0  1.0   0:06 postmaster
 1983 postgres   9   0 20544  20M 19344 S     0.0  0.9   0:04 postmaster
 2342 postgres   9   0 20352  19M 13772 S     0.0  0.9   0:23 postmaster
 2357 postgres  15   0 20260  19M 18532 R    93.1  0.9   1:34 postmaster
 2204 postgres   9   0 19816  19M 18244 S     0.0  0.9   0:01 postmaster
 2199 postgres   9   0 16840  16M 15452 S     0.0  0.8   0:00 postmaster
 2207 postgres   9   0 16784  16M 15512 S     0.0  0.8   0:00 postmaster
 2050 postgres   9   0 15880  15M 14136 S     0.0  0.7   0:06 postmaster
 2200 postgres   9   0 15568  15M 14080 S     0.0  0.7   0:00 postmaster
 2301 postgres   9   0 15076  14M 13940 S     0.0  0.7   0:00 postmaster
 2236 postgres   9   0 14132  13M 12824 S     0.0  0.6   0:00 postmaster
 2346 postgres   9   0 14080  13M 12972 S     0.0  0.6   0:09 postmaster
 2347 postgres   9   0 14064  13M 12960 S     0.0  0.6   0:09 postmaster
 2205 postgres   9   0 13904  13M 12412 S     0.0  0.6   0:00 postmaster
 2339 postgres  11   0 12660  12M 11448 S    36.1  0.6   0:27 postmaster

which is not at all good...

And still the major problem in finding the cause of all this is to
identify the query which causes these huge backends. At the moment I
tend to suspect that there certainly are several queries which need a
bit of RAM, not that much however, and the backend fails to free up the
memory used by processed queries. What puzzles me here is that all these
400MB+ backend-processes are marked as idle (ps-output) respectively
sleeping, so they don't actually do much. But how come a sleeping
process grabs 20% of 2GB of RAM and 15% of processing time?

Another fact that hints at backends just eating memory without reason is
that when I stop the database, processes keep lingering - kill just
doesn't help, I have to kill -9 the last of the lot to get red of them
before restarting the database:

dunkles:/var/lib/pgsql/data/base # /etc/init.d/postgresql stop
Shutting down PostgreSQL
done
dunkles:/var/lib/pgsql/data/base # ps ax|grep post
  863 pts/1    S      0:00 login -- postgres
  885 pts/2    S      0:00 login -- postgres
 1552 pts/0    S      0:01 /opt/pgsql/bin/postmaster -i
-D/var/lib/pgsql/data/base
 1553 pts/0    S      0:00 postgres: stats buffer process

 1554 pts/0    S      0:01 postgres: stats collector process

 1650 pts/0    S      0:00 postgres: postgres template1 212.123.109.25
idle
 1681 pts/0    S      0:00 postgres: postgres abo 212.123.109.25 idle

 1682 pts/0    S      0:00 postgres: postgres bluebox 212.123.109.25
idle
 1683 pts/0    S      0:00 postgres: postgres kidszone 212.123.109.25
idle
 1684 pts/0    S      0:00 postgres: postgres mcv 212.123.109.25 idle

 1685 pts/0    S      0:00 postgres: postgres mpo 212.123.109.25 idle

 1686 pts/0    S      0:00 postgres: postgres nzone 212.123.109.25 idle

 1687 pts/0    S      0:00 postgres: postgres pcaction 212.123.109.25
idle
 1688 pts/0    S      0:00 postgres: postgres pcgames 212.123.109.25
idle
 1689 pts/0    S      0:00 postgres: postgres phppgadmin 212.123.109.25
idle
 1690 pts/0    S      0:00 postgres: postgres pszone 212.123.109.25 idle

 1691 pts/0    S      0:00 postgres: postgres saturn 212.123.109.25 idle

 1693 pts/0    S      0:00 postgres: postgres template1 212.123.109.25
idle
 1780 pts/0    S      0:00 postgres: postgres template1 212.123.108.149
idle
 1781 pts/0    S      0:00 postgres: postgres template1 212.123.108.149
idle
 1784 pts/0    S      0:00 postgres: postgres pcgames 212.123.108.149
idle

This just has some sort of Windows-look&feel to it - processes not
responding any more.

Now I set back the original values before upgrading RAM: shared_buffers
= 32768 (256MB) and sort_mem = 51200 (50MB). Limit for backend size not
seems to be some 265 odd MB (about four to six of those around at any
given time, as always) and top reports:

  3:11pm  up  2:03,  3 users,  load average: 3.24, 1.65, 1.81
163 processes: 156 sleeping, 7 running, 0 zombie, 0 stopped
CPU0 states: 81.3% user, 15.5% system,  0.0% nice,  2.2% idle
CPU1 states: 67.1% user, 26.0% system,  0.0% nice,  6.3% idle
CPU2 states: 69.2% user, 21.0% system,  0.0% nice,  9.3% idle
CPU3 states: 65.4% user, 20.2% system,  0.0% nice, 13.3% idle
Mem:  2061560K av, 2025392K used,   36168K free,       0K shrd,   13108K
buff
Swap: 2097136K av,    7732K used, 2089404K free                 1750556K
cached

Which isn't quite as bad anymore - although it cannot be the optimum
performance for this machine, I can't imagine.

Now what on earth can I do to get this DB running, and when I say
"running" I don't mean "sort of crawling uphills"? Are there any
OS-patches or environment-settings I need to consider? Are there known
memory-leaks? I just doesn't seem to matter whichever settings I take in
postgresql.conf, the database just eats up any available memory in any
case. I cannot use it for production purposes this way. I am quite
dispaired right now - and I am definitely running out of time. Which
bits of linux (glibc, kernel, whatnot) does postgresql 7.2.1 need in
which version to run smoothly? As I mentioned before, we don't use any
"fancy" features of the database like foreign keys, triggers (except
one) or whatever, it's just basic functionality that seems to fall down
on us...

I am willing to try almost anything - but I need to squeeze more
performance out of this thing and I need some hints on which tools to
use to identify the problem. And unfortunately I need all this as soon
as possible... Hints and help are very, very much appreciated. Thank
you!

Regards,

Markus

Re: One source of constant annoyance identified

От

"Markus Wollny"

Дата:

03 июля 2002 г., 13:01:14

Hello!

Scanning google for strange backend behaviour of PostgreSQL I stumbled
over the following bug-report:

http://postgresql.linux.cz/mhonarc/pgsql-ports/1999-01/msg00001.html

It's referring to a very old version of PostgreSQL (6.0), but could it
be that we suffer from the very same phenomenon? I do notice "connection
reset by peer" messages in the logfile - I always thought that this
happened when a user terminated his request or a connection broke
because of other (mostly harmless) reasons.

Now to me it seems quite likely that our backends sort of "go to sleep",
too, hogging memory and (for some reason) processing time as well.

Has the mentioned bug report been adressed since then?

Regards,

Markus

Re: One source of constant annoyance identified

От

Tom Lane

Дата:

03 июля 2002 г., 13:57:20

"Markus Wollny" <Markus.Wollny@computec.de> writes:
> http://postgresql.linux.cz/mhonarc/pgsql-ports/1999-01/msg00001.html
> Has the mentioned bug report been adressed since then?

I don't think it's a bug.  TCP is not designed to detect connection
failure instantly, and it's definitely not designed to waste network
resources when neither side of a connection is sending anything.
In the scenario described in the report, the backend is sitting idle
and there's no way to know that the client side thinks the connection
has failed.

In recent releases we set the keepalive feature on, which means that
after a sufficiently long timeout the server side will probe to see
if the client is still there, and will then discover that the
connection is dead.  Whereupon the backend will receive a kernel
error report and will exit.  But up to that moment, there's no reason
to think that anything is wrong.

            regards, tom lane

Re: One source of constant annoyance identified

От

Martijn van Oosterhout

Дата:

03 июля 2002 г., 14:31:49

On Wed, Jul 03, 2002 at 03:23:39PM +0200, Markus Wollny wrote:
> Okay, now it's even more annoying...
>
> We just upgraded the server from 1GB to 2GB of RAM because it kept
> swapping out about 300MB.

Well, you've solved the swapping problem.

>   PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
>  1652 postgres   9   0  469M 469M  467M S     0.0 23.3   2:21 postmaster
>  2041 postgres   9   0  431M 431M  429M S     1.2 21.4   1:32 postmaster
>  1588 postgres   9   0  411M 411M  410M S    15.6 20.4   1:25 postmaster
>  1597 postgres   9   0  289M 289M  287M S     0.1 14.3   0:31 postmaster

That %CPU refers to the average CPU over the lifetime of the process. For
example, that first process has done nearly 2 and a half minutes of work.
That's either a lot of queries or one really big one. Now I can imagine
that many queries mapping in all of shared memory. Note that RSS is around
the size of the shared memory you have. So all those 460M processes seem to
be all the same memory, all shared.

What exactly is the problem here? You're not swapping.

> And still the major problem in finding the cause of all this is to
> identify the query which causes these huge backends. At the moment I
> tend to suspect that there certainly are several queries which need a
> bit of RAM, not that much however, and the backend fails to free up the
> memory used by processed queries. What puzzles me here is that all these
> 400MB+ backend-processes are marked as idle (ps-output) respectively
> sleeping, so they don't actually do much. But how come a sleeping
> process grabs 20% of 2GB of RAM and 15% of processing time?

My theory is that it's not using RAM at all, it's simply mapping the shared
memory in, which inflates the RSS. If you reduce the amount of shared
memory, does the RSS of the processes go down too?

> Another fact that hints at backends just eating memory without reason is
> that when I stop the database, processes keep lingering - kill just
> doesn't help, I have to kill -9 the last of the lot to get red of them
> before restarting the database:

Never kill -9. Recipie for disaster. If you want those processes to die,
perhaps you should stop the processes that are accesses the DB. They're the
ones that are idle. Are you using persistant connections at all? If so, how
long for?

Note that pg_ctl has various stop modes: smart, fast and immediate. Make
you're using the one you want.

>  1689 pts/0    S      0:00 postgres: postgres phppgadmin 212.123.109.25
> idle
>  1690 pts/0    S      0:00 postgres: postgres pszone 212.123.109.25 idle
>
>  1691 pts/0    S      0:00 postgres: postgres saturn 212.123.109.25 idle
>
>  1693 pts/0    S      0:00 postgres: postgres template1 212.123.109.25
> idle
>  1780 pts/0    S      0:00 postgres: postgres template1 212.123.108.149
> idle
>  1781 pts/0    S      0:00 postgres: postgres template1 212.123.108.149
> idle
>  1784 pts/0    S      0:00 postgres: postgres pcgames 212.123.108.149
> idle

Note that these are not the processes you listed above. The process times
don't match. You also have quite a lot of connections to template1.

>   3:11pm  up  2:03,  3 users,  load average: 3.24, 1.65, 1.81

Look at that load average. You have a whole CPU idle. CPU power is not your
limitation. No swap so that's not the problem. You really need to work out
what is slow.

> Which isn't quite as bad anymore - although it cannot be the optimum
> performance for this machine, I can't imagine.

So is it actually slow or are you looking for some ethereal "faster".

> Now what on earth can I do to get this DB running, and when I say
> "running" I don't mean "sort of crawling uphills"? Are there any

Look. 90% of performance improvement comes from modifying queries. Tweaking
the config settings can really only account for so much. There is nothing in
any of your messages where you have provided anything that we can use to
help you. There are no optimal settings, they depend entirely on what your
queries are.

> OS-patches or environment-settings I need to consider? Are there known
> memory-leaks? I just doesn't seem to matter whichever settings I take in
> postgresql.conf, the database just eats up any available memory in any
> case. I cannot use it for production purposes this way. I am quite
> dispaired right now - and I am definitely running out of time. Which
> bits of linux (glibc, kernel, whatnot) does postgresql 7.2.1 need in
> which version to run smoothly? As I mentioned before, we don't use any
> "fancy" features of the database like foreign keys, triggers (except
> one) or whatever, it's just basic functionality that seems to fall down
> on us...

By the way, you do notice the little figure in your top output labeled
"cached" that is around 1.7GB. I do beleive that means the kernel is using
that much memory for disk cache. You are not running out of memory by any
stretch of the imagination (unless one of your queries is using a lot, but
it appears to be getting freed at the end).

> I am willing to try almost anything - but I need to squeeze more
> performance out of this thing and I need some hints on which tools to
> use to identify the problem. And unfortunately I need all this as soon
> as possible... Hints and help are very, very much appreciated. Thank
> you!

Work out where your bottleneck is. It's not memory and it doesn't appear to
be CPU. With that much cache I can't imagine it's disk transfer rate either.
So the only explanation is the individual queries.

As for hints, the straight forward method is the best. Find whatever action
is the slowest and profile it. How much in the front end, how much in the
database, how much in client latency. Only once you understand where the
time is going can you do any meaningful optimisation. Tweaking config
settings will not help you here.

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

Re: One source of constant annoyance identified

От

Tom Lane

Дата:

03 июля 2002 г., 15:33:47

Martijn van Oosterhout <kleptog@svana.org> writes:
> My theory is that it's not using RAM at all, it's simply mapping the shared
> memory in, which inflates the RSS.

Interesting idea.  By rights the whole shared memory segment should be
charged against a backend's SIZE all the time, since it's certainly all
there in the address space.  But if for some reason a page of shmem
isn't charged until first touched, then indeed that could explain Markus'
observation --- and, perhaps, my inability to duplicate it on other
platforms.  (I tried on Red Hat 7.2, which you'd think would be fairly
close to his SuSE release, but maybe it's different on just this point.
HPUX seems to completely ignore shared segments in both SIZE and RSS,
but it's way out in left field on a number of fronts...)

            regards, tom lane

Re: One source of constant annoyance identified

От

Manfred Koizar

Дата:

03 июля 2002 г., 21:30:00

On Wed, 3 Jul 2002 10:34:53 +0200, "Markus Wollny"
<Markus.Wollny@computec.de> wrote:
>This one generates an overview over all the forum-threads in one board
>which has 41,624 messages, 2971 of them are FATHER_ID=0, so
                            ^^^^
>thread-starters: [long SQL statement follows]

Markus, that's strange.  Your explain says:
>Sort  (cost=30695.27..30695.27 rows=7693 width=154) (actual
>time=9745.94..9751.58 rows=4663 loops=1)
                            ^^^^
>  ->  Index Scan using idx_bm_show_topics on ct_com_board_message
>message  (cost=0.00..30198.72 rows=7693 width=154) (actual
>time=111.56..9549.99 rows=4663 loops=1)
>        SubPlan
>          ->  Aggregate  (cost=5.83..5.83 rows=1 width=0) (actual
>time=0.91..0.91 rows=1 loops=4663)
>                ->  Index Scan using idx_user_thread_follow on
>ct_com_user_thread_follow  (cost=0.00..5.83 rows=1 width=0) (actual
>time=0.88..0.88 rows=0 loops=4663)
>Total runtime: 9835.57 msec

Anyway, that's not my point here.  I'd want you to do an EXPLAIN
ANALYZE of another SQL statement which does the same IMHO.

From the schema you posted I see that (user_id, thread_id) is the
primary key, or at least a unique index, in CT_COM_USER_THREAD_FOLLOW,
so the sub-select
>(select count(*)
> from CT_COM_USER_THREAD_FOLLOW
> where USER_ID= '295798' and thread_id=MESSAGE.THREAD_ID) as TFUID
can only give 0 or 1.  So following my first rule of thumb "Avoid
subselects; use joins wherever possible" I'd write:

select  MESSAGE.BOARD_ID
      , MESSAGE.THREAD_ID
      , MESSAGE.FATHER_ID
      , MESSAGE.MESSAGE_ID
      , MESSAGE.USER_ID
      , MESSAGE.USER_LOGIN as LOGIN
      , MESSAGE.USER_STATUS as STATUS
      , MESSAGE.USER_RIGHTS as RIGHTS
      , MESSAGE.TITLE
      , MESSAGE.COUNT_REPLY as COUNT_REPLY
      , to_char(MESSAGE.LAST_REPLY,'DD.MM.YY hh24:mi')    as
LAST_REPLY
      , round((date_part('epoch',CURRENT_TIMESTAMP) -
               date_part('epoch',MESSAGE.LAST_REPLY))/60) as
diff_posting
      , to_char(MESSAGE.CREATED,'DD.MM.YY hh24:mi')       as DATUM
      , count(TH.THREAD_ID) as TFUID
from   CT_COM_BOARD_MESSAGE    MESSAGE
       left join CT_COM_USER_THREAD_FOLLOW TH
                 ON (TH.USER_ID='295798' and
                     TH.thread_id=MESSAGE.THREAD_ID)
where  (0=0)
and    MESSAGE.BOARD_ID    = 10
and    MESSAGE.FATHER_ID    = 0
and    MESSAGE.STATE_ID    = 0
order by  MESSAGE.LAST_REPLY    desc;

Could you try to EXPLAIN ANALYZE this and post the result?
And please cross-check whether it returns the same result set as your
original query.

Servus
 Manfred

Re: One source of constant annoyance identified

От

Manfred Koizar

Дата:

04 июля 2002 г., 07:05:01

On Wed, 03 Jul 2002 21:09:24 +0200, I wrote:
>select  MESSAGE.BOARD_ID
>      , [...]
>      , count(TH.THREAD_ID) as TFUID

Oops!  COUNT won't work unless you add GRAOUP BY <all other fields> at
the end of the query.  I had
    CASE WHEN th.thread_id IS NULL THEN 0 ELSE 1 END
here and thoughtlessly replaced it by COUNT(...) for brevity.  How
foolish!

Servus
 Manfred

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: One source of constant annoyance identified

Вложения