Обсуждение: FW: query pg_stat_ssl hang 100%cpu

Поиск
Список
Период
Сортировка

FW: query pg_stat_ssl hang 100%cpu

От
"James Pang (chaolpan)"
Дата:

Looks like an abnormal case.

 

From: James Pang (chaolpan)
Sent: Wednesday, September 6, 2023 9:41 AM
To: pgsql-performance@lists.postgresql.org
Subject: query pg_stat_ssl hang 100%cpu

 

Hi,

    PGv14.8, OS RHEL8, no SSL enabled in this database, we have a lot of client sessions who check it’s ssl state by  query, all other sessions got done very quickly, but only 1 session hang there in 100% cpu tens of hours, even pg_terminate_backend does not make it stopped either.  It looks like abnormal.

 

   select ssl from pg_stat_ssl where pid=pg_backend_pid();

    

testdb=#  select pid,usename,application_name,query_start,xact_start,state_change,wait_event_type,state,query from pg_stat_activity where pid=1245344;

   pid   | usename |    application_name    |         query_start          |          xact_start           |         state_change         | wait_event_type |

state  |                         query

---------+---------+------------------------+------------------------------+-------------------------------+------------------------------+-----------------+

--------+--------------------------------------------------------

1245344 | test    | PostgreSQL JDBC Driver | 2023-09-03 02:36:23.40238+00 | 2023-09-03 02:36:23.402331+00 | 2023-09-03 02:36:23.40238+00 |                 |

active | select ssl from pg_stat_ssl where pid=pg_backend_pid()

(1 row)

 

testdb=#  select pid,usename,application_name,query_start,xact_start,state_change,wait_event_type,state,query from pg_stat_activity where pid=1245344;

   pid   | usename |    application_name    |         query_start          |          xact_start           |         state_change         | wait_event_type |

state  |                         query

---------+---------+------------------------+------------------------------+-------------------------------+------------------------------+-----------------+

--------+--------------------------------------------------------

1245344 | test    | PostgreSQL JDBC Driver | 2023-09-03 02:36:23.40238+00 | 2023-09-03 02:36:23.402331+00 | 2023-09-03 02:36:23.40238+00 |                 |

active | select ssl from pg_stat_ssl where pid=pg_backend_pid()

(1 row)

 

testdb=#  select pid,usename,application_name,query_start,xact_start,state_change,wait_event_type,state,query from pg_stat_activity where pid=1245344;

   pid   | usename |    application_name    |         query_start          |          xact_start           |         state_change         | wait_event_type |

state  |                         query

---------+---------+------------------------+------------------------------+-------------------------------+------------------------------+-----------------+

--------+--------------------------------------------------------

1245344 | test    | PostgreSQL JDBC Driver | 2023-09-03 02:36:23.40238+00 | 2023-09-03 02:36:23.402331+00 | 2023-09-03 02:36:23.40238+00 |                 |

active | select ssl from pg_stat_ssl where pid=pg_backend_pid()

(1 row)

 

    PID      USER        PR  NI    VIRT    RES    SHR      S  %CPU  %MEM     TIME+   nMaj nMin WCHAN         COMMAND

1245344 postgres  20   0   32.5g  12468  12164 R  99.5     0.0         4219:12    0     1343     -             postgres: test testdb 10.250.193.40(48282) BIND

 

#0  ensure_record_cache_typmod_slot_exists (typmod=0) at typcache.c:1714

#1  0x000000000091185b in assign_record_type_typmod (tupDesc=<optimized out>, tupDesc@entry=0x27bc738) at typcache.c:2001

#2  0x000000000091df03 in internal_get_result_type (funcid=<optimized out>, call_expr=<optimized out>, rsinfo=<optimized out>,

    resultTypeId=<optimized out>, resultTupleDesc=0x7ffc9dff8cd0) at funcapi.c:393

#3  0x000000000091e263 in get_expr_result_type (expr=expr@entry=0x2792798, resultTypeId=resultTypeId@entry=0x7ffc9dff8ccc,

    resultTupleDesc=resultTupleDesc@entry=0x7ffc9dff8cd0) at funcapi.c:230

#4  0x00000000006a2fa5 in ExecInitFunctionScan (node=node@entry=0x273afa8, estate=estate@entry=0x269e948, eflags=eflags@entry=16) at nodeFunctionscan.c:370

#5  0x000000000069084e in ExecInitNode (node=node@entry=0x273afa8, estate=estate@entry=0x269e948, eflags=eflags@entry=16) at execProcnode.c:255

#6  0x000000000068a96d in InitPlan (eflags=16, queryDesc=0x273b2d8) at execMain.c:936

#7  standard_ExecutorStart (queryDesc=0x273b2d8, eflags=16) at execMain.c:263

#8  0x00007f67c2821d5d in pgss_ExecutorStart (queryDesc=0x273b2d8, eflags=<optimized out>) at pg_stat_statements.c:965

#9  0x00000000007fc226 in PortalStart (portal=portal@entry=0x26848b8, params=params@entry=0x0, eflags=eflags@entry=0, snapshot=snapshot@entry=0x0)

    at pquery.c:514

#10 0x00000000007fa27f in exec_bind_message (input_message=0x7ffc9dff90d0) at postgres.c:1995

#11 PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffc9dff9370, dbname=<optimized out>, username=<optimized out>) at postgres.c:4552

#12 0x000000000077a4ea in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4537

#13 BackendStartup (port=<optimized out>) at postmaster.c:4259

#14 ServerLoop () at postmaster.c:1745

#15 0x000000000077b363 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x256abc0) at postmaster.c:1417

#16 0x00000000004fec63 in main (argc=5, argv=0x256abc0) at main.c:209

 

Thanks,

 

James

Re: FW: query pg_stat_ssl hang 100%cpu

От
Michael Paquier
Дата:
On Thu, Sep 07, 2023 at 01:35:00AM +0000, James Pang (chaolpan) wrote:
>     PGv14.8, OS RHEL8, no SSL enabled in this database, we have a
>     lot of client sessions who check it's ssl state by  query, all
>     other sessions got done very quickly, but only 1 session hang
>     there in 100% cpu tens of hours, even pg_terminate_backend does
>     not make it stopped either.  It looks like abnormal.
>
>    select ssl from pg_stat_ssl where pid=pg_backend_pid();

This is hard to act on without more details or even a reproducible and
self-contained test case.  Even a java script based on the JDBC driver
would be OK for me, for example, if it helps digging into what you are
seeing.

> #0  ensure_record_cache_typmod_slot_exists (typmod=0) at typcache.c:1714
> #1  0x000000000091185b in assign_record_type_typmod (tupDesc=<optimized out>, tupDesc@entry=0x27bc738) at
typcache.c:2001
> #2  0x000000000091df03 in internal_get_result_type (funcid=<optimized out>, call_expr=<optimized out>,
rsinfo=<optimizedout>, 
>     resultTypeId=<optimized out>, resultTupleDesc=0x7ffc9dff8cd0) at funcapi.c:393
> #3  0x000000000091e263 in get_expr_result_type (expr=expr@entry=0x2792798,
resultTypeId=resultTypeId@entry=0x7ffc9dff8ccc,
>     resultTupleDesc=resultTupleDesc@entry=0x7ffc9dff8cd0) at funcapi.c:230
> #4  0x00000000006a2fa5 in ExecInitFunctionScan (node=node@entry=0x273afa8, estate=estate@entry=0x269e948,
eflags=eflags@entry=16)at nodeFunctionscan.c:370 
> #5  0x000000000069084e in ExecInitNode (node=node@entry=0x273afa8, estate=estate@entry=0x269e948,
eflags=eflags@entry=16)at execProcnode.c:255 
> #6  0x000000000068a96d in InitPlan (eflags=16, queryDesc=0x273b2d8) at execMain.c:936
> #7  standard_ExecutorStart (queryDesc=0x273b2d8, eflags=16) at execMain.c:263
> #8  0x00007f67c2821d5d in pgss_ExecutorStart (queryDesc=0x273b2d8, eflags=<optimized out>) at
pg_stat_statements.c:965
> #9  0x00000000007fc226 in PortalStart (portal=portal@entry=0x26848b8, params=params@entry=0x0, eflags=eflags@entry=0,
snapshot=snapshot@entry=0x0)
>     at pquery.c:514
> #10 0x00000000007fa27f in exec_bind_message (input_message=0x7ffc9dff90d0) at postgres.c:1995
> #11 PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffc9dff9370, dbname=<optimized out>, username=<optimized
out>)at postgres.c:4552 
> #12 0x000000000077a4ea in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4537
> #13 BackendStartup (port=<optimized out>) at postmaster.c:4259
> #14 ServerLoop () at postmaster.c:1745
> #15 0x000000000077b363 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x256abc0) at postmaster.c:1417
> #16 0x00000000004fec63 in main (argc=5, argv=0x256abc0) at main.c:209

This stack is referring to a code path where we are checking that some
of the type-related data associated to a record is around, but this
does not say exactly where the loop happens, so...  Are we looking on
a loop in the function execution itself from which the information of
pg_stat_ssl is retrieved (aka pg_stat_get_activity())?  Or is the
type cache somewhat broken because of the extended query protocol?
That's not really possible to see any evidence based on the
information provided, though it provides a few hits that can help.
FWIW, I've not heard about an issue like that in the field.

The first thing I would do is update to 14.9, which is the latest
version of Postgres available for this major version.
--
Michael

Вложения

RE: FW: query pg_stat_ssl hang 100%cpu

От
"James Pang (chaolpan)"
Дата:
Yes, this backend has been always on same call stack hours , no changed, and 100% cpu.

-----Original Message-----
From: Michael Paquier <michael@paquier.xyz>
Sent: Thursday, September 7, 2023 12:05 PM
To: James Pang (chaolpan) <chaolpan@cisco.com>
Cc: PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>
Subject: Re: FW: query pg_stat_ssl hang 100%cpu

On Thu, Sep 07, 2023 at 01:35:00AM +0000, James Pang (chaolpan) wrote:
>     PGv14.8, OS RHEL8, no SSL enabled in this database, we have a
>     lot of client sessions who check it's ssl state by  query, all
>     other sessions got done very quickly, but only 1 session hang
>     there in 100% cpu tens of hours, even pg_terminate_backend does
>     not make it stopped either.  It looks like abnormal.
>
>    select ssl from pg_stat_ssl where pid=pg_backend_pid();

This is hard to act on without more details or even a reproducible and self-contained test case.  Even a java script
basedon the JDBC driver would be OK for me, for example, if it helps digging into what you are seeing. 

> #0  ensure_record_cache_typmod_slot_exists (typmod=0) at
> typcache.c:1714
> #1  0x000000000091185b in assign_record_type_typmod
> (tupDesc=<optimized out>, tupDesc@entry=0x27bc738) at typcache.c:2001
> #2  0x000000000091df03 in internal_get_result_type (funcid=<optimized out>, call_expr=<optimized out>,
rsinfo=<optimizedout>, 
>     resultTypeId=<optimized out>, resultTupleDesc=0x7ffc9dff8cd0) at
> funcapi.c:393
> #3  0x000000000091e263 in get_expr_result_type (expr=expr@entry=0x2792798,
resultTypeId=resultTypeId@entry=0x7ffc9dff8ccc,
>     resultTupleDesc=resultTupleDesc@entry=0x7ffc9dff8cd0) at
> funcapi.c:230
> #4  0x00000000006a2fa5 in ExecInitFunctionScan
> (node=node@entry=0x273afa8, estate=estate@entry=0x269e948,
> eflags=eflags@entry=16) at nodeFunctionscan.c:370
> #5  0x000000000069084e in ExecInitNode (node=node@entry=0x273afa8,
> estate=estate@entry=0x269e948, eflags=eflags@entry=16) at
> execProcnode.c:255
> #6  0x000000000068a96d in InitPlan (eflags=16, queryDesc=0x273b2d8) at
> execMain.c:936
> #7  standard_ExecutorStart (queryDesc=0x273b2d8, eflags=16) at
> execMain.c:263
> #8  0x00007f67c2821d5d in pgss_ExecutorStart (queryDesc=0x273b2d8,
> eflags=<optimized out>) at pg_stat_statements.c:965
> #9  0x00000000007fc226 in PortalStart (portal=portal@entry=0x26848b8, params=params@entry=0x0, eflags=eflags@entry=0,
snapshot=snapshot@entry=0x0)
>     at pquery.c:514
> #10 0x00000000007fa27f in exec_bind_message
> (input_message=0x7ffc9dff90d0) at postgres.c:1995
> #11 PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffc9dff9370,
> dbname=<optimized out>, username=<optimized out>) at postgres.c:4552
> #12 0x000000000077a4ea in BackendRun (port=<optimized out>,
> port=<optimized out>) at postmaster.c:4537
> #13 BackendStartup (port=<optimized out>) at postmaster.c:4259
> #14 ServerLoop () at postmaster.c:1745
> #15 0x000000000077b363 in PostmasterMain (argc=argc@entry=5,
> argv=argv@entry=0x256abc0) at postmaster.c:1417
> #16 0x00000000004fec63 in main (argc=5, argv=0x256abc0) at main.c:209

This stack is referring to a code path where we are checking that some of the type-related data associated to a record
isaround, but this does not say exactly where the loop happens, so...  Are we looking on a loop in the function
executionitself from which the information of pg_stat_ssl is retrieved (aka pg_stat_get_activity())?  Or is the type
cachesomewhat broken because of the extended query protocol? 
That's not really possible to see any evidence based on the information provided, though it provides a few hits that
canhelp. 
FWIW, I've not heard about an issue like that in the field.

The first thing I would do is update to 14.9, which is the latest version of Postgres available for this major version.
--
Michael



RE: FW: query pg_stat_ssl hang 100%cpu

От
"James Pang (chaolpan)"
Дата:

   Yes, this backend has been always on same call stack tens of hours  and 100% cpu there.   It's still hang there now,
butI can not reproduce that in other similar environment.  I found this query start a transaction "xact_start" from "
2023-09-0302:36:23" from pg_stat_activity, no idea why.    

-----Original Message-----
From: Michael Paquier <michael@paquier.xyz>
Sent: Thursday, September 7, 2023 12:05 PM
To: James Pang (chaolpan) <chaolpan@cisco.com>
Cc: PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>
Subject: Re: FW: query pg_stat_ssl hang 100%cpu

On Thu, Sep 07, 2023 at 01:35:00AM +0000, James Pang (chaolpan) wrote:
>     PGv14.8, OS RHEL8, no SSL enabled in this database, we have a
>     lot of client sessions who check it's ssl state by  query, all
>     other sessions got done very quickly, but only 1 session hang
>     there in 100% cpu tens of hours, even pg_terminate_backend does
>     not make it stopped either.  It looks like abnormal.
>
>    select ssl from pg_stat_ssl where pid=pg_backend_pid();

This is hard to act on without more details or even a reproducible and self-contained test case.  Even a java script
basedon the JDBC driver would be OK for me, for example, if it helps digging into what you are seeing. 

> #0  ensure_record_cache_typmod_slot_exists (typmod=0) at
> typcache.c:1714
> #1  0x000000000091185b in assign_record_type_typmod
> (tupDesc=<optimized out>, tupDesc@entry=0x27bc738) at typcache.c:2001
> #2  0x000000000091df03 in internal_get_result_type (funcid=<optimized out>, call_expr=<optimized out>,
rsinfo=<optimizedout>, 
>     resultTypeId=<optimized out>, resultTupleDesc=0x7ffc9dff8cd0) at
> funcapi.c:393
> #3  0x000000000091e263 in get_expr_result_type (expr=expr@entry=0x2792798,
resultTypeId=resultTypeId@entry=0x7ffc9dff8ccc,
>     resultTupleDesc=resultTupleDesc@entry=0x7ffc9dff8cd0) at
> funcapi.c:230
> #4  0x00000000006a2fa5 in ExecInitFunctionScan
> (node=node@entry=0x273afa8, estate=estate@entry=0x269e948,
> eflags=eflags@entry=16) at nodeFunctionscan.c:370
> #5  0x000000000069084e in ExecInitNode (node=node@entry=0x273afa8,
> estate=estate@entry=0x269e948, eflags=eflags@entry=16) at
> execProcnode.c:255
> #6  0x000000000068a96d in InitPlan (eflags=16, queryDesc=0x273b2d8) at
> execMain.c:936
> #7  standard_ExecutorStart (queryDesc=0x273b2d8, eflags=16) at
> execMain.c:263
> #8  0x00007f67c2821d5d in pgss_ExecutorStart (queryDesc=0x273b2d8,
> eflags=<optimized out>) at pg_stat_statements.c:965
> #9  0x00000000007fc226 in PortalStart (portal=portal@entry=0x26848b8, params=params@entry=0x0, eflags=eflags@entry=0,
snapshot=snapshot@entry=0x0)
>     at pquery.c:514
> #10 0x00000000007fa27f in exec_bind_message
> (input_message=0x7ffc9dff90d0) at postgres.c:1995
> #11 PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffc9dff9370,
> dbname=<optimized out>, username=<optimized out>) at postgres.c:4552
> #12 0x000000000077a4ea in BackendRun (port=<optimized out>,
> port=<optimized out>) at postmaster.c:4537
> #13 BackendStartup (port=<optimized out>) at postmaster.c:4259
> #14 ServerLoop () at postmaster.c:1745
> #15 0x000000000077b363 in PostmasterMain (argc=argc@entry=5,
> argv=argv@entry=0x256abc0) at postmaster.c:1417
> #16 0x00000000004fec63 in main (argc=5, argv=0x256abc0) at main.c:209

This stack is referring to a code path where we are checking that some of the type-related data associated to a record
isaround, but this does not say exactly where the loop happens, so...  Are we looking on a loop in the function
executionitself from which the information of pg_stat_ssl is retrieved (aka pg_stat_get_activity())?  Or is the type
cachesomewhat broken because of the extended query protocol? 
That's not really possible to see any evidence based on the information provided, though it provides a few hits that
canhelp. 
FWIW, I've not heard about an issue like that in the field.

The first thing I would do is update to 14.9, which is the latest version of Postgres available for this major version.
--
Michael



Re: FW: query pg_stat_ssl hang 100%cpu

От
Thomas Munro
Дата:
> #0  ensure_record_cache_typmod_slot_exists (typmod=0) at typcache.c:1714

Are you able to print out the value of global variable
RecordCacheArrayLen?  I wonder if this loop in
ensure_record_cache_typmod_slot_exists() is not terminating:

        int32       newlen = RecordCacheArrayLen * 2;

        while (typmod >= newlen)
            newlen *= 2;



RE: FW: query pg_stat_ssl hang 100%cpu

От
"James Pang (chaolpan)"
Дата:
(gdb) bt
#0  ensure_record_cache_typmod_slot_exists (typmod=0) at typcache.c:1714
#1  0x000000000091185b in assign_record_type_typmod (tupDesc=<optimized out>, tupDesc@entry=0x27bc738) at
typcache.c:2001
#2  0x000000000091df03 in internal_get_result_type (funcid=<optimized out>, call_expr=<optimized out>,
rsinfo=<optimizedout>,
 
    resultTypeId=<optimized out>, resultTupleDesc=0x7ffc9dff8cd0) at funcapi.c:393
#3  0x000000000091e263 in get_expr_result_type (expr=expr@entry=0x2792798,
resultTypeId=resultTypeId@entry=0x7ffc9dff8ccc,
    resultTupleDesc=resultTupleDesc@entry=0x7ffc9dff8cd0) at funcapi.c:230
#4  0x00000000006a2fa5 in ExecInitFunctionScan (node=node@entry=0x273afa8, estate=estate@entry=0x269e948,
eflags=eflags@entry=16)
    at nodeFunctionscan.c:370
#5  0x000000000069084e in ExecInitNode (node=node@entry=0x273afa8, estate=estate@entry=0x269e948,
eflags=eflags@entry=16)
    at execProcnode.c:255
#6  0x000000000068a96d in InitPlan (eflags=16, queryDesc=0x273b2d8) at execMain.c:936
#7  standard_ExecutorStart (queryDesc=0x273b2d8, eflags=16) at execMain.c:263
#8  0x00007f67c2821d5d in pgss_ExecutorStart (queryDesc=0x273b2d8, eflags=<optimized out>) at pg_stat_statements.c:965
#9  0x00000000007fc226 in PortalStart (portal=portal@entry=0x26848b8, params=params@entry=0x0, eflags=eflags@entry=0,
    snapshot=snapshot@entry=0x0) at pquery.c:514
#10 0x00000000007fa27f in exec_bind_message (input_message=0x7ffc9dff90d0) at postgres.c:1995
#11 PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffc9dff9370, dbname=<optimized out>, username=<optimized out>)
    at postgres.c:4552
#12 0x000000000077a4ea in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4537
#13 BackendStartup (port=<optimized out>) at postmaster.c:4259
#14 ServerLoop () at postmaster.c:1745
#15 0x000000000077b363 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x256abc0) at postmaster.c:1417
#16 0x00000000004fec63 in main (argc=5, argv=0x256abc0) at main.c:209
(gdb) p  RecordCacheArrayLen
$3 = 0

-----Original Message-----
From: Thomas Munro <thomas.munro@gmail.com> 
Sent: Thursday, September 7, 2023 5:31 PM
To: James Pang (chaolpan) <chaolpan@cisco.com>
Cc: Michael Paquier <michael@paquier.xyz>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>
Subject: Re: FW: query pg_stat_ssl hang 100%cpu

> #0  ensure_record_cache_typmod_slot_exists (typmod=0) at 
> typcache.c:1714

Are you able to print out the value of global variable RecordCacheArrayLen?  I wonder if this loop in
ensure_record_cache_typmod_slot_exists() is not terminating:

        int32       newlen = RecordCacheArrayLen * 2;

        while (typmod >= newlen)
            newlen *= 2;

RE: FW: query pg_stat_ssl hang 100%cpu

От
"James Pang (chaolpan)"
Дата:
#0  ensure_record_cache_typmod_slot_exists (typmod=0) at typcache.c:1714
#1  0x000000000091185b in assign_record_type_typmod (tupDesc=<optimized out>, tupDesc@entry=0x27bc738) at
typcache.c:2001
#2  0x000000000091df03 in internal_get_result_type (funcid=<optimized out>, call_expr=<optimized out>,
rsinfo=<optimizedout>,
 
    resultTypeId=<optimized out>, resultTupleDesc=0x7ffc9dff8cd0) at funcapi.c:393
#3  0x000000000091e263 in get_expr_result_type (expr=expr@entry=0x2792798,
resultTypeId=resultTypeId@entry=0x7ffc9dff8ccc,
    resultTupleDesc=resultTupleDesc@entry=0x7ffc9dff8cd0) at funcapi.c:230
#4  0x00000000006a2fa5 in ExecInitFunctionScan (node=node@entry=0x273afa8, estate=estate@entry=0x269e948,
eflags=eflags@entry=16)at nodeFunctionscan.c:370
 
#5  0x000000000069084e in ExecInitNode (node=node@entry=0x273afa8, estate=estate@entry=0x269e948,
eflags=eflags@entry=16)at execProcnode.c:255
 
#6  0x000000000068a96d in InitPlan (eflags=16, queryDesc=0x273b2d8) at execMain.c:936
#7  standard_ExecutorStart (queryDesc=0x273b2d8, eflags=16) at execMain.c:263
#8  0x00007f67c2821d5d in pgss_ExecutorStart (queryDesc=0x273b2d8, eflags=<optimized out>) at pg_stat_statements.c:965
#9  0x00000000007fc226 in PortalStart (portal=portal@entry=0x26848b8, params=params@entry=0x0, eflags=eflags@entry=0,
snapshot=snapshot@entry=0x0)
    at pquery.c:514
#10 0x00000000007fa27f in exec_bind_message (input_message=0x7ffc9dff90d0) at postgres.c:1995
#11 PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffc9dff9370, dbname=<optimized out>, username=<optimized out>)
atpostgres.c:4552
 
#12 0x000000000077a4ea in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4537
#13 BackendStartup (port=<optimized out>) at postmaster.c:4259
#14 ServerLoop () at postmaster.c:1745
#15 0x000000000077b363 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x256abc0) at postmaster.c:1417
#16 0x00000000004fec63 in main (argc=5, argv=0x256abc0) at main.c:209
(gdb) p  RecordCacheArrayLen
$1 = 0
(gdb) p  RecordCacheArrayLen
$2 = 0
(gdb) p  RecordCacheArrayLen
$3 = 0
-----Original Message-----
From: Thomas Munro <thomas.munro@gmail.com> 
Sent: Thursday, September 7, 2023 5:31 PM
To: James Pang (chaolpan) <chaolpan@cisco.com>
Cc: Michael Paquier <michael@paquier.xyz>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>
Subject: Re: FW: query pg_stat_ssl hang 100%cpu

> #0  ensure_record_cache_typmod_slot_exists (typmod=0) at 
> typcache.c:1714

Are you able to print out the value of global variable RecordCacheArrayLen?  I wonder if this loop in
ensure_record_cache_typmod_slot_exists() is not terminating:

        int32       newlen = RecordCacheArrayLen * 2;

        while (typmod >= newlen)
            newlen *= 2;

Re: FW: query pg_stat_ssl hang 100%cpu

От
Thomas Munro
Дата:
On Thu, Sep 7, 2023 at 9:38 PM James Pang (chaolpan) <chaolpan@cisco.com> wrote:
> (gdb) p  RecordCacheArrayLen
> $3 = 0

Clearly this system lacks check against wrapping around, but it must
be hard work to allocate billions of typmods...

Or maybe if in an earlier call we assigned RecordCacheArray but the
allocation of RecordIdentifierArray failed (a clue would be that it is
still NULL), we never manage to assign RecordCacheArrayLen a non-zero
value?  But it must be unlikely for a small allocation to fail...



RE: FW: query pg_stat_ssl hang 100%cpu

От
"James Pang (chaolpan)"
Дата:
Yes, checked the server history logs, we found when the backend starting hang there, operating system has out of
memorythat may lead to the allocation of RecordIdentierArray failed.   
 

Thanks,

James

-----Original Message-----
From: Thomas Munro <thomas.munro@gmail.com> 
Sent: Thursday, September 7, 2023 6:01 PM
To: James Pang (chaolpan) <chaolpan@cisco.com>
Cc: Michael Paquier <michael@paquier.xyz>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>
Subject: Re: FW: query pg_stat_ssl hang 100%cpu

On Thu, Sep 7, 2023 at 9:38 PM James Pang (chaolpan) <chaolpan@cisco.com> wrote:
> (gdb) p  RecordCacheArrayLen
> $3 = 0

Clearly this system lacks check against wrapping around, but it must be hard work to allocate billions of typmods...

Or maybe if in an earlier call we assigned RecordCacheArray but the allocation of RecordIdentifierArray failed (a clue
wouldbe that it is still NULL), we never manage to assign RecordCacheArrayLen a non-zero value?  But it must be
unlikelyfor a small allocation to fail...
 

Re: FW: query pg_stat_ssl hang 100%cpu

От
Thomas Munro
Дата:
On Thu, Sep 7, 2023 at 10:31 PM James Pang (chaolpan)
<chaolpan@cisco.com> wrote:
>    Yes, checked the server history logs, we found when the backend starting hang there, operating system has out of
memorythat may lead to the allocation of RecordIdentierArray failed. 

Can you please print out RecordCacheArray and RecordIdentifierArray?
For that theory, the first one must be non-NULL and the second one
NULL.  That would lead to the infinite loop, I think.



RE: FW: query pg_stat_ssl hang 100%cpu

От
"James Pang (chaolpan)"
Дата:
#0  ensure_record_cache_typmod_slot_exists (typmod=0) at typcache.c:1714
#1  0x000000000091185b in assign_record_type_typmod (tupDesc=<optimized out>, tupDesc@entry=0x27bc738) at
typcache.c:2001
#2  0x000000000091df03 in internal_get_result_type (funcid=<optimized out>, call_expr=<optimized out>,
rsinfo=<optimizedout>,
 
    resultTypeId=<optimized out>, resultTupleDesc=0x7ffc9dff8cd0) at funcapi.c:393
#3  0x000000000091e263 in get_expr_result_type (expr=expr@entry=0x2792798,
resultTypeId=resultTypeId@entry=0x7ffc9dff8ccc,
    resultTupleDesc=resultTupleDesc@entry=0x7ffc9dff8cd0) at funcapi.c:230
#4  0x00000000006a2fa5 in ExecInitFunctionScan (node=node@entry=0x273afa8, estate=estate@entry=0x269e948,
eflags=eflags@entry=16)at nodeFunctionscan.c:370
 
#5  0x000000000069084e in ExecInitNode (node=node@entry=0x273afa8, estate=estate@entry=0x269e948,
eflags=eflags@entry=16)at execProcnode.c:255
 
#6  0x000000000068a96d in InitPlan (eflags=16, queryDesc=0x273b2d8) at execMain.c:936
#7  standard_ExecutorStart (queryDesc=0x273b2d8, eflags=16) at execMain.c:263
#8  0x00007f67c2821d5d in pgss_ExecutorStart (queryDesc=0x273b2d8, eflags=<optimized out>) at pg_stat_statements.c:965
#9  0x00000000007fc226 in PortalStart (portal=portal@entry=0x26848b8, params=params@entry=0x0, eflags=eflags@entry=0,
snapshot=snapshot@entry=0x0)
    at pquery.c:514
#10 0x00000000007fa27f in exec_bind_message (input_message=0x7ffc9dff90d0) at postgres.c:1995
#11 PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffc9dff9370, dbname=<optimized out>, username=<optimized out>)
atpostgres.c:4552
 
#12 0x000000000077a4ea in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4537
#13 BackendStartup (port=<optimized out>) at postmaster.c:4259
#14 ServerLoop () at postmaster.c:1745
#15 0x000000000077b363 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x256abc0) at postmaster.c:1417
#16 0x00000000004fec63 in main (argc=5, argv=0x256abc0) at main.c:209
(gdb) p RecordCacheArray
$1 = (TupleDesc *) 0x7f5fac365d90
(gdb) p RecordIdentifierArray
$2 = (uint64 *) 0x0

-----Original Message-----
From: Thomas Munro <thomas.munro@gmail.com> 
Sent: Thursday, September 7, 2023 6:36 PM
To: James Pang (chaolpan) <chaolpan@cisco.com>
Cc: Michael Paquier <michael@paquier.xyz>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>
Subject: Re: FW: query pg_stat_ssl hang 100%cpu

On Thu, Sep 7, 2023 at 10:31 PM James Pang (chaolpan) <chaolpan@cisco.com> wrote:
>    Yes, checked the server history logs, we found when the backend starting hang there, operating system has out of
memorythat may lead to the allocation of RecordIdentierArray failed.
 

Can you please print out RecordCacheArray and RecordIdentifierArray?
For that theory, the first one must be non-NULL and the second one NULL.  That would lead to the infinite loop, I
think.

Re: FW: query pg_stat_ssl hang 100%cpu

От
Thomas Munro
Дата:
On Thu, Sep 7, 2023 at 10:39 PM James Pang (chaolpan)
<chaolpan@cisco.com> wrote:
> (gdb) p RecordCacheArray
> $1 = (TupleDesc *) 0x7f5fac365d90
> (gdb) p RecordIdentifierArray
> $2 = (uint64 *) 0x0

Hah, yeah that's it, and you've been extremely unlucky to hit it.
ensure_record_cache_typmod_slot_exists() should be more careful about
cleaning up on allocation failure, to avoid this state.



Re: FW: query pg_stat_ssl hang 100%cpu

От
Thomas Munro
Дата:
On Thu, Sep 7, 2023 at 10:59 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Sep 7, 2023 at 10:39 PM James Pang (chaolpan)
> <chaolpan@cisco.com> wrote:
> > (gdb) p RecordCacheArray
> > $1 = (TupleDesc *) 0x7f5fac365d90
> > (gdb) p RecordIdentifierArray
> > $2 = (uint64 *) 0x0
>
> Hah, yeah that's it, and you've been extremely unlucky to hit it.
> ensure_record_cache_typmod_slot_exists() should be more careful about
> cleaning up on allocation failure, to avoid this state.

I think the lazy fix would be to re-order those allocations.  A
marginally more elegant fix would be to merge the arrays, as in the
attached.  Thoughts?

Вложения

Re: FW: query pg_stat_ssl hang 100%cpu

От
Michael Paquier
Дата:
On Thu, Sep 7, 2023 at 10:39 PM James Pang (chaolpan)
<chaolpan(at)cisco(dot)com> wrote:
> (gdb) p RecordCacheArray
> $1 = (TupleDesc *) 0x7f5fac365d90
> (gdb) p RecordIdentifierArray
> $2 = (uint64 *) 0x0

Oh, yeah, this stack is broken.  You have been really unlucky to hit
that.  This can randomly cause any session to get stuck, and no need
for the extended query protocol here.

(I am not sure how, but my email server has somewhat not been able to
get the previous messages from James.  Anyway.)

On Fri, Sep 08, 2023 at 11:45:51AM +1200, Thomas Munro wrote:
> I think the lazy fix would be to re-order those allocations.  A
> marginally more elegant fix would be to merge the arrays, as in the
> attached.  Thoughts?

So, ensure_record_cache_typmod_slot_exists() would allocate the
initial RecordCacheArray and if it fails on the second one it keeps
RecordCacheArrayLen at 0.  When coming back again to this code path,
the second part of the routine causes an infinite loop because the
allocation has been done, but the tracked length is 0.  Fun.

This is broken since 4b93f57 where the second array has been
introduced.  Getting that fixed before 11 is EOL is nice as it was
introduced there, so good timing.

There is a repalloc_extended(), but I cannot get excited to use
MCXT_ALLOC_NO_OOM in this code path if there is a simpler method to
avoid this issue with a single allocation for the all information set.

+static RecordCacheArrayEntry * RecordCacheArray = NULL;

pgindent is annoyed by that..  typedefs.list has been updated in your
patch, so I guess that you missed one extra indentation after this is
refreshed.

Note that RememberToFreeTupleDescAtEOX() does something similar to the
type cache, and it uses the same approach as your patch.

+1 to your proposal of using a struct for the entries in the cache.
--
Michael

Вложения

Re: FW: query pg_stat_ssl hang 100%cpu

От
Thomas Munro
Дата:
On Fri, Sep 8, 2023 at 2:48 PM Michael Paquier <michael@paquier.xyz> wrote:
> +1 to your proposal of using a struct for the entries in the cache.

Cool, I'll push/back-patch after 16.0.  Even though this seems
simple enough, it's an extremely low probability failure and I'd
rather keep out of REL_16_STABLE's way...



Re: FW: query pg_stat_ssl hang 100%cpu

От
Michael Paquier
Дата:
On Mon, Sep 11, 2023 at 10:37:34AM +1200, Thomas Munro wrote:
> Cool, I'll push/back-patch after 16.0.  Even though this seems
> simple enough, it's an extremely low probability failure and I'd
> rather keep out of REL_16_STABLE's way...

+1.
--
Michael

Вложения

Re: FW: query pg_stat_ssl hang 100%cpu

От
Thomas Munro
Дата:
On Mon, Sep 11, 2023 at 10:47 AM Michael Paquier <michael@paquier.xyz> wrote:
> On Mon, Sep 11, 2023 at 10:37:34AM +1200, Thomas Munro wrote:
> > Cool, I'll push/back-patch after 16.0.  Even though this seems
> > simple enough, it's an extremely low probability failure and I'd
> > rather keep out of REL_16_STABLE's way...
>
> +1.

Done.