Обсуждение: Pg stuck at 100% cpu, for multiple days

Поиск
Список
Период
Сортировка

Pg stuck at 100% cpu, for multiple days

От
hubert depesz lubaczewski
Дата:
Hi,
We hit a problem with Pg 12.6 (I know, we should upgrade, but that will
take long time to prepare).

Anyway - it's 12.6 on aarm64.
Couple of days there was replication slot started, and now it seems to
be stuck.

=# select * from pg_stat_activity  where pid = 22697 \gx
─[ RECORD 1 ]────┬──────────────────────────────────────────────────────────
datid            │ 16591
datname          │ canvas
pid              │ 22697
usesysid         │ 16505
usename          │ <CENSORED>
application_name │ PostgreSQL JDBC Driver
client_addr      │ <CENSORED>
client_hostname  │ [null]
client_port      │ 43160
backend_start    │ 2021-08-18 02:12:05.758297+00
xact_start       │ [null]
query_start      │ 2021-08-18 02:12:05.772271+00
state_change     │ 2021-08-18 02:12:05.773428+00
wait_event_type  │ [null]
wait_event       │ [null]
state            │ active
backend_xid      │ [null]
backend_xmin     │ [null]
query            │ SELECT COUNT(1) FROM pg_publication WHERE pubname = 'cdc'
backend_type     │ walsender

=# select pg_current_wal_lsn(), pg_size_pretty( pg_current_wal_lsn() - sent_lsn), * from pg_stat_replication where pid
=22697 \gx
 
─[ RECORD 1 ]──────┬──────────────────────────────
pg_current_wal_lsn │ 1B14/718EA0B8
pg_size_pretty     │ 290 GB
pid                │ 22697
usesysid           │ 16505
usename            │ <CENSORED>
application_name   │ PostgreSQL JDBC Driver
client_addr        │ <CENSORED>
client_hostname    │ [null]
client_port        │ 43160
backend_start      │ 2021-08-18 02:12:05.758297+00
backend_xmin       │ [null]
state              │ streaming
sent_lsn           │ 1ACC/D8689A8
write_lsn          │ 1ACC/D527BD8
flush_lsn          │ 1ACC/C97DF48
replay_lsn         │ 1ACC/C97DF48
write_lag          │ 00:00:00.257041
flush_lag          │ 00:00:01.26665
replay_lag         │ 00:00:01.26665
sync_priority      │ 0
sync_state         │ async
reply_time         │ 1999-12-21 03:15:13.449225+00

top shows the process using 100% of cpu. I tried strace'ing, but strace doesn't
show *anything* - it just sits there.

Got backtrace:

~# gdb --pid=22697 --batch -ex backtrace
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
hash_seq_search (status=status@entry=0xffffdd90f380) at ./build/../src/backend/utils/hash/dynahash.c:1448
1448    ./build/../src/backend/utils/hash/dynahash.c: No such file or directory.
#0  hash_seq_search (status=status@entry=0xffffdd90f380) at ./build/../src/backend/utils/hash/dynahash.c:1448
#1  0x0000aaaac3042060 in RelfilenodeMapInvalidateCallback (arg=<optimized out>, relid=105496194) at
./build/../src/backend/utils/cache/relfilenodemap.c:64
#2  0x0000aaaac3033aa4 in LocalExecuteInvalidationMessage (msg=0xffff9b66eec8) at
./build/../src/backend/utils/cache/inval.c:595
#3  0x0000aaaac2ec8274 in ReorderBufferExecuteInvalidations (rb=0xaaaac326bb00 <errordata>, txn=0xaaaac326b998
<formatted_start_time>,txn=0xaaaac326b998 <formatted_start_time>) at
./build/../src/backend/replication/logical/reorderbuffer.c:2149
#4  ReorderBufferCommit (rb=0xaaaac326bb00 <errordata>, xid=xid@entry=2668396569, commit_lsn=187650393290540,
end_lsn=<optimizedout>, commit_time=commit_time@entry=683222349268077, origin_id=origin_id@entry=0,
origin_lsn=origin_lsn@entry=0)at ./build/../src/backend/replication/logical/reorderbuffer.c:1770
 
#5  0x0000aaaac2ebd314 in DecodeCommit (xid=2668396569, parsed=0xffffdd90f7e0, buf=0xffffdd90f960, ctx=0xaaaaf5d396a0)
at./build/../src/backend/replication/logical/decode.c:640
 
#6  DecodeXactOp (ctx=ctx@entry=0xaaaaf5d396a0, buf=0xffffdd90f960, buf@entry=0xffffdd90f9c0) at
./build/../src/backend/replication/logical/decode.c:248
#7  0x0000aaaac2ebd42c in LogicalDecodingProcessRecord (ctx=0xaaaaf5d396a0, record=0xaaaaf5d39938) at
./build/../src/backend/replication/logical/decode.c:117
#8  0x0000aaaac2ecfdfc in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2840
#9  0x0000aaaac2ed2228 in WalSndLoop (send_data=send_data@entry=0xaaaac2ecfd98 <XLogSendLogical>) at
./build/../src/backend/replication/walsender.c:2189
#10 0x0000aaaac2ed2efc in StartLogicalReplication (cmd=0xaaaaf5d175a8) at
./build/../src/backend/replication/walsender.c:1133
#11 exec_replication_command (cmd_string=cmd_string@entry=0xaaaaf5c0eb00 "START_REPLICATION SLOT cdc LOGICAL
1A2D/4B3640(\"proto_version\" '1', \"publication_names\" 'cdc')") at
./build/../src/backend/replication/walsender.c:1549
#12 0x0000aaaac2f258a4 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaaf5c78cd8, dbname=<optimized out>,
username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4257
 
#13 0x0000aaaac2eac338 in BackendRun (port=0xaaaaf5c68070, port=0xaaaaf5c68070) at
./build/../src/backend/postmaster/postmaster.c:4484
#14 BackendStartup (port=0xaaaaf5c68070) at ./build/../src/backend/postmaster/postmaster.c:4167
#15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
#16 0x0000aaaac2ead364 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
./build/../src/backend/postmaster/postmaster.c:1398
#17 0x0000aaaac2c3ca5c in main (argc=5, argv=0xaaaaf5c07720) at ./build/../src/backend/main/main.c:228

The thing is - I can't close it with pg_terminate_backend(), and I'd
rather not kill -9, as it will, I think, close all other connections,
and this is prod server.

The other end of the connection was something in kubernetes, and it no
longer exists.

Is there anything we could do about it?

Best regards,

depesz




Re: Pg stuck at 100% cpu, for multiple days

От
Vijaykumar Jain
Дата:

On Mon, 30 Aug 2021 at 19:15, hubert depesz lubaczewski <depesz@depesz.com> wrote:
Hi,
We hit a problem with Pg 12.6 (I know, we should upgrade, but that will
take long time to prepare).


The other end of the connection was something in kubernetes, and it no
longer exists.


related? i was kind of expecting debezium when i saw cdc and java, but 


I may be wrong, but just in case, FYI.

--
Thanks,
Vijay
Mumbai, India

Re: Pg stuck at 100% cpu, for multiple days

От
hubert depesz lubaczewski
Дата:
On Mon, Aug 30, 2021 at 07:57:26PM +0530, Vijaykumar Jain wrote:
> related? i was kind of expecting debezium when i saw cdc and java, but
> DBZ-1596 Fix open resources while initializing publication by Naros · Pull
> Request #1120 · debezium/debezium (github.com)
> <https://github.com/debezium/debezium/pull/1120>
> [DBZ-1596] AWS RDS Performance Insights screwed a little by non-closed
> statement in "SELECT COUNT(1) FROM pg_publication" - Red Hat Issue Tracker
> <https://issues.redhat.com/browse/DBZ-1596>
> I may be wrong, but just in case, FYI.

I'll pass this information to people that use the tool that fetches data
from DB, but is still makes me ask: why does Pg end up in such place,
where it doesn't do any syscalls, doesn't accept pg_terminate_backend(),
and is using 100% of cpu?

Best regards,

depesz




Re: Pg stuck at 100% cpu, for multiple days

От
Joe Conway
Дата:
On 8/30/21 10:36 AM, hubert depesz lubaczewski
> Anyway - it's 12.6 on aarm64. Couple of days there was replication
> slot started, and now it seems to be stuck.

> #0  hash_seq_search (status=status@entry=0xffffdd90f380) at ./build/../src/backend/utils/hash/dynahash.c:1448
> #1  0x0000aaaac3042060 in RelfilenodeMapInvalidateCallback (arg=<optimized out>, relid=105496194) at
./build/../src/backend/utils/cache/relfilenodemap.c:64
> #2  0x0000aaaac3033aa4 in LocalExecuteInvalidationMessage (msg=0xffff9b66eec8) at
./build/../src/backend/utils/cache/inval.c:595
> #3  0x0000aaaac2ec8274 in ReorderBufferExecuteInvalidations (rb=0xaaaac326bb00 <errordata>, txn=0xaaaac326b998
<formatted_start_time>,txn=0xaaaac326b998 <formatted_start_time>) at
./build/../src/backend/replication/logical/reorderbuffer.c:2149
> #4  ReorderBufferCommit (rb=0xaaaac326bb00 <errordata>, xid=xid@entry=2668396569, commit_lsn=187650393290540,
end_lsn=<optimizedout>, commit_time=commit_time@entry=683222349268077, origin_id=origin_id@entry=0,
origin_lsn=origin_lsn@entry=0)at ./build/../src/backend/replication/logical/reorderbuffer.c:1770
 
> #5  0x0000aaaac2ebd314 in DecodeCommit (xid=2668396569, parsed=0xffffdd90f7e0, buf=0xffffdd90f960,
ctx=0xaaaaf5d396a0)at ./build/../src/backend/replication/logical/decode.c:640
 
> #6  DecodeXactOp (ctx=ctx@entry=0xaaaaf5d396a0, buf=0xffffdd90f960, buf@entry=0xffffdd90f9c0) at
./build/../src/backend/replication/logical/decode.c:248
> #7  0x0000aaaac2ebd42c in LogicalDecodingProcessRecord (ctx=0xaaaaf5d396a0, record=0xaaaaf5d39938) at
./build/../src/backend/replication/logical/decode.c:117
> #8  0x0000aaaac2ecfdfc in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2840
> #9  0x0000aaaac2ed2228 in WalSndLoop (send_data=send_data@entry=0xaaaac2ecfd98 <XLogSendLogical>) at
./build/../src/backend/replication/walsender.c:2189
> #10 0x0000aaaac2ed2efc in StartLogicalReplication (cmd=0xaaaaf5d175a8) at
./build/../src/backend/replication/walsender.c:1133
> #11 exec_replication_command (cmd_string=cmd_string@entry=0xaaaaf5c0eb00 "START_REPLICATION SLOT cdc LOGICAL
1A2D/4B3640(\"proto_version\" '1', \"publication_names\" 'cdc')") at
./build/../src/backend/replication/walsender.c:1549
> #12 0x0000aaaac2f258a4 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaaf5c78cd8, dbname=<optimized out>,
username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4257
 
> #13 0x0000aaaac2eac338 in BackendRun (port=0xaaaaf5c68070, port=0xaaaaf5c68070) at
./build/../src/backend/postmaster/postmaster.c:4484
> #14 BackendStartup (port=0xaaaaf5c68070) at ./build/../src/backend/postmaster/postmaster.c:4167
> #15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
> #16 0x0000aaaac2ead364 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
./build/../src/backend/postmaster/postmaster.c:1398
> #17 0x0000aaaac2c3ca5c in main (argc=5, argv=0xaaaaf5c07720) at ./build/../src/backend/main/main.c:228
> 
> The thing is - I can't close it with pg_terminate_backend(), and I'd
> rather not kill -9, as it will, I think, close all other connections,
> and this is prod server.

> still makes me ask: why does Pg end up in such place,> where it
> doesn't do any syscalls, doesn't accept pg_terminate_backend(), and
> is using 100% of cpu?
src/backend/utils/hash/dynahash.c:1448 is in the middle of a while loop, 
which is apparently not exiting.

There is no check for interrupts in there and it is a fairly tight loop 
which would explain both symptoms.

As to how it got that way, I have to assume data corruption or a bug of 
some sort. I would repost the details to hackers for better visibility.

Joe
-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



Re: Pg stuck at 100% cpu, for multiple days

От
hubert depesz lubaczewski
Дата:
On Mon, Aug 30, 2021 at 11:04:33AM -0400, Joe Conway wrote:
> src/backend/utils/hash/dynahash.c:1448 is in the middle of a while loop,
> which is apparently not exiting.
> There is no check for interrupts in there and it is a fairly tight loop
> which would explain both symptoms.
> As to how it got that way, I have to assume data corruption or a bug of some
> sort. I would repost the details to hackers for better visibility.

Thanks. Done.

Best regards,

depesz