Unexplained rapid growth in memory usage of idle backends

Поиск
Список
Период
Сортировка
Hi,

We have an application that has a preforked worker pool architecture and opens a relatively large (70+) number of
persistentconnections to PostgreSQL 14, and have `max_connections` set to 200 accordingly.  

This has worked well enough in many deployments, and the backends corresponding to these connections mostly sit idle.
Occasionallythey are hit with a query, and those workloads are almost entirely PL/PgSQL stored functions. These
functionsmostly just aggregate output from various tables, and occasionally apply some business logic; no temporary
tables,no runaway recursive CTEs, nothing baroque.  

Occasionally, about once every 2 months, we'll get a slow-motion implosion over a period of about 24 hours, where the
residentmemory size of some of the backends shoots up from a reasonable few hundred MB to several gigs -- and will keep
growing,e.g. 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
3932733 postgres  20   0  637740 327124 323840 R  10.7   0.5   0:00.11 postgres
 782930 postgres  20   0 3151360   2.8g 551704 S   9.7   4.5 221:43.17 postgres
 782839 postgres  20   0 2999896   2.7g 551832 S   8.7   4.3 220:54.55 postgres
3932734 postgres  20   0  636128 338168 336096 R   8.7   0.5   0:00.09 postgres
3932735 postgres  20   0  636128 319824 317768 R   8.7   0.5   0:00.09 postgres
 782851 postgres  20   0 3142152   2.8g 551872 S   7.8   4.5 221:22.62 postgres
 782855 postgres  20   0 3155144   2.8g 551828 S   7.8   4.5 221:38.59 postgres
 782897 postgres  20   0 3148808   2.8g 551724 S   7.8   4.5 220:05.94 postgres
 783106 postgres  20   0 3152416   2.8g 551912 S   7.8   4.5 221:24.40 postgres
 783168 postgres  20   0 2992592   2.7g 551968 S   7.8   4.2 220:36.06 postgres
 782843 postgres  20   0 3146772   2.8g 551944 R   4.9   4.5 221:38.46 postgres
 782778 postgres  20   0 3150956   2.8g 551852 R   3.9   4.5 220:49.88 postgres

It would be only natural to expect a cornucopia of heavyweight queries there, but there aren't any. `pg_stat_activity`
justreveals an active WalSender (for streaming replication) and the `pg_stat_activity` query (WHERE state <> 'idle')
itself.Once in a while, I'll catch a shortlived query in _one_ of these, if I run it often enough. 

The state of affairs deteriorates until either

1) the grim OOM reaper comes:

[19063737.017400] Out of memory: Killed process 536356 (postgres) total-vm:21703068kB, anon-rss:20804636kB,
file-rss:4kB,shmem-rss:534896kB, UID:107 pgtables:42384kB oom_score_adj:0 
[19063739.149563] oom_reaper: reaped process 536356 (postgres), now anon-rss:0kB, file-rss:0kB, shmem-rss:534896kB,

or

2) the client application is restarted, closing the persistent connections and terminating their corresponding
backends.This releases the memory back to the OS and all is well again. 

For reference:

----- postgresql.auto.conf -----
listen_addresses = '*'
timezone = 'UTC'
max_connections = '200'
effective_cache_size = '2GB'
work_mem = '128MB'
wal_level = 'replica'
fsync = 'on'
synchronous_commit = 'off'
checkpoint_completion_target = '0.75'
checkpoint_warning = '15min'
autovacuum = 'on'
autovacuum_freeze_max_age = '200000000'
vacuum_freeze_min_age = '10000'
vacuum_freeze_table_age = '1000000000'
primary_conninfo = 'user=replication passfile=''/root/.pgpass'' channel_binding=prefer host=xxx.xxx.xxx.xxx port=5432'
promote_trigger_file = '/var/lib/postgresql/14/main/failover_trig'
primary_slot_name = 'replication_db5'
log_min_duration_statement = '-1'
max_worker_processes = '4'
max_parallel_workers_per_gather = '4'
shared_buffers = '512MB'
--------------------------------

And:

----- free -----
               total        used        free      shared  buff/cache   available
Mem:        65837856     2623820    47176960      567988    16037076    61989160
Swap:        1000444           0     1000444
----------------

Lastly, I will reiterate that there is no evidence of a slow-brewing memory leak. Memory usage seems more or less
steady-statein general, rising and falling in expected ways with peak and off-peak workloads. Then, suddenly, some of
thebackends go into into the aforementioned nosedive.  

I have considered using a connection pooler to limit the overall memory footprint and blast radius, and while this
mightaddress the problem from an operational point of view, it does not really resolve the essential question: why is
thishappening in the first place, seemingly out of nowhere? 

I will also say that while the client application does call a lot of stored functions, they are all rather
circumscribedin scope, in support of real-time routing decisions. These are not expensive reporting queries as might be
issuedfrom an API or a user interface of some kind, for example. The client IPs on the problematic backends above
correspondto the client application, not to any other connected clients.  

I'm at an utter loss as to how to troubleshoot or prevent this. Any insight would be deeply appreciated!

-- Alex

--
Alex Balashov
Principal Consultant
Evariste Systems LLC
Web: https://evaristesys.com
Tel: +1-706-510-6800




В списке pgsql-admin по дате отправления:

Предыдущее
От: Erik Wienhold
Дата:
Сообщение: Re: pg_stat_activity query_id
Следующее
От: Erik Wienhold
Дата:
Сообщение: Re: Unexplained rapid growth in memory usage of idle backends