Обсуждение: Memory leak on subquery as scalar operand
There seems to be a memory leak in PostgreSQL for Linux (doesn't happen on Windows). At least in versions 12 to 14 (and not in 9.5). It can be reproduced with the following SQL script on an empty database.
* When repeated, Linux process starts using more and more RAM until OOM-killer will kill it.
* 'pg_backend_memory_contexts' doesn't show any increase.
* (gdb) p MemoryContextStats(TopMemoryContext) - doesn't show any increase.
-/
CREATE TABLE IF NOT EXISTS public.leak_test
(
id integer NOT NULL,
CONSTRAINT leak_test_pkey PRIMARY KEY (id)
);
INSERT INTO leak_test(id)
SELECT id
FROM generate_series(1,100000) id
ON CONFLICT DO NOTHING;
SELECT
id,
(SELECT count(*) FROM leak_test x WHERE x.id=l.id) as x_result,
(SELECT count(*) FROM leak_test y WHERE y.id=l.id) as y_result
/* Leaks memory around 80 kB on each query, but only if two sub-queries are used. */
FROM leak_test l;
/-
Regards
Lauri
* When repeated, Linux process starts using more and more RAM until OOM-killer will kill it.
* 'pg_backend_memory_contexts' doesn't show any increase.
* (gdb) p MemoryContextStats(TopMemoryContext) - doesn't show any increase.
-/
CREATE TABLE IF NOT EXISTS public.leak_test
(
id integer NOT NULL,
CONSTRAINT leak_test_pkey PRIMARY KEY (id)
);
INSERT INTO leak_test(id)
SELECT id
FROM generate_series(1,100000) id
ON CONFLICT DO NOTHING;
SELECT
id,
(SELECT count(*) FROM leak_test x WHERE x.id=l.id) as x_result,
(SELECT count(*) FROM leak_test y WHERE y.id=l.id) as y_result
/* Leaks memory around 80 kB on each query, but only if two sub-queries are used. */
FROM leak_test l;
/-
Regards
Lauri
Lauri Laanmets <pcspets@gmail.com> writes: > There seems to be a memory leak in PostgreSQL for Linux (doesn't happen on > Windows). At least in versions 12 to 14 (and not in 9.5). It can be > reproduced with the following SQL script on an empty database. Thanks for the test case! Unfortunately, I can't see any ill effects with it in currently released versions (I tried 12, 14, 15, and HEAD). Which minor releases did you try? Do you have any non-default settings or build options? regards, tom lane
I have tested it on version 12 that comes with Ubuntu 20.04.5 LTS. Then upgraded to "PostgreSQL 14.5 (Ubuntu 14.5-2.pgdg20.04+2) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit" from "deb http://apt.postgresql.org/pub/repos/apt focal-pgdg main" but the behaviour is the same - if my example query is repeatedly executed on an open session, the postgres process will start using 1 MB more RAM after around every 14 times of execution. Until OOM-killer kills it.
I haven't changed any default settings and it can be reproduced also by pgAdmin Query Tool that executes the query under 'postgres' user. Although it's a bit tiresome to make so many executions manually.
It is interesting that it doesn't happen on Windows. Might be that it's caused by the OS or system in general but I have no idea what to try or test next.
The machine is: Linux 5.4.0-131-generic #147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Regards
Lauri
I haven't changed any default settings and it can be reproduced also by pgAdmin Query Tool that executes the query under 'postgres' user. Although it's a bit tiresome to make so many executions manually.
It is interesting that it doesn't happen on Windows. Might be that it's caused by the OS or system in general but I have no idea what to try or test next.
The machine is: Linux 5.4.0-131-generic #147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Regards
Lauri
Kontakt Tom Lane (<tgl@sss.pgh.pa.us>) kirjutas kuupäeval E, 31. oktoober 2022 kell 15:49:
Lauri Laanmets <pcspets@gmail.com> writes:
> There seems to be a memory leak in PostgreSQL for Linux (doesn't happen on
> Windows). At least in versions 12 to 14 (and not in 9.5). It can be
> reproduced with the following SQL script on an empty database.
Thanks for the test case! Unfortunately, I can't see any ill effects
with it in currently released versions (I tried 12, 14, 15, and HEAD).
Which minor releases did you try? Do you have any non-default
settings or build options?
regards, tom lane
On Mon, 31 Oct 2022 at 23:03, Lauri Laanmets <pcspets@gmail.com> wrote: > There seems to be a memory leak in PostgreSQL for Linux (doesn't happen on Windows). At least in versions 12 to 14 (andnot in 9.5). It can be reproduced with the following SQL script on an empty database. > > * When repeated, Linux process starts using more and more RAM until OOM-killer will kill it. > * 'pg_backend_memory_contexts' doesn't show any increase. > * (gdb) p MemoryContextStats(TopMemoryContext) - doesn't show any increase. Does it still OOM if you do: SET jit TO off; ? David
David Rowley <dgrowleyml@gmail.com> writes: > Does it still OOM if you do: > SET jit TO off; ? Ah, bingo --- I can reproduce the behavior if I use a JIT-enabled build. So this is related to the known problems with leakage of compiled objects. It's pretty odd though that it requires two sub-selects to cause the problem. regards, tom lane
On Tue, 1 Nov 2022 at 13:44, Tom Lane <tgl@sss.pgh.pa.us> wrote: > It's pretty odd though that it requires two > sub-selects to cause the problem. Perhaps that's just what it takes to bump the costs above the JIT threshold. David
David Rowley <dgrowleyml@gmail.com> writes: > On Tue, 1 Nov 2022 at 13:44, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It's pretty odd though that it requires two >> sub-selects to cause the problem. > Perhaps that's just what it takes to bump the costs above the JIT threshold. I see JIT being invoked either way: regression=# explain verbose SELECT id, (SELECT count(*) FROM leak_test x WHERE x.id=l.id) as x_result, (SELECT count(*) FROM leak_test y WHERE y.id=l.id) as y_result FROM leak_test l; QUERY PLAN ------------------------------------------------------------------------------------------------------------ Seq Scan on public.leak_test l (cost=0.00..865943.00 rows=100000 width=20) Output: l.id, (SubPlan 1), (SubPlan 2) SubPlan 1 -> Aggregate (cost=4.31..4.32 rows=1 width=8) Output: count(*) -> Index Only Scan using leak_test_pkey on public.leak_test x (cost=0.29..4.31 rows=1 width=0) Output: x.id Index Cond: (x.id = l.id) SubPlan 2 -> Aggregate (cost=4.31..4.32 rows=1 width=8) Output: count(*) -> Index Only Scan using leak_test_pkey on public.leak_test y (cost=0.29..4.31 rows=1 width=0) Output: y.id Index Cond: (y.id = l.id) JIT: Functions: 12 Options: Inlining true, Optimization true, Expressions true, Deforming true (17 rows) regression=# explain verbose SELECT id, (SELECT count(*) FROM leak_test x WHERE x.id=l.id) as x_result FROM leak_test l; QUERY PLAN ------------------------------------------------------------------------------------------------------------ Seq Scan on public.leak_test l (cost=0.00..433693.00 rows=100000 width=12) Output: l.id, (SubPlan 1) SubPlan 1 -> Aggregate (cost=4.31..4.32 rows=1 width=8) Output: count(*) -> Index Only Scan using leak_test_pkey on public.leak_test x (cost=0.29..4.31 rows=1 width=0) Output: x.id Index Cond: (x.id = l.id) JIT: Functions: 7 Options: Inlining false, Optimization false, Expressions true, Deforming true (11 rows) Maybe the different "inlining" choice makes a difference? regards, tom lane
On Tue, 1 Nov 2022 at 17:10, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Maybe the different "inlining" choice makes a difference? The single subquery version also crashes for me, so perhaps it's just the amount of memory that's being used and when the OOM killer is triggering. It crashes even when I set jit_inline_above_cost and jit_optimize_above_cost above the query's cost. postgres=# set jit_optimize_above_cost=10000000; SET postgres=# set jit_inline_above_cost=10000000; SET postgres=# set jit=0; SET postgres=# explain verbose SELECT id, (SELECT count(*) FROM leak_test y WHERE y.id=l.id) as y_result FROM leak_test l; QUERY PLAN ------------------------------------------------------------------------------------------------------------ Seq Scan on public.leak_test l (cost=0.00..941723.86 rows=112965 width=12) Output: l.id, (SubPlan 1) SubPlan 1 -> Aggregate (cost=8.31..8.32 rows=1 width=8) Output: count(*) -> Index Only Scan using leak_test_pkey on public.leak_test y (cost=0.29..8.31 rows=1 width=0) Output: y.id Index Cond: (y.id = l.id) (8 rows) postgres=# set jit=1; SET postgres=# explain verbose SELECT id, (SELECT count(*) FROM leak_test y WHERE y.id=l.id) as y_result FROM leak_test l; server closed the connection unexpectedly David
David Rowley <dgrowleyml@gmail.com> writes: > The single subquery version also crashes for me, so perhaps it's just > the amount of memory that's being used and when the OOM killer is > triggering. > It crashes even when I set jit_inline_above_cost and > jit_optimize_above_cost above the query's cost. Hmm, maybe we're not seeing the same thing? For me, the behavior seems similar to what the OP reported: there's a per-query leakage but it's less than 100kB per query. It'd take more than a handful of repetitions to get to an OOM failure. This is with LLVM 13.0.1 on RHEL 8.6. Also, as far as I can see there is no leak with the single-subquery version. The process's reported VIRT consumption bounces around a fair amount, but it doesn't go above 300MB, even after ~10min in a tight plpgsql DO loop. VIRT bounces around a lot with the two-subquery version too, actually, but there does seem to be a general uptrend there. I did not have the patience to wait for actual OOM; it looked like it'd take a good long while, tens of minutes at least. If the behavior varies across LLVM versions, as is now seeming a bit likely, it might be their bug not ours. regards, tom lane
Hi, On 2022-11-01 00:52:16 -0400, Tom Lane wrote: > David Rowley <dgrowleyml@gmail.com> writes: > > The single subquery version also crashes for me, so perhaps it's just > > the amount of memory that's being used and when the OOM killer is > > triggering. > > It crashes even when I set jit_inline_above_cost and > > jit_optimize_above_cost above the query's cost. > > Hmm, maybe we're not seeing the same thing? I can't reproduce that either. > For me, the behavior seems similar to what the OP reported: there's a > per-query leakage but it's less than 100kB per query. It'd take more than a > handful of repetitions to get to an OOM failure. This is with LLVM 13.0.1 > on RHEL 8.6. This I can reproduce. Here's an updated patchset addressing this. This query, for some reason, leaks a lot more aggressively than what I've seen in the past, so I needed to reduce the amount of time until an llvm context is recycled substantially. A bit of benchmarking showed no negative consequences of going to 100 uses till recycling, even with absurd settings (i.e. jit_*_cost = 0), but did show impact on lower values. Greetings, Andres Freund
Вложения
> Does it still OOM if you do:
> SET jit TO off; ?
> SET jit TO off; ?
Indeed, it doesn't leak if JIT is OFF. Thank you for a workaround hint!
Regards
Lauri
Kontakt Tom Lane (<tgl@sss.pgh.pa.us>) kirjutas kuupäeval T, 1. november 2022 kell 02:44:
David Rowley <dgrowleyml@gmail.com> writes:
> Does it still OOM if you do:
> SET jit TO off; ?
Ah, bingo --- I can reproduce the behavior if I use a JIT-enabled
build. So this is related to the known problems with leakage of
compiled objects. It's pretty odd though that it requires two
sub-selects to cause the problem.
regards, tom lane
On Mon, Oct 31, 2022 at 10:51:32PM -0700, Andres Freund wrote: > Hi, > > On 2022-11-01 00:52:16 -0400, Tom Lane wrote: > > David Rowley <dgrowleyml@gmail.com> writes: > > > The single subquery version also crashes for me, so perhaps it's just > > > the amount of memory that's being used and when the OOM killer is > > > triggering. > > > It crashes even when I set jit_inline_above_cost and > > > jit_optimize_above_cost above the query's cost. > > > > Hmm, maybe we're not seeing the same thing? > > I can't reproduce that either. @David: does the query really crash as you described - with neither jit inlining nor optimization ? Is it due to OOM or something else ? Does it crash with/without Andres' patches ? What version were you testing ? -- Justin
> On 1 Nov 2022, at 06:51, Andres Freund <andres@anarazel.de> wrote: >> David Rowley <dgrowleyml@gmail.com> writes: >> For me, the behavior seems similar to what the OP reported: there's a >> per-query leakage but it's less than 100kB per query. It'd take more than a >> handful of repetitions to get to an OOM failure. This is with LLVM 13.0.1 >> on RHEL 8.6. > > This I can reproduce. Here's an updated patchset addressing this. This query, > for some reason, leaks a lot more aggressively than what I've seen in the > past, so I needed to reduce the amount of time until an llvm context is > recycled substantially. A bit of benchmarking showed no negative consequences > of going to 100 uses till recycling, even with absurd settings > (i.e. jit_*_cost = 0), but did show impact on lower values. I had a look at this and I concur with the findings in this thread. I didn't do any benchmarking but running various tests I was unable to trigger an OOM. Will do more testing and stressing of it. A few small comments on the patchset: From reading it seems that patch 0002 and 0003 can be committed regardless of the other patches in this series. Were they included because they were found while looking at this, or is there a deeper connection I'm missing? 0004: The commit message states: "That incurs some overhead, so only do so after 10000 JITed queries.", but I fail to see how that's implemented. There is currently recreation after 100 reuses, was the intention to have a different number here or is this just a leftover from an earlier patch-version? + /* + * The LLVM Context used by this JIT context. An LLVM context is reused + * across many compilations, but occasionally reset to prevent it using + * too much memory due to more and more types accumulating. + */ + LLVMContextRef llvm_context; llvm_context is added as a member in LLVMJitContext but is never set or read, the static llvmjit.c:llvm_context is still used for everything. Is this a lefover or was the plan to move this to LLVMJitContext? + * FIXME: should split the handling of llvm_triple / llvm_layout out + * of llvm_create_types() - that doesn't need to be redone. Agreed, that seems wasteful. AFAICT there would not be any reason to recreate this once set? + /* + * Consider as cleaned up even if we skip doing so below, that way we can + * verify the tracking is correct (see llvm_shutdown()). + */ + llvm_jit_context_in_use_count--; Since this doesn't actually release even if llvm_jit_context_in_use_count goes to zero here, this hunk in llvm_release_context is seemingly a bit at odds with the following from jit/README: "If it is desirable to release resources earlier, jit_release_context() can be used". If we want to verify the usage tracking in llvm_shutdown it's hard to see how we could much else, but maybe a note in the README that not all resources are guaranteed to be released could be in order? Regarding llvm_release_context we currently have this: static void llvm_release_context(JitContext *context) { LLVMJitContext *llvm_context = (LLVMJitContext *) context; ..which shadows the "static LLVMContextRef llvm_context;" declared in this patchset, we should probably rename the local var in llvm_release_context. (this is referred to in the 0004 commit message.) The attached 0005 is a WIP attempt to address a few of the FIXME's in this patchset on top of your the 0001-0004. -- Daniel Gustafsson
Вложения
> On 2 May 2023, at 15:33, Daniel Gustafsson <daniel@yesql.se> wrote: > The attached 0005 is a WIP attempt to address a few of the FIXME's in this > patchset on top of your the 0001-0004. I've attached a rebased v5 which has the above fixes as well as a few smaller ones missed in v4. Testing through the reports linked to from the open item it's clear that while there is still a continuous rise in resident size, this slows it down dramatically with the size levelling out which unpatched master doesn't. Reducing LLVMJIT_LLVM_CONTEXT_REUSE_MAX back even further pushes the levelling off further as well, making me wonder if this is something worth using a GUC for given that we're stuck with the destructive LLVM linker for the foreseeable future? -- Daniel Gustafsson
Вложения
Attached is a v6 rebase of this patchset which curbs a memory leak in llvmjit by explicitly using an LLVMContextRef for types which is dropped and recreated at intervals to free unused types. The attached graph plots the memory usage of a backend continuously running the query from the OP in this thread. The patched version (running with all JIT costs at zero to get all inlining etc) goes to a levelled off memory usage where master just continues to grow until terminated. There is more to do on llvmjit memory usage, but this is clearly a win for queries which otherwise accumulte type leaks potentially ending with an OOM. 0001 and 0002 are tangentially related, but are mainly of cleanup character. 0003 contains the LLVMContextRef work which is the meat of the patchset. I think it would be good to get this in early in the v17 cycle such that we have time to revisit the herustic if need be. Thoughts? -- Daniel Gustafsson
Вложения
> On 13 Sep 2023, at 15:24, Daniel Gustafsson <daniel@yesql.se> wrote: > I think it would be good to get this in early in the v17 cycle such that we > have time to revisit the herustic if need be. With an offlist +1 from Andres, and another couple of rounds of read-throughs and jit_above_cost=0 check-worlds I went ahead and applied this to master. The buildfarm has green builds for LLVM enabled builds on Linux, FreeBSD and macOS. -- Daniel Gustafsson
On 2023-09-27 14:07:16 +0200, Daniel Gustafsson wrote: > > On 13 Sep 2023, at 15:24, Daniel Gustafsson <daniel@yesql.se> wrote: > > > I think it would be good to get this in early in the v17 cycle such that we > > have time to revisit the herustic if need be. > > With an offlist +1 from Andres, and another couple of rounds of read-throughs > and jit_above_cost=0 check-worlds I went ahead and applied this to master. The > buildfarm has green builds for LLVM enabled builds on Linux, FreeBSD and macOS. Thanks a lot or working on this!
> On 27 Sep 2023, at 14:07, Daniel Gustafsson <daniel@yesql.se> wrote: > >> On 13 Sep 2023, at 15:24, Daniel Gustafsson <daniel@yesql.se> wrote: > >> I think it would be good to get this in early in the v17 cycle such that we >> have time to revisit the herustic if need be. > > With an offlist +1 from Andres, and another couple of rounds of read-throughs > and jit_above_cost=0 check-worlds I went ahead and applied this to master. The > buildfarm has green builds for LLVM enabled builds on Linux, FreeBSD and macOS. This, as well as the prerequisite backpatch of bab150045bd9 to 12 and 13 has now been backpatched to all supported branches. By now there has been green builds by LLVM enabled animals in the buildfarm for all branches. -- Daniel Gustafsson