Обсуждение: v12 "won't fix" item regarding memory leak in "ATTACH PARTITIONwithout AEL"; (or, relcache ref counting)

Поиск
Список
Период
Сортировка

v12 "won't fix" item regarding memory leak in "ATTACH PARTITIONwithout AEL"; (or, relcache ref counting)

От
Justin Pryzby
Дата:
This links to a long thread, from which I've tried to quote some of the
most important mails, below.
https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items#Won.27t_Fix

I wondered if there's an effort to pursue a resolution for v13 ?

On Fri, Apr 12, 2019 at 11:42:24AM -0400, Tom Lane wrote in <31027.1555083744@sss.pgh.pa.us>:
> Michael Paquier <michael@paquier.xyz> writes:
> > On Wed, Apr 10, 2019 at 05:03:21PM +0900, Amit Langote wrote:
> >> The problem lies in all branches that have partitioning, so it should be
> >> listed under Older Bugs, right?  You may have noticed that I posted
> >> patches for all branches down to 10.
> 
> > I have noticed.  The message from Tom upthread outlined that an open
> > item should be added, but this is not one.  That's what I wanted to
> > emphasize.  Thanks for making it clearer.
> 
> To clarify a bit: there's more than one problem here.  Amit added an
> open item about pre-existing leaks related to rd_partcheck.  (I'm going
> to review and hopefully push his fix for that today.)  However, there's
> a completely separate leak associated with mismanagement of rd_pdcxt,
> as I showed in [1], and it doesn't seem like we have consensus about
> what to do about that one.  AFAIK that's a new bug in 12 (caused by
> 898e5e329) and so it ought to be in the main open-items list.
> 
> This thread has discussed a bunch of possible future changes like
> trying to replace copying of relcache-provided data structures
> with reference-counting, but I don't think any such thing could be
> v12 material at this point.  We do need to fix the newly added
> leak though.
> 
>             regards, tom lane
> 
> [1] https://www.postgresql.org/message-id/10797.1552679128%40sss.pgh.pa.us
> 
> 

On Fri, Mar 15, 2019 at 05:41:47PM -0400, Robert Haas wrote in
<CA+Tgmoa5rT+ZR+Vv+q1XLwQtDMCqkNL6B4PjR4V6YAC9K_LBxw@mail.gmail.com>:
> On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > More to the point, we turned *one* rebuild = false situation into
> > a bunch of rebuild = true situations.  I haven't studied it closely,
> > but I think a CCA animal would probably see O(N^2) rebuild = true
> > invocations in a query with N partitions, since each time
> > expand_partitioned_rtentry opens another child table, we'll incur
> > an AcceptInvalidationMessages call which leads to forced rebuilds
> > of the previously-pinned rels.  In non-CCA situations, almost always
> > nothing happens with the previously-examined relcache entries.
> 
> That's rather unfortunate.  I start to think that clobbering the cache
> "always" is too hard a line.
> 
> > I agree that copying data isn't great.  What I don't agree is that this
> > solution is better.  In particular, copying data out of the relcache
> > rather than expecting the relcache to hold still over long periods
> > is the way we've done things everywhere else, cf RelationGetIndexList,
> > RelationGetStatExtList, RelationGetIndexExpressions,
> > RelationGetIndexPredicate, RelationGetIndexAttrBitmap,
> > RelationGetExclusionInfo, GetRelationPublicationActions.  I don't care
> > for a patch randomly deciding to do things differently on the basis of an
> > unsupported-by-evidence argument that it might cost too much to copy the
> > data.  If we're going to make a push to reduce the amount of copying of
> > that sort that we do, it should be a separately (and carefully) designed
> > thing that applies to all the relcache substructures that have the issue,
> > not one-off hacks that haven't been reviewed thoroughly.
> 
> That's not an unreasonable argument.  On the other hand, if you never
> try new stuff, you lose opportunities to explore things that might
> turn out to be better and worth adopting more widely.
> 
> I am not very convinced that it makes sense to lump something like
> RelationGetIndexAttrBitmap in with something like
> RelationGetPartitionDesc.  RelationGetIndexAttrBitmap is returning a
> single Bitmapset, whereas the data structure RelationGetPartitionDesc
> is vastly larger and more complex.  You can't say "well, if it's best
> to copy 32 bytes of data out of the relcache every time we need it, it
> must also be right to copy 10k or 100k of data out of the relcache
> every time we need it."
> 
> There is another difference as well: there's a good chance that
> somebody is going to want to mutate a Bitmapset, whereas they had
> BETTER NOT think that they can mutate the PartitionDesc.  So returning
> an uncopied Bitmapset is kinda risky in a way that returning an
> uncopied PartitionDesc is not.
> 
> If we want an at-least-somewhat unified solution here, I think we need
> to bite the bullet and make the planner hold a reference to the
> relcache throughout planning.  (The executor already keeps it open, I
> believe.) Otherwise, how is the relcache supposed to know when it can
> throw stuff away and when it can't?  The only alternative seems to be
> to have each subsystem hold its own reference count, as I did with the
> PartitionDirectory stuff, which is not something we'd want to scale
> out.
> 
> > I especially don't care for the much-less-than-half-baked kluge of
> > chaining the old rd_pdcxt onto the new one and hoping that it will go away
> > at a suitable time.
> 
> It will go away at a suitable time, but maybe not at the soonest
> suitable time.  It wouldn't be hard to improve that, though.  The
> easiest thing to do, I think, would be to have an rd_oldpdcxt and
> stuff the old context there.  If there already is one there, parent
> the new one under it.  When RelationDecrementReferenceCount reduces
> the reference count to zero, destroy anything found in rd_oldpdcxt.
> With that change, things get destroyed at the earliest time at which
> we know the old things aren't referenced, instead of the earliest time
> at which they are not referenced + an invalidation arrives.
> 
> > regression=# create table parent (a text, b int) partition by list (a);
> > CREATE TABLE
> > regression=# create table child (a text, b int);
> > CREATE TABLE
> > regression=# do $$
> > regression$# begin
> > regression$# for i in 1..10000000 loop
> > regression$# alter table parent attach partition child for values in ('x');
> > regression$# alter table parent detach partition child;
> > regression$# end loop;
> > regression$# end $$;
> 
> Interesting example.
> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 

On Sun, Apr 14, 2019 at 03:29:26PM -0400, Tom Lane wrote in <27380.1555270166@sss.pgh.pa.us>:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I agree that copying data isn't great.  What I don't agree is that this
> >> solution is better.
> 
> > I am not very convinced that it makes sense to lump something like
> > RelationGetIndexAttrBitmap in with something like
> > RelationGetPartitionDesc.  RelationGetIndexAttrBitmap is returning a
> > single Bitmapset, whereas the data structure RelationGetPartitionDesc
> > is vastly larger and more complex.  You can't say "well, if it's best
> > to copy 32 bytes of data out of the relcache every time we need it, it
> > must also be right to copy 10k or 100k of data out of the relcache
> > every time we need it."
> 
> I did not say that.  What I did say is that they're both correct
> solutions.  Copying large data structures is clearly a potential
> performance problem, but that doesn't mean we can take correctness
> shortcuts.
> 
> > If we want an at-least-somewhat unified solution here, I think we need
> > to bite the bullet and make the planner hold a reference to the
> > relcache throughout planning.  (The executor already keeps it open, I
> > believe.) Otherwise, how is the relcache supposed to know when it can
> > throw stuff away and when it can't?
> 
> The real problem with that is that we *still* won't know whether it's
> okay to throw stuff away or not.  The issue with these subsidiary
> data structures is exactly that we're trying to reduce the lock strength
> required for changing one of them, and as soon as you make that lock
> strength less than AEL, you have the problem that that value may need
> to change in relcache entries despite them being pinned.  The code I'm
> complaining about is trying to devise a shortcut solution to exactly
> that situation ... and it's not a good shortcut.
> 
> > The only alternative seems to be to have each subsystem hold its own
> > reference count, as I did with the PartitionDirectory stuff, which is
> > not something we'd want to scale out.
> 
> Well, we clearly don't want to devise a separate solution for every
> subsystem, but it doesn't seem out of the question to me that we could
> build a "reference counted cache sub-structure" mechanism and apply it
> to each of these relcache fields.  Maybe it could be unified with the
> existing code in the typcache that does a similar thing.  Sure, this'd
> be a fair amount of work, but we've done it before.  Syscache entries
> didn't use to have reference counts, for example.
> 
> BTW, the problem I have with the PartitionDirectory stuff is exactly
> that it *isn't* a reference-counted solution.  If it were, we'd not
> have this problem of not knowing when we could free rd_pdcxt.
> 
> >> I especially don't care for the much-less-than-half-baked kluge of
> >> chaining the old rd_pdcxt onto the new one and hoping that it will go away
> >> at a suitable time.
> 
> > It will go away at a suitable time, but maybe not at the soonest
> > suitable time.  It wouldn't be hard to improve that, though.  The
> > easiest thing to do, I think, would be to have an rd_oldpdcxt and
> > stuff the old context there.  If there already is one there, parent
> > the new one under it.  When RelationDecrementReferenceCount reduces
> > the reference count to zero, destroy anything found in rd_oldpdcxt.
> 
> Meh.  While it seems likely that that would mask most practical problems,
> it still seems like covering up a wart with a dirty bandage.  In
> particular, that fix doesn't fix anything unless relcache reference counts
> go to zero pretty quickly --- which I'll just note is directly contrary
> to your enthusiasm for holding relcache pins longer.
> 
> I think that what we ought to do for v12 is have PartitionDirectory
> copy the data, and then in v13 work on creating real reference-count
> infrastructure that would allow eliminating the copy steps with full
> safety.  The $64 question is whether that really would cause unacceptable
> performance problems.  To look into that, I made the attached WIP patches.
> (These are functionally complete, but I didn't bother for instance with
> removing the hunk that 898e5e329 added to relcache.c, and the comments
> need work, etc.)  The first one just changes the PartitionDirectory
> code to do that, and then the second one micro-optimizes
> partition_bounds_copy() to make it somewhat less expensive, mostly by
> collapsing lots of small palloc's into one big one.
> 
> What I get for test cases like [1] is
> 
> single-partition SELECT, hash partitioning:
> 
> N       tps, HEAD       tps, patch
> 2       11426.243754    11448.615193
> 8       11254.833267    11374.278861
> 32      11288.329114    11371.942425
> 128     11222.329256    11185.845258
> 512     11001.177137    10572.917288
> 1024    10612.456470    9834.172965
> 4096    8819.110195     7021.864625
> 8192    7372.611355     5276.130161
> 
> single-partition SELECT, range partitioning:
> 
> N       tps, HEAD       tps, patch
> 2       11037.855338    11153.595860
> 8       11085.218022    11019.132341
> 32      10994.348207    10935.719951
> 128     10884.417324    10532.685237
> 512     10635.583411    9578.108915
> 1024    10407.286414    8689.585136
> 4096    8361.463829     5139.084405
> 8192    7075.880701     3442.542768
> 
> Now certainly these numbers suggest that avoiding the copy could be worth
> our trouble, but these results are still several orders of magnitude
> better than where we were two weeks ago [2].  Plus, this is an extreme
> case that's not really representative of real-world usage, since the test
> tables have neither indexes nor any data.  In practical situations the
> baseline would be lower and the dropoff less bad.  So I don't feel bad
> about shipping v12 with these sorts of numbers and hoping for more
> improvement later.
> 
>             regards, tom lane
> 
> [1] https://www.postgresql.org/message-id/3529.1554051867%40sss.pgh.pa.us
> 
> [2] https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A512BAC60%40g01jpexmbkw24
> 


On Wed, May 01, 2019 at 01:09:07PM -0400, Robert Haas wrote in
<CA+Tgmob-cska+-WUC7T-G4zkSJp7yum6M_bzYd4YFzwQ51qiow@mail.gmail.com>:
> On Wed, May 1, 2019 at 11:59 AM Andres Freund <andres@anarazel.de> wrote:
> > The message I'm replying to is marked as an open item. Robert, what do
> > you think needs to be done here before release?  Could you summarize,
> > so we then can see what others think?
> 
> The original issue on this thread was that hyrax started running out
> of memory when it hadn't been doing so previously.  That happened
> because, for complicated reasons, commit
> 898e5e3290a72d288923260143930fb32036c00c resulted in the leak being
> hit lots of times in CLOBBER_CACHE_ALWAYS builds instead of just once.
> Commits 2455ab48844c90419714e27eafd235a85de23232 and
> d3f48dfae42f9655425d1f58f396e495c7fb7812 fixed that problem.
> 
> In the email at issue, Tom complains about two things.  First, he
> complains about the fact that I try to arrange things so that relcache
> data remains valid for as long as required instead of just copying it.
> Second, he complains about the fact repeated ATTACH and DETACH
> PARTITION operations can produce a slow session-lifespan memory leak.
> 
> I think it's reasonable to fix the second issue for v12.  I am not
> sure how important it is, because (1) the leak is small, (2) it seems
> unlikely that anyone would execute enough ATTACH/DETACH PARTITION
> operations in one backend for the leak to amount to anything
> significant, and (3) if a relcache flush ever happens when you don't
> have the relation open, all of the "leaked" memory will be un-leaked.
> However, I believe that we could fix it as follows.  First, invent
> rd_pdoldcxt and put the first old context there; if that pointer is
> already in use, then parent the new context under the old one.
> Second, in RelationDecrementReferenceCount, if the refcount hits 0,
> nuke rd_pdoldcxt and set the pointer back to NULL.  With that change,
> you would only keep the old PartitionDesc around until the ref count
> hits 0, whereas at present, you keep the old PartitionDesc around
> until an invalidation happens while the ref count is 0.
> 
> I think the first issue is not v12 material.  Tom proposed to fix it
> by copying all the relevant data out of the relcache, but his own
> performance results show a pretty significant hit, and AFAICS he
> hasn't pointed to anything that's actually broken by the current
> approach.  What I think should be done is actually generalize the
> approach I took in this patch, so that instead of the partition
> directory holding a reference count, the planner or executor holds
> one.  Then not only could people who want information about the
> PartitionDesc avoid copying stuff from the relcache, but maybe other
> things as well.  I think that would be significantly better than
> continuing to double-down on the current copy-everything approach,
> which really only works well in a world where a table can't have all
> that much metadata, which is clearly not true for PostgreSQL any more.
> I'm not sure that Tom is actually opposed to this approach -- although
> I may have misunderstood his position -- but where we disagree, I
> think, is that I see what I did in this commit as a stepping-stone
> toward a better world, and he sees it as something that should be
> killed with fire until that better world has fully arrived.
> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 
> 

On Tue, Jun 11, 2019 at 01:57:16PM -0400, Tom Lane wrote in <18286.1560275836@sss.pgh.pa.us>:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Thu, Jun 6, 2019 at 2:48 AM Amit Langote <amitlangote09@gmail.com> wrote:
> >> Attached is a patch that applies on top of Robert's pdoldcxt-v1.patch,
> >> which seems to fix this issue for me.
> 
> > Yeah, that looks right.  I think my patch was full of fuzzy thinking
> > and inadequate testing; thanks for checking it over and coming up with
> > the right solution.
> 
> > Anyone else want to look/comment?
> 
> I think the existing code is horribly ugly and this is even worse.
> It adds cycles to RelationDecrementReferenceCount which is a hotspot
> that has no business dealing with this; the invariants are unclear;
> and there's no strong reason to think there aren't still cases where
> we accumulate lots of copies of old partition descriptors during a
> sequence of operations.  Basically you're just doubling down on a
> wrong design.
> 
> As I said upthread, my current inclination is to do nothing in this
> area for v12 and then try to replace the whole thing with proper
> reference counting in v13.  I think the cases where we have a major
> leak are corner-case-ish enough that we can leave it as-is for one
> release.
> 
>             regards, tom lane
> 
> 

On Wed, Jun 12, 2019 at 03:11:56PM -0400, Tom Lane wrote in <3800.1560366716@sss.pgh.pa.us>:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Tue, Jun 11, 2019 at 1:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> > I think the change is responsive to your previous complaint that the
> > timing of stuff getting freed is not very well pinned down.  With this
> > change, it's much more tightly pinned down: it happens when the
> > refcount goes to 0.  That is definitely not perfect, but I think that
> > it is a lot easier to come up with scenarios where the leak
> > accumulates because no cache flush happens while the relfcount is 0
> > than it is to come up with scenarios where the refcount never reaches
> > 0.  I agree that the latter type of scenario probably exists, but I
> > don't think we've come up with one yet.
> 
> I don't know why you think that's improbable, given that the changes
> around PartitionDirectory-s cause relcache entries to be held open much
> longer than before (something I've also objected to on this thread).
> 
> >> As I said upthread, my current inclination is to do nothing in this
> >> area for v12 and then try to replace the whole thing with proper
> >> reference counting in v13.  I think the cases where we have a major
> >> leak are corner-case-ish enough that we can leave it as-is for one
> >> release.
> 
> > Is this something you're planning to work on yourself?
> 
> Well, I'd rather farm it out to somebody else, but ...
> 
> > Do you have a
> > design in mind?  Is the idea to reference-count the PartitionDesc?
> 
> What we discussed upthread was refcounting each of the various
> large sub-objects of relcache entries, not just the partdesc.
> I think if we're going to go this way we should bite the bullet and fix
> them all.  I really want to burn down RememberToFreeTupleDescAtEOX() in
> particular ... it seems likely to me that that's also a source of
> unpleasant memory leaks.
> 
>             regards, tom lane
> 
> 



On Mon, Feb 24, 2020 at 7:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> This links to a long thread, from which I've tried to quote some of the
> most important mails, below.
> https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items#Won.27t_Fix
>
> I wondered if there's an effort to pursue a resolution for v13 ?

I think commit 5b9312378 in master branch fixes this.  Commit message
mentions it like this:

    ...
    Also, fix things so that old copies of a relcache partition descriptor
    will be dropped when the cache entry's refcount goes to zero.  In the
    previous coding it was possible for such copies to survive for the
    lifetime of the session, as I'd complained of in a previous discussion.
    (This management technique still isn't perfect, but it's better than
    before.)
    ...
    ...Although this is a pre-existing
    problem, no back-patch: the patch seems too invasive to be safe to
    back-patch, and the bug it fixes is a corner case that seems
    relatively unlikely to cause problems in the field.

    Discussion:
https://postgr.es/m/CA+HiwqFUzjfj9HEsJtYWcr1SgQ_=iCAvQ=O2Sx6aQxoDu4OiHw@mail.gmail.com
    Discussion:
https://postgr.es/m/CA+TgmoY3bRmGB6-DUnoVy5fJoreiBJ43rwMrQRCdPXuKt4Ykaw@mail.gmail.com

Thanks,
Amit