Обсуждение: Partitionwise join fails under GEQO
If you run the core regression tests with geqo_threshold = 2 (injected, say, via ALTER SYSTEM SET), you will notice the earlier tests showing some join ordering differences, which is unsurprising ... but when it gets to partition_join, that test just dumps core. Investigation shows that the crash is due to a corrupt EquivalenceClass member. It gets that way because add_child_join_rel_equivalences is unaware of the planner's memory allocation policies, and feels free to mess with the EC data structures during join planning. When GEQO's transient memory context is thrown away after one round of GEQO planning, some still-referenced EC data goes with it, resulting in a crash during the next round. I band-aided around that with the attached patch, which is enough to prevent the crash, but it's entirely unsatisfactory as a production solution. The problem is that each GEQO round will re-add the same child EC members, since add_child_join_rel_equivalences has absolutely no concept that the members it wants might be there already. Since GEQO tends to use a lot of rounds, this can run to significant memory bloat, not to mention a pretty serious speed hit while EC operations thumb through all of those duplicate expressions. Now that I've seen this, I wonder whether add_child_join_rel_equivalences might not be making duplicate EC entries even without GEQO. Is there any guarantee that it's not called repeatedly on related join-rel sets? So I think that in addition to this, we need some check to see if the derived EC child member is already there. It's likely that checking for an em_relids match would be sufficient to eliminate irrelevant members quickly. regards, tom lane diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c index b68a5a0ec7..8507cc8ef4 100644 --- a/src/backend/optimizer/path/equivclass.c +++ b/src/backend/optimizer/path/equivclass.c @@ -138,6 +138,8 @@ process_equivalence(PlannerInfo *root, *em2; ListCell *lc1; + Assert(CurrentMemoryContext == root->planner_cxt); + /* Should not already be marked as having generated an eclass */ Assert(restrictinfo->left_ec == NULL); Assert(restrictinfo->right_ec == NULL); @@ -2253,6 +2255,8 @@ add_child_rel_equivalences(PlannerInfo *root, Relids child_relids = child_rel->relids; int i; + Assert(CurrentMemoryContext == root->planner_cxt); + /* * EC merging should be complete already, so we can use the parent rel's * eclass_indexes to avoid searching all of root->eq_classes. @@ -2380,6 +2384,7 @@ add_child_join_rel_equivalences(PlannerInfo *root, Relids top_parent_relids = child_joinrel->top_parent_relids; Relids child_relids = child_joinrel->relids; Bitmapset *matching_ecs; + MemoryContext oldcontext; int i; Assert(IS_JOIN_REL(child_joinrel) && IS_JOIN_REL(parent_joinrel)); @@ -2387,6 +2392,8 @@ add_child_join_rel_equivalences(PlannerInfo *root, /* We need consider only ECs that mention the parent joinrel */ matching_ecs = get_eclass_indexes_for_relids(root, top_parent_relids); + oldcontext = MemoryContextSwitchTo(root->planner_cxt); + i = -1; while ((i = bms_next_member(matching_ecs, i)) >= 0) { @@ -2486,6 +2493,8 @@ add_child_join_rel_equivalences(PlannerInfo *root, } } } + + MemoryContextSwitchTo(oldcontext); }
On Mon, Oct 5, 2020 at 6:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > If you run the core regression tests with geqo_threshold = 2 > (injected, say, via ALTER SYSTEM SET), you will notice the > earlier tests showing some join ordering differences, which > is unsurprising ... but when it gets to partition_join, that > test just dumps core. > > Investigation shows that the crash is due to a corrupt EquivalenceClass > member. It gets that way because add_child_join_rel_equivalences > is unaware of the planner's memory allocation policies, and feels > free to mess with the EC data structures during join planning. > When GEQO's transient memory context is thrown away after one round > of GEQO planning, some still-referenced EC data goes with it, > resulting in a crash during the next round. > > I band-aided around that with the attached patch, which is enough > to prevent the crash, but it's entirely unsatisfactory as a production > solution. The problem is that each GEQO round will re-add the same > child EC members, since add_child_join_rel_equivalences has absolutely > no concept that the members it wants might be there already. Since > GEQO tends to use a lot of rounds, this can run to significant memory > bloat, not to mention a pretty serious speed hit while EC operations > thumb through all of those duplicate expressions. > > Now that I've seen this, I wonder whether add_child_join_rel_equivalences > might not be making duplicate EC entries even without GEQO. Is there any > guarantee that it's not called repeatedly on related join-rel sets? build_child_join_rel() is the only caller of add_child_join_rel_equivalences(). Before the first calls the later, the first has an Assert 899 Assert(!find_join_rel(root, joinrel->relids)); This means that for a given child join rel this function is called only once implying that add_child_join_rel_equivalences() is called only once for a given child join rel. Initially I thought that this is enough to not add duplicate child members. But to me use of bms_overlap() in that function looks doubtful. Thinking allowed, let's say there's an EC member with em_relids = {1, 2} 1, 2 being relids of two partitioned tables. Let's assume that there's a third partitioned table with relid = 3. When a child join rel between children, say 6 and 7, of 1 and 2 resp was produces the EC memeber with em_relids = {1,2} was translated to produce and EC memeber with em_relids = {6, 7} and em_is_child = true. But when we will add a child join rel between (6, 7, 8) of a join between (1, 2, 3), bms_overlap({1, 2}, {1, 2, 3}) will return true and we will add one more member with em_relids = {6, 7}. So your suspicion is true. I think, using bms_is_subset(top_parent_relids, em->em_relids) should fix that problem. In addition, adding an Assert to make sure that we are not adding multiple instances of same child EC member should suffice. Thumbing through all the child EC members to eliminate duplicates would be much costlier when a huge number of partitions are involved. On a related but different subject, I have been thinking on-off about the need for expression translation while planning partitions.The translated expressions differ only in the leaf-vars and even for most of the partitioned tables really only the varno (assuming that partitioned table and partitions will have same layouts when a large number of partitions are created automatically.) If we can introduce a new type of (polymorphic) Var node which represents not just the parent Var but also child Vars as well. The whole need for expression translation vanishes, reducing memory requirements by many folds. Such a Var node will indicate which varnos it covers and for a group of varnos which varattno it represents. In most of the cases where parent and the child share the same varattno, all the varnos form a single set. Whole Var references pose a problem since parent's whole var reference translates to a RowExpr containing child Var references, but probably using varattno = 0 for children will suffice. As far as my thoughts go, this will work in planner but when translating plan tree into execution tree we will have to produce different Execution time Var nodes. That should be fine since we will be dealing with only a single plan. If we could fix that somehow, well and good. -- Best Wishes, Ashutosh Bapat
Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes: > On Mon, Oct 5, 2020 at 6:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Now that I've seen this, I wonder whether add_child_join_rel_equivalences >> might not be making duplicate EC entries even without GEQO. Is there any >> guarantee that it's not called repeatedly on related join-rel sets? > build_child_join_rel() is the only caller of > add_child_join_rel_equivalences(). Before the first calls the later, > the first has an Assert > 899 Assert(!find_join_rel(root, joinrel->relids)); > This means that for a given child join rel this function is called > only once implying that add_child_join_rel_equivalences() is called > only once for a given child join rel. Initially I thought that this is > enough to not add duplicate child members. But to me use of > bms_overlap() in that function looks doubtful. Yeah. As soon as you get into three-or-more-way joins, this code will be called again, mentioning some of the same relids as at the lower join level. > On a related but different subject, I have been thinking on-off about > the need for expression translation while planning partitions.The > translated expressions differ only in the leaf-vars and even for most > of the partitioned tables really only the varno (assuming that > partitioned table and partitions will have same layouts when a large > number of partitions are created automatically.) If we can introduce a > new type of (polymorphic) Var node which represents not just the > parent Var but also child Vars as well. The whole need for expression > translation vanishes, reducing memory requirements by many folds. Actually, the reason I have been looking at equivclass.c is that I've been attacking the problem from a different direction. I'm fairly unexcited about making the definition of Var looser as you suggest here --- I actually think that it's dangerously imprecise already, due to not accounting for the effects of outer joins. However, we could avoid the growth in eclass members for large partition sets if we simply didn't store child eclass members, instead translating on-the-fly when we need to deal with child rels. I have a patch about half done, but it won't be possible to determine the true performance implications of that idea until it's all done. More later. Either approach would mean that add_child_join_rel_equivalences goes away entirely, or at least doesn't need to store new em_is_child entries anymore, causing the memory bloat issue to become moot. So maybe we should just fix the wrong-context issue for now, and live with the GEQO bloat in the back branches. regards, tom lane
On Mon, Oct 5, 2020 at 11:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes: > > On Mon, Oct 5, 2020 at 6:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Now that I've seen this, I wonder whether add_child_join_rel_equivalences > >> might not be making duplicate EC entries even without GEQO. Is there any > >> guarantee that it's not called repeatedly on related join-rel sets? > > > build_child_join_rel() is the only caller of > > add_child_join_rel_equivalences(). Before the first calls the later, > > the first has an Assert > > 899 Assert(!find_join_rel(root, joinrel->relids)); > > This means that for a given child join rel this function is called > > only once implying that add_child_join_rel_equivalences() is called > > only once for a given child join rel. Initially I thought that this is > > enough to not add duplicate child members. But to me use of > > bms_overlap() in that function looks doubtful. > > Yeah. As soon as you get into three-or-more-way joins, this code will > be called again, mentioning some of the same relids as at the lower > join level. > > > On a related but different subject, I have been thinking on-off about > > the need for expression translation while planning partitions.The > > translated expressions differ only in the leaf-vars and even for most > > of the partitioned tables really only the varno (assuming that > > partitioned table and partitions will have same layouts when a large > > number of partitions are created automatically.) If we can introduce a > > new type of (polymorphic) Var node which represents not just the > > parent Var but also child Vars as well. The whole need for expression > > translation vanishes, reducing memory requirements by many folds. > > Actually, the reason I have been looking at equivclass.c is that I've > been attacking the problem from a different direction. I'm fairly > unexcited about making the definition of Var looser as you suggest > here --- I actually think that it's dangerously imprecise already, > due to not accounting for the effects of outer joins. However, > we could avoid the growth in eclass members for large partition sets > if we simply didn't store child eclass members, instead translating > on-the-fly when we need to deal with child rels. I have a patch > about half done, but it won't be possible to determine the true > performance implications of that idea until it's all done. More > later. If we translate more than ones, every time someone comes searching for an EC member, we will leak memory in the planner memory context, which anyway gets bloated because of the huge number of translations even when done only once. So that needs to be done rather carefully. Of course we could save child EC members separately in the same EC and search that using hash or something to avoid multiple instances of the same child EC member. But this memory bloat isn't unique to ECs, though it shows up mostly in ECs. We translate all the expressions, TLEs, quals and so on. That consumes a lot of memory. The number of joins grows exponentially with the number of relations being joined and so does this child joins. So we need some way to avoid expression translations, where the translated expressions just differ in leaf var nodes, in order to support a large number of partitions. > > Either approach would mean that add_child_join_rel_equivalences > goes away entirely, or at least doesn't need to store new em_is_child > entries anymore, causing the memory bloat issue to become moot. > So maybe we should just fix the wrong-context issue for now, and > live with the GEQO bloat in the back branches. Yes, I agree with that. For now your patch fixing the wrong context issue is good enough. -- Best Wishes, Ashutosh Bapat
Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes: > On Mon, Oct 5, 2020 at 11:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ... we could avoid the growth in eclass members for large partition sets >> if we simply didn't store child eclass members, instead translating >> on-the-fly when we need to deal with child rels. I have a patch >> about half done, but it won't be possible to determine the true >> performance implications of that idea until it's all done. More >> later. > If we translate more than ones, every time someone comes searching for > an EC member, we will leak memory in the planner memory context, which > anyway gets bloated because of the huge number of translations even > when done only once. So that needs to be done rather carefully. I'm not terribly concerned about that. There's not a "huge number" of such translations to be done --- it's more like one per index. (Also, we could very possibly build the translations in a temp context that gets reset every so often, if it becomes an issue.) I am a bit worried about whether we'll be spending a lot more cycles to do the added translations; though some of that should be bought back by having fewer EC members to compare to. In any event, testing a working patch will be a lot more illuminating than speculation. >> Either approach would mean that add_child_join_rel_equivalences >> goes away entirely, or at least doesn't need to store new em_is_child >> entries anymore, causing the memory bloat issue to become moot. >> So maybe we should just fix the wrong-context issue for now, and >> live with the GEQO bloat in the back branches. > Yes, I agree with that. For now your patch fixing the wrong context > issue is good enough. Done for now. regards, tom lane
On Mon, Oct 5, 2020 at 2:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Actually, the reason I have been looking at equivclass.c is that I've > been attacking the problem from a different direction. I'm fairly > unexcited about making the definition of Var looser as you suggest > here --- I actually think that it's dangerously imprecise already, > due to not accounting for the effects of outer joins. However, > we could avoid the growth in eclass members for large partition sets > if we simply didn't store child eclass members, instead translating > on-the-fly when we need to deal with child rels. I have a patch > about half done, but it won't be possible to determine the true > performance implications of that idea until it's all done. More > later. The thing that seems a bit dumb about the current system is that we got to a lot of trouble to generate separte tlists and quals for every inheritance child, but then in the final plan, what difference does it make? I see that scan nodes, unlike joins etc., don't replace Var nodes with INNER_VAR/OUTER_VAR, but it seems like a qual attached to a Seq Scan has only one possible relation to which it can refer, so what's the point of having separate lists with different varnos in each? There is an issue if column numbering is different, because then varattnos need to differ even at execution time, but it seems like the only possible use of the varnos in these places is during planning. By execution time, we can't really care any more, at least not as far as I understand it. Just to be clear, this is not a vote against whatever you want to do to improve things in this area. It's just an observation that we seem to create an awful lot of bulky data structures when planning a query like SELECT * FROM many_children, and then don't do much with them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 7, 2020 at 12:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes: > > On Mon, Oct 5, 2020 at 11:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> ... we could avoid the growth in eclass members for large partition sets > >> if we simply didn't store child eclass members, instead translating > >> on-the-fly when we need to deal with child rels. I have a patch > >> about half done, but it won't be possible to determine the true > >> performance implications of that idea until it's all done. More > >> later. +1 to this idea. We've seen mainly get_eclass_for_sort_expr() become a bottleneck with large partition sets and getting rid of it would be really great. > > If we translate more than ones, every time someone comes searching for > > an EC member, we will leak memory in the planner memory context, which > > anyway gets bloated because of the huge number of translations even > > when done only once. So that needs to be done rather carefully. > > I'm not terribly concerned about that. There's not a "huge number" > of such translations to be done --- it's more like one per index. > (Also, we could very possibly build the translations in a temp > context that gets reset every so often, if it becomes an issue.) Yeah, the memory bloat from this should be minimal. AFAICS, the only place that demands a child expression from a given matched EC thus needing a translation is createplan.c, which I would expect to be a place that doesn't spend a lot of memory because there's no exploratory processing. > I am a bit worried about whether we'll be spending a lot more cycles > to do the added translations; though some of that should be bought > back by having fewer EC members to compare to. In any event, testing > a working patch will be a lot more illuminating than speculation. Agreed. -- Amit Langote EDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/CAApHDvrEXcadNYAAdq6RO0eKZUG6rRHXJGAbpzj8y432gCD9bA%40mail.gmail.com
On Thu, Oct 8, 2020 at 2:58 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Wed, Oct 7, 2020 at 12:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes: > > > On Mon, Oct 5, 2020 at 11:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> ... we could avoid the growth in eclass members for large partition sets > > >> if we simply didn't store child eclass members, instead translating > > >> on-the-fly when we need to deal with child rels. I have a patch > > >> about half done, but it won't be possible to determine the true > > >> performance implications of that idea until it's all done. More > > >> later. > > +1 to this idea. We've seen mainly get_eclass_for_sort_expr() become > a bottleneck with large partition sets and getting rid of it would be > really great. > > [1] https://www.postgresql.org/message-id/CAApHDvrEXcadNYAAdq6RO0eKZUG6rRHXJGAbpzj8y432gCD9bA%40mail.gmail.com Oops, I linked this thread but forgot to write why. Well, I had meant to say that I had unsuccessfully tried to implement this idea as a PoC back when David had started the linked discussion to address the same problem. -- Amit Langote EDB: http://www.enterprisedb.com