Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

Поиск
Список
Период
Сортировка
От Andy Fan
Тема Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Дата
Msg-id CAKU4AWp-O6goWhtVggUd1Lh=x90W5OMazUh0s71KkRC1u5dKJA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] Re: Improve OR conditions on joined columns (commonstar schema problem)  (Noah Misch <noah@leadboat.com>)
Список pgsql-hackers


On Mon, Mar 18, 2019 at 8:09 AM Noah Misch <noah@leadboat.com> wrote:
On Tue, Oct 02, 2018 at 10:53:40AM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Mon, Oct 01, 2018 at 09:32:10PM -0400, Tom Lane wrote:
> >> FWIW, my problem with this patch is that I remain unconvinced of the basic
> >> correctness of the transform (specifically the unique-ification approach).
> >> Noah's points would be important to address if we were moving the patch
> >> towards commit, but I don't see much reason to put effort into it until
> >> we can think of a way to prove whether that works.
>
> > Not even effort to fix the assertion failures I reported?
>
> If it seemed relevant to the proof-of-correctness problem, I would look
> into it, but ... 
I put some hours into theoretical study of the proof, and I didn't find any
holes.  When the planner removes "l0 LEFT JOIN r1", it does that because
there's one output row per l0.ctid regardless of the rows of r1.  Hence,
deduplicating on (l0.ctid) is equivalent to deduplicating on (l0.ctid,
r1.ctid).  In the bad FULL JOIN case, (l0.ctid, r1.ctid) would be sufficient
as a key, but we're out of luck because removing the join makes us have only
l0.ctid for some tuples and only r1.ctid for other tuples.

If PostgreSQL ever gets inner join removal, it would be harder to preserve
this optimization in this form.  At that point, perhaps we'd cost the path
that retains the join for the benefit of $SUBJECT.  Given the performance test
results, $SUBJECT may already need a cost-based decision on whether to use it.


As for the proof-of-correctness problem, I think the below strategy should be
easy to understand.

SELECT * FROM any_v WHERE (A OR B OR C).
=>
SELECT * FROM any_v WHERE A
UNION ALL
SELECT * FROM any_v WHERE B AND !A
UNION ALL
SELECT * FROM any_v WHERE C AND !A AND !(B AND !A);
where !(Expr) means  (NOT (expr) OR (EXPR) IS NULL)
In this way, there is no duplicated data at the first. Oracle uses a similar
way like this[1].

- A shared planner info idea.

About the planning time case, suppose we have a query below.

SELECT * FROM t1, t2, ... t20  WHERE (t1.a = 1 or t2.a = 1) and xxx;

With the current strategy, t3 .. t20 would be planned many times, including base
relation like (t3.. t20) and joinrel like (t{3, 4}, t(3,4,5}, t(3,4,5,6}) and
so on.  I guess all these relations would get the same result at every time. we
don't need to do that at every section of the Union ALL case.  So an idea of
PlannerInfo A can access the planning result of PlannerInfo B comes, this idea
doesn't come in my mind for the first time,  many cost based transformations may
need this.


Then I have the following POC in my mind.

PlannerInfo
{

  PlannerInfo *shared_planner_info;
  Relids  or_relids; // this field can be improved later.
};

With this way, if we find a 'relids' is not related to or_relids. We can grab
the planning result from shared_planner_info (in this or-union case, we can set
that after we plan the first section of UNION).


Then I found this one would not work after planning time partition prune since
the relid would change. For example:

P (P1, P2) PARTITION BY A
Q (Q1, Q2) PARTITION BY A

SELECT * FROM t, p, q where (t.a = 1 or p.a = 1).  So in the un-transformed
case, Q2 would have relid = 7.  After the transform, Q2 probably has relid
= 6.

So basically, I have 4 questions now.
1. Would investing on the shared_planner_info idea be a good idea?
2. Without planning time prune, shall the above design work?
3. Currently I want to use relid to identify a resultset and pathlists which
   have the planning time prune issue, should we consider other methods?
4. Any other suggestion about to resolve the 'duplicated planning case' besides
   the shared_planner_info method?


--
Best Regards

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: PG 14 release notes, first draft
Следующее
От: Tom Lane
Дата:
Сообщение: Re: PG 14 release notes, first draft