Re: Performance improvement for joins where outer side is unique

Поиск
Список
Период
Сортировка
От David Rowley
Тема Re: Performance improvement for joins where outer side is unique
Дата
Msg-id CAApHDvrgFwypvZmWKrgTZSy=f+sewvmfP_GL0h6Pod0cQYAKng@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Performance improvement for joins where outer side is unique  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Ответы Re: Performance improvement for joins where outer side is unique  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Список pgsql-hackers
On 3 February 2015 at 22:23, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hi, I had a look on this patch. Although I haven't understood
whole the stuff and all of the related things, I will comment as
possible.


Great, thank you for taking the time to look and review the patch.
 
Performance:

I looked on the performance gain this patch gives. For several
on-memory joins, I had gains about 3% for merge join, 5% for hash
join, and 10% for nestloop (@CentOS6), for simple 1-level joins
with aggregation similar to what you mentioned in previous
mail like this.

=# SELECT count(*) FROM t1 JOIN t2 USING (id);

 Of course, the gain would be trivial when returning many tuples,
or scans go to disk.


That's true, but joins where rows are filtered after the join condition will win here too:

For example select * from t1 inner join t2 on t1.id=t2.id where t1.value > t2.value;

Also queries with a GROUP BY clause. (Since grouping is always performed after the join :-( )

I haven't measured the loss by additional computation when the
query is regarded as not a "single join".


I think this would be hard to measure, but likely if it is measurable then you'd want to look at just planning time, rather than planning and execution time.
 

Explain representation:

I don't like that the 'Unique Join:" occupies their own lines in
the result of explain, moreover, it doesn't show the meaning
clearly. What about the representation like the following? Or,
It might not be necessary to appear there.

   Nested Loop ...
     Output: ....
 -   Unique Jion: Yes
       -> Seq Scan on public.t2 (cost = ...
 -     -> Seq Scan on public.t1 (cost = ....
 +     -> Seq Scan on public.t1 (unique) (cost = ....



Yeah I'm not too big a fan of this either. I did at one evolution of the patch I had "Unique Left Join" in the join node's line in the explain output, but I hated that more and changed it just to be just in the VERBOSE output, and after I did that I didn't want to change the join node's line only when in verbose mode.  I do quite like that it's a separate item for the XML and JSON explain output. That's perhaps quite useful when the explain output must be processed by software.

I'm totally open for better ideas on names, but I just don't have any at the moment.
 
Coding:

The style looks OK. Could applied on master.
It looks to work as you expected for some cases.

Random comments follow.

- Looking specialjoin_is_unique_join, you seem to assume that
  !delay_upper_joins is a sufficient condition for not being
  "unique join".  The former simplly indicates that "don't
  commute with upper OJs" and the latter seems to indicate that
  "the RHS is unique", Could you tell me what kind of
  relationship is there between them?

The rationalisation around that are from the (now changed) version of join_is_removable(), where the code read:

/*
* Must be a non-delaying left join to a single baserel, else we aren't
* going to be able to do anything with it.
*/
if (sjinfo->jointype != JOIN_LEFT ||
sjinfo->delay_upper_joins)
return false;

I have to admit that I didn't go and investigate why delayed upper joins cannot be removed by left join removal code, I really just assumed that we're just unable to prove that a join to such a relation won't match more than one outer side's row. I kept this to maintain that behaviour as I assumed it was there for a good reason.

 

- The naming "unique join" seems a bit obscure for me, but I
  don't have no better idea:( However, the member name
  "has_unique_join" seems to be better to be "is_unique_join".


Yeah, I agree with this. I just can't find something short enough that means "based on the join condition, the inner side of the join will never produce more than 1 row for any single outer row". Unique join was the best I came up with. I'm totally open for better ideas.

I agree that is_unique_join is better than has_unique_join. I must have just copied the has_eclass_joins struct member without thinking too hard about it. I've now changed this in my local copy of the patch.

- eclassjoin_is_unique_join involves seemingly semi-exhaustive
  scan on ec members for every element in joinlist. Even if not,
  it might be thought to be a bit too large for the gain. Could
  you do the equivelant things by more small code?


I'd imagine some very complex queries could have many equivalence classes and members, though I hadn't really thought that this number would be so great that processing here would suffer much. There's quite a few fast paths out, like the test to ensure that both rels are mentioned in ec_relids. Though for a query which is so complex to have a great number of eclass' and members, likely there would be quite a few relations involved and planning would be slower anyway. I'm not quite sure how else I could write this and still find the unique join cases each time. We can't really just give up half way through looking.
 
Thanks again for the review.

Regards

David Rowley

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Gierth
Дата:
Сообщение: Re: Reduce pinning in btree indexes
Следующее
От: Haribabu Kommi
Дата:
Сообщение: Re: Providing catalog view to pg_hba.conf file - Patch submission