Обсуждение: Fixing DISTINCT ON for duplicate keys

Поиск
Список
Период
Сортировка

Fixing DISTINCT ON for duplicate keys

От
Tom Lane
Дата:
I looked into this trouble report:
http://archives.postgresql.org/pgsql-sql/2008-07/msg00123.php

The problem is that by the time we get to transformDistinctClause(),
any duplicate entries in the ORDER BY list have been eliminated
(see addTargetToSortList).  But transformDistinctClause expects
a one-for-one match of the two lists, and so it complains.

Clearly, duplicate DISTINCT ON items are just as redundant as duplicate
ORDER BY items are, and so it seems that suppressing them is a
reasonable thing to do.  But I'm thinking that as long as we're touching
this old code, there are some other things that should be fixed:

* There's not really any semantic significance to the ordering of the
DISTINCT ON list anyway, so it would be reasonable to rearrange the
ordering of the list to match the ORDER BY list, rather than making
the user do it.

* It's really bletcherous that the code physically modifies the
user-given ORDER BY.  This damage is visible in stored rules ---
they don't come out the same way you wrote them.  While I don't
mind the idea of dropping redundant entries, adding ORDER BY entries
that the user never wrote seems bogus.  It overconstrains the query,
in a way that doesn't matter given our current implementation but
could matter in the future.

What I am thinking we could do about the latter is modify the querytree
semantics a bit.  Instead of insisting that the transformed
distinctClause be equal to a prefix of the sortClause, allow either
one to be a prefix of the other.  Then the planner simply takes the
longer one as its internal sort-order target.  With this rule, the
sortClause stays as what the user wrote (less any duplicate keys).
The parser is required to remove any duplicate keys from the
distinctClause and rearrange it if needed so that it has a common
prefix with the sortClause (or throw error if this is not possible).
This would be invisible to the user in plain SELECT DISTINCT, and
in SELECT DISTINCT ON would mean that the list is dumped in a
"canonical" order that matches ORDER BY, but isn't changed in any
semantic way.

Now this is probably too big a change to be prudent to back-patch.
Is it worth coming up with a second patch that just tries to get
transformDistinctClause to remove duplicates?  Since the problem
has existed for a very long time (at least back to 7.0 according
to my testing) with no prior reports, it doesn't seem very important
to fix.  I'm a bit worried about putting a patch into only the
back branches --- it would go out with little testing and so the
odds of introducing a fresh problem seem uncomfortably high compared
to the benefit.

Comments?
        regards, tom lane


Re: Fixing DISTINCT ON for duplicate keys

От
David Fetter
Дата:
On Thu, Jul 31, 2008 at 01:38:37PM -0400, Tom Lane wrote:
> I looked into this trouble report:
> http://archives.postgresql.org/pgsql-sql/2008-07/msg00123.php
> 
> Comments?

It seems to me that DISTINCT ON is just a special case of the more
general windowing functions
<http://archives.postgresql.org/pgsql-hackers/2008-07/msg01277.php>
Harada-san has been working on.  Could these use the same machinery?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Fixing DISTINCT ON for duplicate keys

От
Tom Lane
Дата:
David Fetter <david@fetter.org> writes:
> It seems to me that DISTINCT ON is just a special case of the more
> general windowing functions
> <http://archives.postgresql.org/pgsql-hackers/2008-07/msg01277.php>
> Harada-san has been working on.  Could these use the same machinery?

Perhaps at some point we could deprecate DISTINCT ON in favor of using
windowing functions, but I'd not want to weigh down the windowing work
with a mandate that it be bug-compatible with DISTINCT ON.
        regards, tom lane


Re: Fixing DISTINCT ON for duplicate keys

От
David Fetter
Дата:
On Thu, Jul 31, 2008 at 03:10:35PM -0400, Tom Lane wrote:
> David Fetter <david@fetter.org> writes:
> > It seems to me that DISTINCT ON is just a special case of the more
> > general windowing functions
> > <http://archives.postgresql.org/pgsql-hackers/2008-07/msg01277.php>
> > Harada-san has been working on.  Could these use the same
> > machinery?
> 
> Perhaps at some point we could deprecate DISTINCT ON in favor of
> using windowing functions, but I'd not want to weigh down the
> windowing work with a mandate that it be bug-compatible with
> DISTINCT ON.

Good point.  I was just thinking of going the other way, namely fixing
any misbehaviors of DISTINCT ON via machinery to be used by the
windowing functions. :)

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate