Обсуждение: Ordering of data on calls to user defined aggregate.

Поиск
Список
Период
Сортировка

Ordering of data on calls to user defined aggregate.

От
Tim Hart
Дата:
Short version ( for the busy folk).

Is there any guarantee of the ordering of data on calls to a user
defined aggregate. (postgresql 7.2.1)

Longer version ( for those that are curious or need more info ).

I was goofing around tonight and decided to create a concat_with_and
aggregate for varchar. My desired effect: if I have a table like this:

create table foo( id bigint, fk bigint, name varchar);

with data
id    fk    name
1    1    A
2    1       B
3    2    E
4    3    D
5    3    C

and a query like this:

select fk, concat_with_and(name) names from foo group by fk;

then the results would look like:
fk    names
1    A and B
2    E
3    D and C

I had no trouble creating the aggregate function or the supporting
(state transition) function:

create function concat_with_and(varchar, varchar) returns varchar as
'select case when $1 is null then $2 else $1 || \' and \' || $2 end;'
language sql;

I did not use a final function.

Everything worked like a charm. Then I decided to take it one step
further. What if I wanted all the names sorted in alphabetical order
( not across records, but within the record)?

fk    names
1    A and B
2    E
3    C and D

So I tried a query like this:

select fk, concat_with_and(name) from ( select fk, name from foo order
by fk, name) sub_select group by fk;

 From just eyeballing the first 10 to 12 pages of the results, all but 2
records had the names in alphabetical order. So I removed the subselect
and ran the query again - this time paying attention to the ordering
within names. Very few entries in the 'names' column were in
alphabetical order at all.

I tried to develop an example dataset to include here that would
illustrate cases where it doesn't order them alphabetically. I was
unable to do so. I'm experimenting with a dataset of about 15000 records
for my queries.

Where I'm coming from:
I know there are several other ways to achieve the results I was looking
for above. I have no real need for this. I was just looking for an
excuse to write my own aggregate function. What I found intriguing were
the possibilities - and the opportunity to increase my understanding of
what goes on under the hood.

questions:
 From a veteran's point of view, is the approach above elegant or a gross
hack? Somewhere in between?

Am I correct in assuming that the 'group by' clause in my second query
affects the ordering of the subquery? Any idea why it only affected a
small percentage of them?

Could I introduce a final function to solve my problem? The state
transition function simply collects all the entries. The final function
would order them and delimit them with 'and'. In practice, if this were
required I'd hang it up & write a pl/pgsql function and skip the
aggregation all together. The extra work begins to clutter the idea. I'd
prefer not to force every 'concat_with_and' aggregation to sort the data
as well.

Any other comments?


Re: Ordering of data on calls to user defined aggregate.

От
"Joel Burton"
Дата:
Tim Hart <timjhart@shaw.ca> said:

> So I tried a query like this:
>
> select fk, concat_with_and(name) from ( select fk, name from foo order
> by fk, name) sub_select group by fk;
>
>  From just eyeballing the first 10 to 12 pages of the results, all but 2
> records had the names in alphabetical order. So I removed the subselect
> and ran the query again - this time paying attention to the ordering
> within names. Very few entries in the 'names' column were in
> alphabetical order at all.

Hmmm... in my (small) test case, they were all alphabetized.

I didn't think that subquery sort orders were guaranteed, though, so perhaps it's okay that yours weren't.

Can you try with GROUP BY fk, name in the subquery? That works, too, on my small test case, and that should be
guaranteedbehavior in a subquery. Let's see how that works with your data set. 

- J.

Re: Ordering of data on calls to user defined aggregate.

От
Tim Hart
Дата:
On Saturday, May 18, 2002, at 03:13  PM, Joel Burton wrote:

> Tim Hart <timjhart@shaw.ca> said:
>
>> So I tried a query like this:
>>
>> select fk, concat_with_and(name) from ( select fk, name from foo order
>> by fk, name) sub_select group by fk;
>>
>>  From just eyeballing the first 10 to 12 pages of the results, all
>> but 2
>> records had the names in alphabetical order. So I removed the subselect
>> and ran the query again - this time paying attention to the ordering
>> within names. Very few entries in the 'names' column were in
>> alphabetical order at all.
>
> Hmmm... in my (small) test case, they were all alphabetized.
>
> I didn't think that subquery sort orders were guaranteed, though, so
> perhaps it's okay that yours weren't.
>
> Can you try with GROUP BY fk, name in the subquery? That works, too, on
> my small test case, and that should be guaranteed behavior in a
> subquery. Let's see how that works with your data set.
>
> - J.
>
>
I'm not quite sure how you're expecting me to modify the subquery.
Probably my tired brain cells... ;)

But yes - I've been unable to reproduce the issue with a smaller
dataset. I didn't want to spend all that extra time creating a
reproduce-able case  if someone could definitively say - as you did
above - that subquery sort orders were not guaranteed. Given that the
outer 'group by' has no way of knowing that the result set it's being
handed is effectively grouped, I'm guessing that the group-by algorithm
in this case is probably the source of my ills. No biggie.


Re: Ordering of data on calls to user defined aggregate.

От
Tom Lane
Дата:
Tim Hart <timjhart@shaw.ca> writes:
> Short version ( for the busy folk).
> Is there any guarantee of the ordering of data on calls to a user
> defined aggregate. (postgresql 7.2.1)

None whatever.

> So I tried a query like this:

> select fk, concat_with_and(name) from ( select fk, name from foo order
> by fk, name) sub_select group by fk;

The reason this isn't very reliable can be seen by looking at what it
does.  In existing releases you have to grovel through EXPLAIN VERBOSE
output to see what the sort keys are, but CVS tip is friendlier:

srf=# explain
srf-# select fk, my_agg(name) from ( select fk, name from foo order
srf(# by fk, name) sub_select group by fk;
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Aggregate  (cost=122.16..129.66 rows=100 width=40)
   ->  Group  (cost=122.16..127.16 rows=1000 width=40)
         ->  Sort  (cost=122.16..124.66 rows=1000 width=40)
               Sort Key: fk
               ->  Subquery Scan sub_select  (cost=69.83..72.33 rows=1000 width=40)
                     ->  Sort  (cost=69.83..72.33 rows=1000 width=40)
                           Sort Key: fk, name
                           ->  Seq Scan on foo  (cost=0.00..20.00 rows=1000 width=40)
(8 rows)

The outer query has no idea that the inner query's output is already
sorted, so it re-sorts ... using only the specified GROUP BY key.
Unless your system's qsort is stable (which most are not), this will
not preserve the by-name ordering within groups of the same fk value.

If you weren't grouping at the outer level then the inner query's sort
order would get the job done for you.  In the presence of outer
grouping I'm not sure there's a clean way to do it.  I can think of
a hack for the case where fk/name pairs are unique:

select fk, my_agg(DISTINCT name) from foo group by fk;

This relies on the assumption that the aggregate code will implement
DISTINCT by means of sort/unique processing, which seems unlikely to
break anytime soon.  But it won't help if you want the aggregate to see
multiple values of the same name for the same fk.

There is some talk of reimplementing grouped aggregation using hash
tables, which'd eliminate the upper SORT step and thereby give the
behavior you want.  I dunno how soon anyone will get around to it
though.

            regards, tom lane

Re: Ordering of data on calls to user defined aggregate.

От
Tim Hart
Дата:
On Saturday, May 18, 2002, at 03:56  PM, Tom Lane wrote:

> Tim Hart <timjhart@shaw.ca> writes:
>> Short version ( for the busy folk).
>> Is there any guarantee of the ordering of data on calls to a user
>> defined aggregate. (postgresql 7.2.1)
>
> None whatever.

Reasonable.

> The outer query has no idea that the inner query's output is already
> sorted, so it re-sorts ... using only the specified GROUP BY key.

Kind of what I figured. I've been in a discussion with Joel Burton where
I stated the same assumption. Makes sense to me.

<snippage>

> grouping I'm not sure there's a clean way to do it.  I can think of
> a hack for the case where fk/name pairs are unique:
>
> select fk, my_agg(DISTINCT name) from foo group by fk;
>
> This relies on the assumption that the aggregate code will implement
> DISTINCT by means of sort/unique processing, which seems unlikely to
> break anytime soon.  But it won't help if you want the aggregate to see
> multiple values of the same name for the same fk.

That actually works for me. But, as you say, it makes some assumptions.
It's also not very clear from the the SQL what the intent is. If I
really had a need to do this, I'd probably implement this as a function:

select fk, get_sorted_delimited_list(fk, ' and ' ) as names from foo
order by names;

>
> There is some talk of reimplementing grouped aggregation using hash
> tables, which'd eliminate the upper SORT step and thereby give the
> behavior you want.  I dunno how soon anyone will get around to it
> though.

No worries. I can't really say I *want* that behavior. Can't say I
*don't* want it either. I was just after a deeper understanding of the
current behavior. You've provided it, and am grateful. :)