Обсуждение: Order by in a sub query when aggregating the main query

Поиск
Список
Период
Сортировка

Order by in a sub query when aggregating the main query

От
Federico
Дата:
Hi all,

I have a question related to the order by clause used in a subquery of
a main query that uses one or more aggregation functions with a group
by.
A basic example of the type of query in question is the following (see
below for the actual query):

  select w, array_agg(x)
  from (
    select v, v / 10 as w
    from pg_catalog.generate_series(25, 0, -1) as t(v)
    order by v
  ) as t(x)
  group by w

This query will return an ordered array as specified by the order by
clause.in the subquery.
Can this behaviour be relied upon?
From what I could find from searching in SQL the order by in a
subquery could be ignored by the engines, but I've found that
postgresql will always respect it.

The context of the question is the updated reflection logic that will
be introduced in version 2 of SQLAlchemy, that makes use of orderby in
subqueries to, for example, match column index of a constraint with
the column name of a table. This query and other similar one return
the correct result, and they seem stable in their output (ie the CI is
not randomly failing because the order has changed). For more
information this potential issue with the current query is traket in
the issue https://github.com/sqlalchemy/sqlalchemy/issues/8561
Below is the full query that will be used in sqlalchemy to reflect
constraints given the constraint type and on a list of table oids:

  select
    attr.conrelid,
    array_agg(attr.attname) as cols,
    attr.conname,
    min(attr.description) as description
  from (
    select
      con.conrelid as conrelid,
      con.conname as conname,
      con.description as description,
      pg_catalog.pg_attribute.attname as attname
    from pg_catalog.pg_attribute
    join (
      select
        pg_catalog.pg_constraint.conrelid as conrelid,
        pg_catalog.pg_constraint.conname as conname,
        unnest(pg_catalog.pg_constraint.conkey) as attnum,
        generate_subscripts(pg_catalog.pg_constraint.conkey,
%(generate_subscripts_1)s) as ord,
        pg_catalog.pg_description.description as description
      from pg_catalog.pg_constraint
      left outer join pg_catalog.pg_description on
        pg_catalog.pg_description.objoid = pg_catalog.pg_constraint.oid
      where
        pg_catalog.pg_constraint.contype = :contype
        and pg_catalog.pg_constraint.conrelid in (:oids)
      ) as con on
      pg_catalog.pg_attribute.attnum = con.attnum
      and pg_catalog.pg_attribute.attrelid = con.conrelid
    order by con.conname, con.ord
  ) as attr
  group by attr.conrelid, attr.conname
  order by attr.conrelid, attr.conname

The other reflection queries that use order by in subqueries are
similar to the above, I can post them here if they may prove useful.

Thank you
Federico



Re: Order by in a sub query when aggregating the main query

От
Tom Lane
Дата:
Federico <cfederico87@gmail.com> writes:
> A basic example of the type of query in question is the following (see
> below for the actual query):

>   select w, array_agg(x)
>   from (
>     select v, v / 10 as w
>     from pg_catalog.generate_series(25, 0, -1) as t(v)
>     order by v
>   ) as t(x)
>   group by w

> This query will return an ordered array as specified by the order by
> clause.in the subquery.
> Can this behaviour be relied upon?

No, not really.  It might always work given a particular set of
circumstances.  As long as the planner chooses to do the outer
query's grouped aggregation as a HashAgg, there'd be no reason
for it to reshuffle the subquery output before feeding that to
array_agg.  However, if it decided that sort-group-and-aggregate
was better, it'd insert a sort by w above the subquery, and then
you'd lose any certainty of the ordering by v continuing to hold.
(Maybe the sort by w would be stable for equal keys, but that's
not guaranteed.)

What you really ought to do is write

  select w, array_agg(x order by x)
  from ...

to be in the clear per SQL standard.  I think that right now that'd
incur additional sorting overhead, which is annoying.  But work is
ongoing to recognize when the input is already correctly sorted
for an aggregate, so it should get better in PG 16 or so.

            regards, tom lane



Re: Order by in a sub query when aggregating the main query

От
Federico
Дата:
Understood, thanks for the explanation.
I'll work on updating the queries used by sqlalchemy to do array_agg(x
order by x) instead of the order by in the subquery.

> I think that right now that'd
> incur additional sorting overhead, which is annoying.  But work is
> ongoing to recognize when the input is already correctly sorted
> for an aggregate, so it should get better in PG 16 or so.

Nice to know, hopefully it's too bad for this use case

Thanks, Federico Caselli

On Sun, 25 Sept 2022 at 00:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Federico <cfederico87@gmail.com> writes:
> > A basic example of the type of query in question is the following (see
> > below for the actual query):
>
> >   select w, array_agg(x)
> >   from (
> >     select v, v / 10 as w
> >     from pg_catalog.generate_series(25, 0, -1) as t(v)
> >     order by v
> >   ) as t(x)
> >   group by w
>
> > This query will return an ordered array as specified by the order by
> > clause.in the subquery.
> > Can this behaviour be relied upon?
>
> No, not really.  It might always work given a particular set of
> circumstances.  As long as the planner chooses to do the outer
> query's grouped aggregation as a HashAgg, there'd be no reason
> for it to reshuffle the subquery output before feeding that to
> array_agg.  However, if it decided that sort-group-and-aggregate
> was better, it'd insert a sort by w above the subquery, and then
> you'd lose any certainty of the ordering by v continuing to hold.
> (Maybe the sort by w would be stable for equal keys, but that's
> not guaranteed.)
>
> What you really ought to do is write
>
>   select w, array_agg(x order by x)
>   from ...
>
> to be in the clear per SQL standard.  I think that right now that'd
> incur additional sorting overhead, which is annoying.  But work is
> ongoing to recognize when the input is already correctly sorted
> for an aggregate, so it should get better in PG 16 or so.
>
>                         regards, tom lane



Re: Order by in a sub query when aggregating the main query

От
Federico
Дата:
I've changed the code to use order by in the aggregate and it seems
there are no noticeable changes in the query performance.
Thanks for the help.

Best,
Federico Caselli

On Sun, 25 Sept 2022 at 00:30, Federico <cfederico87@gmail.com> wrote:
>
> Understood, thanks for the explanation.
> I'll work on updating the queries used by sqlalchemy to do array_agg(x
> order by x) instead of the order by in the subquery.
>
> > I think that right now that'd
> > incur additional sorting overhead, which is annoying.  But work is
> > ongoing to recognize when the input is already correctly sorted
> > for an aggregate, so it should get better in PG 16 or so.
>
> Nice to know, hopefully it's too bad for this use case
>
> Thanks, Federico Caselli
>
> On Sun, 25 Sept 2022 at 00:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Federico <cfederico87@gmail.com> writes:
> > > A basic example of the type of query in question is the following (see
> > > below for the actual query):
> >
> > >   select w, array_agg(x)
> > >   from (
> > >     select v, v / 10 as w
> > >     from pg_catalog.generate_series(25, 0, -1) as t(v)
> > >     order by v
> > >   ) as t(x)
> > >   group by w
> >
> > > This query will return an ordered array as specified by the order by
> > > clause.in the subquery.
> > > Can this behaviour be relied upon?
> >
> > No, not really.  It might always work given a particular set of
> > circumstances.  As long as the planner chooses to do the outer
> > query's grouped aggregation as a HashAgg, there'd be no reason
> > for it to reshuffle the subquery output before feeding that to
> > array_agg.  However, if it decided that sort-group-and-aggregate
> > was better, it'd insert a sort by w above the subquery, and then
> > you'd lose any certainty of the ordering by v continuing to hold.
> > (Maybe the sort by w would be stable for equal keys, but that's
> > not guaranteed.)
> >
> > What you really ought to do is write
> >
> >   select w, array_agg(x order by x)
> >   from ...
> >
> > to be in the clear per SQL standard.  I think that right now that'd
> > incur additional sorting overhead, which is annoying.  But work is
> > ongoing to recognize when the input is already correctly sorted
> > for an aggregate, so it should get better in PG 16 or so.
> >
> >                         regards, tom lane