Обсуждение: `order by random()` makes select-list `random()` invocations deterministic

Поиск
Список
Период
Сортировка

`order by random()` makes select-list `random()` invocations deterministic

От
"Dian Fay"
Дата:
In Postgres 16.1, running the following query:

```
select
  gs,
  gs + random() * 100 - 50 as gs2,
  random() * 100 - 50 as r1,
  random() * 100 - 50 as r2,
  random() * 100 - 50 as r3
from generate_series(0, 10) as gs
order by random();
```

Every `random()` invocation in the select list uses a single consistent
value within each returned row. Remove the `order by random()` and
values become randomized as expected, but it gets a bit stranger:

- order by any of the `rN` values and `gs2` uses a different random
  value, but all the `rN` values are identical;
- order by `gs2` and all random values are distinct again.



Re: `order by random()` makes select-list `random()` invocations deterministic

От
"David G. Johnston"
Дата:
On Wednesday, February 28, 2024, Dian Fay <di@nmfay.com> wrote:

Every `random()` invocation in the select list uses a single consistent
value within each returned row. Remove the `order by random()` and
values become randomized as expected, but it gets a bit stranger:

 
This seems to be one of those “won’t fix” bugs that stems from the parser being a bit too liberal in what it accepts as valid SQL.


The fact that the gs involving expression is seen differently than the ones not involving gs doesn’t surprise me.

David J.

Re: `order by random()` makes select-list `random()` invocations deterministic

От
"Dian Fay"
Дата:
On Thu Feb 29, 2024 at 12:53 AM EST, David G. Johnston wrote:
> On Wednesday, February 28, 2024, Dian Fay <di@nmfay.com> wrote:
> >
> >
> > Every `random()` invocation in the select list uses a single consistent
> > value within each returned row. Remove the `order by random()` and
> > values become randomized as expected, but it gets a bit stranger:
> >
> >
> This seems to be one of those “won’t fix” bugs that stems from the parser
> being a bit too liberal in what it accepts as valid SQL.
>
>
> https://www.postgresql.org/message-id/CAKFQuwZ3-XGfcS%2BCLTAYvPx3ARYjUxv%2B%3DYL8sOicV0nda%3DT5cA%40mail.gmail.com
>
> The fact that the gs involving expression is seen differently than the ones
> not involving gs doesn’t surprise me.
>
> David J.

Thanks, that makes sense! Any thoughts on whether it's worth a
cautionary note in the `order by` and/or random function docs since
`order by random()` is a fairly well attested solution and I'm not the
first person to run into this quirk?



"Dian Fay" <di@nmfay.com> writes:
> Thanks, that makes sense! Any thoughts on whether it's worth a
> cautionary note in the `order by` and/or random function docs since
> `order by random()` is a fairly well attested solution and I'm not the
> first person to run into this quirk?

The question is where to put the caution, because it's not like this
is somehow specific to random().

I believe what's fundamentally happening here is that ordinarily,
functions in the targetlist are evaluated during the final projection
step that occurs after the scan/join phase.  So for example:

regression=# explain (verbose, costs off) select f1, random(), random() from int4_tbl;
            QUERY PLAN
----------------------------------
 Seq Scan on public.int4_tbl
   Output: f1, random(), random()
(2 rows)

regression=# select f1, random(), random() from int4_tbl;
     f1      |       random        |       random
-------------+---------------------+---------------------
           0 |  0.6623584085865575 |  0.8611211203466376
      123456 |  0.7371018974566144 |   0.707800598912321
     -123456 |  0.5652778572736816 |  0.2684184354600243
  2147483647 | 0.47747487098101504 |   0.531163579706837
 -2147483647 |  0.3829707208069777 | 0.11094005213737002
(5 rows)

The two textually distinct occurrences of random() are evaluated
separately.  However, if you use that same function as a sort key:

regression=# explain (verbose, costs off) select f1, random(), random() from int4_tbl order by random();
               QUERY PLAN
-----------------------------------------
 Result
   Output: f1, (random()), (random())
   ->  Sort
         Output: f1, (random())
         Sort Key: (random())
         ->  Seq Scan on public.int4_tbl
               Output: f1, random()
(7 rows)

regression=# select f1, random(), random() from int4_tbl order by random();
     f1      |       random        |       random
-------------+---------------------+---------------------
     -123456 | 0.39761234612097884 | 0.39761234612097884
      123456 |  0.7227920193261217 |  0.7227920193261217
           0 |  0.7628733460552672 |  0.7628733460552672
  2147483647 |  0.8745622687164281 |  0.8745622687164281
 -2147483647 |   0.929020084273001 |   0.929020084273001
(5 rows)

Reading that explain plan requires a certain amount of expertise,
but the key point is that the extra parens around the upper-level
"random()" expressions indicate that the value is being copied up
from the subplan rather than being evaluated afresh.  So the need
to include "random()" in what passes through the sort step results
in the final projection just re-using that value in both places,
because the targetlist entries (and subexpressions thereof) are
matched literally against what is coming out of the sort step without
consideration for whether those expressions are volatile and what
should happen if they are.  The same would happen with a GROUP BY
expression.

So that's why it happens like that from an implementation standpoint,
but that doesn't inform us much about what we might like to do
differently.  I think the current behavior arose from wanting to
support SQL92 syntax like

    select random() from mytab order by 1;

Here it should surprise nobody if the output appears in sorted order
--- in fact, I bet people would say it's a bug if it doesn't.
However, SQL99 got rid of that syntax and would have us write

    select random() from mytab order by random();

Here it's very much less obvious whether the two occurrences of
random() are meant to denote the same value.  PG has historically
taken the position that they are, mainly because the SQL92 behavior
clearly has some use and you can't readily get that with the SQL99
syntax unless you read it this way.  However, having taken that
position, it's hard to argue that

    select random(), random() from mytab order by random();

shouldn't result in all three instances meaning the same value.

There's an argument certainly that we should require you to write
it differently if you want that behavior, probably by having just
one random() instance in a sub-select.  But I doubt we'd get a
lot of kudos for changing a behavior that's stood for a couple of
decades.

Maybe we could write something like "An expression or subexpression in
the SELECT list that matches an ORDER BY or GROUP BY item is taken to
represent the same value that was sorted or grouped by, even when the
(sub)expression is volatile".  I'm not sure where to put this though.
(TBH, I'm also not very sure that we honor that in absolutely every
case ...)

            regards, tom lane