Обсуждение: Aggregates (last/first) not behaving

Поиск

Список

Период

Сортировка

Aggregates (last/first) not behaving

От

Wes Devauld

Дата:

22 июля 2010 г., 06:51:56

I have PostgreSQL 8.3.9 [PostgreSQL 8.3.9 on i386-apple-darwin10.3.0, compiled by GCC i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5646) (dot 1)]

and the custom first and last aggregates from:

http://wiki.postgresql.org/wiki/First_(aggregate)

http://wiki.postgresql.org/wiki/Last_(aggregate)

I have a simple table, of two columns. The first is a timestamp and is the primary key, the second is an integer. I've loaded the table up with values, one for every minute, for a whole year. Some SQL to recreate the table and the aggregates can be retrieved from:

http://blog.devauld.ca/aggregate_test.zip (File is approximately 180KB)

Now when I try to make use of the first and last aggregates, I get:

# select first(t), last(t) from test group by extract(day from t);

first | last

---------------------+---------------------

2009-01-01 00:00:00 | 2009-01-01 17:02:00

2009-01-02 10:07:00 | 2009-01-02 10:06:00

2009-01-03 20:15:00 | 2009-01-03 20:14:00

2009-01-04 00:00:00 | 2009-01-04 23:59:00

2009-01-05 00:00:00 | 2009-01-05 23:59:00

2009-01-06 16:31:00 | 2009-01-06 16:30:00

2009-01-07 00:00:00 | 2009-01-07 23:49:00

2009-01-08 11:09:00 | 2009-01-08 11:42:00

2009-01-09 11:08:00 | 2009-01-09 00:51:00

2009-01-10 11:33:00 | 2009-01-10 23:37:00

2009-01-11 13:05:00 | 2009-01-11 23:59:00

2009-01-12 23:55:00 | 2009-01-12 23:47:00

2009-01-13 01:50:00 | 2009-01-13 23:36:00

2009-01-14 23:55:00 | 2009-01-14 23:41:00

2009-01-15 00:47:00 | 2009-01-15 23:40:00

2009-01-16 00:29:00 | 2009-01-16 23:38:00

2009-01-17 00:09:00 | 2009-01-17 23:37:00

2009-01-18 23:48:00 | 2009-01-18 23:37:00

2009-01-19 23:56:00 | 2009-01-19 23:39:00

2009-01-20 07:14:00 | 2009-01-20 23:36:00

2009-01-21 23:40:00 | 2009-01-21 23:41:00

2009-01-22 02:57:00 | 2009-01-22 23:40:00

2009-01-23 23:56:00 | 2009-01-23 23:38:00

2009-01-24 09:34:00 | 2009-01-24 23:37:00

2009-01-25 23:50:00 | 2009-01-25 23:37:00

2009-01-26 23:48:00 | 2009-01-26 23:39:00

2009-01-27 06:36:00 | 2009-01-27 23:37:00

2009-01-28 23:59:00 | 2009-01-28 23:41:00

2009-01-29 16:12:00 | 2009-01-29 23:40:00

2009-01-30 21:11:00 | 2009-01-30 23:39:00

2009-01-31 20:12:00 | 2009-01-31 16:20:00

(31 rows)

For some reason the aggregates are not falling into the proper group. I can't blame timezones as the results are all over the map, and first/last relationship is broken as in some cases 'last' is chronologically before 'first'

If I explicitly retrieve the values for midnight each day:

# select t, v from test where extract(hour from t) = 0 and extract(minute from t) = 0;

t | v

---------------------+-------

2009-01-01 00:00:00 | 0

2009-01-02 00:00:00 | 1440

2009-01-03 00:00:00 | 2880

2009-01-04 00:00:00 | 4320

2009-01-05 00:00:00 | 5760

2009-01-06 00:00:00 | 7200

2009-01-07 00:00:00 | 8640

2009-01-08 00:00:00 | 10080

2009-01-09 00:00:00 | 11520

2009-01-10 00:00:00 | 12960

2009-01-11 00:00:00 | 14400

2009-01-12 00:00:00 | 15840

2009-01-13 00:00:00 | 17280

2009-01-14 00:00:00 | 18720

2009-01-15 00:00:00 | 20160

2009-01-16 00:00:00 | 21600

2009-01-17 00:00:00 | 23040

2009-01-18 00:00:00 | 24480

2009-01-19 00:00:00 | 25920

2009-01-20 00:00:00 | 27360

2009-01-21 00:00:00 | 28800

2009-01-22 00:00:00 | 30240

2009-01-23 00:00:00 | 31680

2009-01-24 00:00:00 | 33120

2009-01-25 00:00:00 | 34560

2009-01-26 00:00:00 | 36000

2009-01-27 00:00:00 | 37440

2009-01-28 00:00:00 | 38880

2009-01-29 00:00:00 | 40320

2009-01-30 00:00:00 | 41760

2009-01-31 00:00:00 | 43200

(31 rows)

I get back the values for which I am seeking. The pain is in finding the last record in the day before. I would have thought that grouping by date_trunc on month would have yeilded similar results to above:

# select first(t), first(v) from test group by date_trunc('day', t);

first | first

---------------------+-------

2009-01-01 00:00:00 | 0

2009-01-02 10:07:00 | 2047

2009-01-03 20:15:00 | 4095

2009-01-04 00:00:00 | 4320

2009-01-05 00:00:00 | 5760

2009-01-06 17:33:00 | 8253

2009-01-07 16:56:00 | 9656

2009-01-08 17:28:00 | 11128

2009-01-09 21:14:00 | 12794

2009-01-10 05:47:00 | 13307

2009-01-11 16:42:00 | 15402

2009-01-12 16:30:00 | 16830

2009-01-13 20:14:00 | 18494

2009-01-14 23:59:00 | 20159

2009-01-15 22:17:00 | 21497

2009-01-16 23:57:00 | 23037

2009-01-17 18:32:00 | 24152

2009-01-18 20:15:00 | 25695

2009-01-19 07:58:00 | 26398

2009-01-20 22:16:00 | 28696

2009-01-21 17:31:00 | 29851

2009-01-22 16:37:00 | 31237

2009-01-23 23:59:00 | 33119

2009-01-24 21:13:00 | 34393

2009-01-25 22:17:00 | 35897

2009-01-26 16:42:00 | 37002

2009-01-27 16:30:00 | 38430

2009-01-28 16:52:00 | 39892

2009-01-29 23:59:00 | 41759

2009-01-30 10:19:00 | 42379

2009-01-31 14:58:00 | 44098

(31 rows)

Looking at the plan:

# explain select first(t), first(v) from test group by date_trunc('day', t);

QUERY PLAN

-----------------------------------------------------------------------

GroupAggregate (cost=5010.56..6238.16 rows=44640 width=12)

-> Sort (cost=5010.56..5122.16 rows=44640 width=12)

Sort Key: (date_trunc('day'::text, t))

-> Seq Scan on test (cost=0.00..800.00 rows=44640 width=12)

(4 rows)

Now, if I have a much smaller data set:

# delete from test where t > '2009-01-03'; vacuum analyze;

# select first(t), first(v) from test group by date_trunc('day', t);

first | first

---------------------+-------

2009-01-01 00:00:00 | 0

2009-01-03 00:00:00 | 2880

2009-01-02 00:00:00 | 1440

(3 rows)

# explain select first(t), first(v) from test group by date_trunc('day', t);

QUERY PLAN

---------------------------------------------------------------

HashAggregate (cost=73.62..124.04 rows=2881 width=12)

-> Seq Scan on test (cost=0.00..52.01 rows=2881 width=12)

(2 rows)

So, is there a way I can set up my query to deterministically return the same results each time? Is there another way to get 'first of the month' or 'last for the day' aggregates that work reliably on varying data set sizes? Can I force the optimizer to not use a GroupAggregate?

Any help would be appreciated,

-W

Re: Aggregates (last/first) not behaving

От

Richard Huxton

Дата:

22 июля 2010 г., 08:45:07

On 22/07/10 07:37, Wes Devauld wrote:
> I have PostgreSQL 8.3.9 [PostgreSQL 8.3.9 on i386-apple-darwin10.3.0,
> compiled by GCC i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc.
> build 5646) (dot 1)]
>
> and the custom first and last aggregates from:
> http://wiki.postgresql.org/wiki/First_(aggregate)
> http://wiki.postgresql.org/wiki/Last_(aggregate)
>
> I have a simple table, of two columns.  The first is a timestamp and is
> the primary key, the second is an integer.  I've loaded the table up
> with values, one for every minute, for a whole year.  Some SQL to
> recreate the table and the aggregates can be retrieved from:
>
> http://blog.devauld.ca/aggregate_test.zip (File is approximately 180KB)
>
> Now when I try to make use of the first and last aggregates, I get:
>
> # select first(t), last(t) from test group by extract(day from t);
>          first        |        last
> ---------------------+---------------------
>   2009-01-01 00:00:00 | 2009-01-01 17:02:00
>   2009-01-02 10:07:00 | 2009-01-02 10:06:00
>   2009-01-03 20:15:00 | 2009-01-03 20:14:00
[snip]
> For some reason the aggregates are not falling into the proper group.  I
> can't blame timezones as the results are all over the map, and
> first/last relationship is broken as in some cases 'last' is
> chronologically before 'first'

They all seem grouped properly (by day) to me. Unless I've missed something.

The first/last aggregates aren't ordered in any way. They are "first 
value I happened to find" and "last value I happened to find".

If you want the earliest/latest timestamp from each day, use min() and 
max().

--   Richard Huxton  Archonet Ltd

Re: Aggregates (last/first) not behaving

От

Wes Devauld

Дата:

22 июля 2010 г., 17:13:13

I believe I lost the flavour of what I'm doing when I constructed this example. I'm not interested in the timepoint as much as the value that is attached to it. I need to be able to find the last chronological record for a given day.

I can get the value for which I am looking in two steps:

select max(t) as t into table last_of_day from test group by extract(day from t);

select last_of_day.t, test.v from last_of_day, test where last_of_day.t = test.t;

I was fairly happy when first() and last() were discovered, as these two steps could be merged, until the table grows too large and the query planner decides to sort the results before they are aggregated.

I was searching for a way to keep using last() and keeping the extraction to a single step, although the more I fight with it, the less I think that it is worth it. If you have any further suggestions, I would appreciate hearing them.

-W

On Thu, Jul 22, 2010 at 2:44 AM, Richard Huxton <dev@archonet.com> wrote:

They all seem grouped properly (by day) to me. Unless I've missed something.

The first/last aggregates aren't ordered in any way. They are "first value I happened to find" and "last value I happened to find".

If you want the earliest/latest timestamp from each day, use min() and max().

--
Richard Huxton
Archonet Ltd

Re: Aggregates (last/first) not behaving

От

Tim Landscheidt

Дата:

22 июля 2010 г., 19:03:30

Wes Devauld <wes@devauld.ca> wrote:

> I believe I lost the flavour of what I'm doing when I constructed this
> example.  I'm not interested in the timepoint as much as the value that is
> attached to it.  I need to be able to find the last chronological record for
> a given day.

> I can get the value for which I am looking in two steps:

> select max(t) as t into table last_of_day from test group by extract(day
> from t);
> select last_of_day.t, test.v from last_of_day, test where last_of_day.t =
> test.t;

> I was fairly happy when first() and last() were discovered, as these two
> steps could be merged, until the table grows too large and the query planner
> decides to sort the results before they are aggregated.

> I was searching for a way to keep using last() and keeping the extraction to
> a single step, although the more I fight with it, the less I think that it
> is worth it.  If you have any further suggestions, I would appreciate
> hearing them.
> [...]

Perhaps you could /concisely/ rephrase your problem. Finding
the first/last value per group with/without window functions
is a common problem, and there are lots of solutions to it.
But few people will wade through lots of text to find out
what's bothering you.
 For example, you can query the "last" values per day along
the lines of (untested):

| SELECT EXTRACT(day FROM t), v
|   FROM test
|   WHERE t IN (SELECT MAX(t) FROM test
|                 GROUP BY EXTRACT(day FROM t));

Obviously, this doesn't "keep using last()", so I don't know
whether it's good or bad for you.

Tim

Re: Aggregates (last/first) not behaving

От

Richard Huxton

Дата:

22 июля 2010 г., 19:24:58

On 22/07/10 16:50, Wes Devauld wrote:
>
> I was searching for a way to keep using last() and keeping the extraction to
> a single step, although the more I fight with it, the less I think that it
> is worth it.  If you have any further suggestions, I would appreciate
> hearing them.

You can certainly do it in a single query. I've commented out the 
event_date_idx below because it's far from guaranteed it'll be useful to 
you.

BEGIN;

DROP TABLE IF EXISTS events;

CREATE TABLE events (    e_id   SERIAL,    e_ts   timestamp(0) without time zone,    PRIMARY KEY (e_id)
);

INSERT INTO events (e_ts)
SELECT '2010-01-01 01:01:01'::timestamp without time zone
+ i * '1 minute'::interval
FROM generate_series(0,999999) i;

-- CREATE INDEX event_date_idx ON events ((e_ts::date),e_ts);
CREATE INDEX events_ts_idx ON events (e_ts);

-- EXPLAIN ANALYSE
SELECT    e.e_id,    e.e_ts,    minmax.tgt_day
FROM (    SELECT        (e_ts::date) AS tgt_day,         min(e_ts) as first_ts,         max(e_ts) as last_ts    FROM
   events    GROUP BY 1
 
) AS minmax
JOIN events e
ON (e.e_ts = minmax.first_ts) OR (e.e_ts = minmax.last_ts)
ORDER BY e_ts
;

COMMIT;

Using the real, windowing versions of first/last in 8.4+ will still 
require sorting the whole table (AFAICT) so isn't likely to be much 
improvement over a self-join here.

--   Richard Huxton  Archonet Ltd

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Aggregates (last/first) not behaving

Aggregates (last/first) not behaving

Re: Aggregates (last/first) not behaving

Re: Aggregates (last/first) not behaving

Re: Aggregates (last/first) not behaving

Re: Aggregates (last/first) not behaving