Обсуждение: Select ranges based on sequential breaks

Поиск
Список
Период
Сортировка

Select ranges based on sequential breaks

От
Mike Toews
Дата:
Hi,

I'm having difficulty constructing a query that will find breaks where
data change in a time-series. I've done some searching for this too, but
I haven't found anything.

Here is my example situation, consider my source table:
date     bin
2009-01-01     red
2009-01-02     red
2009-01-03     blue
2009-01-04     blue
2009-01-05     blue
2009-01-06     red
2009-01-07     blue
2009-01-08     blue
2009-01-09     red
2009-01-10     red


I would like to get the first and last of each consecutive series based
on column "bin". My result for the table would look like:
first     last     bin
2009-01-01     2009-01-02     red
2009-01-03     2009-01-05     blue
2009-01-06     2009-01-06     red
2009-01-07     2009-01-08     blue
2009-01-09     2009-01-10     red


This is easy to compute using a spreadsheet or in R, but how would I do
this with SQL? I'm using 8.3. Advice is appreciated.

Thanks,

-Mike

Re: Select ranges based on sequential breaks

От
David Wilson
Дата:
On Mon, Jun 15, 2009 at 2:23 PM, Mike Toews<mwtoews@sfu.ca> wrote:
> Hi,
>
> I'm having difficulty constructing a query that will find breaks where data
> change in a time-series. I've done some searching for this too, but I
> haven't found anything.
>
> Here is my example situation, consider my source table:
> date    bin
> 2009-01-01      red
> 2009-01-02      red
> 2009-01-03      blue
> 2009-01-04      blue
> 2009-01-05      blue
> 2009-01-06      red
> 2009-01-07      blue
> 2009-01-08      blue
> 2009-01-09      red
> 2009-01-10      red
>
>
> I would like to get the first and last of each consecutive series based on
> column "bin". My result for the table would look like:
> first   last    bin
> 2009-01-01      2009-01-02      red
> 2009-01-03      2009-01-05      blue
> 2009-01-06      2009-01-06      red
> 2009-01-07      2009-01-08      blue
> 2009-01-09      2009-01-10      red
>
>
> This is easy to compute using a spreadsheet or in R, but how would I do this
> with SQL? I'm using 8.3. Advice is appreciated.

(Written in email and untested- also, someone will probably provide a
better way, I hope, but this should at least work)

select date as first,
(select date from table t3 where t3.date<(select date from table t5
where t5.date>t1.date and t5.bin<>t1.bin order by date asc limit 1)
order by date desc limit 1) as last,
bin
from table t1 where (select bin from table t2 where t2.date<t1.order
order by date desc limit 1)<>t1.bin;

Ugly, and I'm pretty sure there's a much better way, but my brain is
failing me right now- hopefully this'll at least get you started,
though.

--
- David T. Wilson
david.t.wilson@gmail.com

Re: Select ranges based on sequential breaks

От
Joel Nothman
Дата:
Hi Mike,

I happened upon your query, which is related to some stuff I've been
playing with.

Firstly, David's solution below doesn't work. I haven't yet tried to work
out why.

Secondly, I was hoping to be able to solve your problem nicely with
Postgres 8.4's window functions [1,2], which can provide access to
data in sequentially-related rows.

Given the following setup:
CREATE TABLE foo (dt date, bin varchar(4));
INSERT INTO foo VALUES ('2009-01-01', 'red'), ('2009-01-02', 'red'),
('2009-01-03', 'blue'), ('2009-01-04', 'blue'), ('2009-01-05', 'blue'),
('2009-01-06', 'red'), ('2009-01-07', 'blue'), ('2009-01-08', 'blue'),
('2009-01-09', 'red'), ('2009-01-10', 'red');

I had hoped the following would suffice:
SELECT first_value(dt) OVER w, last_value(dt) OVER w, bin FROM foo WINDOW
w AS (ORDER BY dt PARTITION BY bin);

Apparently, this is bad syntax. ORDER BY cannot precede PARTITION BY in a
WINDOW declaration, and yet I wanted a grouping of date-consecutive bins,
which (PARTITION BY bin ORDER BY dt) would not give me.

I was able to produce the required result with:

SELECT MIN(dt) AS first, MAX(dt) AS last, MAX(bin) AS bin
      FROM (SELECT dt, bin,
                   CASE WHEN newbin IS NULL
                   THEN 0
                   ELSE SUM(newbin) OVER (ORDER BY dt)
                   END AS binno
              FROM (SELECT *,
                           (bin !=3D lag(bin, 1) OVER (ORDER BY dt))::int A=
S
newbin
                      FROM foo
                   ) AS newbins
              ) AS binnos
GROUP BY binno
ORDER BY first;

This relies on a middle step in which I create an enumeration of the bins
in sequence:
SELECT dt, bin, CASE WHEN newbin IS NULL THEN 0 ELSE sum(newbin) OVER
(ORDER BY dt) END AS binno FROM (SELECT *, (bin !=3D lag(bin, 1) OVER (ORDE=
R
BY dt))::int AS newbin FROM foo) AS newbins;
         dt     | bin  | binno
------------+------+-------
     2009-01-01 | red  |     0
     2009-01-02 | red  |     0
     2009-01-03 | blue |     1
     2009-01-04 | blue |     1
     2009-01-05 | blue |     1
     2009-01-06 | red  |     2
     2009-01-07 | blue |     3
     2009-01-08 | blue |     3
     2009-01-09 | red  |     4
     2009-01-10 | red  |     4

I would hope there is a neater way to do this with window functions.

The best way to solve your problem may be with PL/SQL, which is also good
at dealing with sequences (it's not as complicated as it looks!):

CREATE TYPE bindates AS (first date, last date, bin varchar(5));
CREATE OR REPLACE FUNCTION bindates()
RETURNS SETOF bindates
AS $$
DECLARE
      row record;
      res bindates;
BEGIN
      FOR row IN SELECT * FROM foo ORDER BY dt
      LOOP
        IF res.bin IS NULL OR res.bin !=3D row.bin THEN
          IF res.bin IS NOT NULL THEN
            RETURN NEXT res;
          END IF;
          res.bin :=3D row.bin;
          res.first :=3D row.dt;
        END IF;
        res.last :=3D row.dt;
      END LOOP;
      IF res.bin IS NOT NULL THEN
        RETURN NEXT res;
      END IF;
END;
$$ LANGUAGE plpgsql;


Finally, I'll try to sort out David's solution. Once we correct his typo
(t1.order -> t1.date) and add an 'ORDER BY first' to the end, we get:
       first    |    last    | bin
------------+------------+------
     2009-01-03 | 2009-01-05 | blue
     2009-01-06 | 2009-01-06 | red
     2009-01-07 | 2009-01-08 | blue
     2009-01-09 |            | red

This includes correct output, but it fails on both edge cases. The
non-appearance of the first row is due to the WHERE clause on the main
SELECT statement:
WHERE (SELECT bin
           FROM foo t2
           WHERE t2.dt < t1.dt
           ORDER BY dt DESC LIMIT 1) <> t1.bin

If we drop this WHERE clause, we get:
       first    |    last    | bin
------------+------------+------
     2009-01-01 | 2009-01-02 | red
     2009-01-02 | 2009-01-02 | red
     2009-01-03 | 2009-01-05 | blue
     2009-01-04 | 2009-01-05 | blue
     2009-01-05 | 2009-01-05 | blue
     2009-01-06 | 2009-01-06 | red
     2009-01-07 | 2009-01-08 | blue
     2009-01-08 | 2009-01-08 | blue
     2009-01-09 |            | red
     2009-01-10 |            | red

We can therefore get the result including the first row by selecting from
this table with 'GROUP BY last, bin'.

And we can hack in a value for those final NULLs as a special case. The
following statement works:

SELECT MIN(first),
           CASE WHEN last IS NULL THEN (SELECT MAX(dt) FROM foo) ELSE last
END,
           bin
    FROM (
      SELECT dt AS first,
             (SELECT dt
              FROM foo t3
              WHERE t3.dt < (
                  SELECT dt
                  FROM foo t5
                  WHERE t5.dt > t1.dt
                    AND t5.bin <> t1.bin
                  ORDER BY dt ASC
                  LIMIT 1)
              ORDER BY dt DESC
              LIMIT 1) AS last,
              bin
      FROM foo t1
) t0
GROUP BY last, bin
ORDER BY last;


Finally, what's efficient? With 1,000,000 random rows, we get:
Enumeration: 13s
PL/SQL: 12s
Modified David: minutes.

[I used the following to create test data:
CREATE OR REPLACE FUNCTION make_random(n int) RETURNS SETOF foo AS $$
import random
for i in xrange(n):
      m =3D random.randint(1,12)
      d =3D random.randint(1,28)
      b =3D random.choice(('red', 'blue'))
      yield '2009-%d-%d' % (m, d), b
$$ LANGUAGE plpythonu;
DELETE FROM foo; INSERT INTO foo (SELECT * FROM make_random(1000000));]

I hope that helps you in considering various approaches to the problem.

- Joel


[1] http://developer.postgresql.org/pgdocs/postgres/tutorial-window.html
[2] http://developer.postgresql.org/pgdocs/postgres/functions-window.html


On Tue, 16 Jun 2009 06:06:16 +1000, David Wilson
<david.t.wilson@gmail.com> wrote:

> On Mon, Jun 15, 2009 at 2:23 PM, Mike Toews<mwtoews@sfu.ca> wrote:
>> Hi,
>>
>> I'm having difficulty constructing a query that will find breaks where
>> data
>> change in a time-series. I've done some searching for this too, but I
>> haven't found anything.
>>
>> Here is my example situation, consider my source table:
>> date =A0 =A0bin
>> 2009-01-01 =A0 =A0 =A0red
>> 2009-01-02 =A0 =A0 =A0red
>> 2009-01-03 =A0 =A0 =A0blue
>> 2009-01-04 =A0 =A0 =A0blue
>> 2009-01-05 =A0 =A0 =A0blue
>> 2009-01-06 =A0 =A0 =A0red
>> 2009-01-07 =A0 =A0 =A0blue
>> 2009-01-08 =A0 =A0 =A0blue
>> 2009-01-09 =A0 =A0 =A0red
>> 2009-01-10 =A0 =A0 =A0red
>>
>>
>> I would like to get the first and last of each consecutive series based
>> on
>> column "bin". My result for the table would look like:
>> first =A0 last =A0 =A0bin
>> 2009-01-01 =A0 =A0 =A02009-01-02 =A0 =A0 =A0red
>> 2009-01-03 =A0 =A0 =A02009-01-05 =A0 =A0 =A0blue
>> 2009-01-06 =A0 =A0 =A02009-01-06 =A0 =A0 =A0red
>> 2009-01-07 =A0 =A0 =A02009-01-08 =A0 =A0 =A0blue
>> 2009-01-09 =A0 =A0 =A02009-01-10 =A0 =A0 =A0red
>>
>>
>> This is easy to compute using a spreadsheet or in R, but how would I do
>> this
>> with SQL? I'm using 8.3. Advice is appreciated.
>
> (Written in email and untested- also, someone will probably provide a
> better way, I hope, but this should at least work)
>
> select date as first,
> (select date from table t3 where t3.date<(select date from table t5
> where t5.date>t1.date and t5.bin<>t1.bin order by date asc limit 1)
> order by date desc limit 1) as last,
> bin
> from table t1 where (select bin from table t2 where t2.date<t1.order
> order by date desc limit 1)<>t1.bin;
>
> Ugly, and I'm pretty sure there's a much better way, but my brain is
> failing me right now- hopefully this'll at least get you started,
> though.

Re: Select ranges based on sequential breaks

От
Joel Nothman
Дата:
Hi Mike,

I happened upon your query, which is related to some stuff I've been
playing with.

Firstly, David's solution below doesn't work. I haven't yet tried to work
out why.

Secondly, I was hoping to be able to solve your problem nicely with
Postgres 8.4's window functions [1,2], which provide functions relating.

Given the following setup:
CREATE TABLE foo (dt date, bin varchar(4));
INSERT INTO foo VALUES ('2009-01-01', 'red'), ('2009-01-02', 'red'),
('2009-01-03', 'blue'), ('2009-01-04', 'blue'), ('2009-01-05', 'blue'),
('2009-01-06', 'red'), ('2009-01-07', 'blue'), ('2009-01-08', 'blue'),
('2009-01-09', 'red'), ('2009-01-10', 'red');

I had hoped the following would suffice:
SELECT first_value(dt) OVER w, last_value(dt) OVER w, bin FROM foo WINDOW
w AS (ORDER BY dt PARTITION BY bin);

Apparently, this is bad syntax. ORDER BY cannot precede PARTITION BY in a
WINDOW declaration, and yet I wanted a grouping of date-consecutive bins,
which (PARTITION BY bin ORDER BY dt) would not give me.

I was able to produce the required result with:

SELECT MIN(dt) AS first, MAX(dt) AS last, MAX(bin) AS bin
      FROM (SELECT dt, bin,
                   CASE WHEN newbin IS NULL
                   THEN 0
                   ELSE SUM(newbin) OVER (ORDER BY dt)
                   END AS binno
              FROM (SELECT *,
                           (bin != lag(bin, 1) OVER (ORDER BY dt))::int AS
newbin
                      FROM foo
                   ) AS newbins
              ) AS binnos
GROUP BY binno
ORDER BY first;

This relies on a middle step in which I create an enumeration of the bins
in sequence:
SELECT dt, bin, CASE WHEN newbin IS NULL THEN 0 ELSE sum(newbin) OVER
(ORDER BY dt) END AS binno FROM (SELECT *, (bin != lag(bin, 1) OVER (ORDER
BY dt))::int AS newbin FROM foo) AS newbins;
         dt     | bin  | binno
------------+------+-------
     2009-01-01 | red  |     0
     2009-01-02 | red  |     0
     2009-01-03 | blue |     1
     2009-01-04 | blue |     1
     2009-01-05 | blue |     1
     2009-01-06 | red  |     2
     2009-01-07 | blue |     3
     2009-01-08 | blue |     3
     2009-01-09 | red  |     4
     2009-01-10 | red  |     4

I would hope there is a neater way to do this with window functions.

The best way to solve your problem may be with PL/SQL, which is also good
at dealing with sequences (it's not as complicated as it looks!):

CREATE TYPE bindates AS (first date, last date, bin varchar(5));
CREATE OR REPLACE FUNCTION bindates()
RETURNS SETOF bindates
AS $$
DECLARE
      row record;
      res bindates;
BEGIN
      FOR row IN SELECT * FROM foo ORDER BY dt
      LOOP
        IF res.bin IS NULL OR res.bin != row.bin THEN
          IF res.bin IS NOT NULL THEN
            RETURN NEXT res;
          END IF;
          res.bin := row.bin;
          res.first := row.dt;
        END IF;
        res.last := row.dt;
      END LOOP;
      IF res.bin IS NOT NULL THEN
        RETURN NEXT res;
      END IF;
END;
$$ LANGUAGE plpgsql;


Finally, I'll try to sort out David's solution. Once we correct his typo
(t1.order -> t1.date) and add an 'ORDER BY first' to the end, we get:
       first    |    last    | bin
------------+------------+------
     2009-01-03 | 2009-01-05 | blue
     2009-01-06 | 2009-01-06 | red
     2009-01-07 | 2009-01-08 | blue
     2009-01-09 |            | red

This includes correct output, but it fails on both edge cases. The
non-appearance of the first row is due to the WHERE clause on the main
SELECT statement:
WHERE (SELECT bin
           FROM foo t2
           WHERE t2.dt < t1.dt
           ORDER BY dt DESC LIMIT 1) <> t1.bin

If we drop this WHERE clause, we get:
       first    |    last    | bin
------------+------------+------
     2009-01-01 | 2009-01-02 | red
     2009-01-02 | 2009-01-02 | red
     2009-01-03 | 2009-01-05 | blue
     2009-01-04 | 2009-01-05 | blue
     2009-01-05 | 2009-01-05 | blue
     2009-01-06 | 2009-01-06 | red
     2009-01-07 | 2009-01-08 | blue
     2009-01-08 | 2009-01-08 | blue
     2009-01-09 |            | red
     2009-01-10 |            | red

We can therefore get the result including the first row by selecting from
this table with 'GROUP BY last, bin'.

And we can hack in a value for those final NULLs as a special case. The
following statement works:

SELECT MIN(first),
           CASE WHEN last IS NULL THEN (SELECT MAX(dt) FROM foo) ELSE last
END,
           bin
    FROM (
      SELECT dt AS first,
             (SELECT dt
              FROM foo t3
              WHERE t3.dt < (
                  SELECT dt
                  FROM foo t5
                  WHERE t5.dt > t1.dt
                    AND t5.bin <> t1.bin
                  ORDER BY dt ASC
                  LIMIT 1)
              ORDER BY dt DESC
              LIMIT 1) AS last,
              bin
      FROM foo t1
) t0
GROUP BY last, bin
ORDER BY last;


Finally, what's efficient? With 1,000,000 random rows, we get:
Enumeration: 13s
PL/SQL: 12s
Modified David: minutes.

[I used the following to create test data:
CREATE OR REPLACE FUNCTION make_random(n int) RETURNS SETOF foo AS $$
import random
for i in xrange(n):
      m = random.randint(1,12)
      d = random.randint(1,28)
      b = random.choice(('red', 'blue'))
      yield '2009-%d-%d' % (m, d), b
$$ LANGUAGE plpythonu;
DELETE FROM foo; INSERT INTO foo (SELECT * FROM make_random(1000000));]

I hope that helps you in considering various approaches to the problem.

- Joel


[1] http://developer.postgresql.org/pgdocs/postgres/tutorial-window.html
[2] http://developer.postgresql.org/pgdocs/postgres/functions-window.html


On Tue, 16 Jun 2009 06:06:16 +1000, David Wilson
<david.t.wilson@gmail.com> wrote:

> On Mon, Jun 15, 2009 at 2:23 PM, Mike Toews<mwtoews@sfu.ca> wrote:
>> Hi,
>>
>> I'm having difficulty constructing a query that will find breaks where
>> data
>> change in a time-series. I've done some searching for this too, but I
>> haven't found anything.
>>
>> Here is my example situation, consider my source table:
>> date    bin
>> 2009-01-01      red
>> 2009-01-02      red
>> 2009-01-03      blue
>> 2009-01-04      blue
>> 2009-01-05      blue
>> 2009-01-06      red
>> 2009-01-07      blue
>> 2009-01-08      blue
>> 2009-01-09      red
>> 2009-01-10      red
>>
>>
>> I would like to get the first and last of each consecutive series based
>> on
>> column "bin". My result for the table would look like:
>> first   last    bin
>> 2009-01-01      2009-01-02      red
>> 2009-01-03      2009-01-05      blue
>> 2009-01-06      2009-01-06      red
>> 2009-01-07      2009-01-08      blue
>> 2009-01-09      2009-01-10      red
>>
>>
>> This is easy to compute using a spreadsheet or in R, but how would I do
>> this
>> with SQL? I'm using 8.3. Advice is appreciated.
>
> (Written in email and untested- also, someone will probably provide a
> better way, I hope, but this should at least work)
>
> select date as first,
> (select date from table t3 where t3.date<(select date from table t5
> where t5.date>t1.date and t5.bin<>t1.bin order by date asc limit 1)
> order by date desc limit 1) as last,
> bin
> from table t1 where (select bin from table t2 where t2.date<t1.order
> order by date desc limit 1)<>t1.bin;
>
> Ugly, and I'm pretty sure there's a much better way, but my brain is
> failing me right now- hopefully this'll at least get you started,
> though.

Re: Select ranges based on sequential breaks

От
Scott Marlowe
Дата:
On Mon, Jun 15, 2009 at 12:23 PM, Mike Toews<mwtoews@sfu.ca> wrote:
> This is easy to compute using a spreadsheet or in R, but how would I do this
> with SQL? I'm using 8.3. Advice is appreciated.

FYI (and I'm no expert in this area) R is available as a pl for
postgres, look for pl/R or plR

Re: Select ranges based on sequential breaks

От
Mike Toews
Дата:
Hi Joel,

Window functions appear to be the best solution for this style of
problem, and I'm looking forward to their applications. However, I'm
sticking with 8.3 for at least a year, so I'm not able to explore this
solution yet. For now, I can only post-process the output in a non-SQL
environment. I also need to do other fun stuff, like cumulative sums,
which is also challenging with SQL, but much easier and intuitive with R.

An excellent book that I recently stumbled on is /Joe Celko's SQL for
Smarties/ (recommended by someone on this list), which is a heavy read,
but has an amazing depth to ANSI SQL. Of particular interest is "Chapter
24: Regions, Runs, Gaps, Sequences, and Series". I'm slowly digesting this.

-Mike

Joel Nothman wrote:
> Hi Mike,
>
> I happened upon your query, which is related to some stuff I've been
> playing with.
>
> Firstly, David's solution below doesn't work. I haven't yet tried to work
> out why.
>
> Secondly, I was hoping to be able to solve your problem nicely with
> Postgres 8.4's window functions [1,2], which can provide access to
> data in sequentially-related rows.
>
> Given the following setup:
> CREATE TABLE foo (dt date, bin varchar(4));
> INSERT INTO foo VALUES ('2009-01-01', 'red'), ('2009-01-02', 'red'),
> ('2009-01-03', 'blue'), ('2009-01-04', 'blue'), ('2009-01-05', 'blue'),
> ('2009-01-06', 'red'), ('2009-01-07', 'blue'), ('2009-01-08', 'blue'),
> ('2009-01-09', 'red'), ('2009-01-10', 'red');
>
> I had hoped the following would suffice:
> SELECT first_value(dt) OVER w, last_value(dt) OVER w, bin FROM foo WINDOW
> w AS (ORDER BY dt PARTITION BY bin);
>
> Apparently, this is bad syntax. ORDER BY cannot precede PARTITION BY in a
> WINDOW declaration, and yet I wanted a grouping of date-consecutive bins,
> which (PARTITION BY bin ORDER BY dt) would not give me.
>
> I was able to produce the required result with:
>
> SELECT MIN(dt) AS first, MAX(dt) AS last, MAX(bin) AS bin
>       FROM (SELECT dt, bin,
>                    CASE WHEN newbin IS NULL
>                    THEN 0
>                    ELSE SUM(newbin) OVER (ORDER BY dt)
>                    END AS binno
>               FROM (SELECT *,
>                            (bin !=3D lag(bin, 1) OVER (ORDER BY dt))::int A=
> S
> newbin
>                       FROM foo
>                    ) AS newbins
>               ) AS binnos
> GROUP BY binno
> ORDER BY first;
>
> This relies on a middle step in which I create an enumeration of the bins
> in sequence:
> SELECT dt, bin, CASE WHEN newbin IS NULL THEN 0 ELSE sum(newbin) OVER
> (ORDER BY dt) END AS binno FROM (SELECT *, (bin !=3D lag(bin, 1) OVER (ORDE=
> R
> BY dt))::int AS newbin FROM foo) AS newbins;
>          dt     | bin  | binno
> ------------+------+-------
>      2009-01-01 | red  |     0
>      2009-01-02 | red  |     0
>      2009-01-03 | blue |     1
>      2009-01-04 | blue |     1
>      2009-01-05 | blue |     1
>      2009-01-06 | red  |     2
>      2009-01-07 | blue |     3
>      2009-01-08 | blue |     3
>      2009-01-09 | red  |     4
>      2009-01-10 | red  |     4
>
> I would hope there is a neater way to do this with window functions.
>
> The best way to solve your problem may be with PL/SQL, which is also good
> at dealing with sequences (it's not as complicated as it looks!):
>
> CREATE TYPE bindates AS (first date, last date, bin varchar(5));
> CREATE OR REPLACE FUNCTION bindates()
> RETURNS SETOF bindates
> AS $$
> DECLARE
>       row record;
>       res bindates;
> BEGIN
>       FOR row IN SELECT * FROM foo ORDER BY dt
>       LOOP
>         IF res.bin IS NULL OR res.bin !=3D row.bin THEN
>           IF res.bin IS NOT NULL THEN
>             RETURN NEXT res;
>           END IF;
>           res.bin :=3D row.bin;
>           res.first :=3D row.dt;
>         END IF;
>         res.last :=3D row.dt;
>       END LOOP;
>       IF res.bin IS NOT NULL THEN
>         RETURN NEXT res;
>       END IF;
> END;
> $$ LANGUAGE plpgsql;
>
>
> Finally, I'll try to sort out David's solution. Once we correct his typo
> (t1.order -> t1.date) and add an 'ORDER BY first' to the end, we get:
>        first    |    last    | bin
> ------------+------------+------
>      2009-01-03 | 2009-01-05 | blue
>      2009-01-06 | 2009-01-06 | red
>      2009-01-07 | 2009-01-08 | blue
>      2009-01-09 |            | red
>
> This includes correct output, but it fails on both edge cases. The
> non-appearance of the first row is due to the WHERE clause on the main
> SELECT statement:
> WHERE (SELECT bin
>            FROM foo t2
>            WHERE t2.dt < t1.dt
>            ORDER BY dt DESC LIMIT 1) <> t1.bin
>
> If we drop this WHERE clause, we get:
>        first    |    last    | bin
> ------------+------------+------
>      2009-01-01 | 2009-01-02 | red
>      2009-01-02 | 2009-01-02 | red
>      2009-01-03 | 2009-01-05 | blue
>      2009-01-04 | 2009-01-05 | blue
>      2009-01-05 | 2009-01-05 | blue
>      2009-01-06 | 2009-01-06 | red
>      2009-01-07 | 2009-01-08 | blue
>      2009-01-08 | 2009-01-08 | blue
>      2009-01-09 |            | red
>      2009-01-10 |            | red
>
> We can therefore get the result including the first row by selecting from
> this table with 'GROUP BY last, bin'.
>
> And we can hack in a value for those final NULLs as a special case. The
> following statement works:
>
> SELECT MIN(first),
>            CASE WHEN last IS NULL THEN (SELECT MAX(dt) FROM foo) ELSE last
> END,
>            bin
>     FROM (
>       SELECT dt AS first,
>              (SELECT dt
>               FROM foo t3
>               WHERE t3.dt < (
>                   SELECT dt
>                   FROM foo t5
>                   WHERE t5.dt > t1.dt
>                     AND t5.bin <> t1.bin
>                   ORDER BY dt ASC
>                   LIMIT 1)
>               ORDER BY dt DESC
>               LIMIT 1) AS last,
>               bin
>       FROM foo t1
> ) t0
> GROUP BY last, bin
> ORDER BY last;
>
>
> Finally, what's efficient? With 1,000,000 random rows, we get:
> Enumeration: 13s
> PL/SQL: 12s
> Modified David: minutes.
>
> [I used the following to create test data:
> CREATE OR REPLACE FUNCTION make_random(n int) RETURNS SETOF foo AS $$
> import random
> for i in xrange(n):
>       m =3D random.randint(1,12)
>       d =3D random.randint(1,28)
>       b =3D random.choice(('red', 'blue'))
>       yield '2009-%d-%d' % (m, d), b
> $$ LANGUAGE plpythonu;
> DELETE FROM foo; INSERT INTO foo (SELECT * FROM make_random(1000000));]
>
> I hope that helps you in considering various approaches to the problem.
>
> - Joel
>
>
> [1] http://developer.postgresql.org/pgdocs/postgres/tutorial-window.html
> [2] http://developer.postgresql.org/pgdocs/postgres/functions-window.html
>
>
> On Tue, 16 Jun 2009 06:06:16 +1000, David Wilson
> <david.t.wilson@gmail.com> wrote:
>
>


Re: Select ranges based on sequential breaks

От
Scott Marlowe
Дата:
On Mon, Jun 22, 2009 at 12:41 PM, Mike Toews<mwtoews@sfu.ca> wrote:
> Hi Joel,
> An excellent book that I recently stumbled on is /Joe Celko's SQL for
> Smarties/ (recommended by someone on this list), which is a heavy read, but
> has an amazing depth to ANSI SQL. Of particular interest is "Chapter 24:
> Regions, Runs, Gaps, Sequences, and Series". I'm slowly digesting this.

I need to buy a fourth or fifth copy (they keep growing feet / I keep
forgetting who I loaned them to) of that book.

Re: Select ranges based on sequential breaks

От
Mike Toews
Дата:
Scott Marlowe wrote:
> On Mon, Jun 15, 2009 at 12:23 PM, Mike Toews<mwtoews@sfu.ca> wrote:
>
>> This is easy to compute using a spreadsheet or in R, but how would I do this
>> with SQL? I'm using 8.3. Advice is appreciated.
>>
>
> FYI (and I'm no expert in this area) R is available as a pl for
> postgres, look for pl/R or plR
>
FYI, here is how I implement ranges on sequential breaks in R. Sorry, I
haven't meddled with plR yet, although I'm experience with both R and
postgres. This is all R code:

# Randomly sampled bins: "red", "blue"
dat <- data.frame(date=seq(as.Date("2009-01-01"), by="days", length.out=20))
dat$bin <- factor(sample(c("red","blue"), 10, replace=TRUE,
prob=c(0.4,0.6)))

# Determine where the rows are different; 1=different rows, 0=same rows
dat$breaks <- ifelse(dat$bin != c(TRUE,
as.character(dat$bin[-nrow(dat)])), 1, 0)

# Determine where the continuous parts are:
dat$part <- factor(cumsum(dat$breaks))

# Results vary due to random sampling
print(dat)


... and on the SQL side, simple aggregates like min(), max(etc) can be
used with "GROUP BY part" to determine the start/end dates, length of
duration, etc.

-Mike



Re: Select ranges based on sequential breaks

От
"Joel Nothman"
Дата:
On Tue, 23 Jun 2009 04:41:44 +1000, Mike Toews <mwtoews@sfu.ca> wrote:
> Window functions appear to be the best solution for this style of
> problem, and I'm looking forward to their applications. However, I'm
> sticking with 8.3 for at least a year, so I'm not able to explore this
> solution yet. For now, I can only post-process the output in a non-SQL
> environment. I also need to do other fun stuff, like cumulative sums,
> which is also challenging with SQL, but much easier and intuitive with R.

As a largely procedural programmer, the PL/SQL solution is quite appealing
to me, and would be similarly simple to calculate cumulative sums. The
integration of SELECT statements within PL/SQL also seems much tighter
than with other PL languages. Unfortunately, one can't send a cursor or a
set of results directly as a PL argument.

I'm having a skim through Celko's chapter 24, but it doesn't seem to be
close to my needs either.

On Tue, 23 Jun 2009 08:05:14 +1000, Mike Toews <mwtoews@sfu.ca> wrote:
...
> # Determine where the rows are different; 1=different rows, 0=same rows
> dat$breaks <- ifelse(dat$bin != c(TRUE,
> as.character(dat$bin[-nrow(dat)])), 1, 0)
>
> # Determine where the continuous parts are:
> dat$part <- factor(cumsum(dat$breaks))

Yes, as far as I can tell, this is almost identical to my WINDOW-based
solution in finding when there is a change, marking it with 0 or 1 and the
using cumulative sum to number the partitions. This could be similarly
done in PL/SQL but it seemed more sensible to just do the whole thing
rather than using GROUP BY after enumeration.

- Joel