Обсуждение: Query to find contiguous ranges on a column

Поиск
Список
Период
Сортировка

Query to find contiguous ranges on a column

От
Peter Hunsberger
Дата:
Given a column of data resembling the following:

col
2
3
4
5
11
12
13
14
15
16
17
18
19
23
32
33
34
35
36
37

I need a query to find the contiguous ranges within this column, in
this case returning the result set:

start, end
2, 5
11, 19
23, 23
32, 37

I have one solution that joins the table against itself and does
(among other things) a subselect looking "not exists col +1" and "not
exists col -1" on the two instances of the table to find the start and
end.  This is, as you might guess, is not very efficient (my actual
data is some 6 million+ rows) and I'm guessing there has to be
something more efficient with windowing or possibly grouping on min
and max (though I can't see how to make sure they are part of a
contiguous set).  Anyone have any ideas?

--
Peter Hunsberger

Re: Query to find contiguous ranges on a column

От
Tim Landscheidt
Дата:
Peter Hunsberger <peter.hunsberger@gmail.com> wrote:

> [...]
> I have one solution that joins the table against itself and does
> (among other things) a subselect looking "not exists col +1" and "not
> exists col -1" on the two instances of the table to find the start and
> end.  This is, as you might guess, is not very efficient (my actual
> data is some 6 million+ rows) and I'm guessing there has to be
> something more efficient with windowing or possibly grouping on min
> and max (though I can't see how to make sure they are part of a
> contiguous set).  Anyone have any ideas?

You can either use a PL/pgSQL function ("SETOF TEXT" just
for the convenience of the example):

| CREATE FUNCTION SummarizeRanges () RETURNS SETOF TEXT AS $$
| DECLARE
|   CurrentFirst INT;
|   CurrentLast INT;
|   CurrentRecord RECORD;
| BEGIN
|   FOR CurrentRecord IN SELECT col FROM t ORDER BY col LOOP
|     IF CurrentFirst IS NULL THEN
|       CurrentFirst := CurrentRecord.col;
|       CurrentLast  := CurrentRecord.col;
|     ELSIF CurrentRecord.col = CurrentLast + 1 THEN
|       CurrentLast := CurrentRecord.col;
|     ELSE
|       RETURN NEXT CurrentFirst || ', ' || CurrentLast;
|       CurrentFirst := CurrentRecord.col;
|       CurrentLast := CurrentRecord.col;
|     END IF;
|   END LOOP;
|   IF CurrentFirst IS NOT NULL THEN
|     RETURN NEXT CurrentFirst || ', ' || CurrentLast;
|   END IF;
|   RETURN;
| END;
| $$ LANGUAGE plpgsql;

or a recursive query (which I always find very hard to com-
prehend):

| WITH RECURSIVE RecCols (LeftBoundary, Value) AS
|   (SELECT col, col FROM t WHERE (col - 1) NOT IN (SELECT col FROM t)
|    UNION ALL SELECT p.LeftBoundary, c.col FROM RecCols AS p, t AS c WHERE c.col = p.Value + 1)
|   SELECT LeftBoundary, MAX(Value) AS RightBoundary FROM RecCols
|     GROUP BY LeftBoundary
|     ORDER BY LeftBoundary;

Could you run both against your data set and find out which
one is faster for your six million rows?

Tim

Re: Query to find contiguous ranges on a column

От
Thomas Kellerer
Дата:
Peter Hunsberger wrote on 13.10.2009 23:23:
> I need a query to find the contiguous ranges within this column, in
> this case returning the result set:
>
> start, end
> 2, 5
> 11, 19
> 23, 23
> 32, 37
>
> I have one solution that joins the table against itself and does
> (among other things) a subselect looking "not exists col +1" and "not
> exists col -1" on the two instances of the table to find the start and
> end.  This is, as you might guess, is not very efficient (my actual
> data is some 6 million+ rows) and I'm guessing there has to be
> something more efficient with windowing or possibly grouping on min
> and max (though I can't see how to make sure they are part of a
> contiguous set).  Anyone have any ideas?

This is the best I can put together right now.

Not very nice, but currently I can't think of a better solution:

select *
from
(
  select soi as start_of_interval,
         case
           when soi is not null and eoi is null then lead(eoi) over()
           when soi is not null and eoi is not null then eoi
           else null
         end as end_of_interval
  from (
      select case
               when col - (lag(col,1,col + 1) over (order by col)) - 1 <> 0 then col
               else null
             end as soi,
             case
               when col - (lead(col,1,col + 1) over (order by col)) + 1 <> 0 then col
               else null
             end as eoi
      from numbers
  ) t1
  where t1.soi is not null
     or t1.eoi is not null
) t2
where t2.start_of_interval is not null
  and t2.end_of_interval is not null;

The outer-most select is needed to get rid of the "empty" rows. I couldn't find a way to push that into one of the
sub-queries.

The execution plan doesn't look too bad (probably better than your plan with a self join and a subselect), but it does
sortthe whole table which might be a problem.  

Regards
Thomas





Re: Query to find contiguous ranges on a column

От
Peter Hunsberger
Дата:
On Tue, Oct 13, 2009 at 5:12 PM, Tim Landscheidt <tim@tim-landscheidt.de> wrote:
> Peter Hunsberger <peter.hunsberger@gmail.com> wrote:
>
> You can either use a PL/pgSQL function ("SETOF TEXT" just
> for the convenience of the example):

That works well, takes about 20 seconds to do the 6M+ rows

>
> or a recursive query (which I always find very hard to com-
> prehend):
>
> | WITH RECURSIVE RecCols (LeftBoundary, Value) AS
> |   (SELECT col, col FROM t WHERE (col - 1) NOT IN (SELECT col FROM t)
> |    UNION ALL SELECT p.LeftBoundary, c.col FROM RecCols AS p, t AS c WHERE c.col = p.Value + 1)
> |   SELECT LeftBoundary, MAX(Value) AS RightBoundary FROM RecCols
> |     GROUP BY LeftBoundary
> |     ORDER BY LeftBoundary;
>
> Could you run both against your data set and find out which
> one is faster for your six million rows?
>

Turns out the server is v 8.3, looks like I need to get them to
upgrade it so I get recursive and windowing :-(.  If this happens any
time soon I'll let you know the results.

Many thanks.


--
Peter Hunsberger

Re: Query to find contiguous ranges on a column

От
Tim Landscheidt
Дата:
Peter Hunsberger <peter.hunsberger@gmail.com> wrote:

> [...]
>> or a recursive query (which I always find very hard to com-
>> prehend):

>> | WITH RECURSIVE RecCols (LeftBoundary, Value) AS
>> |   (SELECT col, col FROM t WHERE (col - 1) NOT IN (SELECT col FROM t)
>> |    UNION ALL SELECT p.LeftBoundary, c.col FROM RecCols AS p, t AS c WHERE c.col = p.Value + 1)
>> |   SELECT LeftBoundary, MAX(Value) AS RightBoundary FROM RecCols
>> |     GROUP BY LeftBoundary
>> |     ORDER BY LeftBoundary;

>> Could you run both against your data set and find out which
>> one is faster for your six million rows?

> Turns out the server is v 8.3, looks like I need to get them to
> upgrade it so I get recursive and windowing :-(.  If this happens any
> time soon I'll let you know the results.

> Many thanks.

After some tests with a data set of 7983 rows (and 1638 ran-
ges): Don't! :-) The recursive solution seems to be more
than double as slow as the iterative. I'll take it to -per-
formance.

Tim

Re: Query to find contiguous ranges on a column

От
Peter Hunsberger
Дата:
On Wed, Oct 14, 2009 at 4:50 PM, Tim Landscheidt <tim@tim-landscheidt.de> wrote:
> Peter Hunsberger <peter.hunsberger@gmail.com> wrote:
>
> After some tests with a data set of 7983 rows (and 1638 ran-
> ges): Don't! :-) The recursive solution seems to be more
> than double as slow as the iterative. I'll take it to -per-
> formance.
>

Interesting, I've never liked recursive on Oracle but performance is
usually reasonable... Thanks for the heads up...


--
Peter Hunsberger