Обсуждение: Count of records in a row

Поиск
Список
Период
Сортировка

Count of records in a row

От
Robert James
Дата:
I have a table of event_id, event_time.  Many times, several events
happen in a row.  I'd like a query which replaces all of those events
with a single record, showing the count.

Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1;
D,1; A,2; D,2; B,1; C,2

How can I do that?


Re: Count of records in a row

От
David Johnston
Дата:
Robert James wrote
> I have a table of event_id, event_time.  Many times, several events
> happen in a row.  I'd like a query which replaces all of those events
> with a single record, showing the count.
>
> Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1;
> D,1; A,2; D,2; B,1; C,2
>
> How can I do that?

<Theory Only>

Window functions are going to be your friend.

To solve the grouping problem I would assign the first row's value a group
value of zero (0).  Using the "lag(...)" window function and an
appropriately defined frame you conditionally add one (1) to the prior row's
group value if the value of lag(1) does not equal the current row's value.
The result should be a new column where all sequential duplicates share the
same group number.

Distinct will give you a lookup relation for which letter belongs to which
group
Group By + Count on the group will give you counts

Use string_agg(...) to condense the above into single row/column

HTH

David J.




--
View this message in context: http://postgresql.1045698.n5.nabble.com/Count-of-records-in-a-row-tp5775363p5775365.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Count of records in a row

От
Rémi Cura
Дата:
Hey,
I tried something very similar to compute generalized union of numeric range (which was innapropriate, anyway).

My conclusion were that it's not possible using windows function as you need either a memory (windows function are not allowed in update) or iterations to propagate information (windows functions cannot be nested).
There may be a theoretical possibility of success using windows function and recursive CTE.
(see end of this mail for a taste to this kind of solution)

But it is immensely easier and sometimes mandatory to use instead 
a plpgsql function using cursor (or cursors).

It would be something like that in plpgsql : 

cursor on table of letter ordered
accum = 0;
loop on rows of table ordered
if letter = previous letter, new_id = accum
else accum ++ ; new_id = accum

old letter = new_letter
new letter = next letter;
end of loop,

Cheers,
Rémi-C

Piste for solving it with windows function and recursive CTE :

--defining test env :
drop table if exists test_grouping;
create table test_grouping
(id serial
,letter text
--,previous_letter text
,for_computation int
--,previous_for_computation INT
);
INSERT INTO test_grouping (letter) VALUEs 
('A'), ('A'),('A'),('A'),('B'),('C'),('A'),('D'),('A'),('A'),('D'),('D'),('B'),('C'),('C' );
UPDATE test_grouping set for_computation=id;

SELECT *
FROM test_grouping;

--this query gives the result, but it needs to be iterated using a recursive CTE (not done here):
--you can do it manually by executing it several times
WITH computation AS (
SELECT id
, letter 
, for_computation,
lag( letter, 1,NULL) over w,
 CASE 
WHEN  lag( letter, 1,NULL) over w = letter 
THEN 
lag( for_computation, 1,NULL) over w 
--NULL
ELSE
id
END AS new_id,
(SELECT count(*) over ())
FROM test_grouping AS tg
WINDOW w AS (ORDER BY id ASC ROWS 1 preceding)
ORDER BY tg.id ASC
)
UPDATE test_grouping AS tg SET for_computation = new_id FROM computation AS c WHERE tg.id=c.id
RETURNING tg.*



2013/10/22 David Johnston <polobo@yahoo.com>
Robert James wrote
> I have a table of event_id, event_time.  Many times, several events
> happen in a row.  I'd like a query which replaces all of those events
> with a single record, showing the count.
>
> Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1;
> D,1; A,2; D,2; B,1; C,2
>
> How can I do that?

<Theory Only>

Window functions are going to be your friend.

To solve the grouping problem I would assign the first row's value a group
value of zero (0).  Using the "lag(...)" window function and an
appropriately defined frame you conditionally add one (1) to the prior row's
group value if the value of lag(1) does not equal the current row's value.
The result should be a new column where all sequential duplicates share the
same group number.

Distinct will give you a lookup relation for which letter belongs to which
group
Group By + Count on the group will give you counts

Use string_agg(...) to condense the above into single row/column

HTH

David J.




--
View this message in context: http://postgresql.1045698.n5.nabble.com/Count-of-records-in-a-row-tp5775363p5775365.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Count of records in a row

От
Robert James
Дата:
On 10/22/13, Rémi Cura <remi.cura@gmail.com> wrote:
> But it is immensely easier and sometimes mandatory to use instead
> a plpgsql function using cursor (or cursors).
>
> It would be something like that in plpgsql :
>
> cursor on table of letter ordered
> accum = 0;
> loop on rows of table ordered
>
> if letter = previous letter, new_id = accum
> else accum ++ ; new_id = accum
>
> old letter = new_letter
> new letter = next letter;
>
> end of loop,

Shouldn't it be possible to do that with a FOR loop without a cursor?

It might be that procedural is the way to go.  But I still believe
that relational algebra can handle this, even without a window
function.  Something like:

SELECT event e, COUNT(
    SELECT event oe ... WHERE oe.event_time > e.event_time AND NOT EXISTS (
             SELECT event te WHERE te.event_time > e.event_time AND
te.event_time < oe.event_time))

.


Re: Count of records in a row

От
Rémi Cura
Дата:
Hey,
when using a for you implicitly use a cursor (I think), 
so this is the same, use FOR if you like it more.
It should be *very* fast to write ! 

As I wrote, relational algebra can handle it, but it is not practically feasible :

If you just execute 3 times the query I wrote, you will have your answer.
It is 3 times because the biggest sequence is A A A A.
That's the problem, your number of execution depends on the max size of sequence.

The problems boils down to this : the answer for one row depends on the answer of the previous row, the row before , etc.

You could succeed with ordering by id in a windows function, and in this window function order by new_id and putting null to the end, but such nested windows functions calls are not allowed.

Nevertheless if you find something purely relational please keep me posted !

Cheers,

Rémi-C



2013/10/22 Robert James <srobertjames@gmail.com>
On 10/22/13, Rémi Cura <remi.cura@gmail.com> wrote:
> But it is immensely easier and sometimes mandatory to use instead
> a plpgsql function using cursor (or cursors).
>
> It would be something like that in plpgsql :
>
> cursor on table of letter ordered
> accum = 0;
> loop on rows of table ordered
>
> if letter = previous letter, new_id = accum
> else accum ++ ; new_id = accum
>
> old letter = new_letter
> new letter = next letter;
>
> end of loop,

Shouldn't it be possible to do that with a FOR loop without a cursor?

It might be that procedural is the way to go.  But I still believe
that relational algebra can handle this, even without a window
function.  Something like:

SELECT event e, COUNT(
    SELECT event oe ... WHERE oe.event_time > e.event_time AND NOT EXISTS (
             SELECT event te WHERE te.event_time > e.event_time AND
te.event_time < oe.event_time))

.

Re: Count of records in a row

От
Merlin Moncure
Дата:
On Tue, Oct 22, 2013 at 8:16 AM, Rémi Cura <remi.cura@gmail.com> wrote:
> Hey,
> when using a for you implicitly use a cursor (I think),
> so this is the same, use FOR if you like it more.
> It should be *very* fast to write !
>
> As I wrote, relational algebra can handle it, but it is not practically
> feasible :
>
> If you just execute 3 times the query I wrote, you will have your answer.
> It is 3 times because the biggest sequence is A A A A.
> That's the problem, your number of execution depends on the max size of
> sequence.
>
> The problems boils down to this : the answer for one row depends on the
> answer of the previous row, the row before , etc.
>
> You could succeed with ordering by id in a windows function, and in this
> window function order by new_id and putting null to the end, but such nested
> windows functions calls are not allowed.
>
> Nevertheless if you find something purely relational please keep me posted !





CREATE TYPE count_same_t AS
(
  item TEXT,
  count_item INT
);

CREATE OR REPLACE FUNCTION count_same_internal(state count_same_t,
item TEXT) RETURNS count_same_t AS
$$
BEGIN
  state.count_item := CASE WHEN item = state.item THEN
state.count_item + 1  ELSE 1 END;
  state.item := item;
  RETURN state;
END;
$$ LANGUAGE PLPGSQL;

CREATE FUNCTION count_same_count(state count_same_t) RETURNS INT AS
$$
BEGIN
  RETURN state.count_item;
END;
$$ LANGUAGE PLPGSQL;

CREATE AGGREGATE count_same(TEXT)
(
  SFUNC=count_same_internal,
  STYPE=count_same_t,
  FINALFUNC=count_same_count,
  INITCOND='(,)'
);

WITH testdata as (select s, chr((floor(random() * 3))::int + 65) as v
from generate_series(1,50) s)
SELECT s, v, count_same(v) OVER(order by s) from testdata;

 s  | v | count_same
----+---+------------
           1 | A |          1
  2 | B |          1
  3 | A |          1
  4 | A |          2
  5 | C |          1
  6 | A |          1
  7 | C |          1
  8 | C |          2
  9 | C |          3
 10 | C |          4

/snip


merlin :-D


Re: Count of records in a row

От
hubert depesz lubaczewski
Дата:
On pon, paź 21, 2013 at 08:38:52 -0400, Robert James wrote:
> I have a table of event_id, event_time.  Many times, several events
> happen in a row.  I'd like a query which replaces all of those events
> with a single record, showing the count.
>
> Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1;
> D,1; A,2; D,2; B,1; C,2
> How can I do that?

"A" or other letters don't really match your schema description.
Please provide sample schema (as in: create table ...), sample data, and
expected output.

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/


Re: Count of records in a row

От
Rémi Cura
Дата:
héhé, 
nice snipping Merlin !

I guess you are almost there, output is still wrong  (should be) (
> Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1;
> D,1; A,2; D,2; B,1; C,2
)

I don't understand enough to make the modifications =)

Cheers,
Rémi-C


2013/10/22 hubert depesz lubaczewski <depesz@depesz.com>
On pon, paź 21, 2013 at 08:38:52 -0400, Robert James wrote:
> I have a table of event_id, event_time.  Many times, several events
> happen in a row.  I'd like a query which replaces all of those events
> with a single record, showing the count.
>
> Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1;
> D,1; A,2; D,2; B,1; C,2
> How can I do that?

"A" or other letters don't really match your schema description.
Please provide sample schema (as in: create table ...), sample data, and
expected output.

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Count of records in a row

От
Merlin Moncure
Дата:
On Tue, Oct 22, 2013 at 8:41 AM, Rémi Cura <remi.cura@gmail.com> wrote:
> héhé,
> nice snipping Merlin !
>
> I guess you are almost there, output is still wrong  (should be) (
>> Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1;
>> D,1; A,2; D,2; B,1; C,2
> )
>
> I don't understand enough to make the modifications =)

oh right -- whoops.

CREATE TYPE count_same_t AS
(
  item TEXT,
  item_group INT
);

CREATE OR REPLACE FUNCTION count_same_internal(state count_same_t,
item TEXT) RETURNS count_same_t AS
$$
BEGIN
  state.item_group := CASE WHEN item = state.item THEN
state.item_group  ELSE state.item_group + 1 END;
  state.item := item;
  RETURN state;
END;
$$ LANGUAGE PLPGSQL;

CREATE OR REPLACE FUNCTION count_same_count(state count_same_t) RETURNS INT AS
$$
BEGIN
  RETURN state.item_group;
END;
$$ LANGUAGE PLPGSQL;

CREATE AGGREGATE count_same(TEXT)
(
  SFUNC=count_same_internal,
  STYPE=count_same_t,
  FINALFUNC=count_same_count,
  INITCOND='(,0)'
);

WITH testdata as (select s, chr((floor(random() * 3))::int + 65) as v
from generate_series(1,50) s)
select v, count(*)  from (SELECT s, v, count_same(v) OVER(order by s)
grp from testdata) q
GROUP BY v, grp order by grp;

 v | count
---+-------
 A |     1
 B |     1
 A |     1
 B |     2
 C |     1
 A |     1
 C |     2
 A |     1
 C |     2
 A |     3
 B |     3
 A |     1
 B |     1

/snip

aside: learn the technique.  custom aggregates may seem awkward and
weird at first, but they can be used to solve all sorts of wonderful
problems.

merlin


Re: Count of records in a row

От
Rémi Cura
Дата:

Thanks for this good example Merlin !

I didn't know you could use variable inside custom aggregates, and this allow to solve the problem! 

In my own problem I couldn't use aggregates because
_as it output at most one row, it would have mean a lots of useless computation (as in this example I guess, (please correct me if it's not the case) : we do N computations of aggregate , each "using" at most N rows)
_I couldn't cheat with arrays because of cost of serialization/deserialization

I'll keep in mind this custom aggregate use and try to learn more about it.
Cheers,
Rémi-C

Re: Count of records in a row

От
Merlin Moncure
Дата:
On Tue, Oct 22, 2013 at 9:09 AM, Rémi Cura <remi.cura@gmail.com> wrote:
>
> Thanks for this good example Merlin !
>
> I didn't know you could use variable inside custom aggregates, and this
> allow to solve the problem!
>
> In my own problem I couldn't use aggregates because
> _as it output at most one row, it would have mean a lots of useless
> computation (as in this example I guess, (please correct me if it's not the
> case) :

That is not the case.  With the approach above what you 'pay' vs
standard loop is basically one pl/pgsql function call per output row.
(you can do it in straight sql, but when with pl/pgsql to leverage
cached function parsing).  What you 'get' is a more general function
because the loop structure is in the query itself as well as the
output structure.  This cleanly separates action from the data.
Usually, the mechanics of executing the aggregate are not a huge part
of query execution time.  Actually, the worst case is when the
aggregate is trivial but no matter what it's O(n).

I'm not clear what on the issue is with your particular case, since
you didn't post it :-).   Maybe post some extra detail?

merlin


Re: Count of records in a row

От
Rémi Cura
Дата:
Thanks again for the precision !

I still don't understand perfectly. We call the aggregate n times, and each time we compute the aggregate, using (potentially) n rows, thus becoming (at most) O(n*n).

With a standard loop, I loop n times, and each times I only need the current row plus the previous row which I put in memory, thus O(n).



I posted a lot about my issue, but only about the fraction of the problem I was blocked by, and I get no conclusive answer.

My problem was to find a good way to have a plpgsql function taking set of rows as input and returning a set of rows.
I worked on range, and I wanted a polymorphic function (working with any range).

Aggregates only returns one row at most, array are dangerous with big data, temp table have to be created/deleted and have to be used in the same session, cursors arn't well supported by accessing library, view can't be written, mat view weren't available.

Anyway I solved it using cursors, not optimal but works ! 

Cheers,

Rémi-C


2013/10/22 Merlin Moncure <mmoncure@gmail.com>
On Tue, Oct 22, 2013 at 9:09 AM, Rémi Cura <remi.cura@gmail.com> wrote:
>
> Thanks for this good example Merlin !
>
> I didn't know you could use variable inside custom aggregates, and this
> allow to solve the problem!
>
> In my own problem I couldn't use aggregates because
> _as it output at most one row, it would have mean a lots of useless
> computation (as in this example I guess, (please correct me if it's not the
> case) :

That is not the case.  With the approach above what you 'pay' vs
standard loop is basically one pl/pgsql function call per output row.
(you can do it in straight sql, but when with pl/pgsql to leverage
cached function parsing).  What you 'get' is a more general function
because the loop structure is in the query itself as well as the
output structure.  This cleanly separates action from the data.
Usually, the mechanics of executing the aggregate are not a huge part
of query execution time.  Actually, the worst case is when the
aggregate is trivial but no matter what it's O(n).

I'm not clear what on the issue is with your particular case, since
you didn't post it :-).   Maybe post some extra detail?

merlin

Re: Count of records in a row

От
Elliot
Дата:
On 2013-10-21 20:38, Robert James wrote:
> I have a table of event_id, event_time.  Many times, several events
> happen in a row.  I'd like a query which replaces all of those events
> with a single record, showing the count.
>
> Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1;
> D,1; A,2; D,2; B,1; C,2
>
> How can I do that?
>
>

It looks like you already found a solution, but here's one with a CTE. I
cobbled this together from an older query I had for doing something
similar, for which I unfortunately lost the original source of this
approach. Also, this implies that there is something that gives an
ordering to these rows (in this case, the field "i").

create temp table data (i int, val char);

insert into data (val, i)
values
('A',1),
('A',2),
('A',3),
('B',4),
('C',5),
('A',6),
('D',7),
('A',8),
('A',9),
('D',10),
('D',11),
('B',12),
('C',13),
('C',14)
;

with x
as
(
   select i,
          row_number() over () as xxx,
          val,
          row_number() over (partition by val order by i asc)
            - row_number() over () as d
   from data
   order by i
)
select val,
        count(*)
from x
group by d,
          val
order by min(i)
;



Re: Count of records in a row

От
Merlin Moncure
Дата:
On Tue, Oct 22, 2013 at 10:01 AM, Elliot <yields.falsehood@gmail.com> wrote:
> On 2013-10-21 20:38, Robert James wrote:
>>
>> I have a table of event_id, event_time.  Many times, several events
>> happen in a row.  I'd like a query which replaces all of those events
>> with a single record, showing the count.
>>
>> Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1;
>> D,1; A,2; D,2; B,1; C,2
>>
>> How can I do that?
>>
>>
>
> It looks like you already found a solution, but here's one with a CTE. I
> cobbled this together from an older query I had for doing something similar,
> for which I unfortunately lost the original source of this approach. Also,
> this implies that there is something that gives an ordering to these rows
> (in this case, the field "i").
>
> create temp table data (i int, val char);
>
> insert into data (val, i)
> values
> ('A',1),
> ('A',2),
> ('A',3),
> ('B',4),
> ('C',5),
> ('A',6),
> ('D',7),
> ('A',8),
> ('A',9),
> ('D',10),
> ('D',11),
> ('B',12),
> ('C',13),
> ('C',14)
> ;
>
> with x
> as
> (
>   select i,
>          row_number() over () as xxx,
>          val,
>          row_number() over (partition by val order by i asc)
>            - row_number() over () as d
>   from data
>   order by i
> )
> select val,
>        count(*)
> from x
> group by d,
>          val
> order by min(i)

wow, that's really clever.

merlin


Re: Count of records in a row

От
Rémi Cura
Дата:
Hmm exactly what I was thinking !

Thank you a lot, I spend many hours thinking about this and this solution is very nice.

Cheers,
Rémi-C


2013/10/22 Merlin Moncure <mmoncure@gmail.com>
On Tue, Oct 22, 2013 at 10:01 AM, Elliot <yields.falsehood@gmail.com> wrote:
> On 2013-10-21 20:38, Robert James wrote:
>>
>> I have a table of event_id, event_time.  Many times, several events
>> happen in a row.  I'd like a query which replaces all of those events
>> with a single record, showing the count.
>>
>> Eg: Take A,A,A,B,C,A,D,A,A,D,D,B,C,C and return: A,3; B,1; C,1; A,1;
>> D,1; A,2; D,2; B,1; C,2
>>
>> How can I do that?
>>
>>
>
> It looks like you already found a solution, but here's one with a CTE. I
> cobbled this together from an older query I had for doing something similar,
> for which I unfortunately lost the original source of this approach. Also,
> this implies that there is something that gives an ordering to these rows
> (in this case, the field "i").
>
> create temp table data (i int, val char);
>
> insert into data (val, i)
> values
> ('A',1),
> ('A',2),
> ('A',3),
> ('B',4),
> ('C',5),
> ('A',6),
> ('D',7),
> ('A',8),
> ('A',9),
> ('D',10),
> ('D',11),
> ('B',12),
> ('C',13),
> ('C',14)
> ;
>
> with x
> as
> (
>   select i,
>          row_number() over () as xxx,
>          val,
>          row_number() over (partition by val order by i asc)
>            - row_number() over () as d
>   from data
>   order by i
> )
> select val,
>        count(*)
> from x
> group by d,
>          val
> order by min(i)

wow, that's really clever.

merlin


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Count of records in a row

От
Merlin Moncure
Дата:
On Tue, Oct 22, 2013 at 9:43 AM, Rémi Cura <remi.cura@gmail.com> wrote:
> Thanks again for the precision !
>
> I still don't understand perfectly. We call the aggregate n times, and each
> time we compute the aggregate, using (potentially) n rows, thus becoming (at
> most) O(n*n).
>
> With a standard loop, I loop n times, and each times I only need the current
> row plus the previous row which I put in memory, thus O(n).

For posterity, the above is incorrect.  Since the aggregate is ordered
through the window function, it gets executed exactly once per output
row.  It behaves exactly like a loop.  You know this because there is
no array in the aggregate state.

merlin


Re: Count of records in a row

От
Rémi Cura
Дата:
OK,
just out of pure curiosity,
is it always the case or is it due to this particular aggregate?

Cheers,
Rémi-C


2013/10/22 Merlin Moncure <mmoncure@gmail.com>
On Tue, Oct 22, 2013 at 9:43 AM, Rémi Cura <remi.cura@gmail.com> wrote:
> Thanks again for the precision !
>
> I still don't understand perfectly. We call the aggregate n times, and each
> time we compute the aggregate, using (potentially) n rows, thus becoming (at
> most) O(n*n).
>
> With a standard loop, I loop n times, and each times I only need the current
> row plus the previous row which I put in memory, thus O(n).

For posterity, the above is incorrect.  Since the aggregate is ordered
through the window function, it gets executed exactly once per output
row.  It behaves exactly like a loop.  You know this because there is
no array in the aggregate state.

merlin

Re: Count of records in a row

От
Robert James
Дата:
Wow, this is an excellent discussion - and I must admit, a bit beyond
my abilities.  Is there a consensus as to the best approach to adopt?
Is Elliot's the best?

On 10/22/13, Rémi Cura <remi.cura@gmail.com> wrote:
> OK,
> just out of pure curiosity,
> is it always the case or is it due to this particular aggregate?
>
> Cheers,
> Rémi-C
>
>
> 2013/10/22 Merlin Moncure <mmoncure@gmail.com>
>
>> On Tue, Oct 22, 2013 at 9:43 AM, Rémi Cura <remi.cura@gmail.com> wrote:
>> > Thanks again for the precision !
>> >
>> > I still don't understand perfectly. We call the aggregate n times, and
>> each
>> > time we compute the aggregate, using (potentially) n rows, thus
>> > becoming
>> (at
>> > most) O(n*n).
>> >
>> > With a standard loop, I loop n times, and each times I only need the
>> current
>> > row plus the previous row which I put in memory, thus O(n).
>>
>> For posterity, the above is incorrect.  Since the aggregate is ordered
>> through the window function, it gets executed exactly once per output
>> row.  It behaves exactly like a loop.  You know this because there is
>> no array in the aggregate state.
>>
>> merlin
>>
>


Re: Count of records in a row

От
Merlin Moncure
Дата:
> 2013/10/22 Merlin Moncure <mmoncure@gmail.com>
>> > With a standard loop, I loop n times, and each times I only need the
>> > current
>> > row plus the previous row which I put in memory, thus O(n).
>>
>> For posterity, the above is incorrect.  Since the aggregate is ordered
>> through the window function, it gets executed exactly once per output
>> row.  It behaves exactly like a loop.  You know this because there is
>> no array in the aggregate state.
>>
> just out of pure curiosity,
> is it always the case or is it due to this particular aggregate?

It is always the case.  Generally speaking, aggregates, especially
user defined aggregates, are run once per input row.   In this case
the main utility of window functions is to order the aggregate
execution calls and (especially) allow intermediate output per input
row, instead of per aggregate grouping.

On Tue, Oct 22, 2013 at 6:01 PM, Robert James <srobertjames@gmail.com> wrote:
> Wow, this is an excellent discussion - and I must admit, a bit beyond
> my abilities.  Is there a consensus as to the best approach to adopt?
> Is Elliot's the best?

For this *specific* problem, I would give Elliot's (extremely clever)
query the nod on the basis that it does not require any supporting
infrastructure, which is always nice.  That being said, once you start
getting the mojo of user defined aggregates + window functions it
starts to become clear that it's a cleaner way of doing many types of
things that are normally handled by loops.

merlin


Re: Count of records in a row

От
Rémi Cura
Дата:
Ok thanks for this precision Merlin.
Seems like aggregates are way more powerful than I thought.

Obviously I need a lot more reading about custom aggregates before fully understanding it.

Elliot's query is pure SQL so obviously very cool ! 

It could be improved at the margin, and aggregates/function are certainly faster on big data. 
But if you have no specific needs I would say Elliot is easier and more universal.


Cheers & thanks all for this good discussion.

Rémi-C


2013/10/23 Merlin Moncure <mmoncure@gmail.com>
> 2013/10/22 Merlin Moncure <mmoncure@gmail.com>
>> > With a standard loop, I loop n times, and each times I only need the
>> > current
>> > row plus the previous row which I put in memory, thus O(n).
>>
>> For posterity, the above is incorrect.  Since the aggregate is ordered
>> through the window function, it gets executed exactly once per output
>> row.  It behaves exactly like a loop.  You know this because there is
>> no array in the aggregate state.
>>
> just out of pure curiosity,
> is it always the case or is it due to this particular aggregate?

It is always the case.  Generally speaking, aggregates, especially
user defined aggregates, are run once per input row.   In this case
the main utility of window functions is to order the aggregate
execution calls and (especially) allow intermediate output per input
row, instead of per aggregate grouping.

On Tue, Oct 22, 2013 at 6:01 PM, Robert James <srobertjames@gmail.com> wrote:
> Wow, this is an excellent discussion - and I must admit, a bit beyond
> my abilities.  Is there a consensus as to the best approach to adopt?
> Is Elliot's the best?

For this *specific* problem, I would give Elliot's (extremely clever)
query the nod on the basis that it does not require any supporting
infrastructure, which is always nice.  That being said, once you start
getting the mojo of user defined aggregates + window functions it
starts to become clear that it's a cleaner way of doing many types of
things that are normally handled by loops.

merlin

Re: Count of records in a row

От
Robert James
Дата:
On 10/22/13, Elliot <yields.falsehood@gmail.com> wrote:
> It looks like you already found a solution, but here's one with a CTE. I
> cobbled this together from an older query I had for doing something
> similar, for which I unfortunately lost the original source of this
> approach. Also, this implies that there is something that gives an
> ordering to these rows (in this case, the field "i").
>
> create temp table data (i int, val char);
>
> insert into data (val, i)
> values
> ('A',1),
> ('A',2),
> ('A',3),
> ('B',4),
> ('C',5),
>
> with x
> as
> (
>    select i,
>           row_number() over () as xxx,
>           val,
>           row_number() over (partition by val order by i asc)
>             - row_number() over () as d
>    from data
>    order by i
> )
> select val,
>         count(*)
> from x
> group by d,
>           val
> order by min(i)
> ;

Elliot - Thanks for this great solution; I've tested in on my data and
it gives great results.

I'd like to understand your code.  I believe I understand most of it.
Can you explain what 'd' is?

And this clause "row_number() over (partition by val order by i asc) -
row_number() over () as d"?

(Hey, while I'm at it, is there a descriptive name for "x" too?)

Thanks


Re: Count of records in a row

От
Elliot
Дата:
On 2013-10-24 17:09, Robert James wrote:
> On 10/22/13, Elliot <yields.falsehood@gmail.com> wrote:
>> It looks like you already found a solution, but here's one with a CTE. I
>> cobbled this together from an older query I had for doing something
>> similar, for which I unfortunately lost the original source of this
>> approach. Also, this implies that there is something that gives an
>> ordering to these rows (in this case, the field "i").
>>
>> create temp table data (i int, val char);
>>
>> insert into data (val, i)
>> values
>> ('A',1),
>> ('A',2),
>> ('A',3),
>> ('B',4),
>> ('C',5),
>>
>> with x
>> as
>> (
>>     select i,
>>            row_number() over () as xxx,
>>            val,
>>            row_number() over (partition by val order by i asc)
>>              - row_number() over () as d
>>     from data
>>     order by i
>> )
>> select val,
>>          count(*)
>> from x
>> group by d,
>>            val
>> order by min(i)
>> ;
> Elliot - Thanks for this great solution; I've tested in on my data and
> it gives great results.
>
> I'd like to understand your code.  I believe I understand most of it.
> Can you explain what 'd' is?
>
> And this clause "row_number() over (partition by val order by i asc) -
> row_number() over () as d"?
>
> (Hey, while I'm at it, is there a descriptive name for "x" too?)
>
> Thanks
Glad I could help. It's easier to understand if you break apart the CTE.
I'm also moving around the order by i to clean this up a little. Sorry
for the formatting.

Running this:
    select i,
           val,
           row_number() over (partition by val order by i asc) as class_i,
           row_number() over (order by i asc) as overall_i,
           row_number() over (partition by val order by i asc)
             - row_number() over () as d
    from data

Yields this:
i    val    class_i    overall_i    d
1    A    1    1    0
2    A    2    2    0
3    A    3    3    0
4    B    1    4    -3
5    C    1    5    -4
6    A    4    6    -2
7    D    1    7    -6
8    A    5    8    -3
9    A    6    9    -3
10    D    2    10    -8
11    D    3    11    -8
12    B    2    12    -10
13    C    2    13    -11
14    C    3    14    -11

class_i counts the row number within a class and overall_i counts the
overall row number in the sequence. Here's just one class extracted to
emphasize that:

i    val    class_i    overall_i    d
1    A    1    1    0
2    A    2    2    0
3    A    3    3    0
6    A    4    6    -2
8    A    5    8    -3
9    A    6    9    -3

Within a given consecutive run of a particular class the difference
between class_i and overall_i will always be the same (because they're
both increasing by the same amount) but that difference between runs
will always be different (because each run starts the sequences at
different offsets). "d" is the difference of the two. Because that value
segments the runs, all that needs to be done is group by it and count
the items in the group to get the length of the runs.

The xxx column was junk left over from copying and pasting and
verifying. Apologies :). This is a cleaned up version:

with x
as
(
   select i,
          val,
          row_number() over (partition by val order by i asc)
            - row_number() over (order by i asc) as d
   from data
)
select val,
        count(*)
from x
group by d,
          val
order by min(i)
;



Re: Count of records in a row

От
Robert James
Дата:
Ingenious!

I actually think, however, there was a subtle bug in, though I see you fixed it.

The line:
              - row_number() over () as d
needs to be:
              - row_number() over (order by i asc) as d

I discovered this when working your code into my application.  I got
very, very weird results - with one order of columns in the select, I
got the correct answer, but with another one I didn't.  After much
debugging, I realized that the original version ("- row_number over
()") wasn't defined! So, depending on how I wrote the select
statement, Postgres could pick different orders!

But I see your cleaned up version already fixed this!

On 10/25/13, Elliot <yields.falsehood@gmail.com> wrote:
> Glad I could help. It's easier to understand if you break apart the CTE.
> I'm also moving around the order by i to clean this up a little. Sorry
> for the formatting.
>
> Running this:
>     select i,
>            val,
>            row_number() over (partition by val order by i asc) as class_i,
>            row_number() over (order by i asc) as overall_i,
>            row_number() over (partition by val order by i asc)
>              - row_number() over () as d
>     from data
>
> Yields this:
> i    val    class_i    overall_i    d
> 1    A    1    1    0
> 2    A    2    2    0
> 3    A    3    3    0
> 4    B    1    4    -3
> 5    C    1    5    -4
> 6    A    4    6    -2
> 7    D    1    7    -6
> 8    A    5    8    -3
> 9    A    6    9    -3
> 10    D    2    10    -8
> 11    D    3    11    -8
> 12    B    2    12    -10
> 13    C    2    13    -11
> 14    C    3    14    -11
>
> class_i counts the row number within a class and overall_i counts the
> overall row number in the sequence. Here's just one class extracted to
> emphasize that:
>
> i    val    class_i    overall_i    d
> 1    A    1    1    0
> 2    A    2    2    0
> 3    A    3    3    0
> 6    A    4    6    -2
> 8    A    5    8    -3
> 9    A    6    9    -3
>
> Within a given consecutive run of a particular class the difference
> between class_i and overall_i will always be the same (because they're
> both increasing by the same amount) but that difference between runs
> will always be different (because each run starts the sequences at
> different offsets). "d" is the difference of the two. Because that value
> segments the runs, all that needs to be done is group by it and count
> the items in the group to get the length of the runs.
>
> The xxx column was junk left over from copying and pasting and
> verifying. Apologies :). This is a cleaned up version:
>
> with x
> as
> (
>    select i,
>           val,
>           row_number() over (partition by val order by i asc)
>             - row_number() over (order by i asc) as d
>    from data
> )
> select val,
>         count(*)
> from x
> group by d,
>           val
> order by min(i)
> ;
>
>