Обсуждение: Suggestion for aggregate function
I have an idea for an aggregate function (actually a pair) that would be very useful. It's something I've wanted very frequently with Oracle and other databases and while it's possible to implement in SQL it's hard to do efficiently. Whereas it would be really easy for the database to do it efficiently. lookup_min(column1,column2) lookup_max(column1,column2) would return the value of column2 (or one of the values in the case of duplicates) where column1 is the minimum/maximum value. Ie, it would have an accumulator that stores two values, the minimum/maximum value found so far, and the value of column2 for that record. So it would be possible to say for example: select min(column1),lookup_min(column1,column2) from tab to do the equivalent of: select column1,column2 where column1=(select min(column1) from tab) limit 1 except it would be way more efficient. (Especially if there's an index on column1 and postgres were taught to use indexes for min/max, but that's a different story.) I'm not sure on the names, perhaps someone has a better idea? I would be interested in doing this myself, it sounds like a fairly straightforward thing to implement and would be a useful first project. However I'm really a bit bewildered by the number of steps aggregate functions seem to have to go through to store accumulator data. It seems like they're going to a lot of effort to store the accumulator data in a database internal data-type. Is there something I can read to catch up on what these data structures are for and how to use them? -- greg
On Fri, Jan 17, 2003 at 13:39:11 -0500, Greg Stark <gsstark@mit.edu> wrote: > > So it would be possible to say for example: > > select min(column1),lookup_min(column1,column2) from tab > > to do the equivalent of: > > select column1,column2 where column1=(select min(column1) from tab) limit 1 > > except it would be way more efficient. (Especially if there's an index on > column1 and postgres were taught to use indexes for min/max, but that's a > different story.) The following will be more efficient than your function if there is a usable index on column1: select column1,column2 from tab order by column 1 limit 1
Bruno Wolff III <bruno@wolff.to> writes: > On Fri, Jan 17, 2003 at 13:39:11 -0500, > Greg Stark <gsstark@mit.edu> wrote: > > > > So it would be possible to say for example: > > > > select min(column1),lookup_min(column1,column2) from tab > > > > to do the equivalent of: > > > > select column1,column2 where column1=(select min(column1) from tab) limit 1 As several people have pointed out this example isn't sufficiently complex to make rule out various other reasonably efficient SQL implementations. If you're unconvinced that this function would be handy consider a more complex query: SELECT item.*, store.*, x.lowest_price FROM item, store, ( SELECT item_id, min(price) AS lowest_price, lookup_min(price,store_id) AS lowest_price_store FROM items_for_sale WHERE item_category = ? GROUP BY item_id) AS x WHERE item.item_id = x.item_id AND store.store_id = x.store_id There's really no reason for the database to have to do more than one scan of items_for_sale with one nested_loops lookup of item and store. Ideally if there's an index on items_for_sale on item_id, price it should be able to use it too, but that's unlikely. Currently to write this I think you would have to join against items_for_sale twice, once to group by item_id and get the least price, then again to lookup the store. SELECT item_id, min(store_id) FROM items_for_sale, ( SELECT min(price) AS lowest_price FROM items_for_sale WHERE item_category = ? GROUP BY item_id ) AS xWHERE items_for_sale.item_id = x.item_id AND items_for_sale.price = x.lowest_price -- greg
On 17 Jan 2003 15:12:58 -0500, Greg Stark <gsstark@mit.edu> wrote: >SELECT item.*, store.*, x.lowest_price > FROM item, store, ( > SELECT item_id, > min(price) AS lowest_price, > lookup_min(price,store_id) AS lowest_price_store > FROM items_for_sale > WHERE item_category = ? > GROUP BY item_id) AS x > WHERE item.item_id = x.item_id > AND store.store_id = x.store_id > >There's really no reason for the database to have to do more than one scan of >items_for_sale with one nested_loops lookup of item and store. Greg, we already have this feature, just the syntax is a bit different :-) SELECT item.*, store.*, x.lowest_price FROM item, store, ( SELECT DISTINCT ON (item_id) item_id, price ASlowest_price, store_id AS lowest_price_store FROM items_for_sale WHERE item_category = ? ORDER BY item_id, price) AS x WHERE item.item_id = x.item_id AND store.store_id = x.lowest_price_store; > Ideally if >there's an index on items_for_sale on item_id, price it should be able to use >it too, but that's unlikely. ServusManfred
Greg Stark <gsstark@mit.edu> writes: >> select min(column1),lookup_min(column1,column2) from tab One small problem is that we only support single-argument aggregates. As of 7.3 this is no longer wired into the system catalog layout, but it's still wired into various internal datastructures. Anyone interested in trying to fix it? regards, tom lane
Manfred Koizar <mkoi-pg@aon.at> writes: > Greg, we already have this feature, just the syntax is a bit different :-) > > SELECT DISTINCT ON (item_id) item_id, > price AS lowest_price, > store_id AS lowest_price_store > FROM items_for_sale > WHERE item_category = ? > ORDER BY item_id, price Neat! I hadn't seen this. I would have liked to have had that feature on Oracle! (Please don't tell me I did, I went through such pains to work around not having it.) Would this query be efficient if there's an index on item_id, price ? That is, would it know to do an index scan and be able to skip to the next item_id in the index as soon as a price was found? -- greg
On 17 Jan 2003 19:08:06 -0500, Greg Stark <gsstark@mit.edu> wrote: >Would this query be efficient if there's an index on item_id, price ? That is, >would it know to do an index scan Yes, at least to avoid the sort step. > and be able to skip to the next item_id in >the index as soon as a price was found? I don't think so. Look at how the index scan retrieves all rows: => EXPLAIN ANALYZE -> SELECT DISTINCT ON (item) item, price, store FROM sale ORDER BY item, price; NOTICE: QUERY PLAN: Unique (cost=0.00..412.24 rows=1024 width=12) (actual time=0.93..549.95 rows=101 loops=1) -> Index Scan using s_x1on sale (cost=0.00..386.64 rows=10240 width=12) (actual time=0.90..399.52 rows=10240loops=1) Total runtime: 551.55 msec EXPLAIN => DROP INDEX s_x1; DROP => EXPLAIN ANALYZE -> SELECT DISTINCT ON (item) item, price, store FROM sale ORDER BY item, price; NOTICE: QUERY PLAN: Unique (cost=845.48..871.08 rows=1024 width=12) (actual time=941.83..1152.25 rows=101 loops=1) -> Sort (cost=845.48..845.48rows=10240 width=12) (actual time=941.71..1061.93 rows=10240 loops=1) -> Seq Scan onsale (cost=0.00..163.40 rows=10240 width=12) (actual time=0.37..273.41 rows=10240 loops=1) Total runtime: 1304.63 msec ServusManfred
Greg Stark <gsstark@MIT.EDU> writes: > Manfred Koizar <mkoi-pg@aon.at> writes: > > > Greg, we already have this feature, just the syntax is a bit different :-) > > > > SELECT DISTINCT ON (item_id) item_id, > > price AS lowest_price, > > store_id AS lowest_price_store > > FROM items_for_sale > > WHERE item_category = ? > > ORDER BY item_id, price > > Neat! I hadn't seen this. Ok, so I still think DISTINCT ON is the neatest thing since sliced bread. But it strikes me as a bit of an odd syntax. It's very similar to GROUP BY except where all the fields are implicitly aggregated using a peculiar aggregate function that grabs the first value according to the order by expression. I'm using this already for lots of queries, it's very handy. But I'm finding it awkward in one situation -- when I also want other aggregate values other than the first value according to the sort. Consider the above query if I also wanted to know the maximum and average prices per item. Along with the store that had the maximum and minimum prices and the total number of stores that stock the item. With DISTINCT ON I would have to do two queries to get the maximum and minimum along with the relevant stores, and then do a third query with GROUP BY to get the average and total number of stores. What would be useful is something like SELECT item_id, first(price) as min_price, first(store_id) as min_store, avg(price) as avg_price, last(price)as max_price, last(store_id) as min_store, count(distinct store_id) as num_storesFROM (SELECT * FROM items_for_saleORDER BY item_id, store_id) GROUP BY store_id This gives the benefits of DISTINCT ON but makes it easier to combine with GROUP BY. -- greg
Greg Stark <gsstark@mit.edu> writes: > What would be useful is something like > SELECT item_id, > first(price) as min_price, first(store_id) as min_store, > avg(price) as avg_price, > last(price) as max_price, last(store_id) as min_store, > count(distinct store_id) as num_stores > FROM (SELECT * FROM items_for_sale ORDER BY item_id, store_id) > GROUP BY store_id Write it yourself --- both first() and last() are trivial to code as user-defined aggregates. regards, tom lane