Обсуждение: Avoiding double-counting in aggregates with more than one join?

Поиск
Список
Период
Сортировка

Avoiding double-counting in aggregates with more than one join?

От
Paul Jungwirth
Дата:
Hi All,

I've noticed in the past that doing aggregates while joining to more
than one table can sometimes give you unintended results. For example,
suppose I have three tables: products, sales, and resupplies. In sales I
track what I sell, and in resupplies I track my own purchases to
increase inventory. Both have a foreign key to products. Now I want to
run a report showing the total dollars sold for each product versus the
total dollars spent for each product. I could try this:

     SELECT  p.id,
             SUM(s.price * s.qty) AS total_sold,
             SUM(r.price * r.qty) AS total_spent
     FROM    products p
     LEFT OUTER JOIN sales s
     ON      s.product_id = p.id
     LEFT OUTER JOIN resupplies r
     ON      r.product_id = p.id
     GROUP BY p.id
     ;

That seems pretty safe, but actually I get bad answers,
for example if some product has this data:

     sales
     -----
     sold 1 @ $2/ea

     resupplies
     ----------
     bought 1 @ $1/eq
     bought 2 @ $1/ea

Then pre-grouping I have this:

     p.id | s.qty | s.price | r.qty | r.price
     -----+-------+---------+-------+--------
        1 |     1 |      $2 |     1 |      $1
        1 |     1 |      $2 |     2 |      $1

You can see the problem is that I'm going to double-count my sales.
What I really want is this:

     p.id | s.qty | s.price | r.qty | r.price
     -----+-------+---------+-------+--------
        1 |     1 |      $2 |     1 |      $1
        1 |       |         |     2 |      $1

In the past I've always handled these situations by aggregating each
table separately
and only then joining things together:

     WITH
     s AS (
       SELECT  product_id,
               SUM(price * qty) AS total_sold
       FROM    sales
       GROUP BY product_id) s
     ),
     r AS (
       SELECT  product_id,
               SUM(price * qty) AS total_spent
       FROM    resupplies
       GROUP BY product_id) r
     )
     SELECT  p.id,
             COALESCE(s.total_sold, 0),
             COALESCE(r.total_spent, 0)
     FROM    products p
     LEFT OUTER JOIN s
     ON      s.product_id = p.id
     LEFT OUTER JOIN r
     ON      r.product_id = p.id
     ;

Since I've guaranteed that each CTE includes at most one row per product,
this is safe from double-counting errors.

But is there a better way? My approach feels verbose
and harder to read. Also you have to type COALESCE a lot. :-)
Is there some different way of doing things I haven't thought of yet?

Also I wonder about the performance merging all these subqueries together.
Would the final merging be any faster if I had an ORDER BY in each CTE?

It seems like this pattern comes up a lot;
what have others done about it?

Thanks,
Paul


Re: Avoiding double-counting in aggregates with more than one join?

От
"David G. Johnston"
Дата:
On Fri, Nov 18, 2016 at 10:16 AM, Paul Jungwirth <pj@illuminatedcomputing.com> wrote:
But is there a better way?

​Nothing that would be more readable nor likely more performant.

When performing aggregation it is necessary to limit the scope of the query to only whatever it is you are calculating.  Since you wish to compute two things you need two separate parts ​plus a third to combine them.

​If performance is a concern you should move the aggregation queries directly to the main query instead of using the optimization fencing CTE.

SELECT
FROM products
LEFT JOIN (
SELECT sum()
)​ s USING (product_id)
LEFT JOIN (
SELECT sum()
) r USING (product_id)

​If the second "scope" doesn't need to be calculated but simply informs the one-and-only aggregate you should use SEMI JOIN (EXISTS) instead of a INNER/LEFT JOIN​.  But that isn't what you have here.

David J.