Обсуждение: SUM and MAX over different periods - months over several years?
I have a tricky piece of SQL that's bothering me. I have a table called illness - it's oncology so these are all tumours. Let's say we have cervical and lung and some months we might have more lung tumours than cervical and other months it's vice versa! We record them as follows (full DDL and DML at end of post) nature_of_illness VARCHAR(25), created_at TIMESTAMP (UTC) - these dates and times are essentially random - tests can be returned by international labs at any time of the day or night. Now, what I want is to find which illness was reported most in a given month and in a given year. The records returned should look like nature_of_illness year month count (count has to be the MAX out of all illnesses for a given month) lung 2017 January 53 - i.e. in January 2017, the most reported illness was lung tumours of which there were 53 cervix 2017 February 45 .. &c... I have an added constraint - this must work using "old" SQL - i.e. with MySQL 5.6 as well as PostgreSQL 10 - no Windows/Analytics functions &c. I have got this far: -- SELECT DISTINCT(i_data), il_mc) -- commented out pieces show other failed attempts! -- FROM -- ( SELECT c_year, -- c_month, -- just provides a number - how do I convert an ::int (say 1) to 'January' CASE WHEN c_month = 1 THEN 'January' WHEN c_month = 2 THEN 'February' -- then March... &c... END AS the_month, -- SELECT to_char(to_timestamp (c_month::text, 'MM'), 'TMmon') -- tried variants of this, want better than CASE MAX(month_count) AS il_mc FROM ( SELECT nature_of_illness as illness, EXTRACT(YEAR FROM created_at) AS c_year, EXTRACT(MONTH FROM created_at) AS c_month, COUNT(EXTRACT(MONTH FROM created_at)) AS month_count FROM illness GROUP BY illness, EXTRACT(YEAR FROM created_at), EXTRACT(MONTH FROM created_at) ORDER BY EXTRACT(MONTH FROM created_at) ) t1 GROUP BY c_year, c_month ORDER BY c_year, c_month which gives (sample data DML below) c_year the_month il_mc 2017 January 1 2017 February 3 2018 January 5 2018 February 3 These figures are correct with the sample data. What I'm finding difficult is to include the most common illness name text for a given month! And what do I do in the case of ties (as for January 2017 in sample data)? I imagine that my ties scenario will require Analytic functions? DENSE_RANK or similar? I'd be grateful for a non-Analytic solution and an Analytic one. Any discussion, references, URLs or other helpful data most appreciated - I really want to grasp what's going on here! I think I need to JOIN on my derived table(s) but can't figure it! If anything extra is required, please let me know! Rgs, Pól DDL and DML ------------------------- CREATE table illness (nature_of_illness VARCHAR(25), created_at TIMESTAMP); INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40'); INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40'); INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40'); INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40'); INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40'); INSERT INTO illness VALUES ('Lung', '2018-01-03 17:50:32'); INSERT INTO illness VALUES ('Lung', '2018-02-03 17:50:32'); INSERT INTO illness VALUES ('Lung', '2018-02-03 17:50:32'); INSERT INTO illness VALUES ('Lung', '2018-02-03 17:50:32'); INSERT INTO illness VALUES ('Cervix', '2018-02-03 17:50:32'); INSERT INTO illness VALUES ('Cervix', '2017-01-03 15:45:40'); -- one of each for Jan 2017 INSERT INTO illness VALUES ('Lung', '2017-01-03 17:50:32'); -- one of each for Jan 2017 - dealing with ties? INSERT INTO illness VALUES ('Lung', '2017-02-03 17:50:32'); INSERT INTO illness VALUES ('Lung', '2017-02-03 17:50:32'); INSERT INTO illness VALUES ('Lung', '2017-02-03 17:50:32'); INSERT INTO illness VALUES ('Cervix', '2017-02-03 17:50:32');
Now, what I want is to find which illness was reported most in a given
month and in a given year.
Basically, you want to computed an ordered listing of all illnesses counts grouped by year-month, then return the first one of them (in the case of ties are you expecting to return two records or one)?
SELECT DISTINCT ON (year, month) year, month, count_of_illness, illness
FROM (
SELECT year, month, count(*) AS count_of_illness, illness
FROM ...
GROUP BY year, month
ORDER BY 1, 2, 3 DESC
) grp
This will return exactly one record, "the first" for each year/month combination in your data. First is determined by the sort in the subquery.
If you need to return multiple records in the case of ties you either, more of less, self-join on (year, month, count) or use something like dense_rank() OVER (partition by year, month order by count_of_illness desc) to assign a rank of 1 to all highest count items and then add a "where dense_rank = 1" filter to the query.
David J.
Hi, and thanks for taking the time to reply I used this (helped by your original query): SELECT t3.c_year AS "Year", t3.c_month AS "Month", t3.il_mc AS "Tumour count", t4.ill_nat AS "Type" FROM ( SELECT c_year, c_month, il_mc FROM ( SELECT c_year, c_month, MAX(month_count) AS il_mc FROM ( SELECT nature_of_illness as illness, EXTRACT(YEAR FROM created_at) AS c_year, EXTRACT(MONTH FROM created_at) AS c_month, COUNT(EXTRACT(MONTH FROM created_at)) AS month_count FROM illness GROUP BY illness, c_year, c_month ORDER BY c_year, c_month ) AS t1 GROUP BY c_year, c_month ) AS t2 ) AS t3 JOIN ( SELECT EXTRACT(YEAR FROM created_at) AS t_year, EXTRACT(MONTH FROM created_at) AS t_month, nature_of_illness AS ill_nat, COUNT(nature_of_illness) AS ill_cnt FROM illness GROUP BY t_year, t_month, nature_of_illness ORDER BY t_year, t_month, nature_of_illness ) AS t4 ON t3.c_year = t4.t_year AND t3.c_month = t4.t_month AND t3.il_mc = t4.ill_cnt and got this as a result: SELECT t3.c_year AS "Year", t3.c_month AS "Month", t3.il_mc AS "Tumour count", t4.ill_nat AS "Type" FROM ( SELECT c_year, c_month, il_mc FROM ( SELECT c_year, c_month, MAX(month_count) AS il_mc FROM ( SELECT nature_of_illness as illness, EXTRACT(YEAR FROM created_at) AS c_year, EXTRACT(MONTH FROM created_at) AS c_month, COUNT(EXTRACT(MONTH FROM created_at)) AS month_count FROM illness GROUP BY illness, c_year, c_month ORDER BY c_year, c_month ) AS t1 GROUP BY c_year, c_month ) AS t2 ) AS t3 JOIN ( SELECT EXTRACT(YEAR FROM created_at) AS t_year, EXTRACT(MONTH FROM created_at) AS t_month, nature_of_illness AS ill_nat, COUNT(nature_of_illness) AS ill_cnt FROM illness GROUP BY t_year, t_month, nature_of_illness ORDER BY t_year, t_month, nature_of_illness ) AS t4 ON t3.c_year = t4.t_year AND t3.c_month = t4.t_month AND t3.il_mc = t4.ill_cnt > This will return exactly one record, "the first" for each year/month > combination in your data. First is determined by the sort in the subquery. Not what I required - there's no point in having ties randomly returning. > If you need to return multiple records in the case of ties you either, more > of less, self-join on (year, month, count) or use something like > dense_rank() OVER (partition by year, month order by count_of_illness desc) > to assign a rank of 1 to all highest count items and then add a "where > dense_rank = 1" filter to the query. Can't use DENSE_RANK() - MySQL 5.6 doesn't support it :-( Thanks again for your help! Rgs, Pól... > David J.