Обсуждение: Crosstab Confusion
I'm really trying to understand how the tablefunc crosstab function works, to no avail. I have a table that looks like this: customer_id integer date timestamp with time zone amount numeric(10,4) There are rows in this table every-time a customer gets charged an amount, which is multiple times per day. I would like to get a result like this: customer_id,day1,day2,day3,(…) 1,400.00,500.01,123.00,(…) So, one row for each customer id and a column for every day in the current month. Anyone used crosstab for something likethis? Thanks for your help, A. -- www.sherman.ca / +1-613-797-6819 / +1-646-233-3400 "When the burning husks of your startups warm the last of your bones, remember I told you so." - Zed
I'm flying blind here since I have nothing that looks like the structure you described to experiment on, but based on somecrosstabs I have set up this should get you started. Explanatory notes follow.<br /><br />SELECT pivot.*<br /> FROM crosstab('<br/> --row header, column header, cell value<br /> SELECT customer_id, extract(day FROM date), sum(amount)<br/> WHERE extract(month FROM date) = 1 --desired month, 1=Jan, 2=Feb, etc.<br /> FROM your_table<br /> GROUP BY 1, 2<br /> ORDER BY 1, 2<br /> ','<br /> --list of column headers<br /> SELECT 1, 2, 3, […] 31<br/> ') <br /> --list of column names for final result set<br /> --does not have to match column names from crosstab()section!<br /> pivot(customer_id integer, day1 numeric(10,4), day2 numeric(10,4), day3 numeric(10,4), […] day31numeric(10,4))<br />;<br /><br />You basically have three parts:<br /><br />1) SELECT query in the form (row header,column header, cell value). In this case it is an aggregate query so that you can sum the transactions over a givenday.<br /><br />2) List of column headers. If you want, this can SELECT from another table, so you can have a tablewith rows 1, 2, 3, etc and use it to select the days from the month instead of listing them manually.<br /><br />3)List of output columns, which follows "pivot" in the text above. Note that "pivot" is an arbitrary name. You can usefoo, bar, or whatever, but that will be the name of the table which must be used to reference the columns in the top SELECTlist.<br /><br />Note that my somewhat bizarre indenting / end of line structure is designed so that, when using pgAdmin,I can very quickly select and execute just the value query or column header query, which lets me experiment as Ibuild the crosstab.<br /><br />Note, finally, that the list of output columns (following "pivot") should be one more thanthe number of columns in section 2 (technically, you can circumvent this limitation, but it's complicated and seems unnecessaryfor your use case). That could be problematic when running this query in the middle of the month and the numberof value rows per customer doesn't match the number expected in the list of column headers and list of output columns.I don't know of any way to make the list of output columns vary dynamically, but perhaps someone else here does?Otherwise, you would have to either (a) make 31 different crosstab queries and choose which one to run based on theday of the month, or (b) come up with some way to "pad" the value table created in part 1 so that it lists customer_id,day, 0 for all days which have no customer transactions. (Perhaps create a dummy customer_id that has no transactionsfor all days in the month, which should be enough to trick the crosstab function into thinking it has somethingto fill the last columns.<br /><br />--Lee<br /><div class="gmail_quote"><br /></div>-- <br />Lee Hachadoorian<br/>PhD Student, Geography<br />Program in Earth & Environmental Sciences<br />CUNY Graduate Center<br />
On 2010-02-01, at 11:34 , Lee Hachadoorian wrote: > You basically have three parts: > > 1) SELECT query in the form (row header, column header, cell value). In this case it is an aggregate query so that youcan sum the transactions over a given day. > > 2) List of column headers. If you want, this can SELECT from another table, so you can have a table with rows 1, 2, 3,etc and use it to select the days from the month instead of listing them manually. > > 3) List of output columns, which follows "pivot" in the text above. Note that "pivot" is an arbitrary name. You can usefoo, bar, or whatever, but that will be the name of the table which must be used to reference the columns in the top SELECTlist. Wow that's an incredibly complete response! I'm not getting any data in my rows though. This query produces the data: SELECT cust_id as customer, date_trunc('day', date) AS day, SUM(billed_duration) AS minutes FROM master_cdr WHERE date >= '2010-01-01' GROUP BY 1,2 ORDER BY 1,2; Which looks like: customer | day | minutes ----------+---------------------+--------- 1 | 2010-01-01 00:00:00 | 1110 1 | 2010-01-03 00:00:00 | 60 1 | 2010-01-26 00:00:00 | 23010 1 | 2010-01-27 00:00:00 | 17910 2 | 2010-01-01 00:00:00 | 60 2 | 2010-01-02 00:00:00 | 30 2 | 2010-01-04 00:00:00 | 26310 etc, etc, etc But this query: -- clients by day SELECT pivot.* FROM crosstab( 'SELECT cust_id as customer, date_trunc(''day'', date) AS day, SUM(billed_duration) ASminutes FROM master_cdr WHERE date >= ''2010-01-01'' GROUP BY 1,2 ORDER BY 1,2', 'select * from day_of_month' ) pivot ( customer integer, day1 numeric(10,4), day2 numeric(10,4), (…) day31 numeric(10,4) ) ORDER BY customer; Gives me a table that looks right but all values are null for the days. Something simple maybe? Thanks, A. -- www.sherman.ca / +1-613-797-6819 / +1-646-233-3400 "When the burning husks of your startups warm the last of your bones, remember I told you so." - Zed
The output column data type (day1, day2, etc.) is supposed to match the value data type. I used numeric(10,4) because that's what your original post specified, but the billed_duration column in your most recent post looks like it might be integer? (Or is it defined as numeric(10,4), but you never enter noninteger values?)
What's the output of the category query by itself? I forgot to include ORDER BY 1 at the end of the category query. (The order should match the order of output columns, but I think without it you wouldn't get NULL values, just the values would be in the wrong columns.) I assume day_of_month has only one column, but I would suggest naming it explicitly instead of using *. And is the day_of_month column defined in the same format as date_trunc('day', date)? They must successfully pass an "equals" test to get included in the right crosstab cell. If a category value in the source query doesn't match any value produced by the category query, I think the crosstab function just throws out that row, which could lead to a table with the correct structure but all NULLs.
I didn't think about this before, but you can also pull the category headers from your source table like this:
'SELECT DISTINCT date_trunc(''day'', date) AS day WHERE date >= ''2010-01-01'' ORDER BY 1'
Let me know if this gets you anywhere.
--Lee
--
Lee Hachadoorian
PhD Student, Geography
Program in Earth & Environmental Sciences
CUNY Graduate Center
What's the output of the category query by itself? I forgot to include ORDER BY 1 at the end of the category query. (The order should match the order of output columns, but I think without it you wouldn't get NULL values, just the values would be in the wrong columns.) I assume day_of_month has only one column, but I would suggest naming it explicitly instead of using *. And is the day_of_month column defined in the same format as date_trunc('day', date)? They must successfully pass an "equals" test to get included in the right crosstab cell. If a category value in the source query doesn't match any value produced by the category query, I think the crosstab function just throws out that row, which could lead to a table with the correct structure but all NULLs.
I didn't think about this before, but you can also pull the category headers from your source table like this:
'SELECT DISTINCT date_trunc(''day'', date) AS day WHERE date >= ''2010-01-01'' ORDER BY 1'
Let me know if this gets you anywhere.
--Lee
On Mon, Feb 1, 2010 at 11:52 AM, Adam Sherman <adam@sherman.ca> wrote:
On 2010-02-01, at 11:34 , Lee Hachadoorian wrote:Wow that's an incredibly complete response!
> You basically have three parts:
>
> 1) SELECT query in the form (row header, column header, cell value). In this case it is an aggregate query so that you can sum the transactions over a given day.
>
> 2) List of column headers. If you want, this can SELECT from another table, so you can have a table with rows 1, 2, 3, etc and use it to select the days from the month instead of listing them manually.
>
> 3) List of output columns, which follows "pivot" in the text above. Note that "pivot" is an arbitrary name. You can use foo, bar, or whatever, but that will be the name of the table which must be used to reference the columns in the top SELECT list.
I'm not getting any data in my rows though. This query produces the data:
SELECT
cust_id as customer,
date_trunc('day', date) AS day,
SUM(billed_duration) AS minutes
FROM master_cdr
WHERE date >= '2010-01-01'
GROUP BY 1,2
ORDER BY 1,2;
Which looks like:
customer | day | minutes
----------+---------------------+---------
1 | 2010-01-01 00:00:00 | 1110
1 | 2010-01-03 00:00:00 | 60
1 | 2010-01-26 00:00:00 | 23010
1 | 2010-01-27 00:00:00 | 17910
2 | 2010-01-01 00:00:00 | 60
2 | 2010-01-02 00:00:00 | 30
2 | 2010-01-04 00:00:00 | 26310
etc, etc, etc
But this query:
-- clients by day
SELECT pivot.* FROM crosstab(
'SELECT
cust_id as customer,
date_trunc(''day'', date) AS day,
SUM(billed_duration) AS minutes
FROM master_cdr
WHERE date >= ''2010-01-01''GROUP BY 1,2'select * from day_of_month'
ORDER BY 1,2',
) pivot (
customer integer,day1 numeric(10,4),(…)
day2 numeric(10,4),
day31 numeric(10,4)
)
ORDER BY customer;
Gives me a table that looks right but all values are null for the days.
Something simple maybe?
Thanks,
A.
--
www.sherman.ca / +1-613-797-6819 / +1-646-233-3400
"When the burning husks of your startups warm the last of your bones, remember I told you so." - Zed
--
Lee Hachadoorian
PhD Student, Geography
Program in Earth & Environmental Sciences
CUNY Graduate Center
On 2010-02-01, at 14:22 , Lee Hachadoorian wrote: > The output column data type (day1, day2, etc.) is supposed to match the value data type. I used numeric(10,4) because that'swhat your original post specified, but the billed_duration column in your most recent post looks like it might be integer?(Or is it defined as numeric(10,4), but you never enter noninteger values?) Actually, the query I was running is: SELECT cust_id as customer, date_trunc(''day'', date) AS day, SUM(billed_duration)/60.0::numeric(10,4) AS minutes billed_duration is an integer. Make sense? > What's the output of the category query by itself? I forgot to include ORDER BY 1 at the end of the category query. (Theorder should match the order of output columns, but I think without it you wouldn't get NULL values, just the valueswould be in the wrong columns.) I assume day_of_month has only one column, but I would suggest naming it explicitlyinstead of using *. And is the day_of_month column defined in the same format as date_trunc('day', date)? Theymust successfully pass an "equals" test to get included in the right crosstab cell. If a category value in the sourcequery doesn't match any value produced by the category query, I think the crosstab function just throws out that row,which could lead to a table with the correct structure but all NULLs. Right, my list of columns weren't equal to the truncated date. Using your suggested query to generate the columns fixed theproblem! Now, is there a way to generate the labels? Otherwise I have to adjust the query for th number of days returned. Such nice output though! Awesome! Thanks, A. -- www.sherman.ca / +1-613-797-6819 / +1-646-233-3400 "When the burning husks of your startups warm the last of your bones, remember I told you so." - Zed
On Mon, Feb 1, 2010 at 3:11 PM, Adam Sherman <adam@sherman.ca> wrote:
Not that I know of, but I have confirmed that as long as the category SQL matches the output column list, you can have output columns with no data in them. In order to have your list of column headers match the source SQL, I would recommend going back to the extract() function I first recommended to extract the day of month as an integer, and then generate a 31 number series for your category headers. It would look like this:Actually, the query I was running is:SUM(billed_duration)/60.0::numeric(10,4) AS minutes
SELECT
cust_id as customer,
date_trunc(''day'', date) AS day,
billed_duration is an integer. Make sense?
If billed_duration is an integer, sum(billed_duration) will be int or bigint. I would just define the output columns as bigint (day1 bigint, day2 bigint, etc.). Although, formatting it in the source SQL (I see you are dividing by 60 and casting to numeric) saves you from having to format 31 output columns.
Right, my list of columns weren't equal to the truncated date. Using your suggested query to generate the columns fixed the problem!
Now, is there a way to generate the labels? Otherwise I have to adjust the query for th number of days returned.
SELECT pivot.* FROM crosstab(
'SELECT
cust_id as customer,
extract(day from date)::integer AS day,
SUM(billed_duration) AS minutes
FROM master_cdr
WHERE extract(month FROM date) = 1
GROUP BY 1,2
ORDER BY 1,2',
'select day from generate_series(1,31) day'ORDER BY 1,2',
) pivot (
customer integer,
day1 bigint,
day2 bigint,
(…)day2 bigint,
day31 bigint
)
ORDER BY customer;
For half-over months or months with fewer than 31 days, the final columns of the crosstab should just be blank.
PS: The way I have constructed it, I would avoid using WHERE date >= ''2010-01-01''. If data from February gets into the table, it will aggregate data from, e.g. Jan 2 and Feb 2 as both being part of "Day 2".
--Lee
--
Lee Hachadoorian
PhD Student, Geography
Program in Earth & Environmental Sciences
CUNY Graduate Center