Обсуждение: [noob] How to optimize this double pivot query?

Поиск
Список
Период
Сортировка

[noob] How to optimize this double pivot query?

От
Robert Buck
Дата:
I have two tables that contain key-value data that I want to combine in pivoted form into a single result set. They are
relatedto two separate tables.<br /><br />The tables are: test_results, test_variables, metric_def, metadata_key. The
lattertwo tables are enum-like tables, basic descriptors of data stored in other tables. The former two tables are
basicallykey-value tables (with ids as well); these k-v tables are related to the latter two tables via foreign
keys.<br/><br />The following SQL takes about 11 seconds to run on a high-end laptop. The largest table is about 54k
records,pretty puny.<br /><br />Can someone provide a hint as to why this is so slow? Again, I am a noob to SQL, so the
SQLis probably poorly written.<br /><br />Thanks in advance,<br /><br />Bob<br /><br />select<br /><br />   
t.id_name,<br/>    max(t.begin_time) as begin_time,<br />    max(t.end_time) as end_time,<br />    <br />    max(case
when(m.id_name = 'package-version') then v.value end) as package_version,<br />     max(case when (m.id_name =
'database-vendor')then v.value end) as database_vendor,<br />    max(case when (m.id_name = 'bean-name') then v.value
end)as bean_name,<br />    max(case when (m.id_name = 'request-distribution') then v.value end) as
request_distribution,<br/>     max(case when (m.id_name = 'ycsb-workload') then v.value end) as ycsb_workload,<br />   
max(casewhen (m.id_name = 'record-count') then v.value end) as record_count,<br />    max(case when (m.id_name =
'transaction-engine-count')then v.value end) as transaction_engine_count,<br />     max(case when (m.id_name =
'transaction-engine-maxmem')then v.value end) as transaction_engine_maxmem,<br />    max(case when (m.id_name =
'storage-manager-count')then v.value end) as storage_manager_count,<br />     max(case when (m.id_name =
'test-instance-count')then v.value end) as test_instance_count,<br />    max(case when (m.id_name = 'operation-count')
thenv.value end) as operation_count,<br />    max(case when (m.id_name = 'update-percent') then v.value end) as
update_percent,<br/>     max(case when (m.id_name = 'thread-count') then v.value end) as thread_count,<br />    <br
/>   max(case when (d.id_name = 'tps') then r.value end) as tps,<br />    max(case when (d.id_name = 'Memory') then
r.valueend) as memory,<br />     max(case when (d.id_name = 'DiskWritten') then r.value end) as disk_written,<br />   
max(casewhen (d.id_name = 'PercentUserTime') then r.value end) as percent_user,<br />    max(case when (d.id_name =
'PercentCpuTime')then r.value end) as percent_cpu,<br />     max(case when (d.id_name = 'UserMilliseconds') then
r.valueend) as user_milliseconds,<br />    max(case when (d.id_name = 'YcsbUpdateLatencyMicrosecs') then r.value end)
asupdate_latency,<br />    max(case when (d.id_name = 'YcsbReadLatencyMicrosecs') then r.value end) as read_latency,<br
/>    max(case when (d.id_name = 'Updates') then r.value end) as updates,<br />    max(case when (d.id_name =
'Deletes')then r.value end) as deletes,<br />    max(case when (d.id_name = 'Inserts') then r.value end) as inserts,<br
/>    max(case when (d.id_name = 'Commits') then r.value end) as commits,<br />    max(case when (d.id_name =
'Rollbacks')then r.value end) as rollbacks,<br />    max(case when (d.id_name = 'Objects') then r.value end) as
objects,<br/>     max(case when (d.id_name = 'ObjectsCreated') then r.value end) as objects_created,<br />    max(case
when(d.id_name = 'FlowStalls') then r.value end) as flow_stalls,<br />    max(case when (d.id_name =
'NodeApplyPingTime')then r.value end) as node_apply_ping_time,<br />     max(case when (d.id_name = 'NodePingTime')
thenr.value end) as node_ping_time,<br />    max(case when (d.id_name = 'ClientCncts') then r.value end) as
client_connections,<br/>    max(case when (d.id_name = 'YcsbSuccessCount') then r.value end) as success_count,<br />
   max(case when (d.id_name = 'YcsbWarnCount') then r.value end) as warn_count,<br />    max(case when (d.id_name =
'YcsbFailCount')then r.value end) as fail_count<br />    <br />from test as t<br /><br />    left join test_results as
ron r.test_id = <a href="http://t.id">t.id</a><br />     left join test_variables as v on v.test_id = <a
href="http://t.id">t.id</a><br/>    left join metric_def as d on <a href="http://d.id">d.id</a> = r.metric_def_id<br
/>   left join metadata_key as m on <a href="http://m.id">m.id</a> = v.metadata_key_id<br /><br />group by t.id_name<br
/><br/>;<br /><br />"GroupAggregate  (cost=5.87..225516.43 rows=926 width=81)"<br />"  ->  Nested Loop Left Join 
(cost=5.87..53781.24rows=940964 width=81)"<br />"        ->  Nested Loop Left Join  (cost=1.65..1619.61 rows=17235
width=61)"<br/> "              ->  Index Scan using test_uc on test t  (cost=0.00..90.06 rows=926 width=36)"<br
/>"             ->  Hash Right Join  (cost=1.65..3.11 rows=19 width=29)"<br />"                    Hash Cond: (<a
href="http://m.id">m.id</a>= v.metadata_key_id)"<br /> "                    ->  Seq Scan on metadata_key m 
(cost=0.00..1.24rows=24 width=21)"<br />"                    ->  Hash  (cost=1.41..1.41 rows=19 width=16)"<br
/>"                         ->  Index Scan using test_variables_test_id_idx on test_variables v  (cost=0.00..1.41
rows=19width=16)"<br /> "                                Index Cond: (test_id = <a href="http://t.id">t.id</a>)"<br
/>"       ->  Hash Right Join  (cost=4.22..6.69 rows=55 width=28)"<br />"              Hash Cond: (<a
href="http://d.id">d.id</a>= r.metric_def_id)"<br /> "              ->  Seq Scan on metric_def d  (cost=0.00..1.71
rows=71width=20)"<br />"              ->  Hash  (cost=3.53..3.53 rows=55 width=16)"<br />"                    -> 
IndexScan using test_results_test_id_idx on test_results r  (cost=0.00..3.53 rows=55 width=16)"<br />
"                         Index Cond: (test_id = <a href="http://t.id">t.id</a>)"<br /> 

Re: [noob] How to optimize this double pivot query?

От
"David Johnston"
Дата:

From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org] On Behalf Of Robert Buck
Sent: Monday, October 01, 2012 8:47 PM
To: pgsql-sql@postgresql.org
Subject: [SQL] [noob] How to optimize this double pivot query?

 

I have two tables that contain key-value data that I want to combine in pivoted form into a single result set. They are related to two separate tables.

The tables are: test_results, test_variables, metric_def, metadata_key. The latter two tables are enum-like tables, basic descriptors of data stored in other tables. The former two tables are basically key-value tables (with ids as well); these k-v tables are related to the latter two tables via foreign keys.

The following SQL takes about 11 seconds to run on a high-end laptop. The largest table is about 54k records, pretty puny.

Can someone provide a hint as to why this is so slow? Again, I am a noob to SQL, so the SQL is probably poorly written.

 

Your query, while maybe not great, isn’t the cause of your problem.  It is the table schema, specifically the “key-value” aspect, that is killing you.

 

You may want to try:

 

SELECT *

FROM (SELECT id FROM …) id_master

NATURAL LEFT JOIN (SELECT id, field_value AS … FROM … WHERE fieldtype = ‘’) f1

NATURAL LEFT JOIN (SELECT id, field_value AS … FROM … WHERE fieldtype = ‘’) f2

[repeat one left join for every field; though you will then need to decide if/how to deal with NULL – not that you are currently doing anything special anyway…]

 

Mainly the above avoids the use of “max()” and instead uses direct joins between the relevant tables.  I have no clue whether that will improve things but if you are going to lie in this bed you should at least try different positions.

 

The better option is to educate yourself on better ways of constructing the tables so that you do not have to write this kind of god-awful query.  In some cases key-value has merit but usually only when done in moderation.  Not for the entire database.  You likely should simply have a table that looks like the result of the query below.

 

As a second (not necessarily mutually exclusive) alternative: install and use the hstore extension.

 

David J.

 


Thanks in advance,

Bob

select

    t.id_name,
    max(t.begin_time) as begin_time,
    max(t.end_time) as end_time,
   
    max(case when (m.id_name = 'package-version') then v.value end) as package_version,
    max(case when (m.id_name = 'database-vendor') then v.value end) as database_vendor,
    max(case when (m.id_name = 'bean-name') then v.value end) as bean_name,
    max(case when (m.id_name = 'request-distribution') then v.value end) as request_distribution,
    max(case when (m.id_name = 'ycsb-workload') then v.value end) as ycsb_workload,
    max(case when (m.id_name = 'record-count') then v.value end) as record_count,
    max(case when (m.id_name = 'transaction-engine-count') then v.value end) as transaction_engine_count,
    max(case when (m.id_name = 'transaction-engine-maxmem') then v.value end) as transaction_engine_maxmem,
    max(case when (m.id_name = 'storage-manager-count') then v.value end) as storage_manager_count,
    max(case when (m.id_name = 'test-instance-count') then v.value end) as test_instance_count,
    max(case when (m.id_name = 'operation-count') then v.value end) as operation_count,
    max(case when (m.id_name = 'update-percent') then v.value end) as update_percent,
    max(case when (m.id_name = 'thread-count') then v.value end) as thread_count,
   
    max(case when (d.id_name = 'tps') then r.value end) as tps,
    max(case when (d.id_name = 'Memory') then r.value end) as memory,
    max(case when (d.id_name = 'DiskWritten') then r.value end) as disk_written,
    max(case when (d.id_name = 'PercentUserTime') then r.value end) as percent_user,
    max(case when (d.id_name = 'PercentCpuTime') then r.value end) as percent_cpu,
    max(case when (d.id_name = 'UserMilliseconds') then r.value end) as user_milliseconds,
    max(case when (d.id_name = 'YcsbUpdateLatencyMicrosecs') then r.value end) as update_latency,
    max(case when (d.id_name = 'YcsbReadLatencyMicrosecs') then r.value end) as read_latency,
    max(case when (d.id_name = 'Updates') then r.value end) as updates,
    max(case when (d.id_name = 'Deletes') then r.value end) as deletes,
    max(case when (d.id_name = 'Inserts') then r.value end) as inserts,
    max(case when (d.id_name = 'Commits') then r.value end) as commits,
    max(case when (d.id_name = 'Rollbacks') then r.value end) as rollbacks,
    max(case when (d.id_name = 'Objects') then r.value end) as objects,
    max(case when (d.id_name = 'ObjectsCreated') then r.value end) as objects_created,
    max(case when (d.id_name = 'FlowStalls') then r.value end) as flow_stalls,
    max(case when (d.id_name = 'NodeApplyPingTime') then r.value end) as node_apply_ping_time,
    max(case when (d.id_name = 'NodePingTime') then r.value end) as node_ping_time,
    max(case when (d.id_name = 'ClientCncts') then r.value end) as client_connections,
    max(case when (d.id_name = 'YcsbSuccessCount') then r.value end) as success_count,
    max(case when (d.id_name = 'YcsbWarnCount') then r.value end) as warn_count,
    max(case when (d.id_name = 'YcsbFailCount') then r.value end) as fail_count
   
from test as t

    left join test_results as r on r.test_id = t.id
    left join test_variables as v on v.test_id = t.id
    left join metric_def as d on d.id = r.metric_def_id
    left join metadata_key as m on m.id = v.metadata_key_id

group by t.id_name

;

"GroupAggregate  (cost=5.87..225516.43 rows=926 width=81)"
"  ->  Nested Loop Left Join  (cost=5.87..53781.24 rows=940964 width=81)"
"        ->  Nested Loop Left Join  (cost=1.65..1619.61 rows=17235 width=61)"
"              ->  Index Scan using test_uc on test t  (cost=0.00..90.06 rows=926 width=36)"
"              ->  Hash Right Join  (cost=1.65..3.11 rows=19 width=29)"
"                    Hash Cond: (m.id = v.metadata_key_id)"
"                    ->  Seq Scan on metadata_key m  (cost=0.00..1.24 rows=24 width=21)"
"                    ->  Hash  (cost=1.41..1.41 rows=19 width=16)"
"                          ->  Index Scan using test_variables_test_id_idx on test_variables v  (cost=0.00..1.41 rows=19 width=16)"
"                                Index Cond: (test_id = t.id)"
"        ->  Hash Right Join  (cost=4.22..6.69 rows=55 width=28)"
"              Hash Cond: (d.id = r.metric_def_id)"
"              ->  Seq Scan on metric_def d  (cost=0.00..1.71 rows=71 width=20)"
"              ->  Hash  (cost=3.53..3.53 rows=55 width=16)"
"                    ->  Index Scan using test_results_test_id_idx on test_results r  (cost=0.00..3.53 rows=55 width=16)"
"                          Index Cond: (test_id = t.id)"

Re: [noob] How to optimize this double pivot query?

От
Robert Buck
Дата:
So as you can probably glean, the tables store performance metric data. The reason I chose to use k-v is simply to avoid having to create an additional column every time a new metric type come along. So those were the two options I thought of, straight k-v and column for every value type.

Are there other better options worth considering that you could point me towards that supports storing metrics viz. with an unbounded number of metric types in my case?

Bob

On Mon, Oct 1, 2012 at 9:07 PM, David Johnston <polobo@yahoo.com> wrote:

From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org] On Behalf Of Robert Buck
Sent: Monday, October 01, 2012 8:47 PM
To: pgsql-sql@postgresql.org
Subject: [SQL] [noob] How to optimize this double pivot query?

 

I have two tables that contain key-value data that I want to combine in pivoted form into a single result set. They are related to two separate tables.

The tables are: test_results, test_variables, metric_def, metadata_key. The latter two tables are enum-like tables, basic descriptors of data stored in other tables. The former two tables are basically key-value tables (with ids as well); these k-v tables are related to the latter two tables via foreign keys.

The following SQL takes about 11 seconds to run on a high-end laptop. The largest table is about 54k records, pretty puny.

Can someone provide a hint as to why this is so slow? Again, I am a noob to SQL, so the SQL is probably poorly written.

 

Your query, while maybe not great, isn’t the cause of your problem.  It is the table schema, specifically the “key-value” aspect, that is killing you.

 

You may want to try:

 

SELECT *

FROM (SELECT id FROM …) id_master

NATURAL LEFT JOIN (SELECT id, field_value AS … FROM … WHERE fieldtype = ‘’) f1

NATURAL LEFT JOIN (SELECT id, field_value AS … FROM … WHERE fieldtype = ‘’) f2

[repeat one left join for every field; though you will then need to decide if/how to deal with NULL – not that you are currently doing anything special anyway…]

 

Mainly the above avoids the use of “max()” and instead uses direct joins between the relevant tables.  I have no clue whether that will improve things but if you are going to lie in this bed you should at least try different positions.

 

The better option is to educate yourself on better ways of constructing the tables so that you do not have to write this kind of god-awful query.  In some cases key-value has merit but usually only when done in moderation.  Not for the entire database.  You likely should simply have a table that looks like the result of the query below.

 

As a second (not necessarily mutually exclusive) alternative: install and use the hstore extension.

 

David J.

 


Thanks in advance,

Bob

select

    t.id_name,
    max(t.begin_time) as begin_time,
    max(t.end_time) as end_time,
   
    max(case when (m.id_name = 'package-version') then v.value end) as package_version,
    max(case when (m.id_name = 'database-vendor') then v.value end) as database_vendor,
    max(case when (m.id_name = 'bean-name') then v.value end) as bean_name,
    max(case when (m.id_name = 'request-distribution') then v.value end) as request_distribution,
    max(case when (m.id_name = 'ycsb-workload') then v.value end) as ycsb_workload,
    max(case when (m.id_name = 'record-count') then v.value end) as record_count,
    max(case when (m.id_name = 'transaction-engine-count') then v.value end) as transaction_engine_count,
    max(case when (m.id_name = 'transaction-engine-maxmem') then v.value end) as transaction_engine_maxmem,
    max(case when (m.id_name = 'storage-manager-count') then v.value end) as storage_manager_count,
    max(case when (m.id_name = 'test-instance-count') then v.value end) as test_instance_count,
    max(case when (m.id_name = 'operation-count') then v.value end) as operation_count,
    max(case when (m.id_name = 'update-percent') then v.value end) as update_percent,
    max(case when (m.id_name = 'thread-count') then v.value end) as thread_count,
   
    max(case when (d.id_name = 'tps') then r.value end) as tps,
    max(case when (d.id_name = 'Memory') then r.value end) as memory,
    max(case when (d.id_name = 'DiskWritten') then r.value end) as disk_written,
    max(case when (d.id_name = 'PercentUserTime') then r.value end) as percent_user,
    max(case when (d.id_name = 'PercentCpuTime') then r.value end) as percent_cpu,
    max(case when (d.id_name = 'UserMilliseconds') then r.value end) as user_milliseconds,
    max(case when (d.id_name = 'YcsbUpdateLatencyMicrosecs') then r.value end) as update_latency,
    max(case when (d.id_name = 'YcsbReadLatencyMicrosecs') then r.value end) as read_latency,
    max(case when (d.id_name = 'Updates') then r.value end) as updates,
    max(case when (d.id_name = 'Deletes') then r.value end) as deletes,
    max(case when (d.id_name = 'Inserts') then r.value end) as inserts,
    max(case when (d.id_name = 'Commits') then r.value end) as commits,
    max(case when (d.id_name = 'Rollbacks') then r.value end) as rollbacks,
    max(case when (d.id_name = 'Objects') then r.value end) as objects,
    max(case when (d.id_name = 'ObjectsCreated') then r.value end) as objects_created,
    max(case when (d.id_name = 'FlowStalls') then r.value end) as flow_stalls,
    max(case when (d.id_name = 'NodeApplyPingTime') then r.value end) as node_apply_ping_time,
    max(case when (d.id_name = 'NodePingTime') then r.value end) as node_ping_time,
    max(case when (d.id_name = 'ClientCncts') then r.value end) as client_connections,
    max(case when (d.id_name = 'YcsbSuccessCount') then r.value end) as success_count,
    max(case when (d.id_name = 'YcsbWarnCount') then r.value end) as warn_count,
    max(case when (d.id_name = 'YcsbFailCount') then r.value end) as fail_count
   
from test as t

    left join test_results as r on r.test_id = t.id
    left join test_variables as v on v.test_id = t.id
    left join metric_def as d on d.id = r.metric_def_id
    left join metadata_key as m on m.id = v.metadata_key_id

group by t.id_name

;

"GroupAggregate  (cost=5.87..225516.43 rows=926 width=81)"
"  ->  Nested Loop Left Join  (cost=5.87..53781.24 rows=940964 width=81)"
"        ->  Nested Loop Left Join  (cost=1.65..1619.61 rows=17235 width=61)"
"              ->  Index Scan using test_uc on test t  (cost=0.00..90.06 rows=926 width=36)"
"              ->  Hash Right Join  (cost=1.65..3.11 rows=19 width=29)"
"                    Hash Cond: (m.id = v.metadata_key_id)"
"                    ->  Seq Scan on metadata_key m  (cost=0.00..1.24 rows=24 width=21)"
"                    ->  Hash  (cost=1.41..1.41 rows=19 width=16)"
"                          ->  Index Scan using test_variables_test_id_idx on test_variables v  (cost=0.00..1.41 rows=19 width=16)"
"                                Index Cond: (test_id = t.id)"
"        ->  Hash Right Join  (cost=4.22..6.69 rows=55 width=28)"
"              Hash Cond: (d.id = r.metric_def_id)"
"              ->  Seq Scan on metric_def d  (cost=0.00..1.71 rows=71 width=20)"
"              ->  Hash  (cost=3.53..3.53 rows=55 width=16)"
"                    ->  Index Scan using test_results_test_id_idx on test_results r  (cost=0.00..3.53 rows=55 width=16)"
"                          Index Cond: (test_id = t.id)"


Re: [noob] How to optimize this double pivot query?

От
David Johnston
Дата:
Two issues...
1. I do not know how you acquire the data or who controls how/what is generated
2. I do not know the primary means of using said data

If you capture a new metric you generally have to change quite a few things to actually use it so you might as well just add a column as well.  If you want to be able to at least capture unknown input and store it somewhere that is where the hstore extension comes in.  In effect you should store unknown data there until you decide to update the schema to actually make use of it.  In situations where you want to use it without altering the schema you normally simply list the unknowns and as such should output a row-like structure.

The query you provide will need to be updated in the same way a physical table would be.  So just use a table.  Or do not provide a consolidated/wide query.  If both, then deal with the performance hit one time per id and create a materialized view - basically insert the results of the query into a physical table and for live usage query that table.  This is a cache and comes with all the benefits and downsides thereof.

David J.


On Oct 1, 2012, at 21:13, Robert Buck <buck.robert.j@gmail.com> wrote:

So as you can probably glean, the tables store performance metric data. The reason I chose to use k-v is simply to avoid having to create an additional column every time a new metric type come along. So those were the two options I thought of, straight k-v and column for every value type.

Are there other better options worth considering that you could point me towards that supports storing metrics viz. with an unbounded number of metric types in my case?

Bob

On Mon, Oct 1, 2012 at 9:07 PM, David Johnston <polobo@yahoo.com> wrote:

From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org] On Behalf Of Robert Buck
Sent: Monday, October 01, 2012 8:47 PM
To: pgsql-sql@postgresql.org
Subject: [SQL] [noob] How to optimize this double pivot query?

 

I have two tables that contain key-value data that I want to combine in pivoted form into a single result set. They are related to two separate tables.

The tables are: test_results, test_variables, metric_def, metadata_key. The latter two tables are enum-like tables, basic descriptors of data stored in other tables. The former two tables are basically key-value tables (with ids as well); these k-v tables are related to the latter two tables via foreign keys.

The following SQL takes about 11 seconds to run on a high-end laptop. The largest table is about 54k records, pretty puny.

Can someone provide a hint as to why this is so slow? Again, I am a noob to SQL, so the SQL is probably poorly written.

 

Your query, while maybe not great, isn’t the cause of your problem.  It is the table schema, specifically the “key-value” aspect, that is killing you.

 

You may want to try:

 

SELECT *

FROM (SELECT id FROM …) id_master

NATURAL LEFT JOIN (SELECT id, field_value AS … FROM … WHERE fieldtype = ‘’) f1

NATURAL LEFT JOIN (SELECT id, field_value AS … FROM … WHERE fieldtype = ‘’) f2

[repeat one left join for every field; though you will then need to decide if/how to deal with NULL – not that you are currently doing anything special anyway…]

 

Mainly the above avoids the use of “max()” and instead uses direct joins between the relevant tables.  I have no clue whether that will improve things but if you are going to lie in this bed you should at least try different positions.

 

The better option is to educate yourself on better ways of constructing the tables so that you do not have to write this kind of god-awful query.  In some cases key-value has merit but usually only when done in moderation.  Not for the entire database.  You likely should simply have a table that looks like the result of the query below.

 

As a second (not necessarily mutually exclusive) alternative: install and use the hstore extension.

 

David J.

 


Thanks in advance,

Bob

select

    t.id_name,
    max(t.begin_time) as begin_time,
    max(t.end_time) as end_time,
   
    max(case when (m.id_name = 'package-version') then v.value end) as package_version,
    max(case when (m.id_name = 'database-vendor') then v.value end) as database_vendor,
    max(case when (m.id_name = 'bean-name') then v.value end) as bean_name,
    max(case when (m.id_name = 'request-distribution') then v.value end) as request_distribution,
    max(case when (m.id_name = 'ycsb-workload') then v.value end) as ycsb_workload,
    max(case when (m.id_name = 'record-count') then v.value end) as record_count,
    max(case when (m.id_name = 'transaction-engine-count') then v.value end) as transaction_engine_count,
    max(case when (m.id_name = 'transaction-engine-maxmem') then v.value end) as transaction_engine_maxmem,
    max(case when (m.id_name = 'storage-manager-count') then v.value end) as storage_manager_count,
    max(case when (m.id_name = 'test-instance-count') then v.value end) as test_instance_count,
    max(case when (m.id_name = 'operation-count') then v.value end) as operation_count,
    max(case when (m.id_name = 'update-percent') then v.value end) as update_percent,
    max(case when (m.id_name = 'thread-count') then v.value end) as thread_count,
   
    max(case when (d.id_name = 'tps') then r.value end) as tps,
    max(case when (d.id_name = 'Memory') then r.value end) as memory,
    max(case when (d.id_name = 'DiskWritten') then r.value end) as disk_written,
    max(case when (d.id_name = 'PercentUserTime') then r.value end) as percent_user,
    max(case when (d.id_name = 'PercentCpuTime') then r.value end) as percent_cpu,
    max(case when (d.id_name = 'UserMilliseconds') then r.value end) as user_milliseconds,
    max(case when (d.id_name = 'YcsbUpdateLatencyMicrosecs') then r.value end) as update_latency,
    max(case when (d.id_name = 'YcsbReadLatencyMicrosecs') then r.value end) as read_latency,
    max(case when (d.id_name = 'Updates') then r.value end) as updates,
    max(case when (d.id_name = 'Deletes') then r.value end) as deletes,
    max(case when (d.id_name = 'Inserts') then r.value end) as inserts,
    max(case when (d.id_name = 'Commits') then r.value end) as commits,
    max(case when (d.id_name = 'Rollbacks') then r.value end) as rollbacks,
    max(case when (d.id_name = 'Objects') then r.value end) as objects,
    max(case when (d.id_name = 'ObjectsCreated') then r.value end) as objects_created,
    max(case when (d.id_name = 'FlowStalls') then r.value end) as flow_stalls,
    max(case when (d.id_name = 'NodeApplyPingTime') then r.value end) as node_apply_ping_time,
    max(case when (d.id_name = 'NodePingTime') then r.value end) as node_ping_time,
    max(case when (d.id_name = 'ClientCncts') then r.value end) as client_connections,
    max(case when (d.id_name = 'YcsbSuccessCount') then r.value end) as success_count,
    max(case when (d.id_name = 'YcsbWarnCount') then r.value end) as warn_count,
    max(case when (d.id_name = 'YcsbFailCount') then r.value end) as fail_count
   
from test as t

    left join test_results as r on r.test_id = t.id
    left join test_variables as v on v.test_id = t.id
    left join metric_def as d on d.id = r.metric_def_id
    left join metadata_key as m on m.id = v.metadata_key_id

group by t.id_name

;

"GroupAggregate  (cost=5.87..225516.43 rows=926 width=81)"
"  ->  Nested Loop Left Join  (cost=5.87..53781.24 rows=940964 width=81)"
"        ->  Nested Loop Left Join  (cost=1.65..1619.61 rows=17235 width=61)"
"              ->  Index Scan using test_uc on test t  (cost=0.00..90.06 rows=926 width=36)"
"              ->  Hash Right Join  (cost=1.65..3.11 rows=19 width=29)"
"                    Hash Cond: (m.id = v.metadata_key_id)"
"                    ->  Seq Scan on metadata_key m  (cost=0.00..1.24 rows=24 width=21)"
"                    ->  Hash  (cost=1.41..1.41 rows=19 width=16)"
"                          ->  Index Scan using test_variables_test_id_idx on test_variables v  (cost=0.00..1.41 rows=19 width=16)"
"                                Index Cond: (test_id = t.id)"
"        ->  Hash Right Join  (cost=4.22..6.69 rows=55 width=28)"
"              Hash Cond: (d.id = r.metric_def_id)"
"              ->  Seq Scan on metric_def d  (cost=0.00..1.71 rows=71 width=20)"
"              ->  Hash  (cost=3.53..3.53 rows=55 width=16)"
"                    ->  Index Scan using test_results_test_id_idx on test_results r  (cost=0.00..3.53 rows=55 width=16)"
"                          Index Cond: (test_id = t.id)"


Re: [noob] How to optimize this double pivot query?

От
Thomas Kellerer
Дата:
Robert Buck, 02.10.2012 03:13:
> So as you can probably glean, the tables store performance metric
> data. The reason I chose to use k-v is simply to avoid having to
> create an additional column every time a new metric type come along.
> So those were the two options I thought of, straight k-v and column
> for every value type.
>
> Are there other better options worth considering that you could point
> me towards that supports storing metrics viz. with an unbounded
> number of metric types in my case?
>

Have a look at the hstore module. It's exactly meant for that scenario with the added
benefit that you can index on that column and looking up key names and their values
is blazingly fast then.

That combined with the tablefunc module (which let's you do pivot queries) might
make your queries substantially more readable (and maybe faster as well).

Regards
Thomas







Re: [noob] How to optimize this double pivot query?

От
Samuel Gendler
Дата:


On Mon, Oct 1, 2012 at 11:46 PM, Thomas Kellerer <spam_eater@gmx.net> wrote:

That combined with the tablefunc module (which let's you do pivot queries) might
make your queries substantially more readable (and maybe faster as well).


I woud think that using the crosstab functions in tablefunc would solve the problem without needing a complete change of structure. I've built crosstabs over a whole lot more than 54K rows in far, far less time (and resulting in more than 35 columns, too) than the 11 seconds that was quoted here, without feeling the need to deal with hstore or similar.  In fact, wouldn't hstore actually make it more difficult to build a crosstab query than the schema that he has in place now?

--sam

Re: [noob] How to optimize this double pivot query?

От
Robert Buck
Дата:
Hi Samuel 

Thank you. This may be a bit of a stretch for you, but would it be possible for me to peek at a sanitized version of your cross tab query, for a good example on how to do this for this noob?

This will be pretty common in my case. The biggest tables will get much larger as they are raw metrics feeds, which at some point need to be fed through reporting engines to analyze and spot regressions.

Lastly, am I simply using the wrong tech for data feeds and analytics? The first cut of this used flat files and R and though it scoured thousands of files was much faster than the SQL I wrote here. The big goal was to get this off disk and into a database, but as its highly variable, very sparse, metric data, this is why I chose k-v. SQL databases are internally more politically acceptable, though I am personally agnostic on the matter. In the end it would be nice to directly report off a database, but so long as I can transform to csv I can always perform reporting and analytics in R, and optionally map and reduce natively in Ruby. Sane? Ideas? This is early on, and willing to adjust course and find a better way if suggestions indicate such. I've heard a couple options so far.

Best regards,

Bob

On Oct 2, 2012, at 5:21 AM, Samuel Gendler <sgendler@ideasculptor.com> wrote:



On Mon, Oct 1, 2012 at 11:46 PM, Thomas Kellerer <spam_eater@gmx.net> wrote:

That combined with the tablefunc module (which let's you do pivot queries) might
make your queries substantially more readable (and maybe faster as well).


I woud think that using the crosstab functions in tablefunc would solve the problem without needing a complete change of structure. I've built crosstabs over a whole lot more than 54K rows in far, far less time (and resulting in more than 35 columns, too) than the 11 seconds that was quoted here, without feeling the need to deal with hstore or similar.  In fact, wouldn't hstore actually make it more difficult to build a crosstab query than the schema that he has in place now?

--sam

Re: [noob] How to optimize this double pivot query?

От
Samuel Gendler
Дата:


On Tue, Oct 2, 2012 at 2:45 AM, Robert Buck <buck.robert.j@gmail.com> wrote:
Hi Samuel 

Thank you. This may be a bit of a stretch for you, but would it be possible for me to peek at a sanitized version of your cross tab query, for a good example on how to do this for this noob?

This will be pretty common in my case. The biggest tables will get much larger as they are raw metrics feeds, which at some point need to be fed through reporting engines to analyze and spot regressions.

Lastly, am I simply using the wrong tech for data feeds and analytics? The first cut of this used flat files and R and though it scoured thousands of files was much faster than the SQL I wrote here. The big goal was to get this off disk and into a database, but as its highly variable, very sparse, metric data, this is why I chose k-v. SQL databases are internally more politically acceptable, though I am personally agnostic on the matter. In the end it would be nice to directly report off a database, but so long as I can transform to csv I can always perform reporting and analytics in R, and optionally map and reduce natively in Ruby. Sane? Ideas? This is early on, and willing to adjust course and find a better way if suggestions indicate such. I've heard a couple options so far.

OK, you owe me.  I think I just spent more than an hour writing this up ;-)  

Given the numbers of rows you are talking about, I can't think of any good reason why the database shouldn't do what you need it to do pretty effectively/quickly.

Some questions before I provide crosstab samples - your example query has a hardcoded set of keys that it is looking for, but maybe that was code-generated.  There are multiple forms of the crosstab function, some of which rely on a constant column count and others which can generate the set of columns based on another query.  There are complications in that second form relating to race conditions with regard to new keys showing up between the query for the columns and the query for the data, so it is important to understand that and either structure queries accordingly or make sure you execute in a transaction with sufficient transaction isolation to prevent the race condition from appearing.

For crosstab queries, you generally want a query that returns results in the following form:

row_name | key | value

where row_name would be a date or some other label that all values that should share a row will have in common.  Key is the field that defines a column in the final row.  And value is the value to go in that column.  If it is the case that you always want ALL keys, then you can simply do a left join to ensure that you get a row for every key, regardless of whether there is an actual value.  You can use COALESCE to turn nulls resulting from left joins that don't match into '0' values.  If there is the potential for multiple rows with the same key and row_name, you need to use an aggregate function (like max or avg) to compress all rows into a single row.

The crosstab(text sql) form of the function requires that it gets a value for every possible column in every single row.  It makes some naive assumptions about the presence of a value for every spot in the matrix.  It also cannot include extra columns as plain payload in the final cross tab result.  For greater flexibility, you really want to use the crosstab(text sql, text sql) form of the function.  This allows you to specify the set of columns that should appear via the 2nd query.  When processing the results, it makes fewer naive assumptions about the presence of data for every column, so it correctly handles missing data (and, I assume, data for columns that didn't appear in the 2nd query).  This eliminates the need for left joins to get empty rows, though you do still need to use aggregates to combine multiple rows into one row.  

Your 2nd query would be a query that simply returns the distinct set of key names from both of your key/value tables.  You want them in a consistent and repeatable order, so order by id_name:

select id_name from (
    select distinct m.id_name from metadata_key m 
    union 
    select distinct d.id_name from metric_def d
) q order by id_name

The nested union query is required based on what I can gleam of your schema from your examples, since you are getting keys from multiple tables and we need them all in a single column in a consistent order.

Now you just need a query that returns 

row_name | extra_col1 | extra_col2 | key | max(value)

extra_col1 and extra_col2 are going to be the begin_time and end_time from your test table, but you are going to have to do the union thing again in order to extract results from multiple tables into a single column, and you'll have to do the subquery thing in order to get the rows in consistent order again:

select row_name, begin_time, end_time, category, value from
(
select t.id_name as row_name, max(t.begin_time) as begin_time, max(t.end_time) as end_time, 
          m.id_name as category, max(v.value_end) as value
from test t, metadata_key m, test_variables v 
where v.test_id = t.id and m.id = v.metadata_key_id group by 1, 4

union all -- use 'union all' to prevent the database from trying to de-dupe rows

select t.id_name as row_name, max(t.begin_time) as begin_time, max(t.end_time) as end_time, 
         d.id_name as category, max(r.value_end) as value
from test t, test_results r, metric_def d
where v.test_id = r.test_id and d.id = r.metric_def_id
) q 
order by row_name, category 

order by is important, as the crosstab processor is naive - it processes a row until a new row_name is found, then moves on.  If it encounters the same row_name later on, it will create a new row with the same name rather than going back and adding columns to the existing row. I don't think category order is important in this case, since it is doing category lookup by name rather than position, but all of my example code does order categories, too, probably just to be safe.

Now there is one final bit of wierdness related to the crosstab functions.  You must specify the structure of the returned results when you call it, because it returns a setof record.

So you either call it like this:

select * from crosstab(sql1, sql2) as myresult(text rowname, timestamp begin_time, timestamp end_time, int col1, int col2, int col3, int col4, int col5,...)

or if you know that you will have exactly 35 columns every single time, you can declare an alias to the function which specifies the structure of the returned records, which will allow you to call it simply as:

select * from crosstab35(sql, sql);

The documentation for the tablefunc module (http://www.postgresql.org/docs/9.1/static/tablefunc.html ) explains how to set up the aliases you might want if you have a consistent column set.  If you don't have a consistent column set, you have to first query for the names of all possible columns.  Then construct your two queries for the crosstab function, along with the 'as' clause, using the names that were returned from your initial call to limit the results in the eventual crosstab execution.  In other words, you need to make sure that the 2nd sql statement in the crosstab call cannot possibly return more columns than the set of columns you define in the 'as' clause, so you must either execute both queries in a transaction with at least REPEATABLE_READ isolation semantics, or else include a where clause in your column query that limits the columns returned to the same set of columns you got in the first query - select column_name from table where column_name in (...).

Basically, the flow looks like this:

select all possible column names from database.

build data query for crosstab func.
build column query for crosstab func (should be the same or similar to the earlier query for column names).
build 'as' clause for crosstab func - I just name my columns a1, a2, a3, a4, etc, knowing that they will be in the same order as my column names, since my column names are often not strings that would be legal column names in a query.

execute query "select * from crosstab(" + data_query + "," + column_query + ")" + as_clause
process crosstab result

I'll admit, it's a little difficult to set up correctly the first time you do it.  But execution is actually pretty efficient, though it is only doing pretty much exactly what you'd do if you were to pivot the data manually while iterating over the resultset for the data query.  If you're working in a a dynamically typed language that lets you easily write code that has efficient hash maps and the like, and you only have this one use case, it might actually be easier just to query for the raw data and pivot it yourself, where you'll have much more flexibility in dealing with column names and such. But if you have a wide variety of uses, it's probably worthwhile to become familiar with the crosstab functionality, since it is pretty simple to use once you get the hang of it. I've got fairly boilerplate code that deals with things like generating the record structure so I can usually add a new bit of crosstab functionality to an app pretty darn quickly these days.

One thing that's a drag is that the queries are just strings as far as the database is concerned, so depending upon your client library, you may not be able to use variable escaping best practices. Basically, jdbc won't allow you to do use prepared statement variables within a string, so you have to manually escape any user-provided input while constructing the query as a raw string, rather than using '?' in your query.  The same goes for trying to slip a crosstab query though hibernate's SQLQuery interface (I thought that might be a workaround, but it wasn't), so you are forced to construct the entire query string manually.

--sam


Re: [noob] How to optimize this double pivot query?

От
Samuel Gendler
Дата:
One last comment - an alternative solution to this is to use array_agg(), which will simply add each value that matches a group by clause to an array.  If you are careful to structure your query with left joins so that every grouping will have the same number of rows, then you can get a quick and dirty crosstab like this

select row_name, array_agg(value) from test t left join value_table v on t.id = v.id group by 1;

Obviously, you'll want to deal with your potential for duplicate rows via max() or avg() in a subquery, rather than joining directly to the table, but you should get the idea from that example.  You can also use coalesce to convert nulls to some other value, if required.

Since the crosstab functions already require you to do all that work with regard to determining column names and building up the record structure, using array_agg can be every bit as effective, since it basically requires the same process.  First query for all possible names, then issue a query that will cause the values to be processed by array_agg in column order, then iterate over results, getting each array value and associating it with a particular name.

Your result will look  like this:

id_name, start_time, end_time, array_of_values

That may or may not be convenient for you, depending upon how you are using the resultset you get back. You'll still need to play all the same games with regard to unioning multiple queries together to pivot data from multiple tables into the same row.



On Tue, Oct 2, 2012 at 12:57 PM, Samuel Gendler <sgendler@ideasculptor.com> wrote:


On Tue, Oct 2, 2012 at 2:45 AM, Robert Buck <buck.robert.j@gmail.com> wrote:
Hi Samuel 

Thank you. This may be a bit of a stretch for you, but would it be possible for me to peek at a sanitized version of your cross tab query, for a good example on how to do this for this noob?

This will be pretty common in my case. The biggest tables will get much larger as they are raw metrics feeds, which at some point need to be fed through reporting engines to analyze and spot regressions.

Lastly, am I simply using the wrong tech for data feeds and analytics? The first cut of this used flat files and R and though it scoured thousands of files was much faster than the SQL I wrote here. The big goal was to get this off disk and into a database, but as its highly variable, very sparse, metric data, this is why I chose k-v. SQL databases are internally more politically acceptable, though I am personally agnostic on the matter. In the end it would be nice to directly report off a database, but so long as I can transform to csv I can always perform reporting and analytics in R, and optionally map and reduce natively in Ruby. Sane? Ideas? This is early on, and willing to adjust course and find a better way if suggestions indicate such. I've heard a couple options so far.

OK, you owe me.  I think I just spent more than an hour writing this up ;-)  

Given the numbers of rows you are talking about, I can't think of any good reason why the database shouldn't do what you need it to do pretty effectively/quickly.

Some questions before I provide crosstab samples - your example query has a hardcoded set of keys that it is looking for, but maybe that was code-generated.  There are multiple forms of the crosstab function, some of which rely on a constant column count and others which can generate the set of columns based on another query.  There are complications in that second form relating to race conditions with regard to new keys showing up between the query for the columns and the query for the data, so it is important to understand that and either structure queries accordingly or make sure you execute in a transaction with sufficient transaction isolation to prevent the race condition from appearing.

For crosstab queries, you generally want a query that returns results in the following form:

row_name | key | value

where row_name would be a date or some other label that all values that should share a row will have in common.  Key is the field that defines a column in the final row.  And value is the value to go in that column.  If it is the case that you always want ALL keys, then you can simply do a left join to ensure that you get a row for every key, regardless of whether there is an actual value.  You can use COALESCE to turn nulls resulting from left joins that don't match into '0' values.  If there is the potential for multiple rows with the same key and row_name, you need to use an aggregate function (like max or avg) to compress all rows into a single row.

The crosstab(text sql) form of the function requires that it gets a value for every possible column in every single row.  It makes some naive assumptions about the presence of a value for every spot in the matrix.  It also cannot include extra columns as plain payload in the final cross tab result.  For greater flexibility, you really want to use the crosstab(text sql, text sql) form of the function.  This allows you to specify the set of columns that should appear via the 2nd query.  When processing the results, it makes fewer naive assumptions about the presence of data for every column, so it correctly handles missing data (and, I assume, data for columns that didn't appear in the 2nd query).  This eliminates the need for left joins to get empty rows, though you do still need to use aggregates to combine multiple rows into one row.  

Your 2nd query would be a query that simply returns the distinct set of key names from both of your key/value tables.  You want them in a consistent and repeatable order, so order by id_name:

select id_name from (
    select distinct m.id_name from metadata_key m 
    union 
    select distinct d.id_name from metric_def d
) q order by id_name

The nested union query is required based on what I can gleam of your schema from your examples, since you are getting keys from multiple tables and we need them all in a single column in a consistent order.

Now you just need a query that returns 

row_name | extra_col1 | extra_col2 | key | max(value)

extra_col1 and extra_col2 are going to be the begin_time and end_time from your test table, but you are going to have to do the union thing again in order to extract results from multiple tables into a single column, and you'll have to do the subquery thing in order to get the rows in consistent order again:

select row_name, begin_time, end_time, category, value from
(
select t.id_name as row_name, max(t.begin_time) as begin_time, max(t.end_time) as end_time, 
          m.id_name as category, max(v.value_end) as value
from test t, metadata_key m, test_variables v 
where v.test_id = t.id and m.id = v.metadata_key_id group by 1, 4

union all -- use 'union all' to prevent the database from trying to de-dupe rows

select t.id_name as row_name, max(t.begin_time) as begin_time, max(t.end_time) as end_time, 
         d.id_name as category, max(r.value_end) as value
from test t, test_results r, metric_def d
where v.test_id = r.test_id and d.id = r.metric_def_id
) q 
order by row_name, category 

order by is important, as the crosstab processor is naive - it processes a row until a new row_name is found, then moves on.  If it encounters the same row_name later on, it will create a new row with the same name rather than going back and adding columns to the existing row. I don't think category order is important in this case, since it is doing category lookup by name rather than position, but all of my example code does order categories, too, probably just to be safe.

Now there is one final bit of wierdness related to the crosstab functions.  You must specify the structure of the returned results when you call it, because it returns a setof record.

So you either call it like this:

select * from crosstab(sql1, sql2) as myresult(text rowname, timestamp begin_time, timestamp end_time, int col1, int col2, int col3, int col4, int col5,...)

or if you know that you will have exactly 35 columns every single time, you can declare an alias to the function which specifies the structure of the returned records, which will allow you to call it simply as:

select * from crosstab35(sql, sql);

The documentation for the tablefunc module (http://www.postgresql.org/docs/9.1/static/tablefunc.html ) explains how to set up the aliases you might want if you have a consistent column set.  If you don't have a consistent column set, you have to first query for the names of all possible columns.  Then construct your two queries for the crosstab function, along with the 'as' clause, using the names that were returned from your initial call to limit the results in the eventual crosstab execution.  In other words, you need to make sure that the 2nd sql statement in the crosstab call cannot possibly return more columns than the set of columns you define in the 'as' clause, so you must either execute both queries in a transaction with at least REPEATABLE_READ isolation semantics, or else include a where clause in your column query that limits the columns returned to the same set of columns you got in the first query - select column_name from table where column_name in (...).

Basically, the flow looks like this:

select all possible column names from database.

build data query for crosstab func.
build column query for crosstab func (should be the same or similar to the earlier query for column names).
build 'as' clause for crosstab func - I just name my columns a1, a2, a3, a4, etc, knowing that they will be in the same order as my column names, since my column names are often not strings that would be legal column names in a query.

execute query "select * from crosstab(" + data_query + "," + column_query + ")" + as_clause
process crosstab result

I'll admit, it's a little difficult to set up correctly the first time you do it.  But execution is actually pretty efficient, though it is only doing pretty much exactly what you'd do if you were to pivot the data manually while iterating over the resultset for the data query.  If you're working in a a dynamically typed language that lets you easily write code that has efficient hash maps and the like, and you only have this one use case, it might actually be easier just to query for the raw data and pivot it yourself, where you'll have much more flexibility in dealing with column names and such. But if you have a wide variety of uses, it's probably worthwhile to become familiar with the crosstab functionality, since it is pretty simple to use once you get the hang of it. I've got fairly boilerplate code that deals with things like generating the record structure so I can usually add a new bit of crosstab functionality to an app pretty darn quickly these days.

One thing that's a drag is that the queries are just strings as far as the database is concerned, so depending upon your client library, you may not be able to use variable escaping best practices. Basically, jdbc won't allow you to do use prepared statement variables within a string, so you have to manually escape any user-provided input while constructing the query as a raw string, rather than using '?' in your query.  The same goes for trying to slip a crosstab query though hibernate's SQLQuery interface (I thought that might be a workaround, but it wasn't), so you are forced to construct the entire query string manually.

--sam



Re: [noob] How to optimize this double pivot query?

От
Robert Buck
Дата:
Thank you, Samuel.

I am trying some of this out right now...

This is great information.

Thanks so much. This is a huge help.

Bob

On Tue, Oct 2, 2012 at 4:09 PM, Samuel Gendler <sgendler@ideasculptor.com> wrote:
One last comment - an alternative solution to this is to use array_agg(), which will simply add each value that matches a group by clause to an array.  If you are careful to structure your query with left joins so that every grouping will have the same number of rows, then you can get a quick and dirty crosstab like this

select row_name, array_agg(value) from test t left join value_table v on t.id = v.id group by 1;

Obviously, you'll want to deal with your potential for duplicate rows via max() or avg() in a subquery, rather than joining directly to the table, but you should get the idea from that example.  You can also use coalesce to convert nulls to some other value, if required.

Since the crosstab functions already require you to do all that work with regard to determining column names and building up the record structure, using array_agg can be every bit as effective, since it basically requires the same process.  First query for all possible names, then issue a query that will cause the values to be processed by array_agg in column order, then iterate over results, getting each array value and associating it with a particular name.

Your result will look  like this:

id_name, start_time, end_time, array_of_values

That may or may not be convenient for you, depending upon how you are using the resultset you get back. You'll still need to play all the same games with regard to unioning multiple queries together to pivot data from multiple tables into the same row.



On Tue, Oct 2, 2012 at 12:57 PM, Samuel Gendler <sgendler@ideasculptor.com> wrote:


On Tue, Oct 2, 2012 at 2:45 AM, Robert Buck <buck.robert.j@gmail.com> wrote:
Hi Samuel 

Thank you. This may be a bit of a stretch for you, but would it be possible for me to peek at a sanitized version of your cross tab query, for a good example on how to do this for this noob?

This will be pretty common in my case. The biggest tables will get much larger as they are raw metrics feeds, which at some point need to be fed through reporting engines to analyze and spot regressions.

Lastly, am I simply using the wrong tech for data feeds and analytics? The first cut of this used flat files and R and though it scoured thousands of files was much faster than the SQL I wrote here. The big goal was to get this off disk and into a database, but as its highly variable, very sparse, metric data, this is why I chose k-v. SQL databases are internally more politically acceptable, though I am personally agnostic on the matter. In the end it would be nice to directly report off a database, but so long as I can transform to csv I can always perform reporting and analytics in R, and optionally map and reduce natively in Ruby. Sane? Ideas? This is early on, and willing to adjust course and find a better way if suggestions indicate such. I've heard a couple options so far.

OK, you owe me.  I think I just spent more than an hour writing this up ;-)  

Given the numbers of rows you are talking about, I can't think of any good reason why the database shouldn't do what you need it to do pretty effectively/quickly.

Some questions before I provide crosstab samples - your example query has a hardcoded set of keys that it is looking for, but maybe that was code-generated.  There are multiple forms of the crosstab function, some of which rely on a constant column count and others which can generate the set of columns based on another query.  There are complications in that second form relating to race conditions with regard to new keys showing up between the query for the columns and the query for the data, so it is important to understand that and either structure queries accordingly or make sure you execute in a transaction with sufficient transaction isolation to prevent the race condition from appearing.

For crosstab queries, you generally want a query that returns results in the following form:

row_name | key | value

where row_name would be a date or some other label that all values that should share a row will have in common.  Key is the field that defines a column in the final row.  And value is the value to go in that column.  If it is the case that you always want ALL keys, then you can simply do a left join to ensure that you get a row for every key, regardless of whether there is an actual value.  You can use COALESCE to turn nulls resulting from left joins that don't match into '0' values.  If there is the potential for multiple rows with the same key and row_name, you need to use an aggregate function (like max or avg) to compress all rows into a single row.

The crosstab(text sql) form of the function requires that it gets a value for every possible column in every single row.  It makes some naive assumptions about the presence of a value for every spot in the matrix.  It also cannot include extra columns as plain payload in the final cross tab result.  For greater flexibility, you really want to use the crosstab(text sql, text sql) form of the function.  This allows you to specify the set of columns that should appear via the 2nd query.  When processing the results, it makes fewer naive assumptions about the presence of data for every column, so it correctly handles missing data (and, I assume, data for columns that didn't appear in the 2nd query).  This eliminates the need for left joins to get empty rows, though you do still need to use aggregates to combine multiple rows into one row.  

Your 2nd query would be a query that simply returns the distinct set of key names from both of your key/value tables.  You want them in a consistent and repeatable order, so order by id_name:

select id_name from (
    select distinct m.id_name from metadata_key m 
    union 
    select distinct d.id_name from metric_def d
) q order by id_name

The nested union query is required based on what I can gleam of your schema from your examples, since you are getting keys from multiple tables and we need them all in a single column in a consistent order.

Now you just need a query that returns 

row_name | extra_col1 | extra_col2 | key | max(value)

extra_col1 and extra_col2 are going to be the begin_time and end_time from your test table, but you are going to have to do the union thing again in order to extract results from multiple tables into a single column, and you'll have to do the subquery thing in order to get the rows in consistent order again:

select row_name, begin_time, end_time, category, value from
(
select t.id_name as row_name, max(t.begin_time) as begin_time, max(t.end_time) as end_time, 
          m.id_name as category, max(v.value_end) as value
from test t, metadata_key m, test_variables v 
where v.test_id = t.id and m.id = v.metadata_key_id group by 1, 4

union all -- use 'union all' to prevent the database from trying to de-dupe rows

select t.id_name as row_name, max(t.begin_time) as begin_time, max(t.end_time) as end_time, 
         d.id_name as category, max(r.value_end) as value
from test t, test_results r, metric_def d
where v.test_id = r.test_id and d.id = r.metric_def_id
) q 
order by row_name, category 

order by is important, as the crosstab processor is naive - it processes a row until a new row_name is found, then moves on.  If it encounters the same row_name later on, it will create a new row with the same name rather than going back and adding columns to the existing row. I don't think category order is important in this case, since it is doing category lookup by name rather than position, but all of my example code does order categories, too, probably just to be safe.

Now there is one final bit of wierdness related to the crosstab functions.  You must specify the structure of the returned results when you call it, because it returns a setof record.

So you either call it like this:

select * from crosstab(sql1, sql2) as myresult(text rowname, timestamp begin_time, timestamp end_time, int col1, int col2, int col3, int col4, int col5,...)

or if you know that you will have exactly 35 columns every single time, you can declare an alias to the function which specifies the structure of the returned records, which will allow you to call it simply as:

select * from crosstab35(sql, sql);

The documentation for the tablefunc module (http://www.postgresql.org/docs/9.1/static/tablefunc.html ) explains how to set up the aliases you might want if you have a consistent column set.  If you don't have a consistent column set, you have to first query for the names of all possible columns.  Then construct your two queries for the crosstab function, along with the 'as' clause, using the names that were returned from your initial call to limit the results in the eventual crosstab execution.  In other words, you need to make sure that the 2nd sql statement in the crosstab call cannot possibly return more columns than the set of columns you define in the 'as' clause, so you must either execute both queries in a transaction with at least REPEATABLE_READ isolation semantics, or else include a where clause in your column query that limits the columns returned to the same set of columns you got in the first query - select column_name from table where column_name in (...).

Basically, the flow looks like this:

select all possible column names from database.

build data query for crosstab func.
build column query for crosstab func (should be the same or similar to the earlier query for column names).
build 'as' clause for crosstab func - I just name my columns a1, a2, a3, a4, etc, knowing that they will be in the same order as my column names, since my column names are often not strings that would be legal column names in a query.

execute query "select * from crosstab(" + data_query + "," + column_query + ")" + as_clause
process crosstab result

I'll admit, it's a little difficult to set up correctly the first time you do it.  But execution is actually pretty efficient, though it is only doing pretty much exactly what you'd do if you were to pivot the data manually while iterating over the resultset for the data query.  If you're working in a a dynamically typed language that lets you easily write code that has efficient hash maps and the like, and you only have this one use case, it might actually be easier just to query for the raw data and pivot it yourself, where you'll have much more flexibility in dealing with column names and such. But if you have a wide variety of uses, it's probably worthwhile to become familiar with the crosstab functionality, since it is pretty simple to use once you get the hang of it. I've got fairly boilerplate code that deals with things like generating the record structure so I can usually add a new bit of crosstab functionality to an app pretty darn quickly these days.

One thing that's a drag is that the queries are just strings as far as the database is concerned, so depending upon your client library, you may not be able to use variable escaping best practices. Basically, jdbc won't allow you to do use prepared statement variables within a string, so you have to manually escape any user-provided input while constructing the query as a raw string, rather than using '?' in your query.  The same goes for trying to slip a crosstab query though hibernate's SQLQuery interface (I thought that might be a workaround, but it wasn't), so you are forced to construct the entire query string manually.

--sam




Re: [noob] How to optimize this double pivot query?

От
Robert Buck
Дата:
Samuel,

You asked:

Some questions before I provide crosstab samples - your example query has a hardcoded set of keys that it is looking for, but maybe that was code-generated.  There are multiple forms of the crosstab function, some of which rely on a constant column count and others which can generate the set of columns based on another query.  There are complications in that second form relating to race conditions with regard to new keys showing up between the query for the columns and the query for the data, so it is important to understand that and either structure queries accordingly or make sure you execute in a transaction with sufficient transaction isolation to prevent the race condition from appearing.

The query was hand written from examples I found on the web. The list here was hand written too. This will run in Ruby, so I can template as much as needed on that side. I can also run this in a defined transaction boundary rather than using auto-commit. Right now I am thumbing through your email and trying it out.

Bob

On Wed, Oct 3, 2012 at 12:23 PM, Robert Buck <buck.robert.j@gmail.com> wrote:
Thank you, Samuel.

I am trying some of this out right now...

This is great information.

Thanks so much. This is a huge help.

Bob


On Tue, Oct 2, 2012 at 4:09 PM, Samuel Gendler <sgendler@ideasculptor.com> wrote:
One last comment - an alternative solution to this is to use array_agg(), which will simply add each value that matches a group by clause to an array.  If you are careful to structure your query with left joins so that every grouping will have the same number of rows, then you can get a quick and dirty crosstab like this

select row_name, array_agg(value) from test t left join value_table v on t.id = v.id group by 1;

Obviously, you'll want to deal with your potential for duplicate rows via max() or avg() in a subquery, rather than joining directly to the table, but you should get the idea from that example.  You can also use coalesce to convert nulls to some other value, if required.

Since the crosstab functions already require you to do all that work with regard to determining column names and building up the record structure, using array_agg can be every bit as effective, since it basically requires the same process.  First query for all possible names, then issue a query that will cause the values to be processed by array_agg in column order, then iterate over results, getting each array value and associating it with a particular name.

Your result will look  like this:

id_name, start_time, end_time, array_of_values

That may or may not be convenient for you, depending upon how you are using the resultset you get back. You'll still need to play all the same games with regard to unioning multiple queries together to pivot data from multiple tables into the same row.



On Tue, Oct 2, 2012 at 12:57 PM, Samuel Gendler <sgendler@ideasculptor.com> wrote:


On Tue, Oct 2, 2012 at 2:45 AM, Robert Buck <buck.robert.j@gmail.com> wrote:
Hi Samuel 

Thank you. This may be a bit of a stretch for you, but would it be possible for me to peek at a sanitized version of your cross tab query, for a good example on how to do this for this noob?

This will be pretty common in my case. The biggest tables will get much larger as they are raw metrics feeds, which at some point need to be fed through reporting engines to analyze and spot regressions.

Lastly, am I simply using the wrong tech for data feeds and analytics? The first cut of this used flat files and R and though it scoured thousands of files was much faster than the SQL I wrote here. The big goal was to get this off disk and into a database, but as its highly variable, very sparse, metric data, this is why I chose k-v. SQL databases are internally more politically acceptable, though I am personally agnostic on the matter. In the end it would be nice to directly report off a database, but so long as I can transform to csv I can always perform reporting and analytics in R, and optionally map and reduce natively in Ruby. Sane? Ideas? This is early on, and willing to adjust course and find a better way if suggestions indicate such. I've heard a couple options so far.

OK, you owe me.  I think I just spent more than an hour writing this up ;-)  

Given the numbers of rows you are talking about, I can't think of any good reason why the database shouldn't do what you need it to do pretty effectively/quickly.

Some questions before I provide crosstab samples - your example query has a hardcoded set of keys that it is looking for, but maybe that was code-generated.  There are multiple forms of the crosstab function, some of which rely on a constant column count and others which can generate the set of columns based on another query.  There are complications in that second form relating to race conditions with regard to new keys showing up between the query for the columns and the query for the data, so it is important to understand that and either structure queries accordingly or make sure you execute in a transaction with sufficient transaction isolation to prevent the race condition from appearing.

For crosstab queries, you generally want a query that returns results in the following form:

row_name | key | value

where row_name would be a date or some other label that all values that should share a row will have in common.  Key is the field that defines a column in the final row.  And value is the value to go in that column.  If it is the case that you always want ALL keys, then you can simply do a left join to ensure that you get a row for every key, regardless of whether there is an actual value.  You can use COALESCE to turn nulls resulting from left joins that don't match into '0' values.  If there is the potential for multiple rows with the same key and row_name, you need to use an aggregate function (like max or avg) to compress all rows into a single row.

The crosstab(text sql) form of the function requires that it gets a value for every possible column in every single row.  It makes some naive assumptions about the presence of a value for every spot in the matrix.  It also cannot include extra columns as plain payload in the final cross tab result.  For greater flexibility, you really want to use the crosstab(text sql, text sql) form of the function.  This allows you to specify the set of columns that should appear via the 2nd query.  When processing the results, it makes fewer naive assumptions about the presence of data for every column, so it correctly handles missing data (and, I assume, data for columns that didn't appear in the 2nd query).  This eliminates the need for left joins to get empty rows, though you do still need to use aggregates to combine multiple rows into one row.  

Your 2nd query would be a query that simply returns the distinct set of key names from both of your key/value tables.  You want them in a consistent and repeatable order, so order by id_name:

select id_name from (
    select distinct m.id_name from metadata_key m 
    union 
    select distinct d.id_name from metric_def d
) q order by id_name

The nested union query is required based on what I can gleam of your schema from your examples, since you are getting keys from multiple tables and we need them all in a single column in a consistent order.

Now you just need a query that returns 

row_name | extra_col1 | extra_col2 | key | max(value)

extra_col1 and extra_col2 are going to be the begin_time and end_time from your test table, but you are going to have to do the union thing again in order to extract results from multiple tables into a single column, and you'll have to do the subquery thing in order to get the rows in consistent order again:

select row_name, begin_time, end_time, category, value from
(
select t.id_name as row_name, max(t.begin_time) as begin_time, max(t.end_time) as end_time, 
          m.id_name as category, max(v.value_end) as value
from test t, metadata_key m, test_variables v 
where v.test_id = t.id and m.id = v.metadata_key_id group by 1, 4

union all -- use 'union all' to prevent the database from trying to de-dupe rows

select t.id_name as row_name, max(t.begin_time) as begin_time, max(t.end_time) as end_time, 
         d.id_name as category, max(r.value_end) as value
from test t, test_results r, metric_def d
where v.test_id = r.test_id and d.id = r.metric_def_id
) q 
order by row_name, category 

order by is important, as the crosstab processor is naive - it processes a row until a new row_name is found, then moves on.  If it encounters the same row_name later on, it will create a new row with the same name rather than going back and adding columns to the existing row. I don't think category order is important in this case, since it is doing category lookup by name rather than position, but all of my example code does order categories, too, probably just to be safe.

Now there is one final bit of wierdness related to the crosstab functions.  You must specify the structure of the returned results when you call it, because it returns a setof record.

So you either call it like this:

select * from crosstab(sql1, sql2) as myresult(text rowname, timestamp begin_time, timestamp end_time, int col1, int col2, int col3, int col4, int col5,...)

or if you know that you will have exactly 35 columns every single time, you can declare an alias to the function which specifies the structure of the returned records, which will allow you to call it simply as:

select * from crosstab35(sql, sql);

The documentation for the tablefunc module (http://www.postgresql.org/docs/9.1/static/tablefunc.html ) explains how to set up the aliases you might want if you have a consistent column set.  If you don't have a consistent column set, you have to first query for the names of all possible columns.  Then construct your two queries for the crosstab function, along with the 'as' clause, using the names that were returned from your initial call to limit the results in the eventual crosstab execution.  In other words, you need to make sure that the 2nd sql statement in the crosstab call cannot possibly return more columns than the set of columns you define in the 'as' clause, so you must either execute both queries in a transaction with at least REPEATABLE_READ isolation semantics, or else include a where clause in your column query that limits the columns returned to the same set of columns you got in the first query - select column_name from table where column_name in (...).

Basically, the flow looks like this:

select all possible column names from database.

build data query for crosstab func.
build column query for crosstab func (should be the same or similar to the earlier query for column names).
build 'as' clause for crosstab func - I just name my columns a1, a2, a3, a4, etc, knowing that they will be in the same order as my column names, since my column names are often not strings that would be legal column names in a query.

execute query "select * from crosstab(" + data_query + "," + column_query + ")" + as_clause
process crosstab result

I'll admit, it's a little difficult to set up correctly the first time you do it.  But execution is actually pretty efficient, though it is only doing pretty much exactly what you'd do if you were to pivot the data manually while iterating over the resultset for the data query.  If you're working in a a dynamically typed language that lets you easily write code that has efficient hash maps and the like, and you only have this one use case, it might actually be easier just to query for the raw data and pivot it yourself, where you'll have much more flexibility in dealing with column names and such. But if you have a wide variety of uses, it's probably worthwhile to become familiar with the crosstab functionality, since it is pretty simple to use once you get the hang of it. I've got fairly boilerplate code that deals with things like generating the record structure so I can usually add a new bit of crosstab functionality to an app pretty darn quickly these days.

One thing that's a drag is that the queries are just strings as far as the database is concerned, so depending upon your client library, you may not be able to use variable escaping best practices. Basically, jdbc won't allow you to do use prepared statement variables within a string, so you have to manually escape any user-provided input while constructing the query as a raw string, rather than using '?' in your query.  The same goes for trying to slip a crosstab query though hibernate's SQLQuery interface (I thought that might be a workaround, but it wasn't), so you are forced to construct the entire query string manually.

--sam





Re: [noob] How to optimize this double pivot query?

От
Samuel Gendler
Дата:


On Wed, Oct 3, 2012 at 9:31 AM, Robert Buck <buck.robert.j@gmail.com> wrote:
Samuel,

You asked:

Some questions before I provide crosstab samples - your example query has a hardcoded set of keys that it is looking for, but maybe that was code-generated.  There are multiple forms of the crosstab function, some of which rely on a constant column count and others which can generate the set of columns based on another query.  There are complications in that second form relating to race conditions with regard to new keys showing up between the query for the columns and the query for the data, so it is important to understand that and either structure queries accordingly or make sure you execute in a transaction with sufficient transaction isolation to prevent the race condition from appearing.

The query was hand written from examples I found on the web. The list here was hand written too. This will run in Ruby, so I can template as much as needed on that side. I can also run this in a defined transaction boundary rather than using auto-commit. Right now I am thumbing through your email and trying it out.

Actually, it isn't sufficient just to run within a defined transaction, since the default isolation level of a transaction in postgresql doesn't guarantee that you'll see the same rows in two queries during the same transaction.  If another transaction commits between your queries, you'll see the new rows in the 2nd query.  You need to have your transaction isolation set to 'serializable' in order to prevent phantom reads.  The default isolation level is 'read committed' which says you'll only see rows that have been committed, but there's no guarantee that two queries will get the same values if another transaction commits a change between them.

For details, see this:


In general, I find that it is easier to set up my crosstab processing to generate consistent column sets by including a where clause in all but the first query such that the where clause always matches the first query.  That won't work if column categories can be deleted, since you'll still get a different column set from one query to the next, but the vast majority of my apps don't allow actual row deletion, or it is at least incredibly rare and only done by administrative users, not end-users of the app.  Generally, rows representing domain entities get marked as disabled, so tows going away isn't usually a factor for me.  And it allows me to not have to worry that someone who comes along and re-uses my code won't set their transaction isolation up correctly when running the crosstab query and will get whacky results.

Note that if you do use serializable transactions to do this, you should use a read-only transaction to cut down on locking overhead and the possibility that other transactions will fail due to serialization errors while your crosstab is running.

This all seems terribly complex when describing it, but it really is pretty easy once you've done it once.  The column-count/serializable thing applies to using array_agg to build a crosstab, too, since you still need to ensure that your arrays all have the same number of entries in the same order.

--sam