Обсуждение: distinct values and count over window function

Поиск
Список
Период
Сортировка

distinct values and count over window function

От
Rodrigo Rosenfeld Rosas
Дата:
Hello, this is my first post to this particular list (I've also posted to performance list a while ago to report a bug).

I generate complex dynamic queries and paginate them using "LIMIT 50". I have a need for implementing some statistics for the results as well and I'd like to know if I can avoid extra queries or how to be able to do that without much performance hit.

The statistics calculation that bothers me most is a distinct one. For example, suppose you have a query that would return more than 50 distinct values while only the first 50 ones are brought by the paginated query.

When the user asks for its statistics it should show all counts by distinct value including those not present in the current results page for certain columns.

I noticed that there is a plan for PG 9.2 to include a query_to_json function that would indeed help me to get this done when used altogether with a window function.

Our application is currently running PG 9.1 (was rolled back from 9.2 due to a performance bug that seems to be now fixed but not released yet). Is there anyway I could avoid running another query to get this kind of statistics?

The queries can become really complex and take almost 5s to complete in some cases. Which would mean that performing 2 queries with those same conditions would take about 10s... I could perform them in different times but then the user would have to wait for 5s just to get the statistics calculation... This is what I'd like to avoid.

This application is a clustered web application, so I'd rather avoid any client library support for paginating results using cursors or something like that and stick with OFFSET and LIMIT.

Any help is appreciated!

Thanks,
Rodrigo.

Re: distinct values and count over window function

От
"David Johnston"
Дата:
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Rodrigo Rosenfeld
Rosas
Sent: Tuesday, November 27, 2012 4:23 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] distinct values and count over window function

Hello, this is my first post to this particular list (I've also posted to
performance list a while ago to report a bug).

I generate complex dynamic queries and paginate them using "LIMIT 50". I
have a need for implementing some statistics for the results as well and I'd
like to know if I can avoid extra queries or how to be able to do that
without much performance hit.

The statistics calculation that bothers me most is a distinct one. For
example, suppose you have a query that would return more than 50 distinct
values while only the first 50 ones are brought by the paginated query.

When the user asks for its statistics it should show all counts by distinct
value including those not present in the current results page for certain
columns.

I noticed that there is a plan for PG 9.2 to include a query_to_json
function that would indeed help me to get this done when used altogether
with a window function.

Our application is currently running PG 9.1 (was rolled back from 9.2 due to
a performance bug that seems to be now fixed but not released yet). Is there
anyway I could avoid running another query to get this kind of statistics?

The queries can become really complex and take almost 5s to complete in some
cases. Which would mean that performing 2 queries with those same conditions
would take about 10s... I could perform them in different times but then the
user would have to wait for 5s just to get the statistics calculation...
This is what I'd like to avoid.

This application is a clustered web application, so I'd rather avoid any
client library support for paginating results using cursors or something
like that and stick with OFFSET and LIMIT.

Any help is appreciated!

Thanks,
Rodrigo.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>.

Conceptually you could execute and cache (via a Temporary or Permanent
Table) the results of the unlimited query then use that cache to calculate
your statistics as well as perform your subsequent LIMIT/OFFSET queries.

In 9.1+ you have the ability to incorporate INSERT inside a WITH/CTE so that
you can perform multiple actions like this without resorting to the use of a
function - though a function may have value as well.

David J.




Re: distinct values and count over window function

От
Rodrigo Rosenfeld Rosas
Дата:
Em 27-11-2012 19:50, David Johnston escreveu:
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Rodrigo Rosenfeld
> Rosas
> Sent: Tuesday, November 27, 2012 4:23 PM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] distinct values and count over window function
>
> Hello, this is my first post to this particular list (I've also posted to
> performance list a while ago to report a bug).
>
> I generate complex dynamic queries and paginate them using "LIMIT 50". I
> have a need for implementing some statistics for the results as well and I'd
> like to know if I can avoid extra queries or how to be able to do that
> without much performance hit.
>
> The statistics calculation that bothers me most is a distinct one. For
> example, suppose you have a query that would return more than 50 distinct
> values while only the first 50 ones are brought by the paginated query.
>
> When the user asks for its statistics it should show all counts by distinct
> value including those not present in the current results page for certain
> columns.
>
> I noticed that there is a plan for PG 9.2 to include a query_to_json
> function that would indeed help me to get this done when used altogether
> with a window function.
>
> Our application is currently running PG 9.1 (was rolled back from 9.2 due to
> a performance bug that seems to be now fixed but not released yet). Is there
> anyway I could avoid running another query to get this kind of statistics?
>
> The queries can become really complex and take almost 5s to complete in some
> cases. Which would mean that performing 2 queries with those same conditions
> would take about 10s... I could perform them in different times but then the
> user would have to wait for 5s just to get the statistics calculation...
> This is what I'd like to avoid.
>
> This application is a clustered web application, so I'd rather avoid any
> client library support for paginating results using cursors or something
> like that and stick with OFFSET and LIMIT.
>
> Any help is appreciated!
>
> Thanks,
> Rodrigo.
>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .
> Conceptually you could execute and cache (via a Temporary or Permanent
> Table) the results of the unlimited query then use that cache to calculate
> your statistics as well as perform your subsequent LIMIT/OFFSET queries.
>
> In 9.1+ you have the ability to incorporate INSERT inside a WITH/CTE so that
> you can perform multiple actions like this without resorting to the use of a
> function - though a function may have value as well.
>
> David J.
>
>

Indeed David, thanks! Temporary tables seems like a good enough
solution.  I tested with a complex query here, which takes about 300ms
in my dev machine.

I was implementing pagination by issuing 2 similar queries for the first
page, one of them being a count(*) one. Both summed for about 500ms
while they complete in around 350ms if I use a temporary table in within
a transaction using "CREATE TEMP TABLE partial_results ON COMMIT DROP AS".

I'd love to store those partial results for a longer time but this would
complicate things a lot. I'd need a way to ensure each query will use a
different name for the results table and I should also implement a
system that would take care of dropping such tables as the used memory
passes some caching threshold or when some timeout has occurred.

Are there any explicit caching mechanisms like described above built-in
into PostgreSQL?

Anyway, I won't have time to look into this at least until Monday so
I'll use the temp table strategy described above for now due to a tight
dead-line.

Thanks a lot!

Best,
Rodrigo.