Обсуждение: Many, many materialised views - Performance?

Поиск
Список
Период
Сортировка

Many, many materialised views - Performance?

От
Toby Corkindale
Дата:
Hi,
I've discovered previously that Postgres doesn't perform so well in some
areas once you have hundreds of thousands of small tables.

I'm wondering if materialised views will fare better, or if they too
create a lot of fluff in pg_catalog and many files on-disk?

-Toby


Re: Many, many materialised views - Performance?

От
Alban Hertroys
Дата:
On Oct 8, 2013, at 9:36, Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote:

> Hi,
> I've discovered previously that Postgres doesn't perform so well in some areas once you have hundreds of thousands of
smalltables. 
>
> I'm wondering if materialised views will fare better, or if they too create a lot of fluff in pg_catalog and many
fileson-disk? 


A materialised view is basically a view turned into a table, with some fluff around it to keep the data it contains
up-to-datewhen the underlying data gets modified. From the 9.3 documentation it appears that this step isn't done
automaticallyyet, but instead you have to issue a REFRESH MATERIALIZED VIEW command (meaning it's not much fluff). 

One of the main purposes of materialized views is to have differently organised versions of the same data available (to
allsessions and for a longer time than, say, temporary tables) that are, for example, more convenient/performant for
reporting.
In many cases, materialized views are a denormalization of your data and often grouped and aggregated.

Having hundreds of thousands of materialized views is going to hurt catalog performance just as much as having that
manytables, with the (manual) maintenance of keeping the data up-to-date added to that. 

Whether that improves or deteriorates performance depends on how you plan to use them. I can say though that it's
unusualto have hundreds of thousands of them; for what purpose do you intend to use them? 

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



Re: Many, many materialised views - Performance?

От
John R Pierce
Дата:
On 10/8/2013 12:36 AM, Toby Corkindale wrote:
> I've discovered previously that Postgres doesn't perform so well in
> some areas once you have hundreds of thousands of small tables.

i'm having a hard time envisioning a database that would need so many
different record types.






--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



Re: Many, many materialised views - Performance?

От
Toby Corkindale
Дата:
On 08/10/13 19:58, Alban Hertroys wrote:
> On Oct 8, 2013, at 9:36, Toby Corkindale
> <toby.corkindale@strategicdata.com.au> wrote:
>
>> Hi, I've discovered previously that Postgres doesn't perform so
>> well in some areas once you have hundreds of thousands of small
>> tables.
>>
>> I'm wondering if materialised views will fare better, or if they
>> too create a lot of fluff in pg_catalog and many files on-disk?
>
>
> A materialised view is basically a view turned into a table, with
> some fluff around it to keep the data it contains up-to-date when the
> underlying data gets modified. From the 9.3 documentation it appears
> that this step isn't done automatically yet, but instead you have to
> issue a REFRESH MATERIALIZED VIEW command (meaning it's not much
> fluff).
>
> One of the main purposes of materialized views is to have differently
> organised versions of the same data available (to all sessions and
> for a longer time than, say, temporary tables) that are, for example,
> more convenient/performant for reporting. In many cases, materialized
> views are a denormalization of your data and often grouped and
> aggregated.
>
> Having hundreds of thousands of materialized views is going to hurt
> catalog performance just as much as having that many tables, with the
> (manual) maintenance of keeping the data up-to-date added to that.
>
> Whether that improves or deteriorates performance depends on how you
> plan to use them. I can say though that it's unusual to have hundreds
> of thousands of them; for what purpose do you intend to use them?

Hi Alban,
I had wondered if that was the case -- that they'd be implemented
similarly to tables under the hood.

In this instance, we have a lot of queries that build certain aggregate
results, which are very slow. The queries were initially all implemented
as views, but then we started doing a type of materialising of our own,
turning them into tables with CREATE TABLE AS SELECT ....
This does make the results very fast to access now, but the side effect
is a vast number of (very small) tables.

It would be better to use built-in materialised views, because it's a
standard way to do it, but it sounds like it won't solve the
too-many-tables-in-the-system problem.

Realistically we need to go back and use a different approach
altogether, but you know how it is with long-running production systems.
Significant changes can be hard to push through.

Toby


Re: Many, many materialised views - Performance?

От
Kevin Grittner
Дата:
Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote:

> In this instance, we have a lot of queries that build certain aggregate
> results, which are very slow. The queries were initially all implemented
> as views, but then we started doing a type of materialising of our own,
> turning them into tables with CREATE TABLE AS SELECT ....
> This does make the results very fast to access now, but the side effect
> is a vast number of (very small) tables.

If you have multiple tables with identical layout but different
subsets of the data, you will probably get better performance by
putting them into a single table with indexes which allow you to
quickly search the smaller sets within the table.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Many, many materialised views - Performance?

От
Alban Hertroys
Дата:
On Oct 9, 2013, at 4:08, Kevin Grittner <kgrittn@ymail.com> wrote:

> Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote:
>
>> In this instance, we have a lot of queries that build certain aggregate
>> results, which are very slow. The queries were initially all implemented
>> as views, but then we started doing a type of materialising of our own,
>> turning them into tables with CREATE TABLE AS SELECT ....
>> This does make the results very fast to access now, but the side effect
>> is a vast number of (very small) tables.
>
> If you have multiple tables with identical layout but different
> subsets of the data, you will probably get better performance by
> putting them into a single table with indexes which allow you to
> quickly search the smaller sets within the table.


I was thinking just that while reading Toby's message. For example, you could put the results of several related
aggregationsinto a single materialized view, if they share the same key columns (year, month, factory or something
similar).
I'm not sure the new built-in materialized views can be updated like that though, unless you manage to combine those
aggregationsinto a single monster-query, but that will probably not perform well... 
What we tend to do at work (no PostgreSQL, unfortunately) is to use external tools to combine those aggregated results
andstore that back into the database (which we often need to do anyway, as we deal with several databases on several
servers).

Additionally, if you have that many tables, it sounds like you partitioned your data.
With aggregated results, the need for partitioning is much less (or perhaps it isn't even needed at all). And perhaps
youdon't even need the data from all partitions; say if you have monthly partitions of data, do you really need
aggregatedresults from 5 years ago? 

That said, users excel in finding data to request that you thought they wouldn't need.

Which brings me to another question: Do your users really need the data from all those views or do they only think they
needthat? 

Frequently, users create elaborate Excel sheets and then request tons of data to fill them, while what they're really
interestedin is the _result_ of that Excel sheet. If you can provide them with that, they're happy and you can rest
assuredthat they're at least using correct results. Plus, it removes some of _this_ burden from your database. 

I've seen users who're busy creating sheets like that for 2 weeks, twice a year, to create data that I can prepare for
themin a couple of days into a report that takes 2 minutes to load (which is long, but not compared to their 2 weeks). 

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



Re: Many, many materialised views - Performance?

От
Bill Moran
Дата:
On Tue, 8 Oct 2013 19:08:45 -0700 (PDT) Kevin Grittner <kgrittn@ymail.com> wrote:

> > In this instance, we have a lot of queries that build certain aggregate
> > results, which are very slow. The queries were initially all implemented
> > as views, but then we started doing a type of materialising of our own,
> > turning them into tables with CREATE TABLE AS SELECT ....
> > This does make the results very fast to access now, but the side effect
> > is a vast number of (very small) tables.

I missed the start of this thread, so apologies if my suggestion is off-base.

When there are lots of tables, I've seen performance improvements from
distributing the tables through schemas.  It seems to improve name
resolution performance.

--
Bill Moran <wmoran@potentialtech.com>


Re: Many, many materialised views - Performance?

От
Toby Corkindale
Дата:
On 09/10/13 21:05, Alban Hertroys wrote:
> On Oct 9, 2013, at 4:08, Kevin Grittner <kgrittn@ymail.com> wrote:
>
>> Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote:
>>
>>> In this instance, we have a lot of queries that build certain
>>> aggregate results, which are very slow. The queries were
>>> initially all implemented as views, but then we started doing a
>>> type of materialising of our own, turning them into tables with
>>> CREATE TABLE AS SELECT .... This does make the results very fast
>>> to access now, but the side effect is a vast number of (very
>>> small) tables.
>>
>> If you have multiple tables with identical layout but different
>> subsets of the data, you will probably get better performance by
>> putting them into a single table with indexes which allow you to
>> quickly search the smaller sets within the table.
>
>
> I was thinking just that while reading Toby's message. For example,
> you could put the results of several related aggregations into a
> single materialized view, if they share the same key columns (year,
> month, factory or something similar). I'm not sure the new built-in
> materialized views can be updated like that though, unless you manage
> to combine those aggregations into a single monster-query, but that
> will probably not perform well... What we tend to do at work (no
> PostgreSQL, unfortunately) is to use external tools to combine those
> aggregated results and store that back into the database (which we
> often need to do anyway, as we deal with several databases on several
> servers).

Thanks for the suggestions, all.
As I noted in an earlier email -- we're aware that the schema could be
better designed, but making large changes is tricky in production systems.

Many of the tables are actually unique, but only in the sense that you
have various (common) identifier fields, and then a few (unique)
aggregate-results per table.

eg:
int id_key_1, int id_key_2, .., float FooBarXResult

I suspect the correct way to handle this would actually be a table that
looked like:

int id_key_1, int id_key_2, .., text result_name, float result_value

Although that would in turn make other queries more verbose, for
example, currently one can do:

select *
from FooResult
join BarResult using (id_key_1, id_key_2)
where FooResultX > 0.9 and BarResultY < 0.1;

I guess that turns into something like this:
select id_key_1, id_key_2,
        a.result_value as FooResultX, b.result_value as FooResultY
from AllResults a
join AllResults b using (id_key_1, id_key_2)
where a.result_name = "FooResultX"
and a.result_value > 0.9
and b.result_name = "BarResultY"
and b.result_value < 0.1;

So it's all do-able, but it does look nicer to separate things into
their own tables with named columns.

> Additionally, if you have that many tables, it sounds like you
> partitioned your data. With aggregated results, the need for
> partitioning is much less (or perhaps it isn't even needed at all).
> And perhaps you don't even need the data from all partitions; say if
> you have monthly partitions of data, do you really need aggregated
> results from 5 years ago?

You're correct, we do have partitioned tables due to the amount of data
in the system, but that's for just the non-aggregated data.
Those tables perform just fine!
It's the hundreds of thousands of views and tables with just a few rows
in them that worry me.. :)


> That said, users excel in finding data to request that you thought
> they wouldn't need.
 > Which brings me to another question: Do your users really need the
 > data from all those views or do they only think they need that?

Ah, indeed, users have not individually requested each of these many
thousands of tables and views. They are part of a large application, and
the results from all of those are required by it.

If I rewrote the application today, I'd be looking at doing things very
differently, knowing how it would eventually scale.