Обсуждение: Statistics Data archiving with Postgres

Поиск
Список
Период
Сортировка

Statistics Data archiving with Postgres

От
Pascal Cohen
Дата:
Hello
We are developing an application and would like to compute statistics on
it in order:
- to have a better understanding of what is used mostly in our
application to model at best our load test scenarios.
- to get information on the usage of the application for other departments.

The problem is that our application is currently read mostly while
statistics logging is more a log write mostly process. And stats collect
will generate a huge volume of data (because a very low granularity is
mandatory).
We would like to avoid as much as possible any interference of the stats
collecting with the main application.

We have looked in the Postgres documentation and we have found several
ideas:
- We have decided to isolate stats in a specific schema.
- We have looked at polymorphism in order to split our stat tables in
smallest ones that we could "detach" when they are old.
- We have looked at fsync tuning or better at asynchronous commit as
these data are not critical.

But we have been facing several questions/problems:

Polymorphism and ORM question:
- First as we are using an ORM tool around PG access, the rule we
defined in the Polymorphism returned 0 after an insert because the last
rule was generally not the one that made the insert. In our case we know
that only a single rule will match, so we made a hack setting the active
rule name with a zzz but that is very hacky. in that case anyway
Hibernate is happy.

One or several databases, one or several servers ?
- In such a case could we store both our application content and stats
in the same database ? Should we better use two databases in the same
cluster or should we even have to different dedicated servers ?
- If we want to use fsync, I suppose we need two separated servers. I
read that asynchronous commit can be set for a transaction. Is there a
way to say that a given cluster or tables are in asynchronous commit by
default, perhaps with triggers ....

We would like to archive old data collected in slow file storage in any
case but would like to avoid having our database reaching Tb only for
data collecting concerns. May be this is a bad idea. Anyway if this is
not so bad, we have again questions:
With polymorphism we can dump some tables regularly. But polymorphism
has been seen a bit complex and we were studying a simpler way to and we
also have to study other ways with simpler but larger stats tables.
We have studied the simple pg_dump command with only the data but we
would need to dump only a part of the table. Thus we have looked at the
COPY command which seems interesting in our case. Are there experience
or any feedback on that command.

Sorry, there are many questions, our problem is a bit wide because there
are several concerns:
- Polymorphism or not
- One or several DB clusters or servers
- Fsync/asynchronous problem
- Rule limitations
- Use of COPY
But to sum up we would like to collect statistics (write mostly tables,
high volume generation, data not critical) on an application usage on a
read mostly DB with the least impact on this DB perfs. ANn we would also
like to be able to archive outside the DB, the old collected data.

Thanks for any help!

Pascal


Re: Statistics Data archiving with Postgres

От
Craig Ringer
Дата:
Pascal Cohen wrote:

> But to sum up we would like to collect statistics (write mostly tables,
> high volume generation, data not critical) on an application usage on a
> read mostly DB with the least impact on this DB perfs. ANn we would also
> like to be able to archive outside the DB, the old collected data.

[Just throwing ideas around - there's probably a much better and already
well established to do all this]:

Does the stats data need to be handled within the same transactional
scope as the real app data? In other words, do you care if a stats entry
is generated for a transaction that then rolls back?

If you're OK with that, then another possible area to investigate is the
use of IPC messaging to a separate stats collector daemon outside of,
and running asynchronously with, PostgreSQL. Many of the PostgreSQL
procedural languages can communicate outside the server with mechanisms
like UNIX sockets, IP networking, signals, named pipes, and perhaps even
things like shared memory if the Pg backend's use of it doesn't
interfere. Alternately, you could write your stats collector client as a
C add-in to Pg, which would probably let you minimize it's performance
cost in exchange for more development and debugging time, plus a higher
risk to server stability.

Ideally the part of the logging/stats code running inside the Pg backend
would do as little work as possible to record the message for later
processing. The external stats collector/processor would asynchronously
process messages it receives from all Pg backends and record it in your
preferred format (perhaps inserting it into a separate write-optimised
Pg database on another host).

Come to think of it, having stats recorded for transactions that roll
back is probably desirable, rather than any sort of downside.

--
Craig Ringer