Moving to pgsql-hackers, as that's a more appropriate place for this
discussion.
On 27.11.2014 11:26, Maxim Boguk wrote:
>
>
> FWIW, I got curious and checked why we decided not to implement this
> while reworking the stats in 9.3, as keeping an is_dirty flag seems as a
> rather straightforward and simple optimization.
>
> Turns out it's actually considerably more complex, because one of the
> sources of statistics updates are (surprise surprise) autovacuum
> workers. So the whole flow may look like this (in >= 9.3):
>
> 1) launcher requests a fresh stats file (dbs list only)
> 2) worker is started for a particular DB (by launcher)
> 3) the worker requests a stats file (with details for the DB)
>
>
>
> Now for (nearly) idle databases worker do nothing and simple exit in 99.9%.
While it may sound trivial, I believe it's actually more complicated
than just adding a 'is_dirty' flag.
The first question is what 'nearly idle' means. Is that a number of
modified rows? A number of tuples read from tables? Because this would
affect autovacuum processes as well as regular backends (i.e. reading
from pg_stat_user_tables might return stale data). I don't see a
definition that's obviously correct :-(
Secondly, even a 'nearly idle' database needs a vacuum/analyze from time
to time, so there needs to be some logic to detect that. So the logic of
updating is_dirty flag (or maybe reading the value) needs to consider that.
The only way I can think of is duplicating the 'needs vacuum / analyze'
detection into the collector, and setting the flag to 'true' when
there's at least one object in need of autovacuum. This is however much
more complex than a simple is_dirty flag, and I'm not sure it's really
worth it ...
> And if there a lot of idle/low-active db's in cluster - is_dirty
> tracking would be beneficial on both pre 9.3 and after 9.3 versions (and
> in 9.3+ it will be especially effective because id_dirty tracked per-db
> basis).
I don't think so. First, there are practical issues making it way more
complex than what you assume (see my explanation in the preceding text).
Second, let's do a bit of math. With 5MB of stats, 10 databases and 1s
naptime, the old (pre-9.3) implementation writes ~50MB/s. With the new
implementation, this drops to ~5MB/s, because only stats for that
particular database need to be written.
That's a huge improvement. And if even such I/O load is an issue (which
I'd say is unlikely), then move the pg_stat_tmp directory to a tmpfs
filesystem. Problem solved.
Also, you're the first person with such low naptime value (which is
really the culprit here), so most people really don't have such issues.
And I'm still not convinced this is the best way to deal with the
workload you described (tables that grow / get deleted quickly).
I'm not going to spend more time hacking on this, as I doubt it's worth
the effort. Feel free to propose a patch, of course.
regards
Tomas