Обсуждение: pg_stat_tmp

Поиск
Список
Период
Сортировка

pg_stat_tmp

От
Tim Kane
Дата:
Hi all,

The past few days I’ve been encountering the following error, followed by a full db restart and recovery


2013-12-16 07:12:53 GMT LOG:  could not write temporary statistics file "pg_stat_tmp/pgstat.tmp": No space left on device


This occurs at a time of moderate load, during the same set of operations each morning.
Interestingly, when I execute this manually at any other time of date, the process completes normally.

I presume that the pg_stat_tmp location is system-wide and likely is not impacted by temp_tablespaces
The root partition, where postgresql is installed does not have a lot of disk available (4GB).

My first instinct here is to symlink pg_stat_tmp against another disk with a little more room to breathe, however I’m surprised that pgstat.tmp would grow to be so large in the first place – possibly there is something else at play here.

I should note that this issue has only recently occurred, no major changes have been introduced.

Thanks for any advice on this.

Tim

Re: pg_stat_tmp

От
Magnus Hagander
Дата:
On Mon, Dec 16, 2013 at 2:57 PM, Tim Kane <tim.kane@gmail.com> wrote:
Hi all,

The past few days I’ve been encountering the following error, followed by a full db restart and recovery


2013-12-16 07:12:53 GMT LOG:  could not write temporary statistics file "pg_stat_tmp/pgstat.tmp": No space left on device


This occurs at a time of moderate load, during the same set of operations each morning.
Interestingly, when I execute this manually at any other time of date, the process completes normally.

I presume that the pg_stat_tmp location is system-wide and likely is not impacted by temp_tablespaces
The root partition, where postgresql is installed does not have a lot of disk available (4GB). 

My first instinct here is to symlink pg_stat_tmp against another disk with a little more room to breathe, however I’m surprised that pgstat.tmp would grow to be so large in the first place – possibly there is something else at play here.

I should note that this issue has only recently occurred, no major changes have been introduced.

Thanks for any advice on this.


You can control the location of the file with the parameter stats_temp_directory.

The size is most likely because the directory needs space for >1 copy of the file, since it's written to a separate file and then renamed()d over the existing one for atomicity. That said, even with that it shouldn't grow *that* large, unless you've go a huge number of tables. Certainly not into gigabyte range. So it could be that it's just the first indicator that fires, but that there's actually something else using up your space. But setting stats_temp_directory is a good first way to find that out.
 

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: pg_stat_tmp

От
Jeff Janes
Дата:
On Mon, Dec 16, 2013 at 5:57 AM, Tim Kane <tim.kane@gmail.com> wrote:
Hi all,

The past few days I’ve been encountering the following error, followed by a full db restart and recovery


2013-12-16 07:12:53 GMT LOG:  could not write temporary statistics file "pg_stat_tmp/pgstat.tmp": No space left on device

Is that the only thing in the logs?  pg_stat_tmp problems should not bring down your database.  But problems with pg_xlog running out of space certainly can--but they should also be logged.
 


This occurs at a time of moderate load, during the same set of operations each morning.
Interestingly, when I execute this manually at any other time of date, the process completes normally.

I presume that the pg_stat_tmp location is system-wide and likely is not impacted by temp_tablespaces
The root partition, where postgresql is installed does not have a lot of disk available (4GB).

My first instinct here is to symlink pg_stat_tmp against another disk with a little more room to breathe, however I’m surprised that pgstat.tmp would grow to be so large in the first place – possibly there is something else at play here.

We don't know how large it is getting!  If pg_stat_tmp shares the same partition as pg_xlog, base (as in the default configuration), and pg_log, then any of those things could be filling up the partition, and pg_stat_tmp could just be the canary, not the culprit.

Anyway, you don't need to use a symlink, you could just change stats_temp_directory to point someplace else.

Your best bet is run "du" or something similar to figure out where your space is actually going.
 
Cheers,

Jeff

Re: pg_stat_tmp

От
Tim Kane
Дата:

Thanks Jeff, Magnus

Thanks for the suggestions.
This morning the same issue occurred, but this time it also complained that it failed to write to pg_clog  (previous days, the only failure was in writing to pg_stat_tmp)


2013-12-17 07:13:04 GMT DETAIL:  Could not write to file "pg_clog/0004" at offset 212992: No space left on device.
2013-12-17 07:13:05 GMT ERROR:  could not access status of transaction 0
2013-12-17 07:13:05 GMT DETAIL:  Could not write to file "pg_clog/0004" at offset 212992: No space left on device.
2013-12-17 07:13:06 GMT ERROR:  could not access status of transaction 0
2013-12-17 07:13:06 GMT DETAIL:  Could not write to file "pg_clog/0004" at offset 212992: No space left on device.
2013-12-17 07:13:07 GMT LOG:  could not write temporary statistics file "pg_stat_tmp/pgstat.tmp": No space left on device
2013-12-17 07:13:07 GMT LOG:  could not write temporary statistics file "pg_stat_tmp/pgstat.tmp": No space left on device
2013-12-17 07:13:08 GMT LOG:  could not write temporary statistics file "pg_stat_tmp/pgstat.tmp": No space left on device
2013-12-17 07:13:08 GMT LOG:  could not write temporary statistics file "pg_stat_tmp/pgstat.tmp": No space left on dev2013-12-17 07:25:15 GMT WARNING:  terminating connection because of crash of another server
 process
2013-12-17 07:25:15 GMT DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared m
emory.
2013-12-17 07:25:15 GMT HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2013-12-17 07:25:16 GMT LOG:  all server processes terminated; reinitializing
2013-12-17 07:25:17 GMT LOG:  database system was interrupted; last known up at 2013-12-17 07:08:22 GMT
2013-12-17 07:25:17 GMT LOG:  database system was not properly shut down; automatic recovery in progress
2013-12-17 07:25:17 GMT LOG:  redo starts at 5BC/CF3F9888
2013-12-17 07:25:18 GMT LOG:  could not open file "pg_xlog/00000001000005BC000000D6" (log file 1468, segment 214): No such file or directory
2013-12-17 07:25:18 GMT LOG:  redo done at 5BC/D5FFFEC0
2013-12-17 07:25:18 GMT LOG:  last completed transaction was at log time 2013-12-17 07:25:12.989653+00
2013-12-17 07:25:19 GMT LOG:  database system is ready to accept connections
2013-12-17 07:25:19 GMT LOG:  autovacuum launcher started



It’s never failed to write pg_xlog’s though – or at least, never reported a failure.

For now, I’ve altered the pg_stat_tmp path and we’ll see what happens tomorrow – Ideally though, yes.. I’d like to see this happening at runtime and get a better picture of what’s blowing out here.



Further to this however, I notice that the base directory is consuming more disk than I would have expected it to (all our relations are stored in tablespaces assigned to other disks).
Looking for the larger culprits, I note the following files consuming 4.4GB

9.2/main/base/74641/2260957
9.2/main/base/74641/2260957.1
9.2/main/base/74641/2260957.2
9.2/main/base/74641/2260957.3
9.2/main/base/74641/2260957.4

I notice the timestamps are just shy of the time we experience that pg_stat_tmp issue.

-rw------- 1 postgres postgres 1073741824 Dec 17 06:56 2260957
-rw------- 1 postgres postgres 1073741824 Dec 17 06:56 2260957.1
-rw------- 1 postgres postgres 1073741824 Dec 17 06:56 2260957.2
-rw------- 1 postgres postgres 1073741824 Dec 17 06:56 2260957.3
-rw------- 1 postgres postgres  328466432 Dec 17 06:56 2260957.4


They appear to be orphaned oid’s, though I’m not sure this is definitive:

clone=# select pg_relation_filepath(2260957);
 pg_relation_filepath
----------------------

(1 row)



I probably would have thought these to be temporary files – except for that I have temp_tablespace set elsewhere.

Are these truly orphaned files that postgresql no longer needs (possibly due to a crash) ?    Is there a mechanism to scan the PG_DATA directories to locate such files?
FYI, we’re running 9.2.4


Thanks again,

Tim



From: Jeff Janes <jeff.janes@gmail.com>
Date: Monday, 16 December 2013 17:32
To: Tim Kane <tim.kane@gmail.com>
Cc: pgsql-general General <pgsql-general@postgresql.org>
Subject: Re: [GENERAL] pg_stat_tmp

On Mon, Dec 16, 2013 at 5:57 AM, Tim Kane <tim.kane@gmail.com> wrote:
Hi all,

The past few days I’ve been encountering the following error, followed by a full db restart and recovery


2013-12-16 07:12:53 GMT LOG:  could not write temporary statistics file "pg_stat_tmp/pgstat.tmp": No space left on device

Is that the only thing in the logs?  pg_stat_tmp problems should not bring down your database.  But problems with pg_xlog running out of space certainly can--but they should also be logged.
 


This occurs at a time of moderate load, during the same set of operations each morning.
Interestingly, when I execute this manually at any other time of date, the process completes normally.

I presume that the pg_stat_tmp location is system-wide and likely is not impacted by temp_tablespaces
The root partition, where postgresql is installed does not have a lot of disk available (4GB).

My first instinct here is to symlink pg_stat_tmp against another disk with a little more room to breathe, however I’m surprised that pgstat.tmp would grow to be so large in the first place – possibly there is something else at play here.

We don't know how large it is getting!  If pg_stat_tmp shares the same partition as pg_xlog, base (as in the default configuration), and pg_log, then any of those things could be filling up the partition, and pg_stat_tmp could just be the canary, not the culprit.

Anyway, you don't need to use a symlink, you could just change stats_temp_directory to point someplace else.

Your best bet is run "du" or something similar to figure out where your space is actually going.
 
Cheers,

Jeff

Re: pg_stat_tmp

От
Jeff Janes
Дата:
On Tue, Dec 17, 2013 at 3:25 AM, Tim Kane <tim.kane@gmail.com> wrote:

Thanks Jeff, Magnus

Thanks for the suggestions.
This morning the same issue occurred, but this time it also complained that it failed to write to pg_clog  (previous days, the only failure was in writing to pg_stat_tmp)

 
...
 
2013-12-17 07:13:08 GMT LOG:  could not write temporary statistics file "pg_stat_tmp/pgstat.tmp": No space left on dev2013-12-17 07:25:15 GMT WARNING:  terminating connection because of crash of another server
 process

 ...
 


It’s never failed to write pg_xlog’s though – or at least, never reported a failure.

If your log file is on the same partition as all the other stuff, you are probably losing log messages because there is no room to log them.  The mangled line above tends to support this.  Logging stopped for 12 minutes, until the crash freed up some space so it could resume.

 

For now, I’ve altered the pg_stat_tmp path and we’ll see what happens tomorrow – Ideally though, yes.. I’d like to see this happening at runtime and get a better picture of what’s blowing out here.



Further to this however, I notice that the base directory is consuming more disk than I would have expected it to (all our relations are stored in tablespaces assigned to other disks).
Looking for the larger culprits, I note the following files consuming 4.4GB

9.2/main/base/74641/2260957
9.2/main/base/74641/2260957.1
9.2/main/base/74641/2260957.2
9.2/main/base/74641/2260957.3
9.2/main/base/74641/2260957.4

I notice the timestamps are just shy of the time we experience that pg_stat_tmp issue.

-rw------- 1 postgres postgres 1073741824 Dec 17 06:56 2260957
-rw------- 1 postgres postgres 1073741824 Dec 17 06:56 2260957.1
-rw------- 1 postgres postgres 1073741824 Dec 17 06:56 2260957.2
-rw------- 1 postgres postgres 1073741824 Dec 17 06:56 2260957.3
-rw------- 1 postgres postgres  328466432 Dec 17 06:56 2260957.4


They appear to be orphaned oid’s, though I’m not sure this is definitive:

clone=# select pg_relation_filepath(2260957);

The transaction that was making the table probably never committed, so its entry never becomes visible.  I don't know of any good way of finding out what the name of an uncommitted object will eventually be, which is rather annoying when trying to monitor massives loads that happen inside a large transaction.

(And what you should really be doing is select relname from pg_class where relfilenode=2260957,  the relfilenode is often the same as the oid, but not always.)

Cheers,

Jeff