Обсуждение: Errors on missing pg_subtrans/ files with 9.3
G'day list. Didn't get any interest in pgsql-general, thought I'd try my luck here, which perhaps would be more fitting in case I've stumbled upon an edge case issue or something... I've recently upgraded a number of servers from PostgreSQL 9.2.5 to 9.3.1 and have started getting the following errors every couple of hours along with some failed transactions. I have been unable to track down any sort of rhyme or reason for the errors yet, so I figured I'd check with the mailing list to see if I've potentially hit into a potential bug. The errors look like the following: ERROR could not access status of transaction 4179979 Could not open file "pg_subtrans/003F": No such file or directory. Indeed, pg_subtrans/003F does not exist on the file system. I have found similar errors mentioned before a number of years ago with a thread starting here, but couldn't find any real resolution. http://www.postgresql.org/message-id/20111109150225.076c2f38@list.ru A few details: - the new database cluster was created by dumping global settings using pg_dumpall and then dumping each database individually. Several of the databases within this cluster have PostGIS installed and I wanted to do a full PostGIS upgrade, which necessitated individual dumps. - this is the only database in a cluster of 14 databases that is exhibiting the problem and it is sporadic at best. This database is fairly write-intensive, and has been up since November 6th. In that time, we've had 17 such errors. - the pg_subtrans files mentioned in the log file do not exist on the file system. - I have rebuilt the cluster several times, as well as cleaned out the pg_statistic table and run `VACUUM FULL ANALYZE` on every database in case that was affecting it based on some similar threads back from the PostgreSQL 8.4 days, but there was no affect and the errors still occur. Anyone have any suggestions or ideas? Thankfully these are dev systems so I have some room to experiment and can post some more details as necessary. The following are the log file entries from one of the errors. Some details have been sanitized for privacy reasons, but the overall lines themselves are accurate. Nov 10 10:14:02 dev-server postgres[29835]: [4-1] user=dev,db=dev ERROR: could not access status of transaction 4179979 Nov 10 10:14:02 dev-server postgres[29835]: [4-2] user=dev,db=dev DETAIL: Could not open file "pg_subtrans/003F": No such file or directory. Nov 10 10:14:02 dev-server postgres[29835]: [4-3] user=dev,db=dev CONTEXT: SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x" Nov 10 10:14:02 dev-server postgres[29835]: [4-4] user=dev,db=dev STATEMENT: insert into raw (url, id) values ($1, $2) Nov 10 10:14:02 dev-server postgres[29839]: [4-1] user=dev,db=dev ERROR: could not access status of transaction 4179979 Nov 10 10:14:02 dev-server postgres[29839]: [4-2] user=dev,db=dev DETAIL: Could not open file "pg_subtrans/003F": No such file or directory. Nov 10 10:14:02 dev-server postgres[29839]: [4-3] user=dev,db=dev CONTEXT: SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x" Nov 10 10:14:02 dev-server postgres[29839]: [4-4] user=dev,db=dev STATEMENT: update listings set value_a=$1 where id=$2 Nov 10 10:14:02 dev-server postgres[29827]: [4-1] user=dev,db=dev ERROR: could not access status of transaction 4179979 Nov 10 10:14:02 dev-server postgres[29827]: [4-2] user=dev,db=dev DETAIL: Could not open file "pg_subtrans/003F": No such file or directory. Nov 10 10:14:02 dev-server postgres[29827]: [4-3] user=dev,db=dev CONTEXT: SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x" I can provide additional details like my postgresql.conf if it would help, although it's fairly standard outside of tuning as provided by pgtune. Cheers and thanks, List.
Hi, On 2013-11-12 10:56:55 -0500, J Smith wrote: > G'day list. Didn't get any interest in pgsql-general, thought I'd try > my luck here, which perhaps would be more fitting in case I've > stumbled upon an edge case issue or something... Normally the bug report for/the -bugs mailing list is the right one for things like this. > ERROR could not access status of transaction 4179979 > Could not open file "pg_subtrans/003F": No such file or directory. A couple of question: * Did you have any OS, hardware crashes? * Do you run with fsync=off or full_page_writes=off? * Could you post the pg_controldata output? * Is it always the same xid? * Does SELECT count(*) FROM pg_prepared_xacts; return 0? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
* J Smith (dark.panda+lists@gmail.com) wrote: > I've recently upgraded a number of servers from PostgreSQL 9.2.5 to > 9.3.1 and have started getting the following errors every couple of > hours along with some failed transactions. How was this upgrade done? If you used pg_upgrade, what version of the pg_upgrade code did you use? As I recall, there was a bug in older versions which could exhibit in this way.. http://wiki.postgresql.org/wiki/20110408pg_upgrade_fix Thanks, Stephen
On 2013-11-12 11:25:03 -0500, Stephen Frost wrote: > * J Smith (dark.panda+lists@gmail.com) wrote: > > I've recently upgraded a number of servers from PostgreSQL 9.2.5 to > > 9.3.1 and have started getting the following errors every couple of > > hours along with some failed transactions. > > How was this upgrade done? If you used pg_upgrade, what version of the > pg_upgrade code did you use? As I recall, there was a bug in older > versions which could exhibit in this way.. He referred to using pg_dumpall/pg_dump. But that bug was erroring out on pg_clog, not pg_subtrans, right? Getting such errors on pg_subtrans indicates corruption or a bug somewhere, we don't do them if the to-be-looked up xid is older than TransactionXmin and we zero pg_subtrans at restarts. My gut feeling is thats it's related to foreign key locks doing XactLockTableWait()s on too old transactions or something like that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 12, 2013 at 11:25 AM, Stephen Frost <sfrost@snowman.net> wrote: > > How was this upgrade done? If you used pg_upgrade, what version of the > pg_upgrade code did you use? As I recall, there was a bug in older > versions which could exhibit in this way.. > > http://wiki.postgresql.org/wiki/20110408pg_upgrade_fix > As mentioned in the original post... "the new database cluster was created by dumping global settings using pg_dumpall and then dumping each database individually." We haven't been able to use pg_upgrade as we rely heavily on PostGIS and do hard upgrades via pg_dump and postgis_restore.pl when we upgrade. Cheers
G'day Andres. On Tue, Nov 12, 2013 at 11:13 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > On 2013-11-12 10:56:55 -0500, J Smith wrote: >> G'day list. Didn't get any interest in pgsql-general, thought I'd try >> my luck here, which perhaps would be more fitting in case I've >> stumbled upon an edge case issue or something... > > Normally the bug report for/the -bugs mailing list is the right one for > things like this. Ah, okay. I can redirect there if that's the better location. (Assuming it's a bug and not a misconfiguration on my part, although the configuration on these servers has been consistent since 9.0, and we've upgraded through each of the 9.x series while merging together new settings accordingly.) >> ERROR could not access status of transaction 4179979 >> Could not open file "pg_subtrans/003F": No such file or directory. > > A couple of question: > * Did you have any OS, hardware crashes? Haven't had any issues on this machine. > * Do you run with fsync=off or full_page_writes=off? Both fsync and full_page_writes are in their default "on" configurations. I'll post our full postgresql.conf below, as it is quite short. > * Could you post the pg_controldata output? Sure, posted below. > * Is it always the same xid? Seems to be different every time. > * Does SELECT count(*) FROM pg_prepared_xacts; return 0? Yes it does. Here's the postgresql.conf and pg_controldata dump. # postgresql.conf listen_addresses = '*' wal_level = hot_standby checkpoint_completion_target = 0.7 max_wal_senders = 3 wal_keep_segments = 64 logging_collector = off datestyle = 'iso, mdy' timezone = 'Canada/Eastern' lc_messages = 'en_US.UTF-8' lc_monetary = 'en_US.UTF-8' lc_numeric = 'en_US.UTF-8' lc_time = 'en_US.UTF-8' default_text_search_config = 'pg_catalog.english' default_statistics_target = 100 maintenance_work_mem = 1GB effective_cache_size = 44GB work_mem = 320MB wal_buffers = 4MB checkpoint_segments = 64 shared_buffers = 15GB max_connections = 200 # pg_controldata pg_control version number: 937 Catalog version number: 201306121 Database system identifier: 5942827484423487452 Database cluster state: in production pg_control last modified: Tue 12 Nov 2013 11:39:07 AM EST Latest checkpoint location: 30/50F7A180 Prior checkpoint location: 30/4914C428 Latest checkpoint's REDO location: 30/4CB2C450 Latest checkpoint's REDO WAL file: 00000001000000300000004C Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0/5138174 Latest checkpoint's NextOID: 3574036 Latest checkpoint's NextMultiXactId: 5762623 Latest checkpoint's NextMultiOffset: 11956824 Latest checkpoint's oldestXID: 1673 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 5138174 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Time of latest checkpoint: Tue 12 Nov 2013 11:35:37 AM EST Fake LSN counter for unlogged rels: 0/1 Minimum recovery ending location: 0/0 Min recovery ending loc's timeline: 0 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no Current wal_level setting: hot_standby Current max_connections setting: 200 Current max_prepared_xacts setting: 0 Current max_locks_per_xact setting: 64 Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Data page checksum version: 0 Cheers and thanks for the help all.
* Andres Freund (andres@2ndquadrant.com) wrote: > He referred to using pg_dumpall/pg_dump. But that bug was erroring out > on pg_clog, not pg_subtrans, right? Yeah, that was pg_clog. Obviously responded before really looking at it. :) > My gut feeling is thats it's related to foreign key locks doing > XactLockTableWait()s on too old transactions or something like that. Ugh, that wouldn't be good.. Thanks, Stephen
* J Smith (dark.panda+lists@gmail.com) wrote: > We haven't been able to use pg_upgrade as we rely heavily on PostGIS > and do hard upgrades via pg_dump and postgis_restore.pl when we > upgrade. Did you also upgrade to PostGIS 2.x as part of this..? Seems like it'd be unrelated, but one never knows. Any chance you could distill this down into a small test case which exhibits the problem? I'm guessing 'no', but figured I'd ask anyway.. Thanks, Stephen
Hi, On 2013-11-12 11:46:19 -0500, J Smith wrote: > > * Does SELECT count(*) FROM pg_prepared_xacts; return 0? > > Yes it does. Could you show the output? Do you actually use prepared xacts actively? Do you actively use row level locking? Is there high concurrency in that environment? In short, could you describe your database usage a bit? Did you restart the database since getting the errors? > Latest checkpoint's NextXID: 0/5138174 > Latest checkpoint's NextOID: 3574036 > Latest checkpoint's NextMultiXactId: 5762623 > Latest checkpoint's NextMultiOffset: 11956824 > Latest checkpoint's oldestXID: 1673 > Latest checkpoint's oldestXID's DB: 1 > Latest checkpoint's oldestActiveXID: 5138174 > Latest checkpoint's oldestMultiXid: 1 Interesting. Pretty high multixact usage. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 12, 2013 at 11:54 AM, Stephen Frost <sfrost@snowman.net> wrote: > > Did you also upgrade to PostGIS 2.x as part of this..? Seems like it'd > be unrelated, but one never knows. Any chance you could distill this > down into a small test case which exhibits the problem? I'm guessing > 'no', but figured I'd ask anyway.. > Yeah, we upgraded to PostGIS 2.1 as part of the upgrade. In the case of the specific database that is exhibiting the problem, we only really use PostGIS for storage and don't do a lot of spatial querying, as this database is basically for an aggregation service we run. In the case of the transactions that appear to be causing the problems, we don't even use geometry fields at all. Cheers
On Tue, Nov 12, 2013 at 11:55 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > On 2013-11-12 11:46:19 -0500, J Smith wrote: >> > * Does SELECT count(*) FROM pg_prepared_xacts; return 0? >> >> Yes it does. > > Could you show the output? Do you actually use prepared xacts actively? jay:dev@jagger=# select * from pg_prepared_xacts;transaction | gid | prepared | owner | database -------------+-----+----------+-------+---------- (0 rows) We don't actively use prepared transactions, no. We use a lot of prepared statements, but no two-phase committing. > Do you actively use row level locking? Not to my knowledge, no. I've enabled statement logging for a little while so we can get a snapshot of what our average queries look like, but I'm not seeing any PREPARE TRANSACTION or explicit LOCK statements. > Is there high concurrency in that environment? In short, could you describe your database usage a bit? This particular database is used for an aggregation service. There are a handful of Hibernate-based Java processes that manipulate and massage the data from a number of data feeds we receive 24-7 From a quick grepping of the statement log file we captured the following over the course of approximately 13 minutes: 167452 SELECTs 72388 UPDATEs 6782 INSERTs 2 DELETEs > Did you restart the database since getting the errors? Yes, we've restarted several times. >> Latest checkpoint's NextXID: 0/5138174 >> Latest checkpoint's NextOID: 3574036 >> Latest checkpoint's NextMultiXactId: 5762623 >> Latest checkpoint's NextMultiOffset: 11956824 >> Latest checkpoint's oldestXID: 1673 >> Latest checkpoint's oldestXID's DB: 1 >> Latest checkpoint's oldestActiveXID: 5138174 >> Latest checkpoint's oldestMultiXid: 1 > > Interesting. Pretty high multixact usage. > Part of our process involves a collecting several records and aggregating them within a single transaction. There will usually be a handful of SELECTs to fetch the data, a bit of processing in the Java backend, then a set of either UPDATEs or INSERTs into two tables to store the results of the processing or in rarer cases a couple of DELETEs. Our Java guy would have more details, as I just have the output of the log files and a high-level view of how our backend processes work, but this is the general gist of things as I understand it. Cheers
Looks like we got another set of errors overnight. Here's the log file from the errors. (Log file scrubbed slightly to remove private data, but still representative of the problem I believe.) Nov 13 05:34:34 dev postgres[6084]: [4-1] user=dev,db=dev ERROR: could not access status of transaction 6337381 Nov 13 05:34:34 dev postgres[6084]: [4-2] user=dev,db=dev DETAIL: Could not open file "pg_subtrans/0060": No such file or directory. Nov 13 05:34:34 dev postgres[6084]: [4-3] user=dev,db=dev CONTEXT: SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x" Nov 13 05:34:34 dev postgres[6084]: [4-4] user=dev,db=dev STATEMENT: update listings set deactivated_at=$1 where id=$2 and lock_version=$3 Nov 13 05:34:34 dev postgres[6076]: [4-1] user=dev,db=dev ERROR: could not access status of transaction 6337381 Nov 13 05:34:34 dev postgres[6076]: [4-2] user=dev,db=dev DETAIL: Could not open file "pg_subtrans/0060": No such file or directory. Nov 13 05:34:34 dev postgres[6076]: [4-3] user=dev,db=dev CONTEXT: SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x" Nov 13 05:34:34 dev postgres[6076]: [4-4] user=dev,db=dev STATEMENT: update listings set deactivated_at=$1 where id=$2 and lock_version=$3 Nov 13 05:34:34 dev postgres[6087]: [4-1] user=dev,db=dev ERROR: could not access status of transaction 6337381 Nov 13 05:34:34 dev postgres[6087]: [4-2] user=dev,db=dev DETAIL: Could not open file "pg_subtrans/0060": No such file or directory. Nov 13 05:34:34 dev postgres[6087]: [4-3] user=dev,db=dev CONTEXT: SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x" Nov 13 05:34:34 dev postgres[6087]: [4-4] user=dev,db=dev STATEMENT: update listings set deactivated_at=$1 where id=$2 and lock_version=$3 Nov 13 05:34:34 dev postgres[6086]: [4-1] user=dev,db=dev ERROR: could not access status of transaction 6337381 Nov 13 05:34:34 dev postgres[6086]: [4-2] user=dev,db=dev DETAIL: Could not open file "pg_subtrans/0060": No such file or directory. Nov 13 05:34:34 dev postgres[6086]: [4-3] user=dev,db=dev CONTEXT: SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x" Nov 13 05:34:34 dev postgres[6086]: [4-4] user=dev,db=dev STATEMENT: update listings set deactivated_at=$1 where id=$2 and lock_version=$3 Nov 13 05:34:34 dev postgres[6088]: [4-1] user=dev,db=dev ERROR: could not access status of transaction 6337381 Nov 13 05:34:34 dev postgres[6088]: [4-2] user=dev,db=dev DETAIL: Could not open file "pg_subtrans/0060": No such file or directory. Nov 13 05:34:34 dev postgres[6088]: [4-3] user=dev,db=dev CONTEXT: SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x" Nov 13 05:34:34 dev postgres[6088]: [4-4] user=dev,db=dev STATEMENT: update listings set deactivated_at=$1 where id=$2 and lock_version=$3 Nov 13 05:34:34 dev postgres[6085]: [4-1] user=dev,db=dev ERROR: could not access status of transaction 6337381 Nov 13 05:34:34 dev postgres[6085]: [4-2] user=dev,db=dev DETAIL: Could not open file "pg_subtrans/0060": No such file or directory. Nov 13 05:34:34 dev postgres[6085]: [4-3] user=dev,db=dev CONTEXT: SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x" Nov 13 05:34:34 dev postgres[6085]: [4-4] user=dev,db=dev STATEMENT: update listings set deactivated_at=$1 where id=$2 and lock_version=$3 Several processes all seemed to hit the problem at the same moment, and all of them refer to the same transaction ID. Again, the file pg_subtrans/0060 doesn't exist, and the only file that does exist is pg_subtrans/005A which appears to be a zeroed-out file 245760 bytes in length. Still don't have a clue as to how I can reproduce the problem. It seems that in all cases the error occurred during either an UPDATE to a table_X or an INSERT to table_Y. In all cases, the error occurred in a manner identical to those shown in the log above, the only difference being either an UPDATE on table_X or an INSERT on table_Y. Not sure what direction I should head to now. Perhaps some aggressive logging would help, so we can see the queries surrounding the problems? I could reconfigure things to capture all statements and set up monit or something to send an alert when the problem resurfaces, for instance. Cheers all.
On Wed, Nov 13, 2013 at 12:29 PM, J Smith <dark.panda+lists@gmail.com> wrote: > Looks like we got another set of errors overnight. Here's the log file > from the errors. (Log file scrubbed slightly to remove private data, > but still representative of the problem I believe.) > > Nov 13 05:34:34 dev postgres[6084]: [4-1] user=dev,db=dev ERROR: > could not access status of transaction 6337381 > Nov 13 05:34:34 dev postgres[6084]: [4-2] user=dev,db=dev DETAIL: > Could not open file "pg_subtrans/0060": No such file or directory. > Nov 13 05:34:34 dev postgres[6084]: [4-3] user=dev,db=dev CONTEXT: > SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x > WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x" > Nov 13 05:34:34 dev postgres[6084]: [4-4] user=dev,db=dev STATEMENT: > update listings set deactivated_at=$1 where id=$2 and lock_version=$3 > Nov 13 05:34:34 dev postgres[6076]: [4-1] user=dev,db=dev ERROR: > could not access status of transaction 6337381 > Nov 13 05:34:34 dev postgres[6076]: [4-2] user=dev,db=dev DETAIL: > Could not open file "pg_subtrans/0060": No such file or directory. > Nov 13 05:34:34 dev postgres[6076]: [4-3] user=dev,db=dev CONTEXT: > SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x > WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x" > Nov 13 05:34:34 dev postgres[6076]: [4-4] user=dev,db=dev STATEMENT: > update listings set deactivated_at=$1 where id=$2 and lock_version=$3 > Nov 13 05:34:34 dev postgres[6087]: [4-1] user=dev,db=dev ERROR: > could not access status of transaction 6337381 > Nov 13 05:34:34 dev postgres[6087]: [4-2] user=dev,db=dev DETAIL: > Could not open file "pg_subtrans/0060": No such file or directory. > Nov 13 05:34:34 dev postgres[6087]: [4-3] user=dev,db=dev CONTEXT: > SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x > WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x" > Nov 13 05:34:34 dev postgres[6087]: [4-4] user=dev,db=dev STATEMENT: > update listings set deactivated_at=$1 where id=$2 and lock_version=$3 > Nov 13 05:34:34 dev postgres[6086]: [4-1] user=dev,db=dev ERROR: > could not access status of transaction 6337381 > Nov 13 05:34:34 dev postgres[6086]: [4-2] user=dev,db=dev DETAIL: > Could not open file "pg_subtrans/0060": No such file or directory. > Nov 13 05:34:34 dev postgres[6086]: [4-3] user=dev,db=dev CONTEXT: > SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x > WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x" > Nov 13 05:34:34 dev postgres[6086]: [4-4] user=dev,db=dev STATEMENT: > update listings set deactivated_at=$1 where id=$2 and lock_version=$3 > Nov 13 05:34:34 dev postgres[6088]: [4-1] user=dev,db=dev ERROR: > could not access status of transaction 6337381 > Nov 13 05:34:34 dev postgres[6088]: [4-2] user=dev,db=dev DETAIL: > Could not open file "pg_subtrans/0060": No such file or directory. > Nov 13 05:34:34 dev postgres[6088]: [4-3] user=dev,db=dev CONTEXT: > SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x > WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x" > Nov 13 05:34:34 dev postgres[6088]: [4-4] user=dev,db=dev STATEMENT: > update listings set deactivated_at=$1 where id=$2 and lock_version=$3 > Nov 13 05:34:34 dev postgres[6085]: [4-1] user=dev,db=dev ERROR: > could not access status of transaction 6337381 > Nov 13 05:34:34 dev postgres[6085]: [4-2] user=dev,db=dev DETAIL: > Could not open file "pg_subtrans/0060": No such file or directory. > Nov 13 05:34:34 dev postgres[6085]: [4-3] user=dev,db=dev CONTEXT: > SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x > WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x" > Nov 13 05:34:34 dev postgres[6085]: [4-4] user=dev,db=dev STATEMENT: > update listings set deactivated_at=$1 where id=$2 and lock_version=$3 > > Several processes all seemed to hit the problem at the same moment, > and all of them refer to the same transaction ID. Again, the file > pg_subtrans/0060 doesn't exist, and the only file that does exist is > pg_subtrans/005A which appears to be a zeroed-out file 245760 bytes in > length. > > Still don't have a clue as to how I can reproduce the problem. It > seems that in all cases the error occurred during either an UPDATE to > a table_X or an INSERT to table_Y. In all cases, the error occurred in > a manner identical to those shown in the log above, the only > difference being either an UPDATE on table_X or an INSERT on table_Y. > > Not sure what direction I should head to now. Perhaps some aggressive > logging would help, so we can see the queries surrounding the > problems? I could reconfigure things to capture all statements and set > up monit or something to send an alert when the problem resurfaces, > for instance. > > Cheers all. I think what would help the most is if you could arrange to obtain a stack backtrace at the point when the error is thrown. Maybe put a long sleep call in just before the error happens, and when it gets stuck there, attach gdb and run bt full. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Nov 15, 2013 at 3:21 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > I think what would help the most is if you could arrange to obtain a > stack backtrace at the point when the error is thrown. Maybe put a > long sleep call in just before the error happens, and when it gets > stuck there, attach gdb and run bt full. > That could potentially be doable. Perhaps I could use something like google-coredumper or something similar to have a core dump generated if the error comes up? Part of the problem is that the error is so sporadic that it's going to be tough to say when the next one will occur. For instance, we haven't changed our load on the server, yet the error hasn't occurred since Nov 13, 15:01. I'd also like to avoid blocking on the server with sleep or anything like that unless absolutely necessary, as there are other services we have in development that are using other databases on this cluster. (I can as a matter of last resort, of course, but if google-coredumper can do the job I'd like to give that a shot first.) Any hints on where I could insert something like this? Should I try putting it into the section of elog.c dealing with ENOENT errors, or try to find a spot closer to where the file itself is being opened? I haven't looked at Postgres internals for a while now so I'm not quite sure of the best location for this sort of thing. Cheers
On Fri, Nov 15, 2013 at 4:01 PM, J Smith <dark.panda+lists@gmail.com> wrote: > On Fri, Nov 15, 2013 at 3:21 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think what would help the most is if you could arrange to obtain a >> stack backtrace at the point when the error is thrown. Maybe put a >> long sleep call in just before the error happens, and when it gets >> stuck there, attach gdb and run bt full. >> > > That could potentially be doable. Perhaps I could use something like > google-coredumper or something similar to have a core dump generated > if the error comes up? Part of the problem is that the error is so > sporadic that it's going to be tough to say when the next one will > occur. For instance, we haven't changed our load on the server, yet > the error hasn't occurred since Nov 13, 15:01. I'd also like to avoid > blocking on the server with sleep or anything like that unless > absolutely necessary, as there are other services we have in > development that are using other databases on this cluster. (I can as > a matter of last resort, of course, but if google-coredumper can do > the job I'd like to give that a shot first.) > > Any hints on where I could insert something like this? Should I try > putting it into the section of elog.c dealing with ENOENT errors, or > try to find a spot closer to where the file itself is being opened? I > haven't looked at Postgres internals for a while now so I'm not quite > sure of the best location for this sort of thing. I'd look for the specific ereport() call that's firing, and put it just before that. (note that setting the error verbosity to 'verbose' will give you the file and line number where the error is happening, which is useful if the message can be generated from more than one place) I'm not familiar with google-coredumper but it sounds like a promising technique. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Alright, we'll look into doing that heading into the weekend. Interestingly, we haven't experienced the issue since our main Java developer made some modifications to our backend system. I'm not entirely sure what the changes entail except that it's a one-liner that involves re-SELECTing a table during a transaction. We'll rollback this change and re-compile Postgres with google-coredumper and let it run over the weekend and see where we stand. Cheers On Tue, Nov 19, 2013 at 9:14 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Nov 15, 2013 at 4:01 PM, J Smith <dark.panda+lists@gmail.com> wrote: >> On Fri, Nov 15, 2013 at 3:21 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> I think what would help the most is if you could arrange to obtain a >>> stack backtrace at the point when the error is thrown. Maybe put a >>> long sleep call in just before the error happens, and when it gets >>> stuck there, attach gdb and run bt full. >>> >> >> That could potentially be doable. Perhaps I could use something like >> google-coredumper or something similar to have a core dump generated >> if the error comes up? Part of the problem is that the error is so >> sporadic that it's going to be tough to say when the next one will >> occur. For instance, we haven't changed our load on the server, yet >> the error hasn't occurred since Nov 13, 15:01. I'd also like to avoid >> blocking on the server with sleep or anything like that unless >> absolutely necessary, as there are other services we have in >> development that are using other databases on this cluster. (I can as >> a matter of last resort, of course, but if google-coredumper can do >> the job I'd like to give that a shot first.) >> >> Any hints on where I could insert something like this? Should I try >> putting it into the section of elog.c dealing with ENOENT errors, or >> try to find a spot closer to where the file itself is being opened? I >> haven't looked at Postgres internals for a while now so I'm not quite >> sure of the best location for this sort of thing. > > I'd look for the specific ereport() call that's firing, and put it > just before that. > > (note that setting the error verbosity to 'verbose' will give you the > file and line number where the error is happening, which is useful if > the message can be generated from more than one place) > > I'm not familiar with google-coredumper but it sounds like a promising > technique. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
On Tue, Nov 19, 2013 at 10:16 AM, J Smith <dark.panda+lists@gmail.com> wrote: > Alright, we'll look into doing that heading into the weekend. > Interestingly, we haven't experienced the issue since our main Java > developer made some modifications to our backend system. I'm not > entirely sure what the changes entail except that it's a one-liner > that involves re-SELECTing a table during a transaction. We'll > rollback this change and re-compile Postgres with google-coredumper > and let it run over the weekend and see where we stand. > Okay, I've patched our Postgres install and added in a call for coredumper to output a core dump. We'll let our processes run for the next few days and hopefully we'll get something we can look at. I've attached the patch just in case anyone would like to make sure I've done it in a sane way and that I've inserted the call in the correct location. Cheers
Вложения
coredumper worked like a charm. Useful tool, that is... although as a bit of advice, I'd try not to run it on Postgres if your various memory settings are tweaked towards production use -- the core dump that was captured on my server weighed in at 16 GB. Anyways, I've attached both the log entries that caused the error as well as a backtrace using `bt full`. Cheers, hope it helps.
Вложения
Hi, On 2013-11-24 16:56:26 -0500, J Smith wrote: > coredumper worked like a charm. Useful tool, that is... although as a > bit of advice, I'd try not to run it on Postgres if your various > memory settings are tweaked towards production use -- the core dump > that was captured on my server weighed in at 16 GB. > Nov 23 14:38:32 dev postgres[23810]: [4-1] user=dev,db=dev ERROR: could not access status of transaction 13514992 > Nov 23 14:38:32 dev postgres[23810]: [4-2] user=dev,db=dev DETAIL: Could not open file "pg_subtrans/00CE": Success. > Nov 23 14:38:32 dev postgres[23810]: [4-3] user=dev,db=dev CONTEXT: SQL statement "SELECT 1 FROM ONLY "dev"."collection_batches"x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x" Ok, this is helpful. Do you rather longrunning transactions? The transaction that does foreign key checks has an xid of 10260613, while the row that's getting checked has 13514992. #4 0x0000000000635dc7 in XactLockTableWait (xid=13514992) at lmgr.c:501 tag = {locktag_field1 = 13514992, locktag_field2= 0, locktag_field3 = 0, locktag_field4 = 0, locktag_type = 4 '\004', locktag_lockmethodid = 1 '\001'} #5 0x0000000000482223 in heap_lock_updated_tuple_rec (rel=0x2b20f050a8d0, tuple=<value optimized out>, ctid=<value optimizedout>, xid=10260613, mode=LockTupleKeyShare) at heapam.c:4847 I am not sure whether that's the origin of the problem but at the very least it seems to me that heap_lock_updated_tuple_rec() is missing several important pieces: a) do the priorXmax==xmin dance to check we're still following the same ctid chain. Currently we could easily stumble acrosscompletely unrelated tuples if a chain element aborted and got vacuumed. b) Check whether a chain element actually aborted - currently we're only doing that in the HEAP_KEYS_UPDATED updated case,but that seems wrong (we can't check for committed tho!). c) (reported separately as well) cope with failure of heap_fetch() to return anything. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Nov 25, 2013 at 6:47 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > On 2013-11-24 16:56:26 -0500, J Smith wrote: > >> Nov 23 14:38:32 dev postgres[23810]: [4-1] user=dev,db=dev ERROR: could not access status of transaction 13514992 >> Nov 23 14:38:32 dev postgres[23810]: [4-2] user=dev,db=dev DETAIL: Could not open file "pg_subtrans/00CE": Success. >> Nov 23 14:38:32 dev postgres[23810]: [4-3] user=dev,db=dev CONTEXT: SQL statement "SELECT 1 FROM ONLY "dev"."collection_batches"x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x" > > > Ok, this is helpful. Do you rather longrunning transactions? The > transaction that does foreign key checks has an xid of 10260613, while > the row that's getting checked has 13514992. We did have some long-running transactions, yes. We refactored a bit and removed them and the problem ceased on our end. We ended up reverting our changes for the sake of running this experiment over the weekend and the errors returned. We've since restored our fix and haven't had any problems since, so yeah, long-running transactions appear to be involved. We can continue to experiment if you have any additional tests you'd like us to run. We may have to keep the experiments to running over the weekend, but they're definitely do-able. Cheers
J Smith escribió: > We did have some long-running transactions, yes. We refactored a bit > and removed them and the problem ceased on our end. We ended up > reverting our changes for the sake of running this experiment over the > weekend and the errors returned. We've since restored our fix and > haven't had any problems since, so yeah, long-running transactions > appear to be involved. > > We can continue to experiment if you have any additional tests you'd > like us to run. We may have to keep the experiments to running over > the weekend, but they're definitely do-able. I am working on patches to get these bugs fixed. Would you be up for running a patched binary and see if the errors go away? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Nov 25, 2013 at 11:46 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > J Smith escribió: > >> We did have some long-running transactions, yes. We refactored a bit >> and removed them and the problem ceased on our end. We ended up >> reverting our changes for the sake of running this experiment over the >> weekend and the errors returned. We've since restored our fix and >> haven't had any problems since, so yeah, long-running transactions >> appear to be involved. >> >> We can continue to experiment if you have any additional tests you'd >> like us to run. We may have to keep the experiments to running over >> the weekend, but they're definitely do-able. > > I am working on patches to get these bugs fixed. Would you be up for > running a patched binary and see if the errors go away? Sure, just give me a git commit hash and I can do a test build towards the weekend. I'll include coredumper again just in case things go awry and we can go from there. Cheers
Andres Freund escribió: > Ok, this is helpful. Do you rather longrunning transactions? The > transaction that does foreign key checks has an xid of 10260613, while > the row that's getting checked has 13514992. Thanks for the analysis. > #4 0x0000000000635dc7 in XactLockTableWait (xid=13514992) at lmgr.c:501 > tag = {locktag_field1 = 13514992, locktag_field2 = 0, locktag_field3 = 0, locktag_field4 = 0, locktag_type = 4'\004', locktag_lockmethodid = 1 '\001'} > #5 0x0000000000482223 in heap_lock_updated_tuple_rec (rel=0x2b20f050a8d0, tuple=<value optimized out>, ctid=<value optimizedout>, xid=10260613, mode=LockTupleKeyShare) at heapam.c:4847 > > I am not sure whether that's the origin of the problem but at the very > least it seems to me that heap_lock_updated_tuple_rec() is missing > several important pieces: > a) do the priorXmax==xmin dance to check we're still following the same > ctid chain. Currently we could easily stumble across completely > unrelated tuples if a chain element aborted and got vacuumed. This seems simple to handle by adding the check you propose to the loop. Basically if the xmax doesn't match the xmin, we reached the end, there's nothing more to lock and we can return success without any further work: /* * Check the tuple XMIN against prior XMAX, if any. If we reached * the end of the chain, we're done, so returnsuccess. */ if (TransactionIdIsValid(priorXmax) && !TransactionIdEquals(HeapTupleHeaderGetXmin(&mytup), priorXmax)) { UnlockReleaseBuffer(buf); return HeapTupleMayBeUpdated; } > b) Check whether a chain element actually aborted - currently we're > only doing that in the HEAP_KEYS_UPDATED updated case, but that seems > wrong (we can't check for committed tho!). Let me point out that this is exactly the same code that would be affected by my proposed fix for #8434, which would have this check the updateXid in all cases, not only in KEYS_UPDATED as currently. I don't understand your comment about "can't check for committed". I mean, if the updating transaction committed, then we need to return failure to caller, HeapTupleUpdated. This signals to the caller that the future version of the tuple is locked and the whole thing needs to be failed. (In READ COMMITTED isolation level, this would cause EvalPlanQual to fetch the updated version of the tuple and redo the whole thing from there. In REPEATABLE READ and above, the whole operation would be aborted.) In short I propose to fix part this with the simple patch I proposed for bug #8434. > c) (reported separately as well) cope with failure of heap_fetch() to > return anything. Proposed patch for this was posted in the other thread. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2013-11-25 17:10:39 -0300, Alvaro Herrera wrote: > > I am not sure whether that's the origin of the problem but at the very > > least it seems to me that heap_lock_updated_tuple_rec() is missing > > several important pieces: > > a) do the priorXmax==xmin dance to check we're still following the same > > ctid chain. Currently we could easily stumble across completely > > unrelated tuples if a chain element aborted and got vacuumed. > > This seems simple to handle by adding the check you propose to the loop. > Basically if the xmax doesn't match the xmin, we reached the end, > there's nothing more to lock and we can return success without any > further work: Right, that's what we do in other places (notably EvalPlanQualFetch()). > > b) Check whether a chain element actually aborted - currently we're > > only doing that in the HEAP_KEYS_UPDATED updated case, but that seems > > wrong (we can't check for committed tho!). > > Let me point out that this is exactly the same code that would be > affected by my proposed fix for #8434, which would have this check the > updateXid in all cases, not only in KEYS_UPDATED as currently. Hm. I don't see a patch in any of the mails about #8434 although I seem to remember talking with you about one. Maybe that was on IM? > I don't understand your comment about "can't check for committed". I > mean, if the updating transaction committed, then we need to return > failure to caller, HeapTupleUpdated. I mean that in the !KEYS_UPDATED case we don't need to abort if we're only acquiring a key share... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund escribió: > On 2013-11-25 17:10:39 -0300, Alvaro Herrera wrote: > > Let me point out that this is exactly the same code that would be > > affected by my proposed fix for #8434, which would have this check the > > updateXid in all cases, not only in KEYS_UPDATED as currently. > > Hm. I don't see a patch in any of the mails about #8434 although I seem > to remember talking with you about one. Maybe that was on IM? Ah, yeah, that's possible. The patch I proposed back then is attached here. I haven't merged this to latest master; this applies cleanly on top of 16a906f5350 > > I don't understand your comment about "can't check for committed". I > > mean, if the updating transaction committed, then we need to return > > failure to caller, HeapTupleUpdated. > > I mean that in the !KEYS_UPDATED case we don't need to abort if we're > only acquiring a key share... Hm, I think that's correct -- we don't need to abort. But we still need to wait until the updater completes. So this proposed patch is not the full story. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On 2013-11-25 18:06:30 -0300, Alvaro Herrera wrote: > > I mean that in the !KEYS_UPDATED case we don't need to abort if we're > > only acquiring a key share... > > Hm, I think that's correct -- we don't need to abort. But we still need > to wait until the updater completes. So this proposed patch is not the > full story. Hm. Why do we need to wait in that case? Isn't the entire point of KEY SHARE locks *not* having to wait for !KEYS_UPDATED? ISTM in that case we should only check whether the creating transaction has aborted because in that case we don't need to take out a lock? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera escribió: > Andres Freund escribió: > This seems simple to handle by adding the check you propose to the loop. > Basically if the xmax doesn't match the xmin, we reached the end, > there's nothing more to lock and we can return success without any > further work: As mentioned in the thread for bug #8434, the complete working patch for this is attached. > > b) Check whether a chain element actually aborted - currently we're > > only doing that in the HEAP_KEYS_UPDATED updated case, but that seems > > wrong (we can't check for committed tho!). > > Let me point out that this is exactly the same code that would be > affected by my proposed fix for #8434, which would have this check the > updateXid in all cases, not only in KEYS_UPDATED as currently. I posted a patch for this problem in the thread about #8434. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On 2013-11-27 13:57:52 -0300, Alvaro Herrera wrote: > Per bug report by J Smith in > CADFUPgc5bmtv-yg9znxV-vcfkb+JPRqs7m2OesQXaM_4Z1JpdQ@mail.gmail.com > diagnosed by Andres Freund. Alvaro, do you see a way this could actually have caused J.'s problems? I thought about a few, but each turned about to not really be possible... I can easily see why we would uselessly wait or worse, but the pg_subtrans errors I can't really explain with it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2013-11-24 16:56:26 -0500, J Smith wrote: > coredumper worked like a charm. Useful tool, that is... although as a > bit of advice, I'd try not to run it on Postgres if your various > memory settings are tweaked towards production use -- the core dump > that was captured on my server weighed in at 16 GB. > > Anyways, I've attached both the log entries that caused the error as > well as a backtrace using `bt full`. Do you still have the core file around? If so could you 'p *ShmemVariableCache' and 'p *ControlFile'? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Nov 28, 2013 at 5:15 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > Do you still have the core file around? If so could you 'p > *ShmemVariableCache' and 'p *ControlFile'? > So sorry, I didn't see this message until just today. Seems it was accidentally archived before hitting my eyeballs. I see that 9.3.2 was released today along with what appears to be some fixes regarding this and similar issues. Sorry if my missing this message held anything up. We still have the core file kicking around, so here's the output at any rate. (gdb) p *ShmemVariableCache $1 = {nextOid = 8036795, oidCount = 6974, nextXid = 10260628, oldestXid = 1673, xidVacLimit = 200001673, xidWarnLimit = 2136485320, xidStopLimit = 2146485320, xidWrapLimit = 2147485320, oldestXidDB = 1, latestCompletedXid = 10260621} (gdb) p *ControlFile $2 = {system_identifier = 5942827484423487452, pg_control_version = 937, catalog_version_no = 201306121, state = DB_IN_PRODUCTION, time = 1385234938, checkPoint = 381771294048, prevCheckPoint = 381771293888, checkPointCopy = {redo = 381771293992, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = 1 '\001', nextXidEpoch = 0, nextXid = 10217377, nextOid = 8035577, nextMulti = 13448295, nextMultiOffset = 32161320, oldestXid = 1673, oldestXidDB = 1, oldestMulti = 1, oldestMultiDB = 1, time = 1385234938, oldestActiveXid = 10217377}, unloggedLSN = 1, minRecoveryPoint = 0, minRecoveryPointTLI = 0, backupStartPoint = 0, backupEndPoint = 0, backupEndRequired = 0 '\000', wal_level = 2, MaxConnections = 200, max_prepared_xacts = 0, max_locks_per_xact = 64, maxAlign = 8, floatFormat = 1234567, blcksz = 8192, relseg_size = 131072, xlog_blcksz = 8192, xlog_seg_size = 16777216, nameDataLen = 64, indexMaxKeys = 32, toast_max_chunk_size = 1996, enableIntTimes = 1 '\001', float4ByVal = 1 '\001', float8ByVal = 1 '\001', data_checksum_version = 0, crc = 517534097} (gdb) Cheers?
Andres, all: We've just run across a case of this exact issue on 9.2.4. I thought it was supposed to be 9.3-only? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Hi, On 2013-12-10 10:38:32 -0800, Josh Berkus wrote: > We've just run across a case of this exact issue on 9.2.4. I thought it > was supposed to be 9.3-only? Could you please describe "this exact issue"? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12/10/2013 10:39 AM, Andres Freund wrote: > Hi, > > On 2013-12-10 10:38:32 -0800, Josh Berkus wrote: >> We've just run across a case of this exact issue on 9.2.4. I thought it >> was supposed to be 9.3-only? > > Could you please describe "this exact issue"? Fatal errors due to missing pg_subtrans files on creating a new replica. Sequence: 1. Failed over from master to replica #1 2. Remastered other replicas 3. Tried to create a new replica. 4. New replica started failing with errors similar to the original report. If this is fixed in 9.2.6, great, but I didn't get that impression from the commits ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2013-12-10 10:44:30 -0800, Josh Berkus wrote: > On 12/10/2013 10:39 AM, Andres Freund wrote: > > Hi, > > > > On 2013-12-10 10:38:32 -0800, Josh Berkus wrote: > >> We've just run across a case of this exact issue on 9.2.4. I thought it > >> was supposed to be 9.3-only? > > > > Could you please describe "this exact issue"? > > Fatal errors due to missing pg_subtrans files on creating a new replica. > If this is fixed in 9.2.6, great, but I didn't get that impression from > the commits ... I am pretty sure this thread isn't about the bug you're hitting. You get errors during xid assignment, right? If so, upgrade do 9.2.6, that's fixed (it's the bug in which wake the replication bug was introduced). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12/10/2013 10:48 AM, Andres Freund wrote: > On 2013-12-10 10:44:30 -0800, Josh Berkus wrote: >> On 12/10/2013 10:39 AM, Andres Freund wrote: >>> Hi, >>> >>> On 2013-12-10 10:38:32 -0800, Josh Berkus wrote: >>>> We've just run across a case of this exact issue on 9.2.4. I thought it >>>> was supposed to be 9.3-only? >>> >>> Could you please describe "this exact issue"? >> >> Fatal errors due to missing pg_subtrans files on creating a new replica. > >> If this is fixed in 9.2.6, great, but I didn't get that impression from >> the commits ... > > I am pretty sure this thread isn't about the bug you're hitting. You get > errors during xid assignment, right? If so, upgrade do 9.2.6, that's > fixed (it's the bug in which wake the replication bug was introduced). I thought that only affected 9.2.5? This machine is 9.2.4, as was the prior master. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2013-12-10 11:12:03 -0800, Josh Berkus wrote: > On 12/10/2013 10:48 AM, Andres Freund wrote: > > On 2013-12-10 10:44:30 -0800, Josh Berkus wrote: > >> On 12/10/2013 10:39 AM, Andres Freund wrote: > >>> Hi, > >>> > >>> On 2013-12-10 10:38:32 -0800, Josh Berkus wrote: > >>>> We've just run across a case of this exact issue on 9.2.4. I thought it > >>>> was supposed to be 9.3-only? > >>> > >>> Could you please describe "this exact issue"? > >> > >> Fatal errors due to missing pg_subtrans files on creating a new replica. > > > >> If this is fixed in 9.2.6, great, but I didn't get that impression from > >> the commits ... > > > > I am pretty sure this thread isn't about the bug you're hitting. You get > > errors during xid assignment, right? If so, upgrade do 9.2.6, that's > > fixed (it's the bug in which wake the replication bug was introduced). > > I thought that only affected 9.2.5? This machine is 9.2.4, as was the > prior master. 9.2.5 tried to fix the issue (errors because of missing pg_subtrans files during xid assignments), but the fix had unintended consequences, namely the HS init bug. 9.2.6 removed those unintended consequences. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services