Обсуждение: BUG #7500: hot-standby replica crash after an initial rsync
The following bug has been logged on the website: Bug reference: 7500 Logged by: Maxim Boguk Email address: maxim.boguk@gmail.com PostgreSQL version: 9.0.8 Operating system: FreeBSD Description: = Hi, Reinitialization of the replica after failover is procedure performed by me a lot times in that installation. And I never had any issues. But today after following the same pg_start_backup/rsync procedure replica doesn't start with following error: 2012-08-21 11:17:07.777 MSK 34329 @ from [vxid: txid:0] []LOG: entering standby mode 2012-08-21 11:17:07.780 MSK 34329 @ from [vxid:1/0 txid:0] []LOG: redo starts at 925/2C0004C8 2012-08-21 11:17:07.783 MSK 34329 @ from [vxid:1/0 txid:0] []FATAL: could not access status of transaction 361788644 2012-08-21 11:17:07.783 MSK 34329 @ from [vxid:1/0 txid:0] []DETAIL: Could not read from file "pg_multixact/offsets/1590" at offset 114688: Unknown error: 0. 2012-08-21 11:17:07.783 MSK 34329 @ from [vxid:1/0 txid:0] []CONTEXT: xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078 2012-08-21 11:17:07.791 MSK 34327 @ from [vxid: txid:0] []LOG: startup process (PID 34329) exited with exit code 1 2012-08-21 11:17:07.791 MSK 34327 @ from [vxid: txid:0] []LOG: terminating any other active server processes I have kept all that database files for the future investigation. What I should look into first?
On Tuesday, August 21, 2012 10:08:11 AM maxim.boguk@gmail.com wrote: > The following bug has been logged on the website: > > Bug reference: 7500 > Logged by: Maxim Boguk > Email address: maxim.boguk@gmail.com > PostgreSQL version: 9.0.8 > Operating system: FreeBSD > Description: > > Hi, > > Reinitialization of the replica after failover is procedure performed by me > a lot times in that installation. And I never had any issues. But today > after following the same pg_start_backup/rsync procedure replica doesn't > start with following error: > > 2012-08-21 11:17:07.777 MSK 34329 @ from [vxid: txid:0] []LOG: entering > standby mode > 2012-08-21 11:17:07.780 MSK 34329 @ from [vxid:1/0 txid:0] []LOG: redo > starts at 925/2C0004C8 > 2012-08-21 11:17:07.783 MSK 34329 @ from [vxid:1/0 txid:0] []FATAL: could > not access status of transaction 361788644 > 2012-08-21 11:17:07.783 MSK 34329 @ from [vxid:1/0 txid:0] []DETAIL: > Could not read from file "pg_multixact/offsets/1590" at offset 114688: > Unknown error: 0. > 2012-08-21 11:17:07.783 MSK 34329 @ from [vxid:1/0 txid:0] []CONTEXT: > xlog redo create multixact 361788644 offset 822580965: 1768508061 > 1768508078 2012-08-21 11:17:07.791 MSK 34327 @ from [vxid: txid:0] []LOG: > startup process (PID 34329) exited with exit code 1 > 2012-08-21 11:17:07.791 MSK 34327 @ from [vxid: txid:0] []LOG: > terminating any other active server processes > > > I have kept all that database files for the future investigation. > > What I should look into first? Could you reproduce the error with log_error_verbosity=verbose? Or even better provide a backtrace with gdb? Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> > > > > > I have kept all that database files for the future investigation. > > > > What I should look into first? > Could you reproduce the error with log_error_verbosity=verbose? Or even > better > provide a backtrace with gdb? > > There log with log_error_verbosity=verbose: 2012-08-21 14:04:18.973 MSK 33950 @ from [vxid: txid:0] []LOG: 00000: database system was interrupted while in recovery at log time 2012-08-21 08:21:42 MSK 2012-08-21 14:04:18.973 MSK 33950 @ from [vxid: txid:0] []HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2012-08-21 14:04:18.973 MSK 33950 @ from [vxid: txid:0] []LOCATION: StartupXLOG, xlog.c:5757 2012-08-21 14:04:18.973 MSK 33950 @ from [vxid: txid:0] []LOG: 00000: creating missing WAL directory "pg_xlog/archive_status" 2012-08-21 14:04:18.973 MSK 33950 @ from [vxid: txid:0] []LOCATION: ValidateXLOGDirectoryStructure, xlog.c:3452 2012-08-21 14:04:19.009 MSK 33950 @ from [vxid: txid:0] []LOG: 00000: entering standby mode 2012-08-21 14:04:19.009 MSK 33950 @ from [vxid: txid:0] []LOCATION: StartupXLOG, xlog.c:5827 2012-08-21 14:04:19.041 MSK 33951 @ from [vxid: txid:0] []LOG: 00000: streaming replication successfully connected to primary 2012-08-21 14:04:19.041 MSK 33951 @ from [vxid: txid:0] []LOCATION: libpqrcv_connect, libpqwalreceiver.c:169 2012-08-21 14:04:20.140 MSK 33950 @ from [vxid:1/0 txid:0] []LOG: 00000: redo starts at 925/2C0004C8 2012-08-21 14:04:20.140 MSK 33950 @ from [vxid:1/0 txid:0] []LOCATION: StartupXLOG, xlog.c:6199 2012-08-21 14:04:20.160 MSK 33950 @ from [vxid:1/0 txid:0] []FATAL: XX000: could not access status of transaction 361788644 2012-08-21 14:04:20.161 MSK 33950 @ from [vxid:1/0 txid:0] []DETAIL: Could not read from file "pg_multixact/offsets/1590" at offset 114688: Unknown error: 0. 2012-08-21 14:04:20.161 MSK 33950 @ from [vxid:1/0 txid:0] []CONTEXT: xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078 2012-08-21 14:04:20.161 MSK 33950 @ from [vxid:1/0 txid:0] []LOCATION: SlruReportIOError, slru.c:840 2012-08-21 14:04:20.169 MSK 33948 @ from [vxid: txid:0] []LOG: 00000: startup process (PID 33950) exited with exit code 1 2012-08-21 14:04:20.169 MSK 33948 @ from [vxid: txid:0] []LOCATION: LogChildExit, postmaster.c:2840 2012-08-21 14:04:20.169 MSK 33948 @ from [vxid: txid:0] []LOG: 00000: terminating any other active server processes 2012-08-21 14:04:20.169 MSK 33948 @ from [vxid: txid:0] []LOCATION: HandleChildCrash, postmaster.c:2674 Kind Regards, Maksym
On Tue, Aug 21, 2012 at 8:08 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote: > > > >> > >> > I have kept all that database files for the future investigation. >> > >> > What I should look into first? >> Could you reproduce the error with log_error_verbosity=verbose? Or even >> better >> provide a backtrace with gdb? >> >> > There log with log_error_verbosity=verbose: > > > 2012-08-21 14:04:18.973 MSK 33950 @ from [vxid: txid:0] []LOG: 00000: > database system was interrupted while in recovery at log time 2012-08-21 > 08:21:42 MSK > 2012-08-21 14:04:18.973 MSK 33950 @ from [vxid: txid:0] []HINT: If this > has occurred more than once some data might be corrupted and you might need > to choose an earlier recovery target. > 2012-08-21 14:04:18.973 MSK 33950 @ from [vxid: txid:0] []LOCATION: > StartupXLOG, xlog.c:5757 > 2012-08-21 14:04:18.973 MSK 33950 @ from [vxid: txid:0] []LOG: 00000: > creating missing WAL directory "pg_xlog/archive_status" > 2012-08-21 14:04:18.973 MSK 33950 @ from [vxid: txid:0] []LOCATION: > ValidateXLOGDirectoryStructure, xlog.c:3452 > 2012-08-21 14:04:19.009 MSK 33950 @ from [vxid: txid:0] []LOG: 00000: > entering standby mode > 2012-08-21 14:04:19.009 MSK 33950 @ from [vxid: txid:0] []LOCATION: > StartupXLOG, xlog.c:5827 > 2012-08-21 14:04:19.041 MSK 33951 @ from [vxid: txid:0] []LOG: 00000: > streaming replication successfully connected to primary > 2012-08-21 14:04:19.041 MSK 33951 @ from [vxid: txid:0] []LOCATION: > libpqrcv_connect, libpqwalreceiver.c:169 > 2012-08-21 14:04:20.140 MSK 33950 @ from [vxid:1/0 txid:0] []LOG: 00000: > redo starts at 925/2C0004C8 > 2012-08-21 14:04:20.140 MSK 33950 @ from [vxid:1/0 txid:0] []LOCATION: > StartupXLOG, xlog.c:6199 > 2012-08-21 14:04:20.160 MSK 33950 @ from [vxid:1/0 txid:0] []FATAL: > XX000: could not access status of transaction 361788644 > 2012-08-21 14:04:20.161 MSK 33950 @ from [vxid:1/0 txid:0] []DETAIL: > Could not read from file "pg_multixact/offsets/1590" at offset 114688: > Unknown error: 0. > 2012-08-21 14:04:20.161 MSK 33950 @ from [vxid:1/0 txid:0] []CONTEXT: > xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078 > 2012-08-21 14:04:20.161 MSK 33950 @ from [vxid:1/0 txid:0] []LOCATION: > SlruReportIOError, slru.c:840 > 2012-08-21 14:04:20.169 MSK 33948 @ from [vxid: txid:0] []LOG: 00000: > startup process (PID 33950) exited with exit code 1 > 2012-08-21 14:04:20.169 MSK 33948 @ from [vxid: txid:0] []LOCATION: > LogChildExit, postmaster.c:2840 > 2012-08-21 14:04:20.169 MSK 33948 @ from [vxid: txid:0] []LOG: 00000: > terminating any other active server processes > 2012-08-21 14:04:20.169 MSK 33948 @ from [vxid: txid:0] []LOCATION: > HandleChildCrash, postmaster.c:2674 > > I tried A bit play with more detailed debug info and got mode detailed log: LOG: 00000: database system was interrupted while in recovery at log time 2012-08-21 08:21:42 MSK HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. LOCATION: StartupXLOG, xlog.c:5757 DEBUG: 00000: standby_mode = 'on' LOCATION: readRecoveryCommandFile, xlog.c:5269 DEBUG: 00000: primary_conninfo = 'host=10.12.22.19 port=5433 user=replica password=acilper' LOCATION: readRecoveryCommandFile, xlog.c:5276 DEBUG: 00000: trigger_file = '/db/data/failover' LOCATION: readRecoveryCommandFile, xlog.c:5283 LOG: 00000: entering standby mode LOCATION: StartupXLOG, xlog.c:5827 DEBUG: 00000: checkpoint record is at 925/2D41CAA0 LOCATION: StartupXLOG, xlog.c:5894 DEBUG: 00000: redo record is at 925/2C0004C8; shutdown FALSE LOCATION: StartupXLOG, xlog.c:5929 DEBUG: 00000: next transaction ID: 0/1768511856; next OID: 59880938 LOCATION: StartupXLOG, xlog.c:5933 DEBUG: 00000: next MultiXactId: 361788644; next MultiXactOffset: 822580965 LOCATION: StartupXLOG, xlog.c:5936 DEBUG: 00000: oldest unfrozen transaction ID: 1569182525, in database 16424 LOCATION: StartupXLOG, xlog.c:5939 DEBUG: 00000: transaction ID wrap limit is 3716666172, limited by database with OID 16424 LOCATION: SetTransactionIdLimit, varsup.c:334 DEBUG: 00000: initializing for hot standby LOCATION: StartupXLOG, xlog.c:6070 DEBUG: 00000: my backend id is 1 LOCATION: SharedInvalBackendInit, sinvaladt.c:326 LOG: 00000: redo starts at 925/2C0004C8 LOCATION: StartupXLOG, xlog.c:6199 DEBUG: 00000: record known xact 1768508078 latestObservedXid 0 CONTEXT: xlog redo commit: 2012-08-21 08:21:42.079655+04 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: record known xact 1768508078 latestObservedXid 0 CONTEXT: xlog redo commit: 2012-08-21 08:21:42.079655+04 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: remove KnownAssignedXid 1768508078 CONTEXT: xlog redo commit: 2012-08-21 08:21:42.079655+04 LOCATION: KnownAssignedXidsRemove, procarray.c:2869 DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 CONTEXT: xlog redo exclusive_lock: xid 1768508080 rel 1663/16424/17282; tid 725829/8 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 CONTEXT: xlog redo update: rel 1663/16424/17282; tid 725829/8; new 726017/7 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: record known xact 1768508076 latestObservedXid 0 CONTEXT: xlog redo exclusive_lock: xid 1768508076 rel 1663/16424/47760635; tid 3075/136 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 CONTEXT: xlog redo insert: rel 1663/16424/47747; tid 43866/197 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 CONTEXT: xlog redo insert: rel 1663/16424/52546872; tid 44700/276 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: record known xact 1768508061 latestObservedXid 0 CONTEXT: xlog redo log: rel 1663/16424/19211 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: record known xact 1768508082 latestObservedXid 0 CONTEXT: xlog redo exclusive_lock: xid 1768508082 rel 1663/16424/4759107; tid 1098/7 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: record known xact 1768508076 latestObservedXid 0 CONTEXT: xlog redo hot_update: rel 1663/16424/47760635; tid 3075/136; new 3075/140 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 CONTEXT: xlog redo exclusive_lock: xid 1768508080 rel 1663/16424/17282; tid 725918/2 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: record known xact 1768508082 latestObservedXid 0 CONTEXT: xlog redo hot_update: rel 1663/16424/4759107; tid 1098/7; new 1098/39 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 CONTEXT: xlog redo update: rel 1663/16424/17282; tid 725918/2; new 725918/6 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 CONTEXT: xlog redo insert: rel 1663/16424/47747; tid 43866/201 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 CONTEXT: xlog redo insert: rel 1663/16424/52546872; tid 44700/276 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 CONTEXT: xlog redo commit: 2012-08-21 08:21:42.082508+04 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 CONTEXT: xlog redo commit: 2012-08-21 08:21:42.082508+04 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: remove KnownAssignedXid 1768508080 CONTEXT: xlog redo commit: 2012-08-21 08:21:42.082508+04 LOCATION: KnownAssignedXidsRemove, procarray.c:2869 DEBUG: 00000: record known xact 1768508061 latestObservedXid 0 CONTEXT: xlog redo insert: rel 1663/16424/64654; tid 178183/44 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: record known xact 1768508061 latestObservedXid 0 CONTEXT: xlog redo insert: rel 1663/16424/95419; tid 38650/318 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 DEBUG: 00000: record known xact 1768508061 latestObservedXid 0 CONTEXT: xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 FATAL: XX000: could not access status of transaction 361788644 DETAIL: Could not read from file "pg_multixact/offsets/1590" at offset 114688: Unknown error: 0. CONTEXT: xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078 LOCATION: SlruReportIOError, slru.c:840 DEBUG: 00000: shmem_exit(1): 5 callbacks to make I have some problems with debug startup process with gdb... I following next sequence of commands (and got no useful results): [pgsql@db1 ~/tmp/postgresql-9.0.8]$ /usr/local/bin/gdb /usr/local/bin/postgres ... Reading symbols from /usr/local/bin/postgres...done. (gdb) b exit Breakpoint 1 at 0x44d38c (gdb) run -D /db/data_back/ Starting program: /usr/local/bin/postgres -D /db/data_back/ Program received signal SIGUSR1, User defined signal 1. 0x00000008018ddadc in select () from /lib/libc.so.7 (gdb) bt #0 0x00000008018ddadc in select () from /lib/libc.so.7 #1 0x000000000065641c in ServerLoop () at postmaster.c:1396 #2 0x0000000000655d05 in PostmasterMain (argc=3, argv=0x7fffffffead0) at postmaster.c:1097 #3 0x00000000005d974a in main (argc=3, argv=0x7fffffffead0) at main.c:188 (gdb) cont Continuing. Breakpoint 1, 0x000000080186bab0 in exit () from /lib/libc.so.7 (gdb) bt #0 0x000000080186bab0 in exit () from /lib/libc.so.7 #1 0x000000000067d0a6 in proc_exit (code=1) at ipc.c:137 #2 0x0000000000659933 in ExitPostmaster (status=1) at postmaster.c:4172 #3 0x0000000000657d2d in reaper (postgres_signal_arg=20) at postmaster.c:2354 #4 <signal handler called> #5 <signal handler called> #6 0x00000008018ddadc in select () from /lib/libc.so.7 #7 0x000000000065641c in ServerLoop () at postmaster.c:1396 #8 0x0000000000655d05 in PostmasterMain (argc=3, argv=0x7fffffffead0) at postmaster.c:1097 #9 0x00000000005d974a in main (argc=3, argv=0x7fffffffead0) at main.c:188 (gdb) cont Continuing. [Inferior 1 (process 98857) exited with code 01] (gdb) quit
Hi, On Tuesday, August 21, 2012 02:28:54 PM Maxim Boguk wrote: > On Tue, Aug 21, 2012 at 8:08 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote: > >> > I have kept all that database files for the future investigation. > >> > > >> > What I should look into first? > >> > >> Could you reproduce the error with log_error_verbosity=verbose? Or even > >> better > >> provide a backtrace with gdb? > > > > There log with log_error_verbosity=verbose: > LOG: 00000: database system was interrupted while in recovery at log time > 2012-08-21 08:21:42 MSK > HINT: If this has occurred more than once some data might be corrupted and > you might need to choose an earlier recovery target. > LOCATION: StartupXLOG, xlog.c:5757 > DEBUG: 00000: standby_mode = 'on' > LOCATION: readRecoveryCommandFile, xlog.c:5269 > DEBUG: 00000: primary_conninfo = 'host=10.12.22.19 port=5433 user=replica > password=acilper' > LOCATION: readRecoveryCommandFile, xlog.c:5276 > DEBUG: 00000: trigger_file = '/db/data/failover' > LOCATION: readRecoveryCommandFile, xlog.c:5283 > LOG: 00000: entering standby mode > LOCATION: StartupXLOG, xlog.c:5827 > DEBUG: 00000: checkpoint record is at 925/2D41CAA0 > LOCATION: StartupXLOG, xlog.c:5894 > DEBUG: 00000: redo record is at 925/2C0004C8; shutdown FALSE > LOCATION: StartupXLOG, xlog.c:5929 > DEBUG: 00000: next transaction ID: 0/1768511856; next OID: 59880938 > LOCATION: StartupXLOG, xlog.c:5933 > DEBUG: 00000: next MultiXactId: 361788644; next MultiXactOffset: 822580965 > LOCATION: StartupXLOG, xlog.c:5936 > DEBUG: 00000: oldest unfrozen transaction ID: 1569182525, in database > 16424 LOCATION: StartupXLOG, xlog.c:5939 > DEBUG: 00000: transaction ID wrap limit is 3716666172, limited by database > with OID 16424 > LOCATION: SetTransactionIdLimit, varsup.c:334 > DEBUG: 00000: initializing for hot standby > LOCATION: StartupXLOG, xlog.c:6070 > DEBUG: 00000: my backend id is 1 > LOCATION: SharedInvalBackendInit, sinvaladt.c:326 > LOG: 00000: redo starts at 925/2C0004C8 > LOCATION: StartupXLOG, xlog.c:6199 > DEBUG: 00000: record known xact 1768508078 latestObservedXid 0 > CONTEXT: xlog redo commit: 2012-08-21 08:21:42.079655+04 > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: record known xact 1768508078 latestObservedXid 0 > CONTEXT: xlog redo commit: 2012-08-21 08:21:42.079655+04 > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: remove KnownAssignedXid 1768508078 > CONTEXT: xlog redo commit: 2012-08-21 08:21:42.079655+04 > LOCATION: KnownAssignedXidsRemove, procarray.c:2869 > DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 > CONTEXT: xlog redo exclusive_lock: xid 1768508080 rel 1663/16424/17282; > tid 725829/8 > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 > CONTEXT: xlog redo update: rel 1663/16424/17282; tid 725829/8; new > 726017/7 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: record known xact 1768508076 latestObservedXid 0 > CONTEXT: xlog redo exclusive_lock: xid 1768508076 rel 1663/16424/47760635; > tid 3075/136 > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 > CONTEXT: xlog redo insert: rel 1663/16424/47747; tid 43866/197 > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 > CONTEXT: xlog redo insert: rel 1663/16424/52546872; tid 44700/276 > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: record known xact 1768508061 latestObservedXid 0 > CONTEXT: xlog redo log: rel 1663/16424/19211 > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: record known xact 1768508082 latestObservedXid 0 > CONTEXT: xlog redo exclusive_lock: xid 1768508082 rel 1663/16424/4759107; > tid 1098/7 > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: record known xact 1768508076 latestObservedXid 0 > CONTEXT: xlog redo hot_update: rel 1663/16424/47760635; tid 3075/136; new > 3075/140 > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 > CONTEXT: xlog redo exclusive_lock: xid 1768508080 rel 1663/16424/17282; > tid 725918/2 > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: record known xact 1768508082 latestObservedXid 0 > CONTEXT: xlog redo hot_update: rel 1663/16424/4759107; tid 1098/7; new > 1098/39 > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 > CONTEXT: xlog redo update: rel 1663/16424/17282; tid 725918/2; new > 725918/6 LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 > CONTEXT: xlog redo insert: rel 1663/16424/47747; tid 43866/201 > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 > CONTEXT: xlog redo insert: rel 1663/16424/52546872; tid 44700/276 > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 > CONTEXT: xlog redo commit: 2012-08-21 08:21:42.082508+04 > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: record known xact 1768508080 latestObservedXid 0 > CONTEXT: xlog redo commit: 2012-08-21 08:21:42.082508+04 > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: remove KnownAssignedXid 1768508080 > CONTEXT: xlog redo commit: 2012-08-21 08:21:42.082508+04 > LOCATION: KnownAssignedXidsRemove, procarray.c:2869 > DEBUG: 00000: record known xact 1768508061 latestObservedXid 0 > CONTEXT: xlog redo insert: rel 1663/16424/64654; tid 178183/44 > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: record known xact 1768508061 latestObservedXid 0 > CONTEXT: xlog redo insert: rel 1663/16424/95419; tid 38650/318 > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > DEBUG: 00000: record known xact 1768508061 latestObservedXid 0 > > > CONTEXT: xlog redo create multixact 361788644 offset 822580965: 1768508061 > 1768508078 > LOCATION: RecordKnownAssignedTransactionIds, procarray.c:2375 > FATAL: XX000: could not access status of transaction 361788644 > DETAIL: Could not read from file "pg_multixact/offsets/1590" at offset > 114688: Unknown error: 0. > CONTEXT: xlog redo create multixact 361788644 offset 822580965: 1768508061 > 1768508078 > LOCATION: SlruReportIOError, slru.c:840 The strange thing here is the "unknown error: 0" bit. That indicates that slru_errno isn't properly set... > > I have some problems with debug startup process with gdb... > I following next sequence of commands (and got no useful results): Youre debugging the postmaster that way. The easiest way would be to just attach to the startup process with gdb -p. Not sure if you can manage that timingwise without manually adding sleeps. Otherwise you probably will have to work with "set follow-fork-mode ask" in gdb. Getting to the right child isn't easy though. > [pgsql@db1 ~/tmp/postgresql-9.0.8]$ /usr/local/bin/gdb > /usr/local/bin/postgres > ... > Reading symbols from /usr/local/bin/postgres...done. > (gdb) b exit > Breakpoint 1 at 0x44d38c > (gdb) run -D /db/data_back/ > Starting program: /usr/local/bin/postgres -D /db/data_back/ > > Program received signal SIGUSR1, User defined signal 1. > 0x00000008018ddadc in select () from /lib/libc.so.7 > (gdb) bt > #0 0x00000008018ddadc in select () from /lib/libc.so.7 > #1 0x000000000065641c in ServerLoop () at postmaster.c:1396 > #2 0x0000000000655d05 in PostmasterMain (argc=3, argv=0x7fffffffead0) at > postmaster.c:1097 > #3 0x00000000005d974a in main (argc=3, argv=0x7fffffffead0) at main.c:188 > (gdb) cont > Continuing. handle SIGUSR1 nohandle noprint is useful to avoid this btw. > Breakpoint 1, 0x000000080186bab0 in exit () from /lib/libc.so.7 > (gdb) bt > #0 0x000000080186bab0 in exit () from /lib/libc.so.7 > #1 0x000000000067d0a6 in proc_exit (code=1) at ipc.c:137 > #2 0x0000000000659933 in ExitPostmaster (status=1) at postmaster.c:4172 > #3 0x0000000000657d2d in reaper (postgres_signal_arg=20) at > postmaster.c:2354 > #4 <signal handler called> > #5 <signal handler called> > #6 0x00000008018ddadc in select () from /lib/libc.so.7 > #7 0x000000000065641c in ServerLoop () at postmaster.c:1396 > #8 0x0000000000655d05 in PostmasterMain (argc=3, argv=0x7fffffffead0) at > postmaster.c:1097 > #9 0x00000000005d974a in main (argc=3, argv=0x7fffffffead0) at main.c:188 > (gdb) cont > Continuing. > [Inferior 1 (process 98857) exited with code 01] > (gdb) quit Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi Andres, > > I have some problems with debug startup process with gdb... > > I following next sequence of commands (and got no useful results): > Youre debugging the postmaster that way. The easiest way would be to just > attach to the startup process with gdb -p. Not sure if you can manage that > timingwise without manually adding sleeps. > It's imposible without added sleep seems. Too fast for me. Where I should add sleep call in PostgreSQL source? (process managment part of PostgreSQL isn't known for me at all). > > Otherwise you probably will have to work with "set follow-fork-mode ask" in > gdb. Getting to the right child isn't easy though. > Unfortunatelly gdb under FreeBSD doesn't suport set follow-fork-mode. Kind Regards, Maksy
On Tuesday, August 21, 2012 03:30:44 PM Maxim Boguk wrote: > Hi Andres, > > > > I have some problems with debug startup process with gdb... > > > > > I following next sequence of commands (and got no useful results): > > Youre debugging the postmaster that way. The easiest way would be to just > > attach to the startup process with gdb -p. Not sure if you can manage > > that timingwise without manually adding sleeps. > > It's imposible without added sleep seems. Too fast for me. > Where I should add sleep call in PostgreSQL source? (process managment part > of PostgreSQL isn't known for me at all). I would suggest adding it to StartupXLOG() in xlog.c just before the /* REDO */ comment or similar. I would add something akin to elog(WARNING, "pid of startup is: %d, sleeping for 10s", getpid()); sleep(10); Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Aug 22, 2012 at 6:08 AM, Andres Freund <andres@2ndquadrant.com>wrote: > On Tuesday, August 21, 2012 03:30:44 PM Maxim Boguk wrote: > > Hi Andres, > > I would add something akin to > > elog(WARNING, "pid of startup is: %d, sleeping for 10s", getpid()); > sleep(10); > Hi Andres, There is it: gdb backtrace: #0 0x0000000800bd1ab0 in exit () from /lib/libc.so.7 #1 0x00000000005d77f6 in proc_exit (code=1) at ipc.c:137 #2 0x0000000000697948 in errfinish (dummy=<optimized out>) at elog.c:501 #3 0x0000000000488956 in SlruReportIOError (ctl=<optimized out>, pageno=<optimized out>, xid=361788644) at slru.c:857 #4 0x000000000048956a in SimpleLruReadPage (ctl=0x95bf80, pageno=176654, write_ok=1 '\001', xid=361788644) at slru.c:427 #5 0x0000000000489f7f in RecordNewMultiXact (multi=361788644, offset=822580965, nxids=2, xids=0x800e5302c) at multixact.c:764 #6 0x000000000048a126 in multixact_redo (lsn=..., record=0x800e53000) at multixact.c:2017 #7 0x0000000000484bea in StartupXLOG () at xlog.c:6277 #8 0x00000000004870a8 in StartupProcessMain () at xlog.c:9335 #9 0x000000000049cab2 in AuxiliaryProcessMain (argc=2, argv=0x7fffffffe450) at bootstrap.c:412 #10 0x00000000005ba926 in StartChildProcess (type=StartupProcess) at postmaster.c:4442 #11 0x00000000005bccb2 in PostmasterMain (argc=5, argv=0x7fffffffea38) at postmaster.c:1093 #12 0x0000000000566f3b in main (argc=5, argv=0x7fffffffea38) at main.c:188 relevant database log: 2012-08-22 07:47:21.330 MSK 18155 @ from [vxid:1/0 txid:0] []LOG: 00000: redo starts at 925/2C0004C8 2012-08-22 07:47:21.330 MSK 18155 @ from [vxid:1/0 txid:0] []LOCATION: StartupXLOG, xlog.c:6202 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []FATAL: XX000: could not access status of transaction 361788644 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []DETAIL: Could not read from file "pg_multixact/offsets/1590" at offset 114688: Unknown error: 0. 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []CONTEXT: xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []LOCATION: SlruReportIOError, slru.c:840 Kind Regards, Maksym
On Wed, Aug 22, 2012 at 1:50 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote: > > > On Wed, Aug 22, 2012 at 6:08 AM, Andres Freund <andres@2ndquadrant.com>wrote: > >> On Tuesday, August 21, 2012 03:30:44 PM Maxim Boguk wrote: >> > Hi Andres, >> >> I would add something akin to >> >> elog(WARNING, "pid of startup is: %d, sleeping for 10s", getpid()); >> sleep(10); >> > > Hi Andres, > > There is it: > > gdb backtrace: > > #0 0x0000000800bd1ab0 in exit () from /lib/libc.so.7 > #1 0x00000000005d77f6 in proc_exit (code=1) at ipc.c:137 > #2 0x0000000000697948 in errfinish (dummy=<optimized out>) at elog.c:501 > #3 0x0000000000488956 in SlruReportIOError (ctl=<optimized out>, > pageno=<optimized out>, xid=361788644) at slru.c:857 > #4 0x000000000048956a in SimpleLruReadPage (ctl=0x95bf80, pageno=176654, > write_ok=1 '\001', xid=361788644) at slru.c:427 > #5 0x0000000000489f7f in RecordNewMultiXact (multi=361788644, > offset=822580965, nxids=2, xids=0x800e5302c) at multixact.c:764 > #6 0x000000000048a126 in multixact_redo (lsn=..., record=0x800e53000) at > multixact.c:2017 > #7 0x0000000000484bea in StartupXLOG () at xlog.c:6277 > #8 0x00000000004870a8 in StartupProcessMain () at xlog.c:9335 > #9 0x000000000049cab2 in AuxiliaryProcessMain (argc=2, > argv=0x7fffffffe450) at bootstrap.c:412 > #10 0x00000000005ba926 in StartChildProcess (type=StartupProcess) at > postmaster.c:4442 > #11 0x00000000005bccb2 in PostmasterMain (argc=5, argv=0x7fffffffea38) at > postmaster.c:1093 > #12 0x0000000000566f3b in main (argc=5, argv=0x7fffffffea38) at main.c:188 > > relevant database log: > 2012-08-22 07:47:21.330 MSK 18155 @ from [vxid:1/0 txid:0] []LOG: 00000: > redo starts at 925/2C0004C8 > 2012-08-22 07:47:21.330 MSK 18155 @ from [vxid:1/0 txid:0] []LOCATION: > StartupXLOG, xlog.c:6202 > 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []FATAL: > XX000: could not access status of transaction 361788644 > 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []DETAIL: > Could not read from file "pg_multixact/offsets/1590" at offset 114688: > Unknown error: 0. > 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []CONTEXT: > xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078 > 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []LOCATION: > SlruReportIOError, slru.c:840 > > > Kind Regards, > Maksym > And finally gdb backtrace on build without optimization ( ./configure --prefix=/usr/local/pgsql/tmp/postgresql/ --enable-debug --enable-cassert CFLAGS='-O0' ): #0 0x0000000800d42ab0 in exit () from /lib/libc.so.7 #1 0x00000000006aafc6 in proc_exit (code=1) at ipc.c:137 #2 0x00000000007bcbb1 in errfinish (dummy=0) at elog.c:501 #3 0x00000000004b7065 in SlruReportIOError (ctl=0xad9740, pageno=176654, xid=361788644) at slru.c:836 #4 0x00000000004b6353 in SimpleLruReadPage (ctl=0xad9740, pageno=176654, write_ok=1 '\001', xid=361788644) at slru.c:427 #5 0x00000000004b85dd in RecordNewMultiXact (multi=361788644, offset=822580965, nxids=2, xids=0x800f5302c) at multixact.c:764 #6 0x00000000004b9dd1 in multixact_redo (lsn=..., record=0x800f53000) at multixact.c:2017 #7 0x00000000004aef0c in StartupXLOG () at xlog.c:6277 #8 0x00000000004b446d in StartupProcessMain () at xlog.c:9335 #9 0x00000000004d9e36 in AuxiliaryProcessMain (argc=2, argv=0x7fffffffe8b0) at bootstrap.c:412 #10 0x0000000000683398 in StartChildProcess (type=StartupProcess) at postmaster.c:4442 #11 0x000000000067ef60 in PostmasterMain (argc=5, argv=0x7fffffffea38) at postmaster.c:1093 #12 0x00000000005fbf5a in main (argc=5, argv=0x7fffffffea38) at main.c:188 It seems failing around function SlruPhysicalReadPage at block errno = 0; if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ) { slru_errcause = SLRU_READ_FAILED; slru_errno = errno; close(fd); return false; } It seems that it read less than BLCKSZ bytes. And it is not an error condition for read call (from man 2 read: Note that reading less than size bytes is not an error. ). Actual size of pg_multixact/offsets/1590 file is: [pgsql@db1 ~/tmp/postgresql-9.0.8]$ ls -la /db/data_back/pg_multixact/offsets/1590 -rw------- 1 pgsql pgsql 81920 Aug 21 07:46 /db/data_back/pg_multixact/offsets/1590 Now it seems that the (lseek(fd, (off_t) offset, SEEK_SET) < 0) call moving seek position to outside of an actual file size. And after that, read call actually could read only 0 bytes from it. Does that mean that file was damaged during rsync? Kind Regards, Maksym What next I should examine? Kind Regards, Maksym
Hi Maxim, On Wednesday, August 22, 2012 06:18:14 AM Maxim Boguk wrote: > On Wed, Aug 22, 2012 at 1:50 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote: > > On Wed, Aug 22, 2012 at 6:08 AM, Andres Freund <andres@2ndquadrant.com>wrote: > >> On Tuesday, August 21, 2012 03:30:44 PM Maxim Boguk wrote: > >> > Hi Andres, > >> > >> I would add something akin to > >> > >> elog(WARNING, "pid of startup is: %d, sleeping for 10s", getpid()); > >> sleep(10); > > > > Hi Andres, > > > > There is it: > > > > gdb backtrace: > > > > #0 0x0000000800bd1ab0 in exit () from /lib/libc.so.7 > > #1 0x00000000005d77f6 in proc_exit (code=1) at ipc.c:137 > > #2 0x0000000000697948 in errfinish (dummy=<optimized out>) at elog.c:501 > > #3 0x0000000000488956 in SlruReportIOError (ctl=<optimized out>, > > pageno=<optimized out>, xid=361788644) at slru.c:857 > > #4 0x000000000048956a in SimpleLruReadPage (ctl=0x95bf80, pageno=176654, > > write_ok=1 '\001', xid=361788644) at slru.c:427 > > #5 0x0000000000489f7f in RecordNewMultiXact (multi=361788644, > > offset=822580965, nxids=2, xids=0x800e5302c) at multixact.c:764 > > #6 0x000000000048a126 in multixact_redo (lsn=..., record=0x800e53000) at > > multixact.c:2017 > > #7 0x0000000000484bea in StartupXLOG () at xlog.c:6277 > > #8 0x00000000004870a8 in StartupProcessMain () at xlog.c:9335 > > #9 0x000000000049cab2 in AuxiliaryProcessMain (argc=2, > > argv=0x7fffffffe450) at bootstrap.c:412 > > #10 0x00000000005ba926 in StartChildProcess (type=StartupProcess) at > > postmaster.c:4442 > > #11 0x00000000005bccb2 in PostmasterMain (argc=5, argv=0x7fffffffea38) at > > postmaster.c:1093 > > #12 0x0000000000566f3b in main (argc=5, argv=0x7fffffffea38) at > > main.c:188 > > > > relevant database log: > > 2012-08-22 07:47:21.330 MSK 18155 @ from [vxid:1/0 txid:0] []LOG: > > 00000: redo starts at 925/2C0004C8 > > 2012-08-22 07:47:21.330 MSK 18155 @ from [vxid:1/0 txid:0] []LOCATION: > > StartupXLOG, xlog.c:6202 > > 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []FATAL: > > XX000: could not access status of transaction 361788644 > > 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []DETAIL: > > Could not read from file "pg_multixact/offsets/1590" at offset 114688: > > Unknown error: 0. > > 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []CONTEXT: > > xlog redo create multixact 361788644 offset 822580965: 1768508061 > > 1768508078 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] > > []LOCATION: SlruReportIOError, slru.c:840 > > > > > > Kind Regards, > > Maksym > > And finally gdb backtrace on build without optimization > ( ./configure --prefix=/usr/local/pgsql/tmp/postgresql/ --enable-debug > --enable-cassert CFLAGS='-O0' ): > > #0 0x0000000800d42ab0 in exit () from /lib/libc.so.7 > #1 0x00000000006aafc6 in proc_exit (code=1) at ipc.c:137 > #2 0x00000000007bcbb1 in errfinish (dummy=0) at elog.c:501 > #3 0x00000000004b7065 in SlruReportIOError (ctl=0xad9740, pageno=176654, > xid=361788644) at slru.c:836 > #4 0x00000000004b6353 in SimpleLruReadPage (ctl=0xad9740, pageno=176654, > write_ok=1 '\001', xid=361788644) at slru.c:427 > #5 0x00000000004b85dd in RecordNewMultiXact (multi=361788644, > offset=822580965, nxids=2, xids=0x800f5302c) at multixact.c:764 > #6 0x00000000004b9dd1 in multixact_redo (lsn=..., record=0x800f53000) at > multixact.c:2017 > #7 0x00000000004aef0c in StartupXLOG () at xlog.c:6277 > #8 0x00000000004b446d in StartupProcessMain () at xlog.c:9335 > #9 0x00000000004d9e36 in AuxiliaryProcessMain (argc=2, > argv=0x7fffffffe8b0) at bootstrap.c:412 > #10 0x0000000000683398 in StartChildProcess (type=StartupProcess) at > postmaster.c:4442 > #11 0x000000000067ef60 in PostmasterMain (argc=5, argv=0x7fffffffea38) at > postmaster.c:1093 > #12 0x00000000005fbf5a in main (argc=5, argv=0x7fffffffea38) at main.c:188 Thanks! That already helps. > It seems failing around function SlruPhysicalReadPage at block > > errno = 0; > if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ) > { > slru_errcause = SLRU_READ_FAILED; > slru_errno = errno; > close(fd); > return false; > } > > It seems that it read less than BLCKSZ bytes. And it is not an error > condition for read call > (from man 2 read: > Note that reading less than size bytes is not an error. > ). If there are enough bytes available and youre not reading from a pipe or similar it shouldn't happen though (unless signals happen, those change errno though). But you already cleared up why its happening: > Actual size of pg_multixact/offsets/1590 file is: > [pgsql@db1 ~/tmp/postgresql-9.0.8]$ ls -la > /db/data_back/pg_multixact/offsets/1590 > -rw------- 1 pgsql pgsql 81920 Aug 21 07:46 > /db/data_back/pg_multixact/offsets/1590 > > Now it seems that the (lseek(fd, (off_t) offset, SEEK_SET) < 0) call moving > seek position to outside of an actual file size. > And after that, read call actually could read only 0 bytes from it. That does sound plausible, yes. > Does that mean that file was damaged during rsync? Not necessarily. When did you initially set up that cluster? Normally the file should get zeroed out before its being used. If the cluster was copied improperly (i.e. no pg_start/stop backup or such) it could easily happen. But I would guess that you did it properly? Could you show pg_controldata output for primary/replica? To see how new that multixact offset is. > What next I should examine? If you could give me a DEBUG3 log from startup till it crashes it would be helpful. Please #define MULTIXACT_DEBUG ontop of multixact.c before doing so. Thats probably going to be to big to go through the mailinglist. If youre feeling adventurous you could use the xlogdump project (https://github.com/snaga/xlogdump) to see what xlog records happen from the time it starts up till it crashes. I don't immediately see a way a XLOG_MULTIXACT_CREATE_ID can be generated without a preceding XLOG_MULTIXACT_ZERO_OFF_PAGE zeroing the page. But most of that should be already be visible in the log with the above define. Did you have any crashes, -m immediate shutdowns, or similar on the primary or replica? Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> > Does that mean that file was damaged during rsync? > Not necessarily. When did you initially set up that cluster? Normally the > file > should get zeroed out before its being used. If the cluster was copied > improperly (i.e. no pg_start/stop backup or such) it could easily happen. > But > I would guess that you did it properly? > That server was master before planned switchover to replica server was performed. Procedure: 1)stop former master 2)trigger replica failover (thus server become new master) There are unusual part because human error from system administrator: 2.1)start former master 2.2)stop former master 3)empty pg_xlog directory on former master 4)pg_start_backup on new master 5)re-resync datafiles to former master using: rsync -aq --delete --exclude pg_xlog /db/data [former master]:/db/ 6)pg_stop_backup on new master 7)start former master as replica... that procedure used more than 10 times without any problem in the past. May be I missing something in it. > > Could you show pg_controldata output for primary/replica? To see how new > that > multixact offset is. > On failing replica: [pgsql@db1 ~/tmp/postgresql-9.0.8]$ pg_controldata /db/data_back pg_control version number: 903 Catalog version number: 201008051 Database system identifier: 5687760461964967531 Database cluster state: in archive recovery pg_control last modified: Wed Aug 22 13:47:35 2012 Latest checkpoint location: 925/2D41CAA0 Prior checkpoint location: 925/2D41CAA0 Latest checkpoint's REDO location: 925/2C0004C8 Latest checkpoint's TimeLineID: 5 Latest checkpoint's NextXID: 0/1768511856 Latest checkpoint's NextOID: 59880938 Latest checkpoint's NextMultiXactId: 361788644 Latest checkpoint's NextMultiOffset: 822580965 Latest checkpoint's oldestXID: 1569182525 Latest checkpoint's oldestXID's DB: 16424 Latest checkpoint's oldestActiveXID: 1768498919 Time of latest checkpoint: Tue Aug 21 08:21:42 2012 Minimum recovery ending location: 925/2C0004C8 Backup start location: 925/2C0004C8 Current wal_level setting: hot_standby Current max_connections setting: 300 Current max_prepared_xacts setting: 0 Current max_locks_per_xact setting: 2048 Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value On master no sense to look into it because master 1 day ahead of that replica. It's a production system. > > > What next I should examine? > If you could give me a DEBUG3 log from startup till it crashes it would be > helpful. Please #define MULTIXACT_DEBUG ontop of multixact.c before doing > so. > Thats probably going to be to big to go through the mailinglist. > I see nothing new after I rebuit postgres with that define ( /usr/local/pgsql/tmp/postgresql/bin/postgres -d 3 -D /db/data_back/ ). Just: LOCATION: readRecoveryCommandFile, xlog.c:5269 DEBUG: 00000: primary_conninfo = 'host=10.12.22.19 port=5433 user=replica password=acilper' LOCATION: readRecoveryCommandFile, xlog.c:5276 DEBUG: 00000: trigger_file = '/db/data/failover' LOCATION: readRecoveryCommandFile, xlog.c:5283 LOG: 00000: entering standby mode LOCATION: StartupXLOG, xlog.c:5827 DEBUG: 00000: checkpoint record is at 925/2D41CAA0 LOCATION: StartupXLOG, xlog.c:5894 DEBUG: 00000: redo record is at 925/2C0004C8; shutdown FALSE LOCATION: StartupXLOG, xlog.c:5929 DEBUG: 00000: next transaction ID: 0/1768511856; next OID: 59880938 LOCATION: StartupXLOG, xlog.c:5933 DEBUG: 00000: next MultiXactId: 361788644; next MultiXactOffset: 822580965 LOCATION: StartupXLOG, xlog.c:5936 DEBUG: 00000: oldest unfrozen transaction ID: 1569182525, in database 16424 LOCATION: StartupXLOG, xlog.c:5939 DEBUG: 00000: MultiXact: setting next multi to 361788644 offset 822580965 LOCATION: MultiXactSetNextMXact, multixact.c:1668 DEBUG: 00000: transaction ID wrap limit is 3716666172, limited by database with OID 16424 LOCATION: SetTransactionIdLimit, varsup.c:334 WARNING: 01000: pid of startup is: 16919, sleeping for 60s LOCATION: StartupXLOG, xlog.c:5983 DEBUG: 00000: initializing for hot standby LOCATION: StartupXLOG, xlog.c:6073 LOG: 00000: redo starts at 925/2C0004C8 LOCATION: StartupXLOG, xlog.c:6202 FATAL: XX000: could not access status of transaction 361788644 DETAIL: Could not read from file "pg_multixact/offsets/1590" at offset 114688: Unknown error: 0. CONTEXT: xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078 LOCATION: SlruReportIOError, slru.c:840 > > If youre feeling adventurous you could use the xlogdump project > (https://github.com/snaga/xlogdump) to see what xlog records happen from > the > time it starts up till it crashes. I don't immediately see a way a > XLOG_MULTIXACT_CREATE_ID can be generated without a preceding > XLOG_MULTIXACT_ZERO_OFF_PAGE zeroing the page. > But most of that should be already be visible in the log with the above > define. > > Did you have any crashes, -m immediate shutdowns, or similar on the > primary or > replica? > I don't think. But I'm not sure because task was performed by system administrator using provided instruction. And I was informed only after replica doesn't start. Kind Regards, Maksym
On Wednesday, August 22, 2012 12:08:25 PM Maxim Boguk wrote: > > > Does that mean that file was damaged during rsync? > > > > Not necessarily. When did you initially set up that cluster? Normally the > > file > > should get zeroed out before its being used. If the cluster was copied > > improperly (i.e. no pg_start/stop backup or such) it could easily happen. > > But > > I would guess that you did it properly? > > That server was master before planned switchover to replica server was > performed. But it was never started up before, do I understand that correctly? Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
I believe I just hit this same issue, but with PG 9.1.3: <@:32407> 2012-08-29 10:02:09 UTC LOG: shutting down <@:32407> 2012-08-29 10:02:09 UTC LOG: database system is shut down <[unknown]@[unknown]:31687> 2012-08-29 13:34:03 UTC LOG: connection received: host=[local] <[unknown]@[unknown]:31687> 2012-08-29 13:34:03 UTC LOG: incomplete startup packet <@:31686> 2012-08-29 13:34:03 UTC LOG: database system was interrupted; last known up at 2012-08-29 13:14:47 UTC <@:31686> 2012-08-29 13:34:03 UTC LOG: entering standby mode <@:31686> 2012-08-29 13:34:03 UTC LOG: redo starts at A92/5F000020 <@:31686> 2012-08-29 13:34:03 UTC FATAL: could not access status of transaction 208177034 <@:31686> 2012-08-29 13:34:03 UTC DETAIL: Could not read from file "pg_multixact/offsets/0C68" at offset 131072: Success. <@:31686> 2012-08-29 13:34:03 UTC CONTEXT: xlog redo create multixact 208177034 offset 1028958730: 1593544329 1593544330 <@:31681> 2012-08-29 13:34:03 UTC LOG: startup process (PID 31686) exited with exit code 1 <@:31681> 2012-08-29 13:34:03 UTC LOG: terminating any other active server processes This was attempting to rebuild a hot standby after switching my master to a new server. In between the shutdown and the attempt to restart: - The master was put into backup mode. - The datadir was rsynced over, using rsync -ahhP --delete-before --exclude=postmaster.pid --exclude=pg_xlog - The master was taken out of backup mode. - The pg_xlog directory was emptied - The pg_xlog directory was rsynced across from the master. This included all the WAL files from before the promotion, throughout backup mode, and a few from after backup mode was left. - recovery.conf was created (hot_standby=on, primary_conninfo=..., recovery_target_timeline=latest) - The hot standby was started. Unfortunately I didn't have the disk space around to keep everything, and am attempting the rebuild again now. -- Stuart Bishop <stuart@stuartbishop.net> http://www.stuartbishop.net/
On Wednesday, August 29, 2012 05:32:31 PM Stuart Bishop wrote: > I believe I just hit this same issue, but with PG 9.1.3: > > <@:32407> 2012-08-29 10:02:09 UTC LOG: shutting down > <@:32407> 2012-08-29 10:02:09 UTC LOG: database system is shut down > <[unknown]@[unknown]:31687> 2012-08-29 13:34:03 UTC LOG: connection > received: host=[local] > <[unknown]@[unknown]:31687> 2012-08-29 13:34:03 UTC LOG: incomplete > startup packet > <@:31686> 2012-08-29 13:34:03 UTC LOG: database system was > interrupted; last known up at 2012-08-29 13:14:47 UTC > <@:31686> 2012-08-29 13:34:03 UTC LOG: entering standby mode > <@:31686> 2012-08-29 13:34:03 UTC LOG: redo starts at A92/5F000020 > <@:31686> 2012-08-29 13:34:03 UTC FATAL: could not access status of > transaction 208177034 > <@:31686> 2012-08-29 13:34:03 UTC DETAIL: Could not read from file > "pg_multixact/offsets/0C68" at offset 131072: Success. > <@:31686> 2012-08-29 13:34:03 UTC CONTEXT: xlog redo create multixact > 208177034 offset 1028958730: 1593544329 1593544330 > <@:31681> 2012-08-29 13:34:03 UTC LOG: startup process (PID 31686) > exited with exit code 1 > <@:31681> 2012-08-29 13:34:03 UTC LOG: terminating any other active > server processes > > This was attempting to rebuild a hot standby after switching my master > to a new server. In between the shutdown and the attempt to restart: > > - The master was put into backup mode. > - The datadir was rsynced over, using rsync -ahhP --delete-before > --exclude=postmaster.pid --exclude=pg_xlog > - The master was taken out of backup mode. > - The pg_xlog directory was emptied > - The pg_xlog directory was rsynced across from the master. This > included all the WAL files from before the promotion, throughout > backup mode, and a few from after backup mode was left. Thats not valid, you cannot easily guarantee that youve not copied files that were in the progress of being written to. Use a recovery_command if you do not want all files to be transferred via the replication connection. But do that only for files that have been archived via an archive_command beforehand. Did you have a backup label in the rsync'ed datadir? In Maxim's case I could detect that he had not via line numbers, but I do not see them here... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Aug 29, 2012 at 10:59 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On Wednesday, August 29, 2012 05:32:31 PM Stuart Bishop wrote: >> I believe I just hit this same issue, but with PG 9.1.3: >> >> <@:32407> 2012-08-29 10:02:09 UTC LOG: shutting down >> <@:32407> 2012-08-29 10:02:09 UTC LOG: database system is shut down >> <[unknown]@[unknown]:31687> 2012-08-29 13:34:03 UTC LOG: connection >> received: host=[local] >> <[unknown]@[unknown]:31687> 2012-08-29 13:34:03 UTC LOG: incomplete >> startup packet >> <@:31686> 2012-08-29 13:34:03 UTC LOG: database system was >> interrupted; last known up at 2012-08-29 13:14:47 UTC >> <@:31686> 2012-08-29 13:34:03 UTC LOG: entering standby mode >> <@:31686> 2012-08-29 13:34:03 UTC LOG: redo starts at A92/5F000020 >> <@:31686> 2012-08-29 13:34:03 UTC FATAL: could not access status of >> transaction 208177034 >> <@:31686> 2012-08-29 13:34:03 UTC DETAIL: Could not read from file >> "pg_multixact/offsets/0C68" at offset 131072: Success. >> <@:31686> 2012-08-29 13:34:03 UTC CONTEXT: xlog redo create multixact >> 208177034 offset 1028958730: 1593544329 1593544330 >> <@:31681> 2012-08-29 13:34:03 UTC LOG: startup process (PID 31686) >> exited with exit code 1 >> <@:31681> 2012-08-29 13:34:03 UTC LOG: terminating any other active >> server processes >> >> This was attempting to rebuild a hot standby after switching my master >> to a new server. In between the shutdown and the attempt to restart: >> >> - The master was put into backup mode. >> - The datadir was rsynced over, using rsync -ahhP --delete-before >> --exclude=postmaster.pid --exclude=pg_xlog >> - The master was taken out of backup mode. >> - The pg_xlog directory was emptied >> - The pg_xlog directory was rsynced across from the master. This >> included all the WAL files from before the promotion, throughout >> backup mode, and a few from after backup mode was left. > Thats not valid, you cannot easily guarantee that youve not copied files that > were in the progress of being written to. Use a recovery_command if you do not > want all files to be transferred via the replication connection. But do that > only for files that have been archived via an archive_command beforehand. Ok. I had assumed this was fine, as the docs explicitly tell me to copy across any unarchived WAL files when doing failover. I think my confusion is because the docs for building a standby refer to the section on recovering from a backup, but I have a live server. I'll just let the WAL files get sucked over the replication connection if that works - this seems much simpler. I don't think I saw this mentioned in the docs. I had been assuming enough WAL needed to be available to bring the DB up to a consistent state before streaming replication would start. > Did you have a backup label in the rsync'ed datadir? In Maxim's case I could > detect that he had not via line numbers, but I do not see them here... Yes, the backup_label copied across (confirmed in scrollback from the rsync). -- Stuart Bishop <stuart@stuartbishop.net> http://www.stuartbishop.net/