Обсуждение: How to continue streaming replication after this error?
Hi, one of our streaming replicas died with 2014-02-21 05:17:10 UTC PANIC: heap2_redo: unknown op code 32 2014-02-21 05:17:10 UTC CONTEXT: xlog redo UNKNOWN 2014-02-21 05:17:11 UTC LOG: startup process (PID 1060) was terminated by signal 6: Aborted 2014-02-21 05:17:11 UTC LOG: terminating any other active server processes 2014-02-21 05:17:11 UTC WARNING: terminating connection because of crash of another server process 2014-02-21 05:17:11 UTC DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2014-02-21 05:17:11 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command. Now, if I try to restart it, I get this: The PostgreSQL server failed to start. Please check the log output: 2014-02-21 07:42:53 UTC LOG: database system was interrupted while in recovery at log time 2014-02-21 05:02:45 UTC 2014-02-21 07:42:53 UTC HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2014-02-21 07:42:53 UTC LOG: incomplete startup packet 2014-02-21 07:42:53 UTC LOG: entering standby mode 2014-02-21 07:42:53 UTC LOG: redo starts at 11C/B2211778 2014-02-21 07:42:53 UTC FATAL: the database system is starting up 2014-02-21 07:42:54 UTC LOG: consistent recovery state reached at 11C/B4234108 2014-02-21 07:42:54 UTC LOG: database system is ready to accept read only connections 2014-02-21 07:42:54 UTC PANIC: heap2_redo: unknown op code 32 2014-02-21 07:42:54 UTC CONTEXT: xlog redo UNKNOWN 2014-02-21 07:42:54 UTC LOG: startup process (PID 38187) was terminated by signal 6: Aborted 2014-02-21 07:42:54 UTC LOG: terminating any other active server processes This is 9.3.2. What is the supposed way to continue replication? Or do I need to start from a fresh base backup? Thanks, Torsten
On 21/02/14 09:17, Torsten Förtsch wrote: > one of our streaming replicas died with > > 2014-02-21 05:17:10 UTC PANIC: heap2_redo: unknown op code 32 > 2014-02-21 05:17:10 UTC CONTEXT: xlog redo UNKNOWN > 2014-02-21 05:17:11 UTC LOG: startup process (PID 1060) was terminated > by signal 6: Aborted > 2014-02-21 05:17:11 UTC LOG: terminating any other active server processes > 2014-02-21 05:17:11 UTC WARNING: terminating connection because of > crash of another server process > 2014-02-21 05:17:11 UTC DETAIL: The postmaster has commanded this > server process to roll back the current transaction and exit, because > another server process exited abnormally and possibly corrupted shared > memory. > 2014-02-21 05:17:11 UTC HINT: In a moment you should be able to > reconnect to the database and repeat your command. Any idea what that means? I have got a second replica dying with the same symptoms. Thanks, Torsten
On Sat, Feb 22, 2014 at 1:21 PM, Torsten Förtsch <torsten.foertsch@gmx.net> wrote:
On 21/02/14 09:17, Torsten Förtsch wrote:Any idea what that means?
> one of our streaming replicas died with
>
> 2014-02-21 05:17:10 UTC PANIC: heap2_redo: unknown op code 32
> 2014-02-21 05:17:10 UTC CONTEXT: xlog redo UNKNOWN
> 2014-02-21 05:17:11 UTC LOG: startup process (PID 1060) was terminated
> by signal 6: Aborted
> 2014-02-21 05:17:11 UTC LOG: terminating any other active server processes
> 2014-02-21 05:17:11 UTC WARNING: terminating connection because of
> crash of another server process
> 2014-02-21 05:17:11 UTC DETAIL: The postmaster has commanded this
> server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory.
> 2014-02-21 05:17:11 UTC HINT: In a moment you should be able to
> reconnect to the database and repeat your command.
I have got a second replica dying with the same symptoms.
The Xlog record seems to be corrupted. The op code 32 represents XLOG_HEAP2_FREEZE_PAGE, the code exists to handle it.
Don't know why the system is not able to recognize the op code? Can you try pg_xlogdump of the corrupted WAL file?
Keep the data folder for problem investigation. As it seems some of kind corruption, you need to take a fresh base backup to continue.
Regards,
Hari Babu
Fujitsu Australia
On 22/02/14 03:21, Torsten Förtsch wrote: >> 2014-02-21 05:17:10 UTC PANIC: heap2_redo: unknown op code 32 >> > 2014-02-21 05:17:10 UTC CONTEXT: xlog redo UNKNOWN >> > 2014-02-21 05:17:11 UTC LOG: startup process (PID 1060) was terminated >> > by signal 6: Aborted >> > 2014-02-21 05:17:11 UTC LOG: terminating any other active server processes >> > 2014-02-21 05:17:11 UTC WARNING: terminating connection because of >> > crash of another server process >> > 2014-02-21 05:17:11 UTC DETAIL: The postmaster has commanded this >> > server process to roll back the current transaction and exit, because >> > another server process exited abnormally and possibly corrupted shared >> > memory. >> > 2014-02-21 05:17:11 UTC HINT: In a moment you should be able to >> > reconnect to the database and repeat your command. > Any idea what that means? Updating the replica to 9.3.3 cured it. The master was already on 9.3.3. Torsten
On Mon, Feb 24, 2014 at 12:23 PM, Torsten Förtsch <torsten.foertsch@gmx.net> wrote:
-- On 22/02/14 03:21, Torsten Förtsch wrote:Updating the replica to 9.3.3 cured it. The master was already on 9.3.3.
> Any idea what that means?
9.3.3 has introduced some new configuration parameters. So you need to actually update a slave before the master or replication is broken.
Michael