Обсуждение: Failover Testing Failures: invalid resource manager ID in primary checkpoint record

Поиск
Список
Период
Сортировка

Failover Testing Failures: invalid resource manager ID in primary checkpoint record

От
Don Seiler
Дата:
PostgreSQL 12.13 (PGDG packages) in a streaming replication configuration. pgBackrest 2.43 used for WAL archiving and DB backups to cloud storage

I'm testing and documenting a DR exercise process where I:
  1. Cleanly shutdown PG on the primary
  2. Promote the PG DR replica
  3. Place the standby.signal file on the old primary and start it up (presumes no other configurations need changing, primary_conninfo etc were already set).
My hope is I could just start the old primary / new replica if it was cleanly shutdown prior to promoting the replica. However when I try to start up that new replica, I'm met with:

LOG:  restored log file "00000002000000B70000005A" from archive
LOG:  invalid resource manager ID in primary checkpoint record
PANIC:  could not locate a valid checkpoint record
LOG:  startup process (PID 17660) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure
LOG:  database system is shut down


It doesn't appear any WAL files are missing as it finds all the files that it asks for. Am I missing a piece here?

My hope is to avoid having to do a restore to rebuild the new replica.

Aside for those that may be asking: most of these databases do not have data checksums enabled so pg_rewind isn't in the picture. Although I'm reading now that we could enable the wal_log_hints parameter as an alternative. I'm leery of the overhead but if it's the same overhead that would be done with data checksums then I guess there would be nothing lost when we eventually enable them.

--
Don Seiler
www.seiler.us

Re: Failover Testing Failures: invalid resource manager ID in primary checkpoint record

От
Laurenz Albe
Дата:
On Wed, 2023-01-18 at 17:47 -0600, Don Seiler wrote:
> PostgreSQL 12.13 (PGDG packages) in a streaming replication configuration.
> pgBackrest 2.43 used for WAL archiving and DB backups to cloud storage
>
> I'm testing and documenting a DR exercise process where I:
>    1. Cleanly shutdown PG on the primary
>    2. Promote the PG DR replica
>    3. Place the standby.signal file on the old primary and start it up
>       (presumes no other configurations need changing, primary_conninfo
>       etc were already set).
> My hope is I could just start the old primary / new replica if it was
> cleanly shutdown prior to promoting the replica.

Yes, that should work.

> However when I try to start up that new replica, I'm met with:
>
> LOG:  restored log file "00000002000000B70000005A" from archive
> LOG:  invalid resource manager ID in primary checkpoint record
> PANIC:  could not locate a valid checkpoint record
> LOG:  startup process (PID 17660) was terminated by signal 6: Aborted
> LOG:  aborting startup due to startup process failure
> LOG:  database system is shut down

What do you get for "pg_controldata -D datadir | grep checkpoint"?

Can you show the complete log from the point the server started?

Yours,
Laurenz Albe



Re: Failover Testing Failures: invalid resource manager ID in primary checkpoint record

От
Don Seiler
Дата:
On Thu, Jan 19, 2023 at 6:44 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:


What do you get for "pg_controldata -D datadir | grep checkpoint"?

$ /usr/lib/postgresql/12/bin/pg_controldata -D /var/lib/postgresql/12/main | grep checkpoint
Latest checkpoint location:           B7/5A000028
Latest checkpoint's REDO location:    B7/5A000028
Latest checkpoint's REDO WAL file:    00000001000000B70000005A
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0:746
Latest checkpoint's NextOID:          16616
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        479
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  0
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint:            Wed 18 Jan 2023 11:08:00 PM UTC

 
That 5A file should be from timeline 2 on the new primary. The new primary started with 59.partial on TL 1 and then 59 on TL2. I can get those logs as well.

Can you show the complete log from the point the server started?

I've sanitized hostnames etc but should still be pretty clear. Here's the initial startup after. I just noticed that there was actually another startup before what I thought was the initial startup. I've included the PG log from the point where I shut down the old primary followed by the initial startups after it became a replica. I'm going to have to review what chef might have done. I was relying on chef to deploy the configs before attempting to restart but it may have tried to start the service early since that first startup doesn't look like it was as a standby, but then it does a restart and begins requesting files and enters standby mode.

2023-01-18 22:56:18.913 UTC [3716] LOG:  received fast shutdown request
2023-01-18 22:56:18.917 UTC [3716] LOG:  aborting any active transactions
2023-01-18 22:56:18.917 UTC [4029] postgres@dtsdb - [unknown] FATAL:  terminating connection due to administrator command
2023-01-18 22:56:18.917 UTC [4028] postgres@postgres - [unknown] FATAL:  terminating connection due to administrator command
2023-01-18 22:56:18.918 UTC [4029] postgres@dtsdb - [unknown] LOG:  disconnection: session time: 1:16:10.946 user=postgres database=dtsdb host=[local]
2023-01-18 22:56:18.920 UTC [3716] LOG:  background worker "logical replication launcher" (PID 3725) exited with exit code 1
2023-01-18 22:56:18.920 UTC [4028] postgres@postgres - [unknown] LOG:  disconnection: session time: 1:16:10.949 user=postgres database=postgres host=[local]
2023-01-18 22:56:18.922 UTC [3719] LOG:  shutting down
2023-01-18 22:56:18.971 UTC [3719] LOG:  checkpoint starting: shutdown immediate
2023-01-18 22:56:18.989 P00   INFO: archive-push command begin 2.43: [pg_wal/00000001000000B700000058] --archive-async --compress-type=lz4 --exec-id=1220-855f7a50 --log-level-console=info --log-level-file=detai
l --pg1-path=/var/lib/postgresql/12/main --process-max=3 --repo1-azure-account=<redacted> --repo1-azure-container=backups --repo1-azure-key=<redacted> --repo1-cipher-pass=<redacted> --repo1-cipher-type=aes-2
56-cbc --repo1-path=/ --repo1-type=azure --spool-path=/var/lib/pg_wal/12/main/spool --stanza=dtsdb
2023-01-18 22:56:19.017 UTC [3719] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 2 recycled; write=0.009 s, sync=0.001 s, total=0.050 s; sync files=0, longest=0.000 s, avera
ge=0.000 s; distance=49151 kB, estimate=49151 kB
2023-01-18 22:56:19.059 UTC [1226] [unknown]@[unknown] - [unknown] LOG:  connection received: host=172.18.11.102 port=60598
2023-01-18 22:56:19.149 UTC [4127] replication@[unknown] - dtsdb LOG:  disconnection: session time: 1:15:54.644 user=replication database= host=X.Y.4.52 port=65174
2023-01-18 22:56:19.240 UTC [3851] replication@[unknown] - dtsdb LOG:  disconnection: session time: 1:16:23.520 user=replication database= host=X.Y.6.26 port=26380
2023-01-18 22:56:19.393 P00   INFO: pushed WAL file '00000001000000B700000058' to the archive asynchronously
2023-01-18 22:56:19.393 P00   INFO: archive-push command end: completed successfully (405ms)
2023-01-18 22:56:19.419 UTC [3716] LOG:  database system is shut down
2023-01-18 23:07:52.344 UTC [7251] LOG:  database system was shut down at 2023-01-18 22:56:18 UTC
2023-01-18 23:07:52.355 UTC [7249] LOG:  database system is ready to accept connections
2023-01-18 23:07:52.856 UTC [7261] [unknown]@[unknown] - [unknown] LOG:  connection received: host=X.Y.6.26 port=49638
2023-01-18 23:07:52.862 UTC [7261] replication@[unknown] - [unknown] LOG:  replication connection authorized: user=replication application_name=dtsdb SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off)
2023-01-18 23:07:52.864 UTC [7261] replication@[unknown] - dtsdb LOG:  disconnection: session time: 0:00:00.007 user=replication database= host=X.Y.6.26 port=49638
2023-01-18 23:07:52.957 UTC [7262] [unknown]@[unknown] - [unknown] LOG:  connection received: host=[local]
2023-01-18 23:07:53.468 UTC [7265] [unknown]@[unknown] - [unknown] LOG:  connection received: host=[local]
2023-01-18 23:07:53.469 UTC [7265] postgres@template1 - [unknown] LOG:  connection authorized: user=postgres database=template1 application_name=psql
2023-01-18 23:07:53.473 UTC [7265] postgres@template1 - psql LOG:  disconnection: session time: 0:00:00.005 user=postgres database=template1 host=[local]
2023-01-18 23:07:53.984 UTC [7270] [unknown]@[unknown] - [unknown] LOG:  connection received: host=[local]
2023-01-18 23:07:53.986 UTC [7270] postgres@template1 - [unknown] LOG:  connection authorized: user=postgres database=template1 application_name=psql
2023-01-18 23:07:53.987 UTC [7270] postgres@template1 - psql LOG:  disconnection: session time: 0:00:00.002 user=postgres database=template1 host=[local]
2023-01-18 23:07:54.307 UTC [7271] [unknown]@[unknown] - [unknown] LOG:  connection received: host=[local]
2023-01-18 23:07:54.308 UTC [7272] [unknown]@[unknown] - [unknown] LOG:  connection received: host=[local]
2023-01-18 23:07:54.309 UTC [7271] postgres@dtsdb - [unknown] LOG:  connection authorized: user=postgres database=dtsdb
2023-01-18 23:07:54.309 UTC [7272] postgres@postgres - [unknown] LOG:  connection authorized: user=postgres database=postgres
2023-01-18 23:07:54.499 UTC [7275] [unknown]@[unknown] - [unknown] LOG:  connection received: host=[local]
2023-01-18 23:07:54.500 UTC [7275] postgres@template1 - [unknown] LOG:  connection authorized: user=postgres database=template1 application_name=psql
2023-01-18 23:07:54.501 UTC [7275] postgres@template1 - psql LOG:  disconnection: session time: 0:00:00.002 user=postgres database=template1 host=[local]
2023-01-18 23:07:54.608 UTC [7306] [unknown]@[unknown] - [unknown] LOG:  connection received: host=[local]
2023-01-18 23:07:54.610 UTC [7306] postgres@postgres - [unknown] LOG:  connection authorized: user=postgres database=postgres application_name=psql
2023-01-18 23:07:54.612 UTC [7306] postgres@postgres - psql LOG:  disconnection: session time: 0:00:00.004 user=postgres database=postgres host=[local]
2023-01-18 23:07:54.670 UTC [7317] [unknown]@[unknown] - [unknown] LOG:  connection received: host=[local]
2023-01-18 23:07:54.672 UTC [7317] postgres@postgres - [unknown] LOG:  connection authorized: user=postgres database=postgres application_name=psql
2023-01-18 23:07:54.684 UTC [7317] postgres@postgres - psql LOG:  disconnection: session time: 0:00:00.014 user=postgres database=postgres host=[local]
2023-01-18 23:07:54.804 UTC [7335] [unknown]@[unknown] - [unknown] LOG:  connection received: host=[local]
2023-01-18 23:07:54.805 UTC [7335] postgres@postgres - [unknown] LOG:  connection authorized: user=postgres database=postgres application_name=psql
2023-01-18 23:07:54.808 UTC [7335] postgres@postgres - psql LOG:  disconnection: session time: 0:00:00.003 user=postgres database=postgres host=[local]
2023-01-18 23:07:54.914 UTC [7336] [unknown]@[unknown] - [unknown] LOG:  connection received: host=172.18.11.102 port=46394
2023-01-18 23:07:57.207 UTC [7567] [unknown]@[unknown] - [unknown] LOG:  connection received: host=127.0.0.1 port=54828
2023-01-18 23:07:57.281 UTC [7578] [unknown]@[unknown] - [unknown] LOG:  connection received: host=[local]
2023-01-18 23:07:57.283 UTC [7578] postgres@postgres - [unknown] LOG:  connection authorized: user=postgres database=postgres application_name=psql
2023-01-18 23:07:57.285 UTC [7578] postgres@postgres - psql LOG:  disconnection: session time: 0:00:00.004 user=postgres database=postgres host=[local]
2023-01-18 23:07:57.309 UTC [7580] [unknown]@[unknown] - [unknown] LOG:  connection received: host=127.0.0.1 port=54830
2023-01-18 23:07:57.377 UTC [7591] [unknown]@[unknown] - [unknown] LOG:  connection received: host=[local]
2023-01-18 23:07:57.378 UTC [7591] postgres@postgres - [unknown] LOG:  connection authorized: user=postgres database=postgres application_name=psql
2023-01-18 23:07:57.381 UTC [7591] postgres@postgres - psql LOG:  disconnection: session time: 0:00:00.004 user=postgres database=postgres host=[local]
2023-01-18 23:07:57.858 UTC [7594] [unknown]@[unknown] - [unknown] LOG:  connection received: host=X.Y.6.26 port=16724
2023-01-18 23:07:57.866 UTC [7594] replication@[unknown] - [unknown] LOG:  replication connection authorized: user=replication application_name=dtsdb SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off)
2023-01-18 23:07:57.867 UTC [7594] replication@[unknown] - dtsdb LOG:  disconnection: session time: 0:00:00.009 user=replication database= host=X.Y.6.26 port=16724
2023-01-18 23:07:59.937 UTC [7249] LOG:  received fast shutdown request
2023-01-18 23:07:59.942 UTC [7249] LOG:  aborting any active transactions
2023-01-18 23:07:59.942 UTC [7272] postgres@postgres - [unknown] FATAL:  terminating connection due to administrator command
2023-01-18 23:07:59.942 UTC [7271] postgres@dtsdb - [unknown] FATAL:  terminating connection due to administrator command
2023-01-18 23:07:59.943 UTC [7272] postgres@postgres - [unknown] LOG:  disconnection: session time: 0:00:05.634 user=postgres database=postgres host=[local]
2023-01-18 23:07:59.943 UTC [7271] postgres@dtsdb - [unknown] LOG:  disconnection: session time: 0:00:05.635 user=postgres database=dtsdb host=[local]
2023-01-18 23:07:59.945 UTC [7249] LOG:  background worker "logical replication launcher" (PID 7258) exited with exit code 1
2023-01-18 23:07:59.949 UTC [7252] LOG:  shutting down
2023-01-18 23:08:00.082 UTC [7252] LOG:  checkpoint starting: shutdown immediate
2023-01-18 23:08:00.119 UTC [7252] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.010 s, sync=0.001 s, total=0.044 s; sync files=0, longest=0.000 s, avera
ge=0.000 s; distance=16384 kB, estimate=16384 kB
2023-01-18 23:08:00.124 UTC [7249] LOG:  database system is shut down
2023-01-18 23:08:00.298 UTC [7844] LOG:  database system was shut down at 2023-01-18 23:08:00 UTC
2023-01-18 23:08:00.312 P00   INFO: archive-get command begin 2.43: [00000002.history, pg_wal/RECOVERYHISTORY] --archive-async --exec-id=7846-ba866c3e --log-level-console=info --log-level-file=detail --pg2-host=dtsdb0-foo.xxx.com --pg1-path=/var/lib/postgresql/12/main --pg2-path=/var/lib/postgresql/12/main --process-max=4 --repo1-azure-account=<redacted> --repo1-azure-container=backups --repo1-azure-key=<redacted> --repo1-cipher-pass=<redacted> --repo1-cipher-type=aes-256-cbc --repo1-path=/ --repo1-type=azure --spool-path=/var/lib/pg_wal/12/main/spool --stanza=dtsdb
2023-01-18 23:08:00.337 P00   INFO: found 00000002.history in the repo1: 12-1 archive
2023-01-18 23:08:00.337 P00   INFO: archive-get command end: completed successfully (27ms)
2023-01-18 23:08:00.339 UTC [7844] LOG:  restored log file "00000002.history" from archive
2023-01-18 23:08:00.350 P00   INFO: archive-get command begin 2.43: [00000003.history, pg_wal/RECOVERYHISTORY] --archive-async --exec-id=7848-0ab50415 --log-level-console=info --log-level-file=detail --pg2-host=dtsdb0-foo.xxx.com --pg1-path=/var/lib/postgresql/12/main --pg2-path=/var/lib/postgresql/12/main --process-max=4 --repo1-azure-account=<redacted> --repo1-azure-container=backups --repo1-azure-key=<redacted> --repo1-cipher-pass=<redacted> --repo1-cipher-type=aes-256-cbc --repo1-path=/ --repo1-type=azure --spool-path=/var/lib/pg_wal/12/main/spool --stanza=dtsdb
2023-01-18 23:08:00.375 P00   INFO: unable to find 00000003.history in the archive
2023-01-18 23:08:00.375 P00   INFO: archive-get command end: completed successfully (26ms)
2023-01-18 23:08:00.377 UTC [7844] LOG:  entering standby mode
2023-01-18 23:08:00.389 P00   INFO: archive-get command begin 2.43: [00000002.history, pg_wal/RECOVERYHISTORY] --archive-async --exec-id=7850-17f50374 --log-level-console=info --log-level-file=detail --pg2-host=dtsdb0-foo.xxx.com --pg1-path=/var/lib/postgresql/12/main --pg2-path=/var/lib/postgresql/12/main --process-max=4 --repo1-azure-account=<redacted> --repo1-azure-container=backups --repo1-azure-key=<redacted> --repo1-cipher-pass=<redacted> --repo1-cipher-type=aes-256-cbc --repo1-path=/ --repo1-type=azure --spool-path=/var/lib/pg_wal/12/main/spool --stanza=dtsdb
2023-01-18 23:08:00.412 P00   INFO: found 00000002.history in the repo1: 12-1 archive
2023-01-18 23:08:00.412 P00   INFO: archive-get command end: completed successfully (24ms)
2023-01-18 23:08:00.414 UTC [7844] LOG:  restored log file "00000002.history" from archive
2023-01-18 23:08:00.435 P00   INFO: archive-get command begin 2.43: [00000002000000B70000005A, pg_wal/RECOVERYXLOG] --archive-async --exec-id=7852-6759d81b --log-level-console=info --log-level-file=detail --pg2-host=dtsdb0-foo.xxx.com --pg1-path=/var/lib/postgresql/12/main --pg2-path=/var/lib/postgresql/12/main --process-max=4 --repo1-azure-account=<redacted> --repo1-azure-container=backups --repo1-azure-key=<redacted> --repo1-cipher-pass=<redacted> --repo1-cipher-type=aes-256-cbc --repo1-path=/ --repo1-type=azure --spool-path=/var/lib/pg_wal/12/main/spool --stanza=dtsdb
2023-01-18 23:08:01.139 P00   INFO: found 00000002000000B70000005A in the archive asynchronously
2023-01-18 23:08:01.140 P00   INFO: archive-get command end: completed successfully (706ms)
2023-01-18 23:08:01.141 UTC [7844] LOG:  restored log file "00000002000000B70000005A" from archive
2023-01-18 23:08:01.275 UTC [7844] LOG:  invalid resource manager ID in primary checkpoint record
2023-01-18 23:08:01.275 UTC [7844] PANIC:  could not locate a valid checkpoint record
2023-01-18 23:08:02.222 UTC [7842] LOG:  startup process (PID 7844) was terminated by signal 6: Aborted
2023-01-18 23:08:02.222 UTC [7842] LOG:  aborting startup due to startup process failure
2023-01-18 23:08:02.223 UTC [7842] LOG:  database system is shut down


--
Don Seiler
www.seiler.us

Re: Failover Testing Failures: invalid resource manager ID in primary checkpoint record

От
Don Seiler
Дата:
On Thu, Jan 19, 2023 at 9:50 AM Don Seiler <don@seiler.us> wrote:
I'm going to have to review what chef might have done. I was relying on chef to deploy the configs before attempting to restart but it may have tried to start the service early

Reviewing the chef recipe, this does seem to be the case. The code to determine whether or not to place the standby.signal file comes after it already attempts to start the PG service. Another self-inflicted wound apparently.

Don.
--
Don Seiler
www.seiler.us

Re: Failover Testing Failures: invalid resource manager ID in primary checkpoint record

От
Don Seiler
Дата:
On Thu, Jan 19, 2023 at 9:57 AM Don Seiler <don@seiler.us> wrote:
On Thu, Jan 19, 2023 at 9:50 AM Don Seiler <don@seiler.us> wrote:
I'm going to have to review what chef might have done. I was relying on chef to deploy the configs before attempting to restart but it may have tried to start the service early

Reviewing the chef recipe, this does seem to be the case. The code to determine whether or not to place the standby.signal file comes after it already attempts to start the PG service. Another self-inflicted wound apparently.

Yes this was exactly what the problem was. After fixing the order of operations in the chef recipe, the old primary nicely transitioned into the new replica without having to do any rewind/restore operation. Obviously requires cleanly shutting down the old primary first, which should be the case for planned DR exercises. True DR emergencies would require rewind or restore if the old primary were revived.

Don. 

--
Don Seiler
www.seiler.us