Re: Allow users to choose what happens when recovery target is not reached
От | Euler Taveira |
---|---|
Тема | Re: Allow users to choose what happens when recovery target is not reached |
Дата | |
Msg-id | 42f7e161-cbcb-42d8-acc9-3049f2275982@www.fastmail.com обсуждение исходный текст |
Ответ на | Re: Allow users to choose what happens when recovery target is not reached (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>) |
Список | pgsql-hackers |
On Sat, Nov 13, 2021, at 10:15 AM, Bharath Rupireddy wrote:
Firstly, the proposed patch adds no new behaviour as such, it justgives the ability that is existing today on v12 and below (prior tocommit dc78866 which went into v13 and later).
It reintroduces an awkward behavior [1].
I think performing PITR is the user's wish - whether the primary isavailable or not, it is completely the user's choice. The user mightstart the PITR, when the primary is available, thinking that it sendsall the WAL files required for achieving recovery target. But imaginea disaster happens and the primary server crashes, say the recoveryhas replayed a huge bunch of WAL records (a TB may be), and theprimary failed without sending the last one or few WAL files, shouldthe PITR target server be failing this case after replaying a hugebunch of WAL records? The user might want the target server to beavailable instead of FATALly shutting down. This is the exact problemthe proposed patch is trying to solve.
Are you archiving on the primary server? You are risking your customer's
business suggesting such setup. You should store the WAL files on your backup
server.
It seems your setup has a flaw. You set a recovery target but accept a scenario
that is not what you initially asked for. If it is a real PITR, it is awkward
like Peter [1] said. You could validate your recovery settings checking the
timestamp of the last WAL file as a rough approximation of the maximum recovery
target time. The other option is to run pg_waldump to obtain the last commit
timestamp.
If you care about your customer's data, you won't use such option. Otherwise, I
repeat the Julien's question [2]: isn't it better to simply don't specify a target
and let the recovery go as far as possible?
As I said earlier, the behaviour is not too dangerous as it is notsomething new that the patch is proposing, it exists today in v12 andbelow. In fact, it gives a way out of a "dangerous situation" if theuser ever gets stuck in it without wasting recovery cycles and computeresources, by quickly getting the database to be available(of course,the responsibility lies with the user to deal with the missing WALfiles).
Your proposal seems that the user is shooting in the dark. If a FATAL message
was got it means the user missed the target. Even after that the user accepts
the situation, remove the target parameters and start the server again. I think
promote or even pause might lead to incorrect expectations (if the user doesn't
carefully inspect the log messages).
A disadvantage of this proposal is that if you have it set to 'promote', start
the recovery and the server gets promoted before reaching the target. While
inspecting your server configuration, you realized that you are pointing to the
incorrect archive or the WAL files were not available in time (due to timing
issues). You have no option but start from scratch.
В списке pgsql-hackers по дате отправления: