Обсуждение: Incorrect handling of OOM in WAL replay leading to data loss

Поиск
Список
Период
Сортировка

Incorrect handling of OOM in WAL replay leading to data loss

От
Michael Paquier
Дата:
Hi all,

A colleague, Ethan Mertz (in CC), has discovered that we don't handle
correctly WAL records that are failing because of an OOM when
allocating their required space.  In the case of Ethan, we have bumped
on the failure after an allocation failure on XLogReadRecordAlloc():
"out of memory while trying to decode a record of length"

As far as I can see, PerformWalRecovery() uses LOG as elevel for its
private callback in the xlogreader, when doing through ReadRecord(),
which leads to a failure being reported, but recovery considers that
the failure is the end of WAL and decides to abruptly end recovery,
leading to some data lost.

In crash recovery, any records after the OOM would not be replayed.
At quick glance, it seems to me that this can also impact standbys,
where recovery could stop earlier than it should once a consistent
point has been reached.

Attached is a patch that can be applied on HEAD to inject an error,
then just run the script xlogreader_oom.bash attached, or something
similar, to see the failure in the logs:
LOG:  redo starts at 0/1913CD0
LOG:  out of memory while trying to decode a record of length 57
LOG:  redo done at 0/1917358 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

It also looks like recovery_prefetch may mitigate a bit the issue if
we do a read in non-blocking mode, but that's not a strong guarantee
either, especially if the host is under memory pressure.

A patch is registered in the commit fest to improve the error
detection handling, but as far as I can see it fails to handle the OOM
case and replaces ReadRecord() to use a WARNING in the redo loop:
https://www.postgresql.org/message-id/20200228.160100.2210969269596489579.horikyota.ntt%40gmail.com

On top of my mind, any solution I can think of needs to add more
information to XLogReaderState, where we'd either track the type of
error that happened close to errormsg_buf which is where these errors
are tracked, but any of that cannot be backpatched, unfortunately.

Comments?
--
Michael

Вложения

Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Kyotaro Horiguchi
Дата:
At Tue, 1 Aug 2023 12:43:21 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> A colleague, Ethan Mertz (in CC), has discovered that we don't handle
> correctly WAL records that are failing because of an OOM when
> allocating their required space.  In the case of Ethan, we have bumped
> on the failure after an allocation failure on XLogReadRecordAlloc():
> "out of memory while trying to decode a record of length"

I believe a database server is not supposed to be executed under such
a memory-constrained environment.

> In crash recovery, any records after the OOM would not be replayed.
> At quick glance, it seems to me that this can also impact standbys,
> where recovery could stop earlier than it should once a consistent
> point has been reached.

Actually the code is assuming that OOM happens solely due to a broken
record length field. I believe that we intentionally put that
assumption.

> A patch is registered in the commit fest to improve the error
> detection handling, but as far as I can see it fails to handle the OOM
> case and replaces ReadRecord() to use a WARNING in the redo loop:
> https://www.postgresql.org/message-id/20200228.160100.2210969269596489579.horikyota.ntt%40gmail.com

It doesn't change behavior unrelated to the case where the last record
is followed by zeroed trailing bytes.

> On top of my mind, any solution I can think of needs to add more
> information to XLogReaderState, where we'd either track the type of
> error that happened close to errormsg_buf which is where these errors
> are tracked, but any of that cannot be backpatched, unfortunately.

One issue on changing that behavior is that there's not a simple way
to detect a broken record before loading it into memory. We might be
able to implement a fallback mechanism for example that loads the
record into an already-allocated buffer (which is smaller than the
specified length) just to verify if it's corrupted. However, I
question whether it's worth the additional complexity. And I'm not
sure what if the first allocation failed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Michael Paquier
Дата:
On Tue, Aug 01, 2023 at 01:51:13PM +0900, Kyotaro Horiguchi wrote:
> I believe a database server is not supposed to be executed under such
> a memory-constrained environment.

I don't really follow this argument.  The backend and the frontends
are reliable on OOM, where we generate ERRORs or even FATALs depending
on the code path involved.  A memory bounded environment is something
that can easily happen if one's not careful enough with the sizing of
the instance.  For example, this error can be triggered on a standby
with read-only queries that put pressure on the host's memory.

> One issue on changing that behavior is that there's not a simple way
> to detect a broken record before loading it into memory. We might be
> able to implement a fallback mechanism for example that loads the
> record into an already-allocated buffer (which is smaller than the
> specified length) just to verify if it's corrupted. However, I
> question whether it's worth the additional complexity. And I'm not
> sure what if the first allocation failed.

Perhaps we could rely more on a fallback memory, especially if it is
possible to use that for the header validation.  That seems like a
separate thing, still.
--
Michael

Вложения

Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Kyotaro Horiguchi
Дата:
At Tue, 1 Aug 2023 14:03:36 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Tue, Aug 01, 2023 at 01:51:13PM +0900, Kyotaro Horiguchi wrote:
> > I believe a database server is not supposed to be executed under such
> > a memory-constrained environment.
> 
> I don't really follow this argument.  The backend and the frontends
> are reliable on OOM, where we generate ERRORs or even FATALs depending
> on the code path involved.  A memory bounded environment is something
> that can easily happen if one's not careful enough with the sizing of

I didn't meant that OOM should not happen. I mentioned an environemnt
where allocation failure can happen while crash recovery. Anyway I
didn't meant that we shouldn't "fix" it.

> the instance.  For example, this error can be triggered on a standby
> with read-only queries that put pressure on the host's memory.

I thoght that the failure on a stanby results in continuing to retry
reading the next record. However, I found that there's a case where
start process stops in response to OOM [1].

> > One issue on changing that behavior is that there's not a simple way
> > to detect a broken record before loading it into memory. We might be
> > able to implement a fallback mechanism for example that loads the
> > record into an already-allocated buffer (which is smaller than the
> > specified length) just to verify if it's corrupted. However, I
> > question whether it's worth the additional complexity. And I'm not
> > sure what if the first allocation failed.
> 
> Perhaps we could rely more on a fallback memory, especially if it is
> possible to use that for the header validation.  That seems like a
> separate thing, still.

Once a record have been read, that size of memory is already
allocated.

While we will not agree, we could establish a defalut behavior where
an OOM during recovery immediately triggers an ERROR. Then, we could
introduce a *GUC* that causes recovery to regard OOM as an
end-of-recovery error.

regards.

[1] https://www.postgresql.org/message-id/17928-aa92416a70ff44a2%40postgresql.org

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Kyotaro Horiguchi
Дата:
At Tue, 01 Aug 2023 15:28:54 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> While we will not agree, we could establish a defalut behavior where
> an OOM during recovery immediately triggers an ERROR. Then, we could
> introduce a *GUC* that causes recovery to regard OOM as an
> end-of-recovery error.

If we do that, the reverse might be preferable. (OOMs are
end-of-reocvery by default. That can be changed to ERROR by GUC.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Aleksander Alekseev
Дата:
Hi,

> As far as I can see, PerformWalRecovery() uses LOG as elevel
> [...]
> On top of my mind, any solution I can think of needs to add more
> information to XLogReaderState, where we'd either track the type of
> error that happened close to errormsg_buf which is where these errors
> are tracked, but any of that cannot be backpatched, unfortunately.

Probably I'm missing something, but if memory allocation is required
during WAL replay and it fails, wouldn't it be a better solution to
log the error and terminate the DBMS immediately?

Clearly Postgres doesn't have control of the amount of memory
available. It's up to the DBA to resolve the problem and start the
recovery again. If this happens on a replica, it indicates a
misconfiguration of the system and/or lack of the corresponding
configuration options.

Maybe a certain amount of memory should be reserved for the WAL replay
and perhaps other needs. In the recent case the system should account
for the overcommitment of the OS - cases when a successful malloc()
doesn't necessarily allocate the required amount of *physical* memory,
as it's done on Linux.

-- 
Best regards,
Aleksander Alekseev



Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Jeff Davis
Дата:
On Tue, 2023-08-01 at 16:14 +0300, Aleksander Alekseev wrote:
> Probably I'm missing something, but if memory allocation is required
> during WAL replay and it fails, wouldn't it be a better solution to
> log the error and terminate the DBMS immediately?

We need to differentiate between:

1. No valid record exists and it must be the end of WAL; LOG and start
up.

2. A valid record exists and we are unable to process it (e.g. due to
OOM); PANIC.

Regards,
    Jeff Davis





Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Kyotaro Horiguchi
Дата:
At Tue, 01 Aug 2023 15:28:54 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> I thoght that the failure on a stanby results in continuing to retry
> reading the next record. However, I found that there's a case where
> start process stops in response to OOM [1].

I've examined the calls to
MemoryContextAllocExtended(..,MCXT_ALLOC_NO_OOM). In server recovery
path, XLogDecodeNextRecord is the only function that uses it.

So, there doesn't seem to be a problem here. I proceeded to test the
idea of only varifying headers after an allocation failure, and I've
attached a PoC.

- allocate_recordbuf() ensures a minimum of SizeOfXLogRecord bytes
  when it reutnrs false, indicating an allocation failure.

- If allocate_recordbuf() returns false, XLogDecodeNextRecord()
  continues to read pages and perform header checks until the
  total_len reached, but not copying data (except for the logical
  record header, when the first page didn't store the entire header).

- If all relevant WAL pages are consistent, ReadRecord concludes with
  an 'out of memory' ERROR, which then escalates to FATAL.

I believe this approach is sufficient to determine whether the error
is OOM or not. If total_len is currupted and has an excessively large
value, it's highly unlikely that all subsequent pages for that length
will be consistent.

Do you have any thoughts on this?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Michael Paquier
Дата:
On Tue, Aug 01, 2023 at 04:39:54PM -0700, Jeff Davis wrote:
> On Tue, 2023-08-01 at 16:14 +0300, Aleksander Alekseev wrote:
> > Probably I'm missing something, but if memory allocation is required
> > during WAL replay and it fails, wouldn't it be a better solution to
> > log the error and terminate the DBMS immediately?
>
> We need to differentiate between:
>
> 1. No valid record exists and it must be the end of WAL; LOG and start
> up.
>
> 2. A valid record exists and we are unable to process it (e.g. due to
> OOM); PANIC.

Yes, still there is a bit more to it.  The origin of the introduction
to palloc(MCXT_ALLOC_NO_OOM) partially comes from this thread, that
has reported a problem where we switched from malloc() to palloc()
when xlogreader.c got introduced:
https://www.postgresql.org/message-id/CAHGQGwE46cJC4rJGv+kVMV8g6BxHm9dBR_7_QdPjvJUqdt7m=Q@mail.gmail.com

And the malloc() behavior when replaying WAL records is even older
than that.

At the end, we want to be able to give more options to anybody looking
at WAL records, and let them take decisions based on the error reached
and the state of the system.  For example, it does not make much sense
to fail hard on OOM if replaying records when in standby mode because
we can just loop again.  The same can actually be said when in crash
recovery.  On OOM, the startup process considers that we have an
invalid record now, which is incorrect.  We could fail hard and FATAL
to replay again (sounds like the natural option), or we could loop
over the record that failed its allocation, repeating things.  In any
case, we need to give more information back to the system so as it can
take better decisions on what it should do.
--
Michael

Вложения

Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Michael Paquier
Дата:
On Wed, Aug 02, 2023 at 01:16:02PM +0900, Kyotaro Horiguchi wrote:
> I believe this approach is sufficient to determine whether the error
> is OOM or not. If total_len is currupted and has an excessively large
> value, it's highly unlikely that all subsequent pages for that length
> will be consistent.
>
> Do you have any thoughts on this?

This could be more flexible IMO, and actually in some cases
errormsg_fatal may be eaten if using the WAL prefetcher as the error
message is reported with the caller of XLogPrefetcherReadRecord(), no?

Anything that has been discussed on this thread now involves a change
in XLogReaderState that induces an ABI breakage.  For HEAD, we are
likely going in this direction, but if we are going to bite the bullet
we'd better be a bit more aggressive with the integration and report
an error code side-by-side with the error message returned by
XLogPrefetcherReadRecord(), XLogReadRecord() and XLogNextRecord() so
as all of the callers can decide what they want to do on an invalid
record or just an OOM.

Attached is the idea of infrastructure I have in mind, as of 0001,
where this adds an error code to report_invalid_record().  For now
this includes three error codes appended to the error messages
generated that can be expanded if need be: no error, OOM and invalid
data.  The invalid data part may needs to be much more verbose, and
could be improved to make this stuff "less scary" as the other thread
proposes, but what I have here would be enough to trigger a different
decision in the startup process if a record cannot be fetched on OOM
or if there's a different reason behind that.

0002 is an example of decision that can be taken in WAL replay if we
see an OOM, based on the error code received.  One argument is that we
may want to improve emode_for_corrupt_record() so as it reacts better
on OOM, upgrading the emode wanted, but this needs more analysis
depending on the code path involved.

0003 is my previous trick to inject an OOM failure at replay.  Reusing
the previous script, this would be enough to prevent an early redo
creating a loss of data.

Note that we have a few other things going in the tree.  As one
example, pg_walinspect would consider an OOM as the end of WAL.  Not
critical, still slightly incorrect as the end of WAL may not have been
reached yet so it can report some incorrect information depending on
what the WAL reader faces.  This could be improved with the additions
of 0001.

Thoughts or comments?
--
Michael

Вложения

Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Kyotaro Horiguchi
Дата:
At Tue, 8 Aug 2023 16:29:49 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Wed, Aug 02, 2023 at 01:16:02PM +0900, Kyotaro Horiguchi wrote:
> > I believe this approach is sufficient to determine whether the error
> > is OOM or not. If total_len is currupted and has an excessively large
> > value, it's highly unlikely that all subsequent pages for that length
> > will be consistent.
> > 
> > Do you have any thoughts on this?
> 
> This could be more flexible IMO, and actually in some cases
> errormsg_fatal may be eaten if using the WAL prefetcher as the error
> message is reported with the caller of XLogPrefetcherReadRecord(), no?

Right. The goal of my PoC was to detect OOM accurately or at least
sufficiently so. We need to separately pass the "error code" along
with the message to make it work with the prefethcer.  We could
enclose errormsg and errorcode in a struct.

> Anything that has been discussed on this thread now involves a change
> in XLogReaderState that induces an ABI breakage.  For HEAD, we are
> likely going in this direction, but if we are going to bite the bullet
> we'd better be a bit more aggressive with the integration and report
> an error code side-by-side with the error message returned by
> XLogPrefetcherReadRecord(), XLogReadRecord() and XLogNextRecord() so
> as all of the callers can decide what they want to do on an invalid
> record or just an OOM.

Sounds reasonable.

> Attached is the idea of infrastructure I have in mind, as of 0001,
> where this adds an error code to report_invalid_record().  For now
> this includes three error codes appended to the error messages
> generated that can be expanded if need be: no error, OOM and invalid
> data.  The invalid data part may needs to be much more verbose, and
> could be improved to make this stuff "less scary" as the other thread
> proposes, but what I have here would be enough to trigger a different 
> decision in the startup process if a record cannot be fetched on OOM
> or if there's a different reason behind that.

Agreed. This clarifies the basis for decisions at the upper layer
(ReadRecord) and adds flexibility.

> 0002 is an example of decision that can be taken in WAL replay if we
> see an OOM, based on the error code received.  One argument is that we
> may want to improve emode_for_corrupt_record() so as it reacts better
> on OOM, upgrading the emode wanted, but this needs more analysis
> depending on the code path involved.
> 
> 0003 is my previous trick to inject an OOM failure at replay.  Reusing
> the previous script, this would be enough to prevent an early redo
> creating a loss of data.
> 
> Note that we have a few other things going in the tree.  As one
> example, pg_walinspect would consider an OOM as the end of WAL.  Not
> critical, still slightly incorrect as the end of WAL may not have been
> reached yet so it can report some incorrect information depending on
> what the WAL reader faces.  This could be improved with the additions
> of 0001.
> 
> Thoughts or comments?

I like the overall direction. Though, I'm considering enclosing the
errormsg and errorcode in a struct.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Michael Paquier
Дата:
On Tue, Aug 08, 2023 at 05:44:03PM +0900, Kyotaro Horiguchi wrote:
> I like the overall direction. Though, I'm considering enclosing the
> errormsg and errorcode in a struct.

Yes, this suggestion makes sense as it simplifies all the WAL routines
that need to report back a complete error state, and there are four of
them now:
XLogPrefetcherReadRecord()
XLogReadRecord()
XLogNextRecord()
DecodeXLogRecord()

I have spent more time on 0001, polishing it and fixing a few bugs
that I have found while reviewing the whole.  Most of them were
related to mistakes in resetting the error state when expected.  I
have also expanded DecodeXLogRecord() to use an error structure
instead of only an errmsg, giving more consistency.  The error state
now relies on two structures:
+typedef enum XLogReaderErrorCode
+{
+   XLOG_READER_NONE = 0,
+   XLOG_READER_OOM,            /* out-of-memory */
+   XLOG_READER_INVALID_DATA,   /* record data */
+} XLogReaderErrorCode;
+typedef struct XLogReaderError
+{
+   /* Buffer to hold error message */
+   char       *message;
+   bool        message_deferred;
+   /* Error code when filling *message */
+   XLogReaderErrorCode code;
+} XLogReaderError;

I'm kind of happy with this layer, now.

I have also spent some time on finding a more elegant solution for the
WAL replay, relying on the new facility from 0001.  And it happens
that it is easy enough to loop if facing an out-of-memory failure when
reading a record when we are in crash recovery, as the state is
actually close to what a standby does.  The trick is that we should
not change the state and avoid tracking a continuation record.  This
is done in 0002, making replay more robust.  With the addition of the
error injection tweak in 0003, I am able to finish recovery while the
startup process loops if under memory pressure.  As mentioned
previously, there are more code paths to consider, but that's a start
to fix the data loss problems.

Comments are welcome.
--
Michael

Вложения

Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Kyotaro Horiguchi
Дата:
At Wed, 9 Aug 2023 15:03:21 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> I have spent more time on 0001, polishing it and fixing a few bugs
> that I have found while reviewing the whole.  Most of them were
> related to mistakes in resetting the error state when expected.  I
> have also expanded DecodeXLogRecord() to use an error structure
> instead of only an errmsg, giving more consistency.  The error state
> now relies on two structures:
> +typedef enum XLogReaderErrorCode
> +{
> +   XLOG_READER_NONE = 0,
> +   XLOG_READER_OOM,            /* out-of-memory */
> +   XLOG_READER_INVALID_DATA,   /* record data */
> +} XLogReaderErrorCode;
> +typedef struct XLogReaderError
> +{
> +   /* Buffer to hold error message */
> +   char       *message;
> +   bool        message_deferred;
> +   /* Error code when filling *message */
> +   XLogReaderErrorCode code;
> +} XLogReaderError;
> 
> I'm kind of happy with this layer, now.

I'm not certain if message_deferred is a property of the error
struct. Callers don't seem to need that information.

The name "XLOG_RADER_NONE" seems too generic. XLOG_READER_NOERROR will
be clearer.

> I have also spent some time on finding a more elegant solution for the
> WAL replay, relying on the new facility from 0001.  And it happens
> that it is easy enough to loop if facing an out-of-memory failure when
> reading a record when we are in crash recovery, as the state is
> actually close to what a standby does.  The trick is that we should
> not change the state and avoid tracking a continuation record.  This
> is done in 0002, making replay more robust.  With the addition of the

0002 shifts the behavior for the OOM case from ending recovery to
retrying at the same record.  If the last record is really corrupted,
the server won't be able to finish recovery. I doubt we are good with
this behavior change.

> error injection tweak in 0003, I am able to finish recovery while the
> startup process loops if under memory pressure.  As mentioned
> previously, there are more code paths to consider, but that's a start
> to fix the data loss problems.

(The file name "xlogreader_oom" is a bit trickeier to type than "hoge"
 or "foo"X( )

> Comments are welcome.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Michael Paquier
Дата:
On Wed, Aug 09, 2023 at 04:13:53PM +0900, Kyotaro Horiguchi wrote:
> I'm not certain if message_deferred is a property of the error
> struct. Callers don't seem to need that information.

True enough, will remove.

> The name "XLOG_RADER_NONE" seems too generic. XLOG_READER_NOERROR will
> be clearer.

Or perhaps just XLOG_READER_NO_ERROR?

> 0002 shifts the behavior for the OOM case from ending recovery to
> retrying at the same record.  If the last record is really corrupted,
> the server won't be able to finish recovery. I doubt we are good with
> this behavior change.

You mean on an incorrect xl_tot_len?  Yes that could be possible.
Another possibility would be a retry logic with an hardcoded number of
attempts and a delay between each.  Once the infrastructure is in
place, this still deserves more discussions but we can be flexible.
The immediate FATAL is choice.
--
Michael

Вложения

Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Kyotaro Horiguchi
Дата:
At Wed, 9 Aug 2023 16:35:09 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> Or perhaps just XLOG_READER_NO_ERROR?

Looks fine.

> > 0002 shifts the behavior for the OOM case from ending recovery to
> > retrying at the same record.  If the last record is really corrupted,
> > the server won't be able to finish recovery. I doubt we are good with
> > this behavior change.
> 
> You mean on an incorrect xl_tot_len?  Yes that could be possible.
> Another possibility would be a retry logic with an hardcoded number of
> attempts and a delay between each.  Once the infrastructure is in
> place, this still deserves more discussions but we can be flexible.
> The immediate FATAL is choice.

While it's a kind of bug in total, we encountered a case where an
excessively large xl_tot_len actually came from a corrupted
record. [1]

I'm glad to see this infrastructure comes in, and I'm on board with
retrying due to an OOM. However, I think we really need official steps
to wrap up recovery when there is a truly broken, oversized
xl_tot_len.

[1] https://www.postgresql.org/message-id/17928-aa92416a70ff44a2@postgresql.org

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Michael Paquier
Дата:
On Wed, Aug 09, 2023 at 05:00:49PM +0900, Kyotaro Horiguchi wrote:
> Looks fine.

Okay, I've updated the patch in consequence.  I'll look at 0001 again
at the beginning of next week.

> While it's a kind of bug in total, we encountered a case where an
> excessively large xl_tot_len actually came from a corrupted
> record. [1]

Right, I remember this one.  I think that Thomas was pretty much right
that this could be caused because of a lack of zeroing in the WAL
pages.

> I'm glad to see this infrastructure comes in, and I'm on board with
> retrying due to an OOM. However, I think we really need official steps
> to wrap up recovery when there is a truly broken, oversized
> xl_tot_len.

There are a few options on the table, only doable once the WAL reader
provider the error state to the startup process:
1) Retry a few times and FATAL.
2) Just FATAL immediately and don't wait.
3) Retry and hope for the best that the host calms down.
I have not seeing this issue being much of an issue in the field, so
perhaps option 2 with the structure of 0002 and a FATAL when we catch
XLOG_READER_OOM in the switch would be enough.  At least that's enough
for the cases we've seen.  I'll think a bit more about it, as well.

Yeah, agreed.  That's orthogonal to the issue reported by Ethan,
unfortunately, where he was able to trigger the issue of this thread
by manipulating the sizing of a host after producing a record larger
than what the host could afford after the resizing :/
--
Michael

Вложения

Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Kyotaro Horiguchi
Дата:
At Wed, 9 Aug 2023 17:44:49 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> > While it's a kind of bug in total, we encountered a case where an
> > excessively large xl_tot_len actually came from a corrupted
> > record. [1]
> 
> Right, I remember this one.  I think that Thomas was pretty much right
> that this could be caused because of a lack of zeroing in the WAL
> pages.

We have treated every kind of broken data as end-of-recovery, like
incorrect rm_id or prev link including excessively large record length
due to corruption. This patch is going to change the behavior only for
the last one. If you think there can't be non-zero broken data, we
should inhibit proceeding recovery after all non-zero incorrect
data. This seems to be a quite big change in our recovery policy.

> There are a few options on the table, only doable once the WAL reader
> provider the error state to the startup process:
> 1) Retry a few times and FATAL.
> 2) Just FATAL immediately and don't wait.
> 3) Retry and hope for the best that the host calms down.

4) Wrap up recovery then continue to normal operation.

This is the traditional behavior for currupt WAL data.

> I have not seeing this issue being much of an issue in the field, so
> perhaps option 2 with the structure of 0002 and a FATAL when we catch
> XLOG_READER_OOM in the switch would be enough.  At least that's enough
> for the cases we've seen.  I'll think a bit more about it, as well.
> 
> Yeah, agreed.  That's orthogonal to the issue reported by Ethan,
> unfortunately, where he was able to trigger the issue of this thread
> by manipulating the sizing of a host after producing a record larger
> than what the host could afford after the resizing :/

I'm not entirely certain, but if you were to ask me which is more
probable during recovery - encountering a correct record that's too
lengthy for the server to buffer or stumbling upon a corrupt byte
sequence - I'd bet on the latter.

I'm not sure how often users encounter currupt WAL data, but I believe
they should have the option to terminate recovery and then switch to
normal operation.

What if we introduced an option to increase the timeline whenever
recovery hits data error? If that option is disabled, the server stops
when recovery detects an incorrect data, except in the case of an
OOM. OOM cause record retry.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Kyotaro Horiguchi
Дата:
At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Wed, 9 Aug 2023 17:44:49 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> > > While it's a kind of bug in total, we encountered a case where an
> > > excessively large xl_tot_len actually came from a corrupted
> > > record. [1]
> > 
> > Right, I remember this one.  I think that Thomas was pretty much right
> > that this could be caused because of a lack of zeroing in the WAL
> > pages.
> 
> We have treated every kind of broken data as end-of-recovery, like
> incorrect rm_id or prev link including excessively large record length
> due to corruption. This patch is going to change the behavior only for
> the last one. If you think there can't be non-zero broken data, we
> should inhibit proceeding recovery after all non-zero incorrect
> data. This seems to be a quite big change in our recovery policy.
> 
> > There are a few options on the table, only doable once the WAL reader
> > provider the error state to the startup process:
> > 1) Retry a few times and FATAL.
> > 2) Just FATAL immediately and don't wait.
> > 3) Retry and hope for the best that the host calms down.
> 
> 4) Wrap up recovery then continue to normal operation.
> 
> This is the traditional behavior for currupt WAL data.
> 
> > I have not seeing this issue being much of an issue in the field, so
> > perhaps option 2 with the structure of 0002 and a FATAL when we catch
> > XLOG_READER_OOM in the switch would be enough.  At least that's enough
> > for the cases we've seen.  I'll think a bit more about it, as well.
> > 
> > Yeah, agreed.  That's orthogonal to the issue reported by Ethan,
> > unfortunately, where he was able to trigger the issue of this thread
> > by manipulating the sizing of a host after producing a record larger
> > than what the host could afford after the resizing :/
> 
> I'm not entirely certain, but if you were to ask me which is more
> probable during recovery - encountering a correct record that's too
> lengthy for the server to buffer or stumbling upon a corrupt byte
> sequence - I'd bet on the latter.

... of course this refers to crash recovery. For replication, we
should keep retrying the current record until the operator commands
promotion.

> I'm not sure how often users encounter currupt WAL data, but I believe
> they should have the option to terminate recovery and then switch to
> normal operation.
> 
> What if we introduced an option to increase the timeline whenever
> recovery hits data error? If that option is disabled, the server stops
> when recovery detects an incorrect data, except in the case of an
> OOM. OOM cause record retry.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Michael Paquier
Дата:
On Thu, Aug 10, 2023 at 10:15:40AM +0900, Kyotaro Horiguchi wrote:
> At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
>> At Wed, 9 Aug 2023 17:44:49 +0900, Michael Paquier <michael@paquier.xyz> wrote in
>>>> While it's a kind of bug in total, we encountered a case where an
>>>> excessively large xl_tot_len actually came from a corrupted
>>>> record. [1]
>>>
>>> Right, I remember this one.  I think that Thomas was pretty much right
>>> that this could be caused because of a lack of zeroing in the WAL
>>> pages.
>>
>> We have treated every kind of broken data as end-of-recovery, like
>> incorrect rm_id or prev link including excessively large record length
>> due to corruption. This patch is going to change the behavior only for
>> the last one. If you think there can't be non-zero broken data, we
>> should inhibit proceeding recovery after all non-zero incorrect
>> data. This seems to be a quite big change in our recovery policy.

Well, per se the report that led to this thread.  We may lose data and
finish with corrupted pages.  I was planning to reply to the other
thread [1] and the patch of Noah anyway, because we have to fix the
detection of OOM vs corrupted records in the allocation path anyway.
The infra introduced by 0001 and something like 0002 that allows the
startup process to take a different path depending on the type of
error are still needed to avoid a too early end-of-recovery, though.

>>> There are a few options on the table, only doable once the WAL reader
>>> provider the error state to the startup process:
>>> 1) Retry a few times and FATAL.
>>> 2) Just FATAL immediately and don't wait.
>>> 3) Retry and hope for the best that the host calms down.
>>
>> 4) Wrap up recovery then continue to normal operation.
>>
>> This is the traditional behavior for currupt WAL data.

Yeah, we've been doing a pretty bad job in classifying the errors that
can happen doing WAL replay in crash recovery, because we assume that
all the WAL still in pg_wal/ is correct.  That's not an easy problem,
because the record CRC works for all the contents of the record, and
we look at the record header before that.  Another idea may be to have
an extra CRC only for the header itself, that can be used in isolation
as one of the checks in XLogReaderValidatePageHeader().

>>> I have not seeing this issue being much of an issue in the field, so
>>> perhaps option 2 with the structure of 0002 and a FATAL when we catch
>>> XLOG_READER_OOM in the switch would be enough.  At least that's enough
>>> for the cases we've seen.  I'll think a bit more about it, as well.
>>>
>>> Yeah, agreed.  That's orthogonal to the issue reported by Ethan,
>>> unfortunately, where he was able to trigger the issue of this thread
>>> by manipulating the sizing of a host after producing a record larger
>>> than what the host could afford after the resizing :/
>>
>> I'm not entirely certain, but if you were to ask me which is more
>> probable during recovery - encountering a correct record that's too
>> lengthy for the server to buffer or stumbling upon a corrupt byte
>> sequence - I'd bet on the latter.

I don't really believe in chance when it comes to computer science,
facts and a correct detection of such facts are better :)

> ... of course this refers to crash recovery. For replication, we
> should keep retrying the current record until the operator commands
> promotion.

Are you referring about a retry if there is a standby.signal?  I am a
bit confused by this sentence, because we could do a crash recovery,
then switch to archive recovery.  So, I guess that you mean that on
OOM we should retry to retrieve WAL from the local pg_wal/ even in the
case where we are in the crash recovery phase, *before* switching to
archive recovery and a different source, right?  I think that this
argument can go two ways, because it could be more helpful for some to
see a FATAL when we are still in crash recovery, even if there is a
standby.signal.  It does not seem to me that we have a clear
definition about what to do in which case, either.  Now we just fail
and hope for the best when doing crash recovery.

>> I'm not sure how often users encounter currupt WAL data, but I believe
>> they should have the option to terminate recovery and then switch to
>> normal operation.
>>
>> What if we introduced an option to increase the timeline whenever
>> recovery hits data error? If that option is disabled, the server stops
>> when recovery detects an incorrect data, except in the case of an
>> OOM. OOM cause record retry.

I guess that it depends on how much responsiveness one may want.
Forcing a failure on OOM is at least something that users would be
immediately able to act on when we don't run a standby but just
recover from a crash, while a standby would do what it is designed to
do, aka continue to replay what it can see.  One issue with the
wait-and-continue is that a standby may loop continuously on OOM,
which could be also bad if there's a replication slot retaining WAL on
the primary.  Perhaps that's just OK to keep doing that for a
standby.  At least this makes the discussion easier for the sake of
this thread: just consider the case of crash recovery when we don't
have a standby.
--
Michael

Вложения

Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Kyotaro Horiguchi
Дата:
At Thu, 10 Aug 2023 13:33:48 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Thu, Aug 10, 2023 at 10:15:40AM +0900, Kyotaro Horiguchi wrote:
> > At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> >> We have treated every kind of broken data as end-of-recovery, like
> >> incorrect rm_id or prev link including excessively large record length
> >> due to corruption. This patch is going to change the behavior only for
> >> the last one. If you think there can't be non-zero broken data, we
> >> should inhibit proceeding recovery after all non-zero incorrect
> >> data. This seems to be a quite big change in our recovery policy.
> 
> Well, per se the report that led to this thread.  We may lose data and
> finish with corrupted pages.  I was planning to reply to the other
> thread [1] and the patch of Noah anyway, because we have to fix the
> detection of OOM vs corrupted records in the allocation path anyway.

Does this mean we will address the distinction between an OOM and a
corrupt total record length later on? If that's the case, should we
modify that behavior right now?

> The infra introduced by 0001 and something like 0002 that allows the
> startup process to take a different path depending on the type of
> error are still needed to avoid a too early end-of-recovery, though.

Agreed.

> >> 4) Wrap up recovery then continue to normal operation.
> >> 
> >> This is the traditional behavior for currupt WAL data.
> 
> Yeah, we've been doing a pretty bad job in classifying the errors that
> can happen doing WAL replay in crash recovery, because we assume that
> all the WAL still in pg_wal/ is correct.  That's not an easy problem,

I'm not quite sure what "correct" means here. I believe xlogreader
runs various checks since the data may be incorrect. Given that can
break for various reasons, during crash recovery, we continue as long
as incoming WAL record remains consistently valid. The problem raised
here that we can't distinctly identify an OOM from a corrupted total
record length field. One reason is we check the data after a part is
loaded, but we can't load all the bytes from the record into memory in
such cases.

> because the record CRC works for all the contents of the record, and
> we look at the record header before that.  Another idea may be to have
> an extra CRC only for the header itself, that can be used in isolation
> as one of the checks in XLogReaderValidatePageHeader().

Sounds reasonable. By using CRC to protect the header part and
allocating a fixed-length buffer for it, I believe we're adopting a
standard approach and can identify OOM and other kind of header errors
with a good degree of certainty.

> >> I'm not entirely certain, but if you were to ask me which is more
> >> probable during recovery - encountering a correct record that's too
> >> lengthy for the server to buffer or stumbling upon a corrupt byte
> >> sequence - I'd bet on the latter.
> 
> I don't really believe in chance when it comes to computer science,
> facts and a correct detection of such facts are better :)

Even now, it seems like we're balancing the risk of potential data
loss against the potential inability to start the server. I meant
that.. if I had to do choose, I'd lean slightly towards prioritizing
saving the latter, in other words, keeping the current behavior.

> > ... of course this refers to crash recovery. For replication, we
> > should keep retrying the current record until the operator commands
> > promotion.
> 
> Are you referring about a retry if there is a standby.signal?  I am a
> bit confused by this sentence, because we could do a crash recovery,
> then switch to archive recovery.  So, I guess that you mean that on
> OOM we should retry to retrieve WAL from the local pg_wal/ even in the
> case where we are in the crash recovery phase, *before* switching to

Apologies for the confusion. What I was thinking is that an OOM is
more likely to occur in replication downstreams than during server
startup. I also felt for the latter case that such a challenging
environment probably wouldn't let the server enter stable normal
operation.

> archive recovery and a different source, right?  I think that this
> argument can go two ways, because it could be more helpful for some to
> see a FATAL when we are still in crash recovery, even if there is a
> standby.signal.  It does not seem to me that we have a clear
> definition about what to do in which case, either.  Now we just fail
> and hope for the best when doing crash recovery.

Agreed.

> >> I'm not sure how often users encounter currupt WAL data, but I believe
> >> they should have the option to terminate recovery and then switch to
> >> normal operation.
> >> 
> >> What if we introduced an option to increase the timeline whenever
> >> recovery hits data error? If that option is disabled, the server stops
> >> when recovery detects an incorrect data, except in the case of an
> >> OOM. OOM cause record retry.
> 
> I guess that it depends on how much responsiveness one may want.
> Forcing a failure on OOM is at least something that users would be
> immediately able to act on when we don't run a standby but just
> recover from a crash, while a standby would do what it is designed to
> do, aka continue to replay what it can see.  One issue with the
> wait-and-continue is that a standby may loop continuously on OOM,
> which could be also bad if there's a replication slot retaining WAL on
> the primary.  Perhaps that's just OK to keep doing that for a
> standby.  At least this makes the discussion easier for the sake of
> this thread: just consider the case of crash recovery when we don't
> have a standby.

Yeah, I'm with you on focusing on crash recovery cases; that's what I
ntended. Sorry for any confusion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Michael Paquier
Дата:
On Thu, Aug 10, 2023 at 02:47:51PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 10 Aug 2023 13:33:48 +0900, Michael Paquier <michael@paquier.xyz> wrote in
>> On Thu, Aug 10, 2023 at 10:15:40AM +0900, Kyotaro Horiguchi wrote:
>> > At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
>> >> We have treated every kind of broken data as end-of-recovery, like
>> >> incorrect rm_id or prev link including excessively large record length
>> >> due to corruption. This patch is going to change the behavior only for
>> >> the last one. If you think there can't be non-zero broken data, we
>> >> should inhibit proceeding recovery after all non-zero incorrect
>> >> data. This seems to be a quite big change in our recovery policy.
>>
>> Well, per se the report that led to this thread.  We may lose data and
>> finish with corrupted pages.  I was planning to reply to the other
>> thread [1] and the patch of Noah anyway, because we have to fix the
>> detection of OOM vs corrupted records in the allocation path anyway.
>
> Does this mean we will address the distinction between an OOM and a
> corrupt total record length later on? If that's the case, should we
> modify that behavior right now?

My apologies if I sounded unclear here.  It seems to me that we should
wrap the patch on [1] first, and get it backpatched.  At least that
makes for less conflicts when 0001 gets merged for HEAD when we are
able to set a proper error code.  (Was looking at it, actually.)

>>>> 4) Wrap up recovery then continue to normal operation.
>>>>
>>>> This is the traditional behavior for currupt WAL data.
>>
>> Yeah, we've been doing a pretty bad job in classifying the errors that
>> can happen doing WAL replay in crash recovery, because we assume that
>> all the WAL still in pg_wal/ is correct.  That's not an easy problem,
>
> I'm not quite sure what "correct" means here. I believe xlogreader
> runs various checks since the data may be incorrect. Given that can
> break for various reasons, during crash recovery, we continue as long
> as incoming WAL record remains consistently valid. The problem raised
> here that we can't distinctly identify an OOM from a corrupted total
> record length field. One reason is we check the data after a part is
> loaded, but we can't load all the bytes from the record into memory in
> such cases.

Yep.  Correct means that we end recovery in a consistent state, not
too early than we should.

>> because the record CRC works for all the contents of the record, and
>> we look at the record header before that.  Another idea may be to have
>> an extra CRC only for the header itself, that can be used in isolation
>> as one of the checks in XLogReaderValidatePageHeader().
>
> Sounds reasonable. By using CRC to protect the header part and
> allocating a fixed-length buffer for it, I believe we're adopting a
> standard approach and can identify OOM and other kind of header errors
> with a good degree of certainty.

Not something that I'd like to cover in this patch set, though..  This
is a problem on its own.

>>>> I'm not entirely certain, but if you were to ask me which is more
>>>> probable during recovery - encountering a correct record that's too
>>>> lengthy for the server to buffer or stumbling upon a corrupt byte
>>>> sequence - I'd bet on the latter.
>>
>> I don't really believe in chance when it comes to computer science,
>> facts and a correct detection of such facts are better :)
>
> Even now, it seems like we're balancing the risk of potential data
> loss against the potential inability to start the server. I meant
> that.. if I had to do choose, I'd lean slightly towards prioritizing
> saving the latter, in other words, keeping the current behavior.

On OOM, this means data loss and silent corruption.  A failure has the
merit to tell someone that something is wrong, at least, and that
they'd better look at it rather than hope for the best.

>>> ... of course this refers to crash recovery. For replication, we
>>> should keep retrying the current record until the operator commands
>>> promotion.
>>
>> Are you referring about a retry if there is a standby.signal?  I am a
>> bit confused by this sentence, because we could do a crash recovery,
>> then switch to archive recovery.  So, I guess that you mean that on
>> OOM we should retry to retrieve WAL from the local pg_wal/ even in the
>> case where we are in the crash recovery phase, *before* switching to
>
> Apologies for the confusion. What I was thinking is that an OOM is
> more likely to occur in replication downstreams than during server
> startup. I also felt for the latter case that such a challenging
> environment probably wouldn't let the server enter stable normal
> operation.

It depends on what the user does with the host running the cluster.
Both could be impacted.

>>>> I'm not sure how often users encounter currupt WAL data, but I believe
>>>> they should have the option to terminate recovery and then switch to
>>>> normal operation.
>>>>
>>>> What if we introduced an option to increase the timeline whenever
>>>> recovery hits data error? If that option is disabled, the server stops
>>>> when recovery detects an incorrect data, except in the case of an
>>>> OOM. OOM cause record retry.
>>
>> I guess that it depends on how much responsiveness one may want.
>> Forcing a failure on OOM is at least something that users would be
>> immediately able to act on when we don't run a standby but just
>> recover from a crash, while a standby would do what it is designed to
>> do, aka continue to replay what it can see.  One issue with the
>> wait-and-continue is that a standby may loop continuously on OOM,
>> which could be also bad if there's a replication slot retaining WAL on
>> the primary.  Perhaps that's just OK to keep doing that for a
>> standby.  At least this makes the discussion easier for the sake of
>> this thread: just consider the case of crash recovery when we don't
>> have a standby.
>
> Yeah, I'm with you on focusing on crash recovery cases; that's what I
> intended. Sorry for any confusion.

Okay, so we're on the same page here, keeping standbys as they are and
do something for the crash recovery case.
--
Michael

Вложения

Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Michael Paquier
Дата:
On Thu, Aug 10, 2023 at 02:59:07PM +0900, Michael Paquier wrote:
> My apologies if I sounded unclear here.  It seems to me that we should
> wrap the patch on [1] first, and get it backpatched.  At least that
> makes for less conflicts when 0001 gets merged for HEAD when we are
> able to set a proper error code.  (Was looking at it, actually.)

Now that Thomas Munro has addressed the original problem to be able to
trust correctly xl_tot_len with bae868caf22, I am coming back to this
thread.

First, attached is a rebased set:
- 0001 to introduce the new error infra for xlogreader.c with an error
code, so as callers can make the difference between an OOM and an
invalid record.
- 0002 to tweak the startup process.  Once again, I've taken the
approach to make the startup process behave like a standby on crash
recovery: each time that an OOM is found, we loop and retry.
- 0003 to emulate an OOM failure, that can be used with the script
attached to see that we don't stop recovery too early.

>>> I guess that it depends on how much responsiveness one may want.
>>> Forcing a failure on OOM is at least something that users would be
>>> immediately able to act on when we don't run a standby but just
>>> recover from a crash, while a standby would do what it is designed to
>>> do, aka continue to replay what it can see.  One issue with the
>>> wait-and-continue is that a standby may loop continuously on OOM,
>>> which could be also bad if there's a replication slot retaining WAL on
>>> the primary.  Perhaps that's just OK to keep doing that for a
>>> standby.  At least this makes the discussion easier for the sake of
>>> this thread: just consider the case of crash recovery when we don't
>>> have a standby.
>>
>> Yeah, I'm with you on focusing on crash recovery cases; that's what I
>> intended. Sorry for any confusion.
>
> Okay, so we're on the same page here, keeping standbys as they are and
> do something for the crash recovery case.

For the crash recovery case, one argument that stood out in my mind is
that causing a hard failure has the disadvantage to force users to do
again WAL replay from the last redo position, which may be far away
even if the checkpointer now runs during crash recovery.  What I am
proposing on this thread has the merit to avoid that.  Anyway, let's
discuss more before settling this point for the crash recovery case.

By the way, anything that I am proposing here cannot be backpatched
because of the infrastructure changes required in walreader.c, so I am
going to create a second thread with something that could be
backpatched (yeah, likely FATALs on OOM to stop recovery from doing
something bad)..
--
Michael

Вложения

Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Michael Paquier
Дата:
On Tue, Sep 26, 2023 at 03:48:07PM +0900, Michael Paquier wrote:
> By the way, anything that I am proposing here cannot be backpatched
> because of the infrastructure changes required in walreader.c, so I am
> going to create a second thread with something that could be
> backpatched (yeah, likely FATALs on OOM to stop recovery from doing
> something bad)..

Patch set is rebased as an effect of 6b18b3fe2c2f, that switched the
OOMs to fail harder now in xlogreader.c.  The patch set has nothing
new, except that 0001 is now a revert of 6b18b3fe2c2f to switch back
xlogreader.c to use soft errors on OOMs.

If there's no interest in this patch set after the next CF, I'm OK to
drop it.  The state of HEAD is at least correct in the OOM cases now.
--
Michael

Вложения

Re: Incorrect handling of OOM in WAL replay leading to data loss

От
Michael Paquier
Дата:
On Tue, Oct 03, 2023 at 04:20:45PM +0900, Michael Paquier wrote:
> If there's no interest in this patch set after the next CF, I'm OK to
> drop it.  The state of HEAD is at least correct in the OOM cases now.

I have been thinking about this patch for the last few days, and in
light of 6b18b3fe2c2f I am going to withdraw it for now as this makes
the error layers of the xlogreader more complicated.
--
Michael

Вложения