Обсуждение: Fix race condition in InvalidatePossiblyObsoleteSlot()

Поиск
Список
Период
Сортировка

Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Bertrand Drouvot
Дата:
Hi hackers,

While working on [1], we discovered (thanks Alexander for the testing) that an
conflicting active logical slot on a standby could be "terminated" without
leading to an "obsolete" message (see [2]).

Indeed, in case of an active slot we proceed in 2 steps in
InvalidatePossiblyObsoleteSlot():

- terminate the backend holding the slot
- report the slot as obsolete

This is racy because between the two we release the mutex on the slot, which
means that the slot's effective_xmin and effective_catalog_xmin could advance
during that time (leading to exit the loop).

I think that holding the mutex longer is not an option (given what we'd to do
while holding it) so the attached proposal is to record the effective_xmin and
effective_catalog_xmin instead that was used during the backend termination.

[1]: https://www.postgresql.org/message-id/flat/bf67e076-b163-9ba3-4ade-b9fc51a3a8f6%40gmail.com
[2]: https://www.postgresql.org/message-id/7c025095-5763-17a6-8c80-899b35bd0459%40gmail.com

Looking forward to your feedback,

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Bharath Rupireddy
Дата:
On Mon, Jan 15, 2024 at 1:18 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi hackers,
>
> While working on [1], we discovered (thanks Alexander for the testing) that an
> conflicting active logical slot on a standby could be "terminated" without
> leading to an "obsolete" message (see [2]).
>
> Indeed, in case of an active slot we proceed in 2 steps in
> InvalidatePossiblyObsoleteSlot():
>
> - terminate the backend holding the slot
> - report the slot as obsolete
>
> This is racy because between the two we release the mutex on the slot, which
> means that the slot's effective_xmin and effective_catalog_xmin could advance
> during that time (leading to exit the loop).
>
> I think that holding the mutex longer is not an option (given what we'd to do
> while holding it) so the attached proposal is to record the effective_xmin and
> effective_catalog_xmin instead that was used during the backend termination.
>
> [1]: https://www.postgresql.org/message-id/flat/bf67e076-b163-9ba3-4ade-b9fc51a3a8f6%40gmail.com
> [2]: https://www.postgresql.org/message-id/7c025095-5763-17a6-8c80-899b35bd0459%40gmail.com
>
> Looking forward to your feedback,

IIUC, the issue is that we terminate the process holding the
replication slot, and the conflict cause that we recorded before
terminating the process changes in the next iteration due to the
advancement in effective_xmin and/or effective_catalog_xmin.

FWIW, a test code something like [1], can help detect above race issues, right?

Some comments on the patch:

1.
                 last_signaled_pid = active_pid;
+                terminated = true;
             }

Why is a separate variable needed? Can't last_signaled_pid != 0 enough
to determine that a process was terminated earlier?

2. If my understanding of the racy behavior is right, can the same
issue happen due to the advancement in restart_lsn? I'm not sure if it
can happen at all, but I think we can rely on previous conflict reason
instead of previous effective_xmin/effective_catalog_xmin/restart_lsn.

3. Is there a way to reproduce this racy issue, may be by adding some
sleeps or something? If yes, can you share it here, just for the
records and verify the whatever fix provided actually works?

[1]
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 52da694c79..d020b038bc 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1352,6 +1352,7 @@
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
 {
        int                     last_signaled_pid = 0;
        bool            released_lock = false;
+       ReplicationSlotInvalidationCause conflict_prev = RS_INVAL_NONE;

        for (;;)
        {
@@ -1417,6 +1418,18 @@
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
                        }
                }

+               /*
+                * Check if the conflict cause recorded previously
before we terminate
+                * the process changed now for any reason.
+                */
+               if (conflict_prev != RS_INVAL_NONE &&
+                       last_signaled_pid != 0 &&
+                       conflict_prev != conflict)
+                       elog(PANIC, "conflict cause recorded before
terminating process %d has been changed; previous cause: %d, current
cause: %d",
+                                last_signaled_pid,
+                                conflict_prev,
+                                conflict);
+
                /* if there's no conflict, we're done */
                if (conflict == RS_INVAL_NONE)
                {
@@ -1499,6 +1512,7 @@
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
                                        (void) kill(active_pid, SIGTERM);

                                last_signaled_pid = active_pid;
+                               conflict_prev = conflict;
                        }

                        /* Wait until the slot is released. */

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Bertrand Drouvot
Дата:
Hi,

On Thu, Jan 18, 2024 at 04:59:39PM +0530, Bharath Rupireddy wrote:
> IIUC, the issue is that we terminate the process holding the
> replication slot, and the conflict cause that we recorded before
> terminating the process changes in the next iteration due to the
> advancement in effective_xmin and/or effective_catalog_xmin.

Thanks for looking at it!

Yeah, and that could lead to no conflict detected anymore (like in the
case [2] reported up-thread).

> FWIW, a test code something like [1], can help detect above race issues, right?

I think so and I've added it in v2 attached (except that it uses the new
"terminated" variable, see below), thanks!

> Some comments on the patch:
> 
> 1.
>                  last_signaled_pid = active_pid;
> +                terminated = true;
>              }
> 
> Why is a separate variable needed? Can't last_signaled_pid != 0 enough
> to determine that a process was terminated earlier?

Yeah probably, I thought about it but preferred to add a new variable for this 
purpose for clarity and avoid race conditions (in case futur patches "touch" the
last_signaled_pid anyhow). I was thinking that this part of the code is already
not that easy.

> 2. If my understanding of the racy behavior is right, can the same
> issue happen due to the advancement in restart_lsn?

I'm not sure as I never saw it but it should not hurt to also consider this
"potential" case so it's done in v2 attached.

> I'm not sure if it
> can happen at all, but I think we can rely on previous conflict reason
> instead of previous effective_xmin/effective_catalog_xmin/restart_lsn.

I'm not sure what you mean here. I think we should still keep the "initial" LSN
so that the next check done with it still makes sense. The previous conflict
reason as you're proposing also makes sense to me but for another reason: PANIC
in case the issue still happen (for cases we did not think about, means not
covered by what the added previous LSNs are covering).

> 3. Is there a way to reproduce this racy issue, may be by adding some
> sleeps or something? If yes, can you share it here, just for the
> records and verify the whatever fix provided actually works?

Alexander was able to reproduce it on a slow machine and the issue was not there
anymore with v1 in place. I think it's tricky to reproduce as it would need the
slot to advance between the 2 checks.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Michael Paquier
Дата:
On Thu, Jan 18, 2024 at 02:20:28PM +0000, Bertrand Drouvot wrote:
> On Thu, Jan 18, 2024 at 04:59:39PM +0530, Bharath Rupireddy wrote:
>> I'm not sure if it
>> can happen at all, but I think we can rely on previous conflict reason
>> instead of previous effective_xmin/effective_catalog_xmin/restart_lsn.
>
> I'm not sure what you mean here. I think we should still keep the "initial" LSN
> so that the next check done with it still makes sense. The previous conflict
> reason as you're proposing also makes sense to me but for another reason: PANIC
> in case the issue still happen (for cases we did not think about, means not
> covered by what the added previous LSNs are covering).

Using a PANIC seems like an overreaction to me for this path.  Sure,
we should not record twice a conflict reason, but wouldn't an
assertion be more adapted enough to check that a termination is not
logged twice for this code path run in the checkpointer?

+            if (!terminated)
+            {
+                initial_restart_lsn = s->data.restart_lsn;
+                initial_effective_xmin = s->effective_xmin;
+                initial_catalog_effective_xmin = s->effective_catalog_xmin;
+            }

It seems important to document why we are saving this data here; while
we hold the LWLock for repslots, before we perform any termination,
and we  want to protect conflict reports with incorrect data if the
slot data got changed while the lwlock is temporarily released while
there's a termination.  No need to mention the pesky standby snapshot
records, I guess, as there could be different reasons why these xmins
advance.

>> 3. Is there a way to reproduce this racy issue, may be by adding some
>> sleeps or something? If yes, can you share it here, just for the
>> records and verify the whatever fix provided actually works?
>
> Alexander was able to reproduce it on a slow machine and the issue was not there
> anymore with v1 in place. I think it's tricky to reproduce as it would need the
> slot to advance between the 2 checks.

I'd really want a test for that in the long term.  And an injection
point stuck between the point where ReplicationSlotControlLock is
released then re-acquired when there is an active PID should be
enough, isn't it?  For example just after the kill()?  It seems to me
that we should stuck the checkpointer, perform a forced upgrade of the
xmins, release the checkpointer and see the effects of the change in
the second loop.  Now, modules/injection_points/ does not allow that,
yet, but I've posted a patch among these lines that can stop a process
and release it using a condition variable (should be 0006 on [1]).  I
was planning to start a new thread with a patch posted for the next CF
to add this kind of facility with a regression test for an old bug,
the patch needs a few hours of love, first.  I should be able to post
that next week.

[1]: https://www.postgresql.org/message-id/CAExHW5uwP9RmCt9bA91bUfKNQeUrosAWtMens4Ah2PuYZv2NRA@mail.gmail.com
--
Michael

Вложения

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Bertrand Drouvot
Дата:
Hi,

On Thu, Feb 15, 2024 at 02:09:45PM +0900, Michael Paquier wrote:
> On Thu, Jan 18, 2024 at 02:20:28PM +0000, Bertrand Drouvot wrote:
> > On Thu, Jan 18, 2024 at 04:59:39PM +0530, Bharath Rupireddy wrote:
> >> I'm not sure if it
> >> can happen at all, but I think we can rely on previous conflict reason
> >> instead of previous effective_xmin/effective_catalog_xmin/restart_lsn.
> > 
> > I'm not sure what you mean here. I think we should still keep the "initial" LSN
> > so that the next check done with it still makes sense. The previous conflict
> > reason as you're proposing also makes sense to me but for another reason: PANIC
> > in case the issue still happen (for cases we did not think about, means not
> > covered by what the added previous LSNs are covering).
> 
> Using a PANIC seems like an overreaction to me for this path.  Sure, 
> we should not record twice a conflict reason, but wouldn't an
> assertion be more adapted enough to check that a termination is not
> logged twice for this code path run in the checkpointer?

Thanks for looking at it!

Agree, using an assertion instead in v3 attached.

> 
> +            if (!terminated)
> +            {
> +                initial_restart_lsn = s->data.restart_lsn;
> +                initial_effective_xmin = s->effective_xmin;
> +                initial_catalog_effective_xmin = s->effective_catalog_xmin;
> +            }
> 
> It seems important to document why we are saving this data here; while
> we hold the LWLock for repslots, before we perform any termination,
> and we  want to protect conflict reports with incorrect data if the
> slot data got changed while the lwlock is temporarily released while
> there's a termination.

Yeah, comments added in v3.

> >> 3. Is there a way to reproduce this racy issue, may be by adding some
> >> sleeps or something? If yes, can you share it here, just for the
> >> records and verify the whatever fix provided actually works?
> > 
> > Alexander was able to reproduce it on a slow machine and the issue was not there
> > anymore with v1 in place. I think it's tricky to reproduce as it would need the
> > slot to advance between the 2 checks.
> 
> I'd really want a test for that in the long term.  And an injection
> point stuck between the point where ReplicationSlotControlLock is
> released then re-acquired when there is an active PID should be
> enough, isn't it? 

Yeah, that should be enough.

> For example just after the kill()?  It seems to me
> that we should stuck the checkpointer, perform a forced upgrade of the
> xmins, release the checkpointer and see the effects of the change in
> the second loop.  Now, modules/injection_points/ does not allow that,
> yet, but I've posted a patch among these lines that can stop a process
> and release it using a condition variable (should be 0006 on [1]).  I
> was planning to start a new thread with a patch posted for the next CF
> to add this kind of facility with a regression test for an old bug,
> the patch needs a few hours of love, first.  I should be able to post
> that next week.

Great, that looks like a good idea!

Are we going to fix this on master and 16 stable first and then later on add a
test on master with the injection points?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Michael Paquier
Дата:
On Thu, Feb 15, 2024 at 11:24:59AM +0000, Bertrand Drouvot wrote:
> Thanks for looking at it!
>
> Agree, using an assertion instead in v3 attached.

And you did not forget the PG_USED_FOR_ASSERTS_ONLY.

> > It seems important to document why we are saving this data here; while
> > we hold the LWLock for repslots, before we perform any termination,
> > and we  want to protect conflict reports with incorrect data if the
> > slot data got changed while the lwlock is temporarily released while
> > there's a termination.
>
> Yeah, comments added in v3.

The contents look rather OK, I may do some word-smithing for both.

>> For example just after the kill()?  It seems to me
>> that we should stuck the checkpointer, perform a forced upgrade of the
>> xmins, release the checkpointer and see the effects of the change in
>> the second loop.  Now, modules/injection_points/ does not allow that,
>> yet, but I've posted a patch among these lines that can stop a process
>> and release it using a condition variable (should be 0006 on [1]).  I
>> was planning to start a new thread with a patch posted for the next CF
>> to add this kind of facility with a regression test for an old bug,
>> the patch needs a few hours of love, first.  I should be able to post
>> that next week.
>
> Great, that looks like a good idea!

I've been finally able to spend some time on what I had in mind and
posted it here for the next CF:
https://www.postgresql.org/message-id/ZdLuxBk5hGpol91B@paquier.xyz

You should be able to reuse that the same way I do in 0002 posted on
the thread, where I'm causing the checkpointer to wait, then wake it
up.

> Are we going to fix this on master and 16 stable first and then later on add a
> test on master with the injection points?

Still, the other patch is likely going to take a couple of weeks
before getting into the tree, so I have no objection to fix the bug
first and backpatch, then introduce a test.  Things have proved that
failures could show up in the buildfarm in this area, so more time
running this code across two branches is not a bad concept, either.
--
Michael

Вложения

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Bertrand Drouvot
Дата:
Hi,

On Mon, Feb 19, 2024 at 03:14:07PM +0900, Michael Paquier wrote:
> On Thu, Feb 15, 2024 at 11:24:59AM +0000, Bertrand Drouvot wrote:
> > Thanks for looking at it!
> > 
> > Agree, using an assertion instead in v3 attached.
> 
> And you did not forget the PG_USED_FOR_ASSERTS_ONLY.

Thanks to the "CompilerWarnings" cirrus test ;-)

> 
> > > It seems important to document why we are saving this data here; while
> > > we hold the LWLock for repslots, before we perform any termination,
> > > and we  want to protect conflict reports with incorrect data if the
> > > slot data got changed while the lwlock is temporarily released while
> > > there's a termination.
> > 
> > Yeah, comments added in v3.
> 
> The contents look rather OK, I may do some word-smithing for both.

Thanks!

> >> For example just after the kill()?  It seems to me
> >> that we should stuck the checkpointer, perform a forced upgrade of the
> >> xmins, release the checkpointer and see the effects of the change in
> >> the second loop.  Now, modules/injection_points/ does not allow that,
> >> yet, but I've posted a patch among these lines that can stop a process
> >> and release it using a condition variable (should be 0006 on [1]).  I
> >> was planning to start a new thread with a patch posted for the next CF
> >> to add this kind of facility with a regression test for an old bug,
> >> the patch needs a few hours of love, first.  I should be able to post
> >> that next week.
> > 
> > Great, that looks like a good idea!
> 
> I've been finally able to spend some time on what I had in mind and
> posted it here for the next CF:
> https://www.postgresql.org/message-id/ZdLuxBk5hGpol91B@paquier.xyz
> 
> You should be able to reuse that the same way I do in 0002 posted on
> the thread, where I'm causing the checkpointer to wait, then wake it
> up.

Thanks! I'll look at it.

> > Are we going to fix this on master and 16 stable first and then later on add a
> > test on master with the injection points?
> 
> Still, the other patch is likely going to take a couple of weeks
> before getting into the tree, so I have no objection to fix the bug
> first and backpatch, then introduce a test.  Things have proved that
> failures could show up in the buildfarm in this area, so more time
> running this code across two branches is not a bad concept, either.

Fully agree.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Bharath Rupireddy
Дата:
On Mon, Feb 19, 2024 at 11:44 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> > Yeah, comments added in v3.
>
> The contents look rather OK, I may do some word-smithing for both.

Here are some comments on v3:

1.
+    XLogRecPtr    initial_effective_xmin = InvalidXLogRecPtr;
+    XLogRecPtr    initial_catalog_effective_xmin = InvalidXLogRecPtr;
+    XLogRecPtr    initial_restart_lsn = InvalidXLogRecPtr;

Prefix 'initial_' makes the variable names a bit longer, I think we
can just use effective_xmin, catalog_effective_xmin and restart_lsn,
the code updating then only when if (!terminated) tells one that they
aren't updated every time.

2.
+            /*
+             * We'll release the slot's mutex soon, so it's possible that
+             * those values change since the process holding the slot has been
+             * terminated (if any), so record them here to ensure we would
+             * report the slot as obsolete correctly.
+             */

This needs a bit more info as to why and how effective_xmin,
catalog_effective_xmin and restart_lsn can move ahead after signaling
a backend and before the signalled backend reports back.

3.
+        /*
+         * Assert that the conflict cause recorded previously before we
+         * terminate the process did not change now for any reason.
+         */
+        Assert(!(conflict_prev != RS_INVAL_NONE && terminated &&
+                 conflict_prev != conflict));

It took a while for me to understand the above condition, can we
simplify it like below using De Morgan's laws for better readability?

Assert((conflict_prev == RS_INVAL_NONE) || !terminated ||
(conflict_prev == conflict));

> > Are we going to fix this on master and 16 stable first and then later on add a
> > test on master with the injection points?
>
> Still, the other patch is likely going to take a couple of weeks
> before getting into the tree, so I have no objection to fix the bug
> first and backpatch, then introduce a test.  Things have proved that
> failures could show up in the buildfarm in this area, so more time
> running this code across two branches is not a bad concept, either.

While I couldn't agree more on getting this fix in, it's worth pulling
in the required injection points patch here and writing the test to
reproduce this race issue.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Bertrand Drouvot
Дата:
Hi,

On Mon, Feb 19, 2024 at 01:45:16PM +0530, Bharath Rupireddy wrote:
> On Mon, Feb 19, 2024 at 11:44 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > > Yeah, comments added in v3.
> >
> > The contents look rather OK, I may do some word-smithing for both.
> 
> Here are some comments on v3:

Thanks for looing at it!

> 1.
> +    XLogRecPtr    initial_effective_xmin = InvalidXLogRecPtr;
> +    XLogRecPtr    initial_catalog_effective_xmin = InvalidXLogRecPtr;
> +    XLogRecPtr    initial_restart_lsn = InvalidXLogRecPtr;
> 
> Prefix 'initial_' makes the variable names a bit longer, I think we
> can just use effective_xmin, catalog_effective_xmin and restart_lsn,
> the code updating then only when if (!terminated) tells one that they
> aren't updated every time.

I'm not sure about that. I prefer to see meaningfull names instead of having
to read the code where they are used.

> 2.
> +            /*
> +             * We'll release the slot's mutex soon, so it's possible that
> +             * those values change since the process holding the slot has been
> +             * terminated (if any), so record them here to ensure we would
> +             * report the slot as obsolete correctly.
> +             */
> 
> This needs a bit more info as to why and how effective_xmin,
> catalog_effective_xmin and restart_lsn can move ahead after signaling
> a backend and before the signalled backend reports back.

I'm not sure of the added value of such extra details in this comment and if
the comment would be easy to maintain. I've the feeling that knowing it's possible
is enough here. Happy to hear what others think about it too.

> 3.
> +        /*
> +         * Assert that the conflict cause recorded previously before we
> +         * terminate the process did not change now for any reason.
> +         */
> +        Assert(!(conflict_prev != RS_INVAL_NONE && terminated &&
> +                 conflict_prev != conflict));
> 
> It took a while for me to understand the above condition, can we
> simplify it like below using De Morgan's laws for better readability?
> 
> Assert((conflict_prev == RS_INVAL_NONE) || !terminated ||
> (conflict_prev == conflict));

I don't have a strong opinon on this, looks like a matter of taste.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Michael Paquier
Дата:
On Mon, Feb 19, 2024 at 09:49:24AM +0000, Bertrand Drouvot wrote:
> On Mon, Feb 19, 2024 at 01:45:16PM +0530, Bharath Rupireddy wrote:
>> Prefix 'initial_' makes the variable names a bit longer, I think we
>> can just use effective_xmin, catalog_effective_xmin and restart_lsn,
>> the code updating then only when if (!terminated) tells one that they
>> aren't updated every time.
>
> I'm not sure about that. I prefer to see meaningfull names instead of having
> to read the code where they are used.

Prefixing these with "initial_" is fine, IMO.  That shows the
intention that these come from the slot's data before doing the
termination.  So I'm OK with what's been proposed in v3.

>> This needs a bit more info as to why and how effective_xmin,
>> catalog_effective_xmin and restart_lsn can move ahead after signaling
>> a backend and before the signalled backend reports back.
>
> I'm not sure of the added value of such extra details in this comment and if
> the comment would be easy to maintain. I've the feeling that knowing it's possible
> is enough here. Happy to hear what others think about it too.

Documenting that the risk exists rather than all the possible reasons
why this could happen is surely more maintainable.  In short, I'm OK
with what the patch does, just telling that it is possible.

>> +        Assert(!(conflict_prev != RS_INVAL_NONE && terminated &&
>> +                 conflict_prev != conflict));
>>
>> It took a while for me to understand the above condition, can we
>> simplify it like below using De Morgan's laws for better readability?
>>
>> Assert((conflict_prev == RS_INVAL_NONE) || !terminated ||
>> (conflict_prev == conflict));
>
> I don't have a strong opinon on this, looks like a matter of taste.

Both are the same to me, so I have no extra opinion to offer.  ;)
--
Michael

Вложения

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Michael Paquier
Дата:
On Tue, Feb 20, 2024 at 08:51:17AM +0900, Michael Paquier wrote:
> Prefixing these with "initial_" is fine, IMO.  That shows the
> intention that these come from the slot's data before doing the
> termination.  So I'm OK with what's been proposed in v3.

I was looking at that a second time, and just concluded that this is
OK, so I've applied that down to 16, wordsmithing a bit the comments.

> Both are the same to me, so I have no extra opinion to offer.  ;)

I've kept this one as-is, though.
--
Michael

Вложения

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Bertrand Drouvot
Дата:
Hi,

On Tue, Feb 20, 2024 at 02:33:44PM +0900, Michael Paquier wrote:
> On Tue, Feb 20, 2024 at 08:51:17AM +0900, Michael Paquier wrote:
> > Prefixing these with "initial_" is fine, IMO.  That shows the
> > intention that these come from the slot's data before doing the
> > termination.  So I'm OK with what's been proposed in v3.
> 
> I was looking at that a second time, and just concluded that this is
> OK, so I've applied that down to 16, wordsmithing a bit the comments.

Thanks!
FWIW, I've started to write a POC regarding the test we mentioned up-thread.

The POC test is based on what has been submitted by Michael in [1]. The POC test
seems to work fine and it seems that nothing more is needed in [1] (at some point
I thought I would need the ability to wake up multiple "wait" injection points
in sequence but that was not necessary).

I'll polish and propose my POC test once [1] is pushed (unless you're curious
about it before).

[1]: https://www.postgresql.org/message-id/flat/ZdLuxBk5hGpol91B%40paquier.xyz

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Bertrand Drouvot
Дата:
Hi,

On Tue, Feb 20, 2024 at 04:03:53PM +0000, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Feb 20, 2024 at 02:33:44PM +0900, Michael Paquier wrote:
> > On Tue, Feb 20, 2024 at 08:51:17AM +0900, Michael Paquier wrote:
> > > Prefixing these with "initial_" is fine, IMO.  That shows the
> > > intention that these come from the slot's data before doing the
> > > termination.  So I'm OK with what's been proposed in v3.
> > 
> > I was looking at that a second time, and just concluded that this is
> > OK, so I've applied that down to 16, wordsmithing a bit the comments.
> 
> Thanks!
> FWIW, I've started to write a POC regarding the test we mentioned up-thread.
> 
> The POC test is based on what has been submitted by Michael in [1]. The POC test
> seems to work fine and it seems that nothing more is needed in [1] (at some point
> I thought I would need the ability to wake up multiple "wait" injection points
> in sequence but that was not necessary).
> 
> I'll polish and propose my POC test once [1] is pushed (unless you're curious
> about it before).

Though [1] mentioned up-thread is not pushed yet, I'm Sharing the POC patch now
(see the attached).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Michael Paquier
Дата:
On Mon, Feb 26, 2024 at 02:01:45PM +0000, Bertrand Drouvot wrote:
> Though [1] mentioned up-thread is not pushed yet, I'm Sharing the POC patch now
> (see the attached).

I have looked at what you have here.

First, in a build where 818fefd8fd is included, this makes the test
script a lot slower.  Most of the logic is quick, but we're spending
10s or so checking that catalog_xmin has advanced.  Could it be
possible to make that faster?

A second issue is the failure mode when 818fefd8fd is reverted.  The
test is getting stuck when we are waiting on the standby to catch up,
until a timeout decides to kick in to fail the test, and all the
previous tests pass.  Could it be possible to make that more
responsive?  I assume that in the failure mode we would get an
incorrect conflict_reason for injection_inactiveslot, succeeding in
checking the failure.

+    my $terminated = 0;
+    for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
+    {
+        if ($node_standby->log_contains(
+            'terminating process .* to release replication slot \"injection_activeslot\"', $logstart))
+        {
+            $terminated = 1;
+            last;
+        }
+        usleep(100_000);
+    }
+    ok($terminated, 'terminating process holding the active slot is logged with injection point');

The LOG exists when we are sure that the startup process is waiting
in the injection point, so this loop could be replaced with something
like:
+   $node_standby->wait_for_event('startup', 'TerminateProcessHoldingSlot');
+   ok( $node_standby->log_contains('terminating process .* .. ', 'termin .. ';)

Nit: the name of the injection point should be
terminate-process-holding-slot rather than
TerminateProcessHoldingSlot, to be consistent with the other ones.
--
Michael

Вложения

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Bertrand Drouvot
Дата:
Hi,

On Tue, Mar 05, 2024 at 09:42:20AM +0900, Michael Paquier wrote:
> On Mon, Feb 26, 2024 at 02:01:45PM +0000, Bertrand Drouvot wrote:
> > Though [1] mentioned up-thread is not pushed yet, I'm Sharing the POC patch now
> > (see the attached).
> 
> I have looked at what you have here.

Thanks!

> First, in a build where 818fefd8fd is included, this makes the test
> script a lot slower.  Most of the logic is quick, but we're spending
> 10s or so checking that catalog_xmin has advanced.  Could it be
> possible to make that faster?

Yeah, v2 attached changes this. It moves the injection point after the process
has been killed so that another process can decode at wish (without the need
to wait for a walsender timeout) to reach LogicalConfirmReceivedLocation().

> A second issue is the failure mode when 818fefd8fd is reverted.  The
> test is getting stuck when we are waiting on the standby to catch up,
> until a timeout decides to kick in to fail the test, and all the
> previous tests pass.  Could it be possible to make that more
> responsive?  I assume that in the failure mode we would get an
> incorrect conflict_reason for injection_inactiveslot, succeeding in
> checking the failure.

I try to simulate a revert of 818fefd8fd (replacing "!terminated" by "1 == 1"
before the initial_* assignements). The issue is that then the new ASSERT is
triggered leading to the standby shutdown. So, I'm not sure how to improve this
case.

> +    my $terminated = 0;
> +    for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
> +    {
> +        if ($node_standby->log_contains(
> +            'terminating process .* to release replication slot \"injection_activeslot\"', $logstart))
> +        {
> +            $terminated = 1;
> +            last;
> +        }
> +        usleep(100_000);
> +    }
> +    ok($terminated, 'terminating process holding the active slot is logged with injection point');
> 
> The LOG exists when we are sure that the startup process is waiting
> in the injection point, so this loop could be replaced with something
> like:
> +   $node_standby->wait_for_event('startup', 'TerminateProcessHoldingSlot');
> +   ok( $node_standby->log_contains('terminating process .* .. ', 'termin .. ';)
> 

Yeah, now that wait_for_event() is there, let's use it: done in v2.

> Nit: the name of the injection point should be
> terminate-process-holding-slot rather than
> TerminateProcessHoldingSlot, to be consistent with the other ones. 

Done in v2.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Michael Paquier
Дата:
On Tue, Mar 05, 2024 at 10:17:03AM +0000, Bertrand Drouvot wrote:
> On Tue, Mar 05, 2024 at 09:42:20AM +0900, Michael Paquier wrote:
>> First, in a build where 818fefd8fd is included, this makes the test
>> script a lot slower.  Most of the logic is quick, but we're spending
>> 10s or so checking that catalog_xmin has advanced.  Could it be
>> possible to make that faster?
>
> Yeah, v2 attached changes this. It moves the injection point after the process
> has been killed so that another process can decode at wish (without the need
> to wait for a walsender timeout) to reach LogicalConfirmReceivedLocation().

Ah, OK.  Indeed that's much faster this way.

> I try to simulate a revert of 818fefd8fd (replacing "!terminated" by "1 == 1"
> before the initial_* assignements).

Yeah.  I can see how this messes up with the calculation of the
conditions, which is enough from my perspective, even if we don't have
any sanity checks in 818fefd8fd originally.

> The issue is that then the new ASSERT is
> triggered leading to the standby shutdown. So, I'm not sure how to improve this
> case.

It's been mentioned recently that we are not good at detecting crashes
in the TAP tests.  I am wondering if we should check the status of the
node in the most popular routines of Cluster.pm and die hard, as one
way to make the tests more responsive..  A topic for a different
thread, for sure.

> Done in v2.

Reworded a few things and applied this version.
--
Michael

Вложения

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Bertrand Drouvot
Дата:
Hi,

On Wed, Mar 06, 2024 at 02:47:55PM +0900, Michael Paquier wrote:
> On Tue, Mar 05, 2024 at 10:17:03AM +0000, Bertrand Drouvot wrote:
> > On Tue, Mar 05, 2024 at 09:42:20AM +0900, Michael Paquier wrote:
> > The issue is that then the new ASSERT is
> > triggered leading to the standby shutdown. So, I'm not sure how to improve this
> > case.
> 
> It's been mentioned recently that we are not good at detecting crashes
> in the TAP tests.  I am wondering if we should check the status of the
> node in the most popular routines of Cluster.pm and die hard, as one
> way to make the tests more responsive..  A topic for a different
> thread, for sure.

Right, somehow out of context here.

> > Done in v2.
> 
> Reworded a few things and applied this version.

Thanks!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Michael Paquier
Дата:
On Wed, Mar 06, 2024 at 09:17:58AM +0000, Bertrand Drouvot wrote:
> Right, somehow out of context here.

We're not yet in the green yet, one of my animals has complained:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2024-03-06%2010%3A10%3A03

This is telling us that the xmin horizon is unchanged, and the test
cannot move on with the injection point wake up that would trigger the
following logs:
2024-03-06 20:12:59.039 JST [21143] LOG:  invalidating obsolete replication slot "injection_activeslot"
2024-03-06 20:12:59.039 JST [21143] DETAIL:  The slot conflicted with xid horizon 770.

Not sure what to think about that yet.
--
Michael

Вложения

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Bharath Rupireddy
Дата:
On Wed, Mar 6, 2024 at 4:51 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Mar 06, 2024 at 09:17:58AM +0000, Bertrand Drouvot wrote:
> > Right, somehow out of context here.
>
> We're not yet in the green yet, one of my animals has complained:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2024-03-06%2010%3A10%3A03
>
> This is telling us that the xmin horizon is unchanged, and the test
> cannot move on with the injection point wake up that would trigger the
> following logs:
> 2024-03-06 20:12:59.039 JST [21143] LOG:  invalidating obsolete replication slot "injection_activeslot"
> 2024-03-06 20:12:59.039 JST [21143] DETAIL:  The slot conflicted with xid horizon 770.
>
> Not sure what to think about that yet.

Windows - Server 2019, VS 2019 - Meson & ninja on my CI setup isn't
happy about that as well [1]. It looks like the slot's catalog_xmin on
the standby isn't moving forward.

[1]
https://cirrus-ci.com/task/5132148995260416

[09:11:17.851] 285/285 postgresql:recovery /
recovery/035_standby_logical_decoding               ERROR
553.48s   (exit status 255 or signal 127 SIGinvalid)
[09:11:17.855] >>>
INITDB_TEMPLATE=C:/cirrus/build/tmp_install/initdb-template
enable_injection_points=yes
PG_REGRESS=C:\cirrus\build\src/test\regress\pg_regress.exe
MALLOC_PERTURB_=172
REGRESS_SHLIB=C:\cirrus\build\src/test\regress\regress.dll

PATH=C:/cirrus/build/tmp_install/usr/local/pgsql/bin;C:\cirrus\build\src/test\recovery;C:/cirrus/build/src/test/recovery/test;C:\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX64\x64;C:\BuildTools\MSBuild\Current\bin\Roslyn;C:\Program
Files (x86)\Windows Kits\10\bin\10.0.20348.0\x64;C:\Program Files
(x86)\Windows
Kits\10\bin\x64;C:\BuildTools\\MSBuild\Current\Bin;C:\Windows\Microsoft.NET\Framework64\v4.0.30319;C:\BuildTools\Common7\IDE\;C:\BuildTools\Common7\Tools\;C:\BuildTools\VC\Auxiliary\Build;C:\zstd\zstd-v1.5.2-win64;C:\zlib;C:\lz4;C:\icu;C:\winflexbison;C:\strawberry\5.26.3.1\perl\bin;C:\python\Scripts\;C:\python\;C:\Windows
Kits\10\Debuggers\x64;C:\Program

Files\Git\usr\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\ProgramData\GooGet;C:\Program
Files\Google\Compute Engine\metadata_scripts;C:\Program Files
(x86)\Google\Cloud SDK\google-cloud-sdk\bin;C:\Program
Files\PowerShell\7\;C:\Program Files\Google\Compute
Engine\sysprep;C:\ProgramData\chocolatey\bin;C:\Program
Files\Git\cmd;C:\Program Files\Git\mingw64\bin;C:\Program
Files\Git\usr\bin;C:\Windows\system32\config\systemprofile\AppData\Local\Microsoft\WindowsApps
C:\python\python3.EXE C:\cirrus\build\..\src/tools/testwrap --basedir
C:\cirrus\build --srcdir C:\cirrus\src/test\recovery --testgroup
recovery --testname 035_standby_logical_decoding --
C:\strawberry\5.26.3.1\perl\bin\perl.EXE -I C:/cirrus/src/test/perl -I
C:\cirrus\src/test\recovery
C:/cirrus/src/test/recovery/t/035_standby_logical_decoding.pl
[09:11:17.855] ------------------------------------- 8<
-------------------------------------
[09:11:17.855] stderr:
[09:11:17.855] # poll_query_until timed out executing this query:

[09:11:17.855] # SELECT (SELECT catalog_xmin::text::int - 770 from
pg_catalog.pg_replication_slots where slot_name =
'injection_activeslot') > 0

[09:11:17.855] # expecting this output:

[09:11:17.855] # t

[09:11:17.855] # last actual query output:

[09:11:17.855] # f

[09:11:17.855] # with stderr:

[09:11:17.855] # Tests were run but no plan was declared and
done_testing() was not seen.

[09:11:17.855] # Looks like your test exited with 255 just after 57.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Bertrand Drouvot
Дата:
Hi,

On Wed, Mar 06, 2024 at 05:45:56PM +0530, Bharath Rupireddy wrote:
> On Wed, Mar 6, 2024 at 4:51 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Wed, Mar 06, 2024 at 09:17:58AM +0000, Bertrand Drouvot wrote:
> > > Right, somehow out of context here.
> >
> > We're not yet in the green yet, one of my animals has complained:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2024-03-06%2010%3A10%3A03
> >
> > This is telling us that the xmin horizon is unchanged, and the test
> > cannot move on with the injection point wake up that would trigger the
> > following logs:
> > 2024-03-06 20:12:59.039 JST [21143] LOG:  invalidating obsolete replication slot "injection_activeslot"
> > 2024-03-06 20:12:59.039 JST [21143] DETAIL:  The slot conflicted with xid horizon 770.
> >
> > Not sure what to think about that yet.
> 
> Windows - Server 2019, VS 2019 - Meson & ninja on my CI setup isn't
> happy about that as well [1]. It looks like the slot's catalog_xmin on
> the standby isn't moving forward.
> 

Thank you both for the report! I did a few test manually and can see the issue
from times to times. When the issue occurs, the logical decoding was able to
go through the place where LogicalConfirmReceivedLocation() updates the
slot's catalog_xmin before being killed. In that case I can see that the
catalog_xmin is updated to the xid horizon.

Means in a failed test we have something like:

slot's catalog_xmin: 839 and "The slot conflicted with xid horizon 839." 

While when the test is ok you'll see something like:

slot's catalog_xmin: 841 and "The slot conflicted with xid horizon 842."

In the failing test the call to SELECT pg_logical_slot_get_changes() does
not advance the slot's catalog xmin anymore.

To fix this, I think we need a new transacion to decode from the primary before
executing pg_logical_slot_get_changes(). But this transaction has to be replayed
on the standby first by the startup process. Which means we need to wakeup
"terminate-process-holding-slot" and that we probably need another injection
point somewehere in this test.

I'll look at it unless you've another idea?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

От
Michael Paquier
Дата:
On Wed, Mar 06, 2024 at 05:45:56PM +0000, Bertrand Drouvot wrote:
> Thank you both for the report! I did a few test manually and can see the issue
> from times to times. When the issue occurs, the logical decoding was able to
> go through the place where LogicalConfirmReceivedLocation() updates the
> slot's catalog_xmin before being killed. In that case I can see that the
> catalog_xmin is updated to the xid horizon.

Two buildfarm machines have complained here, and one of them twice in
a row.  That's quite amazing, because a couple of dozen runs done in a
row on the same host as these animals all pass.  The CI did not
complain either (did 2~3 runs there yesterday).

> Means in a failed test we have something like:
>
> slot's catalog_xmin: 839 and "The slot conflicted with xid horizon 839."
>
> While when the test is ok you'll see something like:
>
> slot's catalog_xmin: 841 and "The slot conflicted with xid horizon 842."

Perhaps we should also make the test report the catalog_xmin of the
slot.  That may make debugging a bit easier.

> In the failing test the call to SELECT pg_logical_slot_get_changes() does
> not advance the slot's catalog xmin anymore.

Is that something that we could enforce in the test in a stronger way,
cross-checking the xmin horizon before and after the call?

> To fix this, I think we need a new transacion to decode from the primary before
> executing pg_logical_slot_get_changes(). But this transaction has to be replayed
> on the standby first by the startup process. Which means we need to wakeup
> "terminate-process-holding-slot" and that we probably need another injection
> point somewehere in this test.
>
> I'll look at it unless you've another idea?

I am wondering if there is something else lurking here, actually, so
for now I am going to revert the change as it is annoying to get
sporadic failures in the CF bot at this time of the year and there are
a lot of patches under discussion.  Let's give it more time and more
thoughts, without pressure.
--
Michael

Вложения