Обсуждение: logical replication seems broken

Поиск
Список
Период
Сортировка

logical replication seems broken

От
Erik Rijkers
Дата:
Hello,

I am seeing errors in replication in a test program that I've been running for years with very little change (since
2017,really [1]).
 

The symptom:
HEAD-replication fails (most of the time) when cascading 3 instances (master+2 replicas).

HEAD-replication works with 2 instances (master+replica).

I have also compiled a server on top of 4ad31bb2ef25 (avoiding some recent changes) - and this server runs the same
testprogram without failure; so I think the culprit might be somewhere in those changes.  Or (always possible) there
mightbe something my testing does wrong - but then again, I do this test a few times every week, and it never fails.
 

This weekend I can dig into it some more (make a self-sufficient example) but I thought I'd mention it already. perhaps
oneof you see the light immediately...
 


Erik Rijkers

[1] https://www.postgresql.org/message-id/flat/3897361c7010c4ac03f358173adbcd60%40xs4all.nl



Re: logical replication seems broken

От
Amit Kapila
Дата:
On Fri, Feb 12, 2021 at 6:04 PM Erik Rijkers <er@xs4all.nl> wrote:
>
> Hello,
>
> I am seeing errors in replication in a test program that I've been running for years with very little change (since
2017,really [1]).
 
>
> The symptom:
> HEAD-replication fails (most of the time) when cascading 3 instances (master+2 replicas).
>
> HEAD-replication works with 2 instances (master+replica).
>
> I have also compiled a server on top of 4ad31bb2ef25 (avoiding some recent changes) - and this server runs the same
testprogram without failure; so I think the culprit might be somewhere in those changes.
 
>

Oh, if you are running this on today's HEAD then the recent commit
bff456d7a0 could be the culprit but not sure without knowing the
details.

>  Or (always possible) there might be something my testing does wrong - but then again, I do this test a few times
everyweek, and it never fails.
 
>

Do you expect anything in particular while running this test, any
expected messages? What is exactly failing?

> This weekend I can dig into it some more (make a self-sufficient example) but I thought I'd mention it already.
perhapsone of you see the light immediately...
 
>

That would be really helpful.

-- 
With Regards,
Amit Kapila.



Re: logical replication seems broken

От
er@xs4all.nl
Дата:
> On 02/12/2021 1:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
>  
> On Fri, Feb 12, 2021 at 6:04 PM Erik Rijkers <er@xs4all.nl> wrote:
> >
> > Hello,
> >
> > I am seeing errors in replication in a test program that I've been running for years with very little change (since
2017,really [1]).
 

Hi,

Here is a test program.  Careful, it deletes stuff.  And it will need some changes:

I compile postgres server versions into directories like:
     $HOME/pg_stuff/pg_installations/pgsql.$project    where project is a name

The attached script (logrep_cascade_bug.sh)  assumes that two such compiled versions are present (on my machine they
arecalled HEAD and head0):
 
     $HOME/pg_stuff/pg_installations/pgsql.HEAD   --> git master as of today - friday 12 febr 2021
     $HOME/pg_stuff/pg_installations/pgsql.head0  --> 3063eb17593c  so that's from 11 febr, before the replication
changes

In the test script, bash variables 'project' (and 'BIN') reflect my set up - so should probably be changed.

The instance from today 12 february ('HEAD') has the bug:
  it keeps endlessly waiting/looping with 'NOK' (=Not OK).
  'Not OK' means: primary not identical to all replicas (replica1 seems ok, but replica2 remains empty)

The instance from yesterday 11 february ('head0') is ok:
  it finishes in 20 s after waiting/looping just 2 or 3 times
  'ok' means: all replicas are identical to primary (as proven by the md5s).

That's all I have for now - I have no deeper idea about what exactly goes wrong.

I hope that helps, let me know when you cannot reproduce the problem.

Erik Rijkers
Вложения

Re: logical replication seems broken

От
Amit Kapila
Дата:
On Fri, Feb 12, 2021 at 10:00 PM <er@xs4all.nl> wrote:
>
> > On 02/12/2021 1:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > On Fri, Feb 12, 2021 at 6:04 PM Erik Rijkers <er@xs4all.nl> wrote:
> > >
> > > Hello,
> > >
> > > I am seeing errors in replication in a test program that I've been running for years with very little change
(since2017, really [1]).
 
>
> Hi,
>
> Here is a test program.  Careful, it deletes stuff.  And it will need some changes:
>

Thanks for sharing the test. I think I have found the problem.
Actually, it was an existing code problem exposed by the commit
ce0fdbfe97. In pgoutput_begin_txn(), we were sometimes sending the
prepare_write ('w') message but then the actual message was not being
sent. This was the case when we didn't found the origin of a txn. This
can happen after that commit because we have now started using origins
for tablesync workers as well and those origins are removed once the
tablesync workers are finished. We might want to change the behavior
related to the origin messages as indicated in the comments but for
now, fixing the existing code.

Can you please test if the attached fixes the problem at your end as well?

-- 
With Regards,
Amit Kapila.

Вложения

Re: logical replication seems broken

От
Erik Rijkers
Дата:
> On 02/13/2021 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Fri, Feb 12, 2021 at 10:00 PM <er@xs4all.nl> wrote:
> >
> > > On 02/12/2021 1:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Feb 12, 2021 at 6:04 PM Erik Rijkers <er@xs4all.nl> wrote:
> > > >
> > > > I am seeing errors in replication in a test program that I've been running for years with very little change
(since2017, really [1]).
 
> >
> > Hi,
> >
> > Here is a test program.  Careful, it deletes stuff.  And it will need some changes:
> >
> 
> Thanks for sharing the test. I think I have found the problem.
> Actually, it was an existing code problem exposed by the commit
> ce0fdbfe97. In pgoutput_begin_txn(), we were sometimes sending the
> prepare_write ('w') message but then the actual message was not being
> sent. This was the case when we didn't found the origin of a txn. This
> can happen after that commit because we have now started using origins
> for tablesync workers as well and those origins are removed once the
> tablesync workers are finished. We might want to change the behavior
> related to the origin messages as indicated in the comments but for
> now, fixing the existing code.
> 
> Can you please test if the attached fixes the problem at your end as well?

> [fix_origin_message_1.patch]

I compiled just now a binary from HEAD, and a binary from HEAD+patch

HEAD is still broken; your patch rescues it, so yes, fixed.

Maybe a test (check or check-world) should be added to run a second replica?  (Assuming that would have caught this
bug)


Thanks,

Erik Rijkers
 






> 
> -- 
> With Regards,
> Amit Kapila.



Re: logical replication seems broken

От
vignesh C
Дата:
On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers <er@xs4all.nl> wrote:
>
> > On 02/13/2021 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Feb 12, 2021 at 10:00 PM <er@xs4all.nl> wrote:
> > >
> > > > On 02/12/2021 1:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Fri, Feb 12, 2021 at 6:04 PM Erik Rijkers <er@xs4all.nl> wrote:
> > > > >
> > > > > I am seeing errors in replication in a test program that I've been running for years with very little change
(since2017, really [1]).
 
> > >
> > > Hi,
> > >
> > > Here is a test program.  Careful, it deletes stuff.  And it will need some changes:
> > >
> >
> > Thanks for sharing the test. I think I have found the problem.
> > Actually, it was an existing code problem exposed by the commit
> > ce0fdbfe97. In pgoutput_begin_txn(), we were sometimes sending the
> > prepare_write ('w') message but then the actual message was not being
> > sent. This was the case when we didn't found the origin of a txn. This
> > can happen after that commit because we have now started using origins
> > for tablesync workers as well and those origins are removed once the
> > tablesync workers are finished. We might want to change the behavior
> > related to the origin messages as indicated in the comments but for
> > now, fixing the existing code.
> >
> > Can you please test if the attached fixes the problem at your end as well?
>
> > [fix_origin_message_1.patch]
>
> I compiled just now a binary from HEAD, and a binary from HEAD+patch
>
> HEAD is still broken; your patch rescues it, so yes, fixed.
>
> Maybe a test (check or check-world) should be added to run a second replica?  (Assuming that would have caught this
bug)
>

+1 for the idea of having a test for this. I have written a test for this.
Thanks for the fix Amit, I could reproduce the issue without your fix
and verified that the issue gets fixed with the patch you shared.
Attached a patch for the same. Thoughts?

Regards,
Vignesh

Вложения

Re: logical replication seems broken

От
Amit Kapila
Дата:
On Mon, Feb 15, 2021 at 11:53 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers <er@xs4all.nl> wrote:
> >
> >
> > I compiled just now a binary from HEAD, and a binary from HEAD+patch
> >
> > HEAD is still broken; your patch rescues it, so yes, fixed.
> >
> > Maybe a test (check or check-world) should be added to run a second replica?  (Assuming that would have caught this
bug)
> >
>
> +1 for the idea of having a test for this. I have written a test for this.
> Thanks for the fix Amit, I could reproduce the issue without your fix
> and verified that the issue gets fixed with the patch you shared.
> Attached a patch for the same. Thoughts?
>

I have slightly modified the comments in the test case to make things
clear. I am planning not to backpatch this because there is no way in
the core code to hit this prior to commit ce0fdbfe97 and we haven't
received any complaints so far. What do you think?

-- 
With Regards,
Amit Kapila.

Вложения

Re: logical replication seems broken

От
vignesh C
Дата:
On Mon, Feb 15, 2021 at 5:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 15, 2021 at 11:53 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers <er@xs4all.nl> wrote:
> > >
> > >
> > > I compiled just now a binary from HEAD, and a binary from HEAD+patch
> > >
> > > HEAD is still broken; your patch rescues it, so yes, fixed.
> > >
> > > Maybe a test (check or check-world) should be added to run a second replica?  (Assuming that would have caught
thisbug)
 
> > >
> >
> > +1 for the idea of having a test for this. I have written a test for this.
> > Thanks for the fix Amit, I could reproduce the issue without your fix
> > and verified that the issue gets fixed with the patch you shared.
> > Attached a patch for the same. Thoughts?
> >
>
> I have slightly modified the comments in the test case to make things
> clear. I am planning not to backpatch this because there is no way in
> the core code to hit this prior to commit ce0fdbfe97 and we haven't
> received any complaints so far. What do you think?

The changes look fine to me.

Regards,
Vignesh



Re: logical replication seems broken

От
er@xs4all.nl
Дата:
> On 2021.02.15. 12:31 Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Feb 15, 2021 at 11:53 AM vignesh C <vignesh21@gmail.com> wrote:
> > On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers <er@xs4all.nl> wrote:
> > > I compiled just now a binary from HEAD, and a binary from HEAD+patch
> > > HEAD is still broken; your patch rescues it, so yes, fixed.
> > > Maybe a test (check or check-world) should be added to run a second replica?  (Assuming that would have caught
thisbug)
 
> > >
> > +1 for the idea of having a test for this. I have written a test for this.
> > Thanks for the fix Amit, I could reproduce the issue without your fix
> > and verified that the issue gets fixed with the patch you shared.
> > Attached a patch for the same. Thoughts?
> >
> 
> I have slightly modified the comments in the test case to make things
> clear. I am planning not to backpatch this because there is no way in
> the core code to hit this prior to commit ce0fdbfe97 and we haven't
> received any complaints so far. What do you think?

My tests indeed run OK with this.

(I haven't tested whether the newly added test actually tests for the problem that was there - I suppose one of you did
that)

Thanks!

Erik Rijkers



Re: logical replication seems broken

От
vignesh C
Дата:
On Mon, Feb 15, 2021 at 6:14 PM <er@xs4all.nl> wrote:
>
>
> > On 2021.02.15. 12:31 Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Mon, Feb 15, 2021 at 11:53 AM vignesh C <vignesh21@gmail.com> wrote:
> > > On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers <er@xs4all.nl> wrote:
> > > > I compiled just now a binary from HEAD, and a binary from HEAD+patch
> > > > HEAD is still broken; your patch rescues it, so yes, fixed.
> > > > Maybe a test (check or check-world) should be added to run a second replica?  (Assuming that would have caught
thisbug)
 
> > > >
> > > +1 for the idea of having a test for this. I have written a test for this.
> > > Thanks for the fix Amit, I could reproduce the issue without your fix
> > > and verified that the issue gets fixed with the patch you shared.
> > > Attached a patch for the same. Thoughts?
> > >
> >
> > I have slightly modified the comments in the test case to make things
> > clear. I am planning not to backpatch this because there is no way in
> > the core code to hit this prior to commit ce0fdbfe97 and we haven't
> > received any complaints so far. What do you think?
>
> My tests indeed run OK with this.
>
> (I haven't tested whether the newly added test actually tests for the problem that was there - I suppose one of you
didthat)
 
>

I could re-create the scenario that you had faced with this test. This
test case is a simplified test of your script, I have removed the use
of pgbench, reduced the number of tables used and simulated the same
problem with the similar replication setup that you had used.

Regards,
Vignesh



Re: logical replication seems broken

От
Amit Kapila
Дата:
On Mon, Feb 15, 2021 at 7:50 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, Feb 15, 2021 at 6:14 PM <er@xs4all.nl> wrote:
> >
> >
> > > On 2021.02.15. 12:31 Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > On Mon, Feb 15, 2021 at 11:53 AM vignesh C <vignesh21@gmail.com> wrote:
> > > > On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers <er@xs4all.nl> wrote:
> > > > > I compiled just now a binary from HEAD, and a binary from HEAD+patch
> > > > > HEAD is still broken; your patch rescues it, so yes, fixed.
> > > > > Maybe a test (check or check-world) should be added to run a second replica?  (Assuming that would have
caughtthis bug)
 
> > > > >
> > > > +1 for the idea of having a test for this. I have written a test for this.
> > > > Thanks for the fix Amit, I could reproduce the issue without your fix
> > > > and verified that the issue gets fixed with the patch you shared.
> > > > Attached a patch for the same. Thoughts?
> > > >
> > >
> > > I have slightly modified the comments in the test case to make things
> > > clear. I am planning not to backpatch this because there is no way in
> > > the core code to hit this prior to commit ce0fdbfe97 and we haven't
> > > received any complaints so far. What do you think?
> >
> > My tests indeed run OK with this.
> >
> > (I haven't tested whether the newly added test actually tests for the problem that was there - I suppose one of you
didthat)
 
> >
>
> I could re-create the scenario that you had faced with this test. This
> test case is a simplified test of your script, I have removed the use
> of pgbench, reduced the number of tables used and simulated the same
> problem with the similar replication setup that you had used.
>

Yeah, I was also able to see an issue with this test without a patch
and it got fixed after the patch. Pushed!

-- 
With Regards,
Amit Kapila.