Обсуждение: Apply worker fails if a relation is missing on subscriber even if refresh publication has not been refreshed yet

Поиск
Список
Период
Сортировка
Hi hackers,

I realized a behaviour of logical replication that seems unexpected to me, but not totally sure. 

Let's say a new table is created and added into a publication and not created on subscriber yet. Also "ALTER SUBSCRIPTION ... REFRESH PUBLICATION" has not been called yet.
What I expect in that case would be that logical replication continues to work as it was working before the new table was created. The new table does not get replicated until "REFRESH PUBLICATION" as stated here [1].
This is indeed how it actually seems to work. Until we insert a row into the new table. 

After a new row into the new table, the apply worker gets this change and tries to apply it. As expected, it fails since the table does not exist on the subscriber yet. And the worker keeps crashing without and can't apply any changes for any table.
The obvious way to resolve this is creating the table on subscriber as well. After that apply worker will be back to work and skip changes for the new table and move to other changes.
Since REFRESH PUBLICATION is not called yet, any change for the new table will not be replicated.  

If replication of the new table will not start anyway (until REFRESH PUBLICATION), do we really need to have that table on the subscriber for apply worker to work?
AFAIU any change on publication would not affect logical replication setup until the publication gets refreshed on subscriber. If this understanding is correct, then apply worker should be able to run without needing new tables.
What do you think?

Also; if you agree, then the attached patch attempts to fix this issue.
It relies on the info from pg_subscription_rel so that apply worker only applies changes for the relations exist in pg_subscription_rel.
Since new tables wouldn't be in there until the next REFRESH PUBLICATION, missing those tables won't be a problem for existing subscriptions.


Thanks,
--
Melih Mutlu
Microsoft
Вложения
On Thu, Dec 22, 2022 at 7:16 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
>
> Hi hackers,
>
> I realized a behaviour of logical replication that seems unexpected to me, but not totally sure.
>
> Let's say a new table is created and added into a publication and not created on subscriber yet. Also "ALTER
SUBSCRIPTION... REFRESH PUBLICATION" has not been called yet.
 
> What I expect in that case would be that logical replication continues to work as it was working before the new table
wascreated. The new table does not get replicated until "REFRESH PUBLICATION" as stated here [1].
 
> This is indeed how it actually seems to work. Until we insert a row into the new table.
>
> After a new row into the new table, the apply worker gets this change and tries to apply it. As expected, it fails
sincethe table does not exist on the subscriber yet. And the worker keeps crashing without and can't apply any changes
forany table.
 
> The obvious way to resolve this is creating the table on subscriber as well. After that apply worker will be back to
workand skip changes for the new table and move to other changes.
 
> Since REFRESH PUBLICATION is not called yet, any change for the new table will not be replicated.
>
> If replication of the new table will not start anyway (until REFRESH PUBLICATION), do we really need to have that
tableon the subscriber for apply worker to work?
 
> AFAIU any change on publication would not affect logical replication setup until the publication gets refreshed on
subscriber.
>

I also have the same understanding but I think if we skip replicating
some table due to the reason that the corresponding publication has
not been refreshed then it is better to LOG that information instead
of silently skipping it. Along similar lines, personally, I don't see
a very strong reason to not throw the ERROR in the case you mentioned.
Do you have any use case in mind where the user has added a table to
the publication even though she doesn't want it to be replicated? One
thing that came to my mind is that due to some reason after adding a
table to the publication, there is some delay in creating the table on
the subscriber and then refreshing the publication and during that
time user expects replication to proceed smoothly. But for that isn't
it better that the user completes the setup on the subscriber before
performing operations on such a table? Because say there is some error
in the subscriber-side setup that the user misses then it would be a
surprise for a user to not see the table data. In such a case, an
ERROR/LOG information could be helpful for users.

-- 
With Regards,
Amit Kapila.



Hi Amit,

Amit Kapila <amit.kapila16@gmail.com>, 23 Ara 2022 Cum, 09:39 tarihinde şunu yazdı:
I also have the same understanding but I think if we skip replicating
some table due to the reason that the corresponding publication has
not been refreshed then it is better to LOG that information instead
of silently skipping it.
 
By skipping it, I mean the apply worker does not try to do anything with the changes for the missing table since the worker simply cannot apply it and only fails. 
But I agree with you about logging it, the patch currently logs such cases as warnings instead of errors.
I can make it LOG instead of WARNING, just wanted to make something different than ERROR. 

Do you have any use case in mind where the user has added a table to
the publication even though she doesn't want it to be replicated? One
thing that came to my mind is that due to some reason after adding a
table to the publication, there is some delay in creating the table on
the subscriber and then refreshing the publication and during that
time user expects replication to proceed smoothly. But for that isn't
it better that the user completes the setup on the subscriber before
performing operations on such a table? Because say there is some error
in the subscriber-side setup that the user misses then it would be a
surprise for a user to not see the table data. In such a case, an
ERROR/LOG information could be helpful for users.

I don't really see a specific use case for this. The delay between creating a table on publisher and then on subscriber usually may not be even that long to hurt anything. It just seems unnecessary to me that apply worker goes into a failure loop until someone creates the table on the subscriber, even though the table will not be replicated immediately.


Users also shouldn't expect for such tables to be replicated if they did not refresh the publication. That will not happen with or without this change. So I don't think it would be a surprise when they see their new table has not been replicated yet. This issue will also be visible in the logs, just not as an error.
And if users decide/remember to refresh the publication, they cannot do that anyway if the table is still missing on the subscriber. So the REFRESH PUBLICATION command will fail and then users will see an error log.
  
Best,
--
Melih Mutlu
Microsoft
On Mon, Dec 26, 2022 at 3:41 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
>
>> Do you have any use case in mind where the user has added a table to
>> the publication even though she doesn't want it to be replicated? One
>> thing that came to my mind is that due to some reason after adding a
>> table to the publication, there is some delay in creating the table on
>> the subscriber and then refreshing the publication and during that
>> time user expects replication to proceed smoothly. But for that isn't
>> it better that the user completes the setup on the subscriber before
>> performing operations on such a table? Because say there is some error
>> in the subscriber-side setup that the user misses then it would be a
>> surprise for a user to not see the table data. In such a case, an
>> ERROR/LOG information could be helpful for users.
>
>
> I don't really see a specific use case for this. The delay between creating a table on publisher and then on
subscriberusually may not be even that long to hurt anything. It just seems unnecessary to me that apply worker goes
intoa failure loop until someone creates the table on the subscriber, even though the table will not be replicated
immediately.
>

To avoid the failure loop, users can use disable_on_error subscription
parameter. I see your point but not sure if it is worth changing the
current behavior without any specific use case which we want to
address with this change.

--
With Regards,
Amit Kapila.