Обсуждение: How to debug logical replication error "columns are missing" if they are not

Поиск
Список
Период
Сортировка

How to debug logical replication error "columns are missing" if they are not

От
Thomas Kellerer
Дата:
Hello,

we have a a logical replication from Postgres 11 to Postgres 12.

We did some structural changes to the published tables. The changes to the subscriber where applied about 60 minutes
laterthen those on the publisher. Obviously before the subscriber's tables where synced the replication failed. 

But even though the tables are now completely identical, the subscriber still claims:

    logical replication target relation "public.employee" is missing some replicated columns

I double checked multiple times that the tables are now identical by doing a diff of the "pg_dump -s" outputs - they
areidentical in the CREATE TABLE part. 

The table is listed with state "r" (ready) in pg_subscription_rel

How can I find out which columns it thinks are missing? And how do I convince the subscriber that the tables *are*
identical?

Thanks in advance
Thomas









Re: How to debug logical replication error "columns are missing" if they are not

От
Thomas Kellerer
Дата:
Thomas Kellerer schrieb am 25.08.2020 um 07:32:
> we have a a logical replication from Postgres 11 to Postgres 12.
>
> We did some structural changes to the published tables. The changes
> to the subscriber where applied about 60 minutes later then those on
> the publisher. Obviously before the subscriber's tables where synced
> the replication failed.
>
> But even though the tables are now completely identical, the subscriber still claims:
>
>     logical replication target relation "public.employee" is missing some replicated columns
>
> I double checked multiple times that the tables are now identical by
> doing a diff of the "pg_dump -s" outputs - they are identical in the
> CREATE TABLE part.
>
> The table is listed with state "r" (ready) in pg_subscription_rel
>
> How can I find out which columns it thinks are missing? And how do I
> convince the subscriber that the tables *are* identical?


This is getting stranger:

I dropped the table from the publication.

After refreshing the subscription the error is still shown in the logfile even though that table shouldn't be
replicatedany more. 
Is it possible that the error message mentions an unrelated table?

How can I find out which table really causes the problem?

Thomas




Re: How to debug logical replication error "columns are missing" if they are not

От
Thomas Kellerer
Дата:
Samed YILDIRIM schrieb am 25.08.2020 um 08:09:
> Have you dropped any column during or after main schema changes?
>
> If yes, you should add columns you dropped on publisher side to
> subscriber side. These two tables can be identical now. But, if you
> added a column to publisher, and then dropped it, and you never added
> the column(s) on subscriber side, you would get error. Because, you
> are looking current status of tables, but logs that should be
> consumed and data that should be replicated belong to past.

The change involved dropping two columns and adding a new one.
But the dropped columns were already present on the subscriber

The change essentially merged two date columns into a single daterange column.

So what would have been the correct approach here then?

1. Adding the new column on the subscriber
2. Let replication catch up
3. then drop the old columns on the subscriber

Regards
Thomas




Re: How to debug logical replication error "columns are missing" if they are not

От
Thomas Kellerer
Дата:
Thomas Kellerer schrieb am 25.08.2020 um 07:32:
> we have a a logical replication from Postgres 11 to Postgres 12.
>
> We did some structural changes to the published tables. The changes
> to the subscriber where applied about 60 minutes later then those on
> the publisher. Obviously before the subscriber's tables where synced
> the replication failed.
> But even though the tables are now completely identical, the subscriber still claims:
>   logical replication target relation "public.employee" is missing some replicated columns

So - as explained by Samed - the problem boils down to a setup like this:

The initial table looked like this (on both sides)

    create table test_table (id integer primary key, valid_from date, valid_to date);

Replication is running fine, then we run the following on the publisher

  begin transaction;
    ALTER TABLE test_table ADD COLUMN start_end_date daterange;

    UPDATE test_table SET start_end_date = daterange(valid_from, valid_to, '[]');

    ALTER TABLE test_table ALTER COLUMN start_end_date SET NOT NULL;

    ALTER TABLE test_table
      DROP COLUMN valid_from,
      DROP COLUMN valid_to;
  commit;

If the UPDATE part is removed from the change, everything works smoothly.

So, the UPDATE sends the complete modified row including the to be dropped columns to the subscriber.
As I ran the same steps on the subscriber, the columns were dropped on the subscriber before the UPDATE could be
replayedand thus it kept failing. 

I do understand now why this happens, and that it is a limitation of the current implementation.


However, what I don't understand is, why removing the table from the replication doesn't fix this.
It seems, that if the table is re-added later, the old WAL segments are still considered valid and Postgres tries to
replaythem. 

Which seems a rather strange thing to do to me.
Could anyone enlighten me on that?
Is there a way to mark the no longer needed WAL segments as obsolete?


Regards
Thomas