Обсуждение: Logical replication from 11.x to 12.x and "unique key violations"

Поиск
Список
Период
Сортировка

Logical replication from 11.x to 12.x and "unique key violations"

От
Thomas Kellerer
Дата:
Hello,

I have a strange error when using logical replication between a 11.2 (I know!) source database and a 12.3 target.

If I create the publication with all needed tables (about 50) at once, I get "duplicate key value violates unique
constraintxxx_pkey" errors during the initial replication (when creating the subscription). 

When create the publication only with a few tables, the initial data sync works without problems.
To replicate all tables, I add the tables incrementally to the publication, and refresh the subscription.

If I do it like that (step-by-step) everything works fine. Tables that generated the "duplicate key value" error
previouslywill replicate just fine. The tables are quite small, some of them less then 100 rows. 

The table definitions and primary keys are 100% identical between the two systems.

However all tables where this error occurs, have a varchar(100) column as part of the PK definition (don't ask).

The values in the PK columns are 7-bit ASCII only. Both servers are running CentOS 7.x and use UTF-8 as the encoding.

If this was a glibc version issue between the two operating systems, then I would expect the failure to be consistent,
notdepending on the amount of tables for the initial replication. 

I have seen that there were several bugfixes between 11.2 and 11.8 regarding logical replication (including some index
relatedfixes). But I am unsure if they related to 11.2 being the publisher or subscriber 

Could this indeed be caused by the outdated 11.2 on the publisher side?
Is there anything else I could check to identify the root cause?

As I can replicate the tables without problems, this isn't a big issue for me right now, but I would like to know what
theroot cause is. 

Thanks
Thomas



Re: Logical replication from 11.x to 12.x and "unique key violations"

От
Thomas Kellerer
Дата:
> I have a strange error when using logical replication between a 11.2
> source database and a 12.3 target.
>
> If I create the publication with all needed tables (about 50) at
> once, I get "duplicate key value violates unique constraint xxx_pkey"
> errors during the initial replication (when creating the
> subscription).
>
> When create the publication only with a few tables, the initial data
> sync works without problems. To replicate all tables, I add the
> tables incrementally to the publication, and refresh the
> subscription.
>
> If I do it like that (step-by-step) everything works fine. Tables
> that generated the "duplicate key value" error previously will
> replicate just fine. The tables are quite small, some of them less
> then 100 rows.
>

Any pointers where I should start looking to investigate this?




Re: Logical replication from 11.x to 12.x and "unique key violations"

От
Adrian Klaver
Дата:
On 7/20/20 7:22 AM, Thomas Kellerer wrote:
>> I have a strange error when using logical replication between a 11.2
>> source database and a 12.3 target.
>>
>> If I create the publication with all needed tables (about 50) at
>> once, I get "duplicate key value violates unique constraint xxx_pkey"
>> errors during the initial replication (when creating the
>> subscription).
>>
>> When create the publication only with a few tables, the initial data
>> sync works without problems. To replicate all tables, I add the
>> tables incrementally to the publication, and refresh the
>> subscription.
>>
>> If I do it like that (step-by-step) everything works fine. Tables
>> that generated the "duplicate key value" error previously will
>> replicate just fine. The tables are quite small, some of them less
>> then 100 rows.
>>
> 
> Any pointers where I should start looking to investigate this?

What are the PUBLICATION and SUBSCRIPTION commands being used?

Where is "xxx_pkey" coming from, e.g. sequence?

Where are source and target relative to each other in network/world?

Are there any other errors in log at around the same time that might apply?

Have you looked at 
https://www.postgresql.org/docs/12/logical-replication-architecture.html#LOGICAL-REPLICATION-SNAPSHOT?:

30.5.1. Initial Snapshot

Are the tables heavily used when the subscription is invoked?


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Logical replication from 11.x to 12.x and "unique key violations"

От
Tom Lane
Дата:
Thomas Kellerer <shammat@gmx.net> writes:
>> I have a strange error when using logical replication between a 11.2
>> source database and a 12.3 target.
>> 
>> If I create the publication with all needed tables (about 50) at
>> once, I get "duplicate key value violates unique constraint xxx_pkey"
>> errors during the initial replication (when creating the
>> subscription).
>> 
>> When create the publication only with a few tables, the initial data
>> sync works without problems. To replicate all tables, I add the
>> tables incrementally to the publication, and refresh the
>> subscription.
>> 
>> If I do it like that (step-by-step) everything works fine. Tables
>> that generated the "duplicate key value" error previously will
>> replicate just fine. The tables are quite small, some of them less
>> then 100 rows.

I have not looked at the code, but it wouldn't surprise me if the initial
replication just copies all the specified tables in some random order.
If there are FK references involved, the replication would have to be
done with referenced tables first, and I bet there's no logic for that.
(Even if there was, it could not cope with circular references or
self-references.)

Best bet might be to not install the subscriber's foreign key
constraints till after the initial sync is done.

            regards, tom lane



Re: Logical replication from 11.x to 12.x and "unique key violations"

От
Adrian Klaver
Дата:
On 7/20/20 10:47 AM, Tom Lane wrote:
> Thomas Kellerer <shammat@gmx.net> writes:
>>> I have a strange error when using logical replication between a 11.2
>>> source database and a 12.3 target.
>>>
>>> If I create the publication with all needed tables (about 50) at
>>> once, I get "duplicate key value violates unique constraint xxx_pkey"
>>> errors during the initial replication (when creating the
>>> subscription).
>>>
>>> When create the publication only with a few tables, the initial data
>>> sync works without problems. To replicate all tables, I add the
>>> tables incrementally to the publication, and refresh the
>>> subscription.
>>>
>>> If I do it like that (step-by-step) everything works fine. Tables
>>> that generated the "duplicate key value" error previously will
>>> replicate just fine. The tables are quite small, some of them less
>>> then 100 rows.
> 
> I have not looked at the code, but it wouldn't surprise me if the initial
> replication just copies all the specified tables in some random order.
> If there are FK references involved, the replication would have to be
> done with referenced tables first, and I bet there's no logic for that.
> (Even if there was, it could not cope with circular references or
> self-references.)
> 
> Best bet might be to not install the subscriber's foreign key
> constraints till after the initial sync is done.

I'm probably missing something, but would that not result in a 'key not 
found' type of error. The OP is seeing "duplicate key value violates 
unique constraint xxx_pkey". To me that indicates a doubling up of at 
least some of the data replication.

> 
>             regards, tom lane
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Logical replication from 11.x to 12.x and "unique key violations"

От
Tom Lane
Дата:
Adrian Klaver <adrian.klaver@aklaver.com> writes:
> I'm probably missing something, but would that not result in a 'key not 
> found' type of error. The OP is seeing "duplicate key value violates 
> unique constraint xxx_pkey". To me that indicates a doubling up of at 
> least some of the data replication.

[ squint... ]  Sorry, I completely misread the error condition.

Yeah, duplicate keys does seem odd here.  Can you provide a self
contained example?

            regards, tom lane



Re: Logical replication from 11.x to 12.x and "unique key violations"

От
John Ashmead
Дата:
I have had this problem with logical replication on PG 10 repeatedly. In a clean build no problem.

But if I am restarting replication because of some problem I’ve seen problems with rows already present.

My own fix, which has worked in my shop, is to add replica triggers to check for the row being already present.   If it is, they drop the row on the floor.  This lets stuff come in in whatever order it happens to come in.  

Sample code:

if TG_OP = ‘INSERT’ then
select id into id1 from table1 where id = new.id;
if id1 is not null then
— log error for analysis
return null;
end if;
end if;

In an ideal world, this would probably not be necessary.  But it can be tricky to restart replication in an absolutely clean way across all tables and in this case it can be better to allow for a bit of overlap in the rows being replicated. 

FWIW,

John

On Jul 20, 2020, at 1:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Kellerer <shammat@gmx.net> writes:
I have a strange error when using logical replication between a 11.2
source database and a 12.3 target.

If I create the publication with all needed tables (about 50) at
once, I get "duplicate key value violates unique constraint xxx_pkey"
errors during the initial replication (when creating the
subscription).

When create the publication only with a few tables, the initial data
sync works without problems. To replicate all tables, I add the
tables incrementally to the publication, and refresh the
subscription.

If I do it like that (step-by-step) everything works fine. Tables
that generated the "duplicate key value" error previously will
replicate just fine. The tables are quite small, some of them less
then 100 rows.

I have not looked at the code, but it wouldn't surprise me if the initial
replication just copies all the specified tables in some random order.
If there are FK references involved, the replication would have to be
done with referenced tables first, and I bet there's no logic for that.
(Even if there was, it could not cope with circular references or
self-references.)

Best bet might be to not install the subscriber's foreign key
constraints till after the initial sync is done.

regards, tom lane




John Ashmead
139 Montrose Avenue
Rosemont, PA, 19010-1508
(610) 527 9560 
mobile (610) 247 2323






Вложения

Re: Logical replication from 11.x to 12.x and "unique key violations"

От
Thomas Kellerer
Дата:
Adrian Klaver schrieb am 20.07.2020 um 16:45:
> On 7/20/20 7:22 AM, Thomas Kellerer wrote:
>>> I have a strange error when using logical replication between a 11.2
>>> source database and a 12.3 target.
>>>
>>> If I create the publication with all needed tables (about 50) at
>>> once, I get "duplicate key value violates unique constraint xxx_pkey"
>>> errors during the initial replication (when creating the
>>> subscription).
>>>
>>> When create the publication only with a few tables, the initial data
>>> sync works without problems. To replicate all tables, I add the
>>> tables incrementally to the publication, and refresh the
>>> subscription.
>>>
>>> If I do it like that (step-by-step) everything works fine. Tables
>>> that generated the "duplicate key value" error previously will
>>> replicate just fine. The tables are quite small, some of them less
>>> then 100 rows.
>>>
>>
>> Any pointers where I should start looking to investigate this?
>
> What are the PUBLICATION and SUBSCRIPTION commands being used?

Pretty basic:

  create publication some_publication
     for table table1, table2, table3, ...., table50;

  create subscription foo
    connection '...'
    publication some_publication;

The approach that worked:

* create the publication with the first 10 tables
* create the subscription, wait until the initial sync is finished
* alter the publication add the next 10 tables
* refresh the subscription, wait until the sync is finished - repeat until all tables are replicated.

Note that I did not pay attention to foreign key relationships when adding the tables
(I essentially added them in alphabetical order). I first thought that was going to
be a problem, but it seems Postgres can cope with that.

Once the initial sync was through (about a week now), no further problems came up.
It's been running smoothly since then


> Where is "xxx_pkey" coming from, e.g. sequence?

No, as mentioned, those are varchar(20) columns.
The values are generated by the application (no default value defined for the column)

> Where are source and target relative to each other in network/world?

Same datacenter.


> Are there any other errors in log at around the same time that might apply?

No.

> Are the tables heavily used when the subscription is invoked?

Used, but not "heavily" (a few inserts per minute).

And I think the tables where the errors happened, weren't written to at all.
Those were rather small lookup tables (a few hundred rows at most)

There are no triggers on the tables where the errors happened in case that's important.






Re: Logical replication from 11.x to 12.x and "unique key violations"

От
Thomas Kellerer
Дата:
Tom Lane schrieb am 20.07.2020 um 20:04:
> Adrian Klaver <adrian.klaver@aklaver.com> writes:
>> I'm probably missing something, but would that not result in a 'key not
>> found' type of error. The OP is seeing "duplicate key value violates
>> unique constraint xxx_pkey". To me that indicates a doubling up of at
>> least some of the data replication.
>
> [ squint... ]  Sorry, I completely misread the error condition.
>
> Yeah, duplicate keys does seem odd here.  Can you provide a self
> contained example?

I'll try, but this is a production system.
Extracting the necessary anonymous data will be tricky.

Is there any chance the version difference might cause this?
And a slightly outdated 11.x at that?

Thomas



Re: Logical replication from 11.x to 12.x and "unique key violations"

От
Thomas Kellerer
Дата:
John Ashmead schrieb am 20.07.2020 um 20:11:
> I have had this problem with logical replication on PG 10 repeatedly.
> In a clean build no problem.
>
> But if I am restarting replication because of some problem I’ve seen
> problems with rows already present.
>
> My own fix, which has worked in my shop, is to add replica triggers
> to check for the row being already present.   If it is, they drop the
> row on the floor.  This lets stuff come in in whatever order it
> happens to come in.
>
> Sample code:
>
> if TG_OP = ‘INSERT’ then
> select id into id1 from table1 where id = new.id <http://new.id>;
> if id1 is not null then
> — log error for analysis
> return null;
> end if;
> end if;
>
> In an ideal world, this would probably not be necessary.  But it can
> be tricky to restart replication in an absolutely clean way across
> all tables and in this case it can be better to allow for a bit of
> overlap in the rows being replicated.
>

Interesting idea, I'll keep this in mind.

For now everything works fine, so there is no pressing need.

Thomas





Re: Logical replication from 11.x to 12.x and "unique key violations"

От
Tom Lane
Дата:
Thomas Kellerer <shammat@gmx.net> writes:
> Tom Lane schrieb am 20.07.2020 um 20:04:
>> Yeah, duplicate keys does seem odd here.  Can you provide a self
>> contained example?

> I'll try, but this is a production system.
> Extracting the necessary anonymous data will be tricky.

If this is a PG bug, it should be possible to reproduce it with
completely random/generated data.  The key ingredient that you
have and the rest of us don't is the process and timing by which
the primary key values are introduced.

> Is there any chance the version difference might cause this?
> And a slightly outdated 11.x at that?

Hmmm ... I do not recall any recent bug fixes that seem to match
this symptom, but replication isn't a part of the code that
I'm the world's best expert on.  In any case, we do offer as
standard advice that you should reproduce a problem on the latest
minor release before filing a bug report.

            regards, tom lane



Re: Logical replication from 11.x to 12.x and "unique key violations"

От
Thomas Kellerer
Дата:
Tom Lane schrieb am 21.07.2020 um 09:39:
> In any case, we do offer as standard advice that you should reproduce
> a problem on the latest minor release before filing a bug report.

I know ;)

I already told the "powers to be" and it's being addressed
(I also went through the 11.x release notes, but nothing jumped out)


I will try if I can create a stand-alone test, but I suspect some timing
problem which will be hard to replicate (no pun intended)



Re: Logical replication from 11.x to 12.x and "unique key violations"

От
Adrian Klaver
Дата:
On 7/20/20 10:42 PM, Thomas Kellerer wrote:
> Adrian Klaver schrieb am 20.07.2020 um 16:45:
>> On 7/20/20 7:22 AM, Thomas Kellerer wrote:
>>>> I have a strange error when using logical replication between a 11.2


> 
>> Where is "xxx_pkey" coming from, e.g. sequence?
> 
> No, as mentioned, those are varchar(20) columns.
> The values are generated by the application (no default value defined for the column)

Aah I see my mistake I was going off your follow up question not the 
original post. In that original post though you had the PK containing a 
varchar(100) column. Can we see the table schema and the PK definition 
for at least one of the tables that threw an error?




-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Logical replication from 11.x to 12.x and "unique key violations"

От
Thomas Kellerer
Дата:
Adrian Klaver schrieb am 21.07.2020 um 17:07:
>> No, as mentioned, those are varchar(20) columns.
>> The values are generated by the application (no default value defined for the column)
>
> Aah I see my mistake I was going off your follow up question not the
> original post. In that original post though you had the PK containing
> a varchar(100) column. Can we see the table schema and the PK
> definition for at least one of the tables that threw an error?
>

Sorry about the confusion, some PKs are indeed defined as varchar(100) some as varchar(20) and some as varchar(15)
And I was also wrong about the generation, there is indeed a default value defined using a self-written ID generation
function.
But during replication, that function isn't called, so it shouldn't matter, I guess.

Here are two examples of failing tables:

    CREATE TABLE IF NOT EXISTS emp_status
    (
       emp_status_id   varchar(15)   DEFAULT generate_id('EA') NOT NULL PRIMARY KEY,
       status_name     varchar(20)   NOT NULL UNIQUE
    );

    CREATE TABLE IF NOT EXISTS company
    (
       comp_id     varchar(15)   DEFAULT generate_id('CO') NOT NULL PRIMARY KEY,
       name        varchar(50)   NOT NULL UNIQUE,
       country     varchar(50)   NOT NULL,
       code        varchar(20)   NOT NULL
    );

Both tables only contain only a few rows (less than 10) and e.g. for the status lookup, the log entry was:

LOG:  logical replication table synchronization worker for subscription "foo", table "emp_status" has started
ERROR:  duplicate key value violates unique constraint "emp_status_pkey"
DETAIL:  Key (employee_available_status_id)=(BUJ4XFZ7ATY27EA) already exists.
CONTEXT:  COPY employee_available_status, line 1

Thomas



Re: Logical replication from 11.x to 12.x and "unique key violations"

От
Thomas Kellerer
Дата:
Thomas Kellerer schrieb am 13.07.2020 um 11:52:
> If I create the publication with all needed tables (about 50) at
> once, I get "duplicate key value violates unique constraint xxx_pkey"
> errors during the initial replication (when creating the
> subscription).

Turns out the problem was - once again - between the keyboard and the chair.

The script/job to initialize the database structure (to make sure primary and secondary are in sync) also includes
statementsto setup the lookup tables. So obviously the initial sync would fail for those tables. When I re-started
replicationI truncated all tables without re-running the setup script. 

I wonder if it makes sense to add an "on conflict (pk) do nothing" option to the logical replication features ;)

Sorry for the noise
Thomas




Re: Logical replication from 11.x to 12.x and "unique key violations"

От
Adrian Klaver
Дата:
On 7/21/20 10:36 PM, Thomas Kellerer wrote:
> Adrian Klaver schrieb am 21.07.2020 um 17:07:
>>> No, as mentioned, those are varchar(20) columns.
>>> The values are generated by the application (no default value defined for the column)
>>
>> Aah I see my mistake I was going off your follow up question not the
>> original post. In that original post though you had the PK containing
>> a varchar(100) column. Can we see the table schema and the PK
>> definition for at least one of the tables that threw an error?
>>
> 
> Sorry about the confusion, some PKs are indeed defined as varchar(100) some as varchar(20) and some as varchar(15)
> And I was also wrong about the generation, there is indeed a default value defined using a self-written ID generation
function.
> But during replication, that function isn't called, so it shouldn't matter, I guess.
> 
> Here are two examples of failing tables:
> 
>      CREATE TABLE IF NOT EXISTS emp_status
>      (
>         emp_status_id   varchar(15)   DEFAULT generate_id('EA') NOT NULL PRIMARY KEY,
>         status_name     varchar(20)   NOT NULL UNIQUE
>      );
> 
>      CREATE TABLE IF NOT EXISTS company
>      (
>         comp_id     varchar(15)   DEFAULT generate_id('CO') NOT NULL PRIMARY KEY,
>         name        varchar(50)   NOT NULL UNIQUE,
>         country     varchar(50)   NOT NULL,
>         code        varchar(20)   NOT NULL
>      );
> 
> Both tables only contain only a few rows (less than 10) and e.g. for the status lookup, the log entry was:
> 
> LOG:  logical replication table synchronization worker for subscription "foo", table "emp_status" has started
> ERROR:  duplicate key value violates unique constraint "emp_status_pkey"
> DETAIL:  Key (employee_available_status_id)=(BUJ4XFZ7ATY27EA) already exists.
> CONTEXT:  COPY employee_available_status, line 1

I know your subsequent post explained the problem. Still I'm not fully 
understanding the above. I understand the "duplicate key value violates 
unique constraint "emp_status_pkey" part. What I don't understand is 
where employee_available_status_id and employee_available_status are 
coming from? Or is this a copy/paste issue?

> 
> Thomas
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Logical replication from 11.x to 12.x and "unique key violations"

От
Thomas Kellerer
Дата:
Adrian Klaver schrieb am 22.07.2020 um 16:42:
> Or is this a copy/paste issue?

That.