Re: BUG #18433: Logical replication timeout

Поиск
Список
Период
Сортировка
От Kostiantyn Tomakh
Тема Re: BUG #18433: Logical replication timeout
Дата
Msg-id CAJP09w7sJEkoO2=DUi-BS_JJpv7dCDcj+kb7V5YCiW7d1Y2KZQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #18433: Logical replication timeout  (Shlok Kyal <shlok.kyal.oss@gmail.com>)
Ответы Re: BUG #18433: Logical replication timeout  (Bruce Momjian <bruce@momjian.us>)
Re: BUG #18433: Logical replication timeout  (Shlok Kyal <shlok.kyal.oss@gmail.com>)
Список pgsql-bugs
Hello, Shlok Kyal.

I was able to reproduce the problem.
I did it on docker based platform I hope you will be able to reproduce this problem too.
1) I ran the docker-compose file with PostgreSQL 10:
docker-compose.yml:
version: '3.3'
services:
postgres:
image: postgres:10.21
environment:
POSTGRES_DB: "db"
POSTGRES_USER: "user"
POSTGRES_PASSWORD: "password"
PGDATA: "/var/lib/postgresql/data/pgdata"
command: -c config_file=/etc/postgresql.conf
volumes:
- ./data:/var/lib/postgresql/data
- ./postgresql.conf:/etc/postgresql.conf
- ./pg_hba.conf:/etc/pg_hba.conf
ports:
- "5432:5432"
postgres2:
image: postgres:10.21
environment:
POSTGRES_DB: "db"
POSTGRES_USER: "user"
POSTGRES_PASSWORD: "password"
PGDATA: "/var/lib/postgresql/data/pgdata"
command: -c config_file=/etc/postgresql.conf
volumes:
- ./data2:/var/lib/postgresql/data
- ./postgresql.conf:/etc/postgresql.conf
- ./pg_hba.conf:/etc/pg_hba.conf
ports:
- "5433:5432"
postgresql.conf:
hba_file = '/etc/pg_hba.conf'
listen_addresses = '*'
wal_level = logical
pg_hba.conf:
# TYPE  DATABASE        USER            ADDRESS                 METHOD
host all all 0.0.0.0/0 md5
So I have two Postgresql servers.
2) I created table at DB "postgres":
CREATE TABLE public.table (
field_1 uuid NOT NULL,
field_2 uuid NOT NULL,
field_3 uuid NOT NULL,
field_4 text NOT NULL,
field_5 text NOT NULL,
field_6 timestamp with time zone NOT NULL,
field_7 timestamp with time zone NOT NULL,
field_8 smallint NOT NULL,
field_9 text NOT NULL,
field_10 text NOT NULL,
field_11 uuid NOT NULL,
field_12 uuid,
field_13 text,
field_14 uuid,
field_15 text NOT NULL,
field_16 integer,
field_17 text NOT NULL,
field_18 uuid,
field_19 character varying(20),
field_20 character varying(29),
field_21 text,
field_22 bigint,
field_23 character varying(20)
);

ALTER TABLE ONLY public.table
ADD CONSTRAINT "PK_table" PRIMARY KEY (field_1);
3) Created publication at DB "postgres":
CREATE ROLE replication_role  WITH REPLICATION LOGIN PASSWORD 'password';

GRANT ALL PRIVILEGES ON DATABASE "db" TO replication_role;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO replication_role;
CREATE PUBLICATION db_name_public_publication;
ALTER PUBLICATION db_name_public_publication ADD TABLE "table";
4) I started two Python scripts that were generating requests to"postgres":
insert.py:
import psycopg2
from faker import Faker
import uuid
import random
from datetime import datetime, timedelta

# Connect to your PostgreSQL database
conn = psycopg2.connect(
dbname="db",
user="user",
password="password",
host="127.0.0.1",
port="5432"
)
cur = conn.cursor()

# Function to generate fake data for one row
def generate_fake_row():
fake = Faker()
row = (
str(uuid.uuid4()), # Convert UUID to string
str(uuid.uuid4()), # Convert UUID to string
str(uuid.uuid4()), # Convert UUID to string
fake.text(), # field_4
fake.text(), # field_5
fake.date_time_between(start_date='-1y', end_date='now', tzinfo=None), # field_6
fake.date_time_between(start_date='-1y', end_date='now', tzinfo=None), # field_7
random.randint(0, 100), # field_8
fake.text(), # field_9
fake.text(), # field_10
str(uuid.uuid4()), # Convert UUID to string
str(uuid.uuid4()) if random.random() > 0.5 else None, # Convert UUID to string
fake.text() if random.random() > 0.5 else None, # field_13
str(uuid.uuid4()) if random.random() > 0.5 else None, # Convert UUID to string
fake.text(), # field_15
random.randint(0, 1000), # field_16
fake.text(), # field_17
str(uuid.uuid4()) if random.random() > 0.5 else None, # Convert UUID to string
fake.random_letter() * 20, # field_19
fake.random_letter() * 29, # field_20
fake.text() if random.random() > 0.5 else None, # field_21
random.randint(0, 1000000), # field_22
fake.random_letter() * 20 # field_23
)
return row

# Function to insert one row of fake data into the table
def insert_fake_row(row):
sql = """INSERT INTO public.table (
field_1, field_2, field_3, field_4, field_5, field_6, field_7, field_8,
field_9, field_10, field_11, field_12, field_13, field_14, field_15, field_16,
field_17, field_18, field_19, field_20, field_21, field_22, field_23
) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"""
cur.execute(sql, row)
conn.commit()

# Insert one row at a time in an independent loop
while True:
fake_row = generate_fake_row() # Generate fake data for one row
insert_fake_row(fake_row) # Insert the row

# Close the cursor and the connection
cur.close()
conn.close()
update.py:
import psycopg2
import random
import time

# Connect to your PostgreSQL database
conn = psycopg2.connect(
dbname="db",
user="user",
password="password",
host="127.0.0.1",
port="5432"
)
cur = conn.cursor()

# Function to update any row in the table
def update_random_row():
# Generate a random value for field_8 and field_16
new_field_8 = random.randint(0, 100)
new_field_16 = random.randint(0, 1000)

# Choose a random row to update
cur.execute("SELECT field_1 FROM public.table;")
rows = cur.fetchall()
random_row = random.choice(rows)[0]

# Update the chosen row with the new values
cur.execute("UPDATE public.table SET field_8 = %s, field_16 = %s WHERE field_1 = %s;", (new_field_8, new_field_16, random_row))
conn.commit()

# Infinite loop to continuously update rows
while True:
update_random_row()
time.sleep(1) # Wait for 1 second before the next update

# Close the cursor and the connection (won't be executed in an infinite loop)
cur.close()
conn.close()
5)Create subsription at DB postgres2:
CREATE SUBSCRIPTION db_name_public_subscription CONNECTION 'host=postgres port=5432 password=password user=replication_role dbname=db' PUBLICATION db_name_public_publication;

6) Then I got some messages in logs:
10-postgres2-1  | 2024-05-01 16:34:16.296 GMT [38] LOG:  logical replication apply worker for subscription "db_name_public_subscription" has started
10-postgres-1 | 2024-05-01 16:34:16.300 GMT [39] LOG: starting logical decoding for slot "db_name_public_subscription"
10-postgres-1 | 2024-05-01 16:34:16.300 GMT [39] DETAIL: streaming transactions committing after 0/1B5FFC8, reading WAL from 0/1B5FF90
10-postgres-1 | 2024-05-01 16:34:16.300 GMT [39] STATEMENT: START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
10-postgres-1 | 2024-05-01 16:34:16.300 GMT [39] LOG: logical decoding found consistent point at 0/1B5FF90
10-postgres-1 | 2024-05-01 16:34:16.300 GMT [39] DETAIL: There are no running transactions.
10-postgres-1 | 2024-05-01 16:34:16.300 GMT [39] STATEMENT: START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
10-postgres2-1 | 2024-05-01 16:34:16.302 GMT [39] LOG: logical replication table synchronization worker for subscription "db_name_public_subscription", table "table" has started
10-postgres-1 | 2024-05-01 16:34:16.366 GMT [40] LOG: logical decoding found consistent point at 0/1B610D8
10-postgres-1 | 2024-05-01 16:34:16.366 GMT [40] DETAIL: There are no running transactions.
10-postgres-1 | 2024-05-01 16:34:16.366 GMT [40] STATEMENT: CREATE_REPLICATION_SLOT "db_name_public_subscription_16391_sync_16385" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
10-postgres-1 | 2024-05-01 16:34:16.478 GMT [40] LOG: starting logical decoding for slot "db_name_public_subscription_16391_sync_16385"
10-postgres-1 | 2024-05-01 16:34:16.478 GMT [40] DETAIL: streaming transactions committing after 0/1B61110, reading WAL from 0/1B610D8
10-postgres-1 | 2024-05-01 16:34:16.478 GMT [40] STATEMENT: START_REPLICATION SLOT "db_name_public_subscription_16391_sync_16385" LOGICAL 0/1B61110 (proto_version '1', publication_names '"db_name_public_publication"')
10-postgres-1 | 2024-05-01 16:34:16.478 GMT [40] LOG: logical decoding found consistent point at 0/1B610D8
10-postgres-1 | 2024-05-01 16:34:16.478 GMT [40] DETAIL: There are no running transactions.
10-postgres-1 | 2024-05-01 16:34:16.478 GMT [40] STATEMENT: START_REPLICATION SLOT "db_name_public_subscription_16391_sync_16385" LOGICAL 0/1B61110 (proto_version '1', publication_names '"db_name_public_publication"')
10-postgres2-1 | 2024-05-01 16:34:16.479 GMT [39] LOG: logical replication table synchronization worker for subscription "db_name_public_subscription", table "table" has finished
10-postgres2-1 | 2024-05-01 16:34:16.782 GMT [38] ERROR: logical replication target relation "public.table" has neither REPLICA IDENTITY index nor PRIMARY KEY and published relation does not have REPLICA IDENTITY FULL
10-postgres2-1 | 2024-05-01 16:34:16.783 GMT [1] LOG: worker process: logical replication worker for subscription 16391 (PID 38) exited with exit code 1
10-postgres2-1 | 2024-05-01 16:34:21.788 GMT [40] LOG: logical replication apply worker for subscription "db_name_public_subscription" has started
10-postgres-1 | 2024-05-01 16:34:21.791 GMT [41] LOG: starting logical decoding for slot "db_name_public_subscription"
After this logs I had table size in DB postgres 50 Mbytes and table size in DB postgres2 45 Mbytes
7) When I created Primary key at DB postgres2 tables at DB postgres and DB postgres2 was the same size:
ALTER TABLE ONLY public.table
ADD CONSTRAINT "PK_table" PRIMARY KEY (field_1);
Result: I had fully consistent data at DB postgres2 and normal replication process.

Then I repeated the steps for Postgresql 13.14.

1) I ran the dockerfile creation file with PostgreSQL 13:
docker-compose.yml:
version: '3.3'
services:
postgres:
image: postgres:13.14
environment:
POSTGRES_DB: "db"
POSTGRES_USER: "user"
POSTGRES_PASSWORD: "password"
PGDATA: "/var/lib/postgresql/data/pgdata"
command: -c config_file=/etc/postgresql.conf
volumes:
- ./data:/var/lib/postgresql/data
- ./postgresql.conf:/etc/postgresql.conf
- ./pg_hba.conf:/etc/pg_hba.conf
ports:
- "5432:5432"
postgres2:
image: postgres:13.14
environment:
POSTGRES_DB: "db"
POSTGRES_USER: "user"
POSTGRES_PASSWORD: "password"
PGDATA: "/var/lib/postgresql/data/pgdata"
command: -c config_file=/etc/postgresql.conf
volumes:
- ./data2:/var/lib/postgresql/data
- ./postgresql.conf:/etc/postgresql.conf
- ./pg_hba.conf:/etc/pg_hba.conf
ports:
- "5433:5432"
So I have two Postgresql servers.
postgresql.conf:
hba_file = '/etc/pg_hba.conf'
listen_addresses = '*'
ssl = off
wal_level = logical  
pg_hba.conf:
# TYPE  DATABASE        USER            ADDRESS                 METHOD
host all all all md5

2) I created table at DB "postgres":
CREATE TABLE public.table (
field_1 uuid NOT NULL,
field_2 uuid NOT NULL,
field_3 uuid NOT NULL,
field_4 text NOT NULL,
field_5 text NOT NULL,
field_6 timestamp with time zone NOT NULL,
field_7 timestamp with time zone NOT NULL,
field_8 smallint NOT NULL,
field_9 text NOT NULL,
field_10 text NOT NULL,
field_11 uuid NOT NULL,
field_12 uuid,
field_13 text,
field_14 uuid,
field_15 text NOT NULL,
field_16 integer,
field_17 text NOT NULL,
field_18 uuid,
field_19 character varying(20),
field_20 character varying(29),
field_21 text,
field_22 bigint,
field_23 character varying(20)
);

ALTER TABLE ONLY public.table
ADD CONSTRAINT "PK_table" PRIMARY KEY (field_1);
3) Created publication at DB "postgres":
CREATE ROLE replication_role  WITH REPLICATION LOGIN PASSWORD 'password';

GRANT ALL PRIVILEGES ON DATABASE "db" TO replication_role;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO replication_role;
CREATE PUBLICATION db_name_public_publication;
ALTER PUBLICATION db_name_public_publication ADD TABLE "table";
4) I started two Python scripts that were generating requests to"postgres". The scripts insert.py and update.py were described earlier.
5) Create subsription at DB postgres2:
CREATE SUBSCRIPTION db_name_public_subscription CONNECTION 'host=postgres port=5432 password=password user=replication_role dbname=db' PUBLICATION db_name_public_publication;

6) Then I got some messages in logs:
13-postgres-1   | 2024-05-01 16:26:15.348 GMT [40] STATEMENT:  CREATE_REPLICATION_SLOT "db_name_public_subscription" LOGICAL pgoutput NOEXPORT_SNAPSHOT
13-postgres2-1 | 2024-05-01 16:26:15.378 GMT [39] LOG: logical replication apply worker for subscription "db_name_public_subscription" has started
13-postgres-1 | 2024-05-01 16:26:15.382 GMT [41] LOG: starting logical decoding for slot "db_name_public_subscription"
13-postgres-1 | 2024-05-01 16:26:15.382 GMT [41] DETAIL: Streaming transactions committing after 0/A0E10F0, reading WAL from 0/A0E1048.
13-postgres-1 | 2024-05-01 16:26:15.382 GMT [41] STATEMENT: START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
13-postgres-1 | 2024-05-01 16:26:15.382 GMT [41] LOG: logical decoding found initial starting point at 0/A0E1048
13-postgres-1 | 2024-05-01 16:26:15.382 GMT [41] DETAIL: Waiting for transactions (approximately 1) older than 66255 to end.
13-postgres-1 | 2024-05-01 16:26:15.382 GMT [41] STATEMENT: START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
13-postgres-1 | 2024-05-01 16:26:15.382 GMT [41] LOG: logical decoding found consistent point at 0/A0E10B8
13-postgres-1 | 2024-05-01 16:26:15.382 GMT [41] DETAIL: There are no running transactions.
13-postgres-1 | 2024-05-01 16:26:15.382 GMT [41] STATEMENT: START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
13-postgres2-1 | 2024-05-01 16:26:15.384 GMT [40] LOG: logical replication table synchronization worker for subscription "db_name_public_subscription", table "table" has started
13-postgres-1 | 2024-05-01 16:26:15.476 GMT [42] LOG: logical decoding found consistent point at 0/A0E45D8
13-postgres-1 | 2024-05-01 16:26:15.476 GMT [42] DETAIL: There are no running transactions.
13-postgres-1 | 2024-05-01 16:26:15.476 GMT [42] STATEMENT: CREATE_REPLICATION_SLOT "db_name_public_subscription_16500_sync_16385" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
13-postgres-1 | 2024-05-01 16:26:16.993 GMT [42] LOG: starting logical decoding for slot "db_name_public_subscription_16500_sync_16385"
13-postgres-1 | 2024-05-01 16:26:16.993 GMT [42] DETAIL: Streaming transactions committing after 0/A0E4610, reading WAL from 0/A0E45D8.
13-postgres-1 | 2024-05-01 16:26:16.993 GMT [42] STATEMENT: START_REPLICATION SLOT "db_name_public_subscription_16500_sync_16385" LOGICAL 0/A0E4610 (proto_version '1', publication_names '"db_name_public_publication"')
13-postgres-1 | 2024-05-01 16:26:16.993 GMT [42] LOG: logical decoding found consistent point at 0/A0E45D8
13-postgres-1 | 2024-05-01 16:26:16.993 GMT [42] DETAIL: There are no running transactions.
13-postgres-1 | 2024-05-01 16:26:16.993 GMT [42] STATEMENT: START_REPLICATION SLOT "db_name_public_subscription_16500_sync_16385" LOGICAL 0/A0E4610 (proto_version '1', publication_names '"db_name_public_publication"')
13-postgres2-1 | 2024-05-01 16:26:16.994 GMT [40] ERROR: logical replication target relation "public.table" has neither REPLICA IDENTITY index nor PRIMARY KEY and published relation does not have REPLICA IDENTITY FULL
13-postgres-1 | 2024-05-01 16:26:16.994 GMT [42] LOG: could not send data to client: Connection reset by peer
13-postgres-1 | 2024-05-01 16:26:16.994 GMT [42] CONTEXT: slot "db_name_public_subscription_16500_sync_16385", output plugin "pgoutput", in the commit callback, associated LSN 0/A0F0F58
13-postgres-1 | 2024-05-01 16:26:16.994 GMT [42] STATEMENT: START_REPLICATION SLOT "db_name_public_subscription_16500_sync_16385" LOGICAL 0/A0E4610 (proto_version '1', publication_names '"db_name_public_publication"')
13-postgres2-1 | 2024-05-01 16:26:16.996 GMT [1] LOG: background worker "logical replication worker" (PID 40) exited with exit code 1
13-postgres2-1 | 2024-05-01 16:26:20.393 GMT [41] LOG: logical replication table synchronization worker for subscription "db_name_public_subscription", table "table" has started
13-postgres-1 | 2024-05-01 16:26:20.461 GMT [43] LOG: logical decoding found consistent point at 0/A12E068
13-postgres-1 | 2024-05-01 16:26:20.461 GMT [43] DETAIL: There are no running transactions.
After these logs I had table size in DB postgres 50 Mbytes, table size in DB postgres2 450 Mbytes and table size in DB postgres2 was increasing.
I was checking tables size by using query:
select
table_schema,
table_name,
pg_relation_size('"'||table_schema||'"."'||table_name||'"'),
pg_size_pretty(pg_relation_size('"'||table_schema||'"."'||table_name||'"'))
from information_schema.tables
where table_schema in ('public', 'saga', 'hangfire')
order by 3 DESC
Please help to resolve this problem this will allow to migrate big-size tables.
I am looking forward to your answer.
I also attached files with logs to this email.

пн, 29 апр. 2024 г. в 13:14, Shlok Kyal <shlok.kyal.oss@gmail.com>:
Hi

> 3. This log was got for a table size of 39 GB but we have tables much bigger than this.
>
> Our workflow for the tables size 39 GB
>
> 1) Download schema from the source database instance
>
> 2) Deleted PK, FK, and Indexes  from the table schema(we did it to increase data load, startup process)
> 3)Upload the schema to the destination DB.
> 4) configurated identity replication full at source DB for this table
> 5) Configured Logical replication between source and destination DB this table
> 6) During catchup on this table  process we got the messages in log that you can see above.
>
> 7) We also tried create primary key for this table during catchup state but this process was blocked by logical replication worker so we had to cancel primary key creation process.
>
> I want to highlight that we used this workflow for PostgreSQL 10 and it worked.
>

I tried the steps you provided to reproduce the issue on Postgresql 13
but, I was not able to reproduce it. I tried it for tables with size
~120GB and also tried it separately for table with size ~40GB.

Steps I used:
1. Created a sourceDB and DestinationDB.
2.
Populated sourceDB with data.
"create table t1(c1 int PRIMARY KEY, c2 int);"
"insert into t1 values(generate_series(1,1500000000), 2)"
"update t1 set c1 = 0 where c1 = 1"
3.
Deleted Primary Key on sourceDB
"alter table t1 drop constraint t1_pkey"
4.
Created schema on destinationDB
"create table t1(c1 int, c2 int);"
5.
Configured Replica identity full on sourceDB
"ALTER TABLE t1 REPLICA IDENTITY FULL"
6.
Configured Logical Replication Setup
on sourceDB: "create publication pub1_new for all tables"
on destinationDB: ""create subscription test1_new connection
'dbname=postgres host=localhost port=5432' publication pub1_new;"

For me the initial catchup was successful and I did not receive any
issue in sourceDB logs or destinationDB logs.
Can you point out if you did anything different?

Also, For me the initial catchup succeeds even if I did not configure
replica identity full on sourceDB
Can you point out how your steps were different to get error related
to replica identity full?

Also, are you running any queries concurrently on sourceDB or destinationDB?

Thanks and Regards,
Shlok Kyal
Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: "David G. Johnston"
Дата:
Сообщение: Re: BUG #18451: NULL fails to coerce to string when performing string comparison
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: BUG #18433: Logical replication timeout