Обсуждение: Logical replication is missing block of rows when sending initial sync?
Logical replication is missing block of rows when sending initial sync?
От
hubert depesz lubaczewski
Дата:
Hi, we have situation where we use native logical replication to copy data from pg12 (on ubuntu bionic) to pg14 (on ubuntu focal). Generally it mostly works. The process is like this: 1. make all tables in destination, with *just* primary keys. no other indexes, constraints 2. setup logical replication, adding tables in batches by 40, adding new batch to publication only when all tables are in 'r' state 3. once all tables are replicating, add indexes, fkeys, trgiggers and the like on recipient. Every so often we got errors when adding fkeys. Usually I didn't have much time to investigate, but today had a bit. Unfortunatley the replica is gone (I had to re-set the replication). But when I had it there I noticed that a block of rows from one of the tables was missing. I am not entirely sure about the size of the block, but I check 20 rows with consecutive ids, and they were all missing from focal side: #v+ =# select ctid, xmin, id, created_at, updated_at from c33s16831.subs where id between 19733130 and 19733150 order by 1; ctid │ xmin │ id │ created_at │ updated_at ─────────────┼────────────┼──────────┼────────────────────────────┼──────────────────────────── (486337,57) │ 2668461379 │ 19733130 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204 (486337,58) │ 2668461379 │ 19733131 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204 (486337,59) │ 2668461379 │ 19733132 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204 (486337,60) │ 2668461379 │ 19733133 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204 (486337,61) │ 2668461379 │ 19733134 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204 (486337,62) │ 2668461379 │ 19733135 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204 (486337,63) │ 2668461379 │ 19733136 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204 (486337,64) │ 2668461379 │ 19733137 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204 (486337,65) │ 2668461379 │ 19733138 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204 (486337,66) │ 2668461379 │ 19733139 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204 (486337,67) │ 2668461379 │ 19733140 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204 (486340,38) │ 2668691763 │ 19733147 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:07.360778 (486340,58) │ 2668695225 │ 19733146 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:15.227267 (486340,61) │ 2668711141 │ 19733144 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:43.499749 (486340,62) │ 2668717802 │ 19733149 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:49.900798 (486340,63) │ 2668723876 │ 19733141 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:00.749445 (486340,64) │ 2668753420 │ 19733142 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:22:02.140332 (486340,66) │ 2668761678 │ 19733150 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:22:19.140231 (486341,9) │ 2668726256 │ 19733145 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:06.845051 (486341,17) │ 2668748364 │ 19733148 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:50.246899 (486341,18) │ 2668750613 │ 19733143 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:55.863125 (21 rows) #v- Last vacuum (from autovacuum) was ~ 3 weeks ago, so it isn't a factor. I don't know when exactly this batch was added to publication, but I found that COPY that sent initial sync of the data waslogged at: #v+ 2023-10-25 18:48:43.982 UTC,"upguser","dbname",12995,"10.1.207.7:44956",65396265.32c3,6,"COPY",2023-10-25 18:45:57 UTC,74/160580191,0,LOG,00000,"duration:166355.331 ms statement: COPY c33s16831.subs TO STDOUT",,,,,,,,,"pg_443910492_sync_443582288_7292150788357777097" #v- Unfortunately I don't have a way to repeat the problem. Just noticed that on some dbs that we're upgrading the problem is more common, and on some - it generally doesn't happen. So, I don't have high hopes re: finding/fixing the issue, but figured I'll let you know that there is such situation, perhaps it will strike someone as something relating to something that could be the problem... Best regards, depesz
RE: Logical replication is missing block of rows when sending initial sync?
От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear depesz, Hi, thanks for reporting. > 1. make all tables in destination, with *just* primary keys. no other > indexes, constraints > 2. setup logical replication, adding tables in batches by 40, adding new > batch to publication only when all tables are in 'r' state > 3. once all tables are replicating, add indexes, fkeys, trgiggers and > the like on recipient. > I've tried, but I could not reproduce the failure. PSA the script what I did. Could you please missing information like below? * Table definitions on both nodes * GUC settings * Publication/subscription options * Server log and output when you met the failure * A minimal script for reproducing the failure (if you succeeded) * Are there incoming data on publisher during the initial sync? * A method how you checked pg_subscription_rel > > Every so often we got errors when adding fkeys. Usually I didn't have > much time to investigate, but today had a bit. Unfortunatley the replica > is gone (I had to re-set the replication). Could you share an error message what you met? I thought you saw like "Key (XXX)=(YYY) is not present in table "ZZZ".", but please paste here. If so, are tuples really missing at that time? > But when I had it there > I noticed that a block of rows from one of the tables was missing. > > I am not entirely sure about the size of the block, but I check 20 rows > with consecutive ids, and they were all missing from focal side: >... (I was not sure how you defined tables subscriptions, but...) Hmm, logical replication does not copy physical information like ctid, so I'm not sure it is related with us. I felt that tuples between 19733141 and 19733150 were updated by some way so that locations were changed. FYI - Below workflow is a brief description for initial synchronization. 1. Tablesync worker connects to publisher 2. Starts a transaction on both pub/sub 3. Creates a replication slot on publisher 4. Executes "COPY ... TO STDOUT" on publisher. This means that all tuples are extracted 5. Does BeginCopyFrom() and CopyFrom(), they are same infrastacture as normall "COPY FROM" statement. 6. Commits the transaction on both pub/sub 7. Catches up changes since the replication slot was defined at 4. > > I don't know when exactly this batch was added to publication, but I found that > COPY that sent initial sync of the data was logged at: > > #v+ > 2023-10-25 18:48:43.982 > UTC,"upguser","dbname",12995,"10.1.207.7:44956",65396265.32c3,6,"COPY",20 > 23-10-25 18:45:57 UTC,74/160580191,0,LOG,00000,"duration: 166355.331 ms > statement: COPY c33s16831.subs TO > STDOUT",,,,,,,,,"pg_443910492_sync_443582288_7292150788357777097" > #v- Yeah, I think this copy command was sent from tablesync worker. > Unfortunately I don't have a way to repeat the problem. Just noticed > that on some dbs that we're upgrading the problem is more common, and on > some - it generally doesn't happen. Just to confirm - You met the error only when the logical replication was set up from old clusters to newer, right? > So, I don't have high hopes re: finding/fixing the issue, but figured > I'll let you know that there is such situation, perhaps it will strike > someone as something relating to something that could be the problem... That's great, I have never known the issue before. Best Regards, Hayato Kuroda FUJITSU LIMITED
Вложения
Re: Logical replication is missing block of rows when sending initial sync?
От
hubert depesz lubaczewski
Дата:
On Mon, Oct 30, 2023 at 07:10:35AM +0000, Hayato Kuroda (Fujitsu) wrote: > * Table definitions on both nodes Well, table has 50 columns, 28 indexes, 10 fkeys, is referenced back by 26 tables, has two simple triggers. Showing it all would be rather complicated and time consuming (legal issues). Used datatypes: - 14 columns with bigint - 8 columns with character varying(255) - 7 columns with timestamp without time zone - 6 columns with boolean - 3 columns with text - 3 columns with integer - 3 columns with double precision - 2 columns with character varying(16) - 1 column with uuid - 1 column with numeric(6,2) - 1 column with character varying(5) - 1 column with character varying I know it's not full \d, but I don't think sharing it in full is realistic. Schema on recipient is restored from schema dump made ~ 5 minutes earlier. > * GUC settings Which GUCs? > * Publication/subscription options =# \dRp focal14 List of publications Name │ Owner │ All tables │ Inserts │ Updates │ Deletes │ Truncates ─────────┼──────────┼────────────┼─────────┼─────────┼─────────┼─────────── focal14 │ postgres │ f │ t │ t │ t │ t (1 row) and =# \dRs focal14 List of subscriptions Name │ Owner │ Enabled │ Publication ─────────┼──────────┼─────────┼───────────── focal14 │ postgres │ t │ {focal14} (1 row) > * Server log and output when you met the failure The failure was just fail of adding fkey. These logs are gone now, I can't keep failed host when I'm in process of preparing boxes for upgrade. Error message was: ALTER TABLE ONLY c33s16831.ag_change_records_2023_10 ADD CONSTRAINT fk_rails_0113b4b993 FOREIGN KEY (sub_id) REFERENCES c33s16831.subs(id); ERROR: insert or update on table "ag_change_records_2023_10" violates foreign key constraint "fk_rails_0113b4b993" DETAIL: Key (sub_id)=(19733141) is not present in table "subs". > * A minimal script for reproducing the failure (if you succeeded) No. I think it's some kind of race condition. > * Are there incoming data on publisher during the initial sync? Yes. Pg12 is working as normal primary server for cluster that is having traffic 24h/day. > * A method how you checked pg_subscription_rel #v+ WITH info AS ( SELECT format( '%s:%s', sr.srsubstate, count(*) ) AS part, min( o.ord ) AS ord FROM pg_subscription AS s JOIN pg_subscription_rel AS sr ON s.oid = sr.srsubid JOIN unnest( '{i,d,f,s,r}'::text[] ) WITH ORDINALITY AS o ( s, ord ) ON sr.srsubstate = o.s WHERE s.subname = 'focal14' GROUP BY sr.srsubstate ) SELECT string_agg( part, ' ; ' order by ord ) FROM info; #v- This query I used while adding tables. Add 40 tables to publication, wait for all states to be 0, with exception of "r". Afterwards I used: #v+ SELECT n.nspname, c.relname FROM pg_subscription s join pg_subscription_rel sr on s.oid = sr.srsubid join pg_class c on sr.srrelid = c.oid join pg_namespace n on c.relnamespace = n.oid WHERE s.subname = 'focal14' ORDER BY n.nspname, c.relname #v- and compared it against #v+ SELECT schemaname, tablename FROM pg_publication_tables WHERE pubname = 'focal14' ORDER BY schemaname, tablename #v- > Could you share an error message what you met? I thought you saw like > "Key (XXX)=(YYY) is not present in table "ZZZ".", but please paste here. #v+ ALTER TABLE ONLY c33s16831.ag_change_records_2023_10 ADD CONSTRAINT fk_rails_0113b4b993 FOREIGN KEY (sub_id) REFERENCES c33s16831.subs(id); ERROR: insert or update on table "ag_change_records_2023_10" violates foreign key constraint "fk_rails_0113b4b993" DETAIL: Key (sub_id)=(19733141) is not present in table "subs". #v- > If so, are tuples really missing at that time? Yes. I checked, and the whole block of values were missing - like I showed in previous mail. > FYI - Below workflow is a brief description for initial synchronization. > 1. Tablesync worker connects to publisher > 2. Starts a transaction on both pub/sub > 3. Creates a replication slot on publisher > 4. Executes "COPY ... TO STDOUT" on publisher. This means that all tuples are extracted > 5. Does BeginCopyFrom() and CopyFrom(), they are same infrastacture as normall "COPY FROM" statement. > 6. Commits the transaction on both pub/sub > 7. Catches up changes since the replication slot was defined at 4. I kinda, understand it. That's why I thought it's important to report that sometimes, something, somewhere breaks. Since my project is time sensitive, I have to proceed, and if such problem happens on my boxes, I just redo the whole sync from scratch. So I don't mind that there is bug, as I have workaround, but I think it would be cool to figure out what could be the potential source, and perhaps find a fix. > > Unfortunately I don't have a way to repeat the problem. Just noticed > > that on some dbs that we're upgrading the problem is more common, and on > > some - it generally doesn't happen. > Just to confirm - You met the error only when the logical replication was set > up from old clusters to newer, right? Yes. That's what we use LR for. For upgrades. It happens randomly, on some db clusters more than on others, but all in all, at most I had to rebuilt a upgrade box 3 times, and the problem was gone. Usually (in > 90% of cases) there is no problem. Best regards, depesz
Re: Logical replication is missing block of rows when sending initial sync?
От
Kyotaro Horiguchi
Дата:
At Mon, 30 Oct 2023 07:10:35 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > I've tried, but I could not reproduce the failure. PSA the script what I did. I'm not well-versed in the details of logical replication, but does logical replication inherently operate in such a way that it fully maintains relationships between tables? If not, isn't it possible that the issue in question is not about missing referenced data, but merely a temporary delay? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: Logical replication is missing block of rows when sending initial sync?
От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Horiguchi-san, > I'm not well-versed in the details of logical replication, but does > logical replication inherently operate in such a way that it fully > maintains relationships between tables? If not, isn't it possible that > the issue in question is not about missing referenced data, but merely > a temporary delay? I assumed that tuples referred by FK existed before synchronization or during that. In this case, IIUC the status is never 'r' before they are replicated. Thought? Dear Depesz, Is above assumption right? Also, I suggest you that pg_stat_replicaiton can be checked on publisher when you meet the failure again. If pg_stat_replication.replay_lsn is much behind pg_stat_replication.sent_lsn, the replication may be delayed so that the tuples are not appeared. pg_stat_subscription on subscriber has similar infos with replay_lsn. Note that pg_stat_replication may be not updated if feedbacks from a subscriber is disabled. https://www.postgresql.org/docs/14/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Logical replication is missing block of rows when sending initial sync?
От
hubert depesz lubaczewski
Дата:
On Thu, Nov 02, 2023 at 10:17:13AM +0900, Kyotaro Horiguchi wrote: > At Mon, 30 Oct 2023 07:10:35 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > > I've tried, but I could not reproduce the failure. PSA the script what I did. > > I'm not well-versed in the details of logical replication, but does > logical replication inherently operate in such a way that it fully > maintains relationships between tables? If not, isn't it possible that > the issue in question is not about missing referenced data, but merely > a temporary delay? The problem is that date that appeared *later* was visible on the subscriber. Data that came earlier was visible too. Just some block of data got, for some reason, skipped. Best regards, depesz
Re: Logical replication is missing block of rows when sending initial sync?
От
hubert depesz lubaczewski
Дата:
On Thu, Nov 02, 2023 at 03:17:03AM +0000, Hayato Kuroda (Fujitsu) wrote: > Dear Horiguchi-san, > > > I'm not well-versed in the details of logical replication, but does > > logical replication inherently operate in such a way that it fully > > maintains relationships between tables? If not, isn't it possible that > > the issue in question is not about missing referenced data, but merely > > a temporary delay? > > I assumed that tuples referred by FK existed before synchronization or during that. > In this case, IIUC the status is never 'r' before they are replicated. Thought? > > Dear Depesz, > > Is above assumption right? > > Also, I suggest you that pg_stat_replicaiton can be checked on publisher when you > meet the failure again. If pg_stat_replication.replay_lsn is much behind > pg_stat_replication.sent_lsn, the replication may be delayed so that the tuples > are not appeared. pg_stat_subscription on subscriber has similar infos with replay_lsn. > > Note that pg_stat_replication may be not updated if feedbacks from a subscriber > is disabled. > > https://www.postgresql.org/docs/14/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION Whenever the problem happens lag is virtually nonexistent. We test it by adding new rows to couple of tables, and verify that these rows show up on subscriber. They did. Again - replication was *working* with the exception of block of rows from one (so far it never happened on more than one) table that was missing. Best regards, depesz
On Thu, Nov 2, 2023 at 4:53 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Thu, Nov 02, 2023 at 10:17:13AM +0900, Kyotaro Horiguchi wrote: > > At Mon, 30 Oct 2023 07:10:35 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > > > I've tried, but I could not reproduce the failure. PSA the script what I did. > > > > I'm not well-versed in the details of logical replication, but does > > logical replication inherently operate in such a way that it fully > > maintains relationships between tables? If not, isn't it possible that > > the issue in question is not about missing referenced data, but merely > > a temporary delay? > > The problem is that date that appeared *later* was visible on the > subscriber. Data that came earlier was visible too. Just some block of > data got, for some reason, skipped. > Quite strange. I think to narrow down such a problem, the first thing to figure out is whether the data is skipped by initial sync or later replication. To find that out, you can check remote_lsn value in pg_replication_origin_status for the origin used in the initial sync once the relation reaches the 'ready' state. Then, you can try to see on the publisher side using pg_waldump whether the missing rows exist before the value of remote_lsn or after it. That can help us to narrow down the problem and could give us some clues for the next steps. -- With Regards, Amit Kapila.
Re: Logical replication is missing block of rows when sending initial sync?
От
hubert depesz lubaczewski
Дата:
On Fri, Nov 03, 2023 at 09:09:12AM +0530, Amit Kapila wrote: > On Thu, Nov 2, 2023 at 4:53 PM hubert depesz lubaczewski > <depesz@depesz.com> wrote: > > > > On Thu, Nov 02, 2023 at 10:17:13AM +0900, Kyotaro Horiguchi wrote: > > > At Mon, 30 Oct 2023 07:10:35 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in > > > > I've tried, but I could not reproduce the failure. PSA the script what I did. > > > > > > I'm not well-versed in the details of logical replication, but does > > > logical replication inherently operate in such a way that it fully > > > maintains relationships between tables? If not, isn't it possible that > > > the issue in question is not about missing referenced data, but merely > > > a temporary delay? > > > > The problem is that date that appeared *later* was visible on the > > subscriber. Data that came earlier was visible too. Just some block of > > data got, for some reason, skipped. > > > > Quite strange. I think to narrow down such a problem, the first thing > to figure out is whether the data is skipped by initial sync or later > replication. To find that out, you can check remote_lsn value in > pg_replication_origin_status for the origin used in the initial sync > once the relation reaches the 'ready' state. Then, you can try to see > on the publisher side using pg_waldump whether the missing rows exist > before the value of remote_lsn or after it. That can help us to narrow > down the problem and could give us some clues for the next steps. I will be prepping another set of clusters to upgrade soon, will try to get some more data. The window to work on the bad data isn't long, though. Best regards, depesz
On 11/3/23 13:04, hubert depesz lubaczewski wrote: > On Fri, Nov 03, 2023 at 09:09:12AM +0530, Amit Kapila wrote: >> On Thu, Nov 2, 2023 at 4:53 PM hubert depesz lubaczewski >> <depesz@depesz.com> wrote: >>> >>> On Thu, Nov 02, 2023 at 10:17:13AM +0900, Kyotaro Horiguchi wrote: >>>> At Mon, 30 Oct 2023 07:10:35 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in >>>>> I've tried, but I could not reproduce the failure. PSA the script what I did. >>>> >>>> I'm not well-versed in the details of logical replication, but does >>>> logical replication inherently operate in such a way that it fully >>>> maintains relationships between tables? If not, isn't it possible that >>>> the issue in question is not about missing referenced data, but merely >>>> a temporary delay? >>> >>> The problem is that date that appeared *later* was visible on the >>> subscriber. Data that came earlier was visible too. Just some block of >>> data got, for some reason, skipped. >>> >> >> Quite strange. I think to narrow down such a problem, the first thing >> to figure out is whether the data is skipped by initial sync or later >> replication. To find that out, you can check remote_lsn value in >> pg_replication_origin_status for the origin used in the initial sync >> once the relation reaches the 'ready' state. Then, you can try to see >> on the publisher side using pg_waldump whether the missing rows exist >> before the value of remote_lsn or after it. That can help us to narrow >> down the problem and could give us some clues for the next steps. > > I will be prepping another set of clusters to upgrade soon, will try to > get some more data. The window to work on the bad data isn't long, > though. > I think it'd be interesting to know: 1) Commit LSN for the missing rows (for the xmin). 2) Are the other changes for these transactions that *got* replicated correctly? 3) LSNs used for the tablesync slot, catchup etc. I believe those are in the server log. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Logical replication is missing block of rows when sending initial sync?
От
hubert depesz lubaczewski
Дата:
On Fri, Nov 03, 2023 at 01:20:22PM +0100, Tomas Vondra wrote: > I think it'd be interesting to know: > > 1) Commit LSN for the missing rows (for the xmin). > > 2) Are the other changes for these transactions that *got* replicated > correctly? How would I know? > 3) LSNs used for the tablesync slot, catchup etc. I believe those are in > the server log. I will provide more logs on the next problem. Best regards, depesz
Re: Logical replication is missing block of rows when sending initial sync?
От
hubert depesz lubaczewski
Дата:
On Fri, Nov 03, 2023 at 01:22:42PM +0100, hubert depesz lubaczewski wrote: > On Fri, Nov 03, 2023 at 01:20:22PM +0100, Tomas Vondra wrote: > > I think it'd be interesting to know: > > > > 1) Commit LSN for the missing rows (for the xmin). > > > > 2) Are the other changes for these transactions that *got* replicated > > correctly? > > How would I know? > > > 3) LSNs used for the tablesync slot, catchup etc. I believe those are in > > the server log. > > I will provide more logs on the next problem. OK. found another case (maybe two). First, error: ALTER TABLE ONLY c203s15531.conversation_message_participants ADD CONSTRAINT fk_rails_992a8b9e13 FOREIGN KEY (conversation_message_id)REFERENCES c203s15531.conversation_messages(id); ERROR: insert or update on table "conversation_message_participants" violates foreign key constraint "fk_rails_992a8b9e13" DETAIL: Key (conversation_message_id)=(431568) is not present in table "conversation_messages". With forced seq scans (set enable_indexscan = false; set enable_bitmapscan = false; set enable_indexonlyscan = false;) I got: (12/bionic): #v+ =# select count(*) from c203s15531.conversation_messages where id = 431568; count ─────── 1 (1 row) =# select count(*) from c203s15531.conversation_message_participants where conversation_message_id=431568; count ─────── 2 (1 row) #v- (14/focal): #v+ =# select count(*) from c203s15531.conversation_messages where id = 431568; count ─────── 0 (1 row) =# select count(*) from c203s15531.conversation_message_participants where conversation_message_id=431568; count ─────── 2 (1 row) #v- So it looks that some rows are missing. Let's see the missing range: #v+ Pg14# select max(id) from conversation_messages where id < 431568; max ──────── 431567 (1 row) Pg14# select min(id) from conversation_messages where id > 431568; min ──────── 431572 (1 row) #v- So, the block from 431568 to 431571 is missing. On pg12, though: #v+ Pg12# select count(*) from conversation_messages where id between 431568 and 431571; count ─────── 4 (1 row) Pg12# select ctid, xmin, cmin, xmax, cmax, id, created_at from conversation_messages where id between 431568 and 431571; ctid │ xmin │ cmin │ xmax │ cmax │ id │ created_at ────────────┼────────────┼──────┼────────────┼──────┼────────┼──────────────────────────── (29604,10) │ 3095841913 │ 1 │ 3095841995 │ 1 │ 431568 │ 2023-11-07 15:12:38.712504 (29604,11) │ 3095848690 │ 2 │ 3095848745 │ 2 │ 431569 │ 2023-11-07 15:12:49.5695 (29604,12) │ 3095884949 │ 2 │ 3095885033 │ 2 │ 431570 │ 2023-11-07 15:13:37.511025 (29604,13) │ 3095917833 │ 2 │ 3095917890 │ 2 │ 431571 │ 2023-11-07 15:14:40.058367 (4 rows) #v- In logs on bionic/pg12 I found when this particular table was added to subscription: #v+ 2023-11-07 15:10:19.023 UTC,"postgres","dbname",7350,"[local]",654a535a.1cb6,29,"idle",2023-11-07 15:10:18 UTC,30/865502097,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY c203s15745.a_g_c_r_2023_10;",,,,,,,,,"psql" 2023-11-07 15:10:19.026 UTC,"postgres","dbname",7350,"[local]",654a535a.1cb6,30,"idle",2023-11-07 15:10:18 UTC,30/865502098,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY c203s15311.q_q;",,,,,,,,,"psql" 2023-11-07 15:10:19.032 UTC,"postgres","dbname",7350,"[local]",654a535a.1cb6,31,"idle",2023-11-07 15:10:18 UTC,30/865502099,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY c203s16352.m_2023_06;",,,,,,,,,"psql" 2023-11-07 15:10:19.035 UTC,"postgres","dbname",7350,"[local]",654a535a.1cb6,32,"idle",2023-11-07 15:10:18 UTC,30/865502100,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY c203s15531.conversation_messages;",,,,,,,,,"psql" 2023-11-07 15:10:19.039 UTC,"postgres","dbname",7350,"[local]",654a535a.1cb6,33,"idle",2023-11-07 15:10:18 UTC,30/865502101,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY c203s15479.q_s_e_2023_10;",,,,,,,,,"psql" 2023-11-07 15:10:19.043 UTC,"postgres","dbname",7350,"[local]",654a535a.1cb6,34,"idle",2023-11-07 15:10:18 UTC,30/865502102,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY c203s17062.wiki_pages;",,,,,,,,,"psql" 2023-11-07 15:10:19.046 UTC,"postgres","dbname",7350,"[local]",654a535a.1cb6,35,"idle",2023-11-07 15:10:18 UTC,30/865502103,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY c203s15841.a_g_c_r_2023_10;",,,,,,,,,"psql" #v- Since it was previously asked - schema: #v+ Column │ Type │ Collation │ Nullable │ Default ───────────────────────┼─────────────────────────────┼───────────┼──────────┼─────────────────────────────────────────────────── id │ bigint │ │ not null │ nextval('conversation_messages_id_seq'::regclass) ..................... │ bigint │ │ │ ..................... │ bigint │ │ │ created_at │ timestamp without time zone │ │ │ ..................... │ boolean │ │ │ ..................... │ text │ │ │ ..................... │ text │ │ │ ..................... │ character varying(255) │ │ │ ..................... │ character varying(255) │ │ │ ..................... │ bigint │ │ │ ..................... │ character varying(255) │ │ │ ..................... │ bigint │ │ │ ..................... │ character varying(255) │ │ │ ..................... │ text │ │ │ ..................... │ boolean │ │ │ ..................... │ boolean │ │ │ ..................... │ text │ │ │ Indexes: "conversation_messages_pkey" PRIMARY KEY, btree (id), tablespace "data1" "index_conversation_messages_on_...." btree (.........), tablespace "data1" "index_conversation_messages_on_......_and_created_at" btree (..........., created_at), tablespace "data1" Foreign-key constraints: "fk_rails_c3c322d882" FOREIGN KEY (conversation_id) REFERENCES conversations(id) Referenced by: TABLE "conversation_message_participants" CONSTRAINT "fk_rails_992a8b9e13" FOREIGN KEY (conversation_message_id) REFERENCESconversation_messages(id) TABLE "conversation_batches" CONSTRAINT "fk_rails_d421fc74f4" FOREIGN KEY (root_conversation_message_id) REFERENCES conversation_messages(id) Publications: "cdc" "focal14" Triggers: guard_excessive_deletes AFTER DELETE ON conversation_messages REFERENCING OLD TABLE AS oldtbl FOR EACH STATEMENT EXECUTEFUNCTION guard_excessive_updates() guard_excessive_updates AFTER UPDATE ON conversation_messages REFERENCING OLD TABLE AS oldtbl FOR EACH STATEMENT EXECUTEFUNCTION guard_excessive_updates() #v- There are no dropped attributes. Stats: #v+ Pg12# select * from pg_stat_user_tables where relid = 'c203s15531.conversation_messages'::regclass \gx ─[ RECORD 1 ]───────┬────────────────────────────── relid │ 166420 schemaname │ c203s15531 relname │ conversation_messages seq_scan │ 14 seq_tup_read │ 2595888 idx_scan │ 16059568 idx_tup_fetch │ 27426090 n_tup_ins │ 372696 n_tup_upd │ 38 n_tup_del │ 0 n_tup_hot_upd │ 0 n_live_tup │ 432630 n_dead_tup │ 0 n_mod_since_analyze │ 7034 last_vacuum │ [null] last_autovacuum │ 2023-08-10 18:06:11.655573+00 last_analyze │ [null] last_autoanalyze │ 2023-11-01 18:56:52.868395+00 vacuum_count │ 0 autovacuum_count │ 5 analyze_count │ 0 autoanalyze_count │ 91 #v- I can leave it as is for, let's say, a day. So if there are any queries/data you'd like me to find out, let me know, but afterwards I will have to rebuild the focal from scratch. For whatever it's worth, we're adding tables to publication using: #v+ with base as ( select n.nspname, c.relname, c.relpages from pg_class c join pg_namespace n on c.relnamespace = n.oid left outer join pg_catalog.pg_publication_tables pt on pt.pubname = 'focal14' and pt.schemaname = n.nspname and pt.tablename= c.relname where c.relkind in ('p', 'r') and n.nspname ~ 'schemas|we|care|about' and pt.pubname is null ) select format('ALTER PUBLICATION focal14 ADD TABLE ONLY %I.%I;', nspname, relname) from base order by relpages desc limit 40 \gexec #v- Best regards, depesz
Re: Logical replication is missing block of rows when sending initial sync?
От
hubert depesz lubaczewski
Дата:
> OK. found another case (maybe two). OK, the other case is real. Different database. Same scenario. Errors (3 because we have 3 missing fkeys): #v+ ALTER TABLE ONLY c44s16539.assignment_overrides ADD CONSTRAINT fk_rails_58f8ee369b FOREIGN KEY (assignment_id) REFERENCESc44s16539.assignments(id); ERROR: insert or update on table "assignment_overrides" violates foreign key constraint "fk_rails_58f8ee369b" DETAIL: Key (assignment_id)=(248083) is not present in table "assignments". ALTER TABLE ONLY c44s16539.post_policies ADD CONSTRAINT fk_rails_cf2d119863 FOREIGN KEY (assignment_id) REFERENCES c44s16539.assignments(id); ERROR: insert or update on table "post_policies" violates foreign key constraint "fk_rails_cf2d119863" DETAIL: Key (assignment_id)=(248083) is not present in table "assignments". ALTER TABLE ONLY c44s16539.submissions ADD CONSTRAINT fk_rails_61cac0823d FOREIGN KEY (assignment_id) REFERENCES c44s16539.assignments(id); ERROR: insert or update on table "submissions" violates foreign key constraint "fk_rails_61cac0823d" DETAIL: Key (assignment_id)=(248083) is not present in table "assignments". #v- In all cases the problem is missing c44s16539.assignments with is = 248083. This time it looks that only one row is missing: #v+ Pg14# select max(id) from c44s16539.assignments where id < 248083; max ──────── 248082 (1 row) Pg14# select min(id) from c44s16539.assignments where id > 248083; min ──────── 248084 (1 row) #v- Schema is complicated. 76 columns. On pg12 side one column is dropped. 22 indexes, 9 fkeys from this table (assignments) to others, ~ 35 fkeys pointing to this table (but only 3 have rows that reference the missing record. Some informationa bout the record: #v+ Pg12# select ctid, xmin, cmin, xmax, cmax, id, created_at, updated_at from c44s16539.assignments where id = 248083 \gx ─[ RECORD 1 ]────────────────────────── ctid │ (24192,1) xmin │ 472291499 cmin │ 12 xmax │ 472291499 cmax │ 12 id │ 248083 created_at │ 2023-11-07 15:50:05.506077 updated_at │ 2023-11-07 15:51:28.738893 #v- logs for when it was added to replication: #v+ 2023-11-07 15:47:27.898 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,39,"ALTER PUBLICATION",2023-11-07 15:47:27 UTC,6/0,0,LOG,00000,"duration:1.476 ms",,,,,,,,,"psql" 2023-11-07 15:47:27.899 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,40,"idle",2023-11-07 15:47:27 UTC,6/1209097693,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY c247s1094.messages_2022_50;",,,,,,,,,"psql" 2023-11-07 15:47:27.900 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,41,"ALTER PUBLICATION",2023-11-07 15:47:27 UTC,6/0,0,LOG,00000,"duration:1.493 ms",,,,,,,,,"psql" 2023-11-07 15:47:27.900 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,42,"idle",2023-11-07 15:47:27 UTC,6/1209097694,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY c44s16539.assignments;",,,,,,,,,"psql" 2023-11-07 15:47:27.902 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,43,"ALTER PUBLICATION",2023-11-07 15:47:27 UTC,6/0,0,LOG,00000,"duration:1.936 ms",,,,,,,,,"psql" 2023-11-07 15:47:27.902 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,44,"idle",2023-11-07 15:47:27 UTC,6/1209097695,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY c39s5131.messages_2023_20;",,,,,,,,,"psql" 2023-11-07 15:47:27.903 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,45,"ALTER PUBLICATION",2023-11-07 15:47:27 UTC,6/0,0,LOG,00000,"duration:1.292 ms",,,,,,,,,"psql" #v- It is interesting that the time when the row was inserted (created_at) is always "around" the time that the table was added to publication... Just like with the other case, I can leave it in current state for a day, so if you'd need more information from it, let me know. Best regards, depesz
On 11/8/23 15:02, hubert depesz lubaczewski wrote: >> OK. found another case (maybe two). > > OK, the other case is real. Different database. Same scenario. > > Errors (3 because we have 3 missing fkeys): > > #v+ > ALTER TABLE ONLY c44s16539.assignment_overrides ADD CONSTRAINT fk_rails_58f8ee369b FOREIGN KEY (assignment_id) REFERENCESc44s16539.assignments(id); > ERROR: insert or update on table "assignment_overrides" violates foreign key constraint "fk_rails_58f8ee369b" > DETAIL: Key (assignment_id)=(248083) is not present in table "assignments". > > ALTER TABLE ONLY c44s16539.post_policies ADD CONSTRAINT fk_rails_cf2d119863 FOREIGN KEY (assignment_id) REFERENCES c44s16539.assignments(id); > ERROR: insert or update on table "post_policies" violates foreign key constraint "fk_rails_cf2d119863" > DETAIL: Key (assignment_id)=(248083) is not present in table "assignments". > > ALTER TABLE ONLY c44s16539.submissions ADD CONSTRAINT fk_rails_61cac0823d FOREIGN KEY (assignment_id) REFERENCES c44s16539.assignments(id); > ERROR: insert or update on table "submissions" violates foreign key constraint "fk_rails_61cac0823d" > DETAIL: Key (assignment_id)=(248083) is not present in table "assignments". > #v- > > In all cases the problem is missing c44s16539.assignments with is = 248083. > > This time it looks that only one row is missing: > > #v+ > Pg14# select max(id) from c44s16539.assignments where id < 248083; > max > ──────── > 248082 > (1 row) > > Pg14# select min(id) from c44s16539.assignments where id > 248083; > min > ──────── > 248084 > (1 row) > #v- > > Schema is complicated. 76 columns. On pg12 side one column is dropped. > 22 indexes, 9 fkeys from this table (assignments) to others, ~ 35 fkeys > pointing to this table (but only 3 have rows that reference the > missing record. > > Some informationa bout the record: > > #v+ > Pg12# select ctid, xmin, cmin, xmax, cmax, id, created_at, updated_at from c44s16539.assignments where id = 248083 \gx > ─[ RECORD 1 ]────────────────────────── > ctid │ (24192,1) > xmin │ 472291499 > cmin │ 12 > xmax │ 472291499 > cmax │ 12 > id │ 248083 > created_at │ 2023-11-07 15:50:05.506077 > updated_at │ 2023-11-07 15:51:28.738893 > #v- > > logs for when it was added to replication: > > #v+ > 2023-11-07 15:47:27.898 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,39,"ALTER PUBLICATION",2023-11-07 15:47:27UTC,6/0,0,LOG,00000,"duration: 1.476 ms",,,,,,,,,"psql" > 2023-11-07 15:47:27.899 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,40,"idle",2023-11-07 15:47:27 UTC,6/1209097693,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY c247s1094.messages_2022_50;",,,,,,,,,"psql" > 2023-11-07 15:47:27.900 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,41,"ALTER PUBLICATION",2023-11-07 15:47:27UTC,6/0,0,LOG,00000,"duration: 1.493 ms",,,,,,,,,"psql" > 2023-11-07 15:47:27.900 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,42,"idle",2023-11-07 15:47:27 UTC,6/1209097694,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY c44s16539.assignments;",,,,,,,,,"psql" > 2023-11-07 15:47:27.902 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,43,"ALTER PUBLICATION",2023-11-07 15:47:27UTC,6/0,0,LOG,00000,"duration: 1.936 ms",,,,,,,,,"psql" > 2023-11-07 15:47:27.902 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,44,"idle",2023-11-07 15:47:27 UTC,6/1209097695,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY c39s5131.messages_2023_20;",,,,,,,,,"psql" > 2023-11-07 15:47:27.903 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,45,"ALTER PUBLICATION",2023-11-07 15:47:27UTC,6/0,0,LOG,00000,"duration: 1.292 ms",,,,,,,,,"psql" > #v- > > It is interesting that the time when the row was inserted (created_at) is > always "around" the time that the table was added to publication... > > Just like with the other case, I can leave it in current state for > a day, so if you'd need more information from it, let me know. > I think it'd be interesting to look at the WAL using pg_waldump, and see how it relates to the LSN used for the tablesync slot. I'll try reproducing this locally over the weekend. Should I use the test_1030.sh script that you shared a week ago, or do I need to do something more? One thing that puzzles me is that IIUC the script doesn't generate any changes while adding the tables to the subscription? I probably should have paid more attention to the details, instead of assuming there are concurrent changes - which seemed like there might be some race, or something. But this seems like there are no changes and the tablesync just fails to copy some subset of it. That's *really* strange. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Logical replication is missing block of rows when sending initial sync?
От
hubert depesz lubaczewski
Дата:
On Thu, Nov 09, 2023 at 08:14:46PM +0100, Tomas Vondra wrote: > > > On 11/8/23 15:02, hubert depesz lubaczewski wrote: > >> OK. found another case (maybe two). > > > > OK, the other case is real. Different database. Same scenario. > > > > Errors (3 because we have 3 missing fkeys): > > > > #v+ > > ALTER TABLE ONLY c44s16539.assignment_overrides ADD CONSTRAINT fk_rails_58f8ee369b FOREIGN KEY (assignment_id) REFERENCESc44s16539.assignments(id); > > ERROR: insert or update on table "assignment_overrides" violates foreign key constraint "fk_rails_58f8ee369b" > > DETAIL: Key (assignment_id)=(248083) is not present in table "assignments". > > > > ALTER TABLE ONLY c44s16539.post_policies ADD CONSTRAINT fk_rails_cf2d119863 FOREIGN KEY (assignment_id) REFERENCES c44s16539.assignments(id); > > ERROR: insert or update on table "post_policies" violates foreign key constraint "fk_rails_cf2d119863" > > DETAIL: Key (assignment_id)=(248083) is not present in table "assignments". > > > > ALTER TABLE ONLY c44s16539.submissions ADD CONSTRAINT fk_rails_61cac0823d FOREIGN KEY (assignment_id) REFERENCES c44s16539.assignments(id); > > ERROR: insert or update on table "submissions" violates foreign key constraint "fk_rails_61cac0823d" > > DETAIL: Key (assignment_id)=(248083) is not present in table "assignments". > > #v- > > > > In all cases the problem is missing c44s16539.assignments with is = 248083. > > > > This time it looks that only one row is missing: > > > > #v+ > > Pg14# select max(id) from c44s16539.assignments where id < 248083; > > max > > ──────── > > 248082 > > (1 row) > > > > Pg14# select min(id) from c44s16539.assignments where id > 248083; > > min > > ──────── > > 248084 > > (1 row) > > #v- > > > > Schema is complicated. 76 columns. On pg12 side one column is dropped. > > 22 indexes, 9 fkeys from this table (assignments) to others, ~ 35 fkeys > > pointing to this table (but only 3 have rows that reference the > > missing record. > > > > Some informationa bout the record: > > > > #v+ > > Pg12# select ctid, xmin, cmin, xmax, cmax, id, created_at, updated_at from c44s16539.assignments where id = 248083 \gx > > ─[ RECORD 1 ]────────────────────────── > > ctid │ (24192,1) > > xmin │ 472291499 > > cmin │ 12 > > xmax │ 472291499 > > cmax │ 12 > > id │ 248083 > > created_at │ 2023-11-07 15:50:05.506077 > > updated_at │ 2023-11-07 15:51:28.738893 > > #v- > > > > logs for when it was added to replication: > > > > #v+ > > 2023-11-07 15:47:27.898 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,39,"ALTER PUBLICATION",2023-11-07 15:47:27UTC,6/0,0,LOG,00000,"duration: 1.476 ms",,,,,,,,,"psql" > > 2023-11-07 15:47:27.899 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,40,"idle",2023-11-07 15:47:27 UTC,6/1209097693,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY c247s1094.messages_2022_50;",,,,,,,,,"psql" > > 2023-11-07 15:47:27.900 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,41,"ALTER PUBLICATION",2023-11-07 15:47:27UTC,6/0,0,LOG,00000,"duration: 1.493 ms",,,,,,,,,"psql" > > 2023-11-07 15:47:27.900 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,42,"idle",2023-11-07 15:47:27 UTC,6/1209097694,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY c44s16539.assignments;",,,,,,,,,"psql" > > 2023-11-07 15:47:27.902 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,43,"ALTER PUBLICATION",2023-11-07 15:47:27UTC,6/0,0,LOG,00000,"duration: 1.936 ms",,,,,,,,,"psql" > > 2023-11-07 15:47:27.902 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,44,"idle",2023-11-07 15:47:27 UTC,6/1209097695,0,LOG,00000,"statement:ALTER PUBLICATION focal14 ADD TABLE ONLY c39s5131.messages_2023_20;",,,,,,,,,"psql" > > 2023-11-07 15:47:27.903 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,45,"ALTER PUBLICATION",2023-11-07 15:47:27UTC,6/0,0,LOG,00000,"duration: 1.292 ms",,,,,,,,,"psql" > > #v- > > > > It is interesting that the time when the row was inserted (created_at) is > > always "around" the time that the table was added to publication... > > > > Just like with the other case, I can leave it in current state for > > a day, so if you'd need more information from it, let me know. > > > > I think it'd be interesting to look at the WAL using pg_waldump, and see > how it relates to the LSN used for the tablesync slot. well, I can try to dump, and show data, but it will most likely be a lot. Are you more interested in looking into wal around the second where data was inserted, or when table was added to publication? > I'll try reproducing this locally over the weekend. Should I use the > test_1030.sh script that you shared a week ago, or do I need to do > something more? That question is probably not to me, but to Hayato Kuroda > One thing that puzzles me is that IIUC the script doesn't generate any > changes while adding the tables to the subscription? I probably should > have paid more attention to the details, instead of assuming there are > concurrent changes - which seemed like there might be some race, or > something. But this seems like there are no changes and the tablesync > just fails to copy some subset of it. That's *really* strange. This script was made by Hayato Kuroda, as far as I can tell. Our situation is not like this. In our case there is constant traffic on the publisher from normally working app. Best regards, depesz
Re: Logical replication is missing block of rows when sending initial sync?
От
hubert depesz lubaczewski
Дата:
On Fri, Nov 10, 2023 at 12:44:17PM +0100, hubert depesz lubaczewski wrote: > > I think it'd be interesting to look at the WAL using pg_waldump, and see > > how it relates to the LSN used for the tablesync slot. > > well, I can try to dump, and show data, but it will most likely be > a lot. Are you more interested in looking into wal around the second > where data was inserted, or when table was added to publication? I tried to find it based on relfilenode of assignments table, and ctid, but I can't find it in waldump :( I'm probably doing something wrong, or basing on wrong assumptions, but I got 33GB of wal, dumped it all, found relfilenode (41441) and grepped for this, and block number (ctid │ (24192,1), so i assumed block 24191, 24192, or 24193 - nothing like this exists in wal dump. Best regards, depesz
RE: Logical replication is missing block of rows when sending initial sync?
От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Tomas, Depesz, > > > I'll try reproducing this locally over the weekend. Should I use the > > test_1030.sh script that you shared a week ago, or do I need to do > > something more? > > That question is probably not to me, but to Hayato Kuroda Just FYI - As I posted, this script could not reproduce the issue. We may use the same table definition but I have not tried yet. I will let you know if I succeeded to reproduce. Best Regards, Hayato Kuroda FUJITSU LIMITED
On 11/15/23 02:41, Hayato Kuroda (Fujitsu) wrote: > Dear Tomas, Depesz, > >> >>> I'll try reproducing this locally over the weekend. Should I use the >>> test_1030.sh script that you shared a week ago, or do I need to do >>> something more? >> >> That question is probably not to me, but to Hayato Kuroda > > Just FYI - As I posted, this script could not reproduce the issue. We may use the same table definition > but I have not tried yet. I will let you know if I succeeded to reproduce. > FYI I think I've reproduced the issue (with a different script), I've started a thread on pgsql-hackers: https://www.postgresql.org/message-id/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com Obviously, I can't be sure it's exactly the same issue, but the symptoms seem to match what Depesz reported (gaps in data etc.). I don't know what the root cause is, or fix. But there's more info (and scripts) in the hackers thread. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Logical replication is missing block of rows when sending initial sync?
От
hubert depesz lubaczewski
Дата:
On Fri, Nov 17, 2023 at 03:40:35PM +0100, Tomas Vondra wrote: > On 11/15/23 02:41, Hayato Kuroda (Fujitsu) wrote: > > Dear Tomas, Depesz, > > > >> > >>> I'll try reproducing this locally over the weekend. Should I use the > >>> test_1030.sh script that you shared a week ago, or do I need to do > >>> something more? > >> > >> That question is probably not to me, but to Hayato Kuroda > > > > Just FYI - As I posted, this script could not reproduce the issue. We may use the same table definition > > but I have not tried yet. I will let you know if I succeeded to reproduce. > > > > FYI I think I've reproduced the issue (with a different script), I've > started a thread on pgsql-hackers: > > https://www.postgresql.org/message-id/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com > > Obviously, I can't be sure it's exactly the same issue, but the symptoms > seem to match what Depesz reported (gaps in data etc.). I don't know > what the root cause is, or fix. But there's more info (and scripts) in > the hackers thread. Thanks a lot. Much appreciated. It will be over for us with the upgrade soon, but it's good to know that there might be one less bug in pg thanks to this :) Best regards, depesz