Обсуждение: Re: Moving delta data faster

Поиск
Список
Период
Сортировка

Re: Moving delta data faster

От
Adrian Klaver
Дата:
On 4/3/24 13:38, yudhi s wrote:
>   Hi All,
>   It's postgresql database version 15.4. We have a requirement in which 
> we will be initially moving full table data for 3-4 tables, from source 
> database to target(i.e. postgres) . Maximum number of rows will be 
> ~10million rows in those tables. Then subsequently these rows will be 
> inserted/updated based on the delta number of rows that got 
> inserted/updated in the source database. In some cases these changed 
> data can flow multiple times per day to the downstream i.e. postgres 
> database and in other cases once daily.

What is the source database?

Can it be reached with a FDW?:

https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Can the delta on the source be output as CSV?

> 
>   Want to understand , if we should use upsert(insert on conflict) or 
> merge statements or anything else in such a scenario so as to persist 
> those delta records faster in the target database, while making the 
> system online to the users?
> 
> Regards
> Yudhi

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Moving delta data faster

От
yudhi s
Дата:
On Thu, Apr 4, 2024 at 2:41 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 4/3/24 13:38, yudhi s wrote:
>   Hi All,
>   It's postgresql database version 15.4. We have a requirement in which
> we will be initially moving full table data for 3-4 tables, from source
> database to target(i.e. postgres) . Maximum number of rows will be
> ~10million rows in those tables. Then subsequently these rows will be
> inserted/updated based on the delta number of rows that got
> inserted/updated in the source database. In some cases these changed
> data can flow multiple times per day to the downstream i.e. postgres
> database and in other cases once daily.

What is the source database?

Can it be reached with a FDW?:

https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Can the delta on the source be output as CSV?



Thank you Adrian.

And one thing i forgot to mention this target postgresql database would be on AWS RDS whereas the source Oracle databases is on premise. I think we don't have the FDW extension currently in place but we can get that. I am just not able to understand clearly  though, but do you mean export the data from source using CSV and do truncate and import on target. And as these data will be moved through the network won't that cause slowness?

The source database here is Oracle database. Correct me if wrong, it looks like foreign data wrapper is like a DB link. Or do you mean writing a query on the target database (which can be UPSERT or MERGE) but will be joining the table from the source database through the DBlink/DDW? But my question was whether we should use UPSERT or MERGE for comparing and loading the delta records to the target postgresql database. Want to understand which is more performant , as I see in the past Merge having performance issues in the past, but not very sure about that.
  

Re: Moving delta data faster

От
Adrian Klaver
Дата:
On 4/3/24 20:54, yudhi s wrote:
> On Thu, Apr 4, 2024 at 2:41 AM Adrian Klaver <adrian.klaver@aklaver.com 
> <mailto:adrian.klaver@aklaver.com>> wrote:
> 
>     On 4/3/24 13:38, yudhi s wrote:
>      >   Hi All,
>      >   It's postgresql database version 15.4. We have a requirement in
>     which
>      > we will be initially moving full table data for 3-4 tables, from
>     source
>      > database to target(i.e. postgres) . Maximum number of rows will be
>      > ~10million rows in those tables. Then subsequently these rows
>     will be
>      > inserted/updated based on the delta number of rows that got
>      > inserted/updated in the source database. In some cases these changed
>      > data can flow multiple times per day to the downstream i.e. postgres
>      > database and in other cases once daily.
> 
>     What is the source database?
> 
>     Can it be reached with a FDW?:
> 
>     https://wiki.postgresql.org/wiki/Foreign_data_wrappers
>     <https://wiki.postgresql.org/wiki/Foreign_data_wrappers>
> 
>     Can the delta on the source be output as CSV?
> 
> 
> 
> Thank you Adrian.
> 
> And one thing i forgot to mention this target postgresql database would 
> be on AWS RDS whereas the source Oracle databases is on premise. I think 
> we don't have the FDW extension currently in place but we can get that. 
> I am just not able to understand clearly  though, but do you mean export 
> the data from source using CSV and do truncate and import on target. And 
> as these data will be moved through the network won't that cause slowness?
> 
> The source database here is Oracle database. Correct me if wrong, it 
> looks like foreign data wrapper is like a DB link. Or do you mean 
> writing a query on the target database (which can be UPSERT or MERGE) 
> but will be joining the table from the source database through the 
> DBlink/DDW? But my question was whether we should use UPSERT or MERGE 
> for comparing and loading the delta records to the target postgresql 
> database. Want to understand which is more performant , as I see in the 
> past Merge having performance issues in the past, but not very sure 
> about that.

My motivation was to get some basic information about your setup and 
what you are trying to achieve.

If I understand correctly you have:

1) An Oracle database with tables that you want to copy the complete 
data from to a Postgres database. For this sort of thing 
COPY(https://www.postgresql.org/docs/current/sql-copy.html) on the 
Postgres end using CSV data generated from the source is probably the 
quickest bulk load method.

2) After the initial load you want to do follow up INSERT/UPDATEs based 
on a delta of the source tables relative to the initial load. This is 
still a bit of mystery to me. How are determining the delta: a) On the 
source end entirely or b) Target relative to source? Also what is the 
anticipated size of the delta per transfer?

Additional information needed:

1) Network distance between source and target?

2) Network capacity?

3) Expected load on both source and target servers from other operations?

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Moving delta data faster

От
yudhi s
Дата:

On Thu, Apr 4, 2024 at 10:16 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 4/3/24 20:54, yudhi s wrote:
> On Thu, Apr 4, 2024 at 2:41 AM Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> wrote:
>
>    > Thank you Adrian.
>
> And one thing i forgot to mention this target postgresql database would
> be on AWS RDS whereas the source Oracle databases is on premise. I think
> we don't have the FDW extension currently in place but we can get that.
> I am just not able to understand clearly  though, but do you mean export
> the data from source using CSV and do truncate and import on target. And
> as these data will be moved through the network won't that cause slowness?
>
> The source database here is Oracle database. Correct me if wrong, it
> looks like foreign data wrapper is like a DB link. Or do you mean
> writing a query on the target database (which can be UPSERT or MERGE)
> but will be joining the table from the source database through the
> DBlink/DDW? But my question was whether we should use UPSERT or MERGE
> for comparing and loading the delta records to the target postgresql
> database. Want to understand which is more performant , as I see in the
> past Merge having performance issues in the past, but not very sure
> about that.

My motivation was to get some basic information about your setup and
what you are trying to achieve.

If I understand correctly you have:

1) An Oracle database with tables that you want to copy the complete
data from to a Postgres database. For this sort of thing
COPY(https://www.postgresql.org/docs/current/sql-copy.html) on the
Postgres end using CSV data generated from the source is probably the
quickest bulk load method.

2) After the initial load you want to do follow up INSERT/UPDATEs based
on a delta of the source tables relative to the initial load. This is
still a bit of mystery to me. How are determining the delta: a) On the
source end entirely or b) Target relative to source? Also what is the
anticipated size of the delta per transfer?

Additional information needed:

1) Network distance between source and target?

2) Network capacity?

3) Expected load on both source and target servers from other operations?


Thank you. Actually I was trying to understand how to cater the delta load after the one time load is done . The delta change in records is planned to be found based on the primary keys on the tables. If it found the key it will update the records if it does not find the keys it will insert the rows.

Basically the select query from the source database will fetch the data with a certain time interval(based on the latest update timestamp or create timestamp if they are available or else full dump) and put it on S3 and then from the S3 it will be picked and gets merged to the target postgres database. As upsert and merge both were looking similar , so was wondering what we should use here for loading the delta records?

Re: Moving delta data faster

От
Adrian Klaver
Дата:
On 4/3/24 22:24, yudhi s wrote:
> 
> On Thu, Apr 4, 2024 at 10:16 AM Adrian Klaver <adrian.klaver@aklaver.com 
> <mailto:adrian.klaver@aklaver.com>> wrote:
> 
>     On 4/3/24 20:54, yudhi s wrote:
>      > On Thu, Apr 4, 2024 at 2:41 AM Adrian Klaver
>     <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>      > <mailto:adrian.klaver@aklaver.com
>     <mailto:adrian.klaver@aklaver.com>>> wrote:
>      >
>      >    > Thank you Adrian.
>      >
>      > And one thing i forgot to mention this target postgresql database
>     would
>      > be on AWS RDS whereas the source Oracle databases is on premise.
>     I think
>      > we don't have the FDW extension currently in place but we can get
>     that.
>      > I am just not able to understand clearly  though, but do you mean
>     export
>      > the data from source using CSV and do truncate and import on
>     target. And
>      > as these data will be moved through the network won't that cause
>     slowness?
>      >
>      > The source database here is Oracle database. Correct me if wrong, it
>      > looks like foreign data wrapper is like a DB link. Or do you mean
>      > writing a query on the target database (which can be UPSERT or
>     MERGE)
>      > but will be joining the table from the source database through the
>      > DBlink/DDW? But my question was whether we should use UPSERT or
>     MERGE
>      > for comparing and loading the delta records to the target postgresql
>      > database. Want to understand which is more performant , as I see
>     in the
>      > past Merge having performance issues in the past, but not very sure
>      > about that.
> 
>     My motivation was to get some basic information about your setup and
>     what you are trying to achieve.
> 
>     If I understand correctly you have:
> 
>     1) An Oracle database with tables that you want to copy the complete
>     data from to a Postgres database. For this sort of thing
>     COPY(https://www.postgresql.org/docs/current/sql-copy.html
>     <https://www.postgresql.org/docs/current/sql-copy.html>) on the
>     Postgres end using CSV data generated from the source is probably the
>     quickest bulk load method.
> 
>     2) After the initial load you want to do follow up INSERT/UPDATEs based
>     on a delta of the source tables relative to the initial load. This is
>     still a bit of mystery to me. How are determining the delta: a) On the
>     source end entirely or b) Target relative to source? Also what is the
>     anticipated size of the delta per transfer?
> 
>     Additional information needed:
> 
>     1) Network distance between source and target?
> 
>     2) Network capacity?
> 
>     3) Expected load on both source and target servers from other
>     operations?
> 
> 
> Thank you. Actually I was trying to understand how to cater the delta 
> load after the one time load is done . The delta change in records is 
> planned to be found based on the primary keys on the tables. If it found 
> the key it will update the records if it does not find the keys it will 
> insert the rows.
> 
> Basically the select query from the source database will fetch the data 
> with a certain time interval(based on the latest update timestamp or 
> create timestamp if they are available or else full dump) and put it on 
> S3 and then from the S3 it will be picked and gets merged to the target 
> postgres database. As upsert and merge both were looking similar , so 
> was wondering what we should use here for loading the delta records?

S3 is not a database. You will need to be more specific about '... then 
from the S3 it will be picked and gets merged to the target postgres 
database.'

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Moving delta data faster

От
yudhi s
Дата:

On Thu, Apr 4, 2024 at 9:04 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 4/3/24 22:24, yudhi s wrote:
>
> On Thu, Apr 4, 2024 at 10:16 AM Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> wrote:

S3 is not a database. You will need to be more specific about '... then
from the S3 it will be picked and gets merged to the target postgres
database.'


The data from S3 will be dumped into the stage table and then the upsert/merge from that table to the actual table.

Re: Moving delta data faster

От
Greg Sabino Mullane
Дата:
So you want advice from this open source community about moving things from a proprietary database to another proprietary database, going through a proprietary storage system along the way? :)

To answer the original question, use MERGE. That's it's job, and your table size is very tiny, so I doubt the performance impact will be worth worrying about.

Cheers,
Greg

Re: Moving delta data faster

От
Adrian Klaver
Дата:
On 4/4/24 13:42, yudhi s wrote:
> 
> On Thu, Apr 4, 2024 at 9:04 PM Adrian Klaver <adrian.klaver@aklaver.com 
> <mailto:adrian.klaver@aklaver.com>> wrote:
> 
>     On 4/3/24 22:24, yudhi s wrote:
>      >
>      > On Thu, Apr 4, 2024 at 10:16 AM Adrian Klaver
>     <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>      > <mailto:adrian.klaver@aklaver.com
>     <mailto:adrian.klaver@aklaver.com>>> wrote:
> 
>     S3 is not a database. You will need to be more specific about '... then
>     from the S3 it will be picked and gets merged to the target postgres
>     database.'
> 
> 
> The data from S3 will be dumped into the stage table and then the 
> upsert/merge from that table to the actual table.

The S3 --> staging table would be helped by having the data as CSV and 
then using COPY. The staging --> final table step could be done as 
either ON CONFLICT or MERGE, you would need to test in your situation to 
verify which works better.

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Moving delta data faster

От
veem v
Дата:

On Fri, 5 Apr 2024 at 06:10, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

>     S3 is not a database. You will need to be more specific about '... then
>     from the S3 it will be picked and gets merged to the target postgres
>     database.'
>
>
> The data from S3 will be dumped into the stage table and then the
> upsert/merge from that table to the actual table.

The S3 --> staging table would be helped by having the data as CSV and
then using COPY. The staging --> final table step could be done as
either ON CONFLICT or MERGE, you would need to test in your situation to
verify which works better.
 
Just a thought , in case the delta record changes are really higher(say >30-40% of the total number of rows in the table) can OP also evaluate the "truncate target table +load target table" strategy here considering DDL/Trunc is transactional in postgres so can be done online without impacting the ongoing read queries and also performance wise, it would be faster as compared to the traditional Update/Insert/Upsert/Merge?
 

Re: Moving delta data faster

От
yudhi s
Дата:
Thank you Adrian, Greg and Veem.

I tried writing a small routine to see how the performance differs in these four approaches i.e. Upsert VS traditional update+insert VS Merge vs Truncate+load. 

Initially I was thinking Upsert will perform the same as Merge as the logic looks similar but it seems it's the worst performing among all, not sure why , yet to know the reason though. Truncate+ load seems to be the best performing among all. Hope i am doing it correctly. Please correct me if I'm wrong.

UPSERT approach execution time: 00:00:20.921343
UPSERT approach rows: 1000000

insert/update approach execution time: 00:00:15.53612
insert/update approach update rows : 500000
insert/update approach Insert rows: 500000

MERGE approach execution time: 00:00:14.884623
MERGE approach rows: 1000000

truncate load approach execution time: 00:00:07.428826
truncate load rows: 1000000


********* Routine ********
*************** UPSERT Testcase **********

drop table source_tab;
drop table target_tab;

CREATE TABLE source_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

-- Create target table
CREATE TABLE target_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

INSERT INTO source_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 1000000) AS i;

INSERT INTO target_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 500000) AS i;

DO $$
DECLARE
start_time timestamp;
end_time timestamp;
rows_inserted integer:=0;
rows_updated integer:=0;
rows_upserted integer:=0;
rows_merged integer:=0;
BEGIN
-- Measure performance of UPSERT
start_time := clock_timestamp();
INSERT INTO target_tab (id, column1, column2)
SELECT id, column1, column2
FROM source_tab
ON CONFLICT (id) DO UPDATE
SET
column1 = EXCLUDED.column1,
column2 = EXCLUDED.column2;
get diagnostics rows_upserted=row_count;
end_time := clock_timestamp();
RAISE NOTICE 'UPSERT approach execution time: %', end_time - start_time;
RAISE NOTICE 'UPSERT approach rows: %', rows_upserted;

rollback;
END $$;

*************** Traditional Insert+update Testcase **********

drop table source_tab;
drop table target_tab;

CREATE TABLE source_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

-- Create target table
CREATE TABLE target_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

INSERT INTO source_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 1000000) AS i;

INSERT INTO target_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 500000) AS i;

DO $$
DECLARE
start_time timestamp;
end_time timestamp;
rows_inserted integer:=0;
rows_updated integer:=0;
rows_upserted integer:=0;
rows_merged integer:=0;
BEGIN
-- Measure performance of insert/update approach
start_time := clock_timestamp();
-- Update existing records
UPDATE target_tab AS t
SET
column1 = s.column1,
column2 = s.column2
FROM source_tab AS s
WHERE t.id = s.id;
get diagnostics rows_updated=row_count;

-- Insert new records
INSERT INTO target_tab (id, column1, column2)
SELECT s.id, s.column1, s.column2
FROM source_tab AS s
LEFT JOIN target_tab AS t ON s.id = t.id
WHERE t.id IS NULL;
get diagnostics rows_inserted=row_count;

end_time := clock_timestamp();
RAISE NOTICE 'insert/update approach execution time: %', end_time - start_time;
RAISE NOTICE 'insert/update approach update rows : %', rows_updated;
RAISE NOTICE 'insert/update approach Insert rows: %', rows_inserted;

rollback;
END $$;

*************** MERGE Testcase **********

drop table source_tab;
drop table target_tab;

CREATE TABLE source_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

-- Create target table
CREATE TABLE target_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

INSERT INTO source_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 1000000) AS i;

INSERT INTO target_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 500000) AS i;

DO $$
DECLARE
start_time timestamp;
end_time timestamp;
rows_inserted integer:=0;
rows_updated integer:=0;
rows_upserted integer:=0;
rows_merged integer:=0;
begin

start_time := clock_timestamp();

merge into
target_tab t
using source_tab s on
t. id = s. id
when matched then
update
set column1 = s.column1,
column2 = s.column2
when not matched then
insert
values (id, column1, column2);
get diagnostics rows_merged=row_count;

end_time := clock_timestamp();
RAISE NOTICE 'MERGE approach execution time: %', end_time - start_time;
RAISE NOTICE 'MERGE approach rows: %', rows_merged;

rollback;
END $$;

*************** Truncate+load Testcase **********

drop table source_tab;
drop table target_tab;

CREATE TABLE source_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

-- Create target table
CREATE TABLE target_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

INSERT INTO source_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 1000000) AS i;

INSERT INTO target_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 500000) AS i;

DO $$
DECLARE
start_time timestamp;
end_time timestamp;
rows_inserted integer:=0;
rows_updated integer:=0;
rows_upserted integer:=0;
rows_truncate_loaded integer:=0;
begin

start_time := clock_timestamp();

truncate table target_tab;

INSERT INTO target_tab (id, column1, column2)
SELECT s.id, s.column1, s.column2
FROM source_tab AS s;
get diagnostics rows_truncate_loaded=row_count;

end_time := clock_timestamp();
RAISE NOTICE 'truncate load approach execution time: %', end_time - start_time;
RAISE NOTICE 'truncate load rows: %', rows_truncate_loaded;

rollback;
END $$;

Re: Moving delta data faster

От
Adrian Klaver
Дата:
On 4/6/24 08:47, yudhi s wrote:
> Thank you Adrian, Greg and Veem.
> 
> I tried writing a small routine to see how the performance differs in 
> these four approaches i.e. Upsert VS traditional update+insert VS Merge 
> vs Truncate+load.
> 
> Initially I was thinking Upsert will perform the same as Merge as the 
> logic looks similar but it seems it's the worst performing among all, 
> not sure why , yet to know the reason though. Truncate+ load seems to be 
> the best performing among all. Hope i am doing it correctly. Please 
> correct me if I'm wrong.

Your original problem description was:

"Then subsequently these rows will be inserted/updated based on the 
delta number of rows that got inserted/updated in the source database. 
In some cases these changed data can flow multiple times per day to the 
downstream i.e. postgres database and in other cases once daily."

If the above is not a hard rule, then yes up to some point just 
replacing the data in mass would be the simplest/fastest method. You 
could cut a step out by doing something like TRUNCATE target_tab and 
then COPY target_tab FROM 'source.csv' bypassing the INSERT INTO source_tab.



-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Moving delta data faster

От
yudhi s
Дата:

On Sat, Apr 6, 2024 at 10:25 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:

Your original problem description was:

"Then subsequently these rows will be inserted/updated based on the
delta number of rows that got inserted/updated in the source database.
In some cases these changed data can flow multiple times per day to the
downstream i.e. postgres database and in other cases once daily."

If the above is not a hard rule, then yes up to some point just
replacing the data in mass would be the simplest/fastest method. You
could cut a step out by doing something like TRUNCATE target_tab and
then COPY target_tab FROM 'source.csv' bypassing the INSERT INTO source_tab.

Yes, actually i didn't realize that truncate table transactional/online here in postgres. In other databases like Oracle its downtime for the read queries on the target table, as data will be vanished from the target table post truncate(until the data load happens) and those are auto commit. Thanks Veem for sharing that  option. 

 I also think that truncate will be faster if the changes/delta is large , but if its handful of rows like <5%of the rows in the table then Upsert/Merge will be better performant. And also the down side of the truncate option is,  it does ask to bring/export all the data from source to the S3 file which may take longer as compared to bringing just the delta records. Correct me if I'm wrong.

However I am still not able to understand why the upsert is less performant than merge, could you throw some light on this please?

Re: Moving delta data faster

От
Adrian Klaver
Дата:
On 4/6/24 13:04, yudhi s wrote:
> 
> On Sat, Apr 6, 2024 at 10:25 PM Adrian Klaver <adrian.klaver@aklaver.com 
> <mailto:adrian.klaver@aklaver.com>> wrote:
> 
> 
>     Your original problem description was:
> 
>     "Then subsequently these rows will be inserted/updated based on the
>     delta number of rows that got inserted/updated in the source database.
>     In some cases these changed data can flow multiple times per day to the
>     downstream i.e. postgres database and in other cases once daily."
> 
>     If the above is not a hard rule, then yes up to some point just
>     replacing the data in mass would be the simplest/fastest method. You
>     could cut a step out by doing something like TRUNCATE target_tab and
>     then COPY target_tab FROM 'source.csv' bypassing the INSERT INTO
>     source_tab.
> 
> Yes, actually i didn't realize that truncate table transactional/online 
> here in postgres. In other databases like Oracle its downtime for the 
> read queries on the target table, as data will be vanished from the 
> target table post truncate(until the data load happens) and those are 
> auto commit. Thanks Veem for sharing that  option.
> 
>   I also think that truncate will be faster if the changes/delta is 
> large , but if its handful of rows like <5%of the rows in the table then 
> Upsert/Merge will be better performant. And also the down side of the 
> truncate option is,  it does ask to bring/export all the data from 
> source to the S3 file which may take longer as compared to bringing just 
> the delta records. Correct me if I'm wrong.

Since you still have not specified how the data is stored in S3 and how 
you propose to move them into Postgres I can't really answer.

> 
> However I am still not able to understand why the upsert is less 
> performant than merge, could you throw some light on this please?
> 

I have no idea how this works in the code, but my suspicion is it is due
to the following:

https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT

"The optional ON CONFLICT clause specifies an alternative action to 
raising a unique violation or exclusion constraint violation error. For 
each individual row proposed for insertion, either the insertion 
proceeds, or, if an arbiter constraint or index specified by 
conflict_target is violated, the alternative conflict_action is taken. 
ON CONFLICT DO NOTHING simply avoids inserting a row as its alternative 
action. ON CONFLICT DO UPDATE updates the existing row that conflicts 
with the row proposed for insertion as its alternative action."

vs this:

"First, the MERGE command performs a join from data_source to 
target_table_name producing zero or more candidate change rows. For each 
candidate change row, the status of MATCHED or NOT MATCHED is set just 
once, after which WHEN clauses are evaluated in the order specified. For 
each candidate change row, the first clause to evaluate as true is 
executed. No more than one WHEN clause is executed for any candidate 
change row."

Where ON CONFLICT attempts the INSERT then on failure does the UPDATE 
for the ON CONFLICT DO UPDATE case. MERGE on the hand evaluates based on 
the join condition(ON tbl1.fld =tbl2.fld) and then based on MATCH/NOT 
MATCHED takes the appropriate action for the first WHEN match. In other 
words it goes directly to the appropriate action.

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Moving delta data faster

От
yudhi s
Дата:

On Sun, Apr 7, 2024 at 2:25 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:

I have no idea how this works in the code, but my suspicion is it is due
to the following:

https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT

"The optional ON CONFLICT clause specifies an alternative action to
raising a unique violation or exclusion constraint violation error. For
each individual row proposed for insertion, either the insertion
proceeds, or, if an arbiter constraint or index specified by
conflict_target is violated, the alternative conflict_action is taken.
ON CONFLICT DO NOTHING simply avoids inserting a row as its alternative
action. ON CONFLICT DO UPDATE updates the existing row that conflicts
with the row proposed for insertion as its alternative action."

vs this:

"First, the MERGE command performs a join from data_source to
target_table_name producing zero or more candidate change rows. For each
candidate change row, the status of MATCHED or NOT MATCHED is set just
once, after which WHEN clauses are evaluated in the order specified. For
each candidate change row, the first clause to evaluate as true is
executed. No more than one WHEN clause is executed for any candidate
change row."

Where ON CONFLICT attempts the INSERT then on failure does the UPDATE
for the ON CONFLICT DO UPDATE case. MERGE on the hand evaluates based on
the join condition(ON tbl1.fld =tbl2.fld) and then based on MATCH/NOT
MATCHED takes the appropriate action for the first WHEN match. In other
words it goes directly to the appropriate action.


Thank you Adrian. I think you are spoton on the cause of upsert becoming slower than Merge. Below is the explain plan I captured for both the operations and it looks like even the planning time is small for the UPSERT, as because it operates on the constraint i.e the Update will wait for all the failure records from the INSERT and thus it takes longer. The Merge seems to be evaluated on the Joins i.e it is directly able to get the set of rows which has to be Updated rather waiting for the INSERT to make it fail based on the PK constraint.

************** Explain plan for UPSERT *****************

Insert on public.target_tab (cost=0.00..17353.00 rows=0 width=0) (actual time=19957.569..19957.570 rows=0 loops=1)
Conflict Resolution: UPDATE
Conflict Arbiter Indexes: target_tab_pkey
Tuples Inserted: 500000
Conflicting Tuples: 500000
Buffers: shared hit=8545735 written=10094
-> Seq Scan on public.source_tab (cost=0.00..17353.00 rows=1000000 width=29) (actual time=0.006..208.306 rows=1000000 loops=1)
Output: source_tab.id, source_tab.column1, source_tab.column2
Buffers: shared hit=7353
Settings: effective_cache_size = '10475192kB', maintenance_io_concurrency = '1', max_parallel_workers = '32', max_parallel_workers_per_gather = '4', search_path = 'public, public, "$user"', temp_buffers = '16MB', work_mem = '8MB'
Query Identifier: -1356019529835809419
Planning:
Buffers: shared hit=41
Planning Time: 0.199 ms
Execution Time: 19959.261 ms

************** Explain plan for Merge *****************

Merge on public.target_tab t (cost=17368.00..53460.01 rows=0 width=0) (actual time=14209.966..14209.968 rows=0 loops=1)
Tuples: inserted=500000 updated=500000
Buffers: shared hit=5040097 written=10460, temp read=4143 written=4143
I/O Timings: temp read=26.746 write=68.596
-> Hash Left Join (cost=17368.00..53460.01 rows=1000000 width=35) (actual time=179.233..1332.264 rows=1000000 loops=1)
Output: t.ctid, s.column1, s.column2, s.id
Inner Unique: true
Hash Cond: (s.id = t.id)
Buffers: shared hit=11029, temp read=4143 written=4143
I/O Timings: temp read=26.746 write=68.596
-> Seq Scan on public.source_tab s (cost=0.00..17353.00 rows=1000000 width=29) (actual time=0.008..268.506 rows=1000000 loops=1)
Output: s.column1, s.column2, s.id
Buffers: shared hit=7353
-> Hash (cost=8676.00..8676.00 rows=500000 width=10) (actual time=178.101..178.102 rows=500000 loops=1)
Output: t.ctid, t.id
Buckets: 524288 Batches: 2 Memory Usage: 14824kB
Buffers: shared hit=3676, temp written=977
I/O Timings: temp write=5.904
-> Seq Scan on public.target_tab t (cost=0.00..8676.00 rows=500000 width=10) (actual time=0.007..66.441 rows=500000 loops=1)
Output: t.ctid, t.id
Buffers: shared hit=3676
Settings: effective_cache_size = '10475192kB', maintenance_io_concurrency = '1', max_parallel_workers = '32', max_parallel_workers_per_gather = '4', search_path = 'public, public, "$user"', temp_buffers = '16MB', work_mem = '8MB'
Query Identifier: -2297080081674771467
Planning:
Buffers: shared hit=85
Planning Time: 0.466 ms
Execution Time: 14212.061 ms