Обсуждение: Copy Bulk Ignore Duplicated

Поиск
Список
Период
Сортировка

Copy Bulk Ignore Duplicated

От
Leandro Guimarães
Дата:
Hi,
 
   I have a scenario with a large table and I'm trying to insert it via a COPY command with a csv file.

   Everything works, but sometimes my source .csv file has duplicated data in the previously fulfilled table. If I add a check constraint and try to run the COPY command I have an error that stops the whole insertion.

  I've tried to put the data in a tmp table and fill the main using distinct this way (the fields and names are just examples):

INSERT INTO final_table values (name, document)
   SELECT DISTINCT name, document
   FROM tmp_TABLE t1
   WHERE NOT EXISTS (
   SELECT 1 FROM final_table t2
   WHERE (t2.name, t2.document)
   IS NOT DISTINCT FROM (t1.name, t1.document))

The problem is that my final_table is a large (and partitioned) table and this query is taking a long time to execute.

Someone have any idea (really guys anything would be great) how to solve this situation? I need to ignore duplicates instead to have some error.

I'm using PostgreSQL 9.4 so I can't use "ON CONFLICT" and upgrade is not an option.

Thanks and Kind Regards!


Leandro Guimarães

Re: Copy Bulk Ignore Duplicated

От
Adrian Klaver
Дата:
On 6/14/19 2:04 PM, Leandro Guimarães wrote:
> Hi,
>     I have a scenario with a large table and I'm trying to insert it via 
> a COPY command with a csv file.
> 
>     Everything works, but sometimes my source .csv file has duplicated 
> data in the previously fulfilled table. If I add a check constraint and 
> try to run the COPY command I have an error that stops the whole insertion.
> 
>    I've tried to put the data in a tmp table and fill the main using 
> distinct this way (the fields and names are just examples):
> 
> INSERT INTO final_table values (name, document)
>     SELECT DISTINCT name, document
>     FROM tmp_TABLE t1
>     WHERE NOT EXISTS (
>     SELECT 1 FROM final_table t2
>     WHERE (t2.name <http://t2.name>, t2.document)
>     IS NOT DISTINCT FROM (t1.name <http://t1.name>, t1.document))
> 
> The problem is that my final_table is a large (and partitioned) table 
> and this query is taking a long time to execute.
> 
> Someone have any idea (really guys anything would be great) how to solve 
> this situation? I need to ignore duplicates instead to have some error.

pg_bulkload?:

https://ossc-db.github.io/pg_bulkload/pg_bulkload.html
"PARSE_ERRORS = n
     The number of ingored tuples that throw errors during parsing, 
encoding checks, encoding conversion, FILTER function, CHECK constraint 
checks, NOT NULL checks, or data type conversion. Invalid input tuples 
are not loaded and recorded in the PARSE BADFILE. The default is 0. If 
there are equal or more parse errors than the value, already loaded data 
is committed and the remaining tuples are not loaded. 0 means to allow 
no errors, and -1 and INFINITE mean to ignore all errors. "

> 
> I'm using*PostgreSQL 9.4* so I can't use "ON CONFLICT" and upgrade is 
> not an option.
> 
> Thanks and Kind Regards!
> 
> 
> Leandro Guimarães
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Copy Bulk Ignore Duplicated

От
Adrian Klaver
Дата:
On 6/14/19 2:04 PM, Leandro Guimarães wrote:
> Hi,
>     I have a scenario with a large table and I'm trying to insert it via 
> a COPY command with a csv file.
> 
>     Everything works, but sometimes my source .csv file has duplicated 
> data in the previously fulfilled table. If I add a check constraint and 
> try to run the COPY command I have an error that stops the whole insertion.
> 
>    I've tried to put the data in a tmp table and fill the main using 
> distinct this way (the fields and names are just examples):
> 
> INSERT INTO final_table values (name, document)
>     SELECT DISTINCT name, document
>     FROM tmp_TABLE t1
>     WHERE NOT EXISTS (
>     SELECT 1 FROM final_table t2
>     WHERE (t2.name <http://t2.name>, t2.document)
>     IS NOT DISTINCT FROM (t1.name <http://t1.name>, t1.document))

Hit enter too soon on previous post.
Alternative to above query(rough code):

DELETE FROM tmp_TABLE where <check constraint parameters>;
INSERT INTO final_table VALUES () SELECT * tmp_TABLE;

> 
> The problem is that my final_table is a large (and partitioned) table 
> and this query is taking a long time to execute.
> 
> Someone have any idea (really guys anything would be great) how to solve 
> this situation? I need to ignore duplicates instead to have some error.
> 
> I'm using*PostgreSQL 9.4* so I can't use "ON CONFLICT" and upgrade is 
> not an option.
> 
> Thanks and Kind Regards!
> 
> 
> Leandro Guimarães
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Copy Bulk Ignore Duplicated

От
Tim Cross
Дата:
Leandro Guimarães <leo.guimaraes@gmail.com> writes:

> Hi,
>
>    I have a scenario with a large table and I'm trying to insert it via a
> COPY command with a csv file.
>
>    Everything works, but sometimes my source .csv file has duplicated data
> in the previously fulfilled table. If I add a check constraint and try to
> run the COPY command I have an error that stops the whole insertion.
>
>   I've tried to put the data in a tmp table and fill the main using
> distinct this way (the fields and names are just examples):
>
> INSERT INTO final_table values (name, document)
>    SELECT DISTINCT name, document
>    FROM tmp_TABLE t1
>    WHERE NOT EXISTS (
>    SELECT 1 FROM final_table t2
>    WHERE (t2.name, t2.document)
>    IS NOT DISTINCT FROM (t1.name, t1.document))
>
> The problem is that my final_table is a large (and partitioned) table and
> this query is taking a long time to execute.
>
> Someone have any idea (really guys anything would be great) how to solve
> this situation? I need to ignore duplicates instead to have some error.
>
> I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and upgrade is not
> an option.
>

Explain plan would probably shed some light, but I suspect your
performance is being heavily hit by the sub query. Distinct is an
expensive operation and you are performing it once for every distinct row
in your temp table.

It isn't clear what the primary key is for your final table - name +
document seems suspicious given these seem to be the only two columns
your inserting as well. You don't indicate what the data types are
either - it document is something like 'text' then using it in a
distinct clause is likely to have huge performance impact.

The first thing I'd do is to eliminate duplicates from your temp table
as a separate statement or by pre-filtering the CSV before import. I
would then try something like an outer join to identify rows in your
temp table which don't exist in your final table and select from there
to insert into the final table. You don't really need the distinct in
the sub query as all you really need to know is if (name, document)
exists - it doesn't matter if more than one exists (for this test).

If you really don't have something more specific for a primary key,
depending on what data type 'document' is and how large it is, you may
find adding a column which is a checksum of your 'document' field a
useful addition. I have done this in the past where I had an application
where name was not unique and we only wanted distinct instances of
'document' (document was a fairly large XML document in this case).

--
Tim Cross



Re: Copy Bulk Ignore Duplicated

От
Leandro Guimarães
Дата:
Hi Tim, thanks for you answer! 

The columns were just examples, but let me explain the database structure, the fields in bold are the keys:

customer_id integer
date_time timestamp
indicator_id integer
element_id integer
indicator_value double precision

The table is partitioned per day and customer_id (it works great) the problem is just the duplicated key situation that I'm really worried about.

I populate the database via a Java Application with JDBC.

Maybe this info could help to provide some light!

Thanks Again!

Leandro Guimarães 



On Fri, Jun 14, 2019 at 7:39 PM Tim Cross <theophilusx@gmail.com> wrote:

Leandro Guimarães <leo.guimaraes@gmail.com> writes:

> Hi,
>
>    I have a scenario with a large table and I'm trying to insert it via a
> COPY command with a csv file.
>
>    Everything works, but sometimes my source .csv file has duplicated data
> in the previously fulfilled table. If I add a check constraint and try to
> run the COPY command I have an error that stops the whole insertion.
>
>   I've tried to put the data in a tmp table and fill the main using
> distinct this way (the fields and names are just examples):
>
> INSERT INTO final_table values (name, document)
>    SELECT DISTINCT name, document
>    FROM tmp_TABLE t1
>    WHERE NOT EXISTS (
>    SELECT 1 FROM final_table t2
>    WHERE (t2.name, t2.document)
>    IS NOT DISTINCT FROM (t1.name, t1.document))
>
> The problem is that my final_table is a large (and partitioned) table and
> this query is taking a long time to execute.
>
> Someone have any idea (really guys anything would be great) how to solve
> this situation? I need to ignore duplicates instead to have some error.
>
> I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and upgrade is not
> an option.
>

Explain plan would probably shed some light, but I suspect your
performance is being heavily hit by the sub query. Distinct is an
expensive operation and you are performing it once for every distinct row
in your temp table.

It isn't clear what the primary key is for your final table - name +
document seems suspicious given these seem to be the only two columns
your inserting as well. You don't indicate what the data types are
either - it document is something like 'text' then using it in a
distinct clause is likely to have huge performance impact.

The first thing I'd do is to eliminate duplicates from your temp table
as a separate statement or by pre-filtering the CSV before import. I
would then try something like an outer join to identify rows in your
temp table which don't exist in your final table and select from there
to insert into the final table. You don't really need the distinct in
the sub query as all you really need to know is if (name, document)
exists - it doesn't matter if more than one exists (for this test).

If you really don't have something more specific for a primary key,
depending on what data type 'document' is and how large it is, you may
find adding a column which is a checksum of your 'document' field a
useful addition. I have done this in the past where I had an application
where name was not unique and we only wanted distinct instances of
'document' (document was a fairly large XML document in this case).

--
Tim Cross


Re: Copy Bulk Ignore Duplicated

От
Leandro Guimarães
Дата:
Hi Adrian,
  I'll take a look about pg_bulkload, but I populate the database via a Java application with JDBC.

  I'll try the query you kindly sent to me!

Thanks!
Leandro Guimarães



On Fri, Jun 14, 2019 at 6:59 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/14/19 2:04 PM, Leandro Guimarães wrote:
> Hi,
>     I have a scenario with a large table and I'm trying to insert it via
> a COPY command with a csv file.
>
>     Everything works, but sometimes my source .csv file has duplicated
> data in the previously fulfilled table. If I add a check constraint and
> try to run the COPY command I have an error that stops the whole insertion.
>
>    I've tried to put the data in a tmp table and fill the main using
> distinct this way (the fields and names are just examples):
>
> INSERT INTO final_table values (name, document)
>     SELECT DISTINCT name, document
>     FROM tmp_TABLE t1
>     WHERE NOT EXISTS (
>     SELECT 1 FROM final_table t2
>     WHERE (t2.name <http://t2.name>, t2.document)
>     IS NOT DISTINCT FROM (t1.name <http://t1.name>, t1.document))
>
> The problem is that my final_table is a large (and partitioned) table
> and this query is taking a long time to execute.
>
> Someone have any idea (really guys anything would be great) how to solve
> this situation? I need to ignore duplicates instead to have some error.

pg_bulkload?:

https://ossc-db.github.io/pg_bulkload/pg_bulkload.html
"PARSE_ERRORS = n
     The number of ingored tuples that throw errors during parsing,
encoding checks, encoding conversion, FILTER function, CHECK constraint
checks, NOT NULL checks, or data type conversion. Invalid input tuples
are not loaded and recorded in the PARSE BADFILE. The default is 0. If
there are equal or more parse errors than the value, already loaded data
is committed and the remaining tuples are not loaded. 0 means to allow
no errors, and -1 and INFINITE mean to ignore all errors. "

>
> I'm using*PostgreSQL 9.4* so I can't use "ON CONFLICT" and upgrade is
> not an option.
>
> Thanks and Kind Regards!
>
>
> Leandro Guimarães
>


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Copy Bulk Ignore Duplicated

От
Adrian Klaver
Дата:
On 6/14/19 7:24 PM, Leandro Guimarães wrote:
> Hi Tim, thanks for you answer!
> 
> The columns were just examples, but let me explain the database 
> structure, the fields in *bold are the keys*:
> 
> *customer_id integer*
> *date_time timestamp*
> *indicator_id integer*
> *element_id integer*
> indicator_value double precision

Huh, earlier you said you had a check constraint that was causing issues.

Does that also exist or where you referring to the keys above?

Are the keys above formally defined as the PRIMARY KEY?
-
> 
> The table is partitioned per day and customer_id (it works great) the 
> problem is just the duplicated key situation that I'm really worried about.
> 
> I populate the database via a Java Application with JDBC.
> 
> Maybe this info could help to provide some light!
> 
> Thanks Again!
> 
> Leandro Guimarães
> 
> 
> 
> On Fri, Jun 14, 2019 at 7:39 PM Tim Cross <theophilusx@gmail.com 
> <mailto:theophilusx@gmail.com>> wrote:
> 
> 
>     Leandro Guimarães <leo.guimaraes@gmail.com
>     <mailto:leo.guimaraes@gmail.com>> writes:
> 
>      > Hi,
>      >
>      >    I have a scenario with a large table and I'm trying to insert
>     it via a
>      > COPY command with a csv file.
>      >
>      >    Everything works, but sometimes my source .csv file has
>     duplicated data
>      > in the previously fulfilled table. If I add a check constraint
>     and try to
>      > run the COPY command I have an error that stops the whole insertion.
>      >
>      >   I've tried to put the data in a tmp table and fill the main using
>      > distinct this way (the fields and names are just examples):
>      >
>      > INSERT INTO final_table values (name, document)
>      >    SELECT DISTINCT name, document
>      >    FROM tmp_TABLE t1
>      >    WHERE NOT EXISTS (
>      >    SELECT 1 FROM final_table t2
>      >    WHERE (t2.name <http://t2.name>, t2.document)
>      >    IS NOT DISTINCT FROM (t1.name <http://t1.name>, t1.document))
>      >
>      > The problem is that my final_table is a large (and partitioned)
>     table and
>      > this query is taking a long time to execute.
>      >
>      > Someone have any idea (really guys anything would be great) how
>     to solve
>      > this situation? I need to ignore duplicates instead to have some
>     error.
>      >
>      > I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and
>     upgrade is not
>      > an option.
>      >
> 
>     Explain plan would probably shed some light, but I suspect your
>     performance is being heavily hit by the sub query. Distinct is an
>     expensive operation and you are performing it once for every
>     distinct row
>     in your temp table.
> 
>     It isn't clear what the primary key is for your final table - name +
>     document seems suspicious given these seem to be the only two columns
>     your inserting as well. You don't indicate what the data types are
>     either - it document is something like 'text' then using it in a
>     distinct clause is likely to have huge performance impact.
> 
>     The first thing I'd do is to eliminate duplicates from your temp table
>     as a separate statement or by pre-filtering the CSV before import. I
>     would then try something like an outer join to identify rows in your
>     temp table which don't exist in your final table and select from there
>     to insert into the final table. You don't really need the distinct in
>     the sub query as all you really need to know is if (name, document)
>     exists - it doesn't matter if more than one exists (for this test).
> 
>     If you really don't have something more specific for a primary key,
>     depending on what data type 'document' is and how large it is, you may
>     find adding a column which is a checksum of your 'document' field a
>     useful addition. I have done this in the past where I had an application
>     where name was not unique and we only wanted distinct instances of
>     'document' (document was a fairly large XML document in this case).
> 
>     --
>     Tim Cross
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Copy Bulk Ignore Duplicated

От
Leandro Guimarães
Дата:
Hi Adrian,

   You are right, these fields are in CHECK CONSTRAiNTS and they are not formally defined as Primary Keys. 

Thanks!
Leandro Guimarães



On Sat, Jun 15, 2019 at 10:45 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/14/19 7:24 PM, Leandro Guimarães wrote:
> Hi Tim, thanks for you answer!
>
> The columns were just examples, but let me explain the database
> structure, the fields in *bold are the keys*:
>
> *customer_id integer*
> *date_time timestamp*
> *indicator_id integer*
> *element_id integer*
> indicator_value double precision

Huh, earlier you said you had a check constraint that was causing issues.

Does that also exist or where you referring to the keys above?

Are the keys above formally defined as the PRIMARY KEY?
-
>
> The table is partitioned per day and customer_id (it works great) the
> problem is just the duplicated key situation that I'm really worried about.
>
> I populate the database via a Java Application with JDBC.
>
> Maybe this info could help to provide some light!
>
> Thanks Again!
>
> Leandro Guimarães
>
>
>
> On Fri, Jun 14, 2019 at 7:39 PM Tim Cross <theophilusx@gmail.com
> <mailto:theophilusx@gmail.com>> wrote:
>
>
>     Leandro Guimarães <leo.guimaraes@gmail.com
>     <mailto:leo.guimaraes@gmail.com>> writes:
>
>      > Hi,
>      >
>      >    I have a scenario with a large table and I'm trying to insert
>     it via a
>      > COPY command with a csv file.
>      >
>      >    Everything works, but sometimes my source .csv file has
>     duplicated data
>      > in the previously fulfilled table. If I add a check constraint
>     and try to
>      > run the COPY command I have an error that stops the whole insertion.
>      >
>      >   I've tried to put the data in a tmp table and fill the main using
>      > distinct this way (the fields and names are just examples):
>      >
>      > INSERT INTO final_table values (name, document)
>      >    SELECT DISTINCT name, document
>      >    FROM tmp_TABLE t1
>      >    WHERE NOT EXISTS (
>      >    SELECT 1 FROM final_table t2
>      >    WHERE (t2.name <http://t2.name>, t2.document)
>      >    IS NOT DISTINCT FROM (t1.name <http://t1.name>, t1.document))
>      >
>      > The problem is that my final_table is a large (and partitioned)
>     table and
>      > this query is taking a long time to execute.
>      >
>      > Someone have any idea (really guys anything would be great) how
>     to solve
>      > this situation? I need to ignore duplicates instead to have some
>     error.
>      >
>      > I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and
>     upgrade is not
>      > an option.
>      >
>
>     Explain plan would probably shed some light, but I suspect your
>     performance is being heavily hit by the sub query. Distinct is an
>     expensive operation and you are performing it once for every
>     distinct row
>     in your temp table.
>
>     It isn't clear what the primary key is for your final table - name +
>     document seems suspicious given these seem to be the only two columns
>     your inserting as well. You don't indicate what the data types are
>     either - it document is something like 'text' then using it in a
>     distinct clause is likely to have huge performance impact.
>
>     The first thing I'd do is to eliminate duplicates from your temp table
>     as a separate statement or by pre-filtering the CSV before import. I
>     would then try something like an outer join to identify rows in your
>     temp table which don't exist in your final table and select from there
>     to insert into the final table. You don't really need the distinct in
>     the sub query as all you really need to know is if (name, document)
>     exists - it doesn't matter if more than one exists (for this test).
>
>     If you really don't have something more specific for a primary key,
>     depending on what data type 'document' is and how large it is, you may
>     find adding a column which is a checksum of your 'document' field a
>     useful addition. I have done this in the past where I had an application
>     where name was not unique and we only wanted distinct instances of
>     'document' (document was a fairly large XML document in this case).
>
>     --
>     Tim Cross
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Copy Bulk Ignore Duplicated

От
Adrian Klaver
Дата:
On 6/17/19 8:14 AM, Leandro Guimarães wrote:
> Hi Adrian,
> 
>     You are right, these fields are in CHECK CONSTRAiNTS and they are 
> not formally defined as Primary Keys.

Alright. Two things:

1) If you are are thinking of them as keys, why not make them a PK or a 
UNIQUE index?

2) Still not clear to me whether you are looking for duplicated 
information within a row or between rows?
To put it another way, what are the CHECK constraints doing?


> 
> Thanks!
> Leandro Guimarães
> 
> 
> 
> On Sat, Jun 15, 2019 at 10:45 AM Adrian Klaver 
> <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:
> 
>     On 6/14/19 7:24 PM, Leandro Guimarães wrote:
>      > Hi Tim, thanks for you answer!
>      >
>      > The columns were just examples, but let me explain the database
>      > structure, the fields in *bold are the keys*:
>      >
>      > *customer_id integer*
>      > *date_time timestamp*
>      > *indicator_id integer*
>      > *element_id integer*
>      > indicator_value double precision
> 
>     Huh, earlier you said you had a check constraint that was causing
>     issues.
> 
>     Does that also exist or where you referring to the keys above?
> 
>     Are the keys above formally defined as the PRIMARY KEY?
>     -
>      >
>      > The table is partitioned per day and customer_id (it works great)
>     the
>      > problem is just the duplicated key situation that I'm really
>     worried about.
>      >
>      > I populate the database via a Java Application with JDBC.
>      >
>      > Maybe this info could help to provide some light!
>      >
>      > Thanks Again!
>      >
>      > Leandro Guimarães
>      >
>      >
>      >
>      > On Fri, Jun 14, 2019 at 7:39 PM Tim Cross <theophilusx@gmail.com
>     <mailto:theophilusx@gmail.com>
>      > <mailto:theophilusx@gmail.com <mailto:theophilusx@gmail.com>>> wrote:
>      >
>      >
>      >     Leandro Guimarães <leo.guimaraes@gmail.com
>     <mailto:leo.guimaraes@gmail.com>
>      >     <mailto:leo.guimaraes@gmail.com
>     <mailto:leo.guimaraes@gmail.com>>> writes:
>      >
>      >      > Hi,
>      >      >
>      >      >    I have a scenario with a large table and I'm trying to
>     insert
>      >     it via a
>      >      > COPY command with a csv file.
>      >      >
>      >      >    Everything works, but sometimes my source .csv file has
>      >     duplicated data
>      >      > in the previously fulfilled table. If I add a check constraint
>      >     and try to
>      >      > run the COPY command I have an error that stops the whole
>     insertion.
>      >      >
>      >      >   I've tried to put the data in a tmp table and fill the
>     main using
>      >      > distinct this way (the fields and names are just examples):
>      >      >
>      >      > INSERT INTO final_table values (name, document)
>      >      >    SELECT DISTINCT name, document
>      >      >    FROM tmp_TABLE t1
>      >      >    WHERE NOT EXISTS (
>      >      >    SELECT 1 FROM final_table t2
>      >      >    WHERE (t2.name <http://t2.name> <http://t2.name>,
>     t2.document)
>      >      >    IS NOT DISTINCT FROM (t1.name <http://t1.name>
>     <http://t1.name>, t1.document))
>      >      >
>      >      > The problem is that my final_table is a large (and
>     partitioned)
>      >     table and
>      >      > this query is taking a long time to execute.
>      >      >
>      >      > Someone have any idea (really guys anything would be
>     great) how
>      >     to solve
>      >      > this situation? I need to ignore duplicates instead to
>     have some
>      >     error.
>      >      >
>      >      > I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and
>      >     upgrade is not
>      >      > an option.
>      >      >
>      >
>      >     Explain plan would probably shed some light, but I suspect your
>      >     performance is being heavily hit by the sub query. Distinct is an
>      >     expensive operation and you are performing it once for every
>      >     distinct row
>      >     in your temp table.
>      >
>      >     It isn't clear what the primary key is for your final table -
>     name +
>      >     document seems suspicious given these seem to be the only two
>     columns
>      >     your inserting as well. You don't indicate what the data
>     types are
>      >     either - it document is something like 'text' then using it in a
>      >     distinct clause is likely to have huge performance impact.
>      >
>      >     The first thing I'd do is to eliminate duplicates from your
>     temp table
>      >     as a separate statement or by pre-filtering the CSV before
>     import. I
>      >     would then try something like an outer join to identify rows
>     in your
>      >     temp table which don't exist in your final table and select
>     from there
>      >     to insert into the final table. You don't really need the
>     distinct in
>      >     the sub query as all you really need to know is if (name,
>     document)
>      >     exists - it doesn't matter if more than one exists (for this
>     test).
>      >
>      >     If you really don't have something more specific for a
>     primary key,
>      >     depending on what data type 'document' is and how large it
>     is, you may
>      >     find adding a column which is a checksum of your 'document'
>     field a
>      >     useful addition. I have done this in the past where I had an
>     application
>      >     where name was not unique and we only wanted distinct
>     instances of
>      >     'document' (document was a fairly large XML document in this
>     case).
>      >
>      >     --
>      >     Tim Cross
>      >
>      >
> 
> 
>     -- 
>     Adrian Klaver
>     adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Copy Bulk Ignore Duplicated

От
Adrian Klaver
Дата:
On 6/17/19 9:06 AM, Leandro Guimarães wrote:
Please reply to list also.
Ccing list.
> Ugh My bad again.
> 
> They are UNIQUE:
> CONSTRAINT unique_const_value_20190501_45 UNIQUE (customer_id, 
> date_time, indicator_id, element_id),
> 
> I've made a mistake typing "check constraint" before because these are 
> partitioned tables and I have the CHECK CONSTRAINT to partition.
> 
> Is that clear?

Yes.

To get back to the original issue, the problem is that when you COPY in 
new data you may get rows that conflict on the above UNIQUE constraint, 
correct?

> 
> Thanks for your patience!
> Leandro Guimarães
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Copy Bulk Ignore Duplicated

От
Adrian Klaver
Дата:
On 6/17/19 9:22 AM, Adrian Klaver wrote:
> On 6/17/19 9:06 AM, Leandro Guimarães wrote:
> Please reply to list also.
> Ccing list.
>> Ugh My bad again.
>>
>> They are UNIQUE:
>> CONSTRAINT unique_const_value_20190501_45 UNIQUE (customer_id, 
>> date_time, indicator_id, element_id),
>>
>> I've made a mistake typing "check constraint" before because these are 
>> partitioned tables and I have the CHECK CONSTRAINT to partition.
>>
>> Is that clear?
> 
> Yes.
> 
> To get back to the original issue, the problem is that when you COPY in 
> new data you may get rows that conflict on the above UNIQUE constraint, 
> correct?

Assuming the above is correct, would not something like below work?:

create table orig_tbl(id integer, fld_1 varchar, fld_2 integer, fld_3 
varchar, CONSTRAINT u_idx UNIQUE(id, fld_1, fld_2));

\d orig_tbl
                    Table "public.orig_tbl"
  Column |       Type        | Collation | Nullable | Default
--------+-------------------+-----------+----------+---------
  id     | integer           |           |          |
  fld_1  | character varying |           |          |
  fld_2  | integer           |           |          |
  fld_3  | character varying |           |          |
Indexes:
     "u_idx" UNIQUE CONSTRAINT, btree (id, fld_1, fld_2)

create table tmp_tbl(id integer, fld_1 varchar, fld_2 integer, fld_3 
varchar);

insert into orig_tbl values (1, 'test', 3, 'test'), (2, 'foo', 5, 
'bar'), (3, 'cat', 8, 'dog');

select * from orig_tbl ;

  id | fld_1 | fld_2 | fld_3
----+-------+-------+-------
   1 | test  |     3 | test
   2 | foo   |     5 | bar
   3 | cat   |     8 | dog
(3 rows)

insert into tmp_tbl values (1, 'test', 3, 'test'), (4, 'fish', 6, 
'bird'), (7, 'rabbit', 8, 'squirrel'), (10, 'plant', 2, 'animal');

select * from tmp_tbl ;

  id | fld_1  | fld_2 |  fld_3
----+--------+-------+----------
   1 | test   |     3 | test
   4 | fish   |     6 | bird
   7 | rabbit |     8 | squirrel
  10 | plant  |     2 | animal
(4 rows)

select * from tmp_tbl AS tt left join orig_tbl AS ot on (tt.id, 
tt.fld_1, tt.fld_2) = (ot.id, ot.fld_1, ot.fld_2) where ot.id is null;

  id | fld_1  | fld_2 |  fld_3   | id | fld_1 | fld_2 | fld_3
----+--------+-------+----------+----+-------+-------+-------
   4 | fish   |     6 | bird     |    |       |       |
   7 | rabbit |     8 | squirrel |    |       |       |
  10 | plant  |     2 | animal   |    |       |       |

> 
>>
>> Thanks for your patience!
>> Leandro Guimarães
>>
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Copy Bulk Ignore Duplicated

От
Leandro Guimarães
Дата:
Hi Adrian,

   Yes, that's the problem!

   I'm testing now the pg_bulkload but I'm facing some issues to install it on Postgresql 9.4.
Leandro Guimarães



On Mon, Jun 17, 2019 at 1:22 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/17/19 9:06 AM, Leandro Guimarães wrote:
Please reply to list also.
Ccing list.
> Ugh My bad again.
>
> They are UNIQUE:
> CONSTRAINT unique_const_value_20190501_45 UNIQUE (customer_id,
> date_time, indicator_id, element_id),
>
> I've made a mistake typing "check constraint" before because these are
> partitioned tables and I have the CHECK CONSTRAINT to partition.
>
> Is that clear?

Yes.

To get back to the original issue, the problem is that when you COPY in
new data you may get rows that conflict on the above UNIQUE constraint,
correct?

>
> Thanks for your patience!
> Leandro Guimarães
>


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Copy Bulk Ignore Duplicated

От
Adrian Klaver
Дата:
On 6/17/19 10:04 AM, Leandro Guimarães wrote:
> Hi Adrian,
> 
>     Yes, that's the problem!
> 
>     I'm testing now the pg_bulkload but I'm facing some issues to 
> install it on Postgresql 9.4.

The issues would be?

> Leandro Guimarães
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Copy Bulk Ignore Duplicated

От
Leandro Guimarães
Дата:
I've installed all dependencies, but when I try to "make install" in pg_bulkload folder if have some errors like this:

In file included from pgut/pgut.h:24:0,
                 from pgut/pgut-fe.h:13,
                 from pg_bulkload.c:17:
/usr/include/postgresql/internal/pqexpbuffer.h:149:13: error: old-style parameter declarations in prototyped function definition
 extern void printfPQExpBuffer(PQExpBuffer str, const char *fmt,...) pg_attribute_printf(2, 3);
             ^
pg_bulkload.c:743:1: error: expected ‘{’ at end of input


My OS is Ubuntu Server 14.04 and PostgreSQL 9.4. I've searched about theses errors but it ways advice to change the C code in files but I'm not sure to do this in a tool like pg_bulkload.

Leandro Guimarães



On Mon, Jun 17, 2019 at 3:56 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/17/19 10:04 AM, Leandro Guimarães wrote:
> Hi Adrian,
>
>     Yes, that's the problem!
>
>     I'm testing now the pg_bulkload but I'm facing some issues to
> install it on Postgresql 9.4.

The issues would be?

> Leandro Guimarães
>

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Copy Bulk Ignore Duplicated

От
Adrian Klaver
Дата:
On 6/17/19 12:01 PM, Leandro Guimarães wrote:
> I've installed all dependencies, but when I try to "make install" in 
> pg_bulkload folder if have some errors like this:
> 
> In file included from pgut/pgut.h:24:0,
>                   from pgut/pgut-fe.h:13,
>                   from pg_bulkload.c:17:
> /usr/include/postgresql/internal/pqexpbuffer.h:149:13: error: old-style 
> parameter declarations in prototyped function definition
>   extern void printfPQExpBuffer(PQExpBuffer str, const char *fmt,...) 
> pg_attribute_printf(2, 3);
>               ^
> pg_bulkload.c:743:1: error: expected ‘{’ at end of input

Where did you get the source and what version of source?

The most recent should be 3.1.15 from here:

https://github.com/ossc-db/pg_bulkload

> 
> 
> My OS is Ubuntu Server 14.04 and PostgreSQL 9.4. I've searched about 
> theses errors but it ways advice to change the C code in files but I'm 
> not sure to do this in a tool like pg_bulkload.
> 
> Leandro Guimarães
> 
> 
> 
> On Mon, Jun 17, 2019 at 3:56 PM Adrian Klaver <adrian.klaver@aklaver.com 
> <mailto:adrian.klaver@aklaver.com>> wrote:
> 
>     On 6/17/19 10:04 AM, Leandro Guimarães wrote:
>      > Hi Adrian,
>      >
>      >     Yes, that's the problem!
>      >
>      >     I'm testing now the pg_bulkload but I'm facing some issues to
>      > install it on Postgresql 9.4.
> 
>     The issues would be?
> 
>      > Leandro Guimarães
>      >
> 
>     -- 
>     Adrian Klaver
>     adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Copy Bulk Ignore Duplicated

От
"Peter J. Holzer"
Дата:
On 2019-06-17 16:01:57 -0300, Leandro Guimarães wrote:
> I've installed all dependencies, but when I try to "make install" in
> pg_bulkload folder if have some errors like this:
>
> In file included from pgut/pgut.h:24:0,
>                  from pgut/pgut-fe.h:13,
>                  from pg_bulkload.c:17:
> /usr/include/postgresql/internal/pqexpbuffer.h:149:13: error: old-style
> parameter declarations in prototyped function definition
>  extern void printfPQExpBuffer(PQExpBuffer str, const char *fmt,...)
> pg_attribute_printf(2, 3);
>              ^
> pg_bulkload.c:743:1: error: expected ‘{’ at end of input
>
>
> My OS is Ubuntu Server 14.04 and PostgreSQL 9.4. I've searched about
> theses errors but it ways advice to change the C code in files but I'm
> not sure to do this in a tool like pg_bulkload.

The error message is misleading. There is no old-style parameter
declaration here. My guess is that the definition for the macro
pg_attribute_printf is missing, and without that definition this is
a syntax error (which the compiler reports although it seems to be
confused about the nature of the error).

In postgresql-9.5 (haven't 9.4 and am too lazy to download it)
pg_attribute_printf is defined in "c.h".

I guess that this must be explicitely included in 9.4 but is implicitely
included in versions that the author of pg_bulkload tested.

You could try just to insert

#include "c.h"

before line 17 of pg_bulkload.c.

But ultimately you should report this incompatibility to the author(s)
of pg_bulkload.

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Вложения