Обсуждение: Removing duplicate records from a bulk upload

Поиск
Список
Период
Сортировка

Removing duplicate records from a bulk upload

От
Daniel Begin
Дата:

I have just completed the bulk upload of a large database. Some tables have billions of records and no constraints or indexes have been applied yet. About 0.1% of these records may have been duplicated during the upload and I need to remove them before applying constraints.

 

I understand there are (at least) two approaches to get a table without duplicate records…

-           Delete duplicate records from the table based on an appropriate select clause;

-           Create a new table with the results from a select distinct clause, and then drop the original table.

 

What would be the most efficient procedure in PostgreSQL to do the job considering …

-           I do not know which records were duplicated;

-           There are no indexes applied on tables yet;

-           There is no OIDS on tables yet;

-           The database is currently 1TB but I have plenty of disk space.

Daniel

 

Re: Removing duplicate records from a bulk upload

От
Andy Colson
Дата:
On 12/7/2014 9:31 PM, Daniel Begin wrote:
> I have just completed the bulk upload of a large database. Some tables
> have billions of records and no constraints or indexes have been applied
> yet. About 0.1% of these records may have been duplicated during the
> upload and I need to remove them before applying constraints.
>
> I understand there are (at least) two approaches to get a table without
> duplicate records…
>
> -           Delete duplicate records from the table based on an
> appropriate select clause;
>
> -           Create a new table with the results from a select distinct
> clause, and then drop the original table.
>
> What would be the most efficient procedure in PostgreSQL to do the job
> considering …
>
> -           I do not know which records were duplicated;
>
> -           There are no indexes applied on tables yet;
>
> -           There is no OIDS on tables yet;
>
> -           The database is currently 1TB but I have plenty of disk space.
>
> Daniel
>

How would you detect duplicate?  Is there a single field that would be
duplicated?  Or do you have to test a bunch of different fields?

If its a single field, you could find dups in a single pass of the table
with:

create index bigtable_key on bigtable(key);
select key, count(*) from bigtable group by key having count(*) > 1;

Save that list, and decide on some way of deleting the dups.

The index might help the initial select, but will really help re-query
and delete statements.

-Andy


Re: Removing duplicate records from a bulk upload

От
Andy Colson
Дата:
On 12/8/2014 10:30 AM, Andy Colson wrote:
> On 12/7/2014 9:31 PM, Daniel Begin wrote:
>> I have just completed the bulk upload of a large database. Some tables
>> have billions of records and no constraints or indexes have been applied
>> yet. About 0.1% of these records may have been duplicated during the
>> upload and I need to remove them before applying constraints.
>>
>> I understand there are (at least) two approaches to get a table without
>> duplicate records…
>>
>> -           Delete duplicate records from the table based on an
>> appropriate select clause;
>>
>> -           Create a new table with the results from a select distinct
>> clause, and then drop the original table.
>>
>> What would be the most efficient procedure in PostgreSQL to do the job
>> considering …
>>
>> -           I do not know which records were duplicated;
>>
>> -           There are no indexes applied on tables yet;
>>
>> -           There is no OIDS on tables yet;
>>
>> -           The database is currently 1TB but I have plenty of disk
>> space.
>>
>> Daniel
>>
>
> How would you detect duplicate?  Is there a single field that would be
> duplicated?  Or do you have to test a bunch of different fields?
>
> If its a single field, you could find dups in a single pass of the table
> with:
>
> create index bigtable_key on bigtable(key);
> select key, count(*) from bigtable group by key having count(*) > 1;
>
> Save that list, and decide on some way of deleting the dups.
>
> The index might help the initial select, but will really help re-query
> and delete statements.
>
> -Andy
>
>

I just thought of a more generic way.

1) make a non-unique index on bigtable
2) make a temp table
3) -- copy only dups
insert into temp table
select * from big table where (its a duplicate);

4)
delete from bigtable where keys in (select key from temp);

5)
insert into bigtable
select distinct from temp;


This would minimize the amount of data you have to move around.  Depends
on how hard step 3 is to write.  Index not required but would help both
step 3 and 4 be faster.

-Andy