Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)

Поиск

Список

Период

Сортировка

От	Daniel Begin
Тема	Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)
Дата	9 декабря 2014 г. 17:58:25
Msg-id	COL129-DS22F3AC2E7068560F72605B94650@phx.gbl обсуждение исходный текст
Ответ на	Re: Removing duplicate records from a bulk upload (rationale behind selecting a method) (Daniel Begin <jfd553@hotmail.com>)
Ответы	Re: Removing duplicate records from a bulk upload (rationale behind selecting a method) (Marc Mamin <M.Mamin@intershop.de>)
Список	pgsql-general

Дерево обсуждения

Thank Tom,
I understand that the rationale behind choosing to create a new table from
distinct records is that, since both approaches need full table scans,
selecting distinct records is faster (and seems more straight forward) than
finding/deleting duplicates;

Best regards,
Daniel

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
Sent: December-08-14 21:52
To: Scott Marlowe
Cc: Andy Colson; Daniel Begin; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)

Scott Marlowe <scott.marlowe@gmail.com> writes:
> If you're de-duping a whole table, no need to create indexes, as it's
> gonna have to hit every row anyway. Fastest way I've found has been:

> select a,b,c into newtable from oldtable group by a,b,c;

> On pass, done.

> If you want to use less than the whole row, you can use select
> distinct on (col1, col2) * into newtable from oldtable;

Also, the DISTINCT ON method can be refined to control which of a set of
duplicate keys is retained, if you can identify additional columns that
constitute a preference order for retaining/discarding dupes.  See the
"latest weather reports" example in the SELECT reference page.

In any case, it's advisable to crank up work_mem while performing this
operation.

            regards, tom lane


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

В списке pgsql-general по дате отправления:

Предыдущее

От: Albe Laurenz
Дата: 09 декабря 2014 г., 13:24:59
Сообщение: Re: Use cases for lateral that do not involve a set returning function

Следующее

От: lin
Дата: 09 декабря 2014 г., 18:29:33
Сообщение: pg_restore -n sch1 : schema "sch1" does not exist

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)

Предыдущее

Следующее