Обсуждение: Slow trigger on insert: alternative solutions?

Поиск
Список
Период
Сортировка

Slow trigger on insert: alternative solutions?

От
Guillaume Drolet
Дата:
Hello,

I need your help speeding up the procedure I will explain below. I am looking for improvements to my method or different approaches/ideas to would help in this matter.

I have a set of DBF files that I load into my database using a plpython function and a call to ogr2ogr (http://www.gdal.org/drv_pg.html). Once in a while, I'll have to load updated versions of these tables to get the latest additions and possible corrections to older versions.

In my plpython script, if a table is loaded for the first time, I first load it empty, then I create a trigger function on insert (execute on row) that will check for duplicates on each insert. Depending on the type of data I load, my trigger first checks for equality in a subset of columns (between 1 and 3 columns that would be like my primary key(s)) and if true, I check if all columns are equal between NEW and the matching row from my table. When this condition is true, I return null, else I store rows (i.e. NEW.* and matching row(s) in a new table called "duplicate" for further manual investigation. Here's an example for one table:

CREATE OR REPLACE FUNCTION check_naipf_insert()
  RETURNS trigger AS
' BEGIN
   IF EXISTS (SELECT 1
                    FROM    public.naipf
                    WHERE id_pet_mes IS NOT DISTINCT FROM NEW.id_pet_mes
                    AND etage IS NOT DISTINCT FROM NEW.etage) THEN
      IF EXISTS (SELECT 1
                       FROM public.naipf
                       WHERE id_pet_mes IS NOT DISTINCT FROM NEW.id_pet_mes
                       AND etage IS NOT DISTINCT FROM NEW.etage
                       AND type_couv IS NOT DISTINCT FROM NEW.type_couv
                       AND densite IS NOT DISTINCT FROM NEW.densite
                       AND hauteur IS NOT DISTINCT FROM NEW.hauteur
                       AND cl_age IS NOT DISTINCT FROM NEW.cl_age) THEN
                         RETURN NULL;
       ELSE 
         INSERT INTO public.duplic_naipf SELECT NEW.*;
         INSERT INTO public.duplic_naipf (SELECT *
                                                           FROM   public.naipf
                                                           WHERE id_pet_mes IS NOT DISTINCT FROM  NEW.id_pet_mes
                                                            AND etage IS NOT DISTINCT FROM NEW.etage );     
         RETURN NULL;
       END IF;
     END IF;
     RETURN NEW;
 END;  '
 LANGUAGE plpgsql VOLATILE COST 100;

CREATE TRIGGER check_insert_naipf
  BEFORE INSERT
  ON public.pet4_naipf
  FOR EACH ROW
  EXECUTE PROCEDURE check_naipf_insert();

(in this case, duplicate rows that need investigation are rows that may have changed relative to older version of the DBF file, but that have no change in what I call their primary keys although they are not really PKs since I don't want to raise errors at loading)

Once this is done, ogr2ogr is called a second time to load the data. It is quite fast for small tables (tens of thousands of rows, tens of columns) but for large tables it takes forever. For example, I started loading a table with 3.5 million rows/33 columns last Friday at 3PM and this now, Monday morning at 9PM some 3 million rows have been loaded.

My question is: what are the other approaches that would make this procedure faster? How is this kind of task usually implemented in postgresql? Would it be better to load everything with no check and then apply some functions to find duplicate rows (although this would involve more manual work)?

Thanks a lot for your help!




Re: Slow trigger on insert: alternative solutions?

От
Adrian Klaver
Дата:
On 03/30/2015 06:04 AM, Guillaume Drolet wrote:
> Hello,
>
> I need your help speeding up the procedure I will explain below. I am
> looking for improvements to my method or different approaches/ideas to
> would help in this matter.
>
> I have a set of DBF files that I load into my database using a plpython
> function and a call to ogr2ogr (http://www.gdal.org/drv_pg.html). Once
> in a while, I'll have to load updated versions of these tables to get
> the latest additions and possible corrections to older versions.
>
> In my plpython script, if a table is loaded for the first time, I first
> load it empty, then I create a trigger function on insert (execute on
> row) that will check for duplicates on each insert. Depending on the
> type of data I load, my trigger first checks for equality in a subset of
> columns (between 1 and 3 columns that would be like my primary key(s))
> and if true, I check if all columns are equal between NEW and the
> matching row from my table. When this condition is true, I return null,
> else I store rows (i.e. NEW.* and matching row(s) in a new table called
> "duplicate" for further manual investigation. Here's an example for one
> table:
>
> CREATE OR REPLACE FUNCTION check_naipf_insert()
>    RETURNS trigger AS
> ' BEGIN
>     IF EXISTS (SELECT 1
>                      FROM    public.naipf
>                      WHERE id_pet_mes IS NOT DISTINCT FROM NEW.id_pet_mes
>                      AND etage IS NOT DISTINCT FROM NEW.etage) THEN
>        IF EXISTS (SELECT 1
>                         FROM public.naipf
>                         WHERE id_pet_mes IS NOT DISTINCT FROM
> NEW.id_pet_mes
>                         AND etage IS NOT DISTINCT FROM NEW.etage
>                         AND type_couv IS NOT DISTINCT FROM NEW.type_couv
>                         AND densite IS NOT DISTINCT FROM NEW.densite
>                         AND hauteur IS NOT DISTINCT FROM NEW.hauteur
>                         AND cl_age IS NOT DISTINCT FROM NEW.cl_age) THEN
>                           RETURN NULL;
>         ELSE
>           INSERT INTO public.duplic_naipf SELECT NEW.*;
>           INSERT INTO public.duplic_naipf (SELECT *
>                                                             FROM
> public.naipf
>                                                             WHERE
> id_pet_mes IS NOT DISTINCT FROM  NEW.id_pet_mes
>                                                              AND etage
> IS NOT DISTINCT FROM NEW.etage );
>           RETURN NULL;
>         END IF;
>       END IF;
>       RETURN NEW;
>   END;  '
>   LANGUAGE plpgsql VOLATILE COST 100;
>
> CREATE TRIGGER check_insert_naipf
>    BEFORE INSERT
>    ON public.pet4_naipf
>    FOR EACH ROW
>    EXECUTE PROCEDURE check_naipf_insert();
>
> (in this case, duplicate rows that need investigation are rows that may
> have changed relative to older version of the DBF file, but that have no
> change in what I call their primary keys although they are not really
> PKs since I don't want to raise errors at loading)
>
> Once this is done, ogr2ogr is called a second time to load the data. It
> is quite fast for small tables (tens of thousands of rows, tens of
> columns) but for large tables it takes forever. For example, I started
> loading a table with 3.5 million rows/33 columns last Friday at 3PM and
> this now, Monday morning at 9PM some 3 million rows have been loaded.
>
> My question is: what are the other approaches that would make this
> procedure faster? How is this kind of task usually implemented in
> postgresql? Would it be better to load everything with no check and then
> apply some functions to find duplicate rows (although this would involve
> more manual work)?

I guess it depends on what end purpose of the above is? If you are just
trying to keep relatively update to date information from the DBF
sources, would it not be easier just to load them into a new table?

So, where existing table is some_dbf_data:

1) CREATE TABLE new_some_dbf_data(...)
2) Dump DBF file into new_some_dbf_data
3)In transaction rename/drop some_dbf_data, rename new_some_dbf_data to
some_dbf_data


Option 2 is what I do for a similar procedure:

1) Dump DBF data into holding table.
2) Use SQL in function(s) to compare old/new table and make appropriate
adjustments. Doing SQL in bulk is a lot faster then checking each row,
or least that is what I found. In any case the way you are doing it
looks to involve 3.5 million inserts with a trigger action on each, that
is bound to be slow:)
3) This is automated in an external Python script.

Option 3

Use dbf(https://pypi.python.org/pypi/dbf/0.88.16) and do the comparisons
in the DBF files outside Postgres and only import what has changed.


>

> Thanks a lot for your help!
>
>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Slow trigger on insert: alternative solutions?

От
Guillaume Drolet
Дата:
Thanks Adrian,

In option 2, when you say "this is automated in an external Python script", do you mean that you use something like psycopg2 to perform the queries the database (e.g. for comparing data in the holding table with the older table)?

Thanks.

2015-03-30 9:53 GMT-04:00 Adrian Klaver <adrian.klaver@aklaver.com>:
On 03/30/2015 06:04 AM, Guillaume Drolet wrote:
Hello,

I need your help speeding up the procedure I will explain below. I am
looking for improvements to my method or different approaches/ideas to
would help in this matter.

I have a set of DBF files that I load into my database using a plpython
function and a call to ogr2ogr (http://www.gdal.org/drv_pg.html). Once
in a while, I'll have to load updated versions of these tables to get
the latest additions and possible corrections to older versions.

In my plpython script, if a table is loaded for the first time, I first
load it empty, then I create a trigger function on insert (execute on
row) that will check for duplicates on each insert. Depending on the
type of data I load, my trigger first checks for equality in a subset of
columns (between 1 and 3 columns that would be like my primary key(s))
and if true, I check if all columns are equal between NEW and the
matching row from my table. When this condition is true, I return null,
else I store rows (i.e. NEW.* and matching row(s) in a new table called
"duplicate" for further manual investigation. Here's an example for one
table:

CREATE OR REPLACE FUNCTION check_naipf_insert()
   RETURNS trigger AS
' BEGIN
    IF EXISTS (SELECT 1
                     FROM    public.naipf
                     WHERE id_pet_mes IS NOT DISTINCT FROM NEW.id_pet_mes
                     AND etage IS NOT DISTINCT FROM NEW.etage) THEN
       IF EXISTS (SELECT 1
                        FROM public.naipf
                        WHERE id_pet_mes IS NOT DISTINCT FROM
NEW.id_pet_mes
                        AND etage IS NOT DISTINCT FROM NEW.etage
                        AND type_couv IS NOT DISTINCT FROM NEW.type_couv
                        AND densite IS NOT DISTINCT FROM NEW.densite
                        AND hauteur IS NOT DISTINCT FROM NEW.hauteur
                        AND cl_age IS NOT DISTINCT FROM NEW.cl_age) THEN
                          RETURN NULL;
        ELSE
          INSERT INTO public.duplic_naipf SELECT NEW.*;
          INSERT INTO public.duplic_naipf (SELECT *
                                                            FROM
public.naipf
                                                            WHERE
id_pet_mes IS NOT DISTINCT FROM  NEW.id_pet_mes
                                                             AND etage
IS NOT DISTINCT FROM NEW.etage );
          RETURN NULL;
        END IF;
      END IF;
      RETURN NEW;
  END;  '
  LANGUAGE plpgsql VOLATILE COST 100;

CREATE TRIGGER check_insert_naipf
   BEFORE INSERT
   ON public.pet4_naipf
   FOR EACH ROW
   EXECUTE PROCEDURE check_naipf_insert();

(in this case, duplicate rows that need investigation are rows that may
have changed relative to older version of the DBF file, but that have no
change in what I call their primary keys although they are not really
PKs since I don't want to raise errors at loading)

Once this is done, ogr2ogr is called a second time to load the data. It
is quite fast for small tables (tens of thousands of rows, tens of
columns) but for large tables it takes forever. For example, I started
loading a table with 3.5 million rows/33 columns last Friday at 3PM and
this now, Monday morning at 9PM some 3 million rows have been loaded.

My question is: what are the other approaches that would make this
procedure faster? How is this kind of task usually implemented in
postgresql? Would it be better to load everything with no check and then
apply some functions to find duplicate rows (although this would involve
more manual work)?

I guess it depends on what end purpose of the above is? If you are just trying to keep relatively update to date information from the DBF sources, would it not be easier just to load them into a new table?

So, where existing table is some_dbf_data:

1) CREATE TABLE new_some_dbf_data(...)
2) Dump DBF file into new_some_dbf_data
3)In transaction rename/drop some_dbf_data, rename new_some_dbf_data to some_dbf_data


Option 2 is what I do for a similar procedure:

1) Dump DBF data into holding table.
2) Use SQL in function(s) to compare old/new table and make appropriate adjustments. Doing SQL in bulk is a lot faster then checking each row, or least that is what I found. In any case the way you are doing it looks to involve 3.5 million inserts with a trigger action on each, that is bound to be slow:)
3) This is automated in an external Python script.

Option 3

Use dbf(https://pypi.python.org/pypi/dbf/0.88.16) and do the comparisons in the DBF files outside Postgres and only import what has changed.





Thanks a lot for your help!






--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Slow trigger on insert: alternative solutions?

От
Adrian Klaver
Дата:
On 03/30/2015 01:43 PM, Guillaume Drolet wrote:
> Thanks Adrian,
>
> In option 2, when you say "this is automated in an external Python
> script", do you mean that you use something like psycopg2 to perform the
> queries the database (e.g. for comparing data in the holding table with
> the older table)?

Yes. Basically I use the dbf module I mentioned previously to read the
DBF files, output the data I need, writing it into an in memory CSV file
which I then use the psycopg2 COPY functions to dump into the Postgres
database. The files I am working with are not as the large the ones you
mention, so doing all this in memory is workable. The script then calls
a series of user functions in Postgres to do the comparing and
manipulating.  At the time I did this plpythonu was less featured then
it is now, so to do what I wanted made more sense in an external script.
Also the script pulls the DBF files from elsewhere and I felt more
comfortable doing that outside the database then in.

Though more and more I seem to be using pandas(pandas.pydata.org) to do
data conversions. Saves a lot of the steps in the above. In this case
you would still need to get the data out of the DBF files.

>
> Thanks.
>
> 2015-03-30 9:53 GMT-04:00 Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>>:
>
>     On 03/30/2015 06:04 AM, Guillaume Drolet wrote:
>
>         Hello,
>
>         I need your help speeding up the procedure I will explain below.
>         I am
>         looking for improvements to my method or different
>         approaches/ideas to
>         would help in this matter.
>
>         I have a set of DBF files that I load into my database using a
>         plpython
>         function and a call to ogr2ogr
>         (http://www.gdal.org/drv_pg.__html
>         <http://www.gdal.org/drv_pg.html>). Once
>         in a while, I'll have to load updated versions of these tables
>         to get
>         the latest additions and possible corrections to older versions.
>
>         In my plpython script, if a table is loaded for the first time,
>         I first
>         load it empty, then I create a trigger function on insert
>         (execute on
>         row) that will check for duplicates on each insert. Depending on the
>         type of data I load, my trigger first checks for equality in a
>         subset of
>         columns (between 1 and 3 columns that would be like my primary
>         key(s))
>         and if true, I check if all columns are equal between NEW and the
>         matching row from my table. When this condition is true, I
>         return null,
>         else I store rows (i.e. NEW.* and matching row(s) in a new table
>         called
>         "duplicate" for further manual investigation. Here's an example
>         for one
>         table:
>
>         CREATE OR REPLACE FUNCTION check_naipf_insert()
>             RETURNS trigger AS
>         ' BEGIN
>              IF EXISTS (SELECT 1
>                               FROM    public.naipf
>                               WHERE id_pet_mes IS NOT DISTINCT FROM
>         NEW.id_pet_mes
>                               AND etage IS NOT DISTINCT FROM NEW.etage) THEN
>                 IF EXISTS (SELECT 1
>                                  FROM public.naipf
>                                  WHERE id_pet_mes IS NOT DISTINCT FROM
>         NEW.id_pet_mes
>                                  AND etage IS NOT DISTINCT FROM NEW.etage
>                                  AND type_couv IS NOT DISTINCT FROM
>         NEW.type_couv
>                                  AND densite IS NOT DISTINCT FROM
>         NEW.densite
>                                  AND hauteur IS NOT DISTINCT FROM
>         NEW.hauteur
>                                  AND cl_age IS NOT DISTINCT FROM
>         NEW.cl_age) THEN
>                                    RETURN NULL;
>                  ELSE
>                    INSERT INTO public.duplic_naipf SELECT NEW.*;
>                    INSERT INTO public.duplic_naipf (SELECT *
>                                                                      FROM
>         public.naipf
>                                                                      WHERE
>         id_pet_mes IS NOT DISTINCT FROM  NEW.id_pet_mes
>
>           AND etage
>         IS NOT DISTINCT FROM NEW.etage );
>                    RETURN NULL;
>                  END IF;
>                END IF;
>                RETURN NEW;
>            END;  '
>            LANGUAGE plpgsql VOLATILE COST 100;
>
>         CREATE TRIGGER check_insert_naipf
>             BEFORE INSERT
>             ON public.pet4_naipf
>             FOR EACH ROW
>             EXECUTE PROCEDURE check_naipf_insert();
>
>         (in this case, duplicate rows that need investigation are rows
>         that may
>         have changed relative to older version of the DBF file, but that
>         have no
>         change in what I call their primary keys although they are not
>         really
>         PKs since I don't want to raise errors at loading)
>
>         Once this is done, ogr2ogr is called a second time to load the
>         data. It
>         is quite fast for small tables (tens of thousands of rows, tens of
>         columns) but for large tables it takes forever. For example, I
>         started
>         loading a table with 3.5 million rows/33 columns last Friday at
>         3PM and
>         this now, Monday morning at 9PM some 3 million rows have been
>         loaded.
>
>         My question is: what are the other approaches that would make this
>         procedure faster? How is this kind of task usually implemented in
>         postgresql? Would it be better to load everything with no check
>         and then
>         apply some functions to find duplicate rows (although this would
>         involve
>         more manual work)?
>
>
>     I guess it depends on what end purpose of the above is? If you are
>     just trying to keep relatively update to date information from the
>     DBF sources, would it not be easier just to load them into a new table?
>
>     So, where existing table is some_dbf_data:
>
>     1) CREATE TABLE new_some_dbf_data(...)
>     2) Dump DBF file into new_some_dbf_data
>     3)In transaction rename/drop some_dbf_data, rename new_some_dbf_data
>     to some_dbf_data
>
>
>     Option 2 is what I do for a similar procedure:
>
>     1) Dump DBF data into holding table.
>     2) Use SQL in function(s) to compare old/new table and make
>     appropriate adjustments. Doing SQL in bulk is a lot faster then
>     checking each row, or least that is what I found. In any case the
>     way you are doing it looks to involve 3.5 million inserts with a
>     trigger action on each, that is bound to be slow:)
>     3) This is automated in an external Python script.
>
>     Option 3
>
>     Use dbf(https://pypi.python.org/__pypi/dbf/0.88.16
>     <https://pypi.python.org/pypi/dbf/0.88.16>) and do the comparisons
>     in the DBF files outside Postgres and only import what has changed.
>
>
>
>
>
>         Thanks a lot for your help!
>
>
>
>
>
>
>     --
>     Adrian Klaver
>     adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com