Обсуждение: Why creating GIN table index is so slow than inserting data into empty table with the same index?

Поиск
Список
Период
Сортировка

Why creating GIN table index is so slow than inserting data into empty table with the same index?

От
Sergey Burladyan
Дата:
example:

select version();
                                          version
--------------------------------------------------------------------------------------------
 PostgreSQL 8.3.6 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Debian 4.3.3-3) 4.3.3

show maintenance_work_mem ;
 maintenance_work_mem
----------------------
 128MB

create table a (i1 int, i2 int, i3 int, i4 int, i5 int, i6 int);

insert into a select n, n, n, n, n, n from generate_series(1, 100000) as n;
INSERT 0 100000
Время: 570,110 мс

create index arr_gin on a using gin ( (array[i1, i2, i3, i4, i5, i6]) );
CREATE INDEX
Время: 203068,314 мс

truncate a;
drop index arr_gin ;

create index arr_gin on a using gin ( (array[i1, i2, i3, i4, i5, i6]) );
CREATE INDEX
Время: 3,246 мс

insert into a select n, n, n, n, n, n from generate_series(1, 100000) as n;
INSERT 0 100000
Время: 2405,481 мс

select pg_size_pretty(pg_total_relation_size('a')) as total,
       pg_size_pretty(pg_relation_size('a')) as table;
  total  |  table
---------+---------
 9792 kB | 5096 kB


203068.314 ms VS 2405.481 ms, is this behaviour normal ?

Thanks !

--
Sergey Burladyan

Sergey Burladyan <eshkinkot@gmail.com> writes:
> show maintenance_work_mem ;
>  maintenance_work_mem
> ----------------------
>  128MB

> create table a (i1 int, i2 int, i3 int, i4 int, i5 int, i6 int);
> insert into a select n, n, n, n, n, n from generate_series(1, 100000) as n;
> create index arr_gin on a using gin ( (array[i1, i2, i3, i4, i5, i6]) );

[ takes forever ]

Seems the reason this is so awful is that the incoming data is exactly
presorted, meaning that the tree structure that ginInsertEntry() is
trying to build degenerates to a linear list (ie, every new element
becomes the right child of the prior one).  So the processing becomes
O(N^2) up till you reach maintenance_work_mem and flush the tree.  With
a large setting for maintenance_work_mem it gets spectacularly slow.

I think a reasonable solution for this might be to keep an eye on
maxdepth and force a flush if that gets too large (more than a few
hundred, perhaps?).  Something like this:

    /* If we've maxed out our available memory, dump everything to the index */
+   /* Also dump if the tree seems to be getting too unbalanced */
-   if (buildstate->accum.allocatedMemory >= maintenance_work_mem * 1024L)
+   if (buildstate->accum.allocatedMemory >= maintenance_work_mem * 1024L ||
+       buildstate->accum.maxdepth > DEPTH_LIMIT)
    {

The new fast-insert code likely will need a similar defense.

Comments?

            regards, tom lane

Re: Why creating GIN table index is so slow than inserting data into empty table with the same index?

От
Heikki Linnakangas
Дата:
Tom Lane wrote:
> Sergey Burladyan <eshkinkot@gmail.com> writes:
>> show maintenance_work_mem ;
>>  maintenance_work_mem
>> ----------------------
>>  128MB
>
>> create table a (i1 int, i2 int, i3 int, i4 int, i5 int, i6 int);
>> insert into a select n, n, n, n, n, n from generate_series(1, 100000) as n;
>> create index arr_gin on a using gin ( (array[i1, i2, i3, i4, i5, i6]) );
>
> [ takes forever ]
>
> Seems the reason this is so awful is that the incoming data is exactly
> presorted, meaning that the tree structure that ginInsertEntry() is
> trying to build degenerates to a linear list (ie, every new element
> becomes the right child of the prior one).  So the processing becomes
> O(N^2) up till you reach maintenance_work_mem and flush the tree.  With
> a large setting for maintenance_work_mem it gets spectacularly slow.

Yes, this is probably the same issue I bumped into a while ago:

http://archives.postgresql.org/message-id/49350A13.3020105@enterprisedb.com

> I think a reasonable solution for this might be to keep an eye on
> maxdepth and force a flush if that gets too large (more than a few
> hundred, perhaps?).  Something like this:
>
>     /* If we've maxed out our available memory, dump everything to the index */
> +   /* Also dump if the tree seems to be getting too unbalanced */
> -   if (buildstate->accum.allocatedMemory >= maintenance_work_mem * 1024L)
> +   if (buildstate->accum.allocatedMemory >= maintenance_work_mem * 1024L ||
> +       buildstate->accum.maxdepth > DEPTH_LIMIT)
>     {
>
> The new fast-insert code likely will need a similar defense.

I fooled around with a balanced tree, which solved the problem but
unfortunately made the unsorted case slower. Limiting the depth like
that should avoid that so it's worth trying.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> I think a reasonable solution for this might be to keep an eye on
>> maxdepth and force a flush if that gets too large (more than a few
>> hundred, perhaps?).  Something like this:

> I fooled around with a balanced tree, which solved the problem but
> unfortunately made the unsorted case slower.

Yeah, rebalancing the search tree would fix that, but every balanced
tree algorithm I know about is complicated, slow, and needs extra
memory.  It's really unclear that it'd be worth the trouble for a
transient data structure like this one.

            regards, tom lane