Обсуждение: Performance difference between ANY and IN, also array syntax

Поиск
Список
Период
Сортировка

Performance difference between ANY and IN, also array syntax

От
Bart Grantham
Дата:
Hello, all.  I work for a Manhattan ISP and have developed an internal
systems management/housekeeping app on php/postgres 7.4.  I am trying to
speed up some bits with stored procedures and have had great success,
except I've now run into a bit of trouble.  It comes down to this:

# SELECT * FROM connections  WHERE connectee_node_id = ANY (
ARRAY[28543,28542] );
-snip-
Time: 564.899 ms

...versus...

# SELECT * FROM connections  WHERE connectee_node_id IN ( 28543,28542 );
-snip-
Time: 1.410 ms

Why the difference?  I tried explain:

# explain SELECT * FROM connections  WHERE connectee_node_id = ANY (
ARRAY[28543,28542] );
                             QUERY PLAN
---------------------------------------------------------------------
 Seq Scan on connections  (cost=0.00..17963.44 rows=207264 width=33)
   Filter: (connectee_node_id = ANY ('{28543,28542}'::integer[]))

..versus...

# explain SELECT * FROM connections  WHERE connectee_node_id IN (
28543,28542 );
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Index Scan using c_connectee_node_id, c_connectee_node_id on
connections  (cost=0.00..67.28 rows=72 width=33)
   Index Cond: ((connectee_node_id = 28543) OR (connectee_node_id = 28542))


Why filter for one and index for the other?  Is it because an array is
mutable, so it's impossible to operate on it the same way?  I need to
pass in an array to my stored procedure, but having to use ANY in my
select is killing the performance. I'd like to know what I can do to
make ANY perform like IN, or alternatively, could someone fill me in on
the syntax on how to cast an INT[] to a "list of scalar expressions",
which the manual states is the right-hand side to IN.

Also, I have a few bits of confusion about array syntax that perhaps
someone could illuminate for me.  Forgive me, I'm not as fluent in SQL
as other languages.

First, this doesn't work for me: RAISE NOTICE ''DEBUG: %'', _myarray[1];
It seems pretty reasonable to me, but it gives me a 'syntax error at or
near "["'.

Next, I can't seem to declare an array of a user-defined row: _innerrow
my_type%ROWTYPE[];
Is there a syntactical snag I'm tripping over?  Or can you not have
arrays of other than built-in types?

I think it's super-cool that you can extract arbitrary rectangles of
data from a multi-dimentional array, but can you take a vertical slice
from an array of user-defined type by column?  For example:
_mytype[1:5].some_column

And finally, how do you specifcy an entire array index when doing
columns from multi-dim arrays?  Is there something like
_my_multidim_of_ints[*][4] or maybe _my_multidim_of_ints[:][4] ?


Thanks for the help, and thanks for the great database.

Bart G
Logicworks NOC

Re: Performance difference between ANY and IN, also array syntax

От
James Robinson
Дата:
Without anything truly fancy, you could write a proc which dynamically
builds a query string using the IN form out of a array parameter:

You get to do a bunch of string contatenation and you don't get the
luxury of pre-planning, but this technique might work for you. If your
arrays aren't too big, then it might be a winner. I'm sure someone more
knowledgeable may well propose something more elegant.

----
create table test
(
    id int
);

insert into test values(1);
insert into test values(2);
insert into test values(3);


create or replace function dynamic_test(int []) returns setof test as
$$
DECLARE
    query text;
    testrow test;
    ids alias for $1;
    maxidx int := array_upper($1, 1);
    i int;
BEGIN
    query := 'select * from test where id in (';
    -- unroll the array ...
    for i in 1..maxidx loop
        query := query || ids[i];
        if i <> maxidx then
            query := query || ', ';
        end if;
    end loop;
    query := query || ')';

    raise notice 'query: "%"', query;
    -- okay -- eat it now
    for testrow in execute query loop
        return next testrow;
    end loop;

    return;
END;
$$ language plpgsql;

social=# select * from dynamic_test('{2,3}');
NOTICE:  query: "select * from test where id in (2, 3)"
  id
----
   2
   3
(2 rows)

----
----
James Robinson
Socialserve.com


Re: Performance difference between ANY and IN, also array syntax

От
Tom Lane
Дата:
Bart Grantham <bart@logicworks.net> writes:
> It comes down to this:

> # SELECT * FROM connections  WHERE connectee_node_id = ANY (
> ARRAY[28543,28542] );
> [ performance sucks ]

Yeah :-(.  The = ANY (array) notation is new in the last release or so,
and is completely without optimization of any kind.  Turn it into IN
if you can.

> First, this doesn't work for me: RAISE NOTICE ''DEBUG: %'', _myarray[1];
> It seems pretty reasonable to me, but it gives me a 'syntax error at or
> near "["'.

The RAISE statement is more than a few bricks shy of a load --- it
doesn't accept any sort of expression, only simple variable names.
I hope we get around to fixing that for 8.1, but for now, what you
have to do is assign the thing you want to display into a temporary
variable and then use that in the RAISE.

> Next, I can't seem to declare an array of a user-defined row: _innerrow
> my_type%ROWTYPE[];

We don't presently support arrays of composite types.

You seem to be doing amazingly well at hitting multiple weak spots
of PG simultaneously ;-).  Most of this stuff is on the radar, but
I wouldn't recommend holding your breath for a fix, with the exception
of the RAISE issue.  RAISE is way overdue for an overhaul and I hope
we will actually get to it for 8.1.

            regards, tom lane

Re: Performance difference between ANY and IN, also array syntax

От
Ron Mayer
Дата:
Tom Lane wrote:
> Bart Grantham <bart@logicworks.net> writes:
>
>># SELECT * FROM connections  WHERE connectee_node_id = ANY (ARRAY[28543,28542] );
>>[ performance sucks ]
>
> Yeah :-(.  The = ANY (array) notation is new in the last release or so,
> and is completely without optimization of any kind.  Turn it into IN
> if you can.

You could also use the module in contrib/intagg to treat your array
as a table so you can JOIN against it like this:

   select * from connections
            join int_array_enum('{28543,28542}'::int[]) as iae(x)
              on (connectee_node_id = x);

This seems to work very well for me.  I show a real-life example
with a table about the same size as Bart's (about 200K rows) below.

It has the advantage of being able to do all the different join types
whenever it guesses it would be appropriate to do so.   I've never
seen "in" clauses use some of those tricks (though it might be possible
that in clauses can do that as well).  It seems to always think that
int_array_enum will return a 1000 row table (I wonder if there should
eventually be some way to tweak that), so when joining against very
large tables it happily does index scans - and when joining against
smaller tables it'll do merge joins or whatever.





=========================================================================
=========================================================================
=========================================================================
== an example going against a larger table,
== the int_array_enum trick reasonably chooses nested loops and index scans
=========================================================================
=========================================================================
=========================================================================
fli=# explain
  select *
    from lines
   where tlid = ANY ('{125060436,125060437}'::int[]);
                            QUERY PLAN
----------------------------------------------------------------
  Seq Scan on lines  (cost=0.00..26636.03 rows=254229 width=102)
    Filter: (tlid = ANY ('{125060436,125060437}'::integer[]))
(2 rows)

fli=# explain
select *
   from lines
   join int_array_enum('{125060436,125060437}'::int[]) as iae(x)
     on (tlid=x);
fli-# fli-# fli-# fli-#                                       QUERY PLAN
--------------------------------------------------------------------------------------
  Nested Loop  (cost=0.00..4228.99 rows=1000 width=106)
    ->  Function Scan on int_array_enum iae  (cost=0.00..12.50 rows=1000 width=4)
    ->  Index Scan using rtgr_lines__tlid on lines  (cost=0.00..4.20 rows=1 width=102)
          Index Cond: (lines.tlid = "outer".x)
(4 rows)


=========================================================================
=========================================================================
=========================================================================
== an example going against a smaller table
== the int_array_enum trick reasonably chooses a merge join.
=========================================================================
=========================================================================
=========================================================================
fli=# explain analyze
select *
   from rt6
   join int_array_enum('{125060436,125060437}'::int[]) as iae(x)
     on (tlid=x);
fli-# fli-# fli-# fli-#                                                             QUERY PLAN                \

--------------------------------------------------------------------------------------------------------------\
--------------------
  Merge Join  (cost=62.33..1226.41 rows=2848 width=82) (actual time=36.669..36.669 rows=0 loops=1)
    Merge Cond: ("outer".tlid = "inner".x)
    ->  Index Scan using rt6__tlid on rt6  (cost=0.00..1066.65 rows=21881 width=78) (actual time=0.020..21.982 \
rows=10544 loops=1)
    ->  Sort  (cost=62.33..64.83 rows=1000 width=4) (actual time=0.042..0.046 rows=2 loops=1)
          Sort Key: iae.x
          ->  Function Scan on int_array_enum iae  (cost=0.00..12.50 rows=1000 width=4) (actual time=0.020..0.0\
23 rows=2 loops=1)
  Total runtime: 36.770 ms
(7 rows)

======================================================

Re: Performance difference between ANY and IN, also array

От
Alban Hertroys
Дата:
Bart Grantham wrote:
> Hello, all.  I work for a Manhattan ISP and have developed an internal
> systems management/housekeeping app on php/postgres 7.4.  I am trying to
> speed up some bits with stored procedures and have had great success,
> except I've now run into a bit of trouble.  It comes down to this:
>
> # SELECT * FROM connections  WHERE connectee_node_id IN ( 28543,28542 );
> -snip-
> Time: 1.410 ms

If you can divide your data set into ranges of adjacent values, it is
even faster to use series of BETWEEN clauses. Kind of depends on the
amount of processing you need to do to get those ranges. The values
being ordered beforehand helps a great deal.

For example, you could do WHERE value IN (1,2,3,4,6,7,8,9,10) or you
could use WHERE value BETWEEN 1 AND 4 OR value BETWEEN 6 AND 10.

You'll also want to prevent having queries like WHERE value BETWEEN 2
AND 3. There IN (2,3) is probably the better alternative.

--
Alban Hertroys
MAG Productions

T: +31(0)53 4346874
F: +31(0)53 4346876
E: alban@magproductions.nl
W: http://www.magproductions.nl