Обсуждение: Why is this doing a seq scan?

Поиск
Список
Период
Сортировка

Why is this doing a seq scan?

От
"Ingram, Bryan"
Дата:
I'm trying to find the nearest locations to a certain point using 2 tables.
One contains the address and zipcodes of the locations and is about 2000
rows, the other contains zipcodes and lat, lon values and has about 1.4M
rows.

I've got indexes on the zip column of both tables and I just need to pull
the lat and lon out of the zips table for each zipcode that happens to be in
the ATMs table.  

There are only about 2000 rows in ATMs, and since both the atms.zip and
zips.zip are indexed, I'm not sure why a seq scan is being performed.

The select is taking anywhere from 30secs to 1min.  it's running on a linux
box w/2 pIII/700s and a raid ..so the machine shouldn't be slowing me down.
I think it's the seq scan but I can't seem to get rid of it.

=> explain select ( point(32.85, -94.55) <@> point(y.lat, y.lon) )  as
distance, x.zip, y.zip, y.lat, y.lon from atms x, zips y where x.zip = y.zip
order by 1 limit 3;
NOTICE:  QUERY PLAN:

Sort  (cost=39164156.66..39164156.66 rows=32338349 width=40) ->  Nested Loop  (cost=0.00..30401394.25 rows=32338349
width=40)      ->  Seq Scan on zips y  (cost=0.00..29558.49 rows=1401749 width=28)       ->  Index Scan using atms_zip
onatms x  (cost=0.00..21.38 rows=23
 
width=12)

Any idea on how to speed this up?

Thanks,
Bryan



Re: Why is this doing a seq scan?

От
Tom Lane
Дата:
Hmm.  Have you VACUUM ANALYZED the tables?  If so, what do you get from
these queries:

select attname,attdisbursion,s.*
from pg_statistic s, pg_attribute a, pg_class c
where starelid = c.oid and attrelid = c.oid and staattnum = attnum
and relname = 'zips';

select attname,attdisbursion,s.*
from pg_statistic s, pg_attribute a, pg_class c
where starelid = c.oid and attrelid = c.oid and staattnum = attnum
and relname = 'atms';

Also it would be useful to see the full declarations of the tables
and their indexes; I'm wondering what datatype the zip columns are,
for example.
        regards, tom lane


RE: Why is this doing a seq scan?

От
"Ingram, Bryan"
Дата:
> Hmm.  Have you VACUUM ANALYZED the tables?  If so, what do 
> you get from
> these queries:

Tom, thanks for the reply, and here is all the info you asked for.

> select attname,attdisbursion,s.*
> from pg_statistic s, pg_attribute a, pg_class c
> where starelid = c.oid and attrelid = c.oid and staattnum = attnum
> and relname = 'zips';

attname
|attdisbursion|starelid|staattnum|staop|stanullfrac|stacommonfrac|stacommonv
al|staloval  |stahival
---------+-------------+--------+---------+-----+-----------+-------------+-
-----------+----------+--------
zip      |           -1|   93920|        1| 1066|          0|
7.13394e-07|01226       |00401     |Y1A6A1  
state    |     0.165522|   93920|        2| 1066|          0|
0.346728|ON          |AB        |YT      
city     |   0.00729095|   93920|        3| 1066|          0|
0.0322854|TORONTO     |          |ZWOLLE  
lat      |   0.00326189|   93920|        4|  672|          0|
0.0153651|51.05       |-123.176  |79.989  
lon      |   0.00326061|   93920|        5|  672|          0|
0.0153594|-114.083333 |-176.31005|144.445 
bestbound|     0.997491|   93920|        6|  672|   0.998605|   0.00107366|2
|2         |98      
(6 rows)


> select attname,attdisbursion,s.*
> from pg_statistic s, pg_attribute a, pg_class c
> where starelid = c.oid and attrelid = c.oid and staattnum = attnum
> and relname = 'atms';

attname
|attdisbursion|starelid|staattnum|staop|stanullfrac|stacommonfrac|stacommonv
al   |staloval                      |stahival        
----------------+-------------+--------+---------+-----+-----------+--------
-----+---------------+------------------------------+----------------
terminal        |           -1|   50904|        1|   97|          0|
0.000433463|6000           |55                            |9433            
district        |    0.0679035|   50904|        2|  664|          0|
0.192024|ARCO California|ARCO Arizona                  |Western New York
name            |  0.000261431|   50904|        3|  664|          0|
0.00130039|Gateway Center |11th & Conger                 |Zionsville      
address         |  0.000261431|   50904|        4|  664|          0|
0.00130039|215 Tecumseh Rd|"402 E Yakima Ave, Suite 1400"|Windham Mall    
city            |   0.00522279|   50904|        5|  664|          0|
0.0238405|Seattle        |Aberdeen                      |Zionsville      
state           |    0.0687854|   50904|        6| 1058|          0|
0.193758|CA             |AK                            |WA              
zip             |  0.000614214|   50904|        7|  664|          0|
0.00303424|92392          |                              |99901           
access          |     0.385091|   50904|        8|  664|          0|
0.579974|WU             |                              |WU              
function        |     0.396416|   50904|        9|  664|          0|
0.589944|FF             |CD                            |FF              
location        |     0.414461|   50904|       10|  664|          0|
0.605548|BR             |BR                            |Rem             
language        |     0.431861|   50904|       11|  664|          0|
0.620286|E              |                              |E               
restricted_hours|     0.886758|   50904|       12|  664|          0|
0.939749|FALSE          |FALSE                         |TRUE            
seasonal        |     0.994812|   50904|       13|  664|          0|
0.997399|FALSE          |FALSE                         |TRUE            
stamps          |     0.621877|   50904|       14|  664|          0|
0.746857|FALSE          |FALSE                         |TRUE            
(14 rows)

> Also it would be useful to see the full declarations of the tables
> and their indexes; I'm wondering what datatype the zip columns are,
> for example.

Table    = atms
+----------------------------------+----------------------------------+-----
--+
|              Field               |              Type                |
Length|
+----------------------------------+----------------------------------+-----
--+
| terminal                         | int4                             |
4 |
| district                         | text                             |
var |
| name                             | text                             |
var |
| address                          | text                             |
var |
| city                             | text                             |
var |
| state                            | char()                           |
2 |
| zip                              | text                             |
var |
| access                           | text                             |
var |
| function                         | text                             |
var |
| location                         | text                             |
var |
| language                         | text                             |
var |
| restricted_hours                 | text                             |
var |
| seasonal                         | text                             |
var |
| stamps                           | text                             |
var |
+----------------------------------+----------------------------------+-----
--+
Index:    atms_zip

thirdfed=> \d zips
Table    = zips
+----------------------------------+----------------------------------+-----
--+
|              Field               |              Type                |
Length|
+----------------------------------+----------------------------------+-----
--+
| zip                              | varchar()                        |
10 |
| state                            | varchar()                        |
3 |
| city                             | varchar()                        |
100 |
| lat                              | float8                           |
8 |
| lon                              | float8                           |
8 |
| bestbound                        | float8                           |
8 |
+----------------------------------+----------------------------------+-----
--+
Indices:  zips_latindex         zips_lonindex         zips_pkey

thirdfed=> \d zips_pkey
Table    = zips_pkey
+----------------------------------+----------------------------------+-----
--+
|              Field               |              Type                |
Length|
+----------------------------------+----------------------------------+-----
--+
| zip                              | varchar()                        |
10 |
+----------------------------------+----------------------------------+-----
--+
thirdfed=> \d atms_zip
Table    = atms_zip
+----------------------------------+----------------------------------+-----
--+
|              Field               |              Type                |
Length|
+----------------------------------+----------------------------------+-----
--+
| zip                              | text                             |
var |
+----------------------------------+----------------------------------+-----
--+
Thanks,
Bryan




Re: Why is this doing a seq scan?

От
Tom Lane
Дата:
"Ingram, Bryan" <BIngram@sixtyfootspider.com> writes:
>> Also it would be useful to see the full declarations of the tables
>> and their indexes; I'm wondering what datatype the zip columns are,
>> for example.

> Table    = atms
> | zip                              | text                             |

> Table    = zips
> | zip                              | varchar()                        |

Ah, there's your problem --- the planner is not very smart about
optimizing cross-datatype comparisons.  Make these columns both text,
or both varchar, and I'll bet you get a more intelligent plan.

Current sources (7.1-to-be) are a little smarter than 7.0 about
cross-data-type joins, but they still don't get this case right.
I have a TODO item about that, but I dunno if it'll get done before
7.1 ...
        regards, tom lane


Re: Why is this doing a seq scan?

От
Tom Lane
Дата:
I said:
> Ah, there's your problem --- the planner is not very smart about
> optimizing cross-datatype comparisons.  Make these columns both text,
> or both varchar, and I'll bet you get a more intelligent plan.

After a little further thought, I realize that the planner may be
handicapped by not realizing it can do a merge or hash join across
datatypes, but even without that problem, this is not going to be
a fast query.  What you've got is
select ... from atms x, zips y where x.zip = y.ziporder by 1 limit 3;

and there just isn't any way to process this without forming the
full join product --- ie, the thing will sit there and form a join
tuple for *every* valid combination of ATM and ZIP in your database,
and then compute the distance to the target point for every one of
those ATMs, and then sort that result, and finally give you only
the top three rows.  A smarter kind of join isn't going to help
all that much; to make this fast, you need to be filtering
using the really selective condition (distance to the target point)
*before* you do the join.

If you are sufficiently interested in the speed of this query to want to
maintain a specialized index for it, I'd suggest looking at an r-tree
index on the location data, and then using a WHERE condition on the
r-tree index to prefilter the rows before you join.  r-trees only work
on boxes and polygons AFAICT --- what would work nicely is to store a
"box" of very small dimensions surrounding the location of each ATM,
index that column, and then use a WHERE test for overlap between that
box column and a box surrounding the target point out to as far as you
think is likely to be interesting.  This gives you back a fairly small
number of candidate ATMs for which you compute the exact distance to
the target, sort, and limit.  Not sure that you need to join to zips
at all if you do it this way.
        regards, tom lane