Обсуждение: optimize self-join query

Поиск
Список
Период
Сортировка

optimize self-join query

От
Ty Busby
Дата:
I have a table that stores a very large starting number called epc_start_numeric and a quantity.  I've apparently built
themost inefficient query possible for doing the job I need: find out if any records overlap.  Imagine the
epc_start_numeric+ quantity representing a block of numbers.  I need to find out if any of these blocks overlap. 

Here is the table:

CREATE TABLE ftp_epc_audit
(record_id serial NOT NULL,sent_filename text NOT NULL,pcid text NOT NULL,tsid text NOT NULL,user_sec_id text NOT
NULL,format_settext NOT NULL,format text NOT NULL,epc_start text NOT NULL,quantity integer,epc_decimal_start
numeric(29)
)
WITH OIDS;
ALTER TABLE ftp_epc_audit OWNER TO postgres;

And the query is currently this:

SELECT '', 0, a.*, '', 0, b.* FROM ftp_epc_audit_staging a,
ftp_epc_audit_staging b
WHERE a.sent_filename <> b.sent_filename
AND a.quantity > 0
AND b.quantity > 0
AND a.locked = 1
AND b.locked = 1
AND(( a.epc_decimal_start BETWEEN b.epc_decimal_start AND b.epc_decimal_start + b.quantity - 1 )
OR ( a.epc_decimal_start + a.quantity - 1 BETWEEN b.epc_decimal_start AND b.quantity - 1 )
OR ( a.epc_decimal_start + a.quantity - 1 < b.epc_decimal_start AND a.epc_decimal_start + a.quantity -1 >
b.epc_decimal_start+ b.quantity - 1 )) 

The column sent_filename is the unique value used so that a record does not find itself.  I've got an index on
sent_filename,quantity, locked and epc_decimal_start. 

The query runs fine with a very small number of records.  However, I need to process 60,000 records and this is taking
hours. There must be a fundemental flaw in either the query or my overall design because it doesn't seem like a
challengingtask.   

I've also tried a variant of this query which also takes several hours to run through 60,000 records.

SELECT * FROM ftp_epc_audit_staging a
WHERE a.quantity > 0
AND a.locked = 1
AND EXISTS ( SELECT TRUE FROM ftp_epc_audit_staging b WHERE b.locked = 1
AND b.quantity > 0 AND a.sent_filename <> b.sent_filename
AND(( a.epc_decimal_start BETWEEN b.epc_decimal_start AND b.epc_decimal_start + b.quantity - 1 )
OR ( a.epc_decimal_start + a.quantity - 1 BETWEEN b.epc_decimal_start AND b.quantity - 1 )
OR ( a.epc_decimal_start + a.quantity - 1 < b.epc_decimal_start AND a.epc_decimal_start + a.quantity -1 >
b.epc_decimal_start+ b.quantity - 1 ))) 

I would really appreciate any thoughts on this.

Re: optimize self-join query

От
Tom Lane
Дата:
Ty Busby <tybusby@gmail.com> writes:
> I have a table that stores a very large starting number called epc_start_numeric and a quantity.  I've apparently
builtthe most inefficient query possible for doing the job I need: find out if any records overlap.  Imagine the
epc_start_numeric+ quantity representing a block of numbers.  I need to find out if any of these blocks overlap.
 

Yeah, overlap is a hard problem.  Basically, Postgres doesn't have any
way to do your query short of comparing each row to each other row,
so the cost goes up as O(N^2).

If you know more than you've let on about the properties of the
intervals, you might be able to improve things.  For instance
if the intervals fall into nonoverlapping buckets then you could
add a constraint that the buckets of the two sides are equal.
Postgres is a lot better with equality join constraints than it
is with range constraints, so it would be able to match up rows
and only do the O(N^2) work within each bucket.

In the long run we might have better answers --- Jeff Davis has been
working on range types for years now, and one of the long-range goals
of that is to have smarter support for this type of problem.  But for
now, it's going to be painful.
        regards, tom lane


Re: optimize self-join query

От
Lee Hachadoorian
Дата:
On Tue, Oct 25, 2011 at 2:37 PM, Ty Busby <tybusby@gmail.com> wrote:
I have a table that stores a very large starting number called epc_start_numeric and a quantity.  I've apparently built the most inefficient query possible for doing the job I need: find out if any records overlap.  Imagine the epc_start_numeric + quantity representing a block of numbers.  I need to find out if any of these blocks overlap.

Here is the table:

CREATE TABLE ftp_epc_audit
(
 record_id serial NOT NULL,
 sent_filename text NOT NULL,
 pcid text NOT NULL,
 tsid text NOT NULL,
 user_sec_id text NOT NULL,
 format_set text NOT NULL,
 format text NOT NULL,
 epc_start text NOT NULL,
 quantity integer,
 epc_decimal_start numeric(29)
)
WITH OIDS;
ALTER TABLE ftp_epc_audit OWNER TO postgres;

Not having access to your dataset, I created the following test data:

DROP TABLE IF EXISTS ftp_epc_audit;
CREATE TABLE ftp_epc_audit (
 record_id serial NOT NULL,
 quantity integer,
 epc_decimal_start numeric(29)
)
WITH OIDS;
ALTER TABLE ftp_epc_audit ADD PRIMARY KEY(record_id);

INSERT INTO ftp_epc_audit (epc_decimal_start, quantity)
SELECT 
generate_series(5, 10000, 5) AS epc_decimal_start, 
round(ceiling(20 * random()) / ceiling(5 * random()))::int AS quantity
;

The strange final column creates something like a stepped Poisson distribution, most of the values of quantity are < 5 but they can go as high as 20. Since epc_decimal_start increases in steps of five, this ensures we have a many records with nonoverlapping ranges and many that overlap one, two, or three adjacent ranges.
 

And the query is currently this:

SELECT '', 0, a.*, '', 0, b.* FROM ftp_epc_audit_staging a,
ftp_epc_audit_staging b
WHERE a.sent_filename <> b.sent_filename
AND a.quantity > 0
AND b.quantity > 0
AND a.locked = 1
AND b.locked = 1
AND(( a.epc_decimal_start BETWEEN b.epc_decimal_start AND b.epc_decimal_start + b.quantity - 1 )
OR ( a.epc_decimal_start + a.quantity - 1 BETWEEN b.epc_decimal_start AND b.quantity - 1 )
OR ( a.epc_decimal_start + a.quantity - 1 < b.epc_decimal_start AND a.epc_decimal_start + a.quantity -1 > b.epc_decimal_start + b.quantity - 1 ))

I rewrote this to run against my pared down 2000 record test table as:

SELECT
*
FROM
ftp_epc_audit a CROSS JOIN ftp_epc_audit b 
WHERE 
a.record_id <> b.record_id AND
(( a.epc_decimal_start BETWEEN b.epc_decimal_start AND b.epc_decimal_start + b.quantity - 1 )
OR ( a.epc_decimal_start + a.quantity - 1 BETWEEN b.epc_decimal_start AND b.quantity - 1 )
OR ( a.epc_decimal_start + a.quantity - 1 < b.epc_decimal_start AND a.epc_decimal_start + a.quantity -1 > b.epc_decimal_start + b.quantity - 1 ))
;

These condition inside AND ((…)) looks unnecessarily complex. In fact, the last OR (…) clause can't ever evaluate to TRUE, but it does hurt the query performance.

If x = b.epc_decimal_start 
and y = b.epc_decimal_start + b.quantity - 1
and quantity is an integer restricted by an earlier clause to be > 0
Then x ≤ y
Therefore there is no z such that z < x and z > y

Testing this on the 2000 record table, dropping that clause alone made the query time drop from ~20 seconds to ~12 seconds. I also confirmed that dropping that clause returns the same number of records.

The final four lines can be further simplified to

…AND
a.epc_decimal_start < b.epc_decimal_start 
AND b.epc_decimal_start < a.epc_decimal_start + a.quantity

This says, for each record, compare only the records with a greater range start, and return any record whose range starts before my range stops. This also returns the same number of records as the previous version and on the 2000 record table runs in ~2.6 seconds.

There are further speed gains from using a WITH clause so that the range limits are only evaluated once (~2.0 seconds), and in using a window function to ORDER the recordset by range start, then compare the rank instead of the range start (~1.5 seconds). It ends up looking like this:

WITH rank_epc AS (
SELECT 
record_id,
epc_decimal_start AS range_low,
epc_decimal_start + quantity AS range_high,
rank() OVER (ORDER BY epc_decimal_start)
FROM
ftp_epc_audit
)
SELECT
*
FROM
rank_epc a CROSS JOIN rank_epc b WHERE a.rank < b.rank AND b.range_low < a.range_high
;

This ran on the 2000 record table in ~1.5 seconds and on a 20,000 record dataset in 172 seconds.

Finally, depending on your use case and what you want to do once you identify the problematic records, it might be useful to only compare adjacent ranges to see if they overlap. If adjacent ranges are then "fixed" so that they don't overlap anymore, there won't be any left that overlap two or three or more ranges. This can be accomplished with:

WITH rank_epc AS (
SELECT 
record_id,
epc_decimal_start AS range_low,
epc_decimal_start + quantity AS range_high,
rank() OVER (ORDER BY epc_decimal_start)
FROM
ftp_epc_audit
)
SELECT
*
FROM
rank_epc a JOIN rank_epc b ON (a.rank = b.rank - 1 AND b.range_low < a.range_high) 
;

This ran on the 20,000 record dataset in 300 milliseconds.

Hope this helps.

--Lee

--
Lee Hachadoorian
PhD, Earth & Environmental Sciences (Geography)
Research Associate, CUNY Center for Urban Research
http://freecity.commons.gc.cuny.edu/

Re: optimize self-join query

От
Harald Fuchs
Дата:
In article <BFC71945-8821-4BC9-8430-A8CACF8F3794@gmail.com>,
Ty Busby <tybusby@gmail.com> writes:

> I have a table that stores a very large starting number called
> epc_start_numeric and a quantity.  I've apparently built the most
> inefficient query possible for doing the job I need: find out if any
> records overlap.  Imagine the epc_start_numeric + quantity
> representing a block of numbers.  I need to find out if any of these
> blocks overlap.

If I understand you correctly, you want to compare numeric intervals.
On PgFoundry you can find an interval type like that called bioseg.
This type is GiST-indexable and thus may speed up your query.

Example:
 CREATE TABLE test2 (   id serial NOT NULL,   seg bioseg NOT NULL,   PRIMARY KEY (id) );
 -- Fill test2 with a gazillion of rows
 CREATE INDEX test2_seg_ix ON test2 USING gist (seg);
 SELECT t1.id, t1.seg, t2.id, t2.seg FROM test2 t1 JOIN test2 t2 ON t2.id != t1.id AND t2.seg && t1.seg;

You'll still need a seqscan for t1, but t2 will use an index scan.

You can even define a table constraint to prevent overlaps:
 ALTER TABLE test2 ADD CONSTRAINT test2_seg_ex EXCLUDE USING gist (seg WITH &&);



Re: optimize self-join query

От
Ty Busby
Дата:
Thanks to everyone for all the thoughtful responses.  They have been incredibly helpful.

Ended up optimizing per Lee, and creating buckets per Tom.  Thanks again.

On Oct 25, 2011, at 2:37 PM, Ty Busby wrote:

I have a table that stores a very large starting number called epc_start_numeric and a quantity.  I've apparently built
themost inefficient query possible for doing the job I need: find out if any records overlap.  Imagine the
epc_start_numeric+ quantity representing a block of numbers.  I need to find out if any of these blocks overlap.