Обсуждение: Very specialised query
So, I have an query that has a very great difference between the possible performance and the achieved performance. I was wondering if there was a possibility that Postgres would approach the possible performance by some devious method. The query returns a list of overlaps between objects. Each object is defined by a start position and end position and a foreign key to the thing that is is located on. It's genes on chromosomes, in case you were wondering. The relevant parts of the location table are as follows: release-16.0-preview-14-mar=# \d location Table "public.location" Column | Type | Modifiers -----------------+---------+----------- end | integer | start | integer | objectid | integer | id | integer | not null Indexes: "location__object" btree (objectid, id) "location__start" btree (start) "location_bioseg" gist (bioseg_create(intermine_start, intermine_end)) The table has 3490079 entries with 21875 distinct objectids, although the majority of entries are concentrated on about ten distinct objectids. The entire table is in cache. The query to run is: SELECT l1.id AS id1, l2.id AS id2 FROM location l1, location l2 WHERE l1.objectid = l2.objectid AND l1.id <> l2.id AND l1.start < l2.end AND l2.start < l1.end The EXPLAIN result: QUERY PLAN ---------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..180896163.93 rows=54169773639 width=8) Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end) AND (l2.start < l1.end)) -> Seq Scan on location l1 (cost=0.00..78076.79 rows=3490079 width=16) -> Index Scan using location__object on location l2 (cost=0.00..24.43 rows=1369 width=16) Index Cond: (l2.objectid = l1.objectid) (5 rows) I could get an EXPLAIN ANALYSE, but it would take quite a few hours. Now, we have a spacial gist index on (start, end), using the bioseg addon, which works great for single overlap lookups, and does speed up the above query, if we alter it to have an equivalent meaning, but use the bioseg function: SELECT l1.id AS id1, l2.id AS id2 FROM location l1, location l2 WHERE l1.objectid = l2.objectid AND l1.id <> l2.id AND bioseg_create(l1.start, l1.end) && bioseg_create(l2.start, l2.end); The EXPLAIN result: QUERY PLAN -------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..99982692.10 rows=4875280 width=8) Join Filter: ((l1.id <> l2.id) AND (l1.objectid = l2.objectid)) -> Seq Scan on location l1 (cost=0.00..78076.79 rows=3490079 width=16) -> Index Scan using location_bioseg on location l2 (cost=0.00..12.92 rows=698 width=16) Index Cond: (bioseg_create(l1.start, l1.end) && bioseg_create(l2.start, l2.end)) (5 rows) This query takes about two hours. Now, it happens that there is an algorithm for calculating overlaps which is really quick. It involves iterating through the table in order of the start variable and keeping a list of ranges which "haven't ended yet". When you read the next range from the table, you firstly purge all the ranges from the list that end before the beginning of the new range. Then, you output a result row for each element in the list combined with the new range, then you add the new range to the list. This algorithm just doesn't seem to fit into SQL at all. However, it would reduce two hours down to a few seconds. Any chances of it running in Postgres, or any other tips? Matthew -- Hi! You have reached 555-0129. None of us are here to answer the phone and the cat doesn't have opposing thumbs, so his messages are illegible. Please leave your name and message after the beep ...
Matthew Wakeling <matthew@flymine.org> wrote: > any other tips? I would try adding an index on (objectid, start) and another on (objectid, end) and see how that first query does. Also, if id is a unique identifier, I'd recommend a unique constraint or (better) a primary key definition. Check the system tables to see whether all the existing indexes are actually being used -- if not, drop them. -Kevin
Matthew Wakeling <matthew@flymine.org> writes: > This query takes about two hours. > Now, it happens that there is an algorithm for calculating overlaps which > is really quick. It involves iterating through the table in order of the > start variable and keeping a list of ranges which "haven't ended yet". > When you read the next range from the table, you firstly purge all the > ranges from the list that end before the beginning of the new range. Then, > you output a result row for each element in the list combined with the new > range, then you add the new range to the list. > This algorithm just doesn't seem to fit into SQL at all. No, it doesn't. Have you thought about coding it in plpgsql? I have a feeling that it might be possible to do it using SQL:2003 recursive queries, but the procedural coding is likely to be easier to understand and better-performing. Not to mention that you won't have to wait for 8.4... regards, tom lane
On Thu, 26 Mar 2009, Tom Lane wrote: > No, it doesn't. Have you thought about coding it in plpgsql? *Looks* Nice. So, it looks like I would be able to write a plpgsql function that returns a table equivalent to the query I posted earlier. However, I'd like to eat my cake *and* have it. My intention is to create a view with those results, and then use that view in all sorts of other queries. This will mean things like constraining the chromosome, or even constraining one of the locations. The algorithm I quoted will work great for the simple case of generating *all* overlaps. However, it will not be ideal for when the chromosome is constrained (the constraint needs to be pushed into the query that the algorithm iterates over, rather than filtered after the algorithm runs), and it will be very much less than ideal when one of the locations is constrained (at which point a simple bio_seg index lookup is the fastest way). Is there a way to define these three methods of generating the results and get the planner to choose the fastest one? Matthew -- Beware of bugs in the above code; I have only proved it correct, not tried it. --Donald Knuth
On Thu, 26 Mar 2009, I wrote: > release-16.0-preview-14-mar=# \d location > Table "public.location" > Column | Type | Modifiers > -----------------+---------+----------- > end | integer | > start | integer | > objectid | integer | > id | integer | not null > Indexes: > "location__object" btree (objectid, id) > "location__start" btree (start) > "location_bioseg" gist (bioseg_create(intermine_start, intermine_end)) So, it would be useful if we could make the location_bioseg index a multi-column index, like this: CREATE INDEX location_bioseg3 ON location USING GIST (objectid, bioseg_create(intermine_start, intermine_end)); However, I get the following error message: ERROR: data type integer has no default operator class for access method "gist" HINT: You must specify an operator class for the index or define a default operator class for the data type. Is there an operator class for integer for gist indexes that I can use? Matthew -- And why do I do it that way? Because I wish to remain sane. Um, actually, maybe I should just say I don't want to be any worse than I already am. - Computer Science Lecturer
Matthew Wakeling <matthew@flymine.org> writes: > Is there an operator class for integer for gist indexes that I can use? See contrib/btree_gist. regards, tom lane
You could try adding "AND l2.start > l1.start" to the first query. This will drop symmetric half of intersections (the ones that will remain are l2 inside or to the left of l1), but you can redo results by
id1,id2 union all id2, id1 and may allow to use start index for "between", for my "like" test this looks like the next:
" -> Index Scan using location__start on location l2 (cost=0.00..756.34 rows=37787 width=12)"
" Index Cond: ((l2.start < l1.eend) AND (l2.start > l1.start))"
also an index on (objectid, start) would help resulting in :
" -> Index Scan using lt on location l2 (cost=0.00..0.84 rows=20 width=16)"
" Index Cond: ((l2.objectid = l1.objectid) AND (l2.start < l1.eend) AND (l2.start > l1.start))"
Best regards,
Vitalii Tymchyshyn
On Fri, 27 Mar 2009, Віталій Тимчишин wrote: > ...an index on (objectid, start) would help... Definitely. > You could try adding "AND l2.start > l1.start" to the first query. > This will drop symmetric half of intersections (the ones that will > remain are l2 inside or to the left of l1), but you can redo results by > id1,id2 union all id2, id1 and may allow to use start index for > "between" That's remarkably clever, and should have been obvious. Thanks. Adding the constraint as you say does indeed make the query fast. However, there seems to be a problem with the planner, in that it does not distinguish between the costs of two alternative plans, which have vastly different real costs. Consider the following: SELECT l1.id AS id1, l2.id AS id2 FROM location l1, location l2 WHERE l1.objectid = 228000093 AND l2.objectid = 228000093 AND l1.id <> l2.id AND l1.start < l2.end AND l1.end > l2.start AND l1.start < l2.start UNION ALL SELECT l1.id AS id1, l2.id AS id2 FROM location l1, location l2 WHERE l1.objectid = 228000093 AND l2.objectid = 228000093 AND l1.id <> l2.id AND l1.start < l2.end AND l1.end > l2.start AND l1.start >= l2.start; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------- Append (cost=0.00..20479179.74 rows=138732882 width=8) -> Nested Loop (cost=0.00..9545925.46 rows=69366441 width=8) Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end)) -> Index Scan using location_test_obj_end on location l1 (cost=0.00..55966.07 rows=43277 width=12) Index Cond: (objectid = 228000093) -> Index Scan using location_test_obj_start on location l2 (cost=0.00..123.10 rows=4809 width=12) Index Cond: ((l2.objectid = 228000093) AND (l1.end > l2.start) AND (l1.start < l2.start)) -> Nested Loop (cost=0.00..9545925.46 rows=69366441 width=8) Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end)) -> Index Scan using location_test_obj_end on location l1 (cost=0.00..55966.07 rows=43277 width=12) Index Cond: (objectid = 228000093) -> Index Scan using location_test_obj_start on location l2 (cost=0.00..123.10 rows=4809 width=12) Index Cond: ((l2.objectid = 228000093) AND (l1.end > l2.start) AND (l1.start >= l2.start)) (13 rows) Notice the two different index conditions: (l1.end > l2.start) AND (l1.start < l2.start) - "between" (l1.end > l2.start) AND (l1.start >= l2.start) - open-ended Both have a cost of (cost=0.00..123.10 rows=4809 width=12) Postgres estimates these two index scans to be equivalent in cost, where they are actually vastly different in real cost. Shouldn't Postgres favour a "between" index scan over an open-ended one? Matthew -- [About NP-completeness] These are the problems that make efficient use of the Fairy Godmother. -- Computer Science Lecturer
Matthew Wakeling <matthew@flymine.org> writes: > Notice the two different index conditions: > (l1.end > l2.start) AND (l1.start < l2.start) - "between" > (l1.end > l2.start) AND (l1.start >= l2.start) - open-ended > Both have a cost of (cost=0.00..123.10 rows=4809 width=12) > Postgres estimates these two index scans to be equivalent in cost, where > they are actually vastly different in real cost. Shouldn't Postgres favour > a "between" index scan over an open-ended one? Currently the planner only notices that for a range check that involves comparisons of the same variable expression to two constants (or pseudoconstants anyway). In principle it might be reasonable to have a heuristic that reduces the estimated selectivity in the example above, but it looks to me like it'd make clauselist_selectivity() a lot slower and more complicated. When you see (l1.end > l2.start), how do you know which variable to try to match up against others? And if you try to match both, what do you do when you get matches for both? regards, tom lane
Hi, Le 26 mars 09 à 15:30, Matthew Wakeling a écrit : > Now, it happens that there is an algorithm for calculating overlaps > which is really quick. It involves iterating through the table in > order of the start variable and keeping a list of ranges which > "haven't ended yet". When you read the next range from the table, > you firstly purge all the ranges from the list that end before the > beginning of the new range. Then, you output a result row for each > element in the list combined with the new range, then you add the > new range to the list. > > This algorithm just doesn't seem to fit into SQL at all. Maybe it's just that I didn't devote enough time to reading your detailed explanation above, but this part sounds like it could be done in an aggregate you'd use in a correlated subquery containing the right ORDER BY, couldn't it? http://www.postgresql.org/docs/8.3/interactive/sql-createaggregate.html HTH, -- dim
Hello,
if your data are mostly static and you have a few mains objects,
maybe you can have some gain while defining conditional indexes for those plus one for the rest
and then slicing the query:
create index o_1x on X (start,end,id) where object_id = 1
create index o_2x on X (start,end,id) where object_id = 2
create index o_3x on X (start,end,id) where object_id = 3
create index o_4x on X (start,end,id) where object_id = 4
...
create index o_4x on X (start,end,id) where object_id not in (1,2,3,4..)
I'm not sure that putting all in one index and using the BETWEEN clause
as in my example is the best method though.
Marc Mamin
SELECT
l1.id AS id1,
l2.id AS id2
FROM
location l1,
location l2
WHERE l1.objectid = 1
AND (l2.start BETWEEN l1.start AND l1.end
OR
l1.start BETWEEN l2.start AND l2.end
)
l1.start
AND l2.start <> l2.start -- if required
AND l2.start <> l2.end -- if required
AND l1.id <> l2.id
UNION ALL
...
WHERE l1.objectid = 2
...
UNION ALL
...
WHERE l1.objectid not in (1,2,3,4..)
On Fri, 27 Mar 2009, Tom Lane wrote: >> Notice the two different index conditions: >> (l1.end > l2.start) AND (l1.start < l2.start) - "between" >> (l1.end > l2.start) AND (l1.start >= l2.start) - open-ended >> Both have a cost of (cost=0.00..123.10 rows=4809 width=12) > Currently the planner only notices that for a range check that involves > comparisons of the same variable expression to two constants (or > pseudoconstants anyway). In principle it might be reasonable to have a > heuristic that reduces the estimated selectivity in the example above, > but it looks to me like it'd make clauselist_selectivity() a lot slower > and more complicated. When you see (l1.end > l2.start), how do you know > which variable to try to match up against others? And if you try to > match both, what do you do when you get matches for both? Those two index conditions are on an index scan on the field l2.start. Therefore, I would expect to only have to take any notice of l2.start when working out selectivity on a range check for this particular plan. When there is an index scan on a different field, then try and match that one up instead. Matthew --
On Fri, 27 Mar 2009, Dimitri Fontaine wrote: > Maybe it's just that I didn't devote enough time to reading your detailed > explanation above, but this part sounds like it could be done in an aggregate > you'd use in a correlated subquery containing the right ORDER BY, couldn't > it? > http://www.postgresql.org/docs/8.3/interactive/sql-createaggregate.html Can you return multiple rows from an aggregate function? Matthew --
On Fri, 27 Mar 2009, Marc Mamin wrote: > if your data are mostly static and you have a few mains objects, > maybe you can have some gain while defining conditional indexes for those plus one for the rest > and then slicing the query: Maybe. I thought about doing that. However, I am not convinced that would be much of a gain, and would require major rewriting of the queries, as you suggest. > WHERE (l2.start BETWEEN l1.start AND l1.end > OR > l1.start BETWEEN l2.start AND l2.end > ) Yes, that's another way to calculate an overlap. However, it turns out to not be that fast. The problem is that OR there, which causes a bitmap index scan, as the leaf of a nested loop join, which can be rather slow. Matthew -- I'd try being be a pessimist, but it probably wouldn't work anyway.
>> WHERE (l2.start BETWEEN l1.start AND l1.end >> OR >> l1.start BETWEEN l2.start AND l2.end >> ) >Yes, that's another way to calculate an overlap. However, it turns out to not be that fast. >The problem is that OR there, which causes a bitmap index scan, as the leaf of a nested loop join, >which can be rather slow. Ok , than splitting these checks in 2 Queries with UNION is better. But I often read that BETWEEN is faster than using 2 comparison operators. Here I guess that a combined index on (start,end) makes sense: .. WHERE l2.start BETWEEN l1.start AND l1.end .. UNION .. WHERE l1.start BETWEEN l2.start AND l2.end .. The first clause being equivalent to AND l1.start <= l2.end AND l1.end >= l2.start AND l1.start <= l2.start I don't know how you have to deal the limit conditions... Marc Mamin
>> Shouldn't Postgres favour a "between" index scan over an open-ended >> one? On Fri, 27 Mar 2009, Tom Lane wrote: > Currently the planner only notices that for a range check that involves > comparisons of the same variable expression to two constants (or > pseudoconstants anyway). In principle it might be reasonable to have a > heuristic that reduces the estimated selectivity in the example above, > but it looks to me like it'd make clauselist_selectivity() a lot slower > and more complicated. When you see (l1.end > l2.start), how do you know > which variable to try to match up against others? And if you try to > match both, what do you do when you get matches for both? Arguably, having multiple "greater than" constraints on a field should not affect the selectivity much, because the separate constraints will all be throwing away the same set of rows. So having a single "greater than" will halve the number of rows, while two "greater than"s will divide the number of rows by three, etc. That's a vast approximation of course, and should be skewed by the statistics. However, combining a "greater than" with a "less than" has a different effect on selectivity. If the numbers were completely random with identical value spread in each field, then a single "greater than" would halve the number of rows and an additional "less than" would halve the number again. However, in most real life situations the values will not be completely random, but will be very skewed, as in our case. Unless the planner starts collecting statistics on the standard distribution of the difference between two fields, that will be hard to account for though. Have a look at the following EXPLAIN ANALYSE: SELECT l1.id AS id1, l2.id AS id2 FROM location l1, location l2 WHERE l1.objectid = 228000093 AND l2.objectid = 228000093 AND l1.id <> l2.id AND l1.start < l2.end AND l1.end > l2.start AND l1.start < l2.start UNION ALL SELECT l1.id AS id1, l2.id AS id2 FROM location l1, location l2 WHERE l1.objectid = 228000093 AND l2.objectid = 228000093 AND l1.id <> l2.id AND l1.start < l2.end AND l1.end > l2.start AND l1.start >= l2.start; QUERY PLAN ---------------------------------------------------------------------------------------------------------- Append (cost=0.00..20479179.74 rows=138732882 width=8) (actual time=0.055..2362748.298 rows=258210 loops=1) -> Nested Loop (cost=0.00..9545925.46 rows=69366441 width=8) (actual time=0.053..627.038 rows=99436 loops=1) Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end)) -> Index Scan using location_test_obj_end on location l1 (cost=0.00..55966.07 rows=43277 width=12) (actual time=0.025..58.604 rows=43534 loops=1) Index Cond: (objectid = 228000093) -> Index Scan using location_test_obj_start on location l2 (cost=0.00..123.10 rows=4809 width=12) (actual time=0.003..0.006 rows=2 loops=43534) Index Cond: ((l2.objectid = 228000093) AND (l1.end > l2.start) AND (l1.start < l2.start)) -> Nested Loop (cost=0.00..9545925.46 rows=69366441 width=8) (actual time=0.041..2361632.540 rows=158774 loops=1) Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end)) -> Index Scan using location_test_obj_end on location l1 (cost=0.00..55966.07 rows=43277 width=12) (actual time=0.009..80.814 rows=43534 loops=1) Index Cond: (objectid = 228000093) -> Index Scan using location_test_obj_start on location l2 (cost=0.00..123.10 rows=4809 width=12) (actual time=0.012..29.823 rows=21768 loops=43534) Index Cond: ((l2.objectid = 228000093) AND (l1.end > l2.start) AND (l1.start >= l2.start)) Total runtime: 2363015.959 ms (14 rows) Note again the two leaf index scans that really matter for performance. On one of them, there is a difference, and the other is open ended. Expected rows Seen rows difference 4809 2 open-ended 4809 21768 The first query fragment takes 700ms to execute, and the second takes about forty minutes. It is effectively doing a nested loop join with hardly any benefit gained from the indexes at all, over a simple index on objectid. I may as well make the query a lot simpler, and do: SELECT l1.id AS id1, l2.id AS id2 FROM location l1, location l2 WHERE l1.objectid = 228000093 AND l2.objectid = 228000093 AND l1.id <> l2.id AND l1.start < l2.end AND l1.end > l2.start Unless this particular issue is improved in the planner, I don't think this particular style of query is going to work for us. I know that the database could in theory return a result in about 1400ms. I'll see how close to that I can get with plpgsql. Matthew -- First law of computing: Anything can go wro sig: Segmentation fault. core dumped.
On Mon, 30 Mar 2009, Marc Mamin wrote: > But I often read that BETWEEN is faster than using 2 comparison operators. http://www.postgresql.org/docs/current/static/functions-comparison.html says otherwise. > a BETWEEN x AND y > > is equivalent to > > a >= x AND a <= y > > There is no difference between the two respective forms apart from the > CPU cycles required to rewrite the first one into the second one > internally. Matthew -- Don't worry! The world can't end today because it's already tomorrow in Australia.
On Mon, 30 Mar 2009, Віталій Тимчишин wrote: > select > case when n == 1 then id1 else id2 end, > case when n == 2 then id1 else id2 end > > from ( > ... a, (values (1),(2)) b(n) Yeah, that's nice. However, it is still the case that we can't trust the database to choose the correct plan. It is currently only choosing the correct plan now by chance, and some time later it may by chance switch to one that takes 40 minutes. I'll certainly add it to my list of possibilities. Matthew -- Jadzia: Don't forget the 34th rule of acquisition: Peace is good for business. Quark: That's the 35th. Jadzia: Oh yes, that's right. What's the 34th again? Quark: War is good for business. It's easy to get them mixed up.
Look, what I did mean by "symmetric" is that you don't need to make second part of query because you will get just same results simply by
select
case when n == 1 then id1 else id2 end,
case when n == 2 then id1 else id2 end
from (
SELECT
l1.id AS id1,
l2.id AS id2
FROM
location l1,
location l2
WHERE
l1.objectid = 228000093
AND l2.objectid = 228000093
AND l1.id <> l2.id
AND l1.start < l2.end
AND l1.end > l2.start
AND l1.start < l2.start) a, (values (1),(2)) b(n)
(I may miss some border cases like when l1.start=l2.start and/or l1.end=l2.end, but this can be fixed by adding "=" to query).
Look, You can have 4 types of intersections:
a) 1s 2s 2e 1e - 2 inside 1
b) 2s 1s 1e 2e - 1 inside 2 (symmetric to (a), if you have 1,2 from (a) you can generate 2,1 for (b))
c) 1s 2s 1e 2e - 1 to the left of 2
d) 2s 1s 2e 1e - 2 to the left of 1 (symmetric to (c), if you have 1,2 from (c) you can generate 2,1 for (d))
The query above gives you results for (a) and (c) and you don't need any second part - simply add "symmetric" results.
Correct me if I miss something.
Best Regards, Vitalii Tymchyshyn
Hello Matthew,
Another idea:
Are your objects limited to some smaller ranges of your whole interval ?
If yes you may possibly reduce the ranges to search for while using an additional table with the min(start) max(end) of each object...
Marc Mamin
Yeah, that's nice.
However, it is still the case that we can't trust the database to choose the correct plan. It is currently only choosing the correct plan now by chance, and some time later it may by chance switch to one that takes 40 minutes.
What is the bad plan? Is it like the first plan from your first message?
You can sometimes tweak optimizer to make sure it will do correct plan. E.g. when your database fits in memory, you can tweak page access costs. Also don't forget to raise statistics target.
BTW: About aggregates: they can return arrays, but I can't imagine what you can group by on... May be windowing functions from 8.4 could help.
Also, if your maximum length (select max(end-start) from location) is low enough, you can try adding some more constraints to make optimizer happy (have it more precise row count to select correct plan).
On Mon, 30 Mar 2009, Marc Mamin wrote: > Are your objects limited to some smaller ranges of your whole interval ? > If yes you may possibly reduce the ranges to search for while using an additional table with the min(start) max(end) ofeach > object... No, they aren't. However, even if they were, that would not actually speed up the query at all. We are already looking up in the index by objectid, and that would be an equivalent constraint to limiting by the available range of start/end values. I'm currently arguing with plpgsql over this problem, but it looks like it will run reasonably fast. Matthew -- If you're thinking "Oh no, this lecturer thinks Turing Machines are a feasible method of computation, where's the door?", then you are in luck. There are some there, there, and by the side there. Oxygen masks will not drop from the ceiling... -- Computer Science Lecturer
On Mon, 30 Mar 2009, Віталій Тимчишин wrote: > select > case when n == 1 then id1 else id2 end, > case when n == 2 then id1 else id2 end > > from ( > SELECT > l1.id AS id1, > l2.id AS id2 > FROM > location l1, > location l2 > WHERE > l1.objectid = 228000093 > AND l2.objectid = 228000093 > AND l1.id <> l2.id > AND l1.start < l2.end > AND l1.end > l2.start > AND l1.start < l2.start) a, (values (1),(2)) b(n) It is a nice idea. However, the planner gets the join the wrong way round: select distinct case when n = 1 then id1 else id2 end, case when n = 1 then id2 else id1 end FROM ( select l1.id AS id1, l2.id AS id2 FROM location l1, location l2 WHERE l1.id <> l2.id AND l1.objectid = l2.objectid AND l1.start <= l2.end AND l2.start <= l1.end AND l1.start <= l2.start ) AS a, (values (1), (2)) b(n); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Unique (cost=7366497963.75..7637346831.94 rows=36113182426 width=12) (actual time=1642178.623..2206678.691 rows=139606782 loops=1) -> Sort (cost=7366497963.75..7456780919.81 rows=36113182426 width=12) (actual time=1642178.619..1899057.147 rows=166377424 loops=1) Sort Key: (CASE WHEN ("*VALUES*".column1 = 1) THEN l1.subjectid ELSE l2.subjectid END), (CASE WHEN ("*VALUES*".column1= 1) THEN l2.subjectid ELSE l1.subjectid END) Sort Method: external merge Disk: 3903272kB -> Nested Loop (cost=0.00..592890483.66 rows=36113182426 width=12) (actual time=85.333..984211.011 rows=166377424 loops=1) -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=4) (actual time=0.002..0.008 rows=2 loops=1) -> Nested Loop (cost=0.00..25596373.62 rows=18056591213 width=8) (actual time=42.684..322743.335 rows=83188712 loops=2) Join Filter: ((l1.subjectid <> l2.subjectid) AND (l1.intermine_start <= l2.intermine_end)) -> Seq Scan on location l1 (cost=0.00..78076.79 rows=3490079 width=16) (actual time=0.008..3629.672 rows=3490079 loops=2) -> Index Scan using location_test_obj_start on location l2 (cost=0.00..3.89 rows=152 width=16) (actual time=0.005..0.038 rows=25 loops=6980158) Index Cond: ((l2.objectid = l1.objectid) AND (l2.intermine_start <= l1.intermine_end) AND (l1.intermine_start<= l2.intermine_start)) Total runtime: 2339619.383 ms The outer nested join has the VALUES as the main loop, and the complicated join as the leaf. So, the complicated overlap-finding join gets run twice. Oh, there's also the great big sort and unique, but I think I can get rid of that. Matthew -- Contrary to popular belief, Unix is user friendly. It just happens to be very selective about who its friends are. -- Kyle Hearn
The outer nested join has the VALUES as the main loop, and the complicated join as the leaf. So, the complicated overlap-finding join gets run twice.
That's weird. What do you have as statistics target? Planner is incorrect few orders of magnitude, so increasing it may help.
BTW: One of constraints is redundant l1.start <= l2.start implies l1.start <= l2.end, so latter can be removed as for me.
Oh, there's also the great big sort and unique, but I think I can get rid of that.
As far as I can see, duplicates will occur if and only if l1.start == l2.start && l1.end == l2.end.
That can be easily filtered by adding "where n=1 or l1.start != l2.start or l1.end != l2.end" to outer select.
On Wed, 1 Apr 2009, Віталій Тимчишин wrote: > The outer nested join has the VALUES as the main loop, and the complicated join as the leaf. So, the complicated > overlap-finding join gets run twice. > > That's weird. What do you have as statistics target? Planner is incorrect few orders of magnitude, so increasing it mayhelp. Unfortunately, the statistics are skewed, so increasing the statistics target won't help. The problem is this: select avg(end - start), stddev_pop(end - start), min(start), max(start) from location; avg | stddev_pop | min | max -----------------------+----------------+-----+---------- 1716.7503512098150214 | 24935.63375733 | 1 | 61544858 (1 row) > Oh, there's also the great big sort and unique, but I think I can get rid of that. > > > As far as I can see, duplicates will occur if and only if l1.start == l2.start && l1.end == l2.end. > That can be easily filtered by adding "where n=1 or l1.start != l2.start or l1.end != l2.end" to outer select. Close - duplicates will occur when l1.start == l2.start, so you filter them out by adding "where n = 1 OR l1.start <> l2.start". Matthew -- Lord grant me patience, and I want it NOW!
On Mon, 30 Mar 2009, Віталій Тимчишин wrote: > What is the bad plan? Is it like the first plan from your first message? It's the plan a few messages back. The UNION ALL query I showed effectively got the database to do it both ways round. It's the case that a "between" index scan will return much fewer rows than an open-ended index scan. > BTW: About aggregates: they can return arrays, but I can't imagine what you can group by on... May be windowing functionsfrom 8.4 > could help. A normal function seems the best way to go about this - they can return multiple rows. So, I have written a plpgsql function to calculate overlaps. It works reasonably quickly where there aren't that many overlaps. However, it seems to go very slowly when there are a large number of rows to return. I am considering recoding it as a C function instead. 1. The docs say that returning multiple rows from plpgsql waits until the whole lot are done before returning any. Does this happen with the C functions too? 2. What sort of speedup would I be likely to see? 3. How do I RAISE NOTICE in a C function? > Also, if your maximum length (select max(end-start) from location) is low enough, you can try adding some more constraintsto make > optimizer happy (have it more precise row count to select correct plan). Alas: select min(start), max(start), min(end), max(end), max(end - start) from location; min | max | min | max | max -----+----------+-----+----------+---------- 1 | 61544858 | 1 | 61545105 | 21512431 (1 row) Matthew -- I suppose some of you have done a Continuous Maths course. Yes? Continuous Maths? <menacing stares from audience> Whoah, it was like that, was it! -- Computer Science Lecturer
On Wed, 1 Apr 2009, Matthew Wakeling wrote: > So, I have written a plpgsql function to calculate overlaps. It works > reasonably quickly where there aren't that many overlaps. However, it seems > to go very slowly when there are a large number of rows to return. In plpgsql, what happens about memory allocation? It's just, I'm generating and throwing away an awful lot of arrays. When do they get unallocated? Also, if I assign a variable (or an element of an array) to be the contents of another variable (which may be a primitive, array, or row), does it copy the contents or do it by reference? Matthew -- For those of you who are into writing programs that are as obscure and complicated as possible, there are opportunities for... real fun here -- Computer Science Lecturer
Trying to rewrite a plpgsql function in C. How do I do the equivalent of: FOR loc IN SELECT * FROM location ORDER BY objectid, intermine_start, intermine_end LOOP END LOOP; in a C function? Matthew -- I wouldn't be so paranoid if you weren't all out to get me!!
Matthew Wakeling wrote: > Trying to rewrite a plpgsql function in C. > > How do I do the equivalent of: > > FOR loc IN SELECT * FROM location ORDER BY objectid, intermine_start, > intermine_end LOOP > END LOOP; > > in a C function? Please create a new message to the list with a new subject line for a new question. Thanks. -- Craig Ringer