Re: efficient way to do "fuzzy" join

Поиск
Список
Период
Сортировка
От Rémi Cura
Тема Re: efficient way to do "fuzzy" join
Дата
Msg-id CAJvUf_s-N7eDJ9mSNUdkmNv4rMJ7iz38sa5Xi_=vfXCNR9ftbQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: efficient way to do "fuzzy" join  (Andy Colson <andy@squeakycode.net>)
Ответы Re: efficient way to do "fuzzy" join  (Rémi Cura <remi.cura@gmail.com>)
Список pgsql-general



2014-04-12 15:04 GMT+02:00 Andy Colson <andy@squeakycode.net>:
On 04/12/2014 06:29 AM, Rémi Cura wrote:
(please note that this random string function is NOT the good way to
do it, i should random int then use it as index to an array
containing all the letter)

Thanks a lot for this new version! It seems to be slower than your
first solution (no index use I guess, I gave up after 5 minutes vs 5
sec for the previous). Morevover, I canno't make assumption about a
fixed interval (2 sec in your example). But I think I see where you
are going.


After some test, the fastest is using BETWEEN and range. (it is way
faster than using the <@, strangely)

Here is the code :

Ah, sorry about that.  I got pulled away to work on work stuff.  I was trying to figure out how to use an index on the range query, but not sure, without adding a new column if it would even work.

I've never had the need for ranges yet, this is the first time I've gotten to play with them.

I would not have thought about between like that, good call.  I'd have never guess it would be so fast.


If you can't use the fixed interval, then ranges are out.

I was thinking this could be improved:


select t,
 (select t from a where a.t >= b.t order by a.t limit 1) as mint,
 (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b

It does two selects into a to find the nearest.  Given this:

create table a(t float);


insert into a values (1), (5), (6);

could you write a single query to find the number nearest 3.5?  If so we might cut the work by 50%.

-Andy

PS: This list prefers you don't top post.

Hey,
the best I can come up with using your original idea is :
------------------------------------------------------------------
--fast-ish: 10sec
DROP TABLE IF EXISTS t;
    CREATE TABLE t AS
    SELECT lower_b_a.gid AS gid_b, lower_b_a.t AS t_b --, lower_b_a.data AS data_b
        , lower_b_a.gid_l_b AS gid_a_lower , a1.t AS t_a_lower--, a1.data AS data_a_lower
        , lower_b_a.gid_l_b -1 AS gid_a_upper , a2.t AS t_a_upper--, a2.data AS data_a_upper
    FROM     (
            SELECT b.gid, b.t
                , (SELECT  gid  FROM a WHERE a.t>=b.t order by a.t ASC LIMIT 1  ) AS gid_l_b
            FROM b) as lower_b_a
        LEFT OUTTER JOIN a AS a1 ON (a1.gid = gid_l_b) LEFT OUTTER JOIN a AS a2 ON  (a2.gid = gid_l_b-1)
-------------------------------------------------------------------

As you suggested it doesn't read the table twice, but only once (to find the closest lower value). The closest upper value is found by knowing it is in the next row taht the closest lower value.

Yet it is still slower :-/

The way to go seems to be the numrange.

Thanks a lot for the help in this optimization !

Cheers,

Rémi-C

В списке pgsql-general по дате отправления:

Предыдущее
От: Albe Laurenz
Дата:
Сообщение: Re: CLOB & BLOB limitations in PostgreSQL
Следующее
От: Ivan Voras
Дата:
Сообщение: Re: CLOB & BLOB limitations in PostgreSQL