Обсуждение: [pg_trgm] Making similarity(?, ?) < ? use an index

Поиск

Список

Период

Сортировка

[pg_trgm] Making similarity(?, ?) < ? use an index

От

Greg Navis

Дата:

03 июня 2016 г., 13:35:42

Hey!

I'm playing with pg_trgm. It seems that `lhs % rhs` is _almost_ equivalent to `similarity(lhs, rhs) < show_limit()`. The difference that I noticed is that `%` uses a GIN index while `similarity` does not.

```

grn=# \d restaurants

Table "public.restaurants"

Column | Type | Modifiers

--------+------------------------+-----------

city | character varying(255) | not null

Indexes:

"restaurants_city_trgm_idx" gin (city gin_trgm_ops)

grn=# SELECT COUNT(*) FROM restaurants;

count

--------

515475

(1 row)

Time: 45.964 ms

grn=# EXPLAIN ANALYZE SELECT * FROM restaurants WHERE similarity(city, 'warsw') > show_limit();

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------

Seq Scan on restaurants (cost=0.00..11692.81 rows=171825 width=10) (actual time=16.436..665.062 rows=360 loops=1)

Filter: (similarity((city)::text, 'warsw'::text) > show_limit())

Rows Removed by Filter: 515115

Planning time: 0.139 ms

Execution time: 665.105 ms

(5 rows)

Time: 665.758 ms

```

My question is: is it possible to make `similarity` use the index? If not, is there a way to speed up the query above?

Best regards
--

Greg Navis

I help tech companies to scale Heroku-hosted Rails apps.

Free, biweekly scalability newsletter for SaaS CEOs

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Artur Zakirov

Дата:

03 июня 2016 г., 13:53:23

Hello.

As I know 'lhs % rhs' is equivalent to 'similarity(lhs, rhs) >=
show_limit()'.

And so your query should looks like this:

SELECT * FROM restaurants WHERE city % 'warsw';

And it should use index.

On 03.06.2016 13:35, Greg Navis wrote:
> Hey!
>
> I'm playing with pg_trgm. It seems that `lhs % rhs` is _almost_
> equivalent to `similarity(lhs, rhs) < show_limit()`. The difference that
> I noticed is that `%` uses a GIN index while `similarity` does not.
>
> ```
> grn=# \d restaurants
>          Table "public.restaurants"
>  Column |          Type          | Modifiers
> --------+------------------------+-----------
>  city   | character varying(255) | not null
> Indexes:
>     "restaurants_city_trgm_idx" gin (city gin_trgm_ops)
>
> grn=# SELECT COUNT(*) FROM restaurants;
>  count
> --------
>  515475
> (1 row)
>
> Time: 45.964 ms
> grn=# EXPLAIN ANALYZE SELECT * FROM restaurants WHERE similarity(city,
> 'warsw') > show_limit();
>                                                      QUERY PLAN
>
> --------------------------------------------------------------------------------------------------------------------
>  Seq Scan on restaurants  (cost=0.00..11692.81 rows=171825 width=10)
> (actual time=16.436..665.062 rows=360 loops=1)
>    Filter: (similarity((city)::text, 'warsw'::text) > show_limit())
>    Rows Removed by Filter: 515115
>  Planning time: 0.139 ms
>  Execution time: 665.105 ms
> (5 rows)
>
> Time: 665.758 ms
> ```
>
> My question is: is it possible to make `similarity` use the index? If
> not, is there a way to speed up the query above?
>
> Best regards
> --
> Greg Navis
> I help tech companies to scale Heroku-hosted Rails apps.
> Free, biweekly scalability newsletter for SaaS CEOs
> <http://www.gregnavis.com/newsletter/>
>


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Greg Navis

Дата:

03 июня 2016 г., 14:24:13

Artur, thanks for your reply. That's right, `%` does use the index. The goal of using `similarity(lhs, rhs) >= show_limit()` was to replace `show_limit()` with a custom, per-query limit. I noticed that the latter approach does _not_ use the index, hence my question:

grn=# EXPLAIN ANALYZE SELECT * FROM restaurants WHERE city % 'warsw';

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------

Bitmap Heap Scan on restaurants (cost=24.28..1319.36 rows=515 width=10) (actual time=96.081..96.456 rows=400 loops=1)

Recheck Cond: ((city)::text % 'warsw'::text)

Heap Blocks: exact=359

-> Bitmap Index Scan on restaurants_city_gist_trgm_idx (cost=0.00..24.15 rows=515 width=0) (actual time=96.030..96.030 rows=400 loops=1)

Index Cond: ((city)::text % 'warsw'::text)

Planning time: 0.211 ms

Execution time: 96.528 ms

(7 rows)

grn=# EXPLAIN ANALYZE SELECT * FROM restaurants WHERE similarity(city, 'warsw') >= show_limit();

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------

Seq Scan on restaurants (cost=0.00..11692.81 rows=171825 width=10) (actual time=14.520..692.520 rows=400 loops=1)

Filter: (similarity((city)::text, 'warsw'::text) >= show_limit())

Rows Removed by Filter: 515075

Planning time: 0.109 ms

Execution time: 692.560 ms

(5 rows)

If this functionality isn't supported then it might be a good idea for a contribution.

Best regards

On Fri, Jun 3, 2016 at 12:51 PM, Artur Zakirov <a.zakirov@postgrespro.ru> wrote:

Hello.

As I know 'lhs % rhs' is equivalent to 'similarity(lhs, rhs) >= show_limit()'.

And so your query should looks like this:

SELECT * FROM restaurants WHERE city % 'warsw';

And it should use index.

On 03.06.2016 13:35, Greg Navis wrote:
Hey!

I'm playing with pg_trgm. It seems that `lhs % rhs` is _almost_
equivalent to `similarity(lhs, rhs) < show_limit()`. The difference that
I noticed is that `%` uses a GIN index while `similarity` does not.

```
grn=# \d restaurants
Table "public.restaurants"
Column | Type | Modifiers
--------+------------------------+-----------
city | character varying(255) | not null
Indexes:
"restaurants_city_trgm_idx" gin (city gin_trgm_ops)

grn=# SELECT COUNT(*) FROM restaurants;
count
--------
515475
(1 row)

Time: 45.964 ms
grn=# EXPLAIN ANALYZE SELECT * FROM restaurants WHERE similarity(city,
'warsw') > show_limit();
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------
Seq Scan on restaurants (cost=0.00..11692.81 rows=171825 width=10)
(actual time=16.436..665.062 rows=360 loops=1)
Filter: (similarity((city)::text, 'warsw'::text) > show_limit())
Rows Removed by Filter: 515115
Planning time: 0.139 ms
Execution time: 665.105 ms
(5 rows)

Time: 665.758 ms
```

My question is: is it possible to make `similarity` use the index? If
not, is there a way to speed up the query above?

Best regards
--
Greg Navis
I help tech companies to scale Heroku-hosted Rails apps.
Free, biweekly scalability newsletter for SaaS CEOs
<http://www.gregnavis.com/newsletter/>

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Greg Navis

I help tech companies to scale Heroku-hosted Rails apps.

Free, biweekly scalability newsletter for SaaS CEOs

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Artur Zakirov

Дата:

03 июня 2016 г., 15:00:11

Oh, I understand. It is because you want different limits for
restaurants and cinemas?

I see only one solution. It is custom extension, which will create
operator class similar to gin_trgm_ops and will depends on pg_trgm. In
gin_trgm_consistent() you can use your own limit variable.

As I know functions do not use indexes.

Of course I may be wrong. And somebody knows a better solution.

On 03.06.2016 14:24, Greg Navis wrote:
> Artur, thanks for your reply. That's right, `%` does use the index. The
> goal of using `similarity(lhs, rhs) >= show_limit()` was to replace
> `show_limit()` with a custom, per-query limit. I noticed that the latter
> approach does _not_ use the index, hence my question:
>
> grn=# EXPLAIN ANALYZE SELECT * FROM restaurants WHERE city % 'warsw';
>                                                                   QUERY
> PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on restaurants  (cost=24.28..1319.36 rows=515
> width=10) (actual time=96.081..96.456 rows=400 loops=1)
>    Recheck Cond: ((city)::text % 'warsw'::text)
>    Heap Blocks: exact=359
>    ->  Bitmap Index Scan on restaurants_city_gist_trgm_idx
>  (cost=0.00..24.15 rows=515 width=0) (actual time=96.030..96.030
> rows=400 loops=1)
>          Index Cond: ((city)::text % 'warsw'::text)
>  Planning time: 0.211 ms
>  Execution time: 96.528 ms
> (7 rows)
>
> grn=# EXPLAIN ANALYZE SELECT * FROM restaurants WHERE similarity(city,
> 'warsw') >= show_limit();
>                                                      QUERY PLAN
>
> --------------------------------------------------------------------------------------------------------------------
>  Seq Scan on restaurants  (cost=0.00..11692.81 rows=171825 width=10)
> (actual time=14.520..692.520 rows=400 loops=1)
>    Filter: (similarity((city)::text, 'warsw'::text) >= show_limit())
>    Rows Removed by Filter: 515075
>  Planning time: 0.109 ms
>  Execution time: 692.560 ms
> (5 rows)
>
> If this functionality isn't supported then it might be a good idea for a
> contribution.
>
> Best regards
>
> On Fri, Jun 3, 2016 at 12:51 PM, Artur Zakirov <a.zakirov@postgrespro.ru
> <mailto:a.zakirov@postgrespro.ru>> wrote:
>
>     Hello.
>
>     As I know 'lhs % rhs' is equivalent to 'similarity(lhs, rhs) >=
>     show_limit()'.
>
>     And so your query should looks like this:
>
>     SELECT * FROM restaurants WHERE city % 'warsw';
>
>     And it should use index.
>
>
>     On 03.06.2016 13:35, Greg Navis wrote:
>
>         Hey!
>
>         I'm playing with pg_trgm. It seems that `lhs % rhs` is _almost_
>         equivalent to `similarity(lhs, rhs) < show_limit()`. The
>         difference that
>         I noticed is that `%` uses a GIN index while `similarity` does not.
>
>         ```
>         grn=# \d restaurants
>                  Table "public.restaurants"
>          Column |          Type          | Modifiers
>         --------+------------------------+-----------
>          city   | character varying(255) | not null
>         Indexes:
>             "restaurants_city_trgm_idx" gin (city gin_trgm_ops)
>
>         grn=# SELECT COUNT(*) FROM restaurants;
>          count
>         --------
>          515475
>         (1 row)
>
>         Time: 45.964 ms
>         grn=# EXPLAIN ANALYZE SELECT * FROM restaurants WHERE
>         similarity(city,
>         'warsw') > show_limit();
>                                                              QUERY PLAN
>
>
--------------------------------------------------------------------------------------------------------------------
>          Seq Scan on restaurants  (cost=0.00..11692.81 rows=171825 width=10)
>         (actual time=16.436..665.062 rows=360 loops=1)
>            Filter: (similarity((city)::text, 'warsw'::text) > show_limit())
>            Rows Removed by Filter: 515115
>          Planning time: 0.139 ms
>          Execution time: 665.105 ms
>         (5 rows)
>
>         Time: 665.758 ms
>         ```
>
>         My question is: is it possible to make `similarity` use the
>         index? If
>         not, is there a way to speed up the query above?
>
>         Best regards
>         --
>         Greg Navis
>         I help tech companies to scale Heroku-hosted Rails apps.
>         Free, biweekly scalability newsletter for SaaS CEOs
>         <http://www.gregnavis.com/newsletter/>
>
>
>
>     --
>     Artur Zakirov
>     Postgres Professional: http://www.postgrespro.com
>     Russian Postgres Company
>
>
>
>
> --
> Greg Navis
> I help tech companies to scale Heroku-hosted Rails apps.
> Free, biweekly scalability newsletter for SaaS CEOs
> <http://www.gregnavis.com/newsletter/>
>


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

"David G. Johnston"

Дата:

03 июня 2016 г., 15:05:43

On Friday, June 3, 2016, Greg Navis <contact@gregnavis.com> wrote:

Hey!

I'm playing with pg_trgm. It seems that `lhs % rhs` is _almost_ equivalent to `similarity(lhs, rhs) < show_limit()`. The difference that I noticed is that `%` uses a GIN index while `similarity` does not.

```
grn=# \d restaurants
Table "public.restaurants"
Column | Type | Modifiers
--------+------------------------+-----------
city | character varying(255) | not null
Indexes:
"restaurants_city_trgm_idx" gin (city gin_trgm_ops)

grn=# SELECT COUNT(*) FROM restaurants;
count
--------
515475
(1 row)

Time: 45.964 ms
grn=# EXPLAIN ANALYZE SELECT * FROM restaurants WHERE similarity(city, 'warsw') > show_limit();
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Seq Scan on restaurants (cost=0.00..11692.81 rows=171825 width=10) (actual time=16.436..665.062 rows=360 loops=1)
Filter: (similarity((city)::text, 'warsw'::text) > show_limit())
Rows Removed by Filter: 515115
Planning time: 0.139 ms
Execution time: 665.105 ms
(5 rows)

Time: 665.758 ms
```

My question is: is it possible to make `similarity` use the index? If not, is there a way to speed up the query above?

No. Indexing is tied to operators.

I don't know which search terms would work best but I gave this same answer less than a week ago. List searching before asking is appreciated.

David J.

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Greg Navis

Дата:

03 июня 2016 г., 22:13:24

Thanks for answers and sorry for not searching hard enough.

I'm curious ... would it be difficult to modify PostgreSQL so that it'd use the index for `similarity(lhs, rhs) >= show_limit()` too? Or even add `is_similar(lhs, rhs, threshold)` that'd allow to change the threshold on a per-query basis. I might be able to block some time to contribute.

Best regards

Greg Navis

I help tech companies to scale Heroku-hosted Rails apps.

Free, biweekly scalability newsletter for SaaS CEOs

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

"David G. Johnston"

Дата:

03 июня 2016 г., 22:22:12

On Fri, Jun 3, 2016 at 3:13 PM, Greg Navis <contact@gregnavis.com> wrote:

Thanks for answers and sorry for not searching hard enough.

I'm curious ... would it be difficult to modify PostgreSQL so that it'd use the index for `similarity(lhs, rhs) >= show_limit()` too?

Not in a way that would be useful.

Or even add `is_similar(lhs, rhs, threshold)` that'd allow to change the threshold on a per-query basis. I might be able to block some time to contribute.

I can see that being a useful API to add to pg_trgm. While it wouldn't solve your indexing problem - it would at least make using cases that are already un-indexable easier to write and comprehend. The particular problem for the other poster was wanting two different values within the same query - which is impossible in the current setup but would be made possible with such a function.

I'm not sure how much effort the following would take but if we cannot change the tie between indexes and operators maybe we can introduce ternary operators that can be assigned to index opclasses.

Something like:

lhs % rhs # 40 => similarity(lhs, rhs, 70)

lhs % rhs # 70 => similarity(lhs, rhs, 70)

It would have the added benefit of allowing us to add the main ternary operator <?:> instead of convoluted CASE statements for verbose functional forms.

David J.

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Jeff Janes

Дата:

03 июня 2016 г., 22:27:08

On Fri, Jun 3, 2016 at 12:13 PM, Greg Navis <contact@gregnavis.com> wrote:
> Thanks for answers and sorry for not searching hard enough.
>
> I'm curious ... would it be difficult to modify PostgreSQL so that it'd use
> the index for `similarity(lhs, rhs) >= show_limit()` too?

Yes, that would be very difficult. The project has kind of painted
itself into a corner on that.

If it were easy, I doubt we would have added the % operator with the
ugly set_limit() wart in the first place (although I was not around at
the time that was done--maybe there were other considerations).

Cheers,

Jeff

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

"David G. Johnston"

Дата:

03 июня 2016 г., 22:46:37

On Fri, Jun 3, 2016 at 3:27 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Fri, Jun 3, 2016 at 12:13 PM, Greg Navis <contact@gregnavis.com> wrote:
> Thanks for answers and sorry for not searching hard enough.
>
> I'm curious ... would it be difficult to modify PostgreSQL so that it'd use
> the index for `similarity(lhs, rhs) >= show_limit()` too?

Yes, that would be very difficult. The project has kind of painted
itself into a corner on that.

If it were easy, I doubt we would have added the % operator with the
ugly set_limit() wart in the first place (although I was not around at
the time that was done--maybe there were other considerations).

Can you clarify?

As far pg_trgm goes its only option was/is to use a GUC if it wants the benefit of indexing. The set/show limit API is merely a syntactic convenience.

The cleanest API I can come up with giving present limitations is:

SELECT * FROM get_restaurants_by_similarity('warsw', 70)

-- you could make the second parameter optional or disallowed depending on how you want to enforce your selection policy.

The SQL queries in that SQL language function would be:

SET LOCAL .... = 70;

SELECT * FROM restaurants WHERE city % $1;

The later being returned as "SETOF restaurants"

You main problem here, then, is loss of optimization options.

The best solution would depend very much on how you plan to use these queries. You also have an option to execute dynamic SQL within a pl/pgsql function.

David J.

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Tom Lane

Дата:

03 июня 2016 г., 23:02:16

Jeff Janes <jeff.janes@gmail.com> writes:
> On Fri, Jun 3, 2016 at 12:13 PM, Greg Navis <contact@gregnavis.com> wrote:
>> I'm curious ... would it be difficult to modify PostgreSQL so that it'd use
>> the index for `similarity(lhs, rhs) >= show_limit()` too?

> Yes, that would be very difficult. The project has kind of painted
> itself into a corner on that.

Well, the thing that is not easy to change is that index scans require
qualifiers expressed as "indexed_column indexable_operator something".
But there's a lot of flexibility about what "something" is.  You could
imagine doing, say,

    foo ~~ similarity_rhs('bar', 0.01)

where the function just collects its arguments into some composite type
that we provide an indexable operator to compare strings to.

> If it were easy, I doubt we would have added the % operator with the
> ugly set_limit() wart in the first place (although I was not around at
> the time that was done--maybe there were other considerations).

I think that was just bad design.  There's a lot of old stuff in contrib
that hasn't been vetted all that closely.

            regards, tom lane

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Jeff Janes

Дата:

03 июня 2016 г., 23:16:21

On Fri, Jun 3, 2016 at 1:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> On Fri, Jun 3, 2016 at 12:13 PM, Greg Navis <contact@gregnavis.com> wrote:
>>> I'm curious ... would it be difficult to modify PostgreSQL so that it'd use
>>> the index for `similarity(lhs, rhs) >= show_limit()` too?
>
>> Yes, that would be very difficult. The project has kind of painted
>> itself into a corner on that.
>
> Well, the thing that is not easy to change is that index scans require
> qualifiers expressed as "indexed_column indexable_operator something".
> But there's a lot of flexibility about what "something" is.  You could
> imagine doing, say,
>
>         foo ~~ similarity_rhs('bar', 0.01)
>
> where the function just collects its arguments into some composite type
> that we provide an indexable operator to compare strings to.

Oh, interesting.  I was thinking we would need to have the operator
receive the parsed but unevaluated expression "tree" (which I didn't
think was possible to do) and then somehow rewrite that structure.
Collecting it into a composite and passing the composite to the
operator does open up a lot of possibilities I hadn't considered.

Cheers,

Jeff

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Greg Navis

Дата:

04 июня 2016 г., 12:50:53

Thanks for your replies.

Sorry for confusion. Instead of `similarity(lhs, rhs) >= show_limit()`, which of course is completely equivalent to `lhs % rhs`, I wanted to write `similarity(lhs, rhs) >= my_custom_threshold`. It seems that the approach with ternary operators is quite a bit of work. I might have a simpler idea:

pg_trgm also provides `<->` but it seems this operator doesn't use indexes either. It seems the shortest path to per-query thresholds, without compromising the design, is making this operator use the index. Please help me understand whether my reasoning is correct. If it is, I'd appreciate a high-level overview of what needs to be done. I can block a few hours to work on this in the upcoming weeks.

Best regards

Greg Navis

I help tech companies to scale Heroku-hosted Rails apps.

Free, biweekly scalability newsletter for SaaS CEOs

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Jeff Janes

Дата:

04 июня 2016 г., 21:48:18

On Sat, Jun 4, 2016 at 2:50 AM, Greg Navis <contact@gregnavis.com> wrote:
> Thanks for your replies.
>
> Sorry for confusion. Instead of `similarity(lhs, rhs) >= show_limit()`,
> which of course is completely equivalent to `lhs % rhs`, I wanted to write
> `similarity(lhs, rhs) >= my_custom_threshold`. It seems that the approach
> with ternary operators is quite a bit of work. I might have a simpler idea:

The problem is that you would have to build your index on the
expression (similarity(lhs,rhs)), and the then one of the lhs or rhs
would have to be a column, and the other would have to be a constant
string at the time you define the index, which is the same constant
string as specified when you want to use the index.  This could work
as long as there are only a few strings you ever care about checking
similarity to, and you would build an index for each one.  These would
be ordinary btree indexes, not gin indexes.   I've done this before,
only using substring rather than similarity.

> pg_trgm also provides `<->` but it seems this operator doesn't use indexes
> either. It seems the shortest path to per-query thresholds, without
> compromising the design, is making this operator use the index.

To be efficient, the index has to know where to start and stop its
scan.  Such a thing doesn't exists for "a<->b", there is no stopping
point.  It would have to be "a <-> b < c", at which point you are back
to ternary again.

> Please help
> me understand whether my reasoning is correct. If it is, I'd appreciate a
> high-level overview of what needs to be done. I can block a few hours to
> work on this in the upcoming weeks.

I think that you are proposing rewriting the indexing architecture
from scratch.  Doing that in a way that is robust and tested and
backward compatible would  probably take years, not hours, for a
single individual who is not already extensively experienced in the
internals of the code.  And, it is not really clear even what you want
to rewrite it into.

Tom's proposal is to do it in a way that works with the existing
architecture, rather than trying to re-write that architecture.
(Which could probably not be done in a few hours, either, especially
not once you write documentation and upgrade scripts and all the other
stuff that goes into production-quality code.)

I don't know if this would even be appropriate as an addition to
pg_trgm.  We might want to fork that code instead.  That would be a
shame, because the underlying c code would be the fundamentally the
same, but the alternative would be to force people who like % and
set_limit() to carry around the baggage of new operators and types
they have no interest in using, and vice versa.  True, we did just add
several new functions and operators to pg_trgm that many people will
have no interest in, so maybe that is not a big deal.

Cheers,

Jeff

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Tom Lane

Дата:

04 июня 2016 г., 22:49:18

Jeff Janes <jeff.janes@gmail.com> writes:
> I don't know if this would even be appropriate as an addition to
> pg_trgm.  We might want to fork that code instead.  That would be a
> shame, because the underlying c code would be the fundamentally the
> same, but the alternative would be to force people who like % and
> set_limit() to carry around the baggage of new operators and types
> they have no interest in using, and vice versa.  True, we did just add
> several new functions and operators to pg_trgm that many people will
> have no interest in, so maybe that is not a big deal.

It seems to me that the old-style and new-style operators could coexist
just fine; neither one ought to be a large increment of unsharable code.
(Granted, it might take some refactoring to make that so.)  So I think
forking would be a bad approach.

            regards, tom lane

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Greg Navis

Дата:

08 июня 2016 г., 12:16:28

Thanks for the replies.

On Sat, Jun 4, 2016 at 8:48 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Sat, Jun 4, 2016 at 2:50 AM, Greg Navis <contact@gregnavis.com> wrote:
> Thanks for your replies.
>
> Sorry for confusion. Instead of `similarity(lhs, rhs) >= show_limit()`,
> which of course is completely equivalent to `lhs % rhs`, I wanted to write
> `similarity(lhs, rhs) >= my_custom_threshold`. It seems that the approach
> with ternary operators is quite a bit of work. I might have a simpler idea:

The problem is that you would have to build your index on the
expression (similarity(lhs,rhs)), and the then one of the lhs or rhs
would have to be a column, and the other would have to be a constant
string at the time you define the index, which is the same constant
string as specified when you want to use the index. This could work
as long as there are only a few strings you ever care about checking
similarity to, and you would build an index for each one. These would
be ordinary btree indexes, not gin indexes. I've done this before,
only using substring rather than similarity.

> pg_trgm also provides `<->` but it seems this operator doesn't use indexes
> either. It seems the shortest path to per-query thresholds, without
> compromising the design, is making this operator use the index.

To be efficient, the index has to know where to start and stop its
scan. Such a thing doesn't exists for "a<->b", there is no stopping
point. It would have to be "a <-> b < c", at which point you are back
to ternary again.

Sorry, that was sloppy thinking on my end. I was fixated on looking for something simple and saw things that aren't there. ;-) You're absolutely right that we're back to ternary with `a <-> b < c`.

> Please help
> me understand whether my reasoning is correct. If it is, I'd appreciate a
> high-level overview of what needs to be done. I can block a few hours to
> work on this in the upcoming weeks.

I think that you are proposing rewriting the indexing architecture
from scratch. Doing that in a way that is robust and tested and
backward compatible would probably take years, not hours, for a
single individual who is not already extensively experienced in the
internals of the code. And, it is not really clear even what you want
to rewrite it into.

No rewrites, please. ;-) I was just confused. I'm looking for a simple, incremental contribution, not a rewrite.

Tom's proposal is to do it in a way that works with the existing
architecture, rather than trying to re-write that architecture.
(Which could probably not be done in a few hours, either, especially
not once you write documentation and upgrade scripts and all the other
stuff that goes into production-quality code.)

Thanks for the info. I blocked a few hours for open source work every week so whether it takes 10 or 40 hours shouldn't be a problem.

I don't know if this would even be appropriate as an addition to
pg_trgm. We might want to fork that code instead. That would be a
shame, because the underlying c code would be the fundamentally the
same, but the alternative would be to force people who like % and
set_limit() to carry around the baggage of new operators and types
they have no interest in using, and vice versa. True, we did just add
several new functions and operators to pg_trgm that many people will
have no interest in, so maybe that is not a big deal.

Cheers,

Jeff

Would this be a better plan then:

1. Add support for trigram operators.

2. Implement `issimilar(lhs, rhs, threshold)`.

3. Add `issimilar` to the trigram operator classes.

PS Should we move this discussion to pgsql-hackers?

Greg Navis

I help tech companies to scale Heroku-hosted Rails apps.

Free, biweekly scalability newsletter for SaaS CEOs

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Artur Zakirov

Дата:

08 июня 2016 г., 13:12:32

On 08.06.2016 12:16, Greg Navis wrote:
> Would this be a better plan then:
>
> 1. Add support for trigram operators.
> 2. Implement `issimilar(lhs, rhs, threshold)`.
> 3. Add `issimilar` to the trigram operator classes.

I think Tom's proposal with composite type is exelent option. If I
understand correctly it introduce a new function similarity_rhs(). You
can use it as the following:

SELECT * FROM restaurants WHERE city % similarity_rhs('warsw', 0.4);
SELECT * FROM cinemas WHERE name % similarity_rhs('warsw', 0.2);

This is what you need?

If I understand correctly your plan differs from Tom's proposal. And I
am afraid that you will do a waste work.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Greg Navis

Дата:

09 июня 2016 г., 11:57:17

Artur, no worries, I'm not writing any code ;-)

I did the following:

CREATE TYPE trgm_match AS (match TEXT, threshold NUMERIC);

CREATE OR REPLACE FUNCTION trgm_check_match (string TEXT, match trgm_match)

RETURNS bool

AS 'SELECT match.match <-> string <= 1 - match.threshold'

LANGUAGE SQL;

CREATE OPERATOR %(leftarg = text, rightarg = trgm_match, procedure=trgm_check_match);

This allows me to write:

SELECT ('Warsaw' % row('Warsw', 0.3)::trgm_match);

I'm not sure how to make this operator use an index. It seems I need to create an operator class but I'm not sure how. This is how pg_trgm creates its operator class:

-- create the operator class for gist

CREATE OPERATOR CLASS gist_trgm_ops

FOR TYPE text USING gist

OPERATOR 1 % (text, text),

FUNCTION 1 gtrgm_consistent (internal, text, smallint, oid, internal),

FUNCTION 2 gtrgm_union (internal, internal),

FUNCTION 3 gtrgm_compress (internal),

FUNCTION 4 gtrgm_decompress (internal),

FUNCTION 5 gtrgm_penalty (internal, internal, internal),

FUNCTION 6 gtrgm_picksplit (internal, internal),

FUNCTION 7 gtrgm_same (gtrgm, gtrgm, internal),

STORAGE gtrgm;

Should my operator class mimic the one above?

Greg Navis

I help tech companies to scale Heroku-hosted Rails apps.

Free, biweekly scalability newsletter for SaaS CEOs

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Tom Lane

Дата:

09 июня 2016 г., 16:56:14

Greg Navis <contact@gregnavis.com> writes:
> I'm not sure how to make this operator use an index. It seems I need to
> create an operator class but I'm not sure how.

What you'd want to do is add it to the existing operator class and then
teach the class's support functions (mostly, the "consistent" function)
about it.

            regards, tom lane

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Tom Lane

Дата:

09 июня 2016 г., 17:13:01

I wrote:
> Greg Navis <contact@gregnavis.com> writes:
>> I'm not sure how to make this operator use an index. It seems I need to
>> create an operator class but I'm not sure how.

> What you'd want to do is add it to the existing operator class and then
> teach the class's support functions (mostly, the "consistent" function)
> about it.

BTW, you'd probably find this patch instructive:

https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=f576b17cd6ba653bdace1f0da9a3b57f4984e460

although it's doing more than just adding one operator.

            regards, tom lane

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Jeff Janes

Дата:

10 июня 2016 г., 19:20:15

On Thu, Jun 9, 2016 at 1:57 AM, Greg Navis <contact@gregnavis.com> wrote:
> Artur, no worries, I'm not writing any code ;-)
>
> I did the following:
>
> CREATE TYPE trgm_match AS (match TEXT, threshold NUMERIC);

I would probably use REAL, not NUMERIC.  But maybe there is good
reason to use NUMERIC.

> CREATE OR REPLACE FUNCTION trgm_check_match (string TEXT, match trgm_match)
>   RETURNS bool
>   AS 'SELECT match.match <-> string <= 1 - match.threshold'
>   LANGUAGE SQL;
> CREATE OPERATOR %(leftarg = text, rightarg = trgm_match,
> procedure=trgm_check_match);
>
> This allows me to write:
>
> SELECT ('Warsaw' % row('Warsw', 0.3)::trgm_match);
>
> I'm not sure how to make this operator use an index. It seems I need to
> create an operator class but I'm not sure how. This is how pg_trgm creates
> its operator class:

I think you should pick a new operator name, not try to reuse %.
Based on Tom's previous comment that forking is probably not a good
idea, you probably want the new operator to co-exist with the existing
one, so it needs a different name.  For example, I picked %% without
giving it a lot of thought for this example below.

>
> -- create the operator class for gist
> CREATE OPERATOR CLASS gist_trgm_ops
> FOR TYPE text USING gist
> AS
>         OPERATOR        1       % (text, text),
>         FUNCTION        1       gtrgm_consistent (internal, text, smallint,
> oid, internal),
>         FUNCTION        2       gtrgm_union (internal, internal),
>         FUNCTION        3       gtrgm_compress (internal),
>         FUNCTION        4       gtrgm_decompress (internal),
>         FUNCTION        5       gtrgm_penalty (internal, internal,
> internal),
>         FUNCTION        6       gtrgm_picksplit (internal, internal),
>         FUNCTION        7       gtrgm_same (gtrgm, gtrgm, internal),
>         STORAGE         gtrgm;
>
> Should my operator class mimic the one above?

Look down a few more stanzas in "contrib/pg_trgm/pg_trgm--1.3.sql",
where it keeps adding new operators and functions.  You will want to
add your own in that method.  All of those could be consolidated into
one CREATE OPERATOR CLASS statement, but you will eventually have to
implement both an upgrade script and an install-from-scratch script,
so that is why they are broken out this way, to make that easier.

For testing, I'd just add "OPERATOR 9 %% (text, trgm_match)" to the
above, then drop and recreate the extension.

Although I would start with gin rather than gist, both because I find
it more useful, and I am more familiar with it.  YMMV of course.

Once you do that, you will probably start getting errors from the
gtrgm_consistent C function (if not others in the list of functions
first) because it is being asked to evaluate a strategy it doesn't
understand.  So then the next step is to teach the C code how to deal
with it.

Cheers,

Jeff

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Jeff Janes

Дата:

10 июня 2016 г., 19:47:34

On Fri, Jun 10, 2016 at 9:20 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Jun 9, 2016 at 1:57 AM, Greg Navis <contact@gregnavis.com> wrote:
>> Artur, no worries, I'm not writing any code ;-)
>>
>> I did the following:
>>
>> CREATE TYPE trgm_match AS (match TEXT, threshold NUMERIC);
>
> I would probably use REAL, not NUMERIC.  But maybe there is good
> reason to use NUMERIC.
>
>> CREATE OR REPLACE FUNCTION trgm_check_match (string TEXT, match trgm_match)
>>   RETURNS bool
>>   AS 'SELECT match.match <-> string <= 1 - match.threshold'
>>   LANGUAGE SQL;

You will have to somehow prevent this from getting inlined.  If it is
inlined, then it will no longer be
recognized as being an indexable operator.  So maybe use plpgsql as
the language.


>> CREATE OPERATOR %(leftarg = text, rightarg = trgm_match,
>> procedure=trgm_check_match);
>>
>> This allows me to write:
>>
>> SELECT ('Warsaw' % row('Warsw', 0.3)::trgm_match);
>>
>> I'm not sure how to make this operator use an index. It seems I need to
>> create an operator class but I'm not sure how. This is how pg_trgm creates
>> its operator class:
>
> I think you should pick a new operator name, not try to reuse %.
> Based on Tom's previous comment that forking is probably not a good
> idea, you probably want the new operator to co-exist with the existing
> one, so it needs a different name.  For example, I picked %% without
> giving it a lot of thought for this example below.

On second thought, it could use overloading distinguished with
different argument types, so it doesn't need a different name, but I
don't know if it is a good idea to use that overloading.

Cheers,

Jeff

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Tom Lane

Дата:

10 июня 2016 г., 20:37:37

Jeff Janes <jeff.janes@gmail.com> writes:
> On Fri, Jun 10, 2016 at 9:20 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> I think you should pick a new operator name, not try to reuse %.

> On second thought, it could use overloading distinguished with
> different argument types, so it doesn't need a different name, but I
> don't know if it is a good idea to use that overloading.

I would vote for overloading; there's no risk of confusion that I can see.

            regards, tom lane

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Greg Navis

Дата:

11 июня 2016 г., 13:47:39

I made some progress but I'm stuck. I'm focused on GiST for now. Please ignore sloppy naming for now.

I made the following changes to pg_trgm--1.2.sql:

CREATE TYPE pg_trgm_match AS (match TEXT, threshold REAL);

CREATE OR REPLACE FUNCTION trgm_check_match(string TEXT, match pg_trgm_match) RETURNS bool AS $$

BEGIN

RETURN match.match <-> string <= 1 - match.threshold;

END;

$$ LANGUAGE plpgsql;

CREATE OPERATOR %%(leftarg = text, rightarg = pg_trgm_match, procedure=trgm_check_match);

ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD

OPERATOR 9 %% (text, pg_trgm_match);

It does indeed make PostgreSQL complain about undefined strategy 9. I added the following define to trgm.h:

#define ThresholdStrategyNumber 9

It seems StrategyNumber is used in gtrgm_consistent and gtrgm_distance.

In gtrgm_consistent, I need change the way `nlimit` is obtained:

nlimit = (strategy == SimilarityStrategyNumber) ?

similarity_threshold : word_similarity_threshold;

I need to add a case for ThresholdStrategyNumber and extract `nlimit` from the argument of `pg_trgm_match`. I'm not sure what to do in `gtrgm_distance`.

My questions:

1a. Is it possible to make `gtrgm_consistent` accept `text` or `pg_trgm_match` as the second argument?

1b. What's the equivalent of `match.match` and `match.threshold` (where `match` is a `pg_trgm_match`) in C?

2. What to do with `gtrgm_distance`?

Thanks for help.

Greg Navis

I help tech companies to scale Heroku-hosted Rails apps.

Free, biweekly scalability newsletter for SaaS CEOs

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Artur Zakirov

Дата:

11 июня 2016 г., 23:40:18

2016-06-11 13:47 GMT+03:00 Greg Navis <contact@gregnavis.com>:

I made some progress but I'm stuck. I'm focused on GiST for now. Please ignore sloppy naming for now.

I made the following changes to pg_trgm--1.2.sql:

CREATE TYPE pg_trgm_match AS (match TEXT, threshold REAL);

CREATE OR REPLACE FUNCTION trgm_check_match(string TEXT, match pg_trgm_match) RETURNS bool AS $$
BEGIN
RETURN match.match <-> string <= 1 - match.threshold;
END;
$$ LANGUAGE plpgsql;

CREATE OPERATOR %%(leftarg = text, rightarg = pg_trgm_match, procedure=trgm_check_match);

ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD
OPERATOR 9 %% (text, pg_trgm_match);

You can overload existing % operator:

ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD

OPERATOR 9 % (text, pg_trgm_match);

It does indeed make PostgreSQL complain about undefined strategy 9. I added the following define to trgm.h:

#define ThresholdStrategyNumber 9

It seems StrategyNumber is used in gtrgm_consistent and gtrgm_distance.

In gtrgm_consistent, I need change the way `nlimit` is obtained:

nlimit = (strategy == SimilarityStrategyNumber) ?
similarity_threshold : word_similarity_threshold;

I need to add a case for ThresholdStrategyNumber and extract `nlimit` from the argument of `pg_trgm_match`. I'm not sure what to do in `gtrgm_distance`.

My questions:

1a. Is it possible to make `gtrgm_consistent` accept `text` or `pg_trgm_match` as the second argument?

I think you can change definition of the gtrgm_consistent() in .sql file in CREATE FUNCTION and CREATE OPERATOR CLASS commands to:

gtrgm_consistent(internal,anynonarray,smallint,oid,internal)

But I do not sure that anynonarray is good here.

1b. What's the equivalent of `match.match` and `match.threshold` (where `match` is a `pg_trgm_match`) in C?

After changing the definition you can extract values from composite type in the gtrgm_consistent(). I think the code in the beginning of function may looks like this:

if (strategy == SimilarityStrategyNumber ||

strategy == WordSimilarityStrategyNumber)

{

query = PG_GETARG_TEXT_P(1);

nlimit = (strategy == SimilarityStrategyNumber) ?

similarity_threshold : word_similarity_threshold;

}

else if (strategy == ThresholdStrategyNumber)

{

HeapTupleHeader query_match = PG_GETARG_HEAPTUPLEHEADER(1);

Oid tupType = HeapTupleHeaderGetTypeId(query_match);

int32 tupTypmod = HeapTupleHeaderGetTypMod(query_match);

TupleDesc tupdesc = lookup_rowtype_tupdesc(tupType, tupTypmod);

HeapTupleData tuple;

bool isnull;

tuple.t_len = HeapTupleHeaderGetDatumLength(query_match);

ItemPointerSetInvalid(&(tuple.t_self));

tuple.t_tableOid = InvalidOid;

tuple.t_data = query_match;

query = DatumGetTextP(fastgetattr(&tuple, 1, tupdesc, &isnull));

nlimit = DatumGetFloat4(fastgetattr(&tuple, 2, tupdesc, &isnull));

ReleaseTupleDesc(tupdesc);

}

else

query = PG_GETARG_TEXT_P(1);

After this code you should execute the query using index:

select t,similarity(t,'qwertyu0988') as sml from test_trgm where t % row('qwertyu0988', 0.6)::pg_trgm_match;

I got the query from the regression test. And of course the code need to be checked for bugs.

2. What to do with `gtrgm_distance`?

You do not need to change gtrgm_distance(). It is used only in ORDER BY clause to calculate distances. To calculate distance you do not need threshold.

Thanks for help.
--
Greg Navis
I help tech companies to scale Heroku-hosted Rails apps.
Free, biweekly scalability newsletter for SaaS CEOs

Artur Zakirov

Postgres Professional: http://www.postgrespro.com

Russian Postgres Company

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

От

Greg Navis

Дата:

20 июня 2016 г., 17:04:41

Artur, thanks for help. I managed to add the new strategy to the index. Hurray! I also discovered a bug in the process that I reported via the form.

I still have a few questions:

1. Naming - pg_trgm_match, match, threshold, trgm_check_match, ThresholdStrategyNumber - are these good names?

2. I made trgm_check_match IMMUTABLE. Are there any other modifies that should be there?

3. I defined % (text, pg_trgm_match) but didn't provide a commutator and other helper procedures. Which of them should I implement?

4. Can I obtain query and nlimit with less code?

5. The attached patch replaced "res = (*(int *) &tmpsml == *(int *) &nlimit || tmpsml > nlimit);" with "res = (tmpsml >= nlimit);" to fix the bug on my machine. I'm not sure whether that's the long-term fix we want to have. It's just there to help me make progress with trigrams.

Thanks for help.

Cheers

Greg

ᐧ

Вложения

pg_trgm_custom_threshold_in_gist.patch

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: [pg_trgm] Making similarity(?, ?) < ? use an index

Вложения