Обсуждение: Full text search on partial URLs

Поиск
Список
Период
Сортировка

Full text search on partial URLs

От
Zev Benjamin
Дата:
Hi,

I have Postgres full text search set up for my application and it's been
working great!  However, my users would like their searches to turn up
parts of URLs.  For example, they would like a search for "foobar" to
turn up a document that contains the string
"http://example.com/foobar/blah" (and similarly for queries like
"example" and "blah).  With the default dictionaries for host, url, and
url_path, the search query would have to contain the complete host or
url path.

What is the best way to accomplish this?  Should I be looking at
building a custom dictionary that breaks down hosts and urls or is there
something simpler I can do?


Thanks,
Zev


Re: Full text search on partial URLs

От
bricklen
Дата:

On Wed, Nov 6, 2013 at 10:45 AM, Zev Benjamin <zev-pgsql@strangersgate.com> wrote:
Hi,

I have Postgres full text search set up for my application and it's been working great!  However, my users would like their searches to turn up parts of URLs.  For example, they would like a search for "foobar" to turn up a document that contains the string "http://example.com/foobar/blah" (and similarly for queries like "example" and "blah).  With the default dictionaries for host, url, and url_path, the search query would have to contain the complete host or url path.

What is the best way to accomplish this?  Should I be looking at building a custom dictionary that breaks down hosts and urls or is there something simpler I can do?

Re: Full text search on partial URLs

От
Zev Benjamin
Дата:
On 11/06/2013 01:47 PM, bricklen wrote:
>
> On Wed, Nov 6, 2013 at 10:45 AM, Zev Benjamin
> <zev-pgsql@strangersgate.com <mailto:zev-pgsql@strangersgate.com>> wrote:
>
>     Hi,
>
>     I have Postgres full text search set up for my application and it's
>     been working great!  However, my users would like their searches to
>     turn up parts of URLs.  For example, they would like a search for
>     "foobar" to turn up a document that contains the string
>     "http://example.com/foobar/__blah <http://example.com/foobar/blah>"
>     (and similarly for queries like "example" and "blah).  With the
>     default dictionaries for host, url, and url_path, the search query
>     would have to contain the complete host or url path.
>
>     What is the best way to accomplish this?  Should I be looking at
>     building a custom dictionary that breaks down hosts and urls or is
>     there something simpler I can do?
>
>
> Have you looked into trigrams?
> http://www.postgresql.org/docs/current/static/pgtrgm.html

I've looked at it in the context of adding fuzzy search.  But my
understanding is that doing a fuzzy search here would only work if the
query were a significant fraction of, say, the url path.  For example, I
would expect a fuzzy search of "foobar" on "/foobar/x" to return a high
similarity, but a fuzzy search of "foobar" on
"/foobar/some/very/long/path/x" to have a low similarity.

Or are you suggesting using trigrams in a different way?


Zev


Re: Full text search on partial URLs

От
bricklen
Дата:

On Wed, Nov 6, 2013 at 10:53 AM, Zev Benjamin <zev-pgsql@strangersgate.com> wrote:
On 11/06/2013 01:47 PM, bricklen wrote:

On Wed, Nov 6, 2013 at 10:45 AM, Zev Benjamin
<zev-pgsql@strangersgate.com <mailto:zev-pgsql@strangersgate.com>> wrote:

    Hi,

    I have Postgres full text search set up for my application and it's
    been working great!  However, my users would like their searches to
    turn up parts of URLs.  For example, they would like a search for
    "foobar" to turn up a document that contains the string
    "http://example.com/foobar/__blah <http://example.com/foobar/blah>"
    (and similarly for queries like "example" and "blah).  With the
    default dictionaries for host, url, and url_path, the search query
    would have to contain the complete host or url path.

    What is the best way to accomplish this?  Should I be looking at
    building a custom dictionary that breaks down hosts and urls or is
    there something simpler I can do?


Have you looked into trigrams?
http://www.postgresql.org/docs/current/static/pgtrgm.html

I've looked at it in the context of adding fuzzy search.  But my understanding is that doing a fuzzy search here would only work if the query were a significant fraction of, say, the url path.  For example, I would expect a fuzzy search of "foobar" on "/foobar/x" to return a high similarity, but a fuzzy search of "foobar" on "/foobar/some/very/long/path/x" to have a low similarity.

Or are you suggesting using trigrams in a different way?

Yeah, I was thinking more along the lines of allowing wildcard searching, not similarity.

Eg.
CREATE INDEX yourtable_yourcol_gist_fbi ON yourtable using GIST ( yourcol gist_trgm_ops );
select * from yourtable where yourcol ~~ '%foobar%';

Re: Full text search on partial URLs

От
Zev Benjamin
Дата:

On 11/06/2013 02:04 PM, bricklen wrote:
>
> On Wed, Nov 6, 2013 at 10:53 AM, Zev Benjamin
> <zev-pgsql@strangersgate.com <mailto:zev-pgsql@strangersgate.com>> wrote:
>
>     On 11/06/2013 01:47 PM, bricklen wrote:
>
>
>         On Wed, Nov 6, 2013 at 10:45 AM, Zev Benjamin
>         <zev-pgsql@strangersgate.com
>         <mailto:zev-pgsql@strangersgate.com>
>         <mailto:zev-pgsql@__strangersgate.com
>         <mailto:zev-pgsql@strangersgate.com>>> wrote:
>
>              Hi,
>
>              I have Postgres full text search set up for my application
>         and it's
>              been working great!  However, my users would like their
>         searches to
>              turn up parts of URLs.  For example, they would like a
>         search for
>              "foobar" to turn up a document that contains the string
>              "http://example.com/foobar/____blah
>         <http://example.com/foobar/__blah>
>         <http://example.com/foobar/__blah <http://example.com/foobar/blah>>"
>              (and similarly for queries like "example" and "blah).  With the
>              default dictionaries for host, url, and url_path, the
>         search query
>              would have to contain the complete host or url path.
>
>              What is the best way to accomplish this?  Should I be
>         looking at
>              building a custom dictionary that breaks down hosts and
>         urls or is
>              there something simpler I can do?
>
>
>         Have you looked into trigrams?
>         http://www.postgresql.org/__docs/current/static/pgtrgm.__html
>         <http://www.postgresql.org/docs/current/static/pgtrgm.html>
>
>
>     I've looked at it in the context of adding fuzzy search.  But my
>     understanding is that doing a fuzzy search here would only work if
>     the query were a significant fraction of, say, the url path.  For
>     example, I would expect a fuzzy search of "foobar" on "/foobar/x" to
>     return a high similarity, but a fuzzy search of "foobar" on
>     "/foobar/some/very/long/path/__x" to have a low similarity.
>
>     Or are you suggesting using trigrams in a different way?
>
>
> Yeah, I was thinking more along the lines of allowing wildcard
> searching, not similarity.
>
> Eg.
> CREATE INDEX yourtable_yourcol_gist_fbi ON yourtable using GIST (
> yourcol gist_trgm_ops );
> select * from yourtable where yourcol ~~ '%foobar%';
>

Hrm.  That might work.  So the application-level search functionality
would be the union of tsearch and trigram wildcard matching.

If anyone else has other ideas, I'd be interested in hearing them as well.


Thanks,
Zev


Re: Full text search on partial URLs

От
Zev Benjamin
Дата:
On 11/15/2013 07:40 PM, Zev Benjamin wrote:
>
> One problem that I've run into here is that I would also like to
> highlight matched text in my application.  For my existing search
> solution, I do this with ts_headline.  For partial matches, it's
> unfortunately not just a matter of searching for the text and adding the
> appropriate markup because my documents are HTML (the FTS lexer
> helpfully pulls out all the HTML tags so it hasn't been a problem so
> far) and we don't want to accidentally "highlight" some of the
> attributes of the markup.
>
> One way to solve this would be if there were a way to turn a tsvector
> and tsquery pair into a list of the offsets and lengths of the lexemes
> that match.  The highlighting could then be done at the application
> level rather than the database level while still leveraging Postgres's
> FTS functionality.

I've written C functions to implement this and attached them to this
email.  The support files necessary for making a module are available at
https://github.com/zbenjamin/tsearch_extras.  I'm new to the PostgreSQL
code base so any feedback or comments would be greatly appreciated.
Would these be appropriate to submit as patches to PostgreSQL?


Thanks,
Zev


Вложения

Re: Full text search on partial URLs

От
Zev Benjamin
Дата:
On 11/15/2013 07:40 PM, Zev Benjamin wrote:
>
> One problem that I've run into here is that I would also like to
> highlight matched text in my application.  For my existing search
> solution, I do this with ts_headline.  For partial matches, it's
> unfortunately not just a matter of searching for the text and adding the
> appropriate markup because my documents are HTML (the FTS lexer
> helpfully pulls out all the HTML tags so it hasn't been a problem so
> far) and we don't want to accidentally "highlight" some of the
> attributes of the markup.
>
> One way to solve this would be if there were a way to turn a tsvector
> and tsquery pair into a list of the offsets and lengths of the lexemes
> that match.  The highlighting could then be done at the application
> level rather than the database level while still leveraging Postgres's
> FTS functionality.

I've written C functions to implement this and attached them to this
email.  The support files necessary for making a module are available at
https://github.com/zbenjamin/tsearch_extras.  I'm new to the PostgreSQL
code base so any feedback or comments would be greatly appreciated.
Would these be appropriate to submit as patches to PostgreSQL?


Thanks,
Zev

Вложения