Обсуждение: Multiple word synonyms (maybe?)

Поиск
Список
Период
Сортировка

Multiple word synonyms (maybe?)

От
Tim van der Linden
Дата:
Hi All

I have a question regarding PostgreSQL's full text capabilities and (presumably) the synonym dictionary.

I'm currently implementing FTS on a medical themed setup which uses domain specific jargon to denote a bunch of stuff.
Aspecific request I wish to implement here are the jargon synonyms that are heavily used. 

Of course, I can simply go ahead and create my own synonym dictionary with a jargon specific synonym file to feed it.
However,most of the synonyms are comprised out of more then a single word.  

The term "heart attack" for example has the following "synonyms":

- Acute MI
- MI
- Myocardial infarction

As far as I understand it, the tokenizer within PostgreSQL FTS engine splits words on spaces to generate tokens which
arethen proposed to each dictionary. I think it is therefor impossible to have "multi-word synonyms" in this sense as
multiplewords cannot reach the dictionary. The term "heart attack" would be presented as the tokens "heart" and
"attack".

From a technical standpoint I understand FTS is about looking at individual words and lexemizing them ... yet from a
naturallanguage lookup perspective you still wish to tie "Heart attack" to "Acute MI" so when a client search on one,
theother will turn up as well. 

Should I write my own tokenizer to catch all these words and present them as a single token? Or is this completely
outsidethe realm of FTS (or FTS within Postgresql)? 

Cheers,
Tim


Re: Multiple word synonyms (maybe?)

От
rob stone
Дата:
On Tue, 2015-10-20 at 19:35 +0900, Tim van der Linden wrote:
> Hi All
>
> I have a question regarding PostgreSQL's full text capabilities and
> (presumably) the synonym dictionary.
>
> I'm currently implementing FTS on a medical themed setup which uses
> domain specific jargon to denote a bunch of stuff. A specific request
> I wish to implement here are the jargon synonyms that are heavily
> used.
>
> Of course, I can simply go ahead and create my own synonym dictionary
> with a jargon specific synonym file to feed it. However, most of the
> synonyms are comprised out of more then a single word.
>
> The term "heart attack" for example has the following "synonyms":
>
> - Acute MI
> - MI
> - Myocardial infarction
>
> As far as I understand it, the tokenizer within PostgreSQL FTS engine
> splits words on spaces to generate tokens which are then proposed to
> each dictionary. I think it is therefor impossible to have "multi-
> word synonyms" in this sense as multiple words cannot reach the
> dictionary. The term "heart attack" would be presented as the tokens
> "heart" and "attack".
>
> From a technical standpoint I understand FTS is about looking at
> individual words and lexemizing them ... yet from a natural language
> lookup perspective you still wish to tie "Heart attack" to "Acute MI"
> so when a client search on one, the other will turn up as well.
>
> Should I write my own tokenizer to catch all these words and present
> them as a single token? Or is this completely outside the realm of
> FTS (or FTS within Postgresql)?
>
> Cheers,
> Tim
>
>


Looking at this from an entirely different perspective, why are you not
using ICD codes to identify patient events?
It is a one to many relationship between patient and their events
identified by the relevant ICD code and date.
Given that MI has several applicable ICD codes you can use a select
along the lines of:-
WHERE icd_code IN (  . . . )


I know it doesn't answer your question!

Cheers,
Rob


Re: Multiple word synonyms (maybe?)

От
Geoff Winkless
Дата:
On 20 October 2015 at 11:35, Tim van der Linden <tim@shisaa.jp> wrote:
Of course, I can simply go ahead and create my own synonym dictionary with a jargon specific synonym file to feed it. However, most of the synonyms are comprised out of more then a single word.

​Does the Thesaurus dictionary not do what you want?​
 

​Geoff​​
 

Re: Multiple word synonyms (maybe?)

От
Kevin Grittner
Дата:
On Tuesday, October 20, 2015 6:05 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
> On 20 October 2015 at 11:35, Tim van der Linden <tim@shisaa.jp>wrote:

>> Of course, I can simply go ahead and create my own synonym
>> dictionary with a jargon specific synonym file to feed it. However,
>> most of the synonyms are comprised out of more then a single word.
>
> ​​Does the Thesaurus dictionary not do what you want?​
>
> ​http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html#TEXTSEARCH-THESAURUS​

+1

I had a very similar need for legal terms (e.g., "power of
attorney") and the thesaurus fit that need exactly.

I don't know whether you'll run into the other need I had that
required some special handling for full text search with legal
documents: things like dates, case numbers, and statute cites were
not handled well by default.  What I did there was to pick those
out with regular expression searches, put them into a
space-separated string, cast that to tsvector, assign a higher
weight to such key elements, and concatenate that tsvector with the
one generated from the standard text parser and dictionaries.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Multiple word synonyms (maybe?)

От
Tim van der Linden
Дата:
On Tue, 20 Oct 2015 21:57:59 +1100
rob stone <floriparob@gmail.com> wrote:
>
> Looking at this from an entirely different perspective, why are you not
> using ICD codes to identify patient events?
> It is a one to many relationship between patient and their events
> identified by the relevant ICD code and date.
> Given that MI has several applicable ICD codes you can use a select
> along the lines of:-
> WHERE icd_code IN (  . . . )
>
>
> I know it doesn't answer your question!

It does indeed not answer my direct question, but it does offer an interesting perspecitive to be used on one of the
nextphases of the medical application. 

Thanks for the heads-up!

> Cheers,
> Rob

Cheers,
Tim


Re: Multiple word synonyms (maybe?)

От
Tim van der Linden
Дата:
On Tue, 20 Oct 2015 12:02:46 +0100

> ​Does the Thesaurus dictionary not do what you want?​
> ​
> http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html#TEXTSEARCH-THESAURUS

Damn, I completely overlooked that one, and it indeed does seem to come very close to what I need in this use case.
Thanksfor jolting my memory (also @Kevin) :) 

If I am not mistaken, this would be a valid thesaurus file:

acute mi : heart attack
mi : heart attack
myocardial infarction : heart attack​

Multiple words on both ends, separated by a colon and each line being functional (a unique phrase linked to its more
genericreplacement)? 

> ​Geoff​​

Cheers,
Tim


Re: Multiple word synonyms (maybe?)

От
Kevin Grittner
Дата:
On Tuesday, October 20, 2015 7:56 PM, Tim van der Linden <tim@shisaa.jp> wrote:
> On Tue, 20 Oct 2015 12:02:46 +0100

>> ​Does the Thesaurus dictionary not do what you want?​
>> ​
>> http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html#TEXTSEARCH-THESAURUS
>
> Damn, I completely overlooked that one, and it indeed does seem
> to come very close to what I need in this use case.

I have to admit that the name of that dictionary type threw me off
a bit at first.

> If I am not mistaken, this would be a valid thesaurus file:
>
> acute mi : heart attack
> mi : heart attack
> myocardial infarction : heart attack​
>
> Multiple words on both ends, separated by a colon and each line
> being functional (a unique phrase linked to its more generic
> replacement)?

It has been a while, but my recollection is that I did something
more like this:

heart attack : heartattack
acute mi : heartattack
mi : heartattack
myocardial infarction : heartattack​

If my memory is to be trusted, both the original words (whichever
are actually in the document) and the "invented" synonym
("heartattack") will be in the tsvector/tsquery; this results in
all *matching* but the identical wording being considered a *closer
match*.

As with most things, I encourage you to play around with it a bit
to see what gives the best results for you.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Multiple word synonyms (maybe?)

От
Tim van der Linden
Дата:
On Wed, 21 Oct 2015 13:40:38 +0000 (UTC)
Kevin Grittner <kgrittn@ymail.com> wrote:

> > Damn, I completely overlooked that one, and it indeed does seem
> > to come very close to what I need in this use case.
>
> I have to admit that the name of that dictionary type threw me off
> a bit at first.

Indeed :)

> > ...
>
> It has been a while, but my recollection is that I did something
> more like this:
>
> heart attack : heartattack
> acute mi : heartattack
> mi : heartattack
> myocardial infarction : heartattack​
>
> If my memory is to be trusted, both the original words (whichever
> are actually in the document) and the "invented" synonym
> ("heartattack") will be in the tsvector/tsquery; this results in
> all *matching* but the identical wording being considered a *closer
> match*.

Hmm, a very helpful insight and it indeed makes sense to convert each phrase into a "single word" mash-up so it can be
lexemized.

> As with most things, I encourage you to play around with it a bit
> to see what gives the best results for you.

Yes indeed and will do!

Thank you very much for your help. If I get this up and running it might offer a nice opportunity to write a small post
aboutthis to expand on my PostgreSQL series... 

> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Cheers,
Tim