Обсуждение: Example non-Latin words for text search parser docs?

Поиск
Список
Период
Сортировка

Example non-Latin words for text search parser docs?

От
Tom Lane
Дата:
I'm afraid my English-centricity is showing, but I could use a little
help filling in the missing examples in the table here:
http://developer.postgresql.org/pgdocs/postgres/textsearch-parsers.html
I'm not sure of a suitable example all-non-ASCII-letters word, and
even less sure of how to represent it in SGML.  (I remember we had
quite a bit of trouble dealing with accented letters in people's names,
for instance.)

            regards, tom lane

Re: Example non-Latin words for text search parser docs?

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> I'm afraid my English-centricity is showing, but I could use a little
> help filling in the missing examples in the table here:
> http://developer.postgresql.org/pgdocs/postgres/textsearch-parsers.html
> I'm not sure of a suitable example all-non-ASCII-letters word,

It's easy to find an example -- I went to the english Wikipedia,
searched for "elephant", then clicked on the russian link at the left.
It gives you "Слоновые", which I see on my terminal as a series of black
squares :-) so there's not a single latin letter in it.

http://ru.wikipedia.org/wiki/%D0%A1%D0%BB%D0%BE%D0%BD%D0%BE%D0%B2%D1%8B%D0%B5

In that page they also mention the word "Слон" which looks like "Slon".

> and even less sure of how to represent it in SGML.  (I remember we had
> quite a bit of trouble dealing with accented letters in people's
> names, for instance.)

Yeah, that will prove difficult.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Example non-Latin words for text search parser docs?

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> and even less sure of how to represent it in SGML.  (I remember we had
>> quite a bit of trouble dealing with accented letters in people's
>> names, for instance.)

> Yeah, that will prove difficult.

This problem largely goes away if we redefine the word categories as
under discussion in the -hackers thread: with any of the proposed
alternatives it'd be pretty easy to make up real words that are easily
representable in SGML.

            regards, tom lane

Re: Example non-Latin words for text search parser docs?

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> I'm afraid my English-centricity is showing, but I could use a little
>> help filling in the missing examples in the table here:
>> http://developer.postgresql.org/pgdocs/postgres/textsearch-parsers.html
>> I'm not sure of a suitable example all-non-ASCII-letters word,

> It's easy to find an example -- I went to the english Wikipedia,
> searched for "elephant", then clicked on the russian link at the left.
> It gives you "Слоновые", which I see on my terminal as a series of black
> squares :-) so there's not a single latin letter in it.

Given the just-applied changes in the definition of a "word", we no
longer need a totally-not-ASCII sample word.  But I wonder if anyone
has a better idea than the føø that I made up on the
spot...

            regards, tom lane

Re: Example non-Latin words for text search parser docs?

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Tom Lane wrote:
> >> I'm afraid my English-centricity is showing, but I could use a little
> >> help filling in the missing examples in the table here:
> >> http://developer.postgresql.org/pgdocs/postgres/textsearch-parsers.html
> >> I'm not sure of a suitable example all-non-ASCII-letters word,
>
> > It's easy to find an example -- I went to the english Wikipedia,
> > searched for "elephant", then clicked on the russian link at the left.
> > It gives you "Слоновые", which I see on my terminal as a series of black
> > squares :-) so there's not a single latin letter in it.
>
> Given the just-applied changes in the definition of a "word", we no
> longer need a totally-not-ASCII sample word.  But I wonder if anyone
> has a better idea than the føø that I made up on the
> spot...

Actually I was wondering if we should use actual words.  So instead of
"foo" we could use "elephant" for asciiword and "Éléphant" (french) for
word.  And for the hword, "sous-espèces" (which appears on the French
Wikipedia) would do.

--
Alvaro Herrera                         http://www.flickr.com/photos/alvherre/
"La espina, desde que nace, ya pincha" (Proverbio africano)

Re: Example non-Latin words for text search parser docs?

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Actually I was wondering if we should use actual words.  So instead of
> "foo" we could use "elephant" for asciiword and "Éléphant" (french) for
> word.  And for the hword, "sous-espèces" (which appears on the French
> Wikipedia) would do.

Hmm ... I see a potential problem with that, which is that if someone
happened to be viewing the page on something that dropped the accents,
or even just made them too small to be easily readable, the examples
wouldn't make any sense at all.

I have no problem with "elephant" as a sample asciiword, but for the
sample non-ascii word I'd suggest something that (a) is clearly not
English and (b) as much as possible, everybody knows has an accent.
At least in large parts of the US, something like "mañana" would
work nicely.

Anyway, feel free to hack on it --- I'm getting a bit weary of looking
at that chapter.

            regards, tom lane

Re: Example non-Latin words for text search parser docs?

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Actually I was wondering if we should use actual words.  So instead of
> > "foo" we could use "elephant" for asciiword and "Éléphant" (french) for
> > word.  And for the hword, "sous-espèces" (which appears on the French
> > Wikipedia) would do.
>
> Hmm ... I see a potential problem with that, which is that if someone
> happened to be viewing the page on something that dropped the accents,
> or even just made them too small to be easily readable, the examples
> wouldn't make any sense at all.
>
> I have no problem with "elephant" as a sample asciiword, but for the
> sample non-ascii word I'd suggest something that (a) is clearly not
> English and (b) as much as possible, everybody knows has an accent.
> At least in large parts of the US, something like "mañana" would
> work nicely.

OK I went with that.  I also used real spanish hyphenated words in the
hword examples.  I also changed the domains foo.com to example.com, just
because I'm anal enough to do it.

The hword_asciipart I'm not 100% sure about.  I used this:

    militar in the context político-militar, or postgresql in the
    context postgresql-beta1

What I wanted to emphasize here is that it's the "ascii-ness" of the
part that matters, not that of the complete token.  The reason I'm not
sure about it is that it makes the table wider.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Example non-Latin words for text search parser docs?

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> The hword_asciipart I'm not 100% sure about.  I used this:
>     militar in the context pol�tico-militar, or postgresql in the
>     context postgresql-beta1

Hmm ... I went and looked at the page on developer.postgresql.org,
and it's just as I feared: with slightly bleary morning eyes, the
accents over the i's are not obvious, and so you have to look *real*
close before you get the point of the examples.  It doesn't help that
'politico' with no accent is exactly how the phrase would be spelled
in English, and so it's easy to not see the accent because you're not
expecting one.  The other examples seem alright, but I think that one's
a bad choice.

            regards, tom lane

Re: Example non-Latin words for text search parser docs?

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > The hword_asciipart I'm not 100% sure about.  I used this:
> >     militar in the context pol�tico-militar, or postgresql in the
> >     context postgresql-beta1
>
> Hmm ... I went and looked at the page on developer.postgresql.org,
> and it's just as I feared: with slightly bleary morning eyes, the
> accents over the i's are not obvious, and so you have to look *real*
> close before you get the point of the examples.  It doesn't help that
> 'politico' with no accent is exactly how the phrase would be spelled
> in English, and so it's easy to not see the accent because you're not
> expecting one.  The other examples seem alright, but I think that one's
> a bad choice.

Damn.  Ok, I'll search for a different example.  We're making progress
nonetheless ;-)

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Example non-Latin words for text search parser docs?

От
Alvaro Herrera
Дата:
Alvaro Herrera wrote:
> Tom Lane wrote:
> > Alvaro Herrera <alvherre@commandprompt.com> writes:
> > > The hword_asciipart I'm not 100% sure about.  I used this:
> > >     militar in the context pol?tico-militar, or postgresql in the
> > >     context postgresql-beta1
> >
> > Hmm ... I went and looked at the page on developer.postgresql.org,
> > and it's just as I feared: with slightly bleary morning eyes, the
> > accents over the i's are not obvious, and so you have to look *real*
> > close before you get the point of the examples.  It doesn't help that
> > 'politico' with no accent is exactly how the phrase would be spelled
> > in English, and so it's easy to not see the accent because you're not
> > expecting one.  The other examples seem alright, but I think that one's
> > a bad choice.
>
> Damn.  Ok, I'll search for a different example.  We're making progress
> nonetheless ;-)

How about "lógico-matemática"?

(If that one doesn't work for you, maybe we should look into words in
another language, more different from english.  Maybe Magnus can suggest
hyphenated words with weird letters).

--
Alvaro Herrera                 http://www.amazon.com/gp/registry/DXLWNGRJD34J
"La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)

Re: Example non-Latin words for text search parser docs?

От
Peter Eisentraut
Дата:
Am Donnerstag, 25. Oktober 2007 schrieb Tom Lane:
> Hmm ... I went and looked at the page on developer.postgresql.org,
> and it's just as I feared: with slightly bleary morning eyes, the
> accents over the i's are not obvious, and so you have to look *real*
> close before you get the point of the examples.

By that standard, you will have to use non-Latin letters, which might decrease
the usability of the examples much more.  There are not likely to be any
Latin-looking letters that are not ASCII and are not resembling another Latin
letter.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Example non-Latin words for text search parser docs?

От
Alvaro Herrera
Дата:
Peter Eisentraut wrote:
> Am Donnerstag, 25. Oktober 2007 schrieb Tom Lane:
> > Hmm ... I went and looked at the page on developer.postgresql.org,
> > and it's just as I feared: with slightly bleary morning eyes, the
> > accents over the i's are not obvious, and so you have to look *real*
> > close before you get the point of the examples.
>
> By that standard, you will have to use non-Latin letters, which might decrease
> the usability of the examples much more.  There are not likely to be any
> Latin-looking letters that are not ASCII and are not resembling another Latin
> letter.

I think it would suffice to use an accent over a vowel that's not an i.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Example non-Latin words for text search parser docs?

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Peter Eisentraut wrote:
>> Am Donnerstag, 25. Oktober 2007 schrieb Tom Lane:
>>> Hmm ... I went and looked at the page on developer.postgresql.org,
>>> and it's just as I feared: with slightly bleary morning eyes, the
>>> accents over the i's are not obvious, and so you have to look *real*
>>> close before you get the point of the examples.
>>
>> By that standard, you will have to use non-Latin letters, which might decrease
>> the usability of the examples much more.  There are not likely to be any
>> Latin-looking letters that are not ASCII and are not resembling another Latin
>> letter.

> I think it would suffice to use an accent over a vowel that's not an i.

Yeah, that would help.  But the real problem with pol?tico-militar
is that it looks way too much like the English equivalent --- my first
reaction was "huh, he forgot the 'y'".  I'm after a word that *looks*
not-English.  Alvaro's comment that maybe we need to look to something
besides Spanish seems on point.

            regards, tom lane

Re: Example non-Latin words for text search parser docs?

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> How about "l�gico-matem�tica"?

Works for me.

            regards, tom lane