Re: unexpected result from to_tsvector

Поиск
Список
Период
Сортировка
От Artur Zakirov
Тема Re: unexpected result from to_tsvector
Дата
Msg-id 56E94455.6050508@postgrespro.ru
обсуждение исходный текст
Ответ на Re: unexpected result from to_tsvector  (Artur Zakirov <a.zakirov@postgrespro.ru>)
Список pgsql-hackers
I found the discussion about allowing an underscore in emails
http://www.postgresql.org/message-id/200908281359.n7SDxfaf044556@wwwmaster.postgresql.org

That bug report is about recognizing an underscore in the local part of
an email. And is not about recognizing an underscore in a domain name.
But that patch allows an underscore in recognized host names also.

I am not good in RFC, so I put excerpt from Wikipedia
https://en.wikipedia.org/wiki/Email_address:

> The local-part of the email address may use any of these ASCII characters:
>
> Uppercase and lowercase Latin letters (A–Z, a–z) (ASCII: 65–90, 97–122)
> Digits 0 to 9 (ASCII: 48–57)
> These special characters: !#$%&'*+-/=?^_`{|}~ (ASCII: 33, 35–39, 42, 43, 45, 47, 61, 63, 94–96, 123–126)
> Character . (dot, period, full stop), ASCII 46, provided that it is not the first or last character, and provided
alsothat it does not appear consecutively (e.g. John..Doe@example.com is not allowed). 
> Other special characters are allowed with restrictions (they are only allowed inside a quoted string, as described in
theparagraph below, and in addition, a backslash or double-quote must be preceded by a backslash). These characters
are:
> Space and "(),:;<>@[\] (ASCII: 32, 34, 40, 41, 44, 58, 59, 60, 62, 64, 91–93)
> Comments are allowed with parentheses at either end of the local part; e.g. john.smith(comment)@example.com and
(comment)john.smith@example.comare both equivalent to john.smith@example.com. 

and https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_host_names

> The Internet standards (Requests for Comments) for protocols mandate that component hostname labels may contain only
theASCII letters 'a' through 'z' (in a case-insensitive manner),the digits '0' through '9', and the hyphen ('-'). The
originalspecification of hostnames in RFC 952, mandated that labels could not start with a digit or with a hyphen, and
mustnot end with a hyphen. However, a subsequent specification (RFC 1123) permitted hostname labels to start with
digits.No other symbols, punctuation characters, or white space are permitted. 

Hence the valid emails is (I might be wrong):

123-s@sample.com
123_s@sample.com
123@123-sample.com
123@123sample.com

The attached patch allow them to be recognized as a email. But this
patch does not prohibit underscore in recognized host names.

As a result this patch gives the following results with underscores:

=# select * from ts_debug('simple', 'aaa@123_yyy.zzz');
  alias |  description  |      token      | dictionaries | dictionary |
      lexemes
-------+---------------+-----------------+--------------+------------+-------------------
  email | Email address | aaa@123_yyy.zzz | {simple}     | simple     |
{aaa@123_yyy.zzz}
(1 row)

=# select * from ts_debug('simple', '123_yyy.zzz');
  alias | description |    token    | dictionaries | dictionary |
lexemes
-------+-------------+-------------+--------------+------------+---------------
  host  | Host        | 123_yyy.zzz | {simple}     | simple     |
{123_yyy.zzz}
(1 row)

On 14.03.2016 17:45, Artur Zakirov wrote:
> On 14.03.2016 16:22, Shulgin, Oleksandr wrote:
>>
>> Hm...  now that doesn't look all that consistent to me (after applying
>> the patch):
>>
>> =# select ts_debug('simple', 'aaa@123-yyy.zzz');
>>                                   ts_debug
>> ---------------------------------------------------------------------------
>>
>>   (email,"Email
>> address",aaa@123-yyy.zzz,{simple},simple,{aaa@123-yyy.zzz})
>> (1 row)
>>
>> But:
>>
>> =# select ts_debug('simple', 'aaa@123_yyy.zzz');
>>                          ts_debug
>> ---------------------------------------------------------
>>   (asciiword,"Word, all ASCII",aaa,{simple},simple,{aaa})
>>   (blank,"Space symbols",@,{},,)
>>   (uint,"Unsigned integer",123,{simple},simple,{123})
>>   (blank,"Space symbols",_,{},,)
>>   (host,Host,yyy.zzz,{simple},simple,{yyy.zzz})
>> (5 rows)
>>
>> One can also see that if we only keep the domain name, the result is
>> similar:
>>
>> =# select ts_debug('simple', '123-yyy.zzz');
>>                         ts_debug
>> -------------------------------------------------------
>>   (host,Host,123-yyy.zzz,{simple},simple,{123-yyy.zzz})
>> (1 row)
>>
>> =# select ts_debug('simple', '123_yyy.zzz');
>>                        ts_debug
>> -----------------------------------------------------
>>   (uint,"Unsigned integer",123,{simple},simple,{123})
>>   (blank,"Space symbols",_,{},,)
>>   (host,Host,yyy.zzz,{simple},simple,{yyy.zzz})
>> (3 rows)
>>
>> But, this only has to do with 123 being recognized as a number, not with
>> the underscore:
>>
>> =# select ts_debug('simple', 'abc_yyy.zzz');
>>                         ts_debug
>> -------------------------------------------------------
>>   (host,Host,abc_yyy.zzz,{simple},simple,{abc_yyy.zzz})
>> (1 row)
>>
>> =# select ts_debug('simple', '1abc_yyy.zzz');
>>                         ts_debug
>> -------------------------------------------------------
>>   (host,Host,1abc_yyy.zzz,{simple},simple,{1abc_yyy.zzz})
>> (1 row)
>>
>> In fact, the 123-yyy.zzz domain is not valid either according to the RFC
>> (subdomain can't start with a digit), but since we already allow it,
>> should we not allow 123_yyy.zzz to be recognized as a Host?  Then why
>> not recognize aaa@123_yyy.zzz as an email address?
>>
>> Another option is to prohibit underscore in recognized host names, but
>> this has more breakage potential IMO.
>>
>> --
>> Alex
>>
>
> It seems reasonable to me. I like more first option. But I am not
> confident that we should allow 123_yyy.zzz to be recognized as a Host.
>
> By the way, in this question http://webmasters.stackexchange.com/a/775
> you can see examples of domain names with numbers (but not subdomains).
>
> If there are not objections from others, I will send a new patch today
> later or tomorrow with 123_yyy.zzz recognizing.
>


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: David Rowley
Дата:
Сообщение: Re: Combining Aggregates
Следующее
От: Grzegorz Sampolski
Дата:
Сообщение: Re: pam auth - add rhost item