Обсуждение: Unaccent characters

Поиск
Список
Период
Сортировка

Unaccent characters

От
Thom Brown
Дата:
Hi,

I had a look at the unaccent.rules file and noticed the following
characters aren't properly converted:

ß (U+00DF)  An eszett represents a double-s "SS" but this replaces it
with one "S".  Shouldn't this be replace with "SS"?

Æ (U+00C6) and æ (U+00E6) These doesn't have an accent, diacritic or
anything added to a single latin character.  It's simply a ligature of
"A" and "E" or "a" and "e".  If someone has the text "æther", I would
imagine they'd be surprised at it being converted to "ather" instead
of "aether".

Œ (U+0152) and œ (U+0153). Same as above.  This is a ligature of "O"
and "E" or "o" and "e".  Except this time the unaccent module chooses
the 2nd character instead of the 1st which is confusing.

If these were properly converted it would change the length of the
text, so I'm wondering if that's the reason for not properly
converting them.  Could someone elaborate?

--
Thom

Re: Unaccent characters

От
Peter Eisentraut
Дата:
On fre, 2012-04-20 at 09:15 +0100, Thom Brown wrote:
> I had a look at the unaccent.rules file and noticed the following
> characters aren't properly converted:
>
> ß (U+00DF)  An eszett represents a double-s "SS" but this replaces it
> with one "S".  Shouldn't this be replace with "SS"?

Probably, but it certainly shouldn't be upper case.

> Æ (U+00C6) and æ (U+00E6) These doesn't have an accent, diacritic or
> anything added to a single latin character.  It's simply a ligature of
> "A" and "E" or "a" and "e".  If someone has the text "æther", I would
> imagine they'd be surprised at it being converted to "ather" instead
> of "aether".

It depends on what the point of this module is supposed to be.  Doing
"unaccenting" usefully depends on language and context.  For example, it
would be very reasonable to map æ to ae, but in a Scandinavian context,
æ is equivalent to ä, which is mapped to a, which is itself
questionable.

> Œ (U+0152) and œ (U+0153). Same as above.  This is a ligature of "O"
> and "E" or "o" and "e".  Except this time the unaccent module chooses
> the 2nd character instead of the 1st which is confusing.

That certainly seems wrong.  It's also worth noting that while æ is in
some languages considered a separate letter, œ is generally just a
typographical ligature.