On Tue, 2024-05-14 at 10:51 -0400, Robert Haas wrote:
> The question of which mappings we actually ought to be adding seems
> a lot harder, because it's not altogether clear what it means to
> "remove an accent". The proposed patch adds a whole lot of rules that
> turn tiny little characters into full-sized characters, boldfaced
> and/or italicized and/or otherwise-fancily-printed characters into
> full-sized characters. Only a handful of the changes are actually
> adding rules that specifically *remove an accent*, but there are
> similar rules that already exist, like turning ⅐ into the
> four-character sequence " 1/7" and blocky-looking versions of each
> letter into standard versions and ㍱ into the three-character sequence
> "hPa". So my naive guess would be that we want all of these rules,
> even though you would not guess from the unaccent documentation that
> it's supposed to do stuff like this. But my knowledge of languages
> other than English is very limited, and I am not a user of unaccent
> and never have been, so I am reluctant to make grand pronouncements.
> Does anyone more knowledgeable want to opine?
I am not necessarily more knowledgeable, but I'll opine anyway.
As a German speaker, I wouldn't call the dieresis on "ü" an
accent like the French é, è or ê, even though the current
implementation of unaccent() turns it into an "u".
And while most people would agree that the caret on â is an
accent in the French language, I am not sure if it is the same
in Vietnamese.
And I cannot see how ⅐ could be considered an accent...
Perhaps if we invent a function called convert_to_ascii() or so
instead of shoving that into unaccent(), it would make more sense.
Yours,
Laurenz Albe