Обсуждение: to_ascii, or some other form of magic transliteration
I'm working on a problem that I imagine others have had, which basically boils down to having nice unicode display text that users are going to want to search against without typing it correctly.... e.g. let a search for "sma" match "små". It seems like the best way to do this is to find a magic unicode transliteration mapping function, and then save the ASCII transliterations for searching against. I see there's a function to_ascii, which sounds hopeful. However, when I try to use it, I get back: ERROR: encoding conversion from UNICODE to ASCII not supported What is this function for, if not to convert other encodings to ASCII? Is there some other way to do what I'm asking for?
On 9/9/05, Ben <bench@silentmedia.com> wrote: > I'm working on a problem that I imagine others have had, which basically > boils down to having nice unicode display text that users are going to > want to search against without typing it correctly.... e.g. let a search > for "sma" match "små". It seems like the best way to do this is to find > a magic unicode transliteration mapping function, and then save the > ASCII transliterations for searching against. > The simplest solution to this that I've found is to maintain a separate column for ASCII-ized version of your text. The conversion can be done automatically using a trigger, and I have one in PL/PERLU that I use. It basically boils down to: 1) transform unicode text to normal form D 2) strip combining non-spacing marks In modern Perls that looks like: #-------------- use Unicode::Normalize; my $txt = NFD(shift()); $txt =~ s/\pM//og; return $txt; #-------------- Hope that helps! > I see there's a function to_ascii, which sounds hopeful. However, when I > try to use it, I get back: > > ERROR: encoding conversion from UNICODE to ASCII not supported > > What is this function for, if not to convert other encodings to ASCII? > Is there some other way to do what I'm asking for? > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match > -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org
Hrm, I must be missing something, because I don't see how this will transliterate to ASCII? On Sep 10, 2005, at 5:30 AM, Mike Rylander wrote: > On 9/9/05, Ben <bench@silentmedia.com> wrote: > >> I'm working on a problem that I imagine others have had, which >> basically >> boils down to having nice unicode display text that users are >> going to >> want to search against without typing it correctly.... e.g. let a >> search >> for "sma" match "små". It seems like the best way to do this is to >> find >> a magic unicode transliteration mapping function, and then save the >> ASCII transliterations for searching against. >> >> > > The simplest solution to this that I've found is to maintain a > separate column for ASCII-ized version of your text. The conversion > can be done automatically using a trigger, and I have one in PL/PERLU > that I use. It basically boils down to: > > 1) transform unicode text to normal form D > 2) strip combining non-spacing marks > > In modern Perls that looks like: > > #-------------- > use Unicode::Normalize; > my $txt = NFD(shift()); > $txt =~ s/\pM//og; > return $txt; > #-------------- > > Hope that helps! > >
On 9/10/05, Ben <bench@silentmedia.com> wrote: > Hrm, I must be missing something, because I don't see how this will > transliterate to ASCII? If you want non-western text to be Romanized you can take a look at Text::Unicode(1). The functionality in the chunk of perl I sent before was stripping of non spacing mark (accents, rings, umlauts and such). You may need to strip other character classes if you've got unicode punctuation codepoints in the text to be searched. For the example you gave, the process is to decompose the character "å" to normalization form D, "a" and the unicode non spacing mark for the ring, and then removing the non spacing mark (the ring diacritic) with the regex s/\pM//sog. That will leave just the ASCII "a" in the text, and the text can the be treated as pure ASCII, because it no longer contains any unicode codepoints with an ord() above 127. You may want to look here(2) for an explanation and examples of Unicode normalization forms. If you don't need that much functionality (handling arbitrary unicode text), and you're dealing strictly with the Latin1 subset of unicode, you can just create a mapping table or hash to transliterate down to ASCII, as done here(3). 1) http://cpan.uwinnipeg.ca/htdocs/Text-Unidecode/Text/Unidecode.html 2) http://www.unicode.org/unicode/reports/tr15/#Canonical_Composition_Examples 3) http://www.eprints.org/files/eprints2/eprints-2.2/defaultcfg/ArchiveTextIndexingConfig.pm > > On Sep 10, 2005, at 5:30 AM, Mike Rylander wrote: > > > On 9/9/05, Ben <bench@silentmedia.com> wrote: > > > >> I'm working on a problem that I imagine others have had, which > >> basically > >> boils down to having nice unicode display text that users are > >> going to > >> want to search against without typing it correctly.... e.g. let a > >> search > >> for "sma" match "små". It seems like the best way to do this is to > >> find > >> a magic unicode transliteration mapping function, and then save the > >> ASCII transliterations for searching against. > >> > >> > > > > The simplest solution to this that I've found is to maintain a > > separate column for ASCII-ized version of your text. The conversion > > can be done automatically using a trigger, and I have one in PL/PERLU > > that I use. It basically boils down to: > > > > 1) transform unicode text to normal form D > > 2) strip combining non-spacing marks > > > > In modern Perls that looks like: > > > > #-------------- > > use Unicode::Normalize; > > my $txt = NFD(shift()); > > $txt =~ s/\pM//og; > > return $txt; > > #-------------- > > > > Hope that helps! > > > > > -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org