Обсуждение: Automatic locale detection?
Is it possible to automatically detect the language encoding of incoming data? For instance if Japanese is used, is there a way to know it is Japanese from a bit in the charset, a dictionary-based evaluation or otherwise?
All-new Yahoo! Mail - Fire up a more powerful email and get things done faster.
All-new Yahoo! Mail - Fire up a more powerful email and get things done faster.
On Sun, Oct 08, 2006 at 12:04:01PM -0700, Matthew Peter wrote: > Is it possible to automatically detect the language encoding of > incoming data? For instance if Japanese is used, is there a way to > know it is Japanese from a bit in the charset, a dictionary-based > evaluation or otherwise? While technically possible, do you really want to run the risk of getting it wrong? Secondly, if you don't know the encoding of your data, you've got a security problem, since you can't safely escape the data. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Вложения
Matthew Peter wrote: > Is it possible to automatically detect the language encoding of incoming > data? For instance if Japanese is used, is there a way to know it is > Japanese from a bit in the charset, a dictionary-based evaluation or > otherwise? > Have a look at http://www.mozilla.org/projects/intl/chardet.html and http://chardet.feedparser.org/ for some implementations of this idea. These detectors are often inaccurate though (and sometimes fail completely), see the warning at the bottom of http://chardet.feedparser.org/docs/supported-encodings.html Regards, LL