Обсуждение: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Hi, After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the outputfrom the script is completely different from the unaccent.rules file content. Am I missing anything.My testing includesthe following Downloaded the following files http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml Executed the below python script python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml > unaccent.rules I am using python 3.7.1 and running on Windows 10 Platform The new status of this patch is: Needs review
On Mon, Feb 11, 2019 at 7:07 AM raam narayana <raam.soft@gmail.com> wrote: > After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the outputfrom the script is completely different from the unaccent.rules file content. Am I missing anything.My testing includesthe following > > Downloaded the following files > > http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt > > http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml > > Executed the below python script > > python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml > unaccent.rules > > I am using python 3.7.1 and running on Windows 10 Platform > > The new status of this patch is: Needs review Hi Raam, How does it differ? Can you please share the output you get? I used Python 2.7 on a Mac, exactly those input files, and my output matched Hugh's. -- Thomas Munro http://www.enterprisedb.com
On Sun, 10 Feb 2019 at 15:07, raam narayana <raam.soft@gmail.com> wrote:
Hi,
After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the output from the script is completely different from the unaccent.rules file content. Am I missing anything.My testing includes the following
Downloaded the following files
http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt
http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml
Executed the below python script
python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml > unaccent.rules
I am using python 3.7.1 and running on Windows 10 Platform
The new status of this patch is: Needs review
Hi Raam,
I just ran generate_unaccent_rules.py under two environments, using the data files given above :
- Python 3.4.3 on Linux Mint 17.3 (equivalent to Ubuntu 14.04)
- Python 3.6.7 on Ubuntu 18.04
In both cases, the output was identical to that generated by the program under Python 2.7. So yes, more information would help. Unfortunately I don't have a Windows Python environment readily available, but could set one up if I had to.
Thanks,
Hugh
Hi Hugh,
I tested the script in python 2.7 and it works perfect. The problem is in python 3.7(and may be only in windows as you were not getting the issue) and I was getting the following error
UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in position 0: character maps to <undefined>
I went through the python script and found that the stdout encoding is set to utf-8 only if python version is <=2.
I have made the same change for python version 3 as well. Please find the patch for the same.Let me know if it makes sense
Regards,
Ram.
On Tue, 12 Feb 2019 at 00:50, Hugh Ranalli <hugh@whtc.ca> wrote:
On Sun, 10 Feb 2019 at 15:07, raam narayana <raam.soft@gmail.com> wrote:Hi,
After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the output from the script is completely different from the unaccent.rules file content. Am I missing anything.My testing includes the following
Downloaded the following files
http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt
http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml
Executed the below python script
python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml > unaccent.rules
I am using python 3.7.1 and running on Windows 10 Platform
The new status of this patch is: Needs reviewHi Raam,I just ran generate_unaccent_rules.py under two environments, using the data files given above :- Python 3.4.3 on Linux Mint 17.3 (equivalent to Ubuntu 14.04)- Python 3.6.7 on Ubuntu 18.04In both cases, the output was identical to that generated by the program under Python 2.7. So yes, more information would help. Unfortunately I don't have a Windows Python environment readily available, but could set one up if I had to.Thanks,Hugh
Cheers
Ram 4.0
Ram 4.0
Вложения
On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote: > I tested the script in python 2.7 and it works perfect. The problem is in > python 3.7(and may be only in windows as you were not getting the issue) > and I was getting the following error > > UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in > position 0: character maps to <undefined> > > I went through the python script and found that the stdout encoding is set > to utf-8 only if python version is <=2. > > I have made the same change for python version 3 as well. Please find the > patch for the same.Let me know if it makes sense Isn't that because Windows encoding becomes cp1252, utf16 or such? FWIW, on Debian SID with Python 3.7, I get the correct output, and no diffs on HEAD. Perhaps it would make sense to use open() on the different files with encoding='utf-8' to avoid any kind of problems? -- Michael
Вложения
Hi Michael,
The issue was that the python script was working in python 2 but not in python 3 in Windows. This is because the python script writes the final output to stdout and stdout encoding is set to utf-8 only for python 2 but not python 3.If no encoding is set for stdout it takes the encoding from the Operating system.Default encoding in linux and windows might be different.Hence this issue.
Regards,
Ram.
On Tue, 12 Feb 2019 at 09:48, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:
> I tested the script in python 2.7 and it works perfect. The problem is in
> python 3.7(and may be only in windows as you were not getting the issue)
> and I was getting the following error
>
> UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
> position 0: character maps to <undefined>
>
> I went through the python script and found that the stdout encoding is set
> to utf-8 only if python version is <=2.
>
> I have made the same change for python version 3 as well. Please find the
> patch for the same.Let me know if it makes sense
Isn't that because Windows encoding becomes cp1252, utf16 or such?
FWIW, on Debian SID with Python 3.7, I get the correct output, and no
diffs on HEAD. Perhaps it would make sense to use open() on the
different files with encoding='utf-8' to avoid any kind of problems?
--
Michael
Cheers
Ram 4.0
Ram 4.0
On Tue, 12 Feb 2019 at 08:54, Ramanarayana <raam.soft@gmail.com> wrote:
Hi Michael,The issue was that the python script was working in python 2 but not in python 3 in Windows. This is because the python script writes the final output to stdout and stdout encoding is set to utf-8 only for python 2 but not python 3.If no encoding is set for stdout it takes the encoding from the Operating system.Default encoding in linux and windows might be different.Hence this issue.Regards,Ram.On Tue, 12 Feb 2019 at 09:48, Michael Paquier <michael@paquier.xyz> wrote:On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:
> I tested the script in python 2.7 and it works perfect. The problem is in
> python 3.7(and may be only in windows as you were not getting the issue)
> and I was getting the following error
>
> UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
> position 0: character maps to <undefined>
>
> I went through the python script and found that the stdout encoding is set
> to utf-8 only if python version is <=2.
>
> I have made the same change for python version 3 as well. Please find the
> patch for the same.Let me know if it makes sense
Isn't that because Windows encoding becomes cp1252, utf16 or such?
FWIW, on Debian SID with Python 3.7, I get the correct output, and no
diffs on HEAD. Perhaps it would make sense to use open() on the
different files with encoding='utf-8' to avoid any kind of problems?
--
Michael
I can't look at this today, but will fire up Windows and Python tomorrow, look at Ram's patch, and see what is going on. I'll also look at how we open the input files, to see if we should supply an encoding. It makes sense those input files will only make sense in UTF-8 anyway.
Ram, thanks for catching this issue.,
Hugh
On Mon, 11 Feb 2019 at 15:57, Ramanarayana <raam.soft@gmail.com> wrote:
Hi Hugh,I tested the script in python 2.7 and it works perfect. The problem is in python 3.7(and may be only in windows as you were not getting the issue) and I was getting the following errorUnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in position 0: character maps to <undefined>I went through the python script and found that the stdout encoding is set to utf-8 only if python version is <=2.I have made the same change for python version 3 as well. Please find the patch for the same.Let me know if it makes senseRegards,Ram
Hi Ram,
I took a look at this, and unfortunately the proposed fix breaks Python 2 (sys.stdout.encoding isn't a writable attribute in Python 2) :-(. I've attached a patch which is compatible with both versions, and have confirmed that the output is identical across Python 2 and 3 and across both Windows and Linux. The output on Windows and Linux is identical, once the difference in line endings is accounted for.
I've also opened the Unicode data file in UTF-8 and added a "with" block which ensures we close the file when we are done with it. The change makes the Python2 compatibility a little more complex (2 blocks to remove), but it's the cleanest I could achieve.
The attached patch goes on top of patch 02 (not on top of the broken, committed 03). I'm hoping that's not a problem. If it is, let me know and I'll factor out the changes.
Please let me know if you have any questions.
Best wishes,
Hugh
Вложения
Hi Hugh,
The patch I submitted was tested both in python 2 and 3 and it worked for me.The single line of code
added in the patch runs only in python 3. I dont think it can break python2. Would like to see the error you got in python 2 Good to know the reported issue is a valid one in windows.I tested your patch as well and it is also working fine.--
Cheers
Ram 4.0
Ram 4.0
On Sun, Feb 17, 2019 at 12:45:39PM +0530, Ramanarayana wrote: > The patch I submitted was tested both in python 2 and 3 and it worked for > me.The single line of code > added in the patch runs only in python 3. I dont think it can break > python2. Would like to see the error you got in python 2 Good to know the > reported issue is a valid one in windows.I tested your patch as well and > it is also working fine. I can see that the commit fest entry associated to this thread has been switched back from "committed" to "Needs Review" with Thomas Munro still associated as committer. The thing is that we have already committed all the bits discussed here, so I am switching back the status as "committed", which reflects the state of the thread. If you have a set of fixes for what has been pushed regarding Windows and Python 2/3 capabilities, I would suggest to create a new entry with yourself as the author. Spawning a new thread would be also nice so as you attract the correct audience, this thread about initially diacritical character support for unaccent has been used more than enough now. Python 2/3 support for this script is easy enough to check on Linux, and now you are adding Windows in the mix... Thanks, -- Michael