Обсуждение: Re: BUG #15548: Unaccent does not remove combining diacritical characters

Поиск

Список

Период

Сортировка

Re: BUG #15548: Unaccent does not remove combining diacritical characters

От

raam narayana

Дата:

10 февраля 2019 г., 23:06:25

Hi,

After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the
outputfrom the script is completely different from the unaccent.rules file content. Am I missing anything.My testing
includesthe following
 

Downloaded the following files

http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt
 
http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml

Executed the below python script

python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file  Latin-ASCII.xml >
unaccent.rules
 

I am using python 3.7.1 and running on Windows 10 Platform

The new status of this patch is: Needs review

Re: BUG #15548: Unaccent does not remove combining diacritical characters

От

Thomas Munro

Дата:

10 февраля 2019 г., 23:44:01

On Mon, Feb 11, 2019 at 7:07 AM raam narayana <raam.soft@gmail.com> wrote:
> After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the
outputfrom the script is completely different from the unaccent.rules file content. Am I missing anything.My testing
includesthe following
 
>
> Downloaded the following files
>
> http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt
>
> http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml
>
> Executed the below python script
>
> python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file  Latin-ASCII.xml >
unaccent.rules
>
> I am using python 3.7.1 and running on Windows 10 Platform
>
> The new status of this patch is: Needs review

Hi Raam,

How does it differ?  Can you please share the output you get?  I used
Python 2.7 on a Mac, exactly those input files, and my output matched
Hugh's.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: BUG #15548: Unaccent does not remove combining diacritical characters

От

Hugh Ranalli

Дата:

11 февраля 2019 г., 22:20:42

On Sun, 10 Feb 2019 at 15:07, raam narayana <raam.soft@gmail.com> wrote:

Hi,

After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the output from the script is completely different from the unaccent.rules file content. Am I missing anything.My testing includes the following

Downloaded the following files

http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt

http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml

Executed the below python script

python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml > unaccent.rules

I am using python 3.7.1 and running on Windows 10 Platform

The new status of this patch is: Needs review

Hi Raam,

I just ran generate_unaccent_rules.py under two environments, using the data files given above :

- Python 3.4.3 on Linux Mint 17.3 (equivalent to Ubuntu 14.04)

- Python 3.6.7 on Ubuntu 18.04

In both cases, the output was identical to that generated by the program under Python 2.7. So yes, more information would help. Unfortunately I don't have a Windows Python environment readily available, but could set one up if I had to.

Thanks,

Hugh

Re: BUG #15548: Unaccent does not remove combining diacritical characters

От

Ramanarayana

Дата:

11 февраля 2019 г., 23:57:31

Hi Hugh,

I tested the script in python 2.7 and it works perfect. The problem is in python 3.7(and may be only in windows as you were not getting the issue) and I was getting the following error

UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in position 0: character maps to <undefined>

I went through the python script and found that the stdout encoding is set to utf-8 only if python version is <=2.

I have made the same change for python version 3 as well. Please find the patch for the same.Let me know if it makes sense

Regards,

Ram.

On Tue, 12 Feb 2019 at 00:50, Hugh Ranalli <hugh@whtc.ca> wrote:

On Sun, 10 Feb 2019 at 15:07, raam narayana <raam.soft@gmail.com> wrote:
Hi,

After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the output from the script is completely different from the unaccent.rules file content. Am I missing anything.My testing includes the following

Downloaded the following files

http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt

http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml

Executed the below python script

python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml > unaccent.rules

I am using python 3.7.1 and running on Windows 10 Platform

The new status of this patch is: Needs review

Hi Raam,
I just ran generate_unaccent_rules.py under two environments, using the data files given above :
- Python 3.4.3 on Linux Mint 17.3 (equivalent to Ubuntu 14.04)
- Python 3.6.7 on Ubuntu 18.04

In both cases, the output was identical to that generated by the program under Python 2.7. So yes, more information would help. Unfortunately I don't have a Windows Python environment readily available, but could set one up if I had to.

Thanks,
Hugh

Cheers
Ram 4.0

Вложения

generate_unaccent_rules-remove-combining-diacritical-accents-03.patch

Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

От

Michael Paquier

Дата:

12 февраля 2019 г., 07:18:19

On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:
> I tested the script in python 2.7 and it works perfect. The problem is in
> python 3.7(and may be only in windows as you were not getting the issue)
> and I was getting the following error
>
> UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
> position 0: character maps to <undefined>
>
>  I went through the python script and found that the stdout encoding is set
> to utf-8 only  if python version is <=2.
>
> I have made the same change for python version 3 as well. Please find the
> patch for the same.Let me know if it makes sense

Isn't that because Windows encoding becomes cp1252, utf16 or such?
FWIW, on Debian SID with Python 3.7, I get the correct output, and no
diffs on HEAD.  Perhaps it would make sense to use open() on the
different files with encoding='utf-8' to avoid any kind of problems?
--
Michael

Вложения

signature.asc

Re: BUG #15548: Unaccent does not remove combining diacritical characters

От

Ramanarayana

Дата:

12 февраля 2019 г., 16:54:20

Hi Michael,

The issue was that the python script was working in python 2 but not in python 3 in Windows. This is because the python script writes the final output to stdout and stdout encoding is set to utf-8 only for python 2 but not python 3.If no encoding is set for stdout it takes the encoding from the Operating system.Default encoding in linux and windows might be different.Hence this issue.

Regards,

Ram.

On Tue, 12 Feb 2019 at 09:48, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:
> I tested the script in python 2.7 and it works perfect. The problem is in
> python 3.7(and may be only in windows as you were not getting the issue)
> and I was getting the following error
>
> UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
> position 0: character maps to <undefined>
>
> I went through the python script and found that the stdout encoding is set
> to utf-8 only if python version is <=2.
>
> I have made the same change for python version 3 as well. Please find the
> patch for the same.Let me know if it makes sense

Isn't that because Windows encoding becomes cp1252, utf16 or such?
FWIW, on Debian SID with Python 3.7, I get the correct output, and no
diffs on HEAD. Perhaps it would make sense to use open() on the
different files with encoding='utf-8' to avoid any kind of problems?
--
Michael

Cheers
Ram 4.0

Re: BUG #15548: Unaccent does not remove combining diacritical characters

От

Hugh Ranalli

Дата:

12 февраля 2019 г., 19:21:35

On Tue, 12 Feb 2019 at 08:54, Ramanarayana <raam.soft@gmail.com> wrote:

Hi Michael,
The issue was that the python script was working in python 2 but not in python 3 in Windows. This is because the python script writes the final output to stdout and stdout encoding is set to utf-8 only for python 2 but not python 3.If no encoding is set for stdout it takes the encoding from the Operating system.Default encoding in linux and windows might be different.Hence this issue.
Regards,
Ram.

On Tue, 12 Feb 2019 at 09:48, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:
> I tested the script in python 2.7 and it works perfect. The problem is in
> python 3.7(and may be only in windows as you were not getting the issue)
> and I was getting the following error
>
> UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
> position 0: character maps to <undefined>
>
> I went through the python script and found that the stdout encoding is set
> to utf-8 only if python version is <=2.
>
> I have made the same change for python version 3 as well. Please find the
> patch for the same.Let me know if it makes sense

Isn't that because Windows encoding becomes cp1252, utf16 or such?
FWIW, on Debian SID with Python 3.7, I get the correct output, and no
diffs on HEAD. Perhaps it would make sense to use open() on the
different files with encoding='utf-8' to avoid any kind of problems?
--
Michael

I can't look at this today, but will fire up Windows and Python tomorrow, look at Ram's patch, and see what is going on. I'll also look at how we open the input files, to see if we should supply an encoding. It makes sense those input files will only make sense in UTF-8 anyway.

Ram, thanks for catching this issue.,

Hugh

Re: BUG #15548: Unaccent does not remove combining diacritical characters

От

Hugh Ranalli

Дата:

17 февраля 2019 г., 03:51:08

On Mon, 11 Feb 2019 at 15:57, Ramanarayana <raam.soft@gmail.com> wrote:

Hi Hugh,

I tested the script in python 2.7 and it works perfect. The problem is in python 3.7(and may be only in windows as you were not getting the issue) and I was getting the following error

UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in position 0: character maps to <undefined>

I went through the python script and found that the stdout encoding is set to utf-8 only if python version is <=2.

I have made the same change for python version 3 as well. Please find the patch for the same.Let me know if it makes sense

Regards,
Ram

Hi Ram,

I took a look at this, and unfortunately the proposed fix breaks Python 2 (sys.stdout.encoding isn't a writable attribute in Python 2) :-(. I've attached a patch which is compatible with both versions, and have confirmed that the output is identical across Python 2 and 3 and across both Windows and Linux. The output on Windows and Linux is identical, once the difference in line endings is accounted for.

I've also opened the Unicode data file in UTF-8 and added a "with" block which ensures we close the file when we are done with it. The change makes the Python2 compatibility a little more complex (2 blocks to remove), but it's the cleanest I could achieve.

The attached patch goes on top of patch 02 (not on top of the broken, committed 03). I'm hoping that's not a problem. If it is, let me know and I'll factor out the changes.

Please let me know if you have any questions.

Best wishes,

Hugh

Вложения

generate_unaccent_rules-remove-combining-diacritical-accents-04.patch

Re: BUG #15548: Unaccent does not remove combining diacritical characters

От

Ramanarayana

Дата:

17 февраля 2019 г., 10:15:39

Hi Hugh,

The patch I submitted was tested both in python 2 and 3 and it worked for me.The single line of code

added in the patch runs only in python 3. I dont think it can break python2. Would like to see the error you got in python 2 Good to know the reported issue is a valid one in windows.I tested your patch as well and it is also working fine.

Cheers
Ram 4.0

Re: BUG #15548: Unaccent does not remove combining diacriticalcharacters

От

Michael Paquier

Дата:

18 февраля 2019 г., 06:36:48

On Sun, Feb 17, 2019 at 12:45:39PM +0530, Ramanarayana wrote:
> The patch I submitted was tested both in python 2 and 3 and it worked for
> me.The single line of code
> added in the patch runs only in python 3. I dont think it can break
> python2. Would like to see the error you got in python 2   Good to know the
> reported issue  is a valid one in windows.I tested your patch as well and
> it is also working fine.

I can see that the commit fest entry associated to this thread has
been switched back from "committed" to "Needs Review" with Thomas
Munro still associated as committer.  The thing is that we have
already committed all the bits discussed here, so I am switching back
the status as "committed", which reflects the state of the thread.  If
you have a set of fixes for what has been pushed regarding Windows and
Python 2/3 capabilities, I would suggest to create a new entry with
yourself as the author.  Spawning a new thread would be also nice so
as you attract the correct audience, this thread about initially
diacritical character support for unaccent has been used more than
enough now.

Python 2/3 support for this script is easy enough to check on Linux,
and now you are adding Windows in the mix...

Thanks,
--
Michael

Вложения

signature.asc

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Re: BUG #15548: Unaccent does not remove combining diacritical characters

Вложения

Вложения

Вложения

Вложения