Обсуждение: UNICODE encoding and jdbc related issues
Hello All, I posted this on the general mailing list several days ago without a response so am posting here. Does anyone here have any wisdom or experience they don't mind sharing? Our production database was created with the default SQL_ASCII encoding. It appears that some of our users have entered characters into the system with characters above 127 (accented vowels, etc). None of the tools we use currently have had a problem with this behavior until recently, everything just worked. I was testing some reporting tools this past weekend and have been playing with Jasper reports[1] . Jasper reports is a Java based reporting tool that reads data from the database via JDBC. When I initially tried to generate reports, the jdbc connection would crash with the following message: org.postgresql.util.PSQLException: Invalid character data was found. Googling eventually turned up a message on the pgsql-jdbc list detailing the problem[2]. Basically, java cannot convert these characters above 127 into unicode which is required by java. After some more googling, I found that if I took a recent database dump and then ran it through iconv[3] and then created the database with a unicode encoding, everything worked. 1. Is there any way to do a iconv type translation inline in a sql statement? ie select translate(text_field, unicode) from table.... Btw, set client_encoding=UNICODE does not work in this situation. In fact the JDBC driver for postgres seems to do this automatically. 2. I'm really not sure I want to change the encoding of our main database to Unicode. Is there a performance loss when going to a UNICODE database encoding? What about sorts, etc. I'm really worried about unintended side effects of moving from SQL_ASCII to UNICODE. 3. Is there any other way around this issue? Or are we living dangerously by trying to store non-ascii data in a database created as ascii encoded? 4. Has anyone else gone through a conversion like this? Are there any gotchas we should look out for? Thanks for your time, -Chris We are using postgres 7.4.5 on Linux. [1] http://jasperreports.sourceforge.net/ [2] http://archives.postgresql.org/pgsql-jdbc/2004-10/msg00280.php [3] iconv -f iso8859-1 -t utf-8 < dbsnapshot.dumpall > dump-utf-8.dumpall -- Chris Kratz
On Wed, 6 Apr 2005, Chris Kratz wrote: > Our production database was created with the default SQL_ASCII encoding. It > appears that some of our users have entered characters into the system with > characters above 127 (accented vowels, etc). None of the tools we use > currently have had a problem with this behavior until recently, everything > just worked. > > I was testing some reporting tools this past weekend and have been playing > with Jasper reports[1] . Jasper reports is a Java based reporting tool that > reads data from the database via JDBC. When I initially tried to generate > reports, the jdbc connection would crash with the following message: > > org.postgresql.util.PSQLException: Invalid character data was found. > > Googling eventually turned up a message on the pgsql-jdbc list detailing the > problem[2]. Basically, java cannot convert these characters above 127 into > unicode which is required by java. > > After some more googling, I found that if I took a recent database dump and > then ran it through iconv[3] and then created the database with a unicode > encoding, everything worked. > > 1. Is there any way to do a iconv type translation inline in a sql statement? > ie select translate(text_field, unicode) from table.... Btw, set > client_encoding=UNICODE does not work in this situation. In fact the JDBC > driver for postgres seems to do this automatically. You can't do translation inline, how would a driver interpret the results of SELECT translate(field1, unicode), translate(field2, latin1) ? The driver does SET client_encoding which does work for all real server encodings. The problem is that SQL_ASCII is not a real encoding. It accepts any encoding and cannot do conversions to other encodings. Your db right now could easily have a mix of encodings. > 2. I'm really not sure I want to change the encoding of our main database to > Unicode. Is there a performance loss when going to a UNICODE database > encoding? What about sorts, etc. I'm really worried about unintended side > effects of moving from SQL_ASCII to UNICODE. You don't need to use unicode, but you must select another encoding. If you'd like to stick with a single byte encoding perhaps LATIN1 would be appropriate for you. > 3. Is there any other way around this issue? Or are we living dangerously by > trying to store non-ascii data in a database created as ascii encoded? You are living dangerously. > 4. Has anyone else gone through a conversion like this? Are there any > gotchas we should look out for? The gotchas here are to make sure your other client tools still work against the new database. > [3] iconv -f iso8859-1 -t utf-8 < dbsnapshot.dumpall > dump-utf-8.dumpall I see your data really is LATIN1. Perhaps you should use that as your db encoding. That should keep your existing client tools happy as well as the JDBC driver. Kris Jurka
> > 2. I'm really not sure I want to change the encoding of our main > database to > > Unicode. Is there a performance loss when going to a UNICODE database > > encoding? What about sorts, etc. I'm really worried about unintended > side > > effects of moving from SQL_ASCII to UNICODE. > > You don't need to use unicode, but you must select another encoding. If > you'd like to stick with a single byte encoding perhaps LATIN1 would be > appropriate for you. I've asked this before on the performance list but didn't get any reply. Is there substantial performance difference between using SQL_ASCII, LATIN1, or UNICODE? > The driver does SET client_encoding which does work for all real server > encodings. The problem is that SQL_ASCII is not a real encoding. It > accepts any encoding and cannot do conversions to other encodings. Your > db right now could easily have a mix of encodings. ISTM that when you create a database with SQL_ASCII encoding you decide to deal with character set issues in the applications. Why is the JDBC driver dictating how the application handles character set issues? -Igor > -----Original Message----- > From: pgsql-jdbc-owner@postgresql.org [mailto:pgsql-jdbc- > owner@postgresql.org] On Behalf Of Kris Jurka > Sent: Wednesday, April 06, 2005 1:23 PM > To: Chris Kratz > Cc: pgsql-jdbc@postgresql.org > Subject: Re: [JDBC] UNICODE encoding and jdbc related issues > > > > On Wed, 6 Apr 2005, Chris Kratz wrote: > > > Our production database was created with the default SQL_ASCII encoding. > It > > appears that some of our users have entered characters into the system > with > > characters above 127 (accented vowels, etc). None of the tools we use > > currently have had a problem with this behavior until recently, > everything > > just worked. > > > > I was testing some reporting tools this past weekend and have been > playing > > with Jasper reports[1] . Jasper reports is a Java based reporting tool > that > > reads data from the database via JDBC. When I initially tried to > generate > > reports, the jdbc connection would crash with the following message: > > > > org.postgresql.util.PSQLException: Invalid character data was found. > > > > Googling eventually turned up a message on the pgsql-jdbc list detailing > the > > problem[2]. Basically, java cannot convert these characters above 127 > into > > unicode which is required by java. > > > > After some more googling, I found that if I took a recent database dump > and > > then ran it through iconv[3] and then created the database with a > unicode > > encoding, everything worked. > > > > 1. Is there any way to do a iconv type translation inline in a sql > statement? > > ie select translate(text_field, unicode) from table.... Btw, set > > client_encoding=UNICODE does not work in this situation. In fact the > JDBC > > driver for postgres seems to do this automatically. > > You can't do translation inline, how would a driver interpret the results > of SELECT translate(field1, unicode), translate(field2, latin1) ? > > > > > > 3. Is there any other way around this issue? Or are we living > dangerously by > > trying to store non-ascii data in a database created as ascii encoded? > > You are living dangerously. > > > 4. Has anyone else gone through a conversion like this? Are there any > > gotchas we should look out for? > > The gotchas here are to make sure your other client tools still work > against the new database. > > > [3] iconv -f iso8859-1 -t utf-8 < dbsnapshot.dumpall > dump-utf- > 8.dumpall > > I see your data really is LATIN1. Perhaps you should use that as your db > encoding. That should keep your existing client tools happy as well as > the JDBC driver. > > Kris Jurka > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
On Wed, 6 Apr 2005, Igor Postelnik wrote: > I've asked this before on the performance list but didn't get any reply. > Is there substantial performance difference between using SQL_ASCII, > LATIN1, or UNICODE? Performance where? Backend performance in terms of string comparisons and sorting is driver off of locale, not encoding. You may use the C locale with UNICODE encoding for example so that should not be an issue. For the JDBC driver it always wants data coming back to it in unicode. If you've got a unicode db no conversion is necessary. If you've got a sql_ascii db no conversion is possible. If you've got a latin1 db conversion will happen, but I don't know what the cost of that is. > ISTM that when you create a database with SQL_ASCII encoding you decide > to deal with character set issues in the applications. Why is the JDBC > driver dictating how the application handles character set issues? If the only API the JDBC driver provided was ResultSet.getBytes() then that would be OK (note this is the only API libpq provides). To provide getString() the driver must know what encoding the data coming back is really in. A database encoding of sql_ascii tells us nothing so we can do nothing about it. It has been suggested in the past to allow the real database encoding for a sql_ascii database to be specified as a URL parameter, but I am of the opinion that is just masking the problem, not solving it. Data should be in a correctly encoded database. If you store unicode data in a sql_ascii then things like varchar(N) are now the number of bytes instead of the number of characters as it should. With sql_ascii there is no restriction on what data can be entered and you can get yourself in a real mess with different clients entering data in different encodings. Do yourself a favor and pick a real encoding. Kris Jurka