Обсуждение: [Q] UTF-8 testing with Windows/ODBC 8.3.0400
Hello, I am trying to confirm that my program can read UTF-8 characters and cannot seem to figure out what's wrong. I get question marks I am using ODBC PSQLODBC35W.dll My C++ program relies on OTL C++ library to do DB access, and in the Visual Studio debugger I see only question marks '?' for the strings. When I convert the question marks to utf-16 (in memory) just to see if may displaying is wrong but I get '63'. I am using std::string to store the bytestream from varchar column an I think it is ok because I do not need to 'manipulate' the content. To eliminate my program from the consideration I downloaded and installed ODBC Query Tool 0.72 Beta build 23 (from sourcerorge) I get the same question marks on the columns with unicode text. I know that the database has stored it correctly, because I can see everything 'Ok' with PGAdminIII (and things work ok with my PHP PDO access). I do not have PGCLIENTENCODING=UTF8 set in the 'System', instead I just set it per command prompt, however I do not think it matters. I have set ODBC driver options to have max var char length to 75000 and the long var char to 75,000. I cannot figure out what else I might be doing wrong.... as I said, all I need for now it is just to test out that a C++ program via ODBC can get the data. thanks in advance for any help, Vlad -- V S P toreason@fastmail.fm -- http://www.fastmail.fm - Same, same, but different...
V S P wrote: > My C++ program relies on OTL C++ library to do DB access, and in the > Visual Studio debugger I see only question marks '?' for the strings. How would Visual Studio know that the std::string instances in question contain UTF-8 data? std::string is a byte string, not a character string - it could contain text in any encoding (or non-text data) and VC++ has no way of knowing how to interpret it. What it probably does is display anything within the ASCII range, and otherwise display ?s . If you expect to be able to work with those strings as real text, you probably want to use std::wstring instead, and USE APPROPRIATE ENCODING CONVERSION ROUTINES. Note that the width of wchar_t varies from platform to platform, so you'll need to convert to/from UTF-16 for a 2 byte wchar_t, or to/from UTF-32 for a 4-byte wchar_t. (I hate working with unicode and encodings in standard C++ *SO* much - argh! One of the only areas where I really wish I was using Java. If only the QString class from Qt was part of standard C++ ... ). > I am using std::string to store the bytestream from varchar column an I > think it is ok > because I do not need to 'manipulate' the content. True - but VC++ won't be able to understand what's in it, either. > I cannot figure out what else I might be doing wrong.... as I said, all > I need for now it is just to test out that a C++ program via ODBC can > get the data. Your description really isn't adequate to say. It's highly likely that you're retrieving the data from the database fine, but your tools don't know it's UTF-8 and aren't able to work with it correctly. That's mostly a guess with the amount of information you've provided, though. Perhaps you could post a small, self-contained test program and a SQL script to populate a test database? Then post the results of running the program against the database, including the hex values of the bytes returned by the ODBC interface. -- Craig Ringer
Hi thank you for the follow up. Just had a break through... I believe I was able to resolve most of the problems I finally found a post on the net that says: if you want UTF8 in your ODBC-based client program and Postgres is in UTF8 then use the ASCII driver not the Unicode So as soon as switched that things worked (I also kept: set client_encoding='UTF' on my ODBC connections at startup) I used : wchar_t wData[1000]; ::MultiByteToWideChar(CP_UTF8, 0, my_normal_std_string.c_str(), -1, wData, 1000); to convert the read data and display in the debugger Before the switch to the Ascii odbc driver, the above was just showing question marks. So reading UTF8 into ODBC programming using the Ascii driver works perfect. Since I now I understood what was going on, converted most of my strings to wstrings, and then enabled Unicode Version of the PG ODBC driver -- and that works too ! :-) so now I have a #define where I switch between wstrings and strings and of course a few other things, and then I flip the drivers in ODBC datasource and things work (I have tested selects sofar). Three things that I am not still sure about, and may be you can help: a) does Posgtres driver on unixODBC do the same as Windows (that is there is a unicode and non unicode versions of the drivers ? (I am interested in 64 bit linux and 64 bit freebsd ones) b) I noticed that when using the Unicode version (first) and Ascii version (second) the value of SWORD right before SQLVLEN is different (it is 12 on the ascii and -9 on unicode version) -- what does this mean? disp_otrq_x86d 8a4-b90 EXIT SQLDescribeColW with return code 0 (SQL_SUCCESS) HSTMT 013F1BA8 UWORD 11 WCHAR * 0x01A28974 [ 9] "cntr_data" SWORD 512 SWORD * 0x01A28BC4 (9) SWORD * 0x01A28BB8 (-9) SQLULEN * 0x01A28B94 (4096) SWORD * 0x01A28BA0 (0) SWORD * 0x01A28B88 (1) disp_otrq_x86d ab8-498 EXIT SQLDescribeColW with return code 0 (SQL_SUCCESS) HSTMT 013F1C38 UWORD 11 WCHAR * 0x01A28974 [ 9] "cntr_data" SWORD 512 SWORD * 0x01A28BC4 (9) SWORD * 0x01A28BB8 (12) SQLULEN * 0x01A28B94 (4096) SWORD * 0x01A28BA0 (0) SWORD * 0x01A28B88 (1) another question: I have about 6 tables where about 20 fields in each table, 2 fields are 65K long (they are declared as varchar(65000) is this is OK for ODBC drivers, and what if anything I should be setting on them? Thank you again for your follow up, Vlad On Wed, 18 Mar 2009 15:46 +0900, "Craig Ringer" <craig@postnewspapers.com.au> wrote: > V S P wrote: > > > My C++ program relies on OTL C++ library to do DB access, and in the > > Visual Studio debugger I see only question marks '?' for the strings. > > How would Visual Studio know that the std::string instances in question > contain UTF-8 data? std::string is a byte string, not a character string > - it could contain text in any encoding (or non-text data) and VC++ has > no way of knowing how to interpret it. > > What it probably does is display anything within the ASCII range, and > otherwise display ?s . > > If you expect to be able to work with those strings as real text, you > probably want to use std::wstring instead, and USE APPROPRIATE ENCODING > CONVERSION ROUTINES. Note that the width of wchar_t varies from platform > to platform, so you'll need to convert to/from UTF-16 for a 2 byte > wchar_t, or to/from UTF-32 for a 4-byte wchar_t. > > (I hate working with unicode and encodings in standard C++ *SO* much - > argh! One of the only areas where I really wish I was using Java. If > only the QString class from Qt was part of standard C++ ... ). > > > I am using std::string to store the bytestream from varchar column an I > > think it is ok > > because I do not need to 'manipulate' the content. > > True - but VC++ won't be able to understand what's in it, either. > > > I cannot figure out what else I might be doing wrong.... as I said, all > > I need for now it is just to test out that a C++ program via ODBC can > > get the data. > > Your description really isn't adequate to say. It's highly likely that > you're retrieving the data from the database fine, but your tools don't > know it's UTF-8 and aren't able to work with it correctly. That's mostly > a guess with the amount of information you've provided, though. > > Perhaps you could post a small, self-contained test program and a SQL > script to populate a test database? Then post the results of running the > program against the database, including the hex values of the bytes > returned by the ODBC interface. > > -- > Craig Ringer -- V S P toreason@fastmail.fm -- http://www.fastmail.fm - A no graphics, no pop-ups email service
V S P wrote: > Hi thank you for the follow up. > Just had a break through... <snip> > Three things that I am not still sure about, and may be you can help: > > a) does Posgtres driver on unixODBC do the same as Windows (that is > there is a unicode and non unicode versions of the drivers ? > (I am interested in 64 bit linux and 64 bit freebsd ones) > > b) I noticed that when using the Unicode version (first) and > Ascii version (second) the value of SWORD right before SQLVLEN is > different > (it is 12 on the ascii and -9 on unicode version) -- what does this > mean? It's the SQL data type. 12 means SQL_VARCHAR and -9 means SQL_WVARCHAR. > disp_otrq_x86d 8a4-b90 EXIT SQLDescribeColW with return code 0 > (SQL_SUCCESS) > HSTMT 013F1BA8 > UWORD 11 > WCHAR * 0x01A28974 [ 9] "cntr_data" > SWORD 512 > SWORD * 0x01A28BC4 (9) > SWORD * 0x01A28BB8 (-9) > SQLULEN * 0x01A28B94 (4096) > SWORD * 0x01A28BA0 (0) > SWORD * 0x01A28B88 (1) > > > disp_otrq_x86d ab8-498 EXIT SQLDescribeColW with return code 0 > (SQL_SUCCESS) > HSTMT 013F1C38 > UWORD 11 > WCHAR * 0x01A28974 [ 9] "cntr_data" > SWORD 512 > SWORD * 0x01A28BC4 (9) > SWORD * 0x01A28BB8 (12) > SQLULEN * 0x01A28B94 (4096) > SWORD * 0x01A28BA0 (0) > SWORD * 0x01A28B88 (1) > another question: I have about 6 tables where about 20 fields in each > table, > 2 fields are 65K long (they are declared as varchar(65000) is this is OK > for ODBC drivers, and what if anything I should be setting on them? 65K may be OK if you set the max long varchar size > 65000 but please note that most applications/middlewares have their own limitation. For example the limitation of MS Access is around 65000 AFAIR. regards, Hiroshi Inoue
thank you, I have a follow up: I if I declare a field as varchar(30) in the UTF-8 Postgres, I assume it can possibly be more than 30 bytes storage. so when I set the ODBCs (unicode version) 'max long varchar' am I setting it to max 'byte length' of the UTF-8 (as in the dabase) or to '60' (max number of bytes for 30 character string in UTF-16) or to '30' (max number of valid characters) ? And would your answer be the same for the unix Odbc version ? Thank you again for the response, Vlad On Thu, 19 Mar 2009 21:28 +0900, "Hiroshi Inoue" <inoue@tpf.co.jp> wrote: > V S P wrote: > > Hi thank you for the follow up. > > Just had a break through... > > <snip> > > > Three things that I am not still sure about, and may be you can help: > > > > a) does Posgtres driver on unixODBC do the same as Windows (that is > > there is a unicode and non unicode versions of the drivers ? > > (I am interested in 64 bit linux and 64 bit freebsd ones) > > > > b) I noticed that when using the Unicode version (first) and > > Ascii version (second) the value of SWORD right before SQLVLEN is > > different > > (it is 12 on the ascii and -9 on unicode version) -- what does this > > mean? > > It's the SQL data type. 12 means SQL_VARCHAR and -9 means SQL_WVARCHAR. > > > disp_otrq_x86d 8a4-b90 EXIT SQLDescribeColW with return code 0 > > (SQL_SUCCESS) > > HSTMT 013F1BA8 > > UWORD 11 > > WCHAR * 0x01A28974 [ 9] "cntr_data" > > SWORD 512 > > SWORD * 0x01A28BC4 (9) > > SWORD * 0x01A28BB8 (-9) > > SQLULEN * 0x01A28B94 (4096) > > SWORD * 0x01A28BA0 (0) > > SWORD * 0x01A28B88 (1) > > > > > > disp_otrq_x86d ab8-498 EXIT SQLDescribeColW with return code 0 > > (SQL_SUCCESS) > > HSTMT 013F1C38 > > UWORD 11 > > WCHAR * 0x01A28974 [ 9] "cntr_data" > > SWORD 512 > > SWORD * 0x01A28BC4 (9) > > SWORD * 0x01A28BB8 (12) > > SQLULEN * 0x01A28B94 (4096) > > SWORD * 0x01A28BA0 (0) > > SWORD * 0x01A28B88 (1) > > > another question: I have about 6 tables where about 20 fields in each > > table, > > 2 fields are 65K long (they are declared as varchar(65000) is this is OK > > for ODBC drivers, and what if anything I should be setting on them? > > 65K may be OK if you set the max long varchar size > 65000 but please > note that most applications/middlewares have their own limitation. > For example the limitation of MS Access is around 65000 AFAIR. > > regards, > Hiroshi Inoue -- V S P toreason@fastmail.fm -- http://www.fastmail.fm - A fast, anti-spam email service.
I am designing now my application and testing on Windows sofar I would like to make sure that from a 'get go' I am doing the 'right thing' as far as a) freebsd/windows/linux (32 and 64 bit support) b) unicode support my DB in encoded in UTF-8 with the help of this list and others I am now able to use OTL (the C++ library I use for ODBC) and connect to Postgres Ascii ODBC and use std::string (even for utf-8) and Unicode ODBC and use wstring for varchar all on windows/32 bit sofar. My question is: what is the 'accepted' way of accessing UTF-8 encoded db on Linux and Freebsd is it via ODBC unicode (if such thing exists) or ASCII unicode and if it is via 'Unicode' what do I use as a C++ type to store string results? wstring or something else? (because wstring is 4 bytes on unix and 2 on windows) Is pg ODBC Unicode driver on Linux/FreeBSD converting utf-8 into two byte or 4 byte character strings? (and what do most people do in this situation) thank you in advance, Vlad -- V S P toreason@fastmail.fm -- http://www.fastmail.fm - mmm... Fastmail...