Обсуждение: XPATH vs. server_encoding != UTF-8
Hi The current thread about JSON and the ensuing discussion about the XML types' behaviour in non-UTF8 databases made me try out how well XPATH() copes with that situation. The code, at least, looks suspicious - XPATH neither verifies that the server encoding is UTF-8, not does it pass the server encoding on to libxml's xpath functions. So I created a database with encoding ISO-8859-1 (LATIN1), and did (which aclient encoding matching my terminal's settings) CREATE TABLE X (d XML); INSERT INTO X VALUES ('<r a="ä"/>'); i.e, I inserted the XML document <r a="ä"/>, but without using an entity reference for the german Umlaut-A. Then I attempted to extract the length of r's attribute "a" with the XPATH /r/@a, both with the XPath function string-length (which works now! yay!) and with postgres' LENGTH() function. SELECT (XPATH('string-length(/r/@a)', d))[1] AS xpath_length, LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length FROM X; The XPATH() function itself doesn't complain, but libxml does - it expects UTF-8 encoded data, and screams bloody murder when it encounters the ISO-8859-1-encoded Umlaut-A ERROR: could not parse XML document DETAIL: line 1: Input is not proper UTF-8, indicate encoding ! Bytes: 0xE4 0x22 0x2F 0x3E <r a="ä"/> That might seem fine on the surface - we did, after all, error out instead of producing potentially non-sensical results. However, libxml's ability to detect this error relies on it's ability to distinguish between UTF-8 and non-UTF-8 encoded strings. Which, of course, doesn't work in the general case. So for my next try, I deliberately set client_encoding to ISO-8859-1, even though my terminal uses UTF-8, removed all data from table X, and did INSERT INTO X VALUES ('<r a="ä"/>'); again. The effect is that is that X now contains ISO-8859-1 encoded data which *happens* to look like valid UTF-8. After changing the client_encoding back to UTF-8, the value we just inserted looks like that <r a="ä"/> Now I invoked the XPATH query from above again. SELECT (XPATH('string-length(/r/@a)', d))[1] AS xpath_length, LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length FROM X; As predicted, it doesn't raise an error this time, since libxml is unable to distinguish the ISO-8859-1 string '<r a="ä"/' from valid UTF-8. But the result is still wrongs, since the string-length() function counts 'ä' as just one character, when it reality it are of course contains two. xpath_length | pg_length --------------+----------- 1 | 2 The easiest way to fix this would be to make XPATH() flat-out refuse to do anything if the server encoding isn't UTF-8. But that seems a bit harsh - things actually do work correctly as long as the XML document contains only ASCII characters, and existing applications might depend on that. So what I think we should do is tell libxml that the encoding is ASCII if the server encoding isn't UTF-8. With that change, the query above produces ERROR: could not parse XML document DETAIL: encoder error which seems sane. Replacing the data in X with ASCII-only data makes the error go away, and the result is then correct also. DELETE FROM X; INSERT INTO X VALUES ('<r a="a"/>'); SELECT (XPATH('string-length(/r/@a)', d))[1] AS xpath_length, LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length FROM X; gives xpath_length | pg_length --------------+----------- 1 | 1 Proof-of-concept patch attached, but doesn't yet include documentation updates. Comments? Thoughts? Suggestions? best regards, Florian Pflug
Вложения
[Resent with pgsql-hackers re-added to the recipient list. I presume you didn't remove it on purpose] On Jul23, 2011, at 18:11 , Joey Adams wrote: > On Sat, Jul 23, 2011 at 11:49 AM, Florian Pflug <fgp@phlo.org> wrote: >> So what I think we should do is tell libxml that the encoding is ASCII >> if the server encoding isn't UTF-8. With that change, the query above >> produces > > I haven't had time to digest this situation, but there is a function > called pg_encoding_to_char for getting a string representation of the > encoding. However, it might not produce a string that libxml > understands in all cases. > > Would it be better to tell libxml the server encoding, whatever it may be? Ultimately, yes. However, I figured if it was as easy as translating our encoding names to those of libxml, the current code would probably do that instead of converting the XML to UTF-8 before validating it. (Validation and XPATH processing use a different code path there!) I'm also not aware of any actual complaints about XPATH's restriction to UTF-8, and it's not a case that I personally care for, so I'm a bit hesitant to put in the time and energy required to extend it to other encodings. But once I had stumbled over this, I didn't want to ignore it all together, so looked for simple way to make the current behaviour more bullet-proof. The patch accomplishes that, I think, and without any major change in behaviour. You only observe the difference if you indeed have non-UTF-8 XMLs which look like valid UTF-8. > In the JSON encoding discussion, the last idea (the one I was planning > to go with) was to allow non-ASCII characters in any server encoding > (like ä in ISO-8859-1), but not allow non-ASCII escapes (like \u00E4) > unless the server encoding is UTF-8. Yeah, that's how I understood your proposal, and it seems sensible. > I think your patch would more > closely match the opposite: allow any escapes, but only allow ASCII > text if the server encoding is not UTF-8. Yeah, but only for XPATH(). XML input validation uses a different code path, and seems to convert the XML to UTF-8 before verifying it's well-formedness with libxml (as you already discovered previously). The difference between JSON and XML here is that the XML types has to live with libxml's idiosyncrasies and restrictions. If we could make libxml use our encoding and text handling infrastructure, the UTF-8 restrictions would probably not exist. But as it stands, libxml has it's own machinery for dealing with encodings... I wonder, BTW, what happens if you attempt to store an XML containing a character not representable in UNICODE. If the conversion to UTF-8 simply replaces it with a placeholder, we'd be fine, since just a replacement cannot affect the well-formedness of an XML. If OTOH it raised an error, that'd be a bit unfortunate... best regards, Florian Pflug
On lör, 2011-07-23 at 17:49 +0200, Florian Pflug wrote: > The current thread about JSON and the ensuing discussion about the > XML types' behaviour in non-UTF8 databases made me try out how well > XPATH() copes with that situation. The code, at least, looks > suspicious - XPATH neither verifies that the server encoding is UTF-8, > not does it pass the server encoding on to libxml's xpath functions. This issue is on the Todo list, and there are some archive links there.
On Jul23, 2011, at 22:49 , Peter Eisentraut wrote: > On lör, 2011-07-23 at 17:49 +0200, Florian Pflug wrote: >> The current thread about JSON and the ensuing discussion about the >> XML types' behaviour in non-UTF8 databases made me try out how well >> XPATH() copes with that situation. The code, at least, looks >> suspicious - XPATH neither verifies that the server encoding is UTF-8, >> not does it pass the server encoding on to libxml's xpath functions. > > This issue is on the Todo list, and there are some archive links there. Thanks for the pointer, but I think the discussion there doesn't really apply here. First, I didn't suggest (or implement) full support for XPATH() together with server encodings other than UTF-8. My suggested patch simply closes a hole in the implementation of the current behaviour. Instead of relying on libxml to be able to detect that the encoding isn't UTF-8, it relies on it only to detect that the encoding isn't ASCII. Since supported server encodings are supersets of ASCII, the latter is trivial. xml.c also seems to have changed quite a bite since this was last discussed. Tom Lane argued against the proposed patch on the grounds that there are many more places in xml.c which pass strings to libxml without charset conversion. However, looking at it now, it seems that all XML validation goes through xml_parse(), which actually converts the XML to UTF-8. Only XPATH contains a separate code path, and chooses to ignore encoding issues all together. best regards, Florian Pflug
On Jul24, 2011, at 01:25 , Florian Pflug wrote: > On Jul23, 2011, at 22:49 , Peter Eisentraut wrote: > >> On lör, 2011-07-23 at 17:49 +0200, Florian Pflug wrote: >>> The current thread about JSON and the ensuing discussion about the >>> XML types' behaviour in non-UTF8 databases made me try out how well >>> XPATH() copes with that situation. The code, at least, looks >>> suspicious - XPATH neither verifies that the server encoding is UTF-8, >>> not does it pass the server encoding on to libxml's xpath functions. >> >> This issue is on the Todo list, and there are some archive links there. > > Thanks for the pointer, but I think the discussion there doesn't > really apply here. Upon further reflection, I came to realize that it in fact does apply. All the non-XPath related XML *parsing* seems to go through xml_parse(), but we also use libxml to write XML, making XMLELEMENT() and friends equally susceptible to all kinds of encoding trouble. For the fun of it, try the following in a ISO-8859-1 database (which client_encoding correctly set up, so the umlaut-a reaches the backend unharmed) select xmlelement(name "r", xmlattributes('ä' as a)); you get xmlelement -------------------<r a="䀀"/> Well, actually, you only get that about 9 times out of 10. Sometimes you instead get xmlelement ---------------------------<r a="䀁\x01\x01"/> It seems the libxml reads past the terminating zero byte if it's preceeded by an invalid UTF-8 byte sequence (like 0xe4 0x00 in the example above). Ouch! Also, passing encoding ASCII to libxml's parser doesn't prevent it from expanding entity references referring to characters outside the ASCII range. So even with my patch applied you can make XPATH() return wrong results. For example (0xe4 is the unicode codepoint representing umlaut-a) select xpath('/r/@a', '<r a="ä"/>'::xml); gives (*with* my patch applied) xpath -------{ä} So scratch the whole idea. There doesn't seem to be a simple way to make the XML type work sanely in a non-UTF-8 setting :-(. Apart from simple input and output that is, which already seems to work correctly regardless of the server encoding. BTW, for the sake of getting this into the archives just in case someone decides to fix this and stumbles over this thread: It seems to me that the easiest way to fix XML generation in the non-UTF-8 case would be to cease using libxml for emitting XML at all. The only non-trivial use of libxml there is the escaping of attribute values, and we do already have our own escape_xml() function - it just needs to be taught the additional escapes needed for attribute values. (libxml is also used to convert binary values to base64 or hexadecimal notation, but there're no encoding issues there) best regards, Florian Pflug