Обсуждение: XMLATTRIBUTES vs. values of type XML
Hi While reviewing the (now applied) XPATH escaping patches, Radoslaw found one case where the previous failure of XPATH to escape its return value was offset by XMLATTRIBUTES insistence to escape all input values, even if they're already of type XML. To wit, if you do SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES('&'::XML AS "a")) you get xmlelement --------------------<t a="&"/> which is somewhat surprising. Especially since SELECT XMLELEMENT(NAME "t", '&'::XML); gives xmlelement --------------<t>&</t> as one would except. Now, it seems rather unlikely that a real application would actually contain a query similar to the former one - you usually don 't store XML in the attributes of XML nodes. But since XPATH() returns XML, it's not unlikely that an application would do SELECT XMLELEMENT(NAME ..., XMLATTRIBUTES((XPATH(...))[1] AS ...)) As it stands, on 9.2 the values returned by the XPath expression will be escaped twice - once by XPATH and a second time by XMLATTRIBUTES, while on 9.1 the value will be escaped only once. Since this seems to be the most likely situation in which XMLATTRIBUTES would receive an input argument of type XML, and since it's not entirely logical for XMLATTRIBUTES to unconditionally escapes values already of type XML, I propose to change the behaviour of XMLATTRIBUTES as follows. Values not of type XML are be treated as they always have been, i.e. all special characters (<,>,&,",') are replaced by an entity reference (<, >, &, ", '). Values of type XML are assumed to be already escaped, so '&' isn't treated as a special character to avoid double escaping. The other special characters (<,>,",') are escaped as usual. The safety of this relies on the fact that '&' may never appear in a well-formed XML document, except as part of an entity reference. This seems to be the case, except if CDATA sections are involved, which may contain plain ampersands. So the actual escaping rule would have to be a bit more complex than I made it sound above - we'd need to detect CDATA sections, and re-enabled the escaping of '&' until the end of such a section. To actually implement this, I'd remove the use of libxml from the implementation of XMLELEMENT, and instead use our already existing escape_xml() function, enhanced with the ability to handle the partial escaping algorithm outlined above. We already only use libxml to escape attribute values, so doing that isn't a radical departure from xml.c's ways. As an added benefit, all the encoding-related problems of XMLELEMENT and XMLATTRIBUTES would go away once libxml is removed from this code path. So XPATH() would be the only remaining function which breaks in non-UTF-8 databases. Comments? Thoughts? Suggestions? best regards, Florian Pflug
On tis, 2011-07-26 at 22:44 +0200, Florian Pflug wrote: > While reviewing the (now applied) XPATH escaping patches, Radoslaw > found one > case where the previous failure of XPATH to escape its return value > was offset > by XMLATTRIBUTES insistence to escape all input values, even if > they're > already of type XML. > > To wit, if you do > > SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES('&'::XML AS "a")) > > you get > > xmlelement > -------------------- > <t a="&"/> Per SQL standard, the attribute values may not be of type XML, so maybe we should just prohibit it.
On Jul27, 2011, at 16:18 , Peter Eisentraut wrote: > On tis, 2011-07-26 at 22:44 +0200, Florian Pflug wrote: >> While reviewing the (now applied) XPATH escaping patches, Radoslaw >> found one >> case where the previous failure of XPATH to escape its return value >> was offset >> by XMLATTRIBUTES insistence to escape all input values, even if >> they're >> already of type XML. >> >> To wit, if you do >> >> SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES('&'::XML AS "a")) >> >> you get >> >> xmlelement >> -------------------- >> <t a="&"/> > > Per SQL standard, the attribute values may not be of type XML, so maybe > we should just prohibit it. We probably should have, but I think it's too late for that. I don't believe I'm the only one who uses XPATH results as attribute values, and we'd severely break that use-case. You might say the same thing about my proposal, of course, but I believe the risk is much smaller there. Applications would only break if they (a) Pass XML from a source other than a XPath expressionselecting a text or attribute and (b) actually want double-escaping to occur. As a data point, I've written an application with makes heavy use of our XML infrastructure over the last few months (as you might have guessed from the stream of patches ;-)). That application would be pretty much untroubled by the changes to XMLATTRIBUTES I proposed, but would be severely broken if we rejected values of type XML all together. best regards, Florian Pflug
On ons, 2011-07-27 at 19:37 +0200, Florian Pflug wrote: > > Per SQL standard, the attribute values may not be of type XML, so > maybe > > we should just prohibit it. > > We probably should have, but I think it's too late for that. I don't > believe I'm the only one who uses XPATH results as attribute values, > and we'd severely break that use-case. > > You might say the same thing about my proposal, of course, but I > believe > the risk is much smaller there. Applications would only break if they > (a) Pass XML from a source other than a XPath expression selecting > a text or attribute and > (b) actually want double-escaping to occur. Well, offhand I would expect that passing an XML value to XMLATTRIBUTES would behave as in SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES(XMLSERIALIZE(content '&'::XML AS text) AS "a")) which is what it indeed does in 9.1. So if we don't want to restrict this, for backward compatibility, then I would suggest that we fix it to work like it used to. I would be very hesitant about adding another escape mechanism that escapes some things but not others. We already have two or three of those for XML, and it doesn't seem worth adding another one just for this, which is outside the standard and for which a valid workaround exists.
On Jul27, 2011, at 23:08 , Peter Eisentraut wrote: > Well, offhand I would expect that passing an XML value to XMLATTRIBUTES > would behave as in > > SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES(XMLSERIALIZE(content '&'::XML AS text) AS "a")) With both 9.1 and 9.2 this query returns xmlelement --------------------<t a="&"/> i.e. makes the value of "a" represent the *literal* string '&', *not* the literal string '&'. Just to be sure there's no miss-understanding here - is this what you expect? > which is what it indeed does in 9.1. > > So if we don't want to restrict this, for backward compatibility, then I > would suggest that we fix it to work like it used to. There's currently no difference in behaviour between 9.1 and 9.2 there. We've only modified XPATH to always correctly escape it's result in 9.2, so there's only a difference if you pass the result of XPATH() to XMLATTRIBUTES. Which I figured to be the most likely reason for to pass values of type XML to XMLATTRIBUTES, but maybe you disagree there. > I would be very hesitant about adding another escape mechanism that > escapes some things but not others. We already have two or three of > those for XML, and it doesn't seem worth adding another one just for > this, which is outside the standard and for which a valid workaround > exists. What's the workaround if you have a value of type XML, e.g. '&', and want to set an attribute to the value represented by that XML fragment (i.e. '&')? Since we have no XMLUNESCAPE function, I don't see an easy way to do that. Maybe I'm missing something, though. best regards, Florian Pflug
On ons, 2011-07-27 at 23:21 +0200, Florian Pflug wrote: > On Jul27, 2011, at 23:08 , Peter Eisentraut wrote: > > Well, offhand I would expect that passing an XML value to XMLATTRIBUTES > > would behave as in > > > > SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES(XMLSERIALIZE(content '&'::XML AS text) AS "a")) > > With both 9.1 and 9.2 this query returns > > xmlelement > -------------------- > <t a="&"/> > > i.e. makes the value of "a" represent the *literal* string '&', *not* > the literal string '&'. Just to be sure there's no miss-understanding here > - is this what you expect? Well, I expect it to fail. > What's the workaround if you have a value of type XML, e.g. '&', > and want to set an attribute to the value represented by that XML fragment > (i.e. '&')? Since we have no XMLUNESCAPE function, I don't see an easy > way to do that. Maybe I'm missing something, though. It may be worth researching whether the XMLSERIALIZE function is actually doing what it should, or whether it could/should do some unescaping. Unfortunately, in the latest SQL/XML standard the final answer it nested deep in the three other standards, so I don't have an answer right now. But there are plenty of standards in this area, so I'd hope that one of them can give us the right behavior, instead of us making something up.
On Jul28, 2011, at 22:51 , Peter Eisentraut wrote: > On ons, 2011-07-27 at 23:21 +0200, Florian Pflug wrote: >> On Jul27, 2011, at 23:08 , Peter Eisentraut wrote: >>> Well, offhand I would expect that passing an XML value to XMLATTRIBUTES >>> would behave as in >>> >>> SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES(XMLSERIALIZE(content '&'::XML AS text) AS "a")) >> >> With both 9.1 and 9.2 this query returns >> >> xmlelement >> -------------------- >> <t a="&"/> >> >> i.e. makes the value of "a" represent the *literal* string '&', *not* >> the literal string '&'. Just to be sure there's no miss-understanding here >> - is this what you expect? > > Well, I expect it to fail. Now you've lost me. What exactly should fail under what circumstances? >> What's the workaround if you have a value of type XML, e.g. '&', >> and want to set an attribute to the value represented by that XML fragment >> (i.e. '&')? Since we have no XMLUNESCAPE function, I don't see an easy >> way to do that. Maybe I'm missing something, though. > > It may be worth researching whether the XMLSERIALIZE function is > actually doing what it should, or whether it could/should do some > unescaping. I don't see how that could work. It can't unescape anything in e.g. '<t>&</t>', because the result would be a quite useless not-really-XML kind of thing. It could unescape '&' but that kind of content-dependent behaviour seem even worse than my proposed escaping rules for XMLATTRIBUTES. > Unfortunately, in the latest SQL/XML standard the final > answer it nested deep in the three other standards, so I don't have an > answer right now. But there are plenty of standards in this area, so > I'd hope that one of them can give us the right behavior, instead of us > making something up. Which standards to you have in mind there? If you can point me to a place where I can obtain them, I could check if there's something in them which helps. best regards, Florian Pflug
On fre, 2011-07-29 at 11:37 +0200, Florian Pflug wrote: > On Jul28, 2011, at 22:51 , Peter Eisentraut wrote: > > On ons, 2011-07-27 at 23:21 +0200, Florian Pflug wrote: > >> On Jul27, 2011, at 23:08 , Peter Eisentraut wrote: > >>> Well, offhand I would expect that passing an XML value to XMLATTRIBUTES > >>> would behave as in > >>> > >>> SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES(XMLSERIALIZE(content '&'::XML AS text) AS "a")) > >> > >> With both 9.1 and 9.2 this query returns > >> > >> xmlelement > >> -------------------- > >> <t a="&"/> > >> > >> i.e. makes the value of "a" represent the *literal* string '&', *not* > >> the literal string '&'. Just to be sure there's no miss-understanding here > >> - is this what you expect? > > > > Well, I expect it to fail. > > Now you've lost me. What exactly should fail under what circumstances? To me, the best solution still appears to be forbidding passing values of type xml to XMLATTRIBUTES, unless we find an obviously better solution that is not, "I came up with this custom escape function that I tweaked so that it appears to make sense". > > Unfortunately, in the latest SQL/XML standard the final > > answer it nested deep in the three other standards, so I don't have an > > answer right now. But there are plenty of standards in this area, so > > I'd hope that one of them can give us the right behavior, instead of us > > making something up. > > Which standards to you have in mind there? If you can point me to a place > where I can obtain them, I could check if there's something in them > which helps. In SQL/XML 2008, the actual behavior of XMLSERIALIZE is delegated to "XSLT 2.0 and XQuery 1.0 Serialization". I'm not familiar with this latter standard, but it appears to have lots of options and parameters, one of which might help us.
On Aug11, 2011, at 09:16 , Peter Eisentraut wrote: > On fre, 2011-07-29 at 11:37 +0200, Florian Pflug wrote: >> On Jul28, 2011, at 22:51 , Peter Eisentraut wrote: >>> On ons, 2011-07-27 at 23:21 +0200, Florian Pflug wrote: >>>> On Jul27, 2011, at 23:08 , Peter Eisentraut wrote: >>>>> Well, offhand I would expect that passing an XML value to XMLATTRIBUTES >>>>> would behave as in >>>>> >>>>> SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES(XMLSERIALIZE(content '&'::XML AS text) AS "a")) >>>> >>>> With both 9.1 and 9.2 this query returns >>>> >>>> xmlelement >>>> -------------------- >>>> <t a="&"/> >>>> >>>> i.e. makes the value of "a" represent the *literal* string '&', *not* >>>> the literal string '&'. Just to be sure there's no miss-understanding here >>>> - is this what you expect? >>> >>> Well, I expect it to fail. >> >> Now you've lost me. What exactly should fail under what circumstances? > > To me, the best solution still appears to be forbidding passing values > of type xml to XMLATTRIBUTES, unless we find an obviously better > solution that is not, "I came up with this custom escape function that I > tweaked so that it appears to make sense". Hm, OK, I see your point. However, if we simply raise an error in 9.2, and do nothing else, that we make it impossible to use the result of an XPath expression as an XML attribute value. Not just inconvenient, but impossible, so I don't think we can do that. We'd thus need to add a function XMLUNESCAPE(XML) RETURNS TEXT to restore that functionality. Defining a sane behaviour for such a function, however, seems no easier than defining sane behaviour for an XML attribute of already of type XML. The core of the problems remains to define the result of XMLUNESCAPE('<tag>content</tag>'), just as the core of the XMLATTRIBUTES problems is to define XMLELEMENT(... XMLATTRIBUTES('<tag>content</tag>' as a)). Thinking about this further, it seems that we essentially have two distinct classes of XML values. Some are essentially plain text, but might contains entity references, while others are "real" XML fragments which contain at least one tag. That suggests that a sensible behaviour for XMLUNESCAPE might be to return a string with the entity references resolved in the former case, and simply return an error in the latter. To summarize, we'd have XMLUNESCAPE(''::XML) -> 'a' XMLUNESCAPE('a'::XML) -> 'a' XMLUNESCAPE('<'::XML) -> '<' XMLUNESCAPE('<t/>'::XML) -> error To not break applications needlessly, I'd then be inclined to make XMLATTRIBUTES(xml_value as "a") mean XMLATTRIBUTES(XMLUNESCAPE(xml_value) as "a") i.e. throw an error if xml_value contains anything but plain text and entity references. But I could probably also live with not doing that. >>> Unfortunately, in the latest SQL/XML standard the final >>> answer it nested deep in the three other standards, so I don't have an >>> answer right now. But there are plenty of standards in this area, so >>> I'd hope that one of them can give us the right behavior, instead of us >>> making something up. >> >> Which standards to you have in mind there? If you can point me to a place >> where I can obtain them, I could check if there's something in them >> which helps. > > In SQL/XML 2008, the actual behavior of XMLSERIALIZE is delegated to > "XSLT 2.0 and XQuery 1.0 Serialization". I'm not familiar with this > latter standard, but it appears to have lots of options and parameters, > one of which might help us. I'll try to obtain a copy of that. Thanks. best regards, Florian Pflug