Обсуждение: Collations versus user-defined functions
I've thought of another area that AFAICT the current patch fails to address at all: what should happen in user-defined functions? Consider create function my_lt(x text, y text) returns bool as$$ begin return x < y; end$$ language plpgsql; select my_lt('foo', 'bar' collate "de_DE");select my_lt('foo', 'bar' collate "fr_FR"); I think it's at least arguably desirable that the results of the two calls respond to the collation clauses, but it does not look to me like that will happen: plpgsql isn't doing anything to propagate its call-site collation value into expressions it evaluates, and if it did, it'd still get the wrong answer on the second call because it would have cached an expression plan tree containing the collation info from the first call. In SQL-language functions the situation is even worse, because they will behave differently depending on whether or not they get inlined. (I think ... haven't really tested that case.) What do we want to do about this? Making it work the way it seems like it ought to will require a rather substantial investment of effort. It looks to me like the least invasive answer would be to have plpgsql cache different plan trees depending on the collation it receives for its parameters, but that's still a whole lot of work. Does the SQL standard have anything to say on the matter, or is there a precedent in the behavior of TSQL or other DBMSes? regards, tom lane
On Sat, Mar 12, 2011 at 12:17:11PM -0500, Tom Lane wrote: > I've thought of another area that AFAICT the current patch fails to > address at all: what should happen in user-defined functions? <snip> The POLA suggests that the collation derivation of the original query should not be affected by the implementation of a function. In the case of SQL functions this means that the expansion of the function must not change the results. This would mean introducing a CollateNode above the SQL function after expansion, though you may be able to acheive this by doing the collation derivation prior to expansion of the SQL function, but I don't know if that's feasable. (Note the introduced collate node would need to remember the collation state.) Similarly, inside the function the parameters should be considered to be IMPLICIT collation, to avoid strange errors depending on how its called. This means you can't make a set_collation function, but that doesn't seem like a loss to me. > select my_lt('foo', 'bar' collate "de_DE"); > select my_lt('foo', 'bar' collate "fr_FR"); > > I think it's at least arguably desirable that the results of the two > calls respond to the collation clauses, but it does not look to me > like that will happen: plpgsql isn't doing anything to propagate > its call-site collation value into expressions it evaluates, and > if it did, it'd still get the wrong answer on the second call because it > would have cached an expression plan tree containing the collation info > from the first call. I think you need to consider the collation to be a variation of the type. plpgsql makes new plans for each type when dealing with any parameters, this should fit right in. SQL would need a recollate-label node like suggested above. For other languages you just need to provide the info, what they do with it is not your problem. > What do we want to do about this? Making it work the way it seems like > it ought to will require a rather substantial investment of effort. > It looks to me like the least invasive answer would be to have plpgsql > cache different plan trees depending on the collation it receives for > its parameters, but that's still a whole lot of work. Does the SQL > standard have anything to say on the matter, or is there a precedent in > the behavior of TSQL or other DBMSes? I can't help you with other DBs, google isn't finding me anything. But the plpgsql problem should be done already right, given it already handles cached plans for different types. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
On Sat, Mar 12, 2011 at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > create function my_lt(x text, y text) returns bool as > $ > begin > return x < y; > end > $ language plpgsql; > > select my_lt('foo', 'bar' collate "de_DE"); > select my_lt('foo', 'bar' collate "fr_FR"); > > I think it's at least arguably desirable that the results of the two > calls respond to the collation clauses, but it does not look to me > like that will happen: plpgsql isn't doing anything to propagate > its call-site collation value into expressions it evaluates, and > if it did, it'd still get the wrong answer on the second call because it > would have cached an expression plan tree containing the collation info > from the first call. > I don't think it's obvious that this is the right behaviour. I think functions should provide the same answer on the same inputs regardless of context unless they're really intended to be volatile. The default collation specified there is not part of the value being passed. If you want to affect the way a plpgsql function orders things in its code you should pass an extra argument for collation and then the plpgsql function should use COLLATE colarg -- though I'm not sure if that works, can you put parameters in COLLATE arguments? I do hope user defined functions return values are marked with implicit/explicit collations based on their arguments though. -- greg
Greg Stark <gsstark@mit.edu> writes: > On Sat, Mar 12, 2011 at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> create function my_lt(x text, y text) returns bool as >> $ >> begin >> return x < y; >> end >> $ language plpgsql; >> >> select my_lt('foo', 'bar' collate "de_DE"); >> select my_lt('foo', 'bar' collate "fr_FR"); >> >> I think it's at least arguably desirable that the results of the two >> calls respond to the collation clauses, but it does not look to me >> like that will happen: plpgsql isn't doing anything to propagate >> its call-site collation value into expressions it evaluates, and >> if it did, it'd still get the wrong answer on the second call because it >> would have cached an expression plan tree containing the collation info >> from the first call. > I don't think it's obvious that this is the right behaviour. I'm not sure of that either, but ... > I think > functions should provide the same answer on the same inputs regardless > of context unless they're really intended to be volatile. ... that argument convinces me not at all, because they are *not* the same inputs. The collate clauses are different. If I believed your argument, then the built-in "<" function shouldn't respond to COLLATE either. > If you want to affect the way a plpgsql function orders things in its > code you should pass an extra argument for collation and then the > plpgsql function should use COLLATE colarg -- though I'm not sure if > that works, can you put parameters in COLLATE arguments? No, you cannot, the SQL committee has blown it on that. COLLATE's argument is an identifier not a variable. There is no way to do runtime selection of collation like that. regards, tom lane
Martijn van Oosterhout <kleptog@svana.org> writes: > On Sat, Mar 12, 2011 at 12:17:11PM -0500, Tom Lane wrote: >> I've thought of another area that AFAICT the current patch fails to >> address at all: what should happen in user-defined functions? > The POLA suggests that the collation derivation of the original query > should not be affected by the implementation of a function. In the case > of SQL functions this means that the expansion of the function must > not change the results. This would mean introducing a CollateNode above > the SQL function after expansion, though you may be able to acheive > this by doing the collation derivation prior to expansion of the SQL > function, but I don't know if that's feasable. CollateExpr as presently defined wouldn't get the job done, but I think it's not a problem because those nodes aren't actually needed at runtime --- collation assignment for operators/functions above the inlined function should have been done when they were parsed, so it won't change as a result of expanding an inlined function. > Similarly, inside the function the parameters should be considered to > be IMPLICIT collation, to avoid strange errors depending on how its > called. Not convinced by this. If we say that that's how it works, then no user-defined function should react to COLLATE in its arguments at all, which seems pretty weird and restrictive --- especially if the COLLATE property is expected to propagate up through the function call so far as the calling expression is concerned. It seems just bizarre to me to say that a function's internal operations don't react to an input collation spec but then its result is thought to still be affected by that. This would actually seem more sensible if we went with something even simpler than the current patch's behavior, namely that COLLATE only affects the operator it is an *immediate* input of, and nothing propagates upward in expressions ever. I remain unconvinced that the SQL spec is calling for propagation ... > I think you need to consider the collation to be a variation of the > type. plpgsql makes new plans for each type when dealing with any > parameters, this should fit right in. Yeah, the same occurred to me a little bit later --- we can actually make that work fairly easily by treating collatable input datatypes as if they were polymorphic. But the question is whether we should. You seem to be arguing above that user-defined functions ought not pay attention to COLLATE specs on their inputs. regards, tom lane
On lör, 2011-03-12 at 12:17 -0500, Tom Lane wrote: > Does the SQL standard have anything to say on the matter, or is there > a precedent in the behavior of TSQL or other DBMSes? I had investigated this issue but the SQL standard doesn't say anything about it. The SQL inlining issue is tricky. Other languages including PL/pgSQL are not supported at the moment.
On Sat, Mar 12, 2011 at 02:46:19PM -0500, Tom Lane wrote: > > Similarly, inside the function the parameters should be considered to > > be IMPLICIT collation, to avoid strange errors depending on how its > > called. > > Not convinced by this. If we say that that's how it works, then no > user-defined function should react to COLLATE in its arguments at all, > which seems pretty weird and restrictive --- especially if the COLLATE > property is expected to propagate up through the function call so > far as the calling expression is concerned. It seems just bizarre to > me to say that a function's internal operations don't react to an > input collation spec but then its result is thought to still be affected > by that. I think I didn't explain myself well. The *state* should be implicit, the actual collation should be whatever the query says. What I was thinking of is the following: CREATE FUNCTION my_english_lt(text, text) RETURNS boolean AS $$ return $1 < $2 COLLATE "en_US" $$; (not sure about the syntax but you get the idea). If you just propegate naively you would get: my_english_lt(x COLLATE "de_DE", y) -> error, conflicting collation my_english_ly(x, y COLLATE "de_DE") -> would work fine Hence my suggestion that on input to the function the parameters would be considered collation "de_DE" state IMPLICIT, so the collation in the function overrides, but if the COLLATE in the function is removed, the implicit collation takes hold. > This would actually seem more sensible if we went with something even > simpler than the current patch's behavior, namely that COLLATE only > affects the operator it is an *immediate* input of, and nothing > propagates upward in expressions ever. I remain unconvinced that the > SQL spec is calling for propagation ... Well, it doesn't say in the general case, but there is under 6.29 <string value function> Syntax rule 4b 4) If <character substring function> CSF is specified, then let DTCVE be the declared type of the <character value expression> immediately contained in CSF. The maximum length, character set, and collation of the declared type DTCSF of CSF are determined as follows: b) The character set and collation of the <character substring function> are those of DTCVE. A similar wording is for the trim function. While obviously it doesn't cover all user defined functions, it seem obviously that once you do propegation for a few builtins you may as well do it for all of them. For the concatination operator is has something similar, though written in a way only a spec committe could come up with. Frankly, without propegation the feature seems entirely useless. Almost all collations are going to be defined by implicit collations attached to columns. If ORDER BY x and ORDER BY x || 'foo' Don't use the same collation then that is a first grade violation of the POLA. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
Martijn van Oosterhout <kleptog@svana.org> writes: > I think I didn't explain myself well. The *state* should be implicit, > the actual collation should be whatever the query says. What I was > thinking of is the following: > CREATE FUNCTION my_english_lt(text, text) RETURNS boolean AS $$ > return $1 < $2 COLLATE "en_US" > $$; > (not sure about the syntax but you get the idea). > If you just propegate naively you would get: > my_english_lt(x COLLATE "de_DE", y) -> error, conflicting collation > my_english_ly(x, y COLLATE "de_DE") -> would work fine > Hence my suggestion that on input to the function the parameters would > be considered collation "de_DE" state IMPLICIT, so the collation in the > function overrides, but if the COLLATE in the function is removed, the > implicit collation takes hold. Oh, I see. Yeah, that should work correctly, because parsing inside the function will be done with Param symbols that act pretty much like Vars --- whatever collation they have is considered implicit. It's important here that we do inlining by splicing completed parsetrees together --- we *don't* do some sort of insert-the-parameters-and-reparse-from-scratch approach. So the collation labelings made inside the function won't change as a result of inlining. regards, tom lane
Martijn van Oosterhout <kleptog@svana.org> writes: > On Sat, Mar 12, 2011 at 02:46:19PM -0500, Tom Lane wrote: >> This would actually seem more sensible if we went with something even >> simpler than the current patch's behavior, namely that COLLATE only >> affects the operator it is an *immediate* input of, and nothing >> propagates upward in expressions ever. I remain unconvinced that the >> SQL spec is calling for propagation ... > Well, it doesn't say in the general case, but there is under 6.29 > <string value function> Syntax rule 4b > 4) If <character substring function> CSF is specified, then let DTCVE > be the declared type of the <character value expression> immediately > contained in CSF. The maximum length, character set, and collation of > the declared type DTCSF of CSF are determined as follows: > b) The character set and collation of the <character substring > function> are those of DTCVE. > A similar wording is for the trim function. While obviously it doesn't > cover all user defined functions, it seem obviously that once you do > propegation for a few builtins you may as well do it for all of them. > For the concatination operator is has something similar, though written > in a way only a spec committe could come up with. > Frankly, without propegation the feature seems entirely useless. I remain unconvinced, because there are too many corner cases. Should collation propagate up out of a subselect? How about a CTE? You're starting to get into some pretty weird action-at-a-distance situations if so, analogous to the function-input-arguments case that you were just saying should NOT propagate collation. And I still don't see anything in the text of the spec to justify it. My feeling is that the feature would be simple, explainable, and useful if COLLATE only affected the immediately syntactically-containing operator. The rest of this stuff requires a huge amount of mechanism whose behavior will be nothing but surprising, even though it's inflexible as can be (cf Greg's point about not being able to select collation at runtime). I'm not going to say it's the worst piece of language design that's ever come out of the SQL committee, but I'm starting to feel like it's in the top ten. regards, tom lane
On Sat, Mar 12, 2011 at 06:06:33PM -0500, Tom Lane wrote: > I remain unconvinced, because there are too many corner cases. Should > collation propagate up out of a subselect? How about a CTE? You're > starting to get into some pretty weird action-at-a-distance situations > if so, analogous to the function-input-arguments case that you were just > saying should NOT propagate collation. And I still don't see anything > in the text of the spec to justify it. I said don't propegate the collation *state*, the collation should be propegated. We propegate type information out of subqueries, we propegate fieldnames, why not collation information? Once you consider the collation a property of the type it becomes pretty obvious. I'll agree the function-input-arguments is a bit odd, but the issue is not the collation at all, but the collate *state*, which is something quite different. But that's primarily (I think) because the SQL standard doesn't have user defined functions (we'll there's PSM but it doesn't consider the issue AFAICS). If you feel that it shouldn't propegate into functions at all, it's a soluton but I bet we'll get bug reports about it, because its totally non-obvious. We get still complaints about not propegating typmod. > My feeling is that the feature would be simple, explainable, and useful > if COLLATE only affected the immediately syntactically-containing > operator. The rest of this stuff requires a huge amount of mechanism > whose behavior will be nothing but surprising, even though it's > inflexible as can be (cf Greg's point about not being able to select > collation at runtime). I'm not going to say it's the worst piece of > language design that's ever come out of the SQL committee, but I'm > starting to feel like it's in the top ten. I'm going to have to disagree, I think that the solution they've come up with using collations and collation state is quite neat and actually does what people want. I've experimented with it and I haven't found any situation where the results would be surprising. And easy to implement, compared to the planner changes. We don't let people change types at runtime, why would collations be any different? Runtime sorting can be acheived with strxfrm. In any case, you don't need the propegation for COLLATE expressions, because they will be rare. You primarily need it for implicit collation propegation. ISTM that doing collation state propegation for everything except explicit COLLATE expressions is about the most surprising solution of all. What you're suggesting is going to lead to situations where the user sets a non-default collation on every field in every table in the database and depending on the query they will sometimes get the default collation anyway. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
On Mar 13, 2011, at 8:25 AM, Martijn van Oosterhout <kleptog@svana.org> wrote:
What you're suggesting is going to lead to situations where the user
sets a non-default collation on every field in every table in the
database and depending on the query they will sometimes get the default
collation anyway.
Not that I know a lot about this, but I agree there should be some kind of bubbling up here. ISTM that you could think of this as replacing the text type (and maybe others) by a collection of closely related types, and operators like >(text, text) become parametrically polymorphic. I am not entirely convinced that there won't be corner cases when this implicit polymorphism will get it wrong, but if it does we can apply a suitably sized band-aid. I agree with Martijn's analysis that it will get it right a lot more often; and consequently avoid the need for a lot of manual fiddling.
...Robert
Martijn van Oosterhout <kleptog@svana.org> writes: > On Sat, Mar 12, 2011 at 06:06:33PM -0500, Tom Lane wrote: >> I remain unconvinced, because there are too many corner cases. Should >> collation propagate up out of a subselect? How about a CTE? You're >> starting to get into some pretty weird action-at-a-distance situations >> if so, analogous to the function-input-arguments case that you were just >> saying should NOT propagate collation. And I still don't see anything >> in the text of the spec to justify it. > I said don't propegate the collation *state*, the collation should be > propegated. Well, it's exactly that distinction that's bugging me. It seems a bit arbitrary if collation propagates in certain cases where collation state doesn't. I'm concerned in particular that we're going to find ourselves backend into a corner if someone comes up with a different reading of the spec. The proposed implementation will be incapable of propagating collation state across subselect boundaries (because the post-parse scan is going to operate at most one subquery at a time), so if someone convinces us that we should do that, what then? regards, tom lane
On Sun, Mar 13, 2011 at 01:16:36PM -0400, Tom Lane wrote: > > I said don't propegate the collation *state*, the collation should be > > propegated. > > Well, it's exactly that distinction that's bugging me. It seems a bit > arbitrary if collation propagates in certain cases where collation state > doesn't. I'm concerned in particular that we're going to find ourselves > backend into a corner if someone comes up with a different reading of > the spec. The proposed implementation will be incapable of propagating > collation state across subselect boundaries (because the post-parse scan > is going to operate at most one subquery at a time), so if someone > convinces us that we should do that, what then? Hmm, if you suggest the proposed implementation won't do it then I believe you. I would have thought it could be propegated at the same time as the type information, after all the SQL standard considers the collation to be part of the SQL type (like Robert says you have lots of closely related character types, section 4.2.1). The reason you don't want to transport collation state across boundaries is because it makes things go action-at-a-distance, like you say. It is the collation states that control the propegation. The only guidence I can find on this in the SQL standard (since AFAICS neither MySQL nor Transact-SQL describe this clearly in their documentation) is section 4.2.2 and it says that anything with a declared data type is implicit. So function arguments, table columns and the like are implicit. Everything else is by definition derived (after all the type must also be derived). So I'd suggest that collate state EXPLICIT only propegates within the same scope where its visible what its doing, and is only assumed implicit when it hits a point where the datatype is stored, like views, functions, etc. That's my reading of it anyway. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
On 12.03.2011 18:17, Tom Lane wrote: > Does the SQL > standard have anything to say on the matter, or is there a precedent in > the behavior of TSQL or other DBMSes? Tom, SQL standard let it open for implementers. The other DBMS - for which I am/was collation expert - takes afair the database/schema collation for functions/procedures - no individual collation. Just believe me - there is tons of user complain feedback on this topic. I really can't recommend doing it same way. My experience is that users want to use own collations in functions too. Susanne -- Susanne Ebrecht - 2ndQuadrant PostgreSQL Development, 24x7 Support, Training and Services www.2ndQuadrant.com
On sön, 2011-03-13 at 13:16 -0400, Tom Lane wrote: > Well, it's exactly that distinction that's bugging me. It seems a bit > arbitrary if collation propagates in certain cases where collation > state doesn't. I'm concerned in particular that we're going to find > ourselves backend into a corner if someone comes up with a different > reading of the spec. The proposed implementation will be incapable of > propagating collation state across subselect boundaries (because the > post-parse scan is going to operate at most one subquery at a time), > so if someone convinces us that we should do that, what then? Do you have an example of what you have in mind? There are some cases in the SQL standard that the current implementation doesn't cover yet. But then again, we have also moved the type system around a few times over the years as we have gained experience and found the time to write the code.