Обсуждение: Collations versus user-defined functions

Поиск

Список

Период

Сортировка

Collations versus user-defined functions

От

Tom Lane

Дата:

12 марта 2011 г., 16:17:25

I've thought of another area that AFAICT the current patch fails to
address at all: what should happen in user-defined functions?
Consider
create function my_lt(x text, y text) returns bool as$$    begin        return x < y;    end$$ language plpgsql;
select my_lt('foo', 'bar' collate "de_DE");select my_lt('foo', 'bar' collate "fr_FR");

I think it's at least arguably desirable that the results of the two
calls respond to the collation clauses, but it does not look to me
like that will happen: plpgsql isn't doing anything to propagate
its call-site collation value into expressions it evaluates, and
if it did, it'd still get the wrong answer on the second call because it
would have cached an expression plan tree containing the collation info
from the first call.

In SQL-language functions the situation is even worse, because they will
behave differently depending on whether or not they get inlined.
(I think ... haven't really tested that case.)

What do we want to do about this?  Making it work the way it seems like
it ought to will require a rather substantial investment of effort.
It looks to me like the least invasive answer would be to have plpgsql
cache different plan trees depending on the collation it receives for
its parameters, but that's still a whole lot of work.  Does the SQL
standard have anything to say on the matter, or is there a precedent in
the behavior of TSQL or other DBMSes?
        regards, tom lane

Re: Collations versus user-defined functions

От

Martijn van Oosterhout

Дата:

12 марта 2011 г., 17:37:20

On Sat, Mar 12, 2011 at 12:17:11PM -0500, Tom Lane wrote:
> I've thought of another area that AFAICT the current patch fails to
> address at all: what should happen in user-defined functions?

<snip>

The POLA suggests that the collation derivation of the original query
should not be affected by the implementation of a function. In the case
of SQL functions this means that the expansion of the function must
not change the results. This would mean introducing a CollateNode above
the SQL function after expansion, though you may be able to acheive
this by doing the collation derivation prior to expansion of the SQL
function, but I don't know if that's feasable.

(Note the introduced collate node would need to remember the collation state.)

Similarly, inside the function the parameters should be considered to
be IMPLICIT collation, to avoid strange errors depending on how its
called.

This means you can't make a set_collation function, but that doesn't
seem like a loss to me.

>     select my_lt('foo', 'bar' collate "de_DE");
>     select my_lt('foo', 'bar' collate "fr_FR");
>
> I think it's at least arguably desirable that the results of the two
> calls respond to the collation clauses, but it does not look to me
> like that will happen: plpgsql isn't doing anything to propagate
> its call-site collation value into expressions it evaluates, and
> if it did, it'd still get the wrong answer on the second call because it
> would have cached an expression plan tree containing the collation info
> from the first call.

I think you need to consider the collation to be a variation of the
type. plpgsql makes new plans for each type when dealing with any
parameters, this should fit right in.

SQL would need a recollate-label node like suggested above.

For other languages you just need to provide the info, what they do
with it is not your problem.

> What do we want to do about this?  Making it work the way it seems like
> it ought to will require a rather substantial investment of effort.
> It looks to me like the least invasive answer would be to have plpgsql
> cache different plan trees depending on the collation it receives for
> its parameters, but that's still a whole lot of work.  Does the SQL
> standard have anything to say on the matter, or is there a precedent in
> the behavior of TSQL or other DBMSes?

I can't help you with other DBs, google isn't finding me anything. But
the plpgsql problem should be done already right, given it already
handles cached plans for different types.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: Collations versus user-defined functions

От

Greg Stark

Дата:

12 марта 2011 г., 17:44:04

On Sat, Mar 12, 2011 at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>        create function my_lt(x text, y text) returns bool as
>        $
>                begin
>                        return x < y;
>                end
>        $ language plpgsql;
>
>        select my_lt('foo', 'bar' collate "de_DE");
>        select my_lt('foo', 'bar' collate "fr_FR");
>
> I think it's at least arguably desirable that the results of the two
> calls respond to the collation clauses, but it does not look to me
> like that will happen: plpgsql isn't doing anything to propagate
> its call-site collation value into expressions it evaluates, and
> if it did, it'd still get the wrong answer on the second call because it
> would have cached an expression plan tree containing the collation info
> from the first call.
>

I don't think it's obvious that this is the right behaviour. I think
functions should provide the same answer on the same inputs regardless
of context unless they're really intended to be volatile. The default
collation specified there is not part of the value being passed. If
you want to affect the way a plpgsql function orders things in its
code you should pass an extra argument for collation and then the
plpgsql function should use COLLATE colarg -- though I'm not sure if
that works, can you put parameters in COLLATE arguments?

I do hope user defined functions return values are marked with
implicit/explicit collations based on their arguments though.

--
greg

Re: Collations versus user-defined functions

От

Tom Lane

Дата:

12 марта 2011 г., 18:25:14

Greg Stark <gsstark@mit.edu> writes:
> On Sat, Mar 12, 2011 at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>        create function my_lt(x text, y text) returns bool as
>>        $
>>                begin
>>                        return x < y;
>>                end
>>        $ language plpgsql;
>> 
>>        select my_lt('foo', 'bar' collate "de_DE");
>>        select my_lt('foo', 'bar' collate "fr_FR");
>> 
>> I think it's at least arguably desirable that the results of the two
>> calls respond to the collation clauses, but it does not look to me
>> like that will happen: plpgsql isn't doing anything to propagate
>> its call-site collation value into expressions it evaluates, and
>> if it did, it'd still get the wrong answer on the second call because it
>> would have cached an expression plan tree containing the collation info
>> from the first call.

> I don't think it's obvious that this is the right behaviour.

I'm not sure of that either, but ...

> I think
> functions should provide the same answer on the same inputs regardless
> of context unless they're really intended to be volatile.

... that argument convinces me not at all, because they are *not* the
same inputs.  The collate clauses are different.  If I believed your
argument, then the built-in "<" function shouldn't respond to COLLATE
either.

> If you want to affect the way a plpgsql function orders things in its
> code you should pass an extra argument for collation and then the
> plpgsql function should use COLLATE colarg -- though I'm not sure if
> that works, can you put parameters in COLLATE arguments?

No, you cannot, the SQL committee has blown it on that.  COLLATE's
argument is an identifier not a variable.  There is no way to do
runtime selection of collation like that.
        regards, tom lane

Re: Collations versus user-defined functions

От

Tom Lane

Дата:

12 марта 2011 г., 18:46:29

Martijn van Oosterhout <kleptog@svana.org> writes:
> On Sat, Mar 12, 2011 at 12:17:11PM -0500, Tom Lane wrote:
>> I've thought of another area that AFAICT the current patch fails to
>> address at all: what should happen in user-defined functions?

> The POLA suggests that the collation derivation of the original query
> should not be affected by the implementation of a function. In the case
> of SQL functions this means that the expansion of the function must
> not change the results. This would mean introducing a CollateNode above
> the SQL function after expansion, though you may be able to acheive
> this by doing the collation derivation prior to expansion of the SQL
> function, but I don't know if that's feasable.

CollateExpr as presently defined wouldn't get the job done, but I think
it's not a problem because those nodes aren't actually needed at runtime
--- collation assignment for operators/functions above the inlined
function should have been done when they were parsed, so it won't change
as a result of expanding an inlined function.

> Similarly, inside the function the parameters should be considered to
> be IMPLICIT collation, to avoid strange errors depending on how its
> called.

Not convinced by this.  If we say that that's how it works, then no
user-defined function should react to COLLATE in its arguments at all,
which seems pretty weird and restrictive --- especially if the COLLATE
property is expected to propagate up through the function call so
far as the calling expression is concerned.  It seems just bizarre to
me to say that a function's internal operations don't react to an
input collation spec but then its result is thought to still be affected
by that.

This would actually seem more sensible if we went with something even
simpler than the current patch's behavior, namely that COLLATE only
affects the operator it is an *immediate* input of, and nothing
propagates upward in expressions ever.  I remain unconvinced that the
SQL spec is calling for propagation ...

> I think you need to consider the collation to be a variation of the
> type. plpgsql makes new plans for each type when dealing with any
> parameters, this should fit right in.

Yeah, the same occurred to me a little bit later --- we can actually
make that work fairly easily by treating collatable input datatypes
as if they were polymorphic.  But the question is whether we should.
You seem to be arguing above that user-defined functions ought not
pay attention to COLLATE specs on their inputs.
        regards, tom lane

Re: Collations versus user-defined functions

От

Peter Eisentraut

Дата:

12 марта 2011 г., 19:23:45

On lör, 2011-03-12 at 12:17 -0500, Tom Lane wrote:
> Does the SQL standard have anything to say on the matter, or is there
> a precedent in the behavior of TSQL or other DBMSes?

I had investigated this issue but the SQL standard doesn't say anything
about it.

The SQL inlining issue is tricky.  Other languages including PL/pgSQL
are not supported at the moment.

Re: Collations versus user-defined functions

От

Martijn van Oosterhout

Дата:

12 марта 2011 г., 20:41:17

On Sat, Mar 12, 2011 at 02:46:19PM -0500, Tom Lane wrote:
> > Similarly, inside the function the parameters should be considered to
> > be IMPLICIT collation, to avoid strange errors depending on how its
> > called.
>
> Not convinced by this.  If we say that that's how it works, then no
> user-defined function should react to COLLATE in its arguments at all,
> which seems pretty weird and restrictive --- especially if the COLLATE
> property is expected to propagate up through the function call so
> far as the calling expression is concerned.  It seems just bizarre to
> me to say that a function's internal operations don't react to an
> input collation spec but then its result is thought to still be affected
> by that.

I think I didn't explain myself well. The *state* should be implicit,
the actual collation should be whatever the query says. What I was
thinking of is the following:

CREATE FUNCTION my_english_lt(text, text) RETURNS boolean AS $$  return $1 < $2 COLLATE "en_US"
$$;

(not sure about the syntax but you get the idea).

If you just propegate naively you would get:

my_english_lt(x COLLATE "de_DE", y)   -> error, conflicting collation
my_english_ly(x, y COLLATE "de_DE")   -> would work fine

Hence my suggestion that on input to the function the parameters would
be considered collation "de_DE" state IMPLICIT, so the collation in the
function overrides, but if the COLLATE in the function is removed, the
implicit collation takes hold.

> This would actually seem more sensible if we went with something even
> simpler than the current patch's behavior, namely that COLLATE only
> affects the operator it is an *immediate* input of, and nothing
> propagates upward in expressions ever.  I remain unconvinced that the
> SQL spec is calling for propagation ...

Well, it doesn't say in the general case, but there is under 6.29
<string value function> Syntax rule 4b

4) If <character substring function> CSF is specified, then let DTCVE
be the declared type of the <character value expression> immediately
contained in CSF. The maximum length, character set, and collation of
the declared type DTCSF of CSF are determined as follows:

b) The character set and collation of the <character substring
function> are those of DTCVE.

A similar wording is for the trim function. While obviously it doesn't
cover all user defined functions, it seem obviously that once you do
propegation for a few builtins you may as well do it for all of them.
For the concatination operator is has something similar, though written
in a way only a spec committe could come up with.

Frankly, without propegation the feature seems entirely useless. Almost
all collations are going to be defined by implicit collations attached
to columns. If

ORDER BY x

and

ORDER BY x || 'foo'

Don't use the same collation then that is a first grade violation of
the POLA.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: Collations versus user-defined functions

От

Tom Lane

Дата:

12 марта 2011 г., 20:54:45

Martijn van Oosterhout <kleptog@svana.org> writes:
> I think I didn't explain myself well. The *state* should be implicit,
> the actual collation should be whatever the query says. What I was
> thinking of is the following:

> CREATE FUNCTION my_english_lt(text, text) RETURNS boolean AS $$
>    return $1 < $2 COLLATE "en_US"
> $$;

> (not sure about the syntax but you get the idea).

> If you just propegate naively you would get:

> my_english_lt(x COLLATE "de_DE", y)   -> error, conflicting collation
> my_english_ly(x, y COLLATE "de_DE")   -> would work fine

> Hence my suggestion that on input to the function the parameters would
> be considered collation "de_DE" state IMPLICIT, so the collation in the
> function overrides, but if the COLLATE in the function is removed, the
> implicit collation takes hold.

Oh, I see.  Yeah, that should work correctly, because parsing inside the
function will be done with Param symbols that act pretty much like Vars
--- whatever collation they have is considered implicit.  It's important
here that we do inlining by splicing completed parsetrees together ---
we *don't* do some sort of insert-the-parameters-and-reparse-from-scratch
approach.  So the collation labelings made inside the function won't
change as a result of inlining.
        regards, tom lane

Re: Collations versus user-defined functions

От

Tom Lane

Дата:

12 марта 2011 г., 22:06:46

Martijn van Oosterhout <kleptog@svana.org> writes:
> On Sat, Mar 12, 2011 at 02:46:19PM -0500, Tom Lane wrote:
>> This would actually seem more sensible if we went with something even
>> simpler than the current patch's behavior, namely that COLLATE only
>> affects the operator it is an *immediate* input of, and nothing
>> propagates upward in expressions ever.  I remain unconvinced that the
>> SQL spec is calling for propagation ...

> Well, it doesn't say in the general case, but there is under 6.29
> <string value function> Syntax rule 4b 

> 4) If <character substring function> CSF is specified, then let DTCVE
> be the declared type of the <character value expression> immediately
> contained in CSF. The maximum length, character set, and collation of
> the declared type DTCSF of CSF are determined as follows:

> b) The character set and collation of the <character substring
> function> are those of DTCVE.

> A similar wording is for the trim function. While obviously it doesn't
> cover all user defined functions, it seem obviously that once you do
> propegation for a few builtins you may as well do it for all of them.
> For the concatination operator is has something similar, though written
> in a way only a spec committe could come up with.

> Frankly, without propegation the feature seems entirely useless.

I remain unconvinced, because there are too many corner cases.  Should
collation propagate up out of a subselect?  How about a CTE?  You're
starting to get into some pretty weird action-at-a-distance situations
if so, analogous to the function-input-arguments case that you were just
saying should NOT propagate collation.  And I still don't see anything
in the text of the spec to justify it.

My feeling is that the feature would be simple, explainable, and useful
if COLLATE only affected the immediately syntactically-containing
operator.  The rest of this stuff requires a huge amount of mechanism
whose behavior will be nothing but surprising, even though it's
inflexible as can be (cf Greg's point about not being able to select
collation at runtime).  I'm not going to say it's the worst piece of
language design that's ever come out of the SQL committee, but I'm
starting to feel like it's in the top ten.
        regards, tom lane

Re: Collations versus user-defined functions

От

Martijn van Oosterhout

Дата:

13 марта 2011 г., 12:25:39

On Sat, Mar 12, 2011 at 06:06:33PM -0500, Tom Lane wrote:
> I remain unconvinced, because there are too many corner cases.  Should
> collation propagate up out of a subselect?  How about a CTE?  You're
> starting to get into some pretty weird action-at-a-distance situations
> if so, analogous to the function-input-arguments case that you were just
> saying should NOT propagate collation.  And I still don't see anything
> in the text of the spec to justify it.

I said don't propegate the collation *state*, the collation should be
propegated.

We propegate type information out of subqueries, we propegate
fieldnames, why not collation information? Once you consider the
collation a property of the type it becomes pretty obvious. I'll agree
the function-input-arguments is a bit odd, but the issue is not the
collation at all, but the collate *state*, which is something quite
different. But that's primarily (I think) because the SQL standard
doesn't have user defined functions (we'll there's PSM but it doesn't
consider the issue AFAICS).

If you feel that it shouldn't propegate into functions at all, it's a
soluton but I bet we'll get bug reports about it, because its totally
non-obvious. We get still complaints about not propegating typmod.

> My feeling is that the feature would be simple, explainable, and useful
> if COLLATE only affected the immediately syntactically-containing
> operator.  The rest of this stuff requires a huge amount of mechanism
> whose behavior will be nothing but surprising, even though it's
> inflexible as can be (cf Greg's point about not being able to select
> collation at runtime).  I'm not going to say it's the worst piece of
> language design that's ever come out of the SQL committee, but I'm
> starting to feel like it's in the top ten.

I'm going to have to disagree, I think that the solution they've come
up with using collations and collation state is quite neat and actually
does what people want. I've experimented with it and I haven't found
any situation where the results would be surprising. And easy to
implement, compared to the planner changes.

We don't let people change types at runtime, why would collations be
any different? Runtime sorting can be acheived with strxfrm.

In any case, you don't need the propegation for COLLATE expressions,
because they will be rare. You primarily need it for implicit collation
propegation. ISTM that doing collation state propegation for everything
except explicit COLLATE expressions is about the most surprising
solution of all.

What you're suggesting is going to lead to situations where the user
sets a non-default collation on every field in every table in the
database and depending on the query they will sometimes get the default
collation anyway.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: Collations versus user-defined functions

От

Robert Haas

Дата:

13 марта 2011 г., 16:26:23

On Mar 13, 2011, at 8:25 AM, Martijn van Oosterhout <kleptog@svana.org> wrote:

What you're suggesting is going to lead to situations where the user
sets a non-default collation on every field in every table in the
database and depending on the query they will sometimes get the default
collation anyway.

Not that I know a lot about this, but I agree there should be some kind of bubbling up here. ISTM that you could think of this as replacing the text type (and maybe others) by a collection of closely related types, and operators like >(text, text) become parametrically polymorphic. I am not entirely convinced that there won't be corner cases when this implicit polymorphism will get it wrong, but if it does we can apply a suitably sized band-aid. I agree with Martijn's analysis that it will get it right a lot more often; and consequently avoid the need for a lot of manual fiddling.

...Robert

Re: Collations versus user-defined functions

От

Tom Lane

Дата:

13 марта 2011 г., 17:16:47

Martijn van Oosterhout <kleptog@svana.org> writes:
> On Sat, Mar 12, 2011 at 06:06:33PM -0500, Tom Lane wrote:
>> I remain unconvinced, because there are too many corner cases.  Should
>> collation propagate up out of a subselect?  How about a CTE?  You're
>> starting to get into some pretty weird action-at-a-distance situations
>> if so, analogous to the function-input-arguments case that you were just
>> saying should NOT propagate collation.  And I still don't see anything
>> in the text of the spec to justify it.

> I said don't propegate the collation *state*, the collation should be
> propegated.

Well, it's exactly that distinction that's bugging me.  It seems a bit
arbitrary if collation propagates in certain cases where collation state
doesn't.  I'm concerned in particular that we're going to find ourselves
backend into a corner if someone comes up with a different reading of
the spec.  The proposed implementation will be incapable of propagating
collation state across subselect boundaries (because the post-parse scan
is going to operate at most one subquery at a time), so if someone
convinces us that we should do that, what then?
        regards, tom lane

Re: Collations versus user-defined functions

От

Martijn van Oosterhout

Дата:

13 марта 2011 г., 21:17:28

On Sun, Mar 13, 2011 at 01:16:36PM -0400, Tom Lane wrote:
> > I said don't propegate the collation *state*, the collation should be
> > propegated.
>
> Well, it's exactly that distinction that's bugging me.  It seems a bit
> arbitrary if collation propagates in certain cases where collation state
> doesn't.  I'm concerned in particular that we're going to find ourselves
> backend into a corner if someone comes up with a different reading of
> the spec.  The proposed implementation will be incapable of propagating
> collation state across subselect boundaries (because the post-parse scan
> is going to operate at most one subquery at a time), so if someone
> convinces us that we should do that, what then?

Hmm, if you suggest the proposed implementation won't do it then I
believe you. I would have thought it could be propegated at the same
time as the type information, after all the SQL standard considers the
collation to be part of the SQL type (like Robert says you have lots of
closely related character types, section 4.2.1).

The reason you don't want to transport collation state across
boundaries is because it makes things go action-at-a-distance, like you
say. It is the collation states that control the propegation.

The only guidence I can find on this in the SQL standard (since AFAICS
neither MySQL nor Transact-SQL describe this clearly in their
documentation) is section 4.2.2 and it says that anything with a
declared data type is implicit. So function arguments, table columns
and the like are implicit. Everything else is by definition derived
(after all the type must also be derived).

So I'd suggest that collate state EXPLICIT only propegates within the
same scope where its visible what its doing, and is only assumed
implicit when it hits a point where the datatype  is stored, like views,
functions, etc. That's my reading of it anyway.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: Collations versus user-defined functions

От

Susanne Ebrecht

Дата:

14 марта 2011 г., 14:42:24

On 12.03.2011 18:17, Tom Lane wrote:
>    Does the SQL
> standard have anything to say on the matter, or is there a precedent in
> the behavior of TSQL or other DBMSes?

Tom,

SQL standard let it open for implementers.

The other DBMS - for which I am/was collation expert - takes afair the
database/schema collation for functions/procedures - no individual 
collation.

Just believe me - there is tons of user complain feedback on this
topic. I really can't recommend doing it same way. My experience
is that users want to use own collations in functions too.

Susanne

-- 
Susanne Ebrecht - 2ndQuadrant
PostgreSQL Development, 24x7 Support, Training and Services
www.2ndQuadrant.com

Re: Collations versus user-defined functions

От

Peter Eisentraut

Дата:

15 марта 2011 г., 19:44:59

On sön, 2011-03-13 at 13:16 -0400, Tom Lane wrote:
> Well, it's exactly that distinction that's bugging me.  It seems a bit
> arbitrary if collation propagates in certain cases where collation
> state doesn't.  I'm concerned in particular that we're going to find
> ourselves backend into a corner if someone comes up with a different
> reading of the spec.  The proposed implementation will be incapable of
> propagating collation state across subselect boundaries (because the
> post-parse scan is going to operate at most one subquery at a time),
> so if someone convinces us that we should do that, what then?

Do you have an example of what you have in mind?

There are some cases in the SQL standard that the current implementation
doesn't cover yet.  But then again, we have also moved the type system
around a few times over the years as we have gained experience and found
the time to write the code.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Collations versus user-defined functions