Обсуждение: Range types

Поиск
Список
Период
Сортировка

Range types

От
Scott Bailey
Дата:
I had proposed a temporal contrib module earlier and you wanted to see 
support for many range types not just timestamptz. So I had an idea on 
how to implement this but I want to see if you guys thought it was a 
viable idea.

So basically I have an anyrange pseudo type with the functions prev, 
next, last, etc defined. So instead of hard coding range types, we would 
allow the user to define their own range types. Basically if we are able 
to determine the previous and next values of the base types we'd be able 
to define a range type. I'm envisioning in a manner much like defining 
an enum type.

CREATE TYPE periodtz AS RANGE (timestamptz, '1 microsecond'::interval);
CREATE TYPE numrange AS RANGE (numeric(8,2));
-- determine granularity from typmod
CREATE TYPE floatrange AS RANGE (float, '0.000000001'::float);

Or getting really crazy...
CREATE TYPE terms AS ENUM ('2000_F', '2000_W', '2000_S', '2000_Su'...  '2010_F', '2010_W', '2010_S', '2010_Su');
CREATE TYPE termrange AS RANGE (terms);

So basically I have a pg_range table to store the base typeid, a text 
field for the granule value and the granule typeid.

I doubt we would be able to get this in for the 8.5 release, especially 
since I'm still learning C and the Postgres internals. Jeff Davis is 
going to get something in before the next commit fest so we'll have some 
type of temporal/range support. But we wanted to see what direction the 
community felt we should go.

Scott Bailey


Re: Range types

От
Takahiro Itagaki
Дата:
Scott Bailey <artacus@comcast.net> wrote:

> CREATE TYPE periodtz AS RANGE (timestamptz, '1 microsecond'::interval);

What does the second argument mean? Is it a default interval?

> So basically I have a pg_range table to store the base typeid, a text 
> field for the granule value and the granule typeid.

As another approach, what about storing typeid in typmod?
(Oid can be assumed to be stored in int32.)

For example,   CREATE TABLE tbl ( r range(timestamp) );   SELECT '[ 2.0, 3.0 )'::range(float);

There might be some overhead to store typeid for each range instance,
but the typmod approach does not require additinal catalogs and syntax
changes. It can be possible even on 8.4.

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center




Re: Range types

От
tomas@tuxteam.de
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sun, Dec 13, 2009 at 11:49:53PM -0800, Scott Bailey wrote:
> I had proposed a temporal contrib module earlier and you wanted to see 
> support for many range types not just timestamptz [...]

> So basically I have an anyrange pseudo type with the functions prev, next, 
> last, etc defined. So instead of hard coding range types, we would allow 
> the user to define their own range types. Basically if we are able to 
> determine the previous and next values of the base types we'd be able to 
> define a range type. I'm envisioning in a manner much like defining an enum 
> type.
>
> CREATE TYPE periodtz AS RANGE (timestamptz, '1 microsecond'::interval);
> CREATE TYPE numrange AS RANGE (numeric(8,2));
> -- determine granularity from typmod
> CREATE TYPE floatrange AS RANGE (float, '0.000000001'::float);

I might be asking the same as Itagaki is (see below) but... are you just
envisioning ranges on 'discrete' types? What about ranges on floats or
(arbitrary length) strings, where there is no prev/next? Too difficult?
(mind you: I don't know exactly what I'm talking about, but in would be
definitely useful).

On Mon, Dec 14, 2009 at 05:10:24PM +0900, Takahiro Itagaki wrote:
> 
> Scott Bailey <artacus@comcast.net> wrote:
> 
> > CREATE TYPE periodtz AS RANGE (timestamptz, '1 microsecond'::interval);
> 
> What does the second argument mean? Is it a default interval?

I think it's the granularity. That defines how to find the 'next' point
in time. As I understood it, Scott envisions ranges only for discrete
types, i.e. those advancing in well-defined steps. Note that (a) I might
be wrong and (b) there might be a very good reason for doing it this way.

> > So basically I have a pg_range table to store the base typeid, a text 
> > field for the granule value and the granule typeid.
> 
> As another approach, what about storing typeid in typmod?
> (Oid can be assumed to be stored in int32.)
> 
> For example,
>     CREATE TABLE tbl ( r range(timestamp) );
>     SELECT '[ 2.0, 3.0 )'::range(float);
> 
> There might be some overhead to store typeid for each range instance,
> but the typmod approach does not require additinal catalogs and syntax
> changes. It can be possible even on 8.4.

This looks more natural to me too.

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFLJgAZBcgs9XrR2kYRAljHAJwJjYV6fHz4qPSY6sXROYZ6pKIlGQCeO4X1
eszUJopVGqcPkXbiHdQOVrs=
=IYQ0
-----END PGP SIGNATURE-----


Re: Range types

От
Robert Haas
Дата:
On Mon, Dec 14, 2009 at 4:06 AM,  <tomas@tuxteam.de> wrote:
>> As another approach, what about storing typeid in typmod?
>> (Oid can be assumed to be stored in int32.)
>>
>> For example,
>>     CREATE TABLE tbl ( r range(timestamp) );
>>     SELECT '[ 2.0, 3.0 )'::range(float);
>>
>> There might be some overhead to store typeid for each range instance,
>> but the typmod approach does not require additinal catalogs and syntax
>> changes. It can be possible even on 8.4.
>
> This looks more natural to me too.

It 's very different than the way we've traditionally used typmod,
though, which Tom described pretty well here:

http://archives.postgresql.org/pgsql-hackers/2009-11/msg01183.php

For example, function signatures ignore typmod, so you'll be able to
write a function that takes a range, but you won't know what kind of
range you're getting.  Pavel proposed changing that, but the problem
is that while you might want to discriminate on the basis of what sort
of range you're getting, you probably DON'T want to discriminate on
the length of the character string being passed in with a varchar
argument, or the number of decimal places in a numeric.

So I think this is going to be awkward.

Also, typid is unsigned and typmod is signed.  Again, awkward.  Maybe
with a big enough crowbar you can make it work, but it seems like it
won't be pretty...

...Robert


Re: Range types

От
tomas@tuxteam.de
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, Dec 14, 2009 at 06:02:04AM -0500, Robert Haas wrote:
> On Mon, Dec 14, 2009 at 4:06 AM,  <tomas@tuxteam.de> wrote:

[...]

> > This looks more natural to me too.
> 
> It 's very different than the way we've traditionally used typmod,
> though, which Tom described pretty well here:
> 
> http://archives.postgresql.org/pgsql-hackers/2009-11/msg01183.php
> 
> For example, function signatures ignore typmod, so you'll be able to
> write a function that takes a range, but you won't know what kind of
> range you're getting [...]

> Also, typid is unsigned and typmod is signed.  Again, awkward [...]

Ugh. I see. Thank you for this very insightful comment.

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFLJiVuBcgs9XrR2kYRAp4ZAJsHjzYuVxwaeAUr1ogqRsZOecxdcwCeLfUv
8lZmeY6lb4r+57c6ZdB0J9M=
=0Ips
-----END PGP SIGNATURE-----


Re: Range types

От
Dimitri Fontaine
Дата:
Scott Bailey <artacus@comcast.net> writes:

> So basically I have an anyrange pseudo type with the functions prev, next,
> last, etc defined. So instead of hard coding range types, we would allow the
> user to define their own range types. Basically if we are able to determine
> the previous and next values of the base types we'd be able to define a
> range type. I'm envisioning in a manner much like defining an enum
> type.

It's not clear how to define those functions for the prefix_range
datatype, where '123' represents any text begining with those chars and
'123[4-6]' any text begining with '123' then either 4, 5 or 6.

What's supposed to return SELECT next('123'::prefix_range); ?

Regards,
-- 
dim

PS: as I'm used to use that in the telephony context, the example
contain figures, but it's a text based type and given questions and
reports in pgsql-general, people do use it with text ranges too.


Re: Range types

От
Martijn van Oosterhout
Дата:
On Sun, Dec 13, 2009 at 11:49:53PM -0800, Scott Bailey wrote:
> So basically I have an anyrange pseudo type with the functions prev,
> next, last, etc defined. So instead of hard coding range types, we would
> allow the user to define their own range types. Basically if we are able
> to determine the previous and next values of the base types we'd be able
> to define a range type. I'm envisioning in a manner much like defining
> an enum type.

I find it odd that you could define functions next() and prev() since
that assumes some kind of dicretisation which simply does not exist for
most types I can think of.

It would seem to me the real useful uses of ranges would be the
operations overlaps, disjoint, proceeds, follows, etc, which could all
be defined on any well-ordered type (wherever greater-than and
less-than are well defined). No need to discretise anything.

Do you have any actual usecase for a distretized range type for
timestamp?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Range types

От
Tom Lane
Дата:
Scott Bailey <artacus@comcast.net> writes:
> So basically I have an anyrange pseudo type with the functions prev, 
> next, last, etc defined. So instead of hard coding range types, we would 
> allow the user to define their own range types. Basically if we are able 
> to determine the previous and next values of the base types we'd be able 
> to define a range type. I'm envisioning in a manner much like defining 
> an enum type.

I think array types, not enums, would be a better model.

The real question is how the heck granularity enters into it.  Why
should a range type require that?  I think you are mixing up two
concepts that would be better kept separate.

In particular, the granularity examples you give seem to assume that
the underlying datatype is exact not approximate --- which among other
things will mean that it fails to work for float timestamps.  Since
timestamps are supposedly the main use-case, that's pretty troubling.
        regards, tom lane


Re: Range types

От
Tom Lane
Дата:
Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> writes:
> As another approach, what about storing typeid in typmod?
> (Oid can be assumed to be stored in int32.)

No, it cannot --- you'd be one bit short.  The general rule for typmod
is that all negative values are treated as "unspecified".  Even if you
managed to find and change every place that assumed that, you'd still
have a failure for -1, which is a perfectly legal value of Oid.
        regards, tom lane


Re: Range types

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Dec 14, 2009 at 4:06 AM,  <tomas@tuxteam.de> wrote:
>>> As another approach, what about storing typeid in typmod?
>>> (Oid can be assumed to be stored in int32.)

> It 's very different than the way we've traditionally used typmod,

Aside from the problems already pointed out, there's another: this
definition usurps the ability to attach a typmod to the range's
base type.  Again, you should be looking at arrays not enums for
a reference example.  The typmod of an array feeds through to its
element type.
        regards, tom lane


Re: Range types

От
Scott Bailey
Дата:
Martijn van Oosterhout wrote:
> On Sun, Dec 13, 2009 at 11:49:53PM -0800, Scott Bailey wrote:
>> So basically I have an anyrange pseudo type with the functions prev,  
>> next, last, etc defined. So instead of hard coding range types, we would  
>> allow the user to define their own range types. Basically if we are able  
>> to determine the previous and next values of the base types we'd be able  
>> to define a range type. I'm envisioning in a manner much like defining  
>> an enum type.
> 
> I find it odd that you could define functions next() and prev() since
> that assumes some kind of dicretisation which simply does not exist for
> most types I can think of.

Because intervals (mathematical not SQL) can be open or closed at each 
end point we need to know what the next an previous value would be at 
the specified granularity. And while you can do some operations without 
knowing this, there are many you can't. For instance you could not tell 
whether two [] or () ranges were adjacent, or be able to coalesce an 
array of ranges.

> It would seem to me the real useful uses of ranges would be the
> operations overlaps, disjoint, proceeds, follows, etc, which could all
> be defined on any well-ordered type (wherever greater-than and
> less-than are well defined). No need to discretise anything.
> 
> Do you have any actual usecase for a distretized range type for
> timestamp?




Re: Range types

От
Tom Lane
Дата:
Scott Bailey <artacus@comcast.net> writes:
> Because intervals (mathematical not SQL) can be open or closed at each 
> end point we need to know what the next an previous value would be at 
> the specified granularity. And while you can do some operations without 
> knowing this, there are many you can't. For instance you could not tell 
> whether two [] or () ranges were adjacent, or be able to coalesce an 
> array of ranges.

This statement seems to me to demonstrate that you don't actually
understand the concept of open and closed ranges.  It has nothing
whatsoever to do with assuming that the data type is discrete;
these concepts are perfectly well defined for the reals, for example.
What it is about is whether the inclusion conditions are "< bound"
or "<= bound".
        regards, tom lane


Re: Range types

От
Nathan Boley
Дата:
>> Because intervals (mathematical not SQL) can be open or closed at each
>> end point we need to know what the next an previous value would be at
>> the specified granularity. And while you can do some operations without
>> knowing this, there are many you can't. For instance you could not tell
>> whether two [] or () ranges were adjacent, or be able to coalesce an
>> array of ranges.
>
> This statement seems to me to demonstrate that you don't actually
> understand the concept of open and closed ranges.  It has nothing
> whatsoever to do with assuming that the data type is discrete;
> these concepts are perfectly well defined for the reals, for example.
> What it is about is whether the inclusion conditions are "< bound"
> or "<= bound".

IMHO the first question is whether, for integers, [1,2] UNION [3,5]
should be equal to [1,5]. In math this is certainly true, and defining
'next' seems like a reasonable way to establish this in postgres.

The next question is whether, for floats, [1,3-FLT_EPSILON] UNION
[3,5] should be [1,5].

And the next question is whether, for numeric(6,2), [1,2.99] UNION
[3,5] should be [1,5].

FWIW, I would answer yes, no, yes to those three questions.

-Nathan


Re: Range types

От
Jeff Davis
Дата:
On Mon, 2009-12-14 at 11:25 -0500, Tom Lane wrote:
> Scott Bailey <artacus@comcast.net> writes:
> > Because intervals (mathematical not SQL) can be open or closed at each 
> > end point we need to know what the next an previous value would be at 
> > the specified granularity. And while you can do some operations without 
> > knowing this, there are many you can't. For instance you could not tell 
> > whether two [] or () ranges were adjacent, or be able to coalesce an 
> > array of ranges.
> 
> This statement seems to me to demonstrate that you don't actually
> understand the concept of open and closed ranges.  It has nothing
> whatsoever to do with assuming that the data type is discrete;
> these concepts are perfectly well defined for the reals, for example.
> What it is about is whether the inclusion conditions are "< bound"
> or "<= bound".

Of course you can still define the obvious "contains" and "overlaps"
operators for a continuous range. But there are specific differences in
the API between a discrete range and a continuous range, which is what
Scott was talking about.

1. With discrete ranges, you can change the displayed
inclusivity/exclusivity without changing the value. For instance in the
integer domain, [ 5, 10 ] is the same value as [ 5, 11 ). This works on
both input and output. It is not possible to change the display for
continuous ranges.

2. With discrete ranges, you can get the last point before the range,
the first point in the range, the last point in the range, and the first
point after the range. These are more than enough to describe the range
completely. For continuous ranges, those functions will fail depending
on the inclusivity/exclusivity of the range. Practically speaking, you
would want to have a different set of accessors: start_inclusive(),
start_point(), end_point(), and end_inclusive(). However, those
functions can't be usefully defined on a discrete range.

We can't choose the API for continuous ranges as the API for discrete
ranges as well. If we did, then we would think that [5, 10] and [5, 11)
were not equal, but they are. Similarly, we would think that [5, 10] and
[11, 12] were not adjacent, but they are.

So there are certainly some user-visible API differences between the
two, and I don't think those differences can be hidden.

Regards,Jeff Davis





Re: Range types

От
Jeff Davis
Дата:
On Mon, 2009-12-14 at 09:58 -0500, Tom Lane wrote:
> In particular, the granularity examples you give seem to assume that
> the underlying datatype is exact not approximate --- which among other
> things will mean that it fails to work for float timestamps.  Since
> timestamps are supposedly the main use-case, that's pretty troubling.

Additionally, granularity for timestamps is not particularly useful when
you get to things like "days" and "months" which don't have a clean
algebra.

Is the granule there only to try to support continuous ranges? If so, I
don't think it's necessary if we expose the API differences I outlined
in another email in this thread. Also, that would mean that we don't
need a granule for float, because we can already treat it as discrete*.

Regards,Jeff Davis

*: nextafter() allows you to increment or decrement a double (loosely
speaking), and according to the man page it's part of C99 and
POSIX.1-2001.



Re: Range types

От
Tom Lane
Дата:
Nathan Boley <npboley@gmail.com> writes:
>> This statement seems to me to demonstrate that you don't actually
>> understand the concept of open and closed ranges.

> IMHO the first question is whether, for integers, [1,2] UNION [3,5]
> should be equal to [1,5]. In math this is certainly true, and defining
> 'next' seems like a reasonable way to establish this in postgres.

Well, that's nice to have (not essential) for data types that actually
are discrete.  It's not a sufficient argument for creating a definition
that is flat out broken for non-discrete datatypes.

It might be worth pointing out here that the concept of an open interval
was only invented in the first place for dealing with a continuum.
If you could assume the underlying set is discrete, every open interval
could be replaced with a closed one, using the next or prior value as
the bound instead.  There would be no need for two kinds of interval.

If you are intending to support UNION on intervals, you are going to
need some more-general representation anyway.  (I trust you don't think
we will accept a datatype for which [1,2] UNION [3,5] works but
[1,2] UNION [4,5] throws an error.)  So whether the code can reduce the
result of a union to a single range or has to keep it as two ranges is
only an internal optimization issue anyhow, not something that should
drive an artificial (and infeasible) attempt to force every domain to be
discrete.
        regards, tom lane


Re: Range types

От
Jeff Davis
Дата:
On Mon, 2009-12-14 at 10:00 -0800, Nathan Boley wrote:
> IMHO the first question is whether, for integers, [1,2] UNION [3,5]
> should be equal to [1,5]. In math this is certainly true, and defining
> 'next' seems like a reasonable way to establish this in postgres.

[ you say "yes" ]

Agreed.

> The next question is whether, for floats, [1,3-FLT_EPSILON] UNION
> [3,5] should be [1,5].

[ you say "no" ]

I think this should be true, because all floats between 1 and 5 are
contained. I don't feel too strongly about this, so I would not complain
if floats were treated as continuous.

> And the next question is whether, for numeric(6,2), [1,2.99] UNION
> [3,5] should be [1,5].

[ you say "yes" ]

I almost agree. Unfortunately, typmod isn't really a part of the type,
it just affects input/output. So, we can't really use it that way -- as
Tom points out, typmod is not passed along to functions that take the
value.

But if it were a part of the type, then I would certainly agree.

Regards,Jeff Davis



Re: Range types

От
Alvaro Herrera
Дата:
Jeff Davis wrote:

> So there are certainly some user-visible API differences between the
> two, and I don't think those differences can be hidden.

ISTM you are saying we should have two types of range types.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Range types

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Mon, 2009-12-14 at 11:25 -0500, Tom Lane wrote:
>> This statement seems to me to demonstrate that you don't actually
>> understand the concept of open and closed ranges.

> Of course you can still define the obvious "contains" and "overlaps"
> operators for a continuous range. But there are specific differences in
> the API between a discrete range and a continuous range, which is what
> Scott was talking about.

Well, if the intention is to invent two different kinds of ranges, with
different operators, for continuous and discrete data types, then fine.
But the original post suggested no such thing, and provided (unworkable)
examples suggesting that the intent was to force every type to be
treated as discrete whether that makes any sense or not.

The main question I would have is how to tell whether the underlying
type is really discrete.  If we allow people to define things like
"range over float4 with 0.000001 step", then somebody will try to do it
--- and file bugs when it doesn't work sanely.  I don't think there is
anything in our API for user-defined types that really tells you whether
it's an exact or approximate type.

(Also, stuff like strings simply doesn't have any sane concept of a
unique next or previous value.  I think the number of types that are
really discrete in this sense is very small, like maybe just ints and
enums.)
        regards, tom lane


Re: Range types

От
Scott Bailey
Дата:
Tom Lane wrote:
> Scott Bailey <artacus@comcast.net> writes:
>> Because intervals (mathematical not SQL) can be open or closed at each 
>> end point we need to know what the next an previous value would be at 
>> the specified granularity. And while you can do some operations without 
>> knowing this, there are many you can't. For instance you could not tell 
>> whether two [] or () ranges were adjacent, or be able to coalesce an 
>> array of ranges.
> 
> This statement seems to me to demonstrate that you don't actually
> understand the concept of open and closed ranges.  It has nothing
> whatsoever to do with assuming that the data type is discrete;
> these concepts are perfectly well defined for the reals, for example.
> What it is about is whether the inclusion conditions are "< bound"
> or "<= bound".

I won't address how you draw your conclusions here. But I find it 
'interesting' that you assume that I don't know what I'm talking about 
rather than assume you don't fully understand what I'm talking about.

Anyhow. For any given range you may be 4 combinations of values. Either 
the first value included in the range '[' or the last value preceding 
the start of the range '('; and the last value included in the range ']' 
or the first value following the end of the range ')'. We aren't going 
to store all four data points so we need to normalize into the most 
common form, a half-open interval [) and store just those two values. 
The first value included in the range and the first value after the end 
of our range.

So lets say you are using a  numeric range to model the high and low 
values of stocks trading on a given day. Now imagine doing this with no 
concept of granularity. You will most likely be given a range [low, 
high] with inclusive end points. So how do you convert that to a 
closed-open interval so you can store it? Is 20.42000000000000000001 the 
next value after 20.42? Probably not. You are going to want to define 
0.01 as the granularity for this (either type or column) so that 20.43 is.

Or again are the ranges [14.0, 22.0] and [22.1, 29.0] adjacent? Maybe, 
maybe not. There is no way to tell w/o knowing the granularity. Perhaps 
the granularity is 0.000000001 and there are a billion values that are 
not included. Or perhaps the granularity is 0.1 and the are adjacent.

Scott


Re: Range types

От
Scott Bailey
Дата:
Tom Lane wrote:
> Scott Bailey <artacus@comcast.net> writes:
>> So basically I have an anyrange pseudo type with the functions prev, 
>> next, last, etc defined. So instead of hard coding range types, we would 
>> allow the user to define their own range types. Basically if we are able 
>> to determine the previous and next values of the base types we'd be able 
>> to define a range type. I'm envisioning in a manner much like defining 
>> an enum type.
> 
> I think array types, not enums, would be a better model.

I was referring to the syntax for how the user actually defined an enum 
not about it's implementation. Basically what I was hoping to get out of 
this thread was whether it was better to allow the user to define their 
own range types by specifying the base type and possibly the granularity  and default inclusiveness of the end points,
orif we should just 
 
provide the types like period and intrange?

Scott


Re: Range types

От
Jeff Davis
Дата:
On Mon, 2009-12-14 at 13:32 -0500, Tom Lane wrote:
> Well, if the intention is to invent two different kinds of ranges, with
> different operators, for continuous and discrete data types, then fine.

Originally I thought most of the use cases were workable as discrete
ranges. If people want continuous ranges, that's certainly doable by
using a slightly different API.

> But the original post suggested no such thing, and provided (unworkable)
> examples suggesting that the intent was to force every type to be
> treated as discrete whether that makes any sense or not.

I agree, we shouldn't say we support continuous types, but force them to
be treated like discrete types.

> The main question I would have is how to tell whether the underlying
> type is really discrete.

We can ask the user to provide a prior() and next() function, and if
they aren't provided, we assume it's continuous.

Also, this range mechanism may be useful to get the meaningful operators
for a range type. For instance, for a range join (e.g. a temporal join),
we could recognize the && as "overlaps" and then find the "strictly left
of" operator if we wanted to do a range merge join. This might be
cleaner than the previous idea to mark operator families as conforming
to a certain convention for strategy numbers.

> (Also, stuff like strings simply doesn't have any sane concept of a
> unique next or previous value.  I think the number of types that are
> really discrete in this sense is very small, like maybe just ints and
> enums.)

I think "countable" is a more accurate word than "discrete". Strings are
discrete but not countable.

Regards,Jeff Davis



Re: Range types

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Mon, 2009-12-14 at 13:32 -0500, Tom Lane wrote:
>> The main question I would have is how to tell whether the underlying
>> type is really discrete.

> We can ask the user to provide a prior() and next() function, and if
> they aren't provided, we assume it's continuous.

Well, that still leaves us with the problem that Joe Schmo will file
a bug when "create function next(float4) returns float4 as
$$ select $1 + 0.00001 $$" doesn't behave sanely for him.  I'd prefer
not to leave it to the user to decide whether a type is discrete or
not.  The traffic on pgsql-bugs is convincing evidence that a very
large fraction of our user-base doesn't understand that floats are
inexact :-(

> I think "countable" is a more accurate word than "discrete". Strings are
> discrete but not countable.

It's been too long since college math classes for me to be sure whether
"discrete" is really the exact term here.  But I'm even more suspicious
of "countable".  I think a suitable diagonalization argument might show
that strings are countable.  That's getting a bit off-topic though...
        regards, tom lane


Re: Range types

От
Jeff Davis
Дата:
On Mon, 2009-12-14 at 11:10 -0800, Scott Bailey wrote:
> I was referring to the syntax for how the user actually defined an enum 
> not about it's implementation. Basically what I was hoping to get out of 
> this thread was whether it was better to allow the user to define their 
> own range types by specifying the base type and possibly the granularity 
>   and default inclusiveness of the end points, or if we should just 
> provide the types like period and intrange?

To be a little bit more specific about the alternatives:

1. Make a contrib module that creates a variety of range types like
PERIOD, PERIODTZ, INT4RANGE, NUMRANGE, etc. In order to support some of
the other features I intend to work on, such as a range join (e.g.
temporal join), we would need to mark an operator family to know that it
conforms to a certain strategy number convention. See:

http://archives.postgresql.org/pgsql-hackers/2009-10/msg01769.php
http://archives.postgresql.org/pgsql-hackers/2009-10/msg01441.php

2. Implement some kind of ANYRANGE type and associated operators; and
provide a way to define new range types by providing the base type,
difference type (e.g. for TIMESTAMP, the difference type would be
INTERVAL) and a couple support functions. This might be more flexible,
and it would obviate the need for marking operator families.

If the second approach seems promising, we can hammer out a real
proposal and see if it's viable.

Regards,Jeff Davis



Re: Range types

От
Robert Haas
Дата:
On Mon, Dec 14, 2009 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It's been too long since college math classes for me to be sure whether
> "discrete" is really the exact term here.  But I'm even more suspicious
> of "countable".  I think a suitable diagonalization argument might show
> that strings are countable.  That's getting a bit off-topic though...

It's actually a dovetailing argument, not a diagonalization argument,
but yes, the set of strings is most certainly countable.

...Robert (former CS theory teaching assistant)


Re: Range types

От
Jeff Davis
Дата:
On Mon, 2009-12-14 at 14:23 -0500, Tom Lane wrote:
> > We can ask the user to provide a prior() and next() function, and if
> > they aren't provided, we assume it's continuous.
> 
> Well, that still leaves us with the problem that Joe Schmo will file
> a bug when "create function next(float4) returns float4 as
> $$ select $1 + 0.00001 $$" doesn't behave sanely for him.  I'd prefer
> not to leave it to the user to decide whether a type is discrete or
> not.  The traffic on pgsql-bugs is convincing evidence that a very
> large fraction of our user-base doesn't understand that floats are
> inexact :-(

I don't know how we can decide such a thing. Do you have any ideas? 

Perhaps we can compromise and restrict the support functions to C? That
might be a high-enough wall, and of course it would keep non-superusers
from confusing the underlying mechanism.

Regards,Jeff Davis



Re: Range types

От
Andrew Dunstan
Дата:

Tom Lane wrote:
>> We can ask the user to provide a prior() and next() function, and if
>> they aren't provided, we assume it's continuous.
>>     
>
> Well, that still leaves us with the problem that Joe Schmo will file
> a bug when "create function next(float4) returns float4 as
> $$ select $1 + 0.00001 $$" doesn't behave sanely for him.  I'd prefer
> not to leave it to the user to decide whether a type is discrete or
> not.  The traffic on pgsql-bugs is convincing evidence that a very
> large fraction of our user-base doesn't understand that floats are
> inexact :-(
>   

Indeed.

>   
>> I think "countable" is a more accurate word than "discrete". Strings are
>> discrete but not countable.
>>     
>
> It's been too long since college math classes for me to be sure whether
> "discrete" is really the exact term here.  But I'm even more suspicious
> of "countable".  I think a suitable diagonalization argument might show
> that strings are countable.  That's getting a bit off-topic though...
>
>             
>   

Right, I don't think strings are any more or less countable than 
integers. (and yes, it's a bit OT).

Surely the issue from our POV is whether, given two distinct members of 
a class, we can ever say there is not any intervening member of the 
class according to some ordering. If we can't then next() and prior() 
make no sense for that class.

cheers

andrew


Re: Range types

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Mon, 2009-12-14 at 14:23 -0500, Tom Lane wrote:
>> I'd prefer not to leave it to the user to decide whether a type is
>> discrete or not.

> I don't know how we can decide such a thing. Do you have any ideas? 

If the only interesting use-cases are ints and enums, maybe we could
just hard-wire it.
        regards, tom lane


Re: Range types

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> Surely the issue from our POV is whether, given two distinct members of 
> a class, we can ever say there is not any intervening member of the 
> class according to some ordering. If we can't then next() and prior() 
> make no sense for that class.

We would also like to be sure that the answer is not machine-dependent.
For example, you can make such an assertion for two normalized floats
that differ by 1 unit in the last place --- but with a different float
implementation the answer could be different, and anyway a lot of the
really serious problems with float ranges would stem from the range
boundary value probably not being exactly what the user thinks it is.

Also, it's not just "some ordering" that's of interest, it's the natural
one for the datatype.  The reason the countability of strings isn't
relevant to us here is that diagonalization or dovetailing counts them
in an ordering that people wouldn't want in practice.

But that brings up something that actually is interesting: should the
range mechanism have a way to identify which btree opclass is considered
to define the type's sort order?  Or is it enough to provide ranges in
the type's default ordering?
        regards, tom lane


Re: Range types

От
Nathan Boley
Дата:
> Right, I don't think strings are any more or less countable than integers.
> (and yes, it's a bit OT).

Well, if I remember my algebra, I think the issue is that integers are
locally finite whereas strings are not ( assuming standard orderings,
of course ).

-Nathan


Re: Range types

От
Tom Lane
Дата:
Scott Bailey <artacus@comcast.net> writes:
> I was referring to the syntax for how the user actually defined an enum 
> not about it's implementation. Basically what I was hoping to get out of 
> this thread was whether it was better to allow the user to define their 
> own range types by specifying the base type and possibly the granularity 
>   and default inclusiveness of the end points, or if we should just 
> provide the types like period and intrange?

If 99% of the usefulness will come from ranges over a fixed set of
datatypes, it might be best to just do that.  That last 1% would be
very expensive to get, when you consider all the infrastructure that
would be involved with an extensible definition.

If we think there's a lot of usefulness for ranges over user-defined
types, then this argument doesn't help ...
        regards, tom lane


Re: Range types

От
Scott Bailey
Дата:
Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
>> On Mon, 2009-12-14 at 14:23 -0500, Tom Lane wrote:
>>> I'd prefer not to leave it to the user to decide whether a type is
>>> discrete or not.
> 
>> I don't know how we can decide such a thing. Do you have any ideas? 
> 
> If the only interesting use-cases are ints and enums, maybe we could
> just hard-wire it.

I think dates could be added to that list as well. But any 
implementation that doesn't do ranges of timestamptz are non-starters as 
far as I'm concerned. Certainly int64 timestamps and numeric are doable. 
And Jeff's period implementation supports float timestamps. I never use 
float timestamps so I can only assume that he made it work.

Personally, I'd rather just see float timestamps go away. And if the 
range types never supported float or float timestamps, I'd be ok with that.

Scott


Re: Range types

От
Tom Lane
Дата:
Scott Bailey <artacus@comcast.net> writes:
> Tom Lane wrote:
>> If the only interesting use-cases are ints and enums, maybe we could
>> just hard-wire it.

> I think dates could be added to that list as well.

Good point.  Network addresses too probably.

> But any implementation that doesn't do ranges of timestamptz are
> non-starters as far as I'm concerned.

Oh, I quite agree --- I'm just complaining about trying to force
timestamps into a discrete model that they don't fit.  What I was trying
to suggest was that we could hard-wire a mechanism that says ints and a
few other predetermined cases are discrete while everything else is
treated as continuous.

> Personally, I'd rather just see float timestamps go away.

That's more or less irrelevant to my point.  A large fraction of the
datatypes in Postgres do not have discrete behavior.  Reals, numerics,
timestamps, strings, bitstrings, geometrics.  Not to mention arrays and
composites.  Creating an artificial granularity is simply the wrong way
to approach it, even for those types where there's an implementation
artifact that allows you to make it sort-of-work.
        regards, tom lane


Re: Range types

От
tomas@tuxteam.de
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, Dec 14, 2009 at 01:32:08PM -0500, Tom Lane wrote:

[...]

> (Also, stuff like strings simply doesn't have any sane concept of a
> unique next or previous value.

If you are willing to limit the length, then yes, you could consider
them discrete too, but...

>                               I think the number of types that are
> really discrete in this sense is very small, like maybe just ints and
> enums.)

...I do agree that ranges over continuous types are the more
"interesting"[1] (and possibly more useful) beast.

- ---------
[11] Unfortunaltel they could turn out to be "interesting" in the sense    of "may you live in interresting times" ;-)

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFLJx3xBcgs9XrR2kYRAn10AJ9f/MQiz45LS7ogsRmMXpawcOWSfgCggkWG
gFev/SS09O+IOO+FB3shav0=
=KwxB
-----END PGP SIGNATURE-----


Re: Range types

От
tomas@tuxteam.de
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, Dec 14, 2009 at 11:09:16AM -0800, Jeff Davis wrote:

[...]

> I think "countable" is a more accurate word than "discrete". Strings are
> discrete but not countable.

Oh, no -- strings (of finite, but arbitrary length) are not discrete --
you can always squeeze one more between two given strings. In this sense
there are quite similar to rational numbers. Can we call them
continuous? -- it depends, it seems that the terminology here isn't
consistent: sometimes the rationals are considered continuous (as per
the property above mentioned), sometimes the reals (which are a much
more monstrous construct!) are referred to as "the continuum".

As Robert points out, they are countable; you'd need infinite length
for them to be more than that (then they would behave a bit like the
reals, Cantor diagonal and all that ;-)

All that said, it's moot: in computers, we can't represent strings of
arbitrary length (PostgreSQL has an upper limit of about 1GB, right?).
The situation is even more restricted with floats (they are much
smaller; thus one could say that they're more "discrete" than strings,
even). Problem with floats is -- the granule is not the "same size" over
the whole range (nasty), and it's all implementation-dependent
(nastier). But given an implementation, the concept of "next" and
"previous" on floats is (if you give me some slack with NANs) mostly
well-defined. Same with strings (up-to) some fixed length.

Still, it seems non-discrete is a useful abstraction?

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFLJzqSBcgs9XrR2kYRAoydAJ9uUYt4aTj+BjuJv9XtDIU7UAAFjwCbBBSv
gkw3a6oTqGOoQBHiuZjcJvQ=
=l9YV
-----END PGP SIGNATURE-----


Re: Range types

От
Greg Stark
Дата:
On Tue, Dec 15, 2009 at 7:28 AM,  <tomas@tuxteam.de> wrote:
> The situation is even more restricted with floats (they are much
> smaller; thus one could say that they're more "discrete" than strings,
> even). Problem with floats is -- the granule is not the "same size" over
> the whole range (nasty), and it's all implementation-dependent
> (nastier). But given an implementation, the concept of "next" and
> "previous" on floats is (if you give me some slack with NANs) mostly
> well-defined

In fact, as I only recently found out, one of the design goals of IEEE
floats was specifically that they sort lexicographically and use every
bit pattern. So you can alwys get the "next" float by just
incrementing your float as an 64-bit integer. Yes that raises your
value by a different amount, and it's still useful.

-- 
greg


Re: Range types

От
tomas@tuxteam.de
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, Dec 15, 2009 at 01:09:02PM +0000, Greg Stark wrote:

[...]

> In fact, as I only recently found out, one of the design goals of IEEE
> floats was specifically that they sort lexicographically and use every
> bit pattern. So you can alwys get the "next" float by just
> incrementing your float as an 64-bit integer. Yes that raises your
> value by a different amount, and it's still useful.

Didn't know that -- thanks :-)

(and as Andrew Dunstan pointed out off-list: I was wrong with my bold
assertion that one can squeeze infinitely many (arbitrary length)
strings between two given. This is not always the case).

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFLJ5i+Bcgs9XrR2kYRAqtUAJ0VHeUd7q/+Xr9H+Clbr2E0HVV3mgCdFXZM
/EPDk1B+M2uP6/Lqr50Rv4k=
=XICC
-----END PGP SIGNATURE-----


Re: Range types

От
Tom Lane
Дата:
tomas@tuxteam.de writes:
> (and as Andrew Dunstan pointed out off-list: I was wrong with my bold
> assertion that one can squeeze infinitely many (arbitrary length)
> strings between two given. This is not always the case).

Really?  If the string length is unbounded I think you were right.
        regards, tom lane


Re: Range types

От
Tom Lane
Дата:
Greg Stark <gsstark@mit.edu> writes:
> In fact, as I only recently found out, one of the design goals of IEEE
> floats was specifically that they sort lexicographically and use every
> bit pattern. So you can alwys get the "next" float by just
> incrementing your float as an 64-bit integer. Yes that raises your
> value by a different amount, and it's still useful.

There are certainly some low-level numerical analysis situations where
you want to get the "next" float value, but that hardly constitutes
an argument for treating ranges of floats as discrete rather than
continuous.  Normal users of a range datatype aren't going to be
interested in dealing with that sort of inherently machine-specific
behavior.

Even in types where next/previous are well defined, I'm not that
comfortable with having range operations depend on them.  What happens
when one end of your range is INT_MIN or INT_MAX?
        regards, tom lane


Re: Range types

От
Robert Haas
Дата:
On Tue, Dec 15, 2009 at 9:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <gsstark@mit.edu> writes:
>> In fact, as I only recently found out, one of the design goals of IEEE
>> floats was specifically that they sort lexicographically and use every
>> bit pattern. So you can alwys get the "next" float by just
>> incrementing your float as an 64-bit integer. Yes that raises your
>> value by a different amount, and it's still useful.
>
> There are certainly some low-level numerical analysis situations where
> you want to get the "next" float value, but that hardly constitutes
> an argument for treating ranges of floats as discrete rather than
> continuous.  Normal users of a range datatype aren't going to be
> interested in dealing with that sort of inherently machine-specific
> behavior.

Yeah, I don't think we want to base this feature on something that
arcane.  I also have to say that I'm very skeptical of the argument
that there is only a small list of types people will want this for.  I
don't think it's going to turn out to be all that small.

...Robert


Re: Range types

От
Nicolas Barbier
Дата:
2009/12/15 Tom Lane <tgl@sss.pgh.pa.us>:

> tomas@tuxteam.de writes:
>
>> (and as Andrew Dunstan pointed out off-list: I was wrong with my bold
>> assertion that one can squeeze infinitely many (arbitrary length)
>> strings between two given. This is not always the case).
>
> Really?  If the string length is unbounded I think you were right.

Assuming lexicographical ordering (first different character
determines order; end-of-string is sorted before anything else),
consider the following two strings:

<whatever>

and

<same whatever as before> + the character with the lowest value in
lexicographical ordering.

I don't think it is possible to get anything in between those two strings.

Nicolas


Re: Range types

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> I also have to say that I'm very skeptical of the argument
> that there is only a small list of types people will want this for.

I'm not sure that anyone has argued that.  I did suggest that there
might be a small list of types for which we should provide discrete
behavior (ie, with next/previous functions) and the rest could have
continuous behavior (without that assumption).  But I quite agree
that we want both types of ranges.
        regards, tom lane


Re: Range types

От
Robert Haas
Дата:
On Tue, Dec 15, 2009 at 10:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I also have to say that I'm very skeptical of the argument
>> that there is only a small list of types people will want this for.
>
> I'm not sure that anyone has argued that.  I did suggest that there
> might be a small list of types for which we should provide discrete
> behavior (ie, with next/previous functions) and the rest could have
> continuous behavior (without that assumption).  But I quite agree
> that we want both types of ranges.

Oh, I think you're right.  I guess I'm skeptical that the set for
which discrete treatment is appropriate is a small, fixed set, too.
Unless hard-coding that assumption buys us some really significant
economies, I think we should avoid doing so.

...Robert


Re: Range types

От
"Florian G. Pflug"
Дата:
On 15.12.09 15:52 , Tom Lane wrote:
> tomas@tuxteam.de writes:
>> (and as Andrew Dunstan pointed out off-list: I was wrong with my
>> bold assertion that one can squeeze infinitely many (arbitrary
>> length) strings between two given. This is not always the case).
>
> Really?  If the string length is unbounded I think you were right.

One example is "a" and "aa" (assuming "a" is minimal character in your
alphabet). The general case is the strings A and Aaaaaaa...a I think -
it doesn't get any more exciting than this.

This *is* a bit surprising, since one usually assumes that the ordering
of strings and reals is fairly similar, since both are lexical.

But note that the mapping of strings into the reals this intuition is
based on (simply prefix a the string with "0." and interpret as a real,
or something similar if the alphabet isn't {0,1}) isn't one-to-one - the
strings "1", "10", "100", ... are all mapped to the *same* real number 0.1

So for reals, the statement is reduced to the trivial fact that for
every x there is no y with x < y < x. Which is of course true..

best regards,
Florian Pflug


Re: Range types

От
Tom Lane
Дата:
Nicolas Barbier <nicolas.barbier@gmail.com> writes:
> Assuming lexicographical ordering (first different character
> determines order; end-of-string is sorted before anything else),
> consider the following two strings:
> <whatever>
> and
> <same whatever as before> + the character with the lowest value in
> lexicographical ordering.

> I don't think it is possible to get anything in between those two strings.

OK, point taken.  But in the real world, many locales use
non-lexicographical ordering.

In practice, even if you are willing to grant a maximum string
length, it is tough enough to find a string just a bit greater
than a given string, and damn near impossible to promise that
there will be no strings between.  We learned that the hard way
trying to make LIKE prefix-match indexing work in non-C locales.
So the long and the short of it is that next/previous are not
going to work for string types, maxlength or no maxlength.
        regards, tom lane


Re: Range types

От
Andrew Dunstan
Дата:

Tom Lane wrote:
> So the long and the short of it is that next/previous are not
> going to work for string types, maxlength or no maxlength.
>
>             
>   

Even more importantly, I strongly doubt they would be of the slightest 
practical value.

cheers

andrew


Re: Range types

От
Jeff Davis
Дата:
On Tue, 2009-12-15 at 10:19 -0500, Tom Lane wrote:
> I'm not sure that anyone has argued that.  I did suggest that there
> might be a small list of types for which we should provide discrete
> behavior (ie, with next/previous functions) and the rest could have
> continuous behavior (without that assumption).  But I quite agree
> that we want both types of ranges.

It seems like we're moving toward treating TIMESTAMP as continuous.

If I'm correct, continuous ranges always need two extra bits of storage
for the exclusivity. But for timestamps, that means 16 bytes (2 x 8-byte
timestamp) turns into 17 bytes, which is really more like 20 or 24 bytes
with alignment.

Considering that these are likely to be used for audit or history
tables, 8 bytes of waste (50%) seems excessive -- especially when
treating them as discrete seems to work pretty well, at least for the
int64 timestamps.

Ideas?

Regards,Jeff Davis



Re: Range types

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> If I'm correct, continuous ranges always need two extra bits of storage
> for the exclusivity. But for timestamps, that means 16 bytes (2 x 8-byte
> timestamp) turns into 17 bytes, which is really more like 20 or 24 bytes
> with alignment.

You probably need some flag bits anyway, so flailing frantically to
avoid that doesn't seem like a profitable use of time.

One pretty obvious use for a flag bit is open-ended ranges, ierange(something, infinity)
You could only do this without a flag bit if the underlying datatype
has an "infinity" value, which not all do.

I'm also wondering what null range boundaries would do.  Maybe that's
the same as the infinity case, or maybe not.
        regards, tom lane


Re: Range types

От
Scott Bailey
Дата:
Jeff Davis wrote:
> On Tue, 2009-12-15 at 10:19 -0500, Tom Lane wrote:
>> I'm not sure that anyone has argued that.  I did suggest that there
>> might be a small list of types for which we should provide discrete
>> behavior (ie, with next/previous functions) and the rest could have
>> continuous behavior (without that assumption).  But I quite agree
>> that we want both types of ranges.
> 
> It seems like we're moving toward treating TIMESTAMP as continuous.
> 
> If I'm correct, continuous ranges always need two extra bits of storage
> for the exclusivity. But for timestamps, that means 16 bytes (2 x 8-byte
> timestamp) turns into 17 bytes, which is really more like 20 or 24 bytes
> with alignment.
> 
> Considering that these are likely to be used for audit or history
> tables, 8 bytes of waste (50%) seems excessive -- especially when
> treating them as discrete seems to work pretty well, at least for the
> int64 timestamps.

Would it be OK if we handled float timestamp ranges as continuous and 
int64 timestamps discrete? You effectively lose the ability to build 
non-contiguous sets with continuous ranges. Which is integral to the 
work I'm doing (union, intersect, coalesce and minus sets of ranges)

As for the extra bits, would it be better to just require continuous 
ranges to be either [] or [)? But I don't know which would be preferred. 
My inclination would be toward [), but Tom seemed to indicate that 
perhaps [] was the norm.

Scott


Re: Range types

От
David Fetter
Дата:
On Tue, Dec 15, 2009 at 11:31:05AM -0800, Scott Bailey wrote:
> Jeff Davis wrote:
> >On Tue, 2009-12-15 at 10:19 -0500, Tom Lane wrote:
> 
> Would it be OK if we handled float timestamp ranges as continuous
> and int64 timestamps discrete?

That sounds like a recipe for disaster.  Whatever timestamp ranges
are, float and int64 should be treated the same way so as not to get
"surprises" due to implementation details.

> You effectively lose the ability to build non-contiguous sets with
> continuous ranges. Which is integral to the work I'm doing (union,
> intersect, coalesce and minus sets of ranges)
> 
> As for the extra bits, would it be better to just require continuous
> ranges to be either [] or [)? But I don't know which would be
> preferred. My inclination would be toward [), but Tom seemed to
> indicate that perhaps [] was the norm.

[] makes certain operations--namely the important ones in
calendaring--impossible, or at least incredibly kludgy, to do.  I
think we ought to leave openness at each end up to the user,
independent of the underlying implementation details.

FWIW, I think it would be a good idea to treat timestamps as
continuous in all cases.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Range types

От
Tom Lane
Дата:
David Fetter <david@fetter.org> writes:
> On Tue, Dec 15, 2009 at 11:31:05AM -0800, Scott Bailey wrote:
>> As for the extra bits, would it be better to just require continuous
>> ranges to be either [] or [)? But I don't know which would be
>> preferred. My inclination would be toward [), but Tom seemed to
>> indicate that perhaps [] was the norm.

> [] makes certain operations--namely the important ones in
> calendaring--impossible, or at least incredibly kludgy, to do.  I
> think we ought to leave openness at each end up to the user,
> independent of the underlying implementation details.

Yes.  A range implementation that couldn't support all four cases
of [], [), (], () would be seriously crippled IMO.
        regards, tom lane


Re: Range types

От
Jeff Davis
Дата:
On Tue, 2009-12-15 at 13:15 -0500, Tom Lane wrote:
> You probably need some flag bits anyway, so flailing frantically to
> avoid that doesn't seem like a profitable use of time.

I think "need" and "flailing" are both a little too strong here. The
biggest use case will almost certainly be ranges of timestamps, and most
of those people will have no use for flag bits (or if they do, it might
not be worth an 8-byte-per-value overhead).

> One pretty obvious use for a flag bit is open-ended ranges, ie
>     range(something, infinity)
> You could only do this without a flag bit if the underlying datatype
> has an "infinity" value, which not all do.

True, but frustrating for people whose primary use case is timestamps or
floats, which already have 'infinity'. Also frustrating for people who
don't mind using the min/max integer as a workaround.

> I'm also wondering what null range boundaries would do.  Maybe that's
> the same as the infinity case, or maybe not.

I would prefer to avoid allowing NULL range boundaries for the following
reasons: * it reminds me of MySQL dates with zeros in them * we are not guided by any kind of standard * we'd have to
inventsemantics that are pretty far outside of those    defined for mathematical intervals * we could easily create
moreconfusion or allow subtle traps for   users * we aren't guided by a clear use case (or I haven't seen it) that
isn'tequally solvable (or better) using some other method  * would impose that extra 8-byte storage burden on people
whomay not   need any flag bits
 

Regards,Jeff Davis





Re: Range types

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> I think "need" and "flailing" are both a little too strong here. The
> biggest use case will almost certainly be ranges of timestamps, and most
> of those people will have no use for flag bits (or if they do, it might
> not be worth an 8-byte-per-value overhead).

When the alternatives are a crippled implementation that might not do
what I want at all, or a full-featured implementation that takes another
8 bytes per value, I'll take the second.  The 8 bytes don't matter if it
doesn't solve my problem.

> I would prefer to avoid allowing NULL range boundaries for the following
> reasons:
>   * it reminds me of MySQL dates with zeros in them

If we use that notation to represent an open-ended interval, it seems
perfectly reasonable to me.  And it doesn't require any assumptions
about whether the underlying type has an infinity.

I think it's a seriously bad idea to tell people that they should depend
on min or max values of a datatype to substitute for the lack of
open-ended intervals.  That sort of technique creates unnecessary
implementation dependencies, and magic numbers (especially ones with a
dozen or two digits in them) are bad things for readability in any case.

To take just one example that's pretty near at hand: if type date had
had an exact published max value that applications were hard-wiring into
their code, we'd not have been able to change it to add 'infinity' as a
special value.
        regards, tom lane


Re: Range types

От
Scott Bailey
Дата:
David Fetter wrote:
> On Tue, Dec 15, 2009 at 11:31:05AM -0800, Scott Bailey wrote:
>> Jeff Davis wrote:
>>> On Tue, 2009-12-15 at 10:19 -0500, Tom Lane wrote:
>> Would it be OK if we handled float timestamp ranges as continuous
>> and int64 timestamps discrete?
> 
> That sounds like a recipe for disaster.  Whatever timestamp ranges
> are, float and int64 should be treated the same way so as not to get
> "surprises" due to implementation details.
> 
>> You effectively lose the ability to build non-contiguous sets with
>> continuous ranges. Which is integral to the work I'm doing (union,
>> intersect, coalesce and minus sets of ranges)
>>
>> As for the extra bits, would it be better to just require continuous
>> ranges to be either [] or [)? But I don't know which would be
>> preferred. My inclination would be toward [), but Tom seemed to
>> indicate that perhaps [] was the norm.
> 
> [] makes certain operations--namely the important ones in
> calendaring--impossible, or at least incredibly kludgy, to do.  I
> think we ought to leave openness at each end up to the user,
> independent of the underlying implementation details.
> 
> FWIW, I think it would be a good idea to treat timestamps as
> continuous in all cases.

Ok, let me give an example of what we can do with the current 
implementations that would not be possible with timestamps if we 
implement as suggested. Jeff's implementation uses a 1 microsecond step 
size or granule. And my implementation uses an interval step size and 
can be configured database wide, but default is 1 second.

The function below takes two period arrays that can have overlapping and 
adjacent elements. It subtracts all values in pa1 that intersect with 
values in pa2. So perhaps pa1 is all of your work shifts for the month 
and pa2 is a combination of your leave and holidays. The result is a 
coalesced non-contiguous set of the times you would actually be working. 
But to do this kind of thing you need to be able to determine prior, 
first, last and next. I need an implementation that can do this for 
timestamps and not just ints and dates.

CREATE OR REPLACE FUNCTION period_minus(   pa1  IN period[],   pa2  IN period[]
) RETURNS period[] AS
$$    SELECT array_agg(prd)    FROM (        SELECT period((t_in).start_time,            MIN((t_out).end_time)) AS prd
     FROM (            SELECT DISTINCT first(p) AS start_time            FROM unnest($1) p            WHERE NOT
contains($2,first(p))            AND NOT contains($1, prior(p))
 
            UNION
            SELECT DISTINCT next(p)            FROM unnest($2) p            WHERE contains($1, next(p))            AND
NOTcontains($2, next(p))        ) t_in        JOIN (            SELECT next(p) AS end_time            FROM unnest($1) p
          WHERE NOT contains($1, next(p))
 
            UNION ALL
            SELECT first(p)            FROM unnest($2) p            WHERE contains($1, first(p))              AND NOT
contains($2,prior(p))        ) t_out ON t_in.start_time < t_out.end_time        GROUP BY t_in.start_time        ORDER
BYt_in.start_time    ) sub;
 
$$ LANGUAGE 'sql' IMMUTABLE STRICT;


Re: Range types

От
Jeff Davis
Дата:
On Tue, 2009-12-15 at 11:49 -0800, David Fetter wrote:
> That sounds like a recipe for disaster.  Whatever timestamp ranges
> are, float and int64 should be treated the same way so as not to get
> "surprises" due to implementation details.

It's a compile-time option that will change the semantics of timestamps
regardless. Anyone who changes between float and int64 timestamps may
experience problems for a number of different reasons. What was unique
before might no longer be unique, for instance.

> FWIW, I think it would be a good idea to treat timestamps as
> continuous in all cases.

I disagree. There is a lot of value in treating timestamp ranges as
discrete.

One big reason is that the ranges can be translated between the
different input/output forms, and there's a canonical form. As we know,
a huge amount of the value in an RDBMS is unifying data from multiple
applications with different conventions.

So, let's say one application uses (] and another uses [). If you are
mixing the data and returning it to the application, you want to be able
to provide the result according to its convention. You can't do that
with a continuous range.

And things get more interesting: if you mix (] and [), then range_union
will produce () and range_intersect will produce []. So now you have all
four conventions floating around the same database.

Another reason you might mix conventions: say you have log data from
several sources, and some sources provide timestamps for an event which
is essentially "instantaneous" and other sources will log the period of
time over which the event occurred, in [) format. To mix the data
coherently, the correct thing to do is call the instantaneous points a
singleton range; but the only way to represent a singleton continuous
range is by using [].

Whenever you mix conventions, you either have to be able to change the
format (which is only possible with discrete ranges) or teach the
application to understand your convention. And if you don't have a
canonical form (which is only possible with discrete ranges), you can't
reliably compare values for equality, or see if they are adjacent.

Saying that discrete ranges are unnecessary is essentially saying that
there's only a use for one convention; or that the conventions will
never be mixed; or that applications will always be smart enough to sort
it out for themselves.

Regards,Jeff Davis



Re: Range types

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Tue, 2009-12-15 at 11:49 -0800, David Fetter wrote:
>> FWIW, I think it would be a good idea to treat timestamps as
>> continuous in all cases.

> I disagree. There is a lot of value in treating timestamp ranges as
> discrete.

> One big reason is that the ranges can be translated between the
> different input/output forms, and there's a canonical form. As we know,
> a huge amount of the value in an RDBMS is unifying data from multiple
> applications with different conventions.

Actually, that is exactly one of the reasons why what you propose is
a *bad* idea.  You want to institutionalize application dependence on
a non-portable implementation detail, namely the granularity of machine
representation of what's in principle a continuous value.  That's one
of the fastest routes to non-unifiable data I can think of.

> So, let's say one application uses (] and another uses [). If you are
> mixing the data and returning it to the application, you want to be able
> to provide the result according to its convention. You can't do that
> with a continuous range.

The above is nonsense.  [1,2) and [1,2] are simply different objects.
A design that assumes that it is always possible to replace one by
the other is broken so badly it's not even worth discussing.

The only reason you'd have applications that fail to handle both open
and closed intervals would be if someone were to create an
implementation that didn't support both from the outset.  Which we
need not and should not do.

> And things get more interesting: if you mix (] and [), then range_union
> will produce () and range_intersect will produce []. So now you have all
> four conventions floating around the same database.

Which is why it's a good idea to support all four...
        regards, tom lane


Re: Range types

От
Tom Lane
Дата:
Scott Bailey <artacus@comcast.net> writes:
> Ok, let me give an example of what we can do with the current 
> implementations that would not be possible with timestamps if we 
> implement as suggested. ...
> The function below takes two period arrays that can have overlapping and 
> adjacent elements. It subtracts all values in pa1 that intersect with 
> values in pa2. So perhaps pa1 is all of your work shifts for the month 
> and pa2 is a combination of your leave and holidays. The result is a 
> coalesced non-contiguous set of the times you would actually be working. 

The proposed problem is certainly soluble without any assumptions
of discreteness.  The answer might not look very much like the way
you chose to code it here, but that's not an argument for adopting
a fundamentally incorrect worldview.  If this were an amazingly
short and beautiful piece of code, it might support your argument,
but it's neither.
        regards, tom lane


Re: Range types

От
Scott Bailey
Дата:
Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
>> On Tue, 2009-12-15 at 11:49 -0800, David Fetter wrote:
>>> FWIW, I think it would be a good idea to treat timestamps as
>>> continuous in all cases.
> 
>> I disagree. There is a lot of value in treating timestamp ranges as
>> discrete.
> 
>> One big reason is that the ranges can be translated between the
>> different input/output forms, and there's a canonical form. As we know,
>> a huge amount of the value in an RDBMS is unifying data from multiple
>> applications with different conventions.
> 
> Actually, that is exactly one of the reasons why what you propose is
> a *bad* idea.  You want to institutionalize application dependence on
> a non-portable implementation detail, namely the granularity of machine
> representation of what's in principle a continuous value.  That's one
> of the fastest routes to non-unifiable data I can think of.
> 
>> So, let's say one application uses (] and another uses [). If you are
>> mixing the data and returning it to the application, you want to be able
>> to provide the result according to its convention. You can't do that
>> with a continuous range.
> 
> The above is nonsense.  [1,2) and [1,2] are simply different objects.
> A design that assumes that it is always possible to replace one by
> the other is broken so badly it's not even worth discussing.

I don't hear anyone arguing that. But you should be able to convert 
between [1,2], [1,3), (0,3) and (0,2].

> The only reason you'd have applications that fail to handle both open
> and closed intervals would be if someone were to create an
> implementation that didn't support both from the outset.  Which we
> need not and should not do.
> 
>> And things get more interesting: if you mix (] and [), then range_union
>> will produce () and range_intersect will produce []. So now you have all
>> four conventions floating around the same database.
> 
> Which is why it's a good idea to support all four...

I don't understand you then. Where do you suppose we would define the 
inclusiveness for the value? At the type level, at the column level, or 
at the value level? A design that allows values of different 
inclusiveness and offers no means to convert from one to another is 
worthless.


Re: Range types

От
Scott Bailey
Дата:
 > If this were an amazingly
> short and beautiful piece of code, it might support your argument,
> but it's neither.

Well we can't all be arrogant brainiacs.



Re: Range types

От
Tom Lane
Дата:
I wrote:
> The proposed problem is certainly soluble without any assumptions
> of discreteness.

To be concrete, I think it could be approached like this:

Assume the datatype provides a built-in function
period_except(p1 period, p2 period) returns setof period

which can return zero, one, or two rows depending on the inputs:

no rows if p1 is completely contained in p2

one row if p1 partially overlaps p2, for example:
[1,4] except [3,5] returns [1,3)[4,6] except [1,5) returns [5,6]

two rows if p1 properly contains p2, for example
[1,10] except [4,5] returns [1,4) and (5,10][1,10] except [9,10) returns [1,9) and [10,10]

and of course just p1 if p1 and p2 don't overlap at all.

Given such a function it's a simple matter of successively removing each
element of p2[] from the set representing the current members of p1[].
The way that I'd find most natural to code that is a loop, along the
lines of
foreach p2_member in unnest(p2) loop  p1 := array(select period_except(p1_member, p2_member)              from
unnest(p1)p1_member);end loop;
 

But maybe it can be done in a single SQL command.

As this example makes clear, when dealing with continuous intervals you
*must* admit both open and closed intervals, else you don't have a way
to represent the results of "except".  Maybe part of the failure to
communicate here arises from your desire to try to avoid supporting both
kinds of intervals.  But I think you really have to do it if you want to
deal with data that hasn't got any natural granularity.
        regards, tom lane


Re: Range types

От
Jeff Davis
Дата:
On Tue, 2009-12-15 at 17:17 -0500, Tom Lane wrote:
> Actually, that is exactly one of the reasons why what you propose is
> a *bad* idea.  You want to institutionalize application dependence on
> a non-portable implementation detail, namely the granularity of machine
> representation of what's in principle a continuous value.  That's one
> of the fastest routes to non-unifiable data I can think of.

Based on the premise that timestamps are a continuous value and the
granularity/precision is entirely an implementation detail, you're
right. But I disagree with the premise, at least in some cases that I
think are worthwhile. See reference [1].

> The above is nonsense.  [1,2) and [1,2] are simply different objects.
> A design that assumes that it is always possible to replace one by
> the other is broken so badly it's not even worth discussing.

I don't understand this point at all. [1,2) and [1,2] are different
values. Of course they are not interchangeable.

If you think I'm proposing that we drop inclusivity/exclusivity before
telling the application, that's not what I'm proposing at all. I'm
proposing that, at least in some circumstances, it's important to be
able to display the same value in different formats -- e.g. [1, 3) or
[1, 2], depending on what the application expects. Similar to a timezone
adjustment.

Regards,Jeff Davis

[1] "Temporal Data and the Relational Model" by C.J. Date, et al., uses
discrete time throughout the entire book, aside from a brief discussion
at the beginning.



Re: Range types

От
Josh Berkus
Дата:
On 12/15/09 2:40 PM, Scott Bailey wrote:
>> If this were an amazingly
>> short and beautiful piece of code, it might support your argument,
>> but it's neither.
> 
> Well we can't all be arrogant brainiacs.

Tom, Scott,

Let's keep it constructive here.  Thanks!

--Josh



Re: Range types

От
decibel
Дата:
On Dec 15, 2009, at 5:40 PM, Jeff Davis wrote:
> If you think I'm proposing that we drop inclusivity/exclusivity before
> telling the application, that's not what I'm proposing at all. I'm
> proposing that, at least in some circumstances, it's important to be
> able to display the same value in different formats -- e.g. [1, 3) or
> [1, 2], depending on what the application expects. Similar to a timezone
> adjustment.

I think it would help the discussion if you could provide some real examples. I suspect you're thinking of things like
schedulingapps, where it's important to be able to do things like "what's the next available time slot?". There are
probablyways to make that kind of thing easier without resorting to discrete time. 

> [1] "Temporal Data and the Relational Model" by C.J. Date, et al., uses
> discrete time throughout the entire book, aside from a brief discussion
> at the beginning.

I find myself wondering if they were influenced by the technology available at the time...
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: Range types

От
decibel
Дата:
On Dec 15, 2009, at 11:34 AM, Jeff Davis wrote:
> On Tue, 2009-12-15 at 10:19 -0500, Tom Lane wrote:
>> I'm not sure that anyone has argued that.  I did suggest that there
>> might be a small list of types for which we should provide discrete
>> behavior (ie, with next/previous functions) and the rest could have
>> continuous behavior (without that assumption).  But I quite agree
>> that we want both types of ranges.
>
> It seems like we're moving toward treating TIMESTAMP as continuous.
>
> If I'm correct, continuous ranges always need two extra bits of storage
> for the exclusivity. But for timestamps, that means 16 bytes (2 x 8-byte
> timestamp) turns into 17 bytes, which is really more like 20 or 24 bytes
> with alignment.
>
> Considering that these are likely to be used for audit or history
> tables, 8 bytes of waste (50%) seems excessive -- especially when
> treating them as discrete seems to work pretty well, at least for the
> int64 timestamps.
>
> Ideas?

Now that varlena's don't have an enormous fixed overhead, perhaps it's worth looking at using them. Obviously some
operationswould be slower, but for your stated examples of auditing and history, I suspect that you're not going to
noticethe overhead that much. 

I'm not sure if the best way to do this would be to support a varlena timestamp or to take fixed-size timestamps and
convertthem into varlena periods. 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: Range types

От
Christophe Pettus
Дата:
On Dec 15, 2009, at 3:40 PM, Jeff Davis wrote:

> Based on the premise that timestamps are a continuous value and the
> granularity/precision is entirely an implementation detail, you're
> right. But I disagree with the premise, at least in some cases that I
> think are worthwhile.

The argument is, in essence:
DECIMAL is continuous.DECIMAL(10,3) is discrete.

timestamptz in general is a continuous value (unless we're talking  
Planck times :) ).  There is no way for us to guarantee that  
next(timestamptz) will have the same value across all platforms; its  
epsilon is platform dependent.

However, if we specify a scale on timestamptz, it becomes much more  
useful.  Just making up a syntax, if we had timestamptz(milliseconds),  
then it's discrete and we know what next(timestamptz(milliseconds)) is.

But in the current implementation, the only way I can see making that  
work is if we specify a scale for timestamptz, and that strikes me as  
a big change to its semantics.

--
-- Christophe Pettus   xof@thebuild.com



Re: Range types

От
Scott Bailey
Дата:
Tom Lane wrote:
> I wrote:
>> The proposed problem is certainly soluble without any assumptions
>> of discreteness.
> 
> To be concrete, I think it could be approached like this:
> 
> Assume the datatype provides a built-in function
> 
>     period_except(p1 period, p2 period) returns setof period
> 
> which can return zero, one, or two rows depending on the inputs:
> 
> no rows if p1 is completely contained in p2
> 
> one row if p1 partially overlaps p2, for example:
> 
>     [1,4] except [3,5] returns [1,3)
>     [4,6] except [1,5) returns [5,6]
> 
> two rows if p1 properly contains p2, for example
> 
>     [1,10] except [4,5] returns [1,4) and (5,10]
>     [1,10] except [9,10) returns [1,9) and [10,10]
> 
> and of course just p1 if p1 and p2 don't overlap at all.
> 
> Given such a function it's a simple matter of successively removing each
> element of p2[] from the set representing the current members of p1[].
> The way that I'd find most natural to code that is a loop, along the
> lines of
> 
>     foreach p2_member in unnest(p2) loop
>       p1 := array(select period_except(p1_member, p2_member)
>                   from unnest(p1) p1_member);
>     end loop;
> 
> But maybe it can be done in a single SQL command.
> 
> As this example makes clear, when dealing with continuous intervals you
> *must* admit both open and closed intervals, else you don't have a way
> to represent the results of "except".  Maybe part of the failure to
> communicate here arises from your desire to try to avoid supporting both
> kinds of intervals.  But I think you really have to do it if you want to
> deal with data that hasn't got any natural granularity.
> 
>             regards, tom lane

Alright well I'm going to calm down a bit and take a step back. Perhaps 
I'm just too close to the issue and not thinking outside of the box that 
I've built. Let me see if I can make everything work rather than arguing 
why it wont.

Scott


Re: Range types

От
Jeff Davis
Дата:
On Tue, 2009-12-15 at 18:06 -0600, decibel wrote:
> Now that varlena's don't have an enormous fixed overhead, perhaps it's
> worth looking at using them. Obviously some operations would be
> slower, but for your stated examples of auditing and history, I
> suspect that you're not going to notice the overhead that much.

For most varvarlena types, you only get stuck with the full alignment
burden if you get unlucky. In this case, we're moving from 16 bytes to
17, which really means 24 bytes with alignment. Try creating two tables:
 create table foo(i int8, t1 timestamp, t2 timestamp); create table bar(i int8, c "char", t1 timestamp, t2 timestamp);

That extra byte there costs you 8 bytes, every time (on my machine,
anyway).

We're at serious risk of people saying "Ah, this temporal thing is
bloated. I'll try to get by with a single timestamp and save 16 bytes
per record". Or maybe "Why waste the bytes? I'll just store two
timestamps".

Regards,Jeff Davis



Re: Range types

От
Jeff Davis
Дата:
On Tue, 2009-12-15 at 18:03 -0600, decibel wrote:
> I think it would help the discussion if you could provide some real
> examples. I suspect you're thinking of things like scheduling apps,
> where it's important to be able to do things like "what's the next
> available time slot?". There are probably ways to make that kind of
> thing easier without resorting to discrete time.

I did provide some examples:

http://archives.postgresql.org/pgsql-hackers/2009-12/msg01385.php

I think the bottom line is that we can't expect all applications to
understand all 4 inclusivity permutations. Think about how many
applications you've seen with "start" and "end" entry boxes, and you
have to pick up from the surrounding words what inclusivity it's
expecting. If you're running two such applications at once and they
share data (perhaps an entry app and a reporting app), or you're
migrating from one to another, you have a problem if you can't convert
the intervals to the expected format.

I don't see it as "resorting to discrete time". I view discrete
intervals as simpler in many ways: there exists a canonical inclusivity
representation, and you can take that canonical representation and
display it however you want.

We should at least allow the user to specify that their range is
discrete (even if only by a superuser). And the user should also be able
to specify that they don't want null boundaries or extra infinity
boundaries, so that they get back down to the 16-byte representation.

> I find myself wondering if they were influenced by the technology
> available at the time...

No, it's entirely at the logical level. It's also not a very old book
(2002).

For another author that apparently likes to deal with discrete time, how
about "Developing Time-Oriented Database Applications in SQL" by
Snodgrass. It's free to download online, I believe.

Regards,Jeff Davis



Re: Range types

От
Jeff Davis
Дата:
On Tue, 2009-12-15 at 16:08 -0800, Christophe Pettus wrote:
> The argument is, in essence:
> 
>     DECIMAL is continuous.
>     DECIMAL(10,3) is discrete.
> 
> timestamptz in general is a continuous value (unless we're talking  
> Planck times :) ).  There is no way for us to guarantee that  
> next(timestamptz) will have the same value across all platforms; its  
> epsilon is platform dependent.

Not unless you compile with float timestamps. Integer timestamps are
microseconds since the year 2000 (positive or negative), which is
platform-independent.

I'm not arguing that we shouldn't support continuous time at all
(clearly, enough people in this thread seem to like it), but I do want
discrete time ranges. A lot of the temporal database literature is
written assuming discrete time intervals.

> But in the current implementation, the only way I can see making that  
> work is if we specify a scale for timestamptz, and that strikes me as  
> a big change to its semantics.

It's already useful at the microsecond precision level. Also, the
granule could be defined for the range type (as Scott suggested) rather
than the timestamp itself.

Regards,Jeff Davis



Re: Range types

От
Dimitri Fontaine
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:
>     foreach p2_member in unnest(p2) loop
>       p1 := array(select period_except(p1_member, p2_member)
>                   from unnest(p1) p1_member);
>     end loop;
>
> But maybe it can be done in a single SQL command.

Yeah, as soon as you have LATERAL, I think. Without it there's no way to
compose SRF in SQL, AFAIK.

Regards,
-- 
dim


Re: Range types

От
tomas@tuxteam.de
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, Dec 15, 2009 at 04:16:28PM +0100, Nicolas Barbier wrote:

[...]

> <whatever>
> 
> and
> 
> <same whatever as before> + the character with the lowest value in
> lexicographical ordering.
> 
> I don't think it is possible to get anything in between those two strings.

Yes, that was basically Andrew's argumentation. I was taken away too
much by the similarity between strings and (decimal, base N) fractions,
forgetting that the last have lots of unwritten zeros at the end...

But hey, 'twas nice to learn that strings aren't as boring as I first
thought...

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFLKN9cBcgs9XrR2kYRAl4LAJ9rHs/mlR3+j+79YOUtNUTCY0JOEwCZAROn
WsIQoT8nCbgCOaDWraH7jVk=
=vjV5
-----END PGP SIGNATURE-----


Re: Range types

От
tomas@tuxteam.de
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, Dec 15, 2009 at 11:49:19AM -0800, David Fetter wrote:
> On Tue, Dec 15, 2009 at 11:31:05AM -0800, Scott Bailey wrote:
> > Jeff Davis wrote:
> > >On Tue, 2009-12-15 at 10:19 -0500, Tom Lane wrote:
> > 
> > Would it be OK if we handled float timestamp ranges as continuous
> > and int64 timestamps discrete?
> 
> That sounds like a recipe for disaster.  Whatever timestamp ranges
> are, float and int64 should be treated the same way so as not to get
> "surprises" due to implementation details.

This alone would practically preclude discrete -- int and float would
behave quite differently (float's "granules" growing towards the edges
or having to choose a bigger granule for float than for int in the first
place).

[...]

> FWIW, I think it would be a good idea to treat timestamps as
> continuous in all cases.

This would come as a corollary from the above

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFLKODXBcgs9XrR2kYRAlpLAJ9nO5f0SHwX8A4CjTn6c/xyZdim1ACdGHTq
Fwn5ygKvCDFGadufOYPGrfA=
=ivCP
-----END PGP SIGNATURE-----


Re: Range types

От
Tom Lane
Дата:
Dimitri Fontaine <dfontaine@hi-media.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>>     foreach p2_member in unnest(p2) loop
>>         p1 := array(select period_except(p1_member, p2_member)
>>                 from unnest(p1) p1_member);
>>     end loop;
>> 
>> But maybe it can be done in a single SQL command.

> Yeah, as soon as you have LATERAL, I think. Without it there's no way to
> compose SRF in SQL, AFAIK.

Hm, how would you do it with LATERAL?  The problem is not so much
composition as the need for a variable number of rounds of composition.
        regards, tom lane


Re: Range types

От
Jeff Davis
Дата:
On Sun, 2009-12-13 at 23:49 -0800, Scott Bailey wrote:
> So basically I have an anyrange pseudo type with the functions prev, 
> next, last, etc defined. So instead of hard coding range types, we would 
> allow the user to define their own range types. Basically if we are able 
> to determine the previous and next values of the base types we'd be able 
> to define a range type. I'm envisioning in a manner much like defining 
> an enum type.

After an off-list discussion with Scott, I think there may be a solution
here that works for everyone if we don't try so hard to unify the
implementation of discrete and continuous ranges. The API should be very
similar, of course, but the implementation doesn't need to be.

Continuous ranges absolutely require the following information: start,
end, and inclusivity information.

But discrete ranges can instead be stored by counting the number of
granules from the start point. For instance, it could be stored as:
start, num_granules.

That has a lot of benefits for discrete ranges of time. First of all, it
allows the algebra to work reasonably well for the "days" and "months"
part of the interval, so we can allow a granule of 1 day/week/month/year
for a timestamp range. For output of the range, we can then just
multiply the granule by the number of granules, and add that to the
start time; thus avoiding the "incremental addition" problem with date
math. I think this works reasonably well for timestamp/date ranges --
let me know if there is a problem here (aside from timestamptz, which I
address below).

Secondly, in the case of a timestamp range, we can use 7 bytes for
storing the number of granules rather than another full 8-byte
timestamp, leaving one byte for flags to represent NULL boundaries,
infinite boundaries, etc. For timestamps that would still mean that an
interval could be 2000 years long with '1 microsecond' granularity. For
dates, 3 bytes is sufficient for a date range 45000 years long with
granules of '1 day'. That means that we can get back down to a 16 byte
representation for timestamp ranges, or 8 byte representation for date
ranges. There are a few details, like infinite ranges, but those can be
pretty easily solved with flags as well.

There's one problem, and that's for timestamptz ranges with intervals
that include days and months. Timezone adjustments are just not
well-defined for that kind of granule (nor would it be particularly
useful even if it magically worked), so this would have to be blocked
somehow. I think that's a special case, and we could provide the user
with a nice error message telling the user to use a date or timestamp
range instead.

So, the idea is to default to a continuous range type, but if the user
supplies a granule, prior and next functions, and other necessary
details, then it becomes a discrete range type.
* continuous ranges can still have everything that everyone wants,   including flags to indicate special values.*
discreterange granule is specified explicitly, so it's not an   "implementation detail"* discrete ranges can have a
compactrepresentation* discrete ranges would still have room for flags to indicate special   values
 
Comments?

Regards,Jeff Davis



Re: Range types

От
Robert Haas
Дата:
On Wed, Dec 16, 2009 at 12:31 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> There's one problem, and that's for timestamptz ranges with intervals
> that include days and months. Timezone adjustments are just not
> well-defined for that kind of granule (nor would it be particularly
> useful even if it magically worked), so this would have to be blocked
> somehow. I think that's a special case, and we could provide the user
> with a nice error message telling the user to use a date or timestamp
> range instead.

This seems like a fairly special-purpose type.  You'd be targeting it
at people who are very concerned with storing large numbers of these
(so they really care about space consumption) but for some reason
don't need to mix days and months (actually, the current interval
representation stores days, months, and seconds separately).  I
certainly think this might be useful to some people but it doesn't
really sounds like a general range type facility, since it seems to
involve some hacks that are fairly datatype-specific.

...Robert


Re: Range types

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> [ hacky special-case representation for discrete timestamp ranges ]

I'm still not exactly clear on what the use-case is for discrete
timestamp ranges, and I wonder how many people are going to be happy
with a representation that can't handle a range that's open-ended
on the left.

> So, the idea is to default to a continuous range type, but if the user
> supplies a granule, prior and next functions, and other necessary
> details, then it becomes a discrete range type.

Huh?  You're not going to be able to have a special case data
representation for one or two data types at the same time as you have a
function-based datatype-independent concept of a parameterized range
type.  Well, maybe you could have special code paths for just date and
timestamp but it'd be horrid.

More importantly, the notion of a representation granule is still 100%
wishful thinking for any inexact-representation datatype, which is going
to be a severe crimp in getting this accepted for timestamp, let alone
defining it in a way that would allow users to try to apply it to
floats.  Float timestamps might not be the default case anymore but they
are still supported.

I think you should let go of the feeling that you have to shave bytes
off the storage format.  You're creating a whole lot of work for
yourself and a whole lot of user-visible corner cases in return for
what ultimately isn't much.
        regards, tom lane


Re: Range types

От
Scott Bailey
Дата:
Jeff Davis wrote:
> On Sun, 2009-12-13 at 23:49 -0800, Scott Bailey wrote:
>> So basically I have an anyrange pseudo type with the functions prev, 
>> next, last, etc defined. So instead of hard coding range types, we would 
>> allow the user to define their own range types. Basically if we are able 
>> to determine the previous and next values of the base types we'd be able 
>> to define a range type. I'm envisioning in a manner much like defining 
>> an enum type.
> 
> After an off-list discussion with Scott, I think there may be a solution
> here that works for everyone if we don't try so hard to unify the
> implementation of discrete and continuous ranges. The API should be very
> similar, of course, but the implementation doesn't need to be.
> 
> Continuous ranges absolutely require the following information: start,
> end, and inclusivity information.
> 
> But discrete ranges can instead be stored by counting the number of
> granules from the start point. For instance, it could be stored as:
> start, num_granules.
> 
> That has a lot of benefits for discrete ranges of time. First of all, it
> allows the algebra to work reasonably well for the "days" and "months"
> part of the interval, so we can allow a granule of 1 day/week/month/year
> for a timestamp range. For output of the range, we can then just
> multiply the granule by the number of granules, and add that to the
> start time; thus avoiding the "incremental addition" problem with date
> math. I think this works reasonably well for timestamp/date ranges --
> let me know if there is a problem here (aside from timestamptz, which I
> address below).
> 
> Secondly, in the case of a timestamp range, we can use 7 bytes for
> storing the number of granules rather than another full 8-byte
> timestamp, leaving one byte for flags to represent NULL boundaries,
> infinite boundaries, etc. For timestamps that would still mean that an
> interval could be 2000 years long with '1 microsecond' granularity. For
> dates, 3 bytes is sufficient for a date range 45000 years long with
> granules of '1 day'. That means that we can get back down to a 16 byte
> representation for timestamp ranges, or 8 byte representation for date
> ranges. There are a few details, like infinite ranges, but those can be
> pretty easily solved with flags as well.
> 
> There's one problem, and that's for timestamptz ranges with intervals
> that include days and months. Timezone adjustments are just not
> well-defined for that kind of granule (nor would it be particularly
> useful even if it magically worked), so this would have to be blocked
> somehow. I think that's a special case, and we could provide the user
> with a nice error message telling the user to use a date or timestamp
> range instead.
> 
> So, the idea is to default to a continuous range type, but if the user
> supplies a granule, prior and next functions, and other necessary
> details, then it becomes a discrete range type.
> 
>  * continuous ranges can still have everything that everyone wants, 
>    including flags to indicate special values.
>  * discrete range granule is specified explicitly, so it's not an 
>    "implementation detail"
>  * discrete ranges can have a compact representation
>  * discrete ranges would still have room for flags to indicate special 
>    values
>  
> Comments?

As I pointed out off-list, I think the granularity for timestamp range 
should be limited to hours and smaller. Anything larger is asking for 
trouble. And quite honestly if they wanted day granularity, they should 
use date range. Also, I think the granule should be same type as 
returned when subtracting two subtypes. So granule of date range should 
be int not interval. And if user wanted something with month 
granularity, perhaps an enum range of 'YYYYMM' would be better.

Quite honestly the following 3 cases would probably meet 99% of need:
CREATE TYPE period AS RANGE(timestamptz(0), interval '1 s');
CREATE TYPE period AS RANGE(timestamptz(3), interval '1 ms');
CREATE TYPE period AS RANGE(timestamptz, interval '1 us');


Re: Range types

От
Scott Bailey
Дата:
Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
>> [ hacky special-case representation for discrete timestamp ranges ]
> 
> I'm still not exactly clear on what the use-case is for discrete
> timestamp ranges, and I wonder how many people are going to be happy
> with a representation that can't handle a range that's open-ended
> on the left.

They wouldn't. But the timestamp data would be the anchor, not 
necessarily the start point. As long as we ranges unbounded on both ends 
we'd be ok.


Re: Range types

От
Scott Bailey
Дата:
Ok, silly question here. But how do you determine the length of a 
continuous range? By definition length of [a, b) and (a, b] = b-a. But 
what about (a,b) and [a,b]? Are we saying that because they are 
continuous, the difference between values included in the range and 
those excluded are so infinitesimally small so as not to matter? Thus 
length (a,b) == length [a,b] == length [a,b)? And if that is the case, 
does the inclusiveness of the range really even matter?

And can anyone point me to a reference for working with continuous 
ranges? Google just insists that I spelled contiguous wrong.

Scott


Re: Range types

От
Tom Lane
Дата:
Scott Bailey <artacus@comcast.net> writes:
> As I pointed out off-list, I think the granularity for timestamp range 
> should be limited to hours and smaller. Anything larger is asking for 
> trouble. And quite honestly if they wanted day granularity, they should 
> use date range.

I'm still not real clear on what the expected use-case is for this.
You're evidently envisioning applications where the allowed form of
an interval is constrained, but in the cases I can think of, the
constraints are a lot more convoluted than what you're proposing.
For example, if you're trying to do classroom scheduling, it might be
useful to constrain the periods to start and end on hour boundaries
--- but the next thing you'll want is to have it know that the "next"
slot after 5pm Friday is 8am Monday.  Except on holidays.  And then
there's the fact that my alma mater starts most hour-long classes on
the half hour.

I think that wiring such constraints into the low-level datatype is
doomed to failure.  What you'd be better off with is a function that
finds the "next" period given a current period and some suitable
representation of the off-limits intervals.  The argument for having
granularity wired into the datatype seems to boil down to just space
savings.  I don't find that compelling enough to justify code
contortions and user-visible restrictions on functionality.
        regards, tom lane


Re: Range types

От
Jeff Davis
Дата:
On Wed, 2009-12-16 at 12:42 -0500, Robert Haas wrote:
> On Wed, Dec 16, 2009 at 12:31 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> > There's one problem, and that's for timestamptz ranges with intervals
> > that include days and months. Timezone adjustments are just not
> > well-defined for that kind of granule (nor would it be particularly
> > useful even if it magically worked), so this would have to be blocked
> > somehow. I think that's a special case, and we could provide the user
> > with a nice error message telling the user to use a date or timestamp
> > range instead.
> 
> This seems like a fairly special-purpose type.  You'd be targeting it
> at people who are very concerned with storing large numbers of these
> (so they really care about space consumption) but for some reason
> don't need to mix days and months (actually, the current interval
> representation stores days, months, and seconds separately).  I
> certainly think this might be useful to some people but it doesn't
> really sounds like a general range type facility, since it seems to
> involve some hacks that are fairly datatype-specific.

My statement should have read "days or months". In other words, you
can't have a timestamptz range with a granularity of '3 days'. But if
that's your granularity, as Scott says, you should be using a date
range, not a timestamptz range.

Timestamptz ranges are only really useful when you have a granularity
measured in seconds (or some fraction or multiple thereof). Otherwise,
the timezone adjustment doesn't make any sense.

So this isn't a case of limited functionality, just that we need to
inform the user that a timestamptz range with granularity '1 day' or '1
month' makes no sense.

Regards,Jeff Davis



Re: Range types

От
Jeff Davis
Дата:
On Wed, 2009-12-16 at 12:50 -0500, Tom Lane wrote:
> I'm still not exactly clear on what the use-case is for discrete
> timestamp ranges, and I wonder how many people are going to be happy
> with a representation that can't handle a range that's open-ended
> on the left.

Huh? We're miscommunicating somewhere. Discrete ranges are values, and
those values can be displayed a number of different ways. That's one of
the biggest advantages.

The very same discrete range can be displayed as open-open, closed-open,
open-closed, or closed-closed. There are edge cases, like how infinity
is never closed, and overflow conditions. But generally speaking, you
have more options for presenting a discrete range than a continuous
range.

The range [5, 7) is equal to the set {5, 6} and equal to the ranges:
(4,7), (4,6], [5,7), and [5,6]. One application can insert it as [5,7)
and another can read it as (4,6].

That's the use case: the application's preferences don't have to match.
It's OK to mix various representation preferences, because you can
convert between them. The on disk format happens to hint at one
particular canonical form, but doesn't enforce that on anyone.

> Huh?  You're not going to be able to have a special case data
> representation for one or two data types at the same time as you have a
> function-based datatype-independent concept of a parameterized range
> type.  Well, maybe you could have special code paths for just date and
> timestamp but it'd be horrid.

They aren't supposed to be exactly the same API, I said that from the
start. There are API differences between continuous and discrete ranges,
and we shouldn't ignore them.

One important differences is that (barring overflow conditions and
special values) prior, first, last, and next are defined for all
discrete range values, but not for all continuous range values.

For instance, the discrete range [5,7) has prior=4, first=5, last=6,
next=7.

Whereas the continuous range [5,7) has prior=undef, first=5, last=undef,
next=7.

We could define one API, that treats discrete and continuous ranges
differently. But you'll never be able to transform a continuous range to
a different representation, while you can do so with a discrete range.

> More importantly, the notion of a representation granule is still 100%
> wishful thinking for any inexact-representation datatype, which is going
> to be a severe crimp in getting this accepted for timestamp, let alone
> defining it in a way that would allow users to try to apply it to
> floats.  Float timestamps might not be the default case anymore but they
> are still supported.

If the only objection is that a superuser can confuse the system by
poorly defining a range type on a non-default build, I think that
objection can be overcome.

> I think you should let go of the feeling that you have to shave bytes
> off the storage format.  You're creating a whole lot of work for
> yourself and a whole lot of user-visible corner cases in return for
> what ultimately isn't much.

This isn't just to shave bytes. It's also because I like the semantics
of discrete ranges for many cases.

Regards,Jeff Davis



Re: Range types

От
Alvaro Herrera
Дата:
tomas@tuxteam.de wrote:

> (and as Andrew Dunstan pointed out off-list: I was wrong with my bold
> assertion that one can squeeze infinitely many (arbitrary length)
> strings between two given. This is not always the case).

Of course you can do that if you assume lexicographical order, or any
other arbitrary order.  The interesting point is whether there exists
some ordering on which this does not happen.  And in fact there is:
order strings by length first, and then lexicographically.  If you do
this then you have next() and prev() for any given string.  If you use
ASCII only, you have a.next = b, and so on.

There is the argument that some languages do not sort lexicographically
but this is also besides the point -- you only need to find *some* way
to sort the characters in the alphabet.  If you dictate that in your
ordering "á" comes before "à" and both after "a", and all of them before
b, then you know that a.next = á and á.next = à and à.next = b.  (Note
that I have also dictated that there is no other character that sorts
after a and before b, which is perfectly possible because the alphabet
is fixed for any given language.  You could use the complete list of
characters coded in a given set of Unicode planes, or even extant all
planes, to obtain the same result).

Defining strings with this ordering means you can have some useful
ranges like [a-z], but then you cannot meaningfully use ranges for
things like [aardvark - zulu] --- note that in this particular example,
the order is reversed, because zulu comes before aardvark which is
probably not what you want.   zzz.next = aaaa

In short, I think that while it is possible to define ranges of strings,
it is not as useful as one would like.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Range types

От
Martijn van Oosterhout
Дата:
On Tue, Dec 15, 2009 at 04:29:26PM -0800, Jeff Davis wrote:
> On Tue, 2009-12-15 at 18:06 -0600, decibel wrote:
> > Now that varlena's don't have an enormous fixed overhead, perhaps it's
> > worth looking at using them. Obviously some operations would be
> > slower, but for your stated examples of auditing and history, I
> > suspect that you're not going to notice the overhead that much.
>
> For most varvarlena types, you only get stuck with the full alignment
> burden if you get unlucky. In this case, we're moving from 16 bytes to
> 17, which really means 24 bytes with alignment. Try creating two tables:
>
>   create table foo(i int8, t1 timestamp, t2 timestamp);
>   create table bar(i int8, c "char", t1 timestamp, t2 timestamp);

But a period type will take just one or two more bytes if you don't
require alignment. Alignment on a varlena type seems silly anyway,
since you'll be aligning the header byte rather than the content.

In the implementation you may need to copy the content before
processing to satisfy the alignment of the contained type, but that's
just a SMOP.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Range types

От
Martijn van Oosterhout
Дата:
On Wed, Dec 16, 2009 at 10:57:19AM -0800, Scott Bailey wrote:
> Ok, silly question here. But how do you determine the length of a
> continuous range? By definition length of [a, b) and (a, b] = b-a. But
> what about (a,b) and [a,b]? Are we saying that because they are
> continuous, the difference between values included in the range and
> those excluded are so infinitesimally small so as not to matter? Thus
> length (a,b) == length [a,b] == length [a,b)? And if that is the case,
> does the inclusiveness of the range really even matter?

Short answer: Yes

Longer answer: You need to decide on your definition of "length" and
what you usually use is the "measure". And yes, the difference between
the two is so called "measure 0" and thus has no effect on the length.

Note the measure has to be done considering the intervals as intervals
on a real line. The integers by themselves have no measure (they are
countable). So for the "length" of a set of integers you might consider
the count of the set.

http://planetmath.org/encyclopedia/ProofThatTheOuterLebesgueMeasureOfAnIntervalIsItsLength.html
http://en.wikipedia.org/wiki/Outer_measure

As for "continuous", as you use it above is not a way I recognise.
There are contiguous sets, but they are something else.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Range types

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> In short, I think that while it is possible to define ranges of strings,
> it is not as useful as one would like.

Note it is not the *range* that is the problem, it is the assumption
that there's a unique "next" string.  There's no unique next in the
reals or rationals either, but we have no problem considering intervals
over those sets.
        regards, tom lane


Re: Range types

От
Jeff Davis
Дата:
On Wed, 2009-12-16 at 13:59 -0500, Tom Lane wrote:
> The argument for having
> granularity wired into the datatype seems to boil down to just space
> savings.  I don't find that compelling enough to justify code
> contortions and user-visible restrictions on functionality.

The argument (at least from me) is that discrete ranges have better
semantics. The counterargument was that the granularity of a timestamp
is an implementation detail. So I countered by making it explicit.

Space savings is not crucial, but it would be frustrating to needlessly
waste space.

I still have not seen an answer to the problem of changing the
representation of a continuous range. If you have the continuous range
[5, 10], you're pretty much stuck with that representation, even if the
application is expecting things in the form [ ).

>From an application's standpoint, you probably want to get the
information about a range as separate columns (as opposed to parsing a
string). So an application expecting data in [) format might use a query
like:
select ..., first(myperiod), next(myperiod) from mytable;

That gives the application complete information about the range. You can
even make a view over a table like that to make it even more transparent
to the application. It's not entirely unreasonable that many such
applications exist; there are many presentations and tutorials that have
been telling people to use a "start" and "end" column, and assume that
the start is inclusive and the end is exclusive.

If there is some other application that expects data in (] format, you
just use the query:
 select ..., prior(myperiod), last(myperiod) from mytable;

With discrete ranges, that all just works (barring overflow or special
values).

With continuous ranges, first() or next() might fail on some values that
were produced by some other application. Really, for continuous ranges,
you'd need to have a query like:
 select ..., start(myperiod), start_inclusive(myperiod),   end(myperiod), end_inclusive(myperiod) from mytable;

in order to have all of the information. And that means that the
application needs a full implementation of a range type to understand
the inclusivity and produce a correct result.

And to further make the case for allowing user-defined discrete ranges,
what about ip4r? That's a discrete range, and it's user-defined. And
that probably means other useful discrete ranges will be invented, as
well.

If we want to say that there is no discrete TIMESTAMP range by default,
and that the superuser has to define it, that's one thing. But if we say
that the only acceptable base types for discrete ranges will be
hard-wired, that's way too restrictive. If nothing else, it makes some
system types "special" which we have not done very many other places.

Regards,Jeff Davis



Re: Range types

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > In short, I think that while it is possible to define ranges of strings,
> > it is not as useful as one would like.
> 
> Note it is not the *range* that is the problem, it is the assumption
> that there's a unique "next" string.  There's no unique next in the
> reals or rationals either, but we have no problem considering intervals
> over those sets.

Yeah, agreed.  It's easy (I think) to define more useful ranges of
strings if you don't insist in having "next".

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Range types

От
Tom Lane
Дата:
Martijn van Oosterhout <kleptog@svana.org> writes:
> But a period type will take just one or two more bytes if you don't
> require alignment. Alignment on a varlena type seems silly anyway,
> since you'll be aligning the header byte rather than the content.

You might still end up paying the alignment overhead after the field,
of course.  But avoiding embedding the alignment in the type itself
seems worth doing.

One idea that might be interesting to consider in this regard is
force-packing varlena range values.  Consider a text range('a', 'b').
The datums are likely to come in with 4-byte headers requiring
alignment.  If we have the smarts to force them to 1-byte header form
inside the varlena range value, not only do we save bytes right there,
but we don't have to expend cycles to copy them somewhere to re-align
them before we can pass them to the datatype-specific functions.
        regards, tom lane


Re: Range types

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Wed, 2009-12-16 at 12:50 -0500, Tom Lane wrote:
>> I'm still not exactly clear on what the use-case is for discrete
>> timestamp ranges, and I wonder how many people are going to be happy
>> with a representation that can't handle a range that's open-ended
>> on the left.

> Huh? We're miscommunicating somewhere.

Yeah, apparently.  By open-ended I meant -infinity left bound, or null
left bound if you prefer.  Not sure if there's a better term.
        regards, tom lane


Re: Range types

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Wed, 2009-12-16 at 13:59 -0500, Tom Lane wrote:
>> The argument for having
>> granularity wired into the datatype seems to boil down to just space
>> savings.  I don't find that compelling enough to justify code
>> contortions and user-visible restrictions on functionality.

> The argument (at least from me) is that discrete ranges have better
> semantics. The counterargument was that the granularity of a timestamp
> is an implementation detail. So I countered by making it explicit.

Making it explicit doesn't fix the fact that you can't rely on the
arithmetic to be exact.

> I still have not seen an answer to the problem of changing the
> representation of a continuous range. If you have the continuous range
> [5, 10], you're pretty much stuck with that representation, even if the
> application is expecting things in the form [ ).

That is not our problem.  It's the application's problem if it can't
handle the concept.  You might as well be complaining that type numeric
is broken because it can represent values that will fail to fit into
float8 when some application tries to force them into that form.

> And to further make the case for allowing user-defined discrete ranges,
> what about ip4r?

What about it?  I don't have a problem with the concept that next() is
well defined for some datatypes.  I just have a problem with the concept
that it's well-defined for timestamps.  It's not, and I don't believe
that forcing it to have some definition is useful in the real world
(which, as a rule, isn't going to fit the simplifying assumptions you
have to make to make it even sort-of work).
        regards, tom lane


Re: Range types

От
Martijn van Oosterhout
Дата:
On Wed, Dec 16, 2009 at 03:57:44PM -0500, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > I still have not seen an answer to the problem of changing the
> > representation of a continuous range. If you have the continuous range
> > [5, 10], you're pretty much stuck with that representation, even if the
> > application is expecting things in the form [ ).
>
> That is not our problem.  It's the application's problem if it can't
> handle the concept.  You might as well be complaining that type numeric
> is broken because it can represent values that will fail to fit into
> float8 when some application tries to force them into that form.

However, it does seem reasonable to allow people to restrict, either by
typmod or a check constraint the kinds of values that can be stored in
a particular column. Then an application can decide which way they want
their intervals to work and have the database enforce it.

(Intermediate values may become a different kind, just as long as the
result being stored it the right kind.)

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Range types

От
Jeff Davis
Дата:
On Wed, 2009-12-16 at 15:46 -0500, Tom Lane wrote:
> > Huh? We're miscommunicating somewhere.
> 
> Yeah, apparently.  By open-ended I meant -infinity left bound, or null
> left bound if you prefer.  Not sure if there's a better term.

But my proposal allowed both of those things with various flag settings
(because it has a flag byte in the latest proposal). I said so
explicitly.

Regards,Jeff Davis



Re: Range types

От
Tom Lane
Дата:
Martijn van Oosterhout <kleptog@svana.org> writes:
> However, it does seem reasonable to allow people to restrict, either by
> typmod or a check constraint the kinds of values that can be stored in
> a particular column. Then an application can decide which way they want
> their intervals to work and have the database enforce it.

Sure --- the range datatype should absolutely provide inquiry functions
that let you determine all the properties of a range, so something like
"CHECK (is_open_on_right(col))" would work for that.  I'm of the opinion
that we must not usurp typmod for range behavior --- the right thing is
to pass that through to the contained type, just as we do with arrays.

(Note that a range over timestamp(0) would eliminate at least some of
the platform dependencies we've been arguing about.  I'm still quite
dubious that "next timestamp" is anything except evidence that you've
misformulated your problem, though.)
        regards, tom lane


Re: Range types

От
Jeff Davis
Дата:
On Wed, 2009-12-16 at 15:57 -0500, Tom Lane wrote:
> Making it explicit doesn't fix the fact that you can't rely on the
> arithmetic to be exact.

Can't rely on what arithmetic to be exact? Int64 timestamps should
clearly work for granules of 1 second.

If the administrator can choose a timestamp format that can't accurately
represent whole seconds, that's not an argument that modeling based on
whole seconds is worthless. We can restrict float timestamps for ranges
as a special case, if we're that worried about people misusing them.

> > I still have not seen an answer to the problem of changing the
> > representation of a continuous range. If you have the continuous range
> > [5, 10], you're pretty much stuck with that representation, even if the
> > application is expecting things in the form [ ).
> 
> That is not our problem.  It's the application's problem if it can't
> handle the concept.  You might as well be complaining that type numeric
> is broken because it can represent values that will fail to fit into
> float8 when some application tries to force them into that form.

...except that we support float8. So if applications like to work float8
float8, we let them.

You're arguing that we should not support discrete time ranges (or even
allow such a type to be defined by a superuser) even though applications
and users may happen to model time that way.

Who are we to argue that all of those people are so wrong that we won't
even support their type? Especially when they may have just finished a
couple books on the subject[1][2] which told them to model it that way?

> > And to further make the case for allowing user-defined discrete ranges,
> > what about ip4r?
> 
> What about it?  I don't have a problem with the concept that next() is
> well defined for some datatypes.

Ok, so we'll allow users to specify user-defined types for discrete
ranges? How should that be specified, and how will that differ from
earlier proposals?

Earlier you proposed that we hard-wire the set of types that are allowed
to be used for discrete ranges:

http://archives.postgresql.org/pgsql-hackers/2009-12/msg01278.php

Regards,Jeff Davis

[1] "Temporal Data and the Relational Model" by C.J. Date, et al.
[2] "Developing Time-Oriented Database Applications in SQL" by Richard
Snodgrass.



Re: Range types

От
Scott Bailey
Дата:
Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
>> However, it does seem reasonable to allow people to restrict, either by
>> typmod or a check constraint the kinds of values that can be stored in
>> a particular column. Then an application can decide which way they want
>> their intervals to work and have the database enforce it.
> 
> Sure --- the range datatype should absolutely provide inquiry functions
> that let you determine all the properties of a range, so something like
> "CHECK (is_open_on_right(col))" would work for that.  I'm of the opinion
> that we must not usurp typmod for range behavior --- the right thing is
> to pass that through to the contained type, just as we do with arrays.
> 
> (Note that a range over timestamp(0) would eliminate at least some of
> the platform dependencies we've been arguing about.  I'm still quite
> dubious that "next timestamp" is anything except evidence that you've
> misformulated your problem, though.)
> 
>             regards, tom lane

Well our work is based on over 15 years of temporal research (not by us) 
and numerous books from Snodgrass, Date and Celko; as well as partial 
implementations in other databases. So its not like we took a blue pill 
this weekend and woke up with this hair-brained idea.

I understand your concern. But I think the objections are based more on 
implementation details with float timestamp rather than conceptually.


Re: Range types

От
Jeff Davis
Дата:
On Wed, 2009-12-16 at 13:59 -0500, Tom Lane wrote:
> For example, if you're trying to do classroom scheduling, it might be
> useful to constrain the periods to start and end on hour boundaries
> --- but the next thing you'll want is to have it know that the "next"
> slot after 5pm Friday is 8am Monday.  Except on holidays.  And then
> there's the fact that my alma mater starts most hour-long classes on
> the half hour.

Data types are only a first-level constraint -- a domain of reasonable
values. The class isn't going to start on a fraction-of-a-minute
boundary, so it would be reasonable to reject those values early.

I never suggested that next() should be such a highly business-dependent
function as you suggest above (skipping holidays, etc); it should just
return the next value in the domain (if it's discrete).

Surely you wouldn't suggest that the ipv4 data type's next() function
should skip over addresses that aren't in a valid subnet on your
network. But you seem to think those make useful discrete ranges.

Regards,Jeff Davis



Re: Range types

От
Jeff Davis
Дата:
On Wed, 2009-12-16 at 14:29 +0100, tomas@tuxteam.de wrote:
> This alone would practically preclude discrete -- int and float would
> behave quite differently (float's "granules" growing towards the edges
> or having to choose a bigger granule for float than for int in the first
> place).

It may be an argument for a different range type name, or trying to spot
obviously dangerous things and throw an error.

But I don't think it's an argument to prevent a superuser from defining
a discrete range of whatever he or she wants, as long as they provide
the necessary functions.

Regards,Jeff Davis



Re: Range types

От
tomas@tuxteam.de
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Dec 16, 2009 at 04:45:54PM -0300, Alvaro Herrera wrote:
> tomas@tuxteam.de wrote:
> 
> > (and as Andrew Dunstan pointed out off-list: I was wrong with my bold
> > assertion that one can squeeze infinitely many (arbitrary length)
> > strings between two given. This is not always the case).
> 
> Of course you can do that if you assume lexicographical order, or any
> other arbitrary order.

Hm. On strings (lexicographical order) there's always a "next" (using
the convention 'a' is the first symbol, 'z' the last and . the append
operator):
 next(X) == X . 'a'

(there is no string between X and X . 'a', and X < X . 'a')

But there generally no prev! (unless the last "character" of a string is
precisely 'a')
 prev("bob") == "bazzzzz......"

...and we just left the realm of finte-length strings. Strange structure
this space, has.

> some ordering on which this does not happen.  And in fact there is:
> order strings by length first, and then lexicographically.  If you do
> this then you have next() and prev() for any given string.  If you use
> ASCII only, you have a.next = b, and so on.

Yes, this is similar to the trick of ordering the rationals in a "snake
diagonal" like in [1], to prove their countability. But this sorting
isn't very useful (except in this clever proof).

> There is the argument that some languages do not sort lexicographically
> but this is also besides the point [...]

Totally agree.

> Defining strings with this ordering means you can have some useful
> ranges like [a-z], but then you cannot meaningfully use ranges for
> things like [aardvark - zulu] --- note that in this particular example,
> the order is reversed, because zulu comes before aardvark which is
> probably not what you want.   zzz.next = aaaa

Yep. Its just useful to prove that the set of strings is countable (I
wonder whether our world would be different if we had chosen the
convention of looking up words in this "length+lexicographical" order in
the first place -- it reminds me of how Chinese-speaking people look up
symbols in a dictionary: number of strokes first).

> In short, I think that while it is possible to define ranges of strings,
> it is not as useful as one would like.

Unless you are willing to treat strings as non-discrete, somehow.

Regards
- ----------------
[1] <http://en.wikipedia.org/wiki/Rational_numbers#Properties>

- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFLKbiFBcgs9XrR2kYRAnUYAJ0QBbWzNITA1KARLKdPTshFBSp/ZwCeIbin
Lwc/pRBYgKoaccGpn0beyu4=
=nViD
-----END PGP SIGNATURE-----


Re: Range types

От
Dimitri Fontaine
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> Dimitri Fontaine <dfontaine@hi-media.com> writes:
>> Tom Lane <tgl@sss.pgh.pa.us> writes:
>>>     foreach p2_member in unnest(p2) loop
>>>         p1 := array(select period_except(p1_member, p2_member)
>>>                 from unnest(p1) p1_member);
>>>     end loop;
>>> 
>>> But maybe it can be done in a single SQL command.
>
>> Yeah, as soon as you have LATERAL, I think. Without it there's no way to
>> compose SRF in SQL, AFAIK.
>
> Hm, how would you do it with LATERAL?  The problem is not so much
> composition as the need for a variable number of rounds of
> composition.

Let's have a try at it:

select p2_member, array_accum(p1) from unnest(p2) as p2_member      lateral (select period_except(p1_member, p2_member)
               from unnest(p1) p1_member) as x(p1);
 

I'm not sure I understand how the explicit looping over unnest(p2) is different
from using lateral, or even if that's what you're talking about when
mentioning variable number of rounds.

Regards,
-- 
dim


Re: Range types

От
Tom Lane
Дата:
Dimitri Fontaine <dfontaine@hi-media.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> Hm, how would you do it with LATERAL?  The problem is not so much
>> composition as the need for a variable number of rounds of
>> composition.

> Let's have a try at it:

> select p2_member, array_accum(p1)
>   from unnest(p2) as p2_member
>        lateral (select period_except(p1_member, p2_member)
>                   from unnest(p1) p1_member) as x(p1);

I don't think that does it.  Maybe I misunderstand LATERAL, but what
that looks like to me is that each p1 will be separately filtered by
each p2, giving rise to a distinct element in the output.  What we
need is for each p1 to be filtered by *all* p2's, successively
(though in any order).
        regards, tom lane


Re: Range types

От
Andrew Gierth
Дата:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Someone mentioned LATERAL?
>> Tom Lane <tgl@sss.pgh.pa.us> writes:>>> Hm, how would you do it with LATERAL?  The problem is not so much>>>
compositionas the need for a variable number of rounds of>>> composition.
 
>> Let's have a try at it:
>> select p2_member, array_accum(p1)>> from unnest(p2) as p2_member>> lateral (select period_except(p1_member,
p2_member)>>from unnest(p1) p1_member) as x(p1);
 
Tom> I don't think that does it.  Maybe I misunderstand LATERAL, butTom> what that looks like to me is that each p1
willbe separatelyTom> filtered by each p2, giving rise to a distinct element in theTom> output.  What we need is for
eachp1 to be filtered by *all*Tom> p2's, successively (though in any order).
 

Right, that's not a job for LATERAL, though it could be done easily
enough in one statement with a recursive CTE, I think.

-- 
Andrew (irc:RhodiumToad)


Re: Range types

От
Scott Bailey
Дата:
Tom Lane wrote:
> Dimitri Fontaine <dfontaine@hi-media.com> writes:
>> Tom Lane <tgl@sss.pgh.pa.us> writes:
>>> Hm, how would you do it with LATERAL?  The problem is not so much
>>> composition as the need for a variable number of rounds of
>>> composition.
> 
>> Let's have a try at it:
> 
>> select p2_member, array_accum(p1)
>>   from unnest(p2) as p2_member
>>        lateral (select period_except(p1_member, p2_member)
>>                   from unnest(p1) p1_member) as x(p1);
> 
> I don't think that does it.  Maybe I misunderstand LATERAL, but what
> that looks like to me is that each p1 will be separately filtered by
> each p2, giving rise to a distinct element in the output.  What we
> need is for each p1 to be filtered by *all* p2's, successively
> (though in any order).
> 
>             regards, tom lane

That approach will only work if you coalesce your inputs into 
non-contiguous sets (NCS) first. Overlapping ranges would break it in a 
hurry. In addition to two coalesce operations, period_except would be 
calculated 1000x for a pair of 100 element arrays. Original solution, 
while not short was probably a little more elegant than Tom gave credit 
for. In a single pass it pulls out only the data points needed to build 
the resultant NCS without making assumptions that the inputs were coalesced.

I think I'll still be able to do a single pass solution for continuous 
ranges. I just wont be able to do the coalesce operations inline with 
the set operations.

Scott


Re: Range types

От
decibel
Дата:
On Dec 15, 2009, at 6:29 PM, Jeff Davis wrote:
> On Tue, 2009-12-15 at 18:06 -0600, decibel wrote:
>> Now that varlena's don't have an enormous fixed overhead, perhaps it's
>> worth looking at using them. Obviously some operations would be
>> slower, but for your stated examples of auditing and history, I
>> suspect that you're not going to notice the overhead that much.
>
> For most varvarlena types, you only get stuck with the full alignment
> burden if you get unlucky. In this case, we're moving from 16 bytes to
> 17, which really means 24 bytes with alignment. Try creating two tables:

My thought was that many timestamps don't actually need 16 bytes. Jan 1, 2000 certainly doesn't. So if your dates are
closeto the PG epoch, you can get away with far fewer than 8 bytes, which means varlena would be a win. 

*does some math* Actually, we're kinda screwed with microsecond time. Neglecting leap years and what-not, I come up
with8 years as the most you can represent in 6 bytes. The good news is that 7 bytes gets you all the way to 2284 (with
uSprecision), so we're not actually hurting ourselves on storage until 4284 or so. Not everyone needs uS precision, so
itmight be worth looking at a varlena-based timestamp. 

I was actually thinking about storing something like an absolute time and then an interval. That might have been able
tocompact a lot more if you used some kind of modified varlena (you'd want to store how long both the absolute time and
theinterval were). But again, we're rather screwed if you use uS precision. 1 byte header + 7 bytes for absolute gets
us+/- 2284 years from epoch, but 4 bytes for interval only gives us 4294 seconds at uS precision. Maybe still worth it
forthose hour-long meetings. 

But if you switch to second precision, things get a lot more interesting: 1 byte overhead + 3 bytes interval gives you
194days. 4 bytes of 1 second absolute time gets you epoch +/- 136 years. That means you could represent an entire
periodin 8 bytes. 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net