Обсуждение: Range types
I had proposed a temporal contrib module earlier and you wanted to see support for many range types not just timestamptz. So I had an idea on how to implement this but I want to see if you guys thought it was a viable idea. So basically I have an anyrange pseudo type with the functions prev, next, last, etc defined. So instead of hard coding range types, we would allow the user to define their own range types. Basically if we are able to determine the previous and next values of the base types we'd be able to define a range type. I'm envisioning in a manner much like defining an enum type. CREATE TYPE periodtz AS RANGE (timestamptz, '1 microsecond'::interval); CREATE TYPE numrange AS RANGE (numeric(8,2)); -- determine granularity from typmod CREATE TYPE floatrange AS RANGE (float, '0.000000001'::float); Or getting really crazy... CREATE TYPE terms AS ENUM ('2000_F', '2000_W', '2000_S', '2000_Su'... '2010_F', '2010_W', '2010_S', '2010_Su'); CREATE TYPE termrange AS RANGE (terms); So basically I have a pg_range table to store the base typeid, a text field for the granule value and the granule typeid. I doubt we would be able to get this in for the 8.5 release, especially since I'm still learning C and the Postgres internals. Jeff Davis is going to get something in before the next commit fest so we'll have some type of temporal/range support. But we wanted to see what direction the community felt we should go. Scott Bailey
Scott Bailey <artacus@comcast.net> wrote: > CREATE TYPE periodtz AS RANGE (timestamptz, '1 microsecond'::interval); What does the second argument mean? Is it a default interval? > So basically I have a pg_range table to store the base typeid, a text > field for the granule value and the granule typeid. As another approach, what about storing typeid in typmod? (Oid can be assumed to be stored in int32.) For example, CREATE TABLE tbl ( r range(timestamp) ); SELECT '[ 2.0, 3.0 )'::range(float); There might be some overhead to store typeid for each range instance, but the typmod approach does not require additinal catalogs and syntax changes. It can be possible even on 8.4. Regards, --- Takahiro Itagaki NTT Open Source Software Center
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sun, Dec 13, 2009 at 11:49:53PM -0800, Scott Bailey wrote: > I had proposed a temporal contrib module earlier and you wanted to see > support for many range types not just timestamptz [...] > So basically I have an anyrange pseudo type with the functions prev, next, > last, etc defined. So instead of hard coding range types, we would allow > the user to define their own range types. Basically if we are able to > determine the previous and next values of the base types we'd be able to > define a range type. I'm envisioning in a manner much like defining an enum > type. > > CREATE TYPE periodtz AS RANGE (timestamptz, '1 microsecond'::interval); > CREATE TYPE numrange AS RANGE (numeric(8,2)); > -- determine granularity from typmod > CREATE TYPE floatrange AS RANGE (float, '0.000000001'::float); I might be asking the same as Itagaki is (see below) but... are you just envisioning ranges on 'discrete' types? What about ranges on floats or (arbitrary length) strings, where there is no prev/next? Too difficult? (mind you: I don't know exactly what I'm talking about, but in would be definitely useful). On Mon, Dec 14, 2009 at 05:10:24PM +0900, Takahiro Itagaki wrote: > > Scott Bailey <artacus@comcast.net> wrote: > > > CREATE TYPE periodtz AS RANGE (timestamptz, '1 microsecond'::interval); > > What does the second argument mean? Is it a default interval? I think it's the granularity. That defines how to find the 'next' point in time. As I understood it, Scott envisions ranges only for discrete types, i.e. those advancing in well-defined steps. Note that (a) I might be wrong and (b) there might be a very good reason for doing it this way. > > So basically I have a pg_range table to store the base typeid, a text > > field for the granule value and the granule typeid. > > As another approach, what about storing typeid in typmod? > (Oid can be assumed to be stored in int32.) > > For example, > CREATE TABLE tbl ( r range(timestamp) ); > SELECT '[ 2.0, 3.0 )'::range(float); > > There might be some overhead to store typeid for each range instance, > but the typmod approach does not require additinal catalogs and syntax > changes. It can be possible even on 8.4. This looks more natural to me too. Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFLJgAZBcgs9XrR2kYRAljHAJwJjYV6fHz4qPSY6sXROYZ6pKIlGQCeO4X1 eszUJopVGqcPkXbiHdQOVrs= =IYQ0 -----END PGP SIGNATURE-----
On Mon, Dec 14, 2009 at 4:06 AM, <tomas@tuxteam.de> wrote: >> As another approach, what about storing typeid in typmod? >> (Oid can be assumed to be stored in int32.) >> >> For example, >> CREATE TABLE tbl ( r range(timestamp) ); >> SELECT '[ 2.0, 3.0 )'::range(float); >> >> There might be some overhead to store typeid for each range instance, >> but the typmod approach does not require additinal catalogs and syntax >> changes. It can be possible even on 8.4. > > This looks more natural to me too. It 's very different than the way we've traditionally used typmod, though, which Tom described pretty well here: http://archives.postgresql.org/pgsql-hackers/2009-11/msg01183.php For example, function signatures ignore typmod, so you'll be able to write a function that takes a range, but you won't know what kind of range you're getting. Pavel proposed changing that, but the problem is that while you might want to discriminate on the basis of what sort of range you're getting, you probably DON'T want to discriminate on the length of the character string being passed in with a varchar argument, or the number of decimal places in a numeric. So I think this is going to be awkward. Also, typid is unsigned and typmod is signed. Again, awkward. Maybe with a big enough crowbar you can make it work, but it seems like it won't be pretty... ...Robert
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mon, Dec 14, 2009 at 06:02:04AM -0500, Robert Haas wrote: > On Mon, Dec 14, 2009 at 4:06 AM, <tomas@tuxteam.de> wrote: [...] > > This looks more natural to me too. > > It 's very different than the way we've traditionally used typmod, > though, which Tom described pretty well here: > > http://archives.postgresql.org/pgsql-hackers/2009-11/msg01183.php > > For example, function signatures ignore typmod, so you'll be able to > write a function that takes a range, but you won't know what kind of > range you're getting [...] > Also, typid is unsigned and typmod is signed. Again, awkward [...] Ugh. I see. Thank you for this very insightful comment. Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFLJiVuBcgs9XrR2kYRAp4ZAJsHjzYuVxwaeAUr1ogqRsZOecxdcwCeLfUv 8lZmeY6lb4r+57c6ZdB0J9M= =0Ips -----END PGP SIGNATURE-----
Scott Bailey <artacus@comcast.net> writes: > So basically I have an anyrange pseudo type with the functions prev, next, > last, etc defined. So instead of hard coding range types, we would allow the > user to define their own range types. Basically if we are able to determine > the previous and next values of the base types we'd be able to define a > range type. I'm envisioning in a manner much like defining an enum > type. It's not clear how to define those functions for the prefix_range datatype, where '123' represents any text begining with those chars and '123[4-6]' any text begining with '123' then either 4, 5 or 6. What's supposed to return SELECT next('123'::prefix_range); ? Regards, -- dim PS: as I'm used to use that in the telephony context, the example contain figures, but it's a text based type and given questions and reports in pgsql-general, people do use it with text ranges too.
On Sun, Dec 13, 2009 at 11:49:53PM -0800, Scott Bailey wrote: > So basically I have an anyrange pseudo type with the functions prev, > next, last, etc defined. So instead of hard coding range types, we would > allow the user to define their own range types. Basically if we are able > to determine the previous and next values of the base types we'd be able > to define a range type. I'm envisioning in a manner much like defining > an enum type. I find it odd that you could define functions next() and prev() since that assumes some kind of dicretisation which simply does not exist for most types I can think of. It would seem to me the real useful uses of ranges would be the operations overlaps, disjoint, proceeds, follows, etc, which could all be defined on any well-ordered type (wherever greater-than and less-than are well defined). No need to discretise anything. Do you have any actual usecase for a distretized range type for timestamp? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Scott Bailey <artacus@comcast.net> writes: > So basically I have an anyrange pseudo type with the functions prev, > next, last, etc defined. So instead of hard coding range types, we would > allow the user to define their own range types. Basically if we are able > to determine the previous and next values of the base types we'd be able > to define a range type. I'm envisioning in a manner much like defining > an enum type. I think array types, not enums, would be a better model. The real question is how the heck granularity enters into it. Why should a range type require that? I think you are mixing up two concepts that would be better kept separate. In particular, the granularity examples you give seem to assume that the underlying datatype is exact not approximate --- which among other things will mean that it fails to work for float timestamps. Since timestamps are supposedly the main use-case, that's pretty troubling. regards, tom lane
Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> writes: > As another approach, what about storing typeid in typmod? > (Oid can be assumed to be stored in int32.) No, it cannot --- you'd be one bit short. The general rule for typmod is that all negative values are treated as "unspecified". Even if you managed to find and change every place that assumed that, you'd still have a failure for -1, which is a perfectly legal value of Oid. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Dec 14, 2009 at 4:06 AM, <tomas@tuxteam.de> wrote: >>> As another approach, what about storing typeid in typmod? >>> (Oid can be assumed to be stored in int32.) > It 's very different than the way we've traditionally used typmod, Aside from the problems already pointed out, there's another: this definition usurps the ability to attach a typmod to the range's base type. Again, you should be looking at arrays not enums for a reference example. The typmod of an array feeds through to its element type. regards, tom lane
Martijn van Oosterhout wrote: > On Sun, Dec 13, 2009 at 11:49:53PM -0800, Scott Bailey wrote: >> So basically I have an anyrange pseudo type with the functions prev, >> next, last, etc defined. So instead of hard coding range types, we would >> allow the user to define their own range types. Basically if we are able >> to determine the previous and next values of the base types we'd be able >> to define a range type. I'm envisioning in a manner much like defining >> an enum type. > > I find it odd that you could define functions next() and prev() since > that assumes some kind of dicretisation which simply does not exist for > most types I can think of. Because intervals (mathematical not SQL) can be open or closed at each end point we need to know what the next an previous value would be at the specified granularity. And while you can do some operations without knowing this, there are many you can't. For instance you could not tell whether two [] or () ranges were adjacent, or be able to coalesce an array of ranges. > It would seem to me the real useful uses of ranges would be the > operations overlaps, disjoint, proceeds, follows, etc, which could all > be defined on any well-ordered type (wherever greater-than and > less-than are well defined). No need to discretise anything. > > Do you have any actual usecase for a distretized range type for > timestamp?
Scott Bailey <artacus@comcast.net> writes: > Because intervals (mathematical not SQL) can be open or closed at each > end point we need to know what the next an previous value would be at > the specified granularity. And while you can do some operations without > knowing this, there are many you can't. For instance you could not tell > whether two [] or () ranges were adjacent, or be able to coalesce an > array of ranges. This statement seems to me to demonstrate that you don't actually understand the concept of open and closed ranges. It has nothing whatsoever to do with assuming that the data type is discrete; these concepts are perfectly well defined for the reals, for example. What it is about is whether the inclusion conditions are "< bound" or "<= bound". regards, tom lane
>> Because intervals (mathematical not SQL) can be open or closed at each >> end point we need to know what the next an previous value would be at >> the specified granularity. And while you can do some operations without >> knowing this, there are many you can't. For instance you could not tell >> whether two [] or () ranges were adjacent, or be able to coalesce an >> array of ranges. > > This statement seems to me to demonstrate that you don't actually > understand the concept of open and closed ranges. It has nothing > whatsoever to do with assuming that the data type is discrete; > these concepts are perfectly well defined for the reals, for example. > What it is about is whether the inclusion conditions are "< bound" > or "<= bound". IMHO the first question is whether, for integers, [1,2] UNION [3,5] should be equal to [1,5]. In math this is certainly true, and defining 'next' seems like a reasonable way to establish this in postgres. The next question is whether, for floats, [1,3-FLT_EPSILON] UNION [3,5] should be [1,5]. And the next question is whether, for numeric(6,2), [1,2.99] UNION [3,5] should be [1,5]. FWIW, I would answer yes, no, yes to those three questions. -Nathan
On Mon, 2009-12-14 at 11:25 -0500, Tom Lane wrote: > Scott Bailey <artacus@comcast.net> writes: > > Because intervals (mathematical not SQL) can be open or closed at each > > end point we need to know what the next an previous value would be at > > the specified granularity. And while you can do some operations without > > knowing this, there are many you can't. For instance you could not tell > > whether two [] or () ranges were adjacent, or be able to coalesce an > > array of ranges. > > This statement seems to me to demonstrate that you don't actually > understand the concept of open and closed ranges. It has nothing > whatsoever to do with assuming that the data type is discrete; > these concepts are perfectly well defined for the reals, for example. > What it is about is whether the inclusion conditions are "< bound" > or "<= bound". Of course you can still define the obvious "contains" and "overlaps" operators for a continuous range. But there are specific differences in the API between a discrete range and a continuous range, which is what Scott was talking about. 1. With discrete ranges, you can change the displayed inclusivity/exclusivity without changing the value. For instance in the integer domain, [ 5, 10 ] is the same value as [ 5, 11 ). This works on both input and output. It is not possible to change the display for continuous ranges. 2. With discrete ranges, you can get the last point before the range, the first point in the range, the last point in the range, and the first point after the range. These are more than enough to describe the range completely. For continuous ranges, those functions will fail depending on the inclusivity/exclusivity of the range. Practically speaking, you would want to have a different set of accessors: start_inclusive(), start_point(), end_point(), and end_inclusive(). However, those functions can't be usefully defined on a discrete range. We can't choose the API for continuous ranges as the API for discrete ranges as well. If we did, then we would think that [5, 10] and [5, 11) were not equal, but they are. Similarly, we would think that [5, 10] and [11, 12] were not adjacent, but they are. So there are certainly some user-visible API differences between the two, and I don't think those differences can be hidden. Regards,Jeff Davis
On Mon, 2009-12-14 at 09:58 -0500, Tom Lane wrote: > In particular, the granularity examples you give seem to assume that > the underlying datatype is exact not approximate --- which among other > things will mean that it fails to work for float timestamps. Since > timestamps are supposedly the main use-case, that's pretty troubling. Additionally, granularity for timestamps is not particularly useful when you get to things like "days" and "months" which don't have a clean algebra. Is the granule there only to try to support continuous ranges? If so, I don't think it's necessary if we expose the API differences I outlined in another email in this thread. Also, that would mean that we don't need a granule for float, because we can already treat it as discrete*. Regards,Jeff Davis *: nextafter() allows you to increment or decrement a double (loosely speaking), and according to the man page it's part of C99 and POSIX.1-2001.
Nathan Boley <npboley@gmail.com> writes: >> This statement seems to me to demonstrate that you don't actually >> understand the concept of open and closed ranges. > IMHO the first question is whether, for integers, [1,2] UNION [3,5] > should be equal to [1,5]. In math this is certainly true, and defining > 'next' seems like a reasonable way to establish this in postgres. Well, that's nice to have (not essential) for data types that actually are discrete. It's not a sufficient argument for creating a definition that is flat out broken for non-discrete datatypes. It might be worth pointing out here that the concept of an open interval was only invented in the first place for dealing with a continuum. If you could assume the underlying set is discrete, every open interval could be replaced with a closed one, using the next or prior value as the bound instead. There would be no need for two kinds of interval. If you are intending to support UNION on intervals, you are going to need some more-general representation anyway. (I trust you don't think we will accept a datatype for which [1,2] UNION [3,5] works but [1,2] UNION [4,5] throws an error.) So whether the code can reduce the result of a union to a single range or has to keep it as two ranges is only an internal optimization issue anyhow, not something that should drive an artificial (and infeasible) attempt to force every domain to be discrete. regards, tom lane
On Mon, 2009-12-14 at 10:00 -0800, Nathan Boley wrote: > IMHO the first question is whether, for integers, [1,2] UNION [3,5] > should be equal to [1,5]. In math this is certainly true, and defining > 'next' seems like a reasonable way to establish this in postgres. [ you say "yes" ] Agreed. > The next question is whether, for floats, [1,3-FLT_EPSILON] UNION > [3,5] should be [1,5]. [ you say "no" ] I think this should be true, because all floats between 1 and 5 are contained. I don't feel too strongly about this, so I would not complain if floats were treated as continuous. > And the next question is whether, for numeric(6,2), [1,2.99] UNION > [3,5] should be [1,5]. [ you say "yes" ] I almost agree. Unfortunately, typmod isn't really a part of the type, it just affects input/output. So, we can't really use it that way -- as Tom points out, typmod is not passed along to functions that take the value. But if it were a part of the type, then I would certainly agree. Regards,Jeff Davis
Jeff Davis wrote: > So there are certainly some user-visible API differences between the > two, and I don't think those differences can be hidden. ISTM you are saying we should have two types of range types. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Jeff Davis <pgsql@j-davis.com> writes: > On Mon, 2009-12-14 at 11:25 -0500, Tom Lane wrote: >> This statement seems to me to demonstrate that you don't actually >> understand the concept of open and closed ranges. > Of course you can still define the obvious "contains" and "overlaps" > operators for a continuous range. But there are specific differences in > the API between a discrete range and a continuous range, which is what > Scott was talking about. Well, if the intention is to invent two different kinds of ranges, with different operators, for continuous and discrete data types, then fine. But the original post suggested no such thing, and provided (unworkable) examples suggesting that the intent was to force every type to be treated as discrete whether that makes any sense or not. The main question I would have is how to tell whether the underlying type is really discrete. If we allow people to define things like "range over float4 with 0.000001 step", then somebody will try to do it --- and file bugs when it doesn't work sanely. I don't think there is anything in our API for user-defined types that really tells you whether it's an exact or approximate type. (Also, stuff like strings simply doesn't have any sane concept of a unique next or previous value. I think the number of types that are really discrete in this sense is very small, like maybe just ints and enums.) regards, tom lane
Tom Lane wrote: > Scott Bailey <artacus@comcast.net> writes: >> Because intervals (mathematical not SQL) can be open or closed at each >> end point we need to know what the next an previous value would be at >> the specified granularity. And while you can do some operations without >> knowing this, there are many you can't. For instance you could not tell >> whether two [] or () ranges were adjacent, or be able to coalesce an >> array of ranges. > > This statement seems to me to demonstrate that you don't actually > understand the concept of open and closed ranges. It has nothing > whatsoever to do with assuming that the data type is discrete; > these concepts are perfectly well defined for the reals, for example. > What it is about is whether the inclusion conditions are "< bound" > or "<= bound". I won't address how you draw your conclusions here. But I find it 'interesting' that you assume that I don't know what I'm talking about rather than assume you don't fully understand what I'm talking about. Anyhow. For any given range you may be 4 combinations of values. Either the first value included in the range '[' or the last value preceding the start of the range '('; and the last value included in the range ']' or the first value following the end of the range ')'. We aren't going to store all four data points so we need to normalize into the most common form, a half-open interval [) and store just those two values. The first value included in the range and the first value after the end of our range. So lets say you are using a numeric range to model the high and low values of stocks trading on a given day. Now imagine doing this with no concept of granularity. You will most likely be given a range [low, high] with inclusive end points. So how do you convert that to a closed-open interval so you can store it? Is 20.42000000000000000001 the next value after 20.42? Probably not. You are going to want to define 0.01 as the granularity for this (either type or column) so that 20.43 is. Or again are the ranges [14.0, 22.0] and [22.1, 29.0] adjacent? Maybe, maybe not. There is no way to tell w/o knowing the granularity. Perhaps the granularity is 0.000000001 and there are a billion values that are not included. Or perhaps the granularity is 0.1 and the are adjacent. Scott
Tom Lane wrote: > Scott Bailey <artacus@comcast.net> writes: >> So basically I have an anyrange pseudo type with the functions prev, >> next, last, etc defined. So instead of hard coding range types, we would >> allow the user to define their own range types. Basically if we are able >> to determine the previous and next values of the base types we'd be able >> to define a range type. I'm envisioning in a manner much like defining >> an enum type. > > I think array types, not enums, would be a better model. I was referring to the syntax for how the user actually defined an enum not about it's implementation. Basically what I was hoping to get out of this thread was whether it was better to allow the user to define their own range types by specifying the base type and possibly the granularity and default inclusiveness of the end points, orif we should just provide the types like period and intrange? Scott
On Mon, 2009-12-14 at 13:32 -0500, Tom Lane wrote: > Well, if the intention is to invent two different kinds of ranges, with > different operators, for continuous and discrete data types, then fine. Originally I thought most of the use cases were workable as discrete ranges. If people want continuous ranges, that's certainly doable by using a slightly different API. > But the original post suggested no such thing, and provided (unworkable) > examples suggesting that the intent was to force every type to be > treated as discrete whether that makes any sense or not. I agree, we shouldn't say we support continuous types, but force them to be treated like discrete types. > The main question I would have is how to tell whether the underlying > type is really discrete. We can ask the user to provide a prior() and next() function, and if they aren't provided, we assume it's continuous. Also, this range mechanism may be useful to get the meaningful operators for a range type. For instance, for a range join (e.g. a temporal join), we could recognize the && as "overlaps" and then find the "strictly left of" operator if we wanted to do a range merge join. This might be cleaner than the previous idea to mark operator families as conforming to a certain convention for strategy numbers. > (Also, stuff like strings simply doesn't have any sane concept of a > unique next or previous value. I think the number of types that are > really discrete in this sense is very small, like maybe just ints and > enums.) I think "countable" is a more accurate word than "discrete". Strings are discrete but not countable. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Mon, 2009-12-14 at 13:32 -0500, Tom Lane wrote: >> The main question I would have is how to tell whether the underlying >> type is really discrete. > We can ask the user to provide a prior() and next() function, and if > they aren't provided, we assume it's continuous. Well, that still leaves us with the problem that Joe Schmo will file a bug when "create function next(float4) returns float4 as $$ select $1 + 0.00001 $$" doesn't behave sanely for him. I'd prefer not to leave it to the user to decide whether a type is discrete or not. The traffic on pgsql-bugs is convincing evidence that a very large fraction of our user-base doesn't understand that floats are inexact :-( > I think "countable" is a more accurate word than "discrete". Strings are > discrete but not countable. It's been too long since college math classes for me to be sure whether "discrete" is really the exact term here. But I'm even more suspicious of "countable". I think a suitable diagonalization argument might show that strings are countable. That's getting a bit off-topic though... regards, tom lane
On Mon, 2009-12-14 at 11:10 -0800, Scott Bailey wrote: > I was referring to the syntax for how the user actually defined an enum > not about it's implementation. Basically what I was hoping to get out of > this thread was whether it was better to allow the user to define their > own range types by specifying the base type and possibly the granularity > and default inclusiveness of the end points, or if we should just > provide the types like period and intrange? To be a little bit more specific about the alternatives: 1. Make a contrib module that creates a variety of range types like PERIOD, PERIODTZ, INT4RANGE, NUMRANGE, etc. In order to support some of the other features I intend to work on, such as a range join (e.g. temporal join), we would need to mark an operator family to know that it conforms to a certain strategy number convention. See: http://archives.postgresql.org/pgsql-hackers/2009-10/msg01769.php http://archives.postgresql.org/pgsql-hackers/2009-10/msg01441.php 2. Implement some kind of ANYRANGE type and associated operators; and provide a way to define new range types by providing the base type, difference type (e.g. for TIMESTAMP, the difference type would be INTERVAL) and a couple support functions. This might be more flexible, and it would obviate the need for marking operator families. If the second approach seems promising, we can hammer out a real proposal and see if it's viable. Regards,Jeff Davis
On Mon, Dec 14, 2009 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > It's been too long since college math classes for me to be sure whether > "discrete" is really the exact term here. But I'm even more suspicious > of "countable". I think a suitable diagonalization argument might show > that strings are countable. That's getting a bit off-topic though... It's actually a dovetailing argument, not a diagonalization argument, but yes, the set of strings is most certainly countable. ...Robert (former CS theory teaching assistant)
On Mon, 2009-12-14 at 14:23 -0500, Tom Lane wrote: > > We can ask the user to provide a prior() and next() function, and if > > they aren't provided, we assume it's continuous. > > Well, that still leaves us with the problem that Joe Schmo will file > a bug when "create function next(float4) returns float4 as > $$ select $1 + 0.00001 $$" doesn't behave sanely for him. I'd prefer > not to leave it to the user to decide whether a type is discrete or > not. The traffic on pgsql-bugs is convincing evidence that a very > large fraction of our user-base doesn't understand that floats are > inexact :-( I don't know how we can decide such a thing. Do you have any ideas? Perhaps we can compromise and restrict the support functions to C? That might be a high-enough wall, and of course it would keep non-superusers from confusing the underlying mechanism. Regards,Jeff Davis
Tom Lane wrote: >> We can ask the user to provide a prior() and next() function, and if >> they aren't provided, we assume it's continuous. >> > > Well, that still leaves us with the problem that Joe Schmo will file > a bug when "create function next(float4) returns float4 as > $$ select $1 + 0.00001 $$" doesn't behave sanely for him. I'd prefer > not to leave it to the user to decide whether a type is discrete or > not. The traffic on pgsql-bugs is convincing evidence that a very > large fraction of our user-base doesn't understand that floats are > inexact :-( > Indeed. > >> I think "countable" is a more accurate word than "discrete". Strings are >> discrete but not countable. >> > > It's been too long since college math classes for me to be sure whether > "discrete" is really the exact term here. But I'm even more suspicious > of "countable". I think a suitable diagonalization argument might show > that strings are countable. That's getting a bit off-topic though... > > > Right, I don't think strings are any more or less countable than integers. (and yes, it's a bit OT). Surely the issue from our POV is whether, given two distinct members of a class, we can ever say there is not any intervening member of the class according to some ordering. If we can't then next() and prior() make no sense for that class. cheers andrew
Jeff Davis <pgsql@j-davis.com> writes: > On Mon, 2009-12-14 at 14:23 -0500, Tom Lane wrote: >> I'd prefer not to leave it to the user to decide whether a type is >> discrete or not. > I don't know how we can decide such a thing. Do you have any ideas? If the only interesting use-cases are ints and enums, maybe we could just hard-wire it. regards, tom lane
Andrew Dunstan <andrew@dunslane.net> writes: > Surely the issue from our POV is whether, given two distinct members of > a class, we can ever say there is not any intervening member of the > class according to some ordering. If we can't then next() and prior() > make no sense for that class. We would also like to be sure that the answer is not machine-dependent. For example, you can make such an assertion for two normalized floats that differ by 1 unit in the last place --- but with a different float implementation the answer could be different, and anyway a lot of the really serious problems with float ranges would stem from the range boundary value probably not being exactly what the user thinks it is. Also, it's not just "some ordering" that's of interest, it's the natural one for the datatype. The reason the countability of strings isn't relevant to us here is that diagonalization or dovetailing counts them in an ordering that people wouldn't want in practice. But that brings up something that actually is interesting: should the range mechanism have a way to identify which btree opclass is considered to define the type's sort order? Or is it enough to provide ranges in the type's default ordering? regards, tom lane
> Right, I don't think strings are any more or less countable than integers. > (and yes, it's a bit OT). Well, if I remember my algebra, I think the issue is that integers are locally finite whereas strings are not ( assuming standard orderings, of course ). -Nathan
Scott Bailey <artacus@comcast.net> writes: > I was referring to the syntax for how the user actually defined an enum > not about it's implementation. Basically what I was hoping to get out of > this thread was whether it was better to allow the user to define their > own range types by specifying the base type and possibly the granularity > and default inclusiveness of the end points, or if we should just > provide the types like period and intrange? If 99% of the usefulness will come from ranges over a fixed set of datatypes, it might be best to just do that. That last 1% would be very expensive to get, when you consider all the infrastructure that would be involved with an extensible definition. If we think there's a lot of usefulness for ranges over user-defined types, then this argument doesn't help ... regards, tom lane
Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: >> On Mon, 2009-12-14 at 14:23 -0500, Tom Lane wrote: >>> I'd prefer not to leave it to the user to decide whether a type is >>> discrete or not. > >> I don't know how we can decide such a thing. Do you have any ideas? > > If the only interesting use-cases are ints and enums, maybe we could > just hard-wire it. I think dates could be added to that list as well. But any implementation that doesn't do ranges of timestamptz are non-starters as far as I'm concerned. Certainly int64 timestamps and numeric are doable. And Jeff's period implementation supports float timestamps. I never use float timestamps so I can only assume that he made it work. Personally, I'd rather just see float timestamps go away. And if the range types never supported float or float timestamps, I'd be ok with that. Scott
Scott Bailey <artacus@comcast.net> writes: > Tom Lane wrote: >> If the only interesting use-cases are ints and enums, maybe we could >> just hard-wire it. > I think dates could be added to that list as well. Good point. Network addresses too probably. > But any implementation that doesn't do ranges of timestamptz are > non-starters as far as I'm concerned. Oh, I quite agree --- I'm just complaining about trying to force timestamps into a discrete model that they don't fit. What I was trying to suggest was that we could hard-wire a mechanism that says ints and a few other predetermined cases are discrete while everything else is treated as continuous. > Personally, I'd rather just see float timestamps go away. That's more or less irrelevant to my point. A large fraction of the datatypes in Postgres do not have discrete behavior. Reals, numerics, timestamps, strings, bitstrings, geometrics. Not to mention arrays and composites. Creating an artificial granularity is simply the wrong way to approach it, even for those types where there's an implementation artifact that allows you to make it sort-of-work. regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mon, Dec 14, 2009 at 01:32:08PM -0500, Tom Lane wrote: [...] > (Also, stuff like strings simply doesn't have any sane concept of a > unique next or previous value. If you are willing to limit the length, then yes, you could consider them discrete too, but... > I think the number of types that are > really discrete in this sense is very small, like maybe just ints and > enums.) ...I do agree that ranges over continuous types are the more "interesting"[1] (and possibly more useful) beast. - --------- [11] Unfortunaltel they could turn out to be "interesting" in the sense of "may you live in interresting times" ;-) Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFLJx3xBcgs9XrR2kYRAn10AJ9f/MQiz45LS7ogsRmMXpawcOWSfgCggkWG gFev/SS09O+IOO+FB3shav0= =KwxB -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mon, Dec 14, 2009 at 11:09:16AM -0800, Jeff Davis wrote: [...] > I think "countable" is a more accurate word than "discrete". Strings are > discrete but not countable. Oh, no -- strings (of finite, but arbitrary length) are not discrete -- you can always squeeze one more between two given strings. In this sense there are quite similar to rational numbers. Can we call them continuous? -- it depends, it seems that the terminology here isn't consistent: sometimes the rationals are considered continuous (as per the property above mentioned), sometimes the reals (which are a much more monstrous construct!) are referred to as "the continuum". As Robert points out, they are countable; you'd need infinite length for them to be more than that (then they would behave a bit like the reals, Cantor diagonal and all that ;-) All that said, it's moot: in computers, we can't represent strings of arbitrary length (PostgreSQL has an upper limit of about 1GB, right?). The situation is even more restricted with floats (they are much smaller; thus one could say that they're more "discrete" than strings, even). Problem with floats is -- the granule is not the "same size" over the whole range (nasty), and it's all implementation-dependent (nastier). But given an implementation, the concept of "next" and "previous" on floats is (if you give me some slack with NANs) mostly well-defined. Same with strings (up-to) some fixed length. Still, it seems non-discrete is a useful abstraction? Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFLJzqSBcgs9XrR2kYRAoydAJ9uUYt4aTj+BjuJv9XtDIU7UAAFjwCbBBSv gkw3a6oTqGOoQBHiuZjcJvQ= =l9YV -----END PGP SIGNATURE-----
On Tue, Dec 15, 2009 at 7:28 AM, <tomas@tuxteam.de> wrote: > The situation is even more restricted with floats (they are much > smaller; thus one could say that they're more "discrete" than strings, > even). Problem with floats is -- the granule is not the "same size" over > the whole range (nasty), and it's all implementation-dependent > (nastier). But given an implementation, the concept of "next" and > "previous" on floats is (if you give me some slack with NANs) mostly > well-defined In fact, as I only recently found out, one of the design goals of IEEE floats was specifically that they sort lexicographically and use every bit pattern. So you can alwys get the "next" float by just incrementing your float as an 64-bit integer. Yes that raises your value by a different amount, and it's still useful. -- greg
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, Dec 15, 2009 at 01:09:02PM +0000, Greg Stark wrote: [...] > In fact, as I only recently found out, one of the design goals of IEEE > floats was specifically that they sort lexicographically and use every > bit pattern. So you can alwys get the "next" float by just > incrementing your float as an 64-bit integer. Yes that raises your > value by a different amount, and it's still useful. Didn't know that -- thanks :-) (and as Andrew Dunstan pointed out off-list: I was wrong with my bold assertion that one can squeeze infinitely many (arbitrary length) strings between two given. This is not always the case). Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFLJ5i+Bcgs9XrR2kYRAqtUAJ0VHeUd7q/+Xr9H+Clbr2E0HVV3mgCdFXZM /EPDk1B+M2uP6/Lqr50Rv4k= =XICC -----END PGP SIGNATURE-----
tomas@tuxteam.de writes: > (and as Andrew Dunstan pointed out off-list: I was wrong with my bold > assertion that one can squeeze infinitely many (arbitrary length) > strings between two given. This is not always the case). Really? If the string length is unbounded I think you were right. regards, tom lane
Greg Stark <gsstark@mit.edu> writes: > In fact, as I only recently found out, one of the design goals of IEEE > floats was specifically that they sort lexicographically and use every > bit pattern. So you can alwys get the "next" float by just > incrementing your float as an 64-bit integer. Yes that raises your > value by a different amount, and it's still useful. There are certainly some low-level numerical analysis situations where you want to get the "next" float value, but that hardly constitutes an argument for treating ranges of floats as discrete rather than continuous. Normal users of a range datatype aren't going to be interested in dealing with that sort of inherently machine-specific behavior. Even in types where next/previous are well defined, I'm not that comfortable with having range operations depend on them. What happens when one end of your range is INT_MIN or INT_MAX? regards, tom lane
On Tue, Dec 15, 2009 at 9:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <gsstark@mit.edu> writes: >> In fact, as I only recently found out, one of the design goals of IEEE >> floats was specifically that they sort lexicographically and use every >> bit pattern. So you can alwys get the "next" float by just >> incrementing your float as an 64-bit integer. Yes that raises your >> value by a different amount, and it's still useful. > > There are certainly some low-level numerical analysis situations where > you want to get the "next" float value, but that hardly constitutes > an argument for treating ranges of floats as discrete rather than > continuous. Normal users of a range datatype aren't going to be > interested in dealing with that sort of inherently machine-specific > behavior. Yeah, I don't think we want to base this feature on something that arcane. I also have to say that I'm very skeptical of the argument that there is only a small list of types people will want this for. I don't think it's going to turn out to be all that small. ...Robert
2009/12/15 Tom Lane <tgl@sss.pgh.pa.us>: > tomas@tuxteam.de writes: > >> (and as Andrew Dunstan pointed out off-list: I was wrong with my bold >> assertion that one can squeeze infinitely many (arbitrary length) >> strings between two given. This is not always the case). > > Really? If the string length is unbounded I think you were right. Assuming lexicographical ordering (first different character determines order; end-of-string is sorted before anything else), consider the following two strings: <whatever> and <same whatever as before> + the character with the lowest value in lexicographical ordering. I don't think it is possible to get anything in between those two strings. Nicolas
Robert Haas <robertmhaas@gmail.com> writes: > I also have to say that I'm very skeptical of the argument > that there is only a small list of types people will want this for. I'm not sure that anyone has argued that. I did suggest that there might be a small list of types for which we should provide discrete behavior (ie, with next/previous functions) and the rest could have continuous behavior (without that assumption). But I quite agree that we want both types of ranges. regards, tom lane
On Tue, Dec 15, 2009 at 10:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I also have to say that I'm very skeptical of the argument >> that there is only a small list of types people will want this for. > > I'm not sure that anyone has argued that. I did suggest that there > might be a small list of types for which we should provide discrete > behavior (ie, with next/previous functions) and the rest could have > continuous behavior (without that assumption). But I quite agree > that we want both types of ranges. Oh, I think you're right. I guess I'm skeptical that the set for which discrete treatment is appropriate is a small, fixed set, too. Unless hard-coding that assumption buys us some really significant economies, I think we should avoid doing so. ...Robert
On 15.12.09 15:52 , Tom Lane wrote: > tomas@tuxteam.de writes: >> (and as Andrew Dunstan pointed out off-list: I was wrong with my >> bold assertion that one can squeeze infinitely many (arbitrary >> length) strings between two given. This is not always the case). > > Really? If the string length is unbounded I think you were right. One example is "a" and "aa" (assuming "a" is minimal character in your alphabet). The general case is the strings A and Aaaaaaa...a I think - it doesn't get any more exciting than this. This *is* a bit surprising, since one usually assumes that the ordering of strings and reals is fairly similar, since both are lexical. But note that the mapping of strings into the reals this intuition is based on (simply prefix a the string with "0." and interpret as a real, or something similar if the alphabet isn't {0,1}) isn't one-to-one - the strings "1", "10", "100", ... are all mapped to the *same* real number 0.1 So for reals, the statement is reduced to the trivial fact that for every x there is no y with x < y < x. Which is of course true.. best regards, Florian Pflug
Nicolas Barbier <nicolas.barbier@gmail.com> writes: > Assuming lexicographical ordering (first different character > determines order; end-of-string is sorted before anything else), > consider the following two strings: > <whatever> > and > <same whatever as before> + the character with the lowest value in > lexicographical ordering. > I don't think it is possible to get anything in between those two strings. OK, point taken. But in the real world, many locales use non-lexicographical ordering. In practice, even if you are willing to grant a maximum string length, it is tough enough to find a string just a bit greater than a given string, and damn near impossible to promise that there will be no strings between. We learned that the hard way trying to make LIKE prefix-match indexing work in non-C locales. So the long and the short of it is that next/previous are not going to work for string types, maxlength or no maxlength. regards, tom lane
Tom Lane wrote: > So the long and the short of it is that next/previous are not > going to work for string types, maxlength or no maxlength. > > > Even more importantly, I strongly doubt they would be of the slightest practical value. cheers andrew
On Tue, 2009-12-15 at 10:19 -0500, Tom Lane wrote: > I'm not sure that anyone has argued that. I did suggest that there > might be a small list of types for which we should provide discrete > behavior (ie, with next/previous functions) and the rest could have > continuous behavior (without that assumption). But I quite agree > that we want both types of ranges. It seems like we're moving toward treating TIMESTAMP as continuous. If I'm correct, continuous ranges always need two extra bits of storage for the exclusivity. But for timestamps, that means 16 bytes (2 x 8-byte timestamp) turns into 17 bytes, which is really more like 20 or 24 bytes with alignment. Considering that these are likely to be used for audit or history tables, 8 bytes of waste (50%) seems excessive -- especially when treating them as discrete seems to work pretty well, at least for the int64 timestamps. Ideas? Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > If I'm correct, continuous ranges always need two extra bits of storage > for the exclusivity. But for timestamps, that means 16 bytes (2 x 8-byte > timestamp) turns into 17 bytes, which is really more like 20 or 24 bytes > with alignment. You probably need some flag bits anyway, so flailing frantically to avoid that doesn't seem like a profitable use of time. One pretty obvious use for a flag bit is open-ended ranges, ierange(something, infinity) You could only do this without a flag bit if the underlying datatype has an "infinity" value, which not all do. I'm also wondering what null range boundaries would do. Maybe that's the same as the infinity case, or maybe not. regards, tom lane
Jeff Davis wrote: > On Tue, 2009-12-15 at 10:19 -0500, Tom Lane wrote: >> I'm not sure that anyone has argued that. I did suggest that there >> might be a small list of types for which we should provide discrete >> behavior (ie, with next/previous functions) and the rest could have >> continuous behavior (without that assumption). But I quite agree >> that we want both types of ranges. > > It seems like we're moving toward treating TIMESTAMP as continuous. > > If I'm correct, continuous ranges always need two extra bits of storage > for the exclusivity. But for timestamps, that means 16 bytes (2 x 8-byte > timestamp) turns into 17 bytes, which is really more like 20 or 24 bytes > with alignment. > > Considering that these are likely to be used for audit or history > tables, 8 bytes of waste (50%) seems excessive -- especially when > treating them as discrete seems to work pretty well, at least for the > int64 timestamps. Would it be OK if we handled float timestamp ranges as continuous and int64 timestamps discrete? You effectively lose the ability to build non-contiguous sets with continuous ranges. Which is integral to the work I'm doing (union, intersect, coalesce and minus sets of ranges) As for the extra bits, would it be better to just require continuous ranges to be either [] or [)? But I don't know which would be preferred. My inclination would be toward [), but Tom seemed to indicate that perhaps [] was the norm. Scott
On Tue, Dec 15, 2009 at 11:31:05AM -0800, Scott Bailey wrote: > Jeff Davis wrote: > >On Tue, 2009-12-15 at 10:19 -0500, Tom Lane wrote: > > Would it be OK if we handled float timestamp ranges as continuous > and int64 timestamps discrete? That sounds like a recipe for disaster. Whatever timestamp ranges are, float and int64 should be treated the same way so as not to get "surprises" due to implementation details. > You effectively lose the ability to build non-contiguous sets with > continuous ranges. Which is integral to the work I'm doing (union, > intersect, coalesce and minus sets of ranges) > > As for the extra bits, would it be better to just require continuous > ranges to be either [] or [)? But I don't know which would be > preferred. My inclination would be toward [), but Tom seemed to > indicate that perhaps [] was the norm. [] makes certain operations--namely the important ones in calendaring--impossible, or at least incredibly kludgy, to do. I think we ought to leave openness at each end up to the user, independent of the underlying implementation details. FWIW, I think it would be a good idea to treat timestamps as continuous in all cases. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter <david@fetter.org> writes: > On Tue, Dec 15, 2009 at 11:31:05AM -0800, Scott Bailey wrote: >> As for the extra bits, would it be better to just require continuous >> ranges to be either [] or [)? But I don't know which would be >> preferred. My inclination would be toward [), but Tom seemed to >> indicate that perhaps [] was the norm. > [] makes certain operations--namely the important ones in > calendaring--impossible, or at least incredibly kludgy, to do. I > think we ought to leave openness at each end up to the user, > independent of the underlying implementation details. Yes. A range implementation that couldn't support all four cases of [], [), (], () would be seriously crippled IMO. regards, tom lane
On Tue, 2009-12-15 at 13:15 -0500, Tom Lane wrote: > You probably need some flag bits anyway, so flailing frantically to > avoid that doesn't seem like a profitable use of time. I think "need" and "flailing" are both a little too strong here. The biggest use case will almost certainly be ranges of timestamps, and most of those people will have no use for flag bits (or if they do, it might not be worth an 8-byte-per-value overhead). > One pretty obvious use for a flag bit is open-ended ranges, ie > range(something, infinity) > You could only do this without a flag bit if the underlying datatype > has an "infinity" value, which not all do. True, but frustrating for people whose primary use case is timestamps or floats, which already have 'infinity'. Also frustrating for people who don't mind using the min/max integer as a workaround. > I'm also wondering what null range boundaries would do. Maybe that's > the same as the infinity case, or maybe not. I would prefer to avoid allowing NULL range boundaries for the following reasons: * it reminds me of MySQL dates with zeros in them * we are not guided by any kind of standard * we'd have to inventsemantics that are pretty far outside of those defined for mathematical intervals * we could easily create moreconfusion or allow subtle traps for users * we aren't guided by a clear use case (or I haven't seen it) that isn'tequally solvable (or better) using some other method * would impose that extra 8-byte storage burden on people whomay not need any flag bits Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > I think "need" and "flailing" are both a little too strong here. The > biggest use case will almost certainly be ranges of timestamps, and most > of those people will have no use for flag bits (or if they do, it might > not be worth an 8-byte-per-value overhead). When the alternatives are a crippled implementation that might not do what I want at all, or a full-featured implementation that takes another 8 bytes per value, I'll take the second. The 8 bytes don't matter if it doesn't solve my problem. > I would prefer to avoid allowing NULL range boundaries for the following > reasons: > * it reminds me of MySQL dates with zeros in them If we use that notation to represent an open-ended interval, it seems perfectly reasonable to me. And it doesn't require any assumptions about whether the underlying type has an infinity. I think it's a seriously bad idea to tell people that they should depend on min or max values of a datatype to substitute for the lack of open-ended intervals. That sort of technique creates unnecessary implementation dependencies, and magic numbers (especially ones with a dozen or two digits in them) are bad things for readability in any case. To take just one example that's pretty near at hand: if type date had had an exact published max value that applications were hard-wiring into their code, we'd not have been able to change it to add 'infinity' as a special value. regards, tom lane
David Fetter wrote: > On Tue, Dec 15, 2009 at 11:31:05AM -0800, Scott Bailey wrote: >> Jeff Davis wrote: >>> On Tue, 2009-12-15 at 10:19 -0500, Tom Lane wrote: >> Would it be OK if we handled float timestamp ranges as continuous >> and int64 timestamps discrete? > > That sounds like a recipe for disaster. Whatever timestamp ranges > are, float and int64 should be treated the same way so as not to get > "surprises" due to implementation details. > >> You effectively lose the ability to build non-contiguous sets with >> continuous ranges. Which is integral to the work I'm doing (union, >> intersect, coalesce and minus sets of ranges) >> >> As for the extra bits, would it be better to just require continuous >> ranges to be either [] or [)? But I don't know which would be >> preferred. My inclination would be toward [), but Tom seemed to >> indicate that perhaps [] was the norm. > > [] makes certain operations--namely the important ones in > calendaring--impossible, or at least incredibly kludgy, to do. I > think we ought to leave openness at each end up to the user, > independent of the underlying implementation details. > > FWIW, I think it would be a good idea to treat timestamps as > continuous in all cases. Ok, let me give an example of what we can do with the current implementations that would not be possible with timestamps if we implement as suggested. Jeff's implementation uses a 1 microsecond step size or granule. And my implementation uses an interval step size and can be configured database wide, but default is 1 second. The function below takes two period arrays that can have overlapping and adjacent elements. It subtracts all values in pa1 that intersect with values in pa2. So perhaps pa1 is all of your work shifts for the month and pa2 is a combination of your leave and holidays. The result is a coalesced non-contiguous set of the times you would actually be working. But to do this kind of thing you need to be able to determine prior, first, last and next. I need an implementation that can do this for timestamps and not just ints and dates. CREATE OR REPLACE FUNCTION period_minus( pa1 IN period[], pa2 IN period[] ) RETURNS period[] AS $$ SELECT array_agg(prd) FROM ( SELECT period((t_in).start_time, MIN((t_out).end_time)) AS prd FROM ( SELECT DISTINCT first(p) AS start_time FROM unnest($1) p WHERE NOT contains($2,first(p)) AND NOT contains($1, prior(p)) UNION SELECT DISTINCT next(p) FROM unnest($2) p WHERE contains($1, next(p)) AND NOTcontains($2, next(p)) ) t_in JOIN ( SELECT next(p) AS end_time FROM unnest($1) p WHERE NOT contains($1, next(p)) UNION ALL SELECT first(p) FROM unnest($2) p WHERE contains($1, first(p)) AND NOT contains($2,prior(p)) ) t_out ON t_in.start_time < t_out.end_time GROUP BY t_in.start_time ORDER BYt_in.start_time ) sub; $$ LANGUAGE 'sql' IMMUTABLE STRICT;
On Tue, 2009-12-15 at 11:49 -0800, David Fetter wrote: > That sounds like a recipe for disaster. Whatever timestamp ranges > are, float and int64 should be treated the same way so as not to get > "surprises" due to implementation details. It's a compile-time option that will change the semantics of timestamps regardless. Anyone who changes between float and int64 timestamps may experience problems for a number of different reasons. What was unique before might no longer be unique, for instance. > FWIW, I think it would be a good idea to treat timestamps as > continuous in all cases. I disagree. There is a lot of value in treating timestamp ranges as discrete. One big reason is that the ranges can be translated between the different input/output forms, and there's a canonical form. As we know, a huge amount of the value in an RDBMS is unifying data from multiple applications with different conventions. So, let's say one application uses (] and another uses [). If you are mixing the data and returning it to the application, you want to be able to provide the result according to its convention. You can't do that with a continuous range. And things get more interesting: if you mix (] and [), then range_union will produce () and range_intersect will produce []. So now you have all four conventions floating around the same database. Another reason you might mix conventions: say you have log data from several sources, and some sources provide timestamps for an event which is essentially "instantaneous" and other sources will log the period of time over which the event occurred, in [) format. To mix the data coherently, the correct thing to do is call the instantaneous points a singleton range; but the only way to represent a singleton continuous range is by using []. Whenever you mix conventions, you either have to be able to change the format (which is only possible with discrete ranges) or teach the application to understand your convention. And if you don't have a canonical form (which is only possible with discrete ranges), you can't reliably compare values for equality, or see if they are adjacent. Saying that discrete ranges are unnecessary is essentially saying that there's only a use for one convention; or that the conventions will never be mixed; or that applications will always be smart enough to sort it out for themselves. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Tue, 2009-12-15 at 11:49 -0800, David Fetter wrote: >> FWIW, I think it would be a good idea to treat timestamps as >> continuous in all cases. > I disagree. There is a lot of value in treating timestamp ranges as > discrete. > One big reason is that the ranges can be translated between the > different input/output forms, and there's a canonical form. As we know, > a huge amount of the value in an RDBMS is unifying data from multiple > applications with different conventions. Actually, that is exactly one of the reasons why what you propose is a *bad* idea. You want to institutionalize application dependence on a non-portable implementation detail, namely the granularity of machine representation of what's in principle a continuous value. That's one of the fastest routes to non-unifiable data I can think of. > So, let's say one application uses (] and another uses [). If you are > mixing the data and returning it to the application, you want to be able > to provide the result according to its convention. You can't do that > with a continuous range. The above is nonsense. [1,2) and [1,2] are simply different objects. A design that assumes that it is always possible to replace one by the other is broken so badly it's not even worth discussing. The only reason you'd have applications that fail to handle both open and closed intervals would be if someone were to create an implementation that didn't support both from the outset. Which we need not and should not do. > And things get more interesting: if you mix (] and [), then range_union > will produce () and range_intersect will produce []. So now you have all > four conventions floating around the same database. Which is why it's a good idea to support all four... regards, tom lane
Scott Bailey <artacus@comcast.net> writes: > Ok, let me give an example of what we can do with the current > implementations that would not be possible with timestamps if we > implement as suggested. ... > The function below takes two period arrays that can have overlapping and > adjacent elements. It subtracts all values in pa1 that intersect with > values in pa2. So perhaps pa1 is all of your work shifts for the month > and pa2 is a combination of your leave and holidays. The result is a > coalesced non-contiguous set of the times you would actually be working. The proposed problem is certainly soluble without any assumptions of discreteness. The answer might not look very much like the way you chose to code it here, but that's not an argument for adopting a fundamentally incorrect worldview. If this were an amazingly short and beautiful piece of code, it might support your argument, but it's neither. regards, tom lane
Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: >> On Tue, 2009-12-15 at 11:49 -0800, David Fetter wrote: >>> FWIW, I think it would be a good idea to treat timestamps as >>> continuous in all cases. > >> I disagree. There is a lot of value in treating timestamp ranges as >> discrete. > >> One big reason is that the ranges can be translated between the >> different input/output forms, and there's a canonical form. As we know, >> a huge amount of the value in an RDBMS is unifying data from multiple >> applications with different conventions. > > Actually, that is exactly one of the reasons why what you propose is > a *bad* idea. You want to institutionalize application dependence on > a non-portable implementation detail, namely the granularity of machine > representation of what's in principle a continuous value. That's one > of the fastest routes to non-unifiable data I can think of. > >> So, let's say one application uses (] and another uses [). If you are >> mixing the data and returning it to the application, you want to be able >> to provide the result according to its convention. You can't do that >> with a continuous range. > > The above is nonsense. [1,2) and [1,2] are simply different objects. > A design that assumes that it is always possible to replace one by > the other is broken so badly it's not even worth discussing. I don't hear anyone arguing that. But you should be able to convert between [1,2], [1,3), (0,3) and (0,2]. > The only reason you'd have applications that fail to handle both open > and closed intervals would be if someone were to create an > implementation that didn't support both from the outset. Which we > need not and should not do. > >> And things get more interesting: if you mix (] and [), then range_union >> will produce () and range_intersect will produce []. So now you have all >> four conventions floating around the same database. > > Which is why it's a good idea to support all four... I don't understand you then. Where do you suppose we would define the inclusiveness for the value? At the type level, at the column level, or at the value level? A design that allows values of different inclusiveness and offers no means to convert from one to another is worthless.
> If this were an amazingly > short and beautiful piece of code, it might support your argument, > but it's neither. Well we can't all be arrogant brainiacs.
I wrote: > The proposed problem is certainly soluble without any assumptions > of discreteness. To be concrete, I think it could be approached like this: Assume the datatype provides a built-in function period_except(p1 period, p2 period) returns setof period which can return zero, one, or two rows depending on the inputs: no rows if p1 is completely contained in p2 one row if p1 partially overlaps p2, for example: [1,4] except [3,5] returns [1,3)[4,6] except [1,5) returns [5,6] two rows if p1 properly contains p2, for example [1,10] except [4,5] returns [1,4) and (5,10][1,10] except [9,10) returns [1,9) and [10,10] and of course just p1 if p1 and p2 don't overlap at all. Given such a function it's a simple matter of successively removing each element of p2[] from the set representing the current members of p1[]. The way that I'd find most natural to code that is a loop, along the lines of foreach p2_member in unnest(p2) loop p1 := array(select period_except(p1_member, p2_member) from unnest(p1)p1_member);end loop; But maybe it can be done in a single SQL command. As this example makes clear, when dealing with continuous intervals you *must* admit both open and closed intervals, else you don't have a way to represent the results of "except". Maybe part of the failure to communicate here arises from your desire to try to avoid supporting both kinds of intervals. But I think you really have to do it if you want to deal with data that hasn't got any natural granularity. regards, tom lane
On Tue, 2009-12-15 at 17:17 -0500, Tom Lane wrote: > Actually, that is exactly one of the reasons why what you propose is > a *bad* idea. You want to institutionalize application dependence on > a non-portable implementation detail, namely the granularity of machine > representation of what's in principle a continuous value. That's one > of the fastest routes to non-unifiable data I can think of. Based on the premise that timestamps are a continuous value and the granularity/precision is entirely an implementation detail, you're right. But I disagree with the premise, at least in some cases that I think are worthwhile. See reference [1]. > The above is nonsense. [1,2) and [1,2] are simply different objects. > A design that assumes that it is always possible to replace one by > the other is broken so badly it's not even worth discussing. I don't understand this point at all. [1,2) and [1,2] are different values. Of course they are not interchangeable. If you think I'm proposing that we drop inclusivity/exclusivity before telling the application, that's not what I'm proposing at all. I'm proposing that, at least in some circumstances, it's important to be able to display the same value in different formats -- e.g. [1, 3) or [1, 2], depending on what the application expects. Similar to a timezone adjustment. Regards,Jeff Davis [1] "Temporal Data and the Relational Model" by C.J. Date, et al., uses discrete time throughout the entire book, aside from a brief discussion at the beginning.
On 12/15/09 2:40 PM, Scott Bailey wrote: >> If this were an amazingly >> short and beautiful piece of code, it might support your argument, >> but it's neither. > > Well we can't all be arrogant brainiacs. Tom, Scott, Let's keep it constructive here. Thanks! --Josh
On Dec 15, 2009, at 5:40 PM, Jeff Davis wrote: > If you think I'm proposing that we drop inclusivity/exclusivity before > telling the application, that's not what I'm proposing at all. I'm > proposing that, at least in some circumstances, it's important to be > able to display the same value in different formats -- e.g. [1, 3) or > [1, 2], depending on what the application expects. Similar to a timezone > adjustment. I think it would help the discussion if you could provide some real examples. I suspect you're thinking of things like schedulingapps, where it's important to be able to do things like "what's the next available time slot?". There are probablyways to make that kind of thing easier without resorting to discrete time. > [1] "Temporal Data and the Relational Model" by C.J. Date, et al., uses > discrete time throughout the entire book, aside from a brief discussion > at the beginning. I find myself wondering if they were influenced by the technology available at the time... -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Dec 15, 2009, at 11:34 AM, Jeff Davis wrote: > On Tue, 2009-12-15 at 10:19 -0500, Tom Lane wrote: >> I'm not sure that anyone has argued that. I did suggest that there >> might be a small list of types for which we should provide discrete >> behavior (ie, with next/previous functions) and the rest could have >> continuous behavior (without that assumption). But I quite agree >> that we want both types of ranges. > > It seems like we're moving toward treating TIMESTAMP as continuous. > > If I'm correct, continuous ranges always need two extra bits of storage > for the exclusivity. But for timestamps, that means 16 bytes (2 x 8-byte > timestamp) turns into 17 bytes, which is really more like 20 or 24 bytes > with alignment. > > Considering that these are likely to be used for audit or history > tables, 8 bytes of waste (50%) seems excessive -- especially when > treating them as discrete seems to work pretty well, at least for the > int64 timestamps. > > Ideas? Now that varlena's don't have an enormous fixed overhead, perhaps it's worth looking at using them. Obviously some operationswould be slower, but for your stated examples of auditing and history, I suspect that you're not going to noticethe overhead that much. I'm not sure if the best way to do this would be to support a varlena timestamp or to take fixed-size timestamps and convertthem into varlena periods. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Dec 15, 2009, at 3:40 PM, Jeff Davis wrote: > Based on the premise that timestamps are a continuous value and the > granularity/precision is entirely an implementation detail, you're > right. But I disagree with the premise, at least in some cases that I > think are worthwhile. The argument is, in essence: DECIMAL is continuous.DECIMAL(10,3) is discrete. timestamptz in general is a continuous value (unless we're talking Planck times :) ). There is no way for us to guarantee that next(timestamptz) will have the same value across all platforms; its epsilon is platform dependent. However, if we specify a scale on timestamptz, it becomes much more useful. Just making up a syntax, if we had timestamptz(milliseconds), then it's discrete and we know what next(timestamptz(milliseconds)) is. But in the current implementation, the only way I can see making that work is if we specify a scale for timestamptz, and that strikes me as a big change to its semantics. -- -- Christophe Pettus xof@thebuild.com
Tom Lane wrote: > I wrote: >> The proposed problem is certainly soluble without any assumptions >> of discreteness. > > To be concrete, I think it could be approached like this: > > Assume the datatype provides a built-in function > > period_except(p1 period, p2 period) returns setof period > > which can return zero, one, or two rows depending on the inputs: > > no rows if p1 is completely contained in p2 > > one row if p1 partially overlaps p2, for example: > > [1,4] except [3,5] returns [1,3) > [4,6] except [1,5) returns [5,6] > > two rows if p1 properly contains p2, for example > > [1,10] except [4,5] returns [1,4) and (5,10] > [1,10] except [9,10) returns [1,9) and [10,10] > > and of course just p1 if p1 and p2 don't overlap at all. > > Given such a function it's a simple matter of successively removing each > element of p2[] from the set representing the current members of p1[]. > The way that I'd find most natural to code that is a loop, along the > lines of > > foreach p2_member in unnest(p2) loop > p1 := array(select period_except(p1_member, p2_member) > from unnest(p1) p1_member); > end loop; > > But maybe it can be done in a single SQL command. > > As this example makes clear, when dealing with continuous intervals you > *must* admit both open and closed intervals, else you don't have a way > to represent the results of "except". Maybe part of the failure to > communicate here arises from your desire to try to avoid supporting both > kinds of intervals. But I think you really have to do it if you want to > deal with data that hasn't got any natural granularity. > > regards, tom lane Alright well I'm going to calm down a bit and take a step back. Perhaps I'm just too close to the issue and not thinking outside of the box that I've built. Let me see if I can make everything work rather than arguing why it wont. Scott
On Tue, 2009-12-15 at 18:06 -0600, decibel wrote: > Now that varlena's don't have an enormous fixed overhead, perhaps it's > worth looking at using them. Obviously some operations would be > slower, but for your stated examples of auditing and history, I > suspect that you're not going to notice the overhead that much. For most varvarlena types, you only get stuck with the full alignment burden if you get unlucky. In this case, we're moving from 16 bytes to 17, which really means 24 bytes with alignment. Try creating two tables: create table foo(i int8, t1 timestamp, t2 timestamp); create table bar(i int8, c "char", t1 timestamp, t2 timestamp); That extra byte there costs you 8 bytes, every time (on my machine, anyway). We're at serious risk of people saying "Ah, this temporal thing is bloated. I'll try to get by with a single timestamp and save 16 bytes per record". Or maybe "Why waste the bytes? I'll just store two timestamps". Regards,Jeff Davis
On Tue, 2009-12-15 at 18:03 -0600, decibel wrote: > I think it would help the discussion if you could provide some real > examples. I suspect you're thinking of things like scheduling apps, > where it's important to be able to do things like "what's the next > available time slot?". There are probably ways to make that kind of > thing easier without resorting to discrete time. I did provide some examples: http://archives.postgresql.org/pgsql-hackers/2009-12/msg01385.php I think the bottom line is that we can't expect all applications to understand all 4 inclusivity permutations. Think about how many applications you've seen with "start" and "end" entry boxes, and you have to pick up from the surrounding words what inclusivity it's expecting. If you're running two such applications at once and they share data (perhaps an entry app and a reporting app), or you're migrating from one to another, you have a problem if you can't convert the intervals to the expected format. I don't see it as "resorting to discrete time". I view discrete intervals as simpler in many ways: there exists a canonical inclusivity representation, and you can take that canonical representation and display it however you want. We should at least allow the user to specify that their range is discrete (even if only by a superuser). And the user should also be able to specify that they don't want null boundaries or extra infinity boundaries, so that they get back down to the 16-byte representation. > I find myself wondering if they were influenced by the technology > available at the time... No, it's entirely at the logical level. It's also not a very old book (2002). For another author that apparently likes to deal with discrete time, how about "Developing Time-Oriented Database Applications in SQL" by Snodgrass. It's free to download online, I believe. Regards,Jeff Davis
On Tue, 2009-12-15 at 16:08 -0800, Christophe Pettus wrote: > The argument is, in essence: > > DECIMAL is continuous. > DECIMAL(10,3) is discrete. > > timestamptz in general is a continuous value (unless we're talking > Planck times :) ). There is no way for us to guarantee that > next(timestamptz) will have the same value across all platforms; its > epsilon is platform dependent. Not unless you compile with float timestamps. Integer timestamps are microseconds since the year 2000 (positive or negative), which is platform-independent. I'm not arguing that we shouldn't support continuous time at all (clearly, enough people in this thread seem to like it), but I do want discrete time ranges. A lot of the temporal database literature is written assuming discrete time intervals. > But in the current implementation, the only way I can see making that > work is if we specify a scale for timestamptz, and that strikes me as > a big change to its semantics. It's already useful at the microsecond precision level. Also, the granule could be defined for the range type (as Scott suggested) rather than the timestamp itself. Regards,Jeff Davis
Tom Lane <tgl@sss.pgh.pa.us> writes: > foreach p2_member in unnest(p2) loop > p1 := array(select period_except(p1_member, p2_member) > from unnest(p1) p1_member); > end loop; > > But maybe it can be done in a single SQL command. Yeah, as soon as you have LATERAL, I think. Without it there's no way to compose SRF in SQL, AFAIK. Regards, -- dim
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, Dec 15, 2009 at 04:16:28PM +0100, Nicolas Barbier wrote: [...] > <whatever> > > and > > <same whatever as before> + the character with the lowest value in > lexicographical ordering. > > I don't think it is possible to get anything in between those two strings. Yes, that was basically Andrew's argumentation. I was taken away too much by the similarity between strings and (decimal, base N) fractions, forgetting that the last have lots of unwritten zeros at the end... But hey, 'twas nice to learn that strings aren't as boring as I first thought... Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFLKN9cBcgs9XrR2kYRAl4LAJ9rHs/mlR3+j+79YOUtNUTCY0JOEwCZAROn WsIQoT8nCbgCOaDWraH7jVk= =vjV5 -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, Dec 15, 2009 at 11:49:19AM -0800, David Fetter wrote: > On Tue, Dec 15, 2009 at 11:31:05AM -0800, Scott Bailey wrote: > > Jeff Davis wrote: > > >On Tue, 2009-12-15 at 10:19 -0500, Tom Lane wrote: > > > > Would it be OK if we handled float timestamp ranges as continuous > > and int64 timestamps discrete? > > That sounds like a recipe for disaster. Whatever timestamp ranges > are, float and int64 should be treated the same way so as not to get > "surprises" due to implementation details. This alone would practically preclude discrete -- int and float would behave quite differently (float's "granules" growing towards the edges or having to choose a bigger granule for float than for int in the first place). [...] > FWIW, I think it would be a good idea to treat timestamps as > continuous in all cases. This would come as a corollary from the above Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFLKODXBcgs9XrR2kYRAlpLAJ9nO5f0SHwX8A4CjTn6c/xyZdim1ACdGHTq Fwn5ygKvCDFGadufOYPGrfA= =ivCP -----END PGP SIGNATURE-----
Dimitri Fontaine <dfontaine@hi-media.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> foreach p2_member in unnest(p2) loop >> p1 := array(select period_except(p1_member, p2_member) >> from unnest(p1) p1_member); >> end loop; >> >> But maybe it can be done in a single SQL command. > Yeah, as soon as you have LATERAL, I think. Without it there's no way to > compose SRF in SQL, AFAIK. Hm, how would you do it with LATERAL? The problem is not so much composition as the need for a variable number of rounds of composition. regards, tom lane
On Sun, 2009-12-13 at 23:49 -0800, Scott Bailey wrote: > So basically I have an anyrange pseudo type with the functions prev, > next, last, etc defined. So instead of hard coding range types, we would > allow the user to define their own range types. Basically if we are able > to determine the previous and next values of the base types we'd be able > to define a range type. I'm envisioning in a manner much like defining > an enum type. After an off-list discussion with Scott, I think there may be a solution here that works for everyone if we don't try so hard to unify the implementation of discrete and continuous ranges. The API should be very similar, of course, but the implementation doesn't need to be. Continuous ranges absolutely require the following information: start, end, and inclusivity information. But discrete ranges can instead be stored by counting the number of granules from the start point. For instance, it could be stored as: start, num_granules. That has a lot of benefits for discrete ranges of time. First of all, it allows the algebra to work reasonably well for the "days" and "months" part of the interval, so we can allow a granule of 1 day/week/month/year for a timestamp range. For output of the range, we can then just multiply the granule by the number of granules, and add that to the start time; thus avoiding the "incremental addition" problem with date math. I think this works reasonably well for timestamp/date ranges -- let me know if there is a problem here (aside from timestamptz, which I address below). Secondly, in the case of a timestamp range, we can use 7 bytes for storing the number of granules rather than another full 8-byte timestamp, leaving one byte for flags to represent NULL boundaries, infinite boundaries, etc. For timestamps that would still mean that an interval could be 2000 years long with '1 microsecond' granularity. For dates, 3 bytes is sufficient for a date range 45000 years long with granules of '1 day'. That means that we can get back down to a 16 byte representation for timestamp ranges, or 8 byte representation for date ranges. There are a few details, like infinite ranges, but those can be pretty easily solved with flags as well. There's one problem, and that's for timestamptz ranges with intervals that include days and months. Timezone adjustments are just not well-defined for that kind of granule (nor would it be particularly useful even if it magically worked), so this would have to be blocked somehow. I think that's a special case, and we could provide the user with a nice error message telling the user to use a date or timestamp range instead. So, the idea is to default to a continuous range type, but if the user supplies a granule, prior and next functions, and other necessary details, then it becomes a discrete range type. * continuous ranges can still have everything that everyone wants, including flags to indicate special values.* discreterange granule is specified explicitly, so it's not an "implementation detail"* discrete ranges can have a compactrepresentation* discrete ranges would still have room for flags to indicate special values Comments? Regards,Jeff Davis
On Wed, Dec 16, 2009 at 12:31 PM, Jeff Davis <pgsql@j-davis.com> wrote: > There's one problem, and that's for timestamptz ranges with intervals > that include days and months. Timezone adjustments are just not > well-defined for that kind of granule (nor would it be particularly > useful even if it magically worked), so this would have to be blocked > somehow. I think that's a special case, and we could provide the user > with a nice error message telling the user to use a date or timestamp > range instead. This seems like a fairly special-purpose type. You'd be targeting it at people who are very concerned with storing large numbers of these (so they really care about space consumption) but for some reason don't need to mix days and months (actually, the current interval representation stores days, months, and seconds separately). I certainly think this might be useful to some people but it doesn't really sounds like a general range type facility, since it seems to involve some hacks that are fairly datatype-specific. ...Robert
Jeff Davis <pgsql@j-davis.com> writes: > [ hacky special-case representation for discrete timestamp ranges ] I'm still not exactly clear on what the use-case is for discrete timestamp ranges, and I wonder how many people are going to be happy with a representation that can't handle a range that's open-ended on the left. > So, the idea is to default to a continuous range type, but if the user > supplies a granule, prior and next functions, and other necessary > details, then it becomes a discrete range type. Huh? You're not going to be able to have a special case data representation for one or two data types at the same time as you have a function-based datatype-independent concept of a parameterized range type. Well, maybe you could have special code paths for just date and timestamp but it'd be horrid. More importantly, the notion of a representation granule is still 100% wishful thinking for any inexact-representation datatype, which is going to be a severe crimp in getting this accepted for timestamp, let alone defining it in a way that would allow users to try to apply it to floats. Float timestamps might not be the default case anymore but they are still supported. I think you should let go of the feeling that you have to shave bytes off the storage format. You're creating a whole lot of work for yourself and a whole lot of user-visible corner cases in return for what ultimately isn't much. regards, tom lane
Jeff Davis wrote: > On Sun, 2009-12-13 at 23:49 -0800, Scott Bailey wrote: >> So basically I have an anyrange pseudo type with the functions prev, >> next, last, etc defined. So instead of hard coding range types, we would >> allow the user to define their own range types. Basically if we are able >> to determine the previous and next values of the base types we'd be able >> to define a range type. I'm envisioning in a manner much like defining >> an enum type. > > After an off-list discussion with Scott, I think there may be a solution > here that works for everyone if we don't try so hard to unify the > implementation of discrete and continuous ranges. The API should be very > similar, of course, but the implementation doesn't need to be. > > Continuous ranges absolutely require the following information: start, > end, and inclusivity information. > > But discrete ranges can instead be stored by counting the number of > granules from the start point. For instance, it could be stored as: > start, num_granules. > > That has a lot of benefits for discrete ranges of time. First of all, it > allows the algebra to work reasonably well for the "days" and "months" > part of the interval, so we can allow a granule of 1 day/week/month/year > for a timestamp range. For output of the range, we can then just > multiply the granule by the number of granules, and add that to the > start time; thus avoiding the "incremental addition" problem with date > math. I think this works reasonably well for timestamp/date ranges -- > let me know if there is a problem here (aside from timestamptz, which I > address below). > > Secondly, in the case of a timestamp range, we can use 7 bytes for > storing the number of granules rather than another full 8-byte > timestamp, leaving one byte for flags to represent NULL boundaries, > infinite boundaries, etc. For timestamps that would still mean that an > interval could be 2000 years long with '1 microsecond' granularity. For > dates, 3 bytes is sufficient for a date range 45000 years long with > granules of '1 day'. That means that we can get back down to a 16 byte > representation for timestamp ranges, or 8 byte representation for date > ranges. There are a few details, like infinite ranges, but those can be > pretty easily solved with flags as well. > > There's one problem, and that's for timestamptz ranges with intervals > that include days and months. Timezone adjustments are just not > well-defined for that kind of granule (nor would it be particularly > useful even if it magically worked), so this would have to be blocked > somehow. I think that's a special case, and we could provide the user > with a nice error message telling the user to use a date or timestamp > range instead. > > So, the idea is to default to a continuous range type, but if the user > supplies a granule, prior and next functions, and other necessary > details, then it becomes a discrete range type. > > * continuous ranges can still have everything that everyone wants, > including flags to indicate special values. > * discrete range granule is specified explicitly, so it's not an > "implementation detail" > * discrete ranges can have a compact representation > * discrete ranges would still have room for flags to indicate special > values > > Comments? As I pointed out off-list, I think the granularity for timestamp range should be limited to hours and smaller. Anything larger is asking for trouble. And quite honestly if they wanted day granularity, they should use date range. Also, I think the granule should be same type as returned when subtracting two subtypes. So granule of date range should be int not interval. And if user wanted something with month granularity, perhaps an enum range of 'YYYYMM' would be better. Quite honestly the following 3 cases would probably meet 99% of need: CREATE TYPE period AS RANGE(timestamptz(0), interval '1 s'); CREATE TYPE period AS RANGE(timestamptz(3), interval '1 ms'); CREATE TYPE period AS RANGE(timestamptz, interval '1 us');
Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: >> [ hacky special-case representation for discrete timestamp ranges ] > > I'm still not exactly clear on what the use-case is for discrete > timestamp ranges, and I wonder how many people are going to be happy > with a representation that can't handle a range that's open-ended > on the left. They wouldn't. But the timestamp data would be the anchor, not necessarily the start point. As long as we ranges unbounded on both ends we'd be ok.
Ok, silly question here. But how do you determine the length of a continuous range? By definition length of [a, b) and (a, b] = b-a. But what about (a,b) and [a,b]? Are we saying that because they are continuous, the difference between values included in the range and those excluded are so infinitesimally small so as not to matter? Thus length (a,b) == length [a,b] == length [a,b)? And if that is the case, does the inclusiveness of the range really even matter? And can anyone point me to a reference for working with continuous ranges? Google just insists that I spelled contiguous wrong. Scott
Scott Bailey <artacus@comcast.net> writes: > As I pointed out off-list, I think the granularity for timestamp range > should be limited to hours and smaller. Anything larger is asking for > trouble. And quite honestly if they wanted day granularity, they should > use date range. I'm still not real clear on what the expected use-case is for this. You're evidently envisioning applications where the allowed form of an interval is constrained, but in the cases I can think of, the constraints are a lot more convoluted than what you're proposing. For example, if you're trying to do classroom scheduling, it might be useful to constrain the periods to start and end on hour boundaries --- but the next thing you'll want is to have it know that the "next" slot after 5pm Friday is 8am Monday. Except on holidays. And then there's the fact that my alma mater starts most hour-long classes on the half hour. I think that wiring such constraints into the low-level datatype is doomed to failure. What you'd be better off with is a function that finds the "next" period given a current period and some suitable representation of the off-limits intervals. The argument for having granularity wired into the datatype seems to boil down to just space savings. I don't find that compelling enough to justify code contortions and user-visible restrictions on functionality. regards, tom lane
On Wed, 2009-12-16 at 12:42 -0500, Robert Haas wrote: > On Wed, Dec 16, 2009 at 12:31 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > There's one problem, and that's for timestamptz ranges with intervals > > that include days and months. Timezone adjustments are just not > > well-defined for that kind of granule (nor would it be particularly > > useful even if it magically worked), so this would have to be blocked > > somehow. I think that's a special case, and we could provide the user > > with a nice error message telling the user to use a date or timestamp > > range instead. > > This seems like a fairly special-purpose type. You'd be targeting it > at people who are very concerned with storing large numbers of these > (so they really care about space consumption) but for some reason > don't need to mix days and months (actually, the current interval > representation stores days, months, and seconds separately). I > certainly think this might be useful to some people but it doesn't > really sounds like a general range type facility, since it seems to > involve some hacks that are fairly datatype-specific. My statement should have read "days or months". In other words, you can't have a timestamptz range with a granularity of '3 days'. But if that's your granularity, as Scott says, you should be using a date range, not a timestamptz range. Timestamptz ranges are only really useful when you have a granularity measured in seconds (or some fraction or multiple thereof). Otherwise, the timezone adjustment doesn't make any sense. So this isn't a case of limited functionality, just that we need to inform the user that a timestamptz range with granularity '1 day' or '1 month' makes no sense. Regards,Jeff Davis
On Wed, 2009-12-16 at 12:50 -0500, Tom Lane wrote: > I'm still not exactly clear on what the use-case is for discrete > timestamp ranges, and I wonder how many people are going to be happy > with a representation that can't handle a range that's open-ended > on the left. Huh? We're miscommunicating somewhere. Discrete ranges are values, and those values can be displayed a number of different ways. That's one of the biggest advantages. The very same discrete range can be displayed as open-open, closed-open, open-closed, or closed-closed. There are edge cases, like how infinity is never closed, and overflow conditions. But generally speaking, you have more options for presenting a discrete range than a continuous range. The range [5, 7) is equal to the set {5, 6} and equal to the ranges: (4,7), (4,6], [5,7), and [5,6]. One application can insert it as [5,7) and another can read it as (4,6]. That's the use case: the application's preferences don't have to match. It's OK to mix various representation preferences, because you can convert between them. The on disk format happens to hint at one particular canonical form, but doesn't enforce that on anyone. > Huh? You're not going to be able to have a special case data > representation for one or two data types at the same time as you have a > function-based datatype-independent concept of a parameterized range > type. Well, maybe you could have special code paths for just date and > timestamp but it'd be horrid. They aren't supposed to be exactly the same API, I said that from the start. There are API differences between continuous and discrete ranges, and we shouldn't ignore them. One important differences is that (barring overflow conditions and special values) prior, first, last, and next are defined for all discrete range values, but not for all continuous range values. For instance, the discrete range [5,7) has prior=4, first=5, last=6, next=7. Whereas the continuous range [5,7) has prior=undef, first=5, last=undef, next=7. We could define one API, that treats discrete and continuous ranges differently. But you'll never be able to transform a continuous range to a different representation, while you can do so with a discrete range. > More importantly, the notion of a representation granule is still 100% > wishful thinking for any inexact-representation datatype, which is going > to be a severe crimp in getting this accepted for timestamp, let alone > defining it in a way that would allow users to try to apply it to > floats. Float timestamps might not be the default case anymore but they > are still supported. If the only objection is that a superuser can confuse the system by poorly defining a range type on a non-default build, I think that objection can be overcome. > I think you should let go of the feeling that you have to shave bytes > off the storage format. You're creating a whole lot of work for > yourself and a whole lot of user-visible corner cases in return for > what ultimately isn't much. This isn't just to shave bytes. It's also because I like the semantics of discrete ranges for many cases. Regards,Jeff Davis
tomas@tuxteam.de wrote: > (and as Andrew Dunstan pointed out off-list: I was wrong with my bold > assertion that one can squeeze infinitely many (arbitrary length) > strings between two given. This is not always the case). Of course you can do that if you assume lexicographical order, or any other arbitrary order. The interesting point is whether there exists some ordering on which this does not happen. And in fact there is: order strings by length first, and then lexicographically. If you do this then you have next() and prev() for any given string. If you use ASCII only, you have a.next = b, and so on. There is the argument that some languages do not sort lexicographically but this is also besides the point -- you only need to find *some* way to sort the characters in the alphabet. If you dictate that in your ordering "á" comes before "à" and both after "a", and all of them before b, then you know that a.next = á and á.next = à and à.next = b. (Note that I have also dictated that there is no other character that sorts after a and before b, which is perfectly possible because the alphabet is fixed for any given language. You could use the complete list of characters coded in a given set of Unicode planes, or even extant all planes, to obtain the same result). Defining strings with this ordering means you can have some useful ranges like [a-z], but then you cannot meaningfully use ranges for things like [aardvark - zulu] --- note that in this particular example, the order is reversed, because zulu comes before aardvark which is probably not what you want. zzz.next = aaaa In short, I think that while it is possible to define ranges of strings, it is not as useful as one would like. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Dec 15, 2009 at 04:29:26PM -0800, Jeff Davis wrote: > On Tue, 2009-12-15 at 18:06 -0600, decibel wrote: > > Now that varlena's don't have an enormous fixed overhead, perhaps it's > > worth looking at using them. Obviously some operations would be > > slower, but for your stated examples of auditing and history, I > > suspect that you're not going to notice the overhead that much. > > For most varvarlena types, you only get stuck with the full alignment > burden if you get unlucky. In this case, we're moving from 16 bytes to > 17, which really means 24 bytes with alignment. Try creating two tables: > > create table foo(i int8, t1 timestamp, t2 timestamp); > create table bar(i int8, c "char", t1 timestamp, t2 timestamp); But a period type will take just one or two more bytes if you don't require alignment. Alignment on a varlena type seems silly anyway, since you'll be aligning the header byte rather than the content. In the implementation you may need to copy the content before processing to satisfy the alignment of the contained type, but that's just a SMOP. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
On Wed, Dec 16, 2009 at 10:57:19AM -0800, Scott Bailey wrote: > Ok, silly question here. But how do you determine the length of a > continuous range? By definition length of [a, b) and (a, b] = b-a. But > what about (a,b) and [a,b]? Are we saying that because they are > continuous, the difference between values included in the range and > those excluded are so infinitesimally small so as not to matter? Thus > length (a,b) == length [a,b] == length [a,b)? And if that is the case, > does the inclusiveness of the range really even matter? Short answer: Yes Longer answer: You need to decide on your definition of "length" and what you usually use is the "measure". And yes, the difference between the two is so called "measure 0" and thus has no effect on the length. Note the measure has to be done considering the intervals as intervals on a real line. The integers by themselves have no measure (they are countable). So for the "length" of a set of integers you might consider the count of the set. http://planetmath.org/encyclopedia/ProofThatTheOuterLebesgueMeasureOfAnIntervalIsItsLength.html http://en.wikipedia.org/wiki/Outer_measure As for "continuous", as you use it above is not a way I recognise. There are contiguous sets, but they are something else. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Alvaro Herrera <alvherre@commandprompt.com> writes: > In short, I think that while it is possible to define ranges of strings, > it is not as useful as one would like. Note it is not the *range* that is the problem, it is the assumption that there's a unique "next" string. There's no unique next in the reals or rationals either, but we have no problem considering intervals over those sets. regards, tom lane
On Wed, 2009-12-16 at 13:59 -0500, Tom Lane wrote: > The argument for having > granularity wired into the datatype seems to boil down to just space > savings. I don't find that compelling enough to justify code > contortions and user-visible restrictions on functionality. The argument (at least from me) is that discrete ranges have better semantics. The counterargument was that the granularity of a timestamp is an implementation detail. So I countered by making it explicit. Space savings is not crucial, but it would be frustrating to needlessly waste space. I still have not seen an answer to the problem of changing the representation of a continuous range. If you have the continuous range [5, 10], you're pretty much stuck with that representation, even if the application is expecting things in the form [ ). >From an application's standpoint, you probably want to get the information about a range as separate columns (as opposed to parsing a string). So an application expecting data in [) format might use a query like: select ..., first(myperiod), next(myperiod) from mytable; That gives the application complete information about the range. You can even make a view over a table like that to make it even more transparent to the application. It's not entirely unreasonable that many such applications exist; there are many presentations and tutorials that have been telling people to use a "start" and "end" column, and assume that the start is inclusive and the end is exclusive. If there is some other application that expects data in (] format, you just use the query: select ..., prior(myperiod), last(myperiod) from mytable; With discrete ranges, that all just works (barring overflow or special values). With continuous ranges, first() or next() might fail on some values that were produced by some other application. Really, for continuous ranges, you'd need to have a query like: select ..., start(myperiod), start_inclusive(myperiod), end(myperiod), end_inclusive(myperiod) from mytable; in order to have all of the information. And that means that the application needs a full implementation of a range type to understand the inclusivity and produce a correct result. And to further make the case for allowing user-defined discrete ranges, what about ip4r? That's a discrete range, and it's user-defined. And that probably means other useful discrete ranges will be invented, as well. If we want to say that there is no discrete TIMESTAMP range by default, and that the superuser has to define it, that's one thing. But if we say that the only acceptable base types for discrete ranges will be hard-wired, that's way too restrictive. If nothing else, it makes some system types "special" which we have not done very many other places. Regards,Jeff Davis
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > In short, I think that while it is possible to define ranges of strings, > > it is not as useful as one would like. > > Note it is not the *range* that is the problem, it is the assumption > that there's a unique "next" string. There's no unique next in the > reals or rationals either, but we have no problem considering intervals > over those sets. Yeah, agreed. It's easy (I think) to define more useful ranges of strings if you don't insist in having "next". -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Martijn van Oosterhout <kleptog@svana.org> writes: > But a period type will take just one or two more bytes if you don't > require alignment. Alignment on a varlena type seems silly anyway, > since you'll be aligning the header byte rather than the content. You might still end up paying the alignment overhead after the field, of course. But avoiding embedding the alignment in the type itself seems worth doing. One idea that might be interesting to consider in this regard is force-packing varlena range values. Consider a text range('a', 'b'). The datums are likely to come in with 4-byte headers requiring alignment. If we have the smarts to force them to 1-byte header form inside the varlena range value, not only do we save bytes right there, but we don't have to expend cycles to copy them somewhere to re-align them before we can pass them to the datatype-specific functions. regards, tom lane
Jeff Davis <pgsql@j-davis.com> writes: > On Wed, 2009-12-16 at 12:50 -0500, Tom Lane wrote: >> I'm still not exactly clear on what the use-case is for discrete >> timestamp ranges, and I wonder how many people are going to be happy >> with a representation that can't handle a range that's open-ended >> on the left. > Huh? We're miscommunicating somewhere. Yeah, apparently. By open-ended I meant -infinity left bound, or null left bound if you prefer. Not sure if there's a better term. regards, tom lane
Jeff Davis <pgsql@j-davis.com> writes: > On Wed, 2009-12-16 at 13:59 -0500, Tom Lane wrote: >> The argument for having >> granularity wired into the datatype seems to boil down to just space >> savings. I don't find that compelling enough to justify code >> contortions and user-visible restrictions on functionality. > The argument (at least from me) is that discrete ranges have better > semantics. The counterargument was that the granularity of a timestamp > is an implementation detail. So I countered by making it explicit. Making it explicit doesn't fix the fact that you can't rely on the arithmetic to be exact. > I still have not seen an answer to the problem of changing the > representation of a continuous range. If you have the continuous range > [5, 10], you're pretty much stuck with that representation, even if the > application is expecting things in the form [ ). That is not our problem. It's the application's problem if it can't handle the concept. You might as well be complaining that type numeric is broken because it can represent values that will fail to fit into float8 when some application tries to force them into that form. > And to further make the case for allowing user-defined discrete ranges, > what about ip4r? What about it? I don't have a problem with the concept that next() is well defined for some datatypes. I just have a problem with the concept that it's well-defined for timestamps. It's not, and I don't believe that forcing it to have some definition is useful in the real world (which, as a rule, isn't going to fit the simplifying assumptions you have to make to make it even sort-of work). regards, tom lane
On Wed, Dec 16, 2009 at 03:57:44PM -0500, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > I still have not seen an answer to the problem of changing the > > representation of a continuous range. If you have the continuous range > > [5, 10], you're pretty much stuck with that representation, even if the > > application is expecting things in the form [ ). > > That is not our problem. It's the application's problem if it can't > handle the concept. You might as well be complaining that type numeric > is broken because it can represent values that will fail to fit into > float8 when some application tries to force them into that form. However, it does seem reasonable to allow people to restrict, either by typmod or a check constraint the kinds of values that can be stored in a particular column. Then an application can decide which way they want their intervals to work and have the database enforce it. (Intermediate values may become a different kind, just as long as the result being stored it the right kind.) Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
On Wed, 2009-12-16 at 15:46 -0500, Tom Lane wrote: > > Huh? We're miscommunicating somewhere. > > Yeah, apparently. By open-ended I meant -infinity left bound, or null > left bound if you prefer. Not sure if there's a better term. But my proposal allowed both of those things with various flag settings (because it has a flag byte in the latest proposal). I said so explicitly. Regards,Jeff Davis
Martijn van Oosterhout <kleptog@svana.org> writes: > However, it does seem reasonable to allow people to restrict, either by > typmod or a check constraint the kinds of values that can be stored in > a particular column. Then an application can decide which way they want > their intervals to work and have the database enforce it. Sure --- the range datatype should absolutely provide inquiry functions that let you determine all the properties of a range, so something like "CHECK (is_open_on_right(col))" would work for that. I'm of the opinion that we must not usurp typmod for range behavior --- the right thing is to pass that through to the contained type, just as we do with arrays. (Note that a range over timestamp(0) would eliminate at least some of the platform dependencies we've been arguing about. I'm still quite dubious that "next timestamp" is anything except evidence that you've misformulated your problem, though.) regards, tom lane
On Wed, 2009-12-16 at 15:57 -0500, Tom Lane wrote: > Making it explicit doesn't fix the fact that you can't rely on the > arithmetic to be exact. Can't rely on what arithmetic to be exact? Int64 timestamps should clearly work for granules of 1 second. If the administrator can choose a timestamp format that can't accurately represent whole seconds, that's not an argument that modeling based on whole seconds is worthless. We can restrict float timestamps for ranges as a special case, if we're that worried about people misusing them. > > I still have not seen an answer to the problem of changing the > > representation of a continuous range. If you have the continuous range > > [5, 10], you're pretty much stuck with that representation, even if the > > application is expecting things in the form [ ). > > That is not our problem. It's the application's problem if it can't > handle the concept. You might as well be complaining that type numeric > is broken because it can represent values that will fail to fit into > float8 when some application tries to force them into that form. ...except that we support float8. So if applications like to work float8 float8, we let them. You're arguing that we should not support discrete time ranges (or even allow such a type to be defined by a superuser) even though applications and users may happen to model time that way. Who are we to argue that all of those people are so wrong that we won't even support their type? Especially when they may have just finished a couple books on the subject[1][2] which told them to model it that way? > > And to further make the case for allowing user-defined discrete ranges, > > what about ip4r? > > What about it? I don't have a problem with the concept that next() is > well defined for some datatypes. Ok, so we'll allow users to specify user-defined types for discrete ranges? How should that be specified, and how will that differ from earlier proposals? Earlier you proposed that we hard-wire the set of types that are allowed to be used for discrete ranges: http://archives.postgresql.org/pgsql-hackers/2009-12/msg01278.php Regards,Jeff Davis [1] "Temporal Data and the Relational Model" by C.J. Date, et al. [2] "Developing Time-Oriented Database Applications in SQL" by Richard Snodgrass.
Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: >> However, it does seem reasonable to allow people to restrict, either by >> typmod or a check constraint the kinds of values that can be stored in >> a particular column. Then an application can decide which way they want >> their intervals to work and have the database enforce it. > > Sure --- the range datatype should absolutely provide inquiry functions > that let you determine all the properties of a range, so something like > "CHECK (is_open_on_right(col))" would work for that. I'm of the opinion > that we must not usurp typmod for range behavior --- the right thing is > to pass that through to the contained type, just as we do with arrays. > > (Note that a range over timestamp(0) would eliminate at least some of > the platform dependencies we've been arguing about. I'm still quite > dubious that "next timestamp" is anything except evidence that you've > misformulated your problem, though.) > > regards, tom lane Well our work is based on over 15 years of temporal research (not by us) and numerous books from Snodgrass, Date and Celko; as well as partial implementations in other databases. So its not like we took a blue pill this weekend and woke up with this hair-brained idea. I understand your concern. But I think the objections are based more on implementation details with float timestamp rather than conceptually.
On Wed, 2009-12-16 at 13:59 -0500, Tom Lane wrote: > For example, if you're trying to do classroom scheduling, it might be > useful to constrain the periods to start and end on hour boundaries > --- but the next thing you'll want is to have it know that the "next" > slot after 5pm Friday is 8am Monday. Except on holidays. And then > there's the fact that my alma mater starts most hour-long classes on > the half hour. Data types are only a first-level constraint -- a domain of reasonable values. The class isn't going to start on a fraction-of-a-minute boundary, so it would be reasonable to reject those values early. I never suggested that next() should be such a highly business-dependent function as you suggest above (skipping holidays, etc); it should just return the next value in the domain (if it's discrete). Surely you wouldn't suggest that the ipv4 data type's next() function should skip over addresses that aren't in a valid subnet on your network. But you seem to think those make useful discrete ranges. Regards,Jeff Davis
On Wed, 2009-12-16 at 14:29 +0100, tomas@tuxteam.de wrote: > This alone would practically preclude discrete -- int and float would > behave quite differently (float's "granules" growing towards the edges > or having to choose a bigger granule for float than for int in the first > place). It may be an argument for a different range type name, or trying to spot obviously dangerous things and throw an error. But I don't think it's an argument to prevent a superuser from defining a discrete range of whatever he or she wants, as long as they provide the necessary functions. Regards,Jeff Davis
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wed, Dec 16, 2009 at 04:45:54PM -0300, Alvaro Herrera wrote: > tomas@tuxteam.de wrote: > > > (and as Andrew Dunstan pointed out off-list: I was wrong with my bold > > assertion that one can squeeze infinitely many (arbitrary length) > > strings between two given. This is not always the case). > > Of course you can do that if you assume lexicographical order, or any > other arbitrary order. Hm. On strings (lexicographical order) there's always a "next" (using the convention 'a' is the first symbol, 'z' the last and . the append operator): next(X) == X . 'a' (there is no string between X and X . 'a', and X < X . 'a') But there generally no prev! (unless the last "character" of a string is precisely 'a') prev("bob") == "bazzzzz......" ...and we just left the realm of finte-length strings. Strange structure this space, has. > some ordering on which this does not happen. And in fact there is: > order strings by length first, and then lexicographically. If you do > this then you have next() and prev() for any given string. If you use > ASCII only, you have a.next = b, and so on. Yes, this is similar to the trick of ordering the rationals in a "snake diagonal" like in [1], to prove their countability. But this sorting isn't very useful (except in this clever proof). > There is the argument that some languages do not sort lexicographically > but this is also besides the point [...] Totally agree. > Defining strings with this ordering means you can have some useful > ranges like [a-z], but then you cannot meaningfully use ranges for > things like [aardvark - zulu] --- note that in this particular example, > the order is reversed, because zulu comes before aardvark which is > probably not what you want. zzz.next = aaaa Yep. Its just useful to prove that the set of strings is countable (I wonder whether our world would be different if we had chosen the convention of looking up words in this "length+lexicographical" order in the first place -- it reminds me of how Chinese-speaking people look up symbols in a dictionary: number of strokes first). > In short, I think that while it is possible to define ranges of strings, > it is not as useful as one would like. Unless you are willing to treat strings as non-discrete, somehow. Regards - ---------------- [1] <http://en.wikipedia.org/wiki/Rational_numbers#Properties> - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFLKbiFBcgs9XrR2kYRAnUYAJ0QBbWzNITA1KARLKdPTshFBSp/ZwCeIbin Lwc/pRBYgKoaccGpn0beyu4= =nViD -----END PGP SIGNATURE-----
Tom Lane <tgl@sss.pgh.pa.us> writes: > Dimitri Fontaine <dfontaine@hi-media.com> writes: >> Tom Lane <tgl@sss.pgh.pa.us> writes: >>> foreach p2_member in unnest(p2) loop >>> p1 := array(select period_except(p1_member, p2_member) >>> from unnest(p1) p1_member); >>> end loop; >>> >>> But maybe it can be done in a single SQL command. > >> Yeah, as soon as you have LATERAL, I think. Without it there's no way to >> compose SRF in SQL, AFAIK. > > Hm, how would you do it with LATERAL? The problem is not so much > composition as the need for a variable number of rounds of > composition. Let's have a try at it: select p2_member, array_accum(p1) from unnest(p2) as p2_member lateral (select period_except(p1_member, p2_member) from unnest(p1) p1_member) as x(p1); I'm not sure I understand how the explicit looping over unnest(p2) is different from using lateral, or even if that's what you're talking about when mentioning variable number of rounds. Regards, -- dim
Dimitri Fontaine <dfontaine@hi-media.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> Hm, how would you do it with LATERAL? The problem is not so much >> composition as the need for a variable number of rounds of >> composition. > Let's have a try at it: > select p2_member, array_accum(p1) > from unnest(p2) as p2_member > lateral (select period_except(p1_member, p2_member) > from unnest(p1) p1_member) as x(p1); I don't think that does it. Maybe I misunderstand LATERAL, but what that looks like to me is that each p1 will be separately filtered by each p2, giving rise to a distinct element in the output. What we need is for each p1 to be filtered by *all* p2's, successively (though in any order). regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Someone mentioned LATERAL? >> Tom Lane <tgl@sss.pgh.pa.us> writes:>>> Hm, how would you do it with LATERAL? The problem is not so much>>> compositionas the need for a variable number of rounds of>>> composition. >> Let's have a try at it: >> select p2_member, array_accum(p1)>> from unnest(p2) as p2_member>> lateral (select period_except(p1_member, p2_member)>>from unnest(p1) p1_member) as x(p1); Tom> I don't think that does it. Maybe I misunderstand LATERAL, butTom> what that looks like to me is that each p1 willbe separatelyTom> filtered by each p2, giving rise to a distinct element in theTom> output. What we need is for eachp1 to be filtered by *all*Tom> p2's, successively (though in any order). Right, that's not a job for LATERAL, though it could be done easily enough in one statement with a recursive CTE, I think. -- Andrew (irc:RhodiumToad)
Tom Lane wrote: > Dimitri Fontaine <dfontaine@hi-media.com> writes: >> Tom Lane <tgl@sss.pgh.pa.us> writes: >>> Hm, how would you do it with LATERAL? The problem is not so much >>> composition as the need for a variable number of rounds of >>> composition. > >> Let's have a try at it: > >> select p2_member, array_accum(p1) >> from unnest(p2) as p2_member >> lateral (select period_except(p1_member, p2_member) >> from unnest(p1) p1_member) as x(p1); > > I don't think that does it. Maybe I misunderstand LATERAL, but what > that looks like to me is that each p1 will be separately filtered by > each p2, giving rise to a distinct element in the output. What we > need is for each p1 to be filtered by *all* p2's, successively > (though in any order). > > regards, tom lane That approach will only work if you coalesce your inputs into non-contiguous sets (NCS) first. Overlapping ranges would break it in a hurry. In addition to two coalesce operations, period_except would be calculated 1000x for a pair of 100 element arrays. Original solution, while not short was probably a little more elegant than Tom gave credit for. In a single pass it pulls out only the data points needed to build the resultant NCS without making assumptions that the inputs were coalesced. I think I'll still be able to do a single pass solution for continuous ranges. I just wont be able to do the coalesce operations inline with the set operations. Scott
On Dec 15, 2009, at 6:29 PM, Jeff Davis wrote: > On Tue, 2009-12-15 at 18:06 -0600, decibel wrote: >> Now that varlena's don't have an enormous fixed overhead, perhaps it's >> worth looking at using them. Obviously some operations would be >> slower, but for your stated examples of auditing and history, I >> suspect that you're not going to notice the overhead that much. > > For most varvarlena types, you only get stuck with the full alignment > burden if you get unlucky. In this case, we're moving from 16 bytes to > 17, which really means 24 bytes with alignment. Try creating two tables: My thought was that many timestamps don't actually need 16 bytes. Jan 1, 2000 certainly doesn't. So if your dates are closeto the PG epoch, you can get away with far fewer than 8 bytes, which means varlena would be a win. *does some math* Actually, we're kinda screwed with microsecond time. Neglecting leap years and what-not, I come up with8 years as the most you can represent in 6 bytes. The good news is that 7 bytes gets you all the way to 2284 (with uSprecision), so we're not actually hurting ourselves on storage until 4284 or so. Not everyone needs uS precision, so itmight be worth looking at a varlena-based timestamp. I was actually thinking about storing something like an absolute time and then an interval. That might have been able tocompact a lot more if you used some kind of modified varlena (you'd want to store how long both the absolute time and theinterval were). But again, we're rather screwed if you use uS precision. 1 byte header + 7 bytes for absolute gets us+/- 2284 years from epoch, but 4 bytes for interval only gives us 4294 seconds at uS precision. Maybe still worth it forthose hour-long meetings. But if you switch to second precision, things get a lot more interesting: 1 byte overhead + 3 bytes interval gives you 194days. 4 bytes of 1 second absolute time gets you epoch +/- 136 years. That means you could represent an entire periodin 8 bytes. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net