Обсуждение: Tracking timezone abbreviation removals in the IANA tz database
It turns out that some of the zone abbreviations shown in the IANA timezone database don't correspond to any real-world usage, but were more or less made up by the timezone database crew. And of late, those folk have started to remove these invented abbreviations in favor of just using numeric offsets. For example, part of the 2016f update did this: # Asia/Novosibirsk covers:# 54 RU-NVS Novosibirsk Oblast +# From Stepan Golosunov (2016-05-30): +# http://asozd2.duma.gov.ru/main.nsf/(Spravka)?OpenAgent&RN=1085784-6 +# moves Novosibirsk oblast from UTC+6 to UTC+7. +# From Stepan Golosunov (2016-07-04): +# The law was signed yesterday and published today on +# http://publication.pravo.gov.ru/Document/View/0001201607040064 +Zone Asia/Novosibirsk 5:31:40 - LMT 1919 Dec 14 6:00 - 6:00 - NOVT 1930 Jun 21 # Novosibirsk Time - 7:00 Russia NOV%sT 1991 Mar 31 2:00s - 6:00 Russia NOV%sT 1992 Jan 19 2:00s - 7:00 Russia NOV%sT 1993 May 23 # say Shanks & P. - 6:00 Russia NOV%sT 2011 Mar 27 2:00s - 7:00 - NOVT 2014 Oct 26 2:00s - 6:00 - NOVT + 6:00 - +06 1930 Jun 21 + 7:00 Russia +07/+08 1991 Mar 31 2:00s + 6:00 Russia +06/+07 1992 Jan 19 2:00s + 7:00 Russia +07/+08 1993 May 23 # say Shanks & P. + 6:00 Russia +06/+07 2011 Mar 27 2:00s + 7:00 - +07 2014 Oct 26 2:00s + 6:00 - +06 2016 Jul 24 2:00s + 7:00 - +07 According to the Russia rules, %s expands to nothing for standard time or "S" for summer time, so that the old entries here had the effect of defining NOVT and NOVST as meaning variously UTC+6, UTC+7, or UTC+8 at different times in the past. Now those abbreviations are gone, so far as IANA is concerned. I had thought that this wouldn't really affect us because PG's interpretations of zone abbreviations are driven by the info in timezone/tznames/ rather than the IANA files in timezone/data/. I forgot however that the "dynamic abbreviation" code relies on being able to find matching abbreviations in the IANA data. For example, timezone/tznames/Default lists NOVST Asia/Novosibirsk # Novosibirsk Summer Time (obsolete) but that zone abbreviation now fails entirely: # select '2016-09-02 08:00:00 NOVST'::timestamptz; ERROR: time zone abbreviation "novst" is not used in time zone "Asia/Novosibirsk" (This is also the cause of bug #14307.) So the question is what to do about it. An easy answer is to just delete such entries from the timezone/tznames/ files altogether. If we believe the IANA crew's assumption that nobody uses these abbreviations in the real world, then that seems like it removes useless maintenance overhead for little cost. I'm a little worried though that somebody might have followed IANA's lead and actually started using these abbreviations, in which case we'd get complaints. Another possibility is to keep these entries but get rid of the dynamic zone aliases, reducing them to plain numeric UTC offsets. Probably the value to use would be whatever was shown as the most recent active value in the IANA list ... although that's open to interpretation. For instance, according to the above we'd likely define NOVT as UTC+7, but then logically NOVST ought to be UTC+8, which doesn't match up with the fact that when it actually last appeared in the IANA data it meant UTC+7. So that sounds like it'd be a bit of a mess involving some judgment calls. In the end, part of the reason we've got these files is so that users can make their own decisions about abbreviation meanings. So the ultimate answer to any complaints is going to be "if you think NOVT means thus-and-such then put in an entry that says so". So I'm leaning to the just-remove-it answer for any deleted abbreviation that relies on a dynamic definition. Names that never had more than one UTC offset can remain in the tznames list. Comments? BTW, we should be glad that the IANA crew decided to use ISO not POSIX sign convention for these numeric zone abbreviations, or we'd really have a mess on our hands :-(. Also, the fact that this failure made it into releases shows that I'm missing some now-necessary validation steps in the zone data update procedure. Will add something about that to the timezone/README file. regards, tom lane
I wrote: > So I'm leaning to the just-remove-it answer for any deleted abbreviation > that relies on a dynamic definition. Names that never had more than one > UTC offset can remain in the tznames list. After a bit more thought and consumption of caffeine, I am thinking that that won't really be good enough. It's clear that the IANA crowd intend to continue removing made-up abbreviations, in fact there's a pile of that in their queue right now: http://mm.icann.org/pipermail/tz/2016-August/023941.html The trouble from our perspective is that a lot of those abbreviations have shifted meaning over the years (if they had any real-world usage maybe they'd have stayed more stable...), which means that we are using dynamic abbreviations for them, which means breakage as soon as the tznames list diverges from the IANA database. Which would be manageable if we only shipped those filesets together, but in installations built with --with-system-tzdata (which at least ought to include most vendor distributions of PG), they don't come from the same place. This means that even if we fix removals when we make new releases, other abbreviations may break the next time the vendor updates their tzdata package. We could maybe tolerate individual abbreviations failing like this, but as noted in bug #14307, the pg_timezone_abbrevs view fails altogether if there are any broken abbreviations in the active timezone_abbreviations list. So that makes this a problem for all users whether or not they care about the specific abbreviations affected. This leads me to think that we need to redefine the dynamic abbreviations feature so it's a bit more robust in the face of this type of situation. A really simple change (at least logically, haven't looked at the code) would be to say that a dynamic abbreviation is just a macro for the referenced zone name, that is if we have NOVT Asia/Novosibirsk in the timezone_abbreviations list then writing NOVT in a timezone value is exactly equivalent to writing Asia/Novosibirsk. However, that breaks backwards compatibility (at least for non-broken abbreviations). Our convention up to now has been that if you write a standard-time zone abbreviation then what it means is your local standard-time UTC offset, even if DST is currently prevailing in your zone. For example, DST is currently in force in the USA, so writing "America/New_York" means UTC-4, but "EST" means UTC-5 regardless. Likewise "EDT" means UTC-4 and will still mean that when winter comes. So the idea I'm toying with (again, haven't tried to code this) is to say that *if* we can match the abbreviation to something in the referenced zone then we'll use that, but otherwise we fall back to treating the abbreviation as a macro for the zone name. This would ensure that updates to the IANA data could not break existing timezone_abbreviations entries (at least, not unless IANA were to remove a zone name altogether, but they have never done that AFAIR). An update could cause an abbreviation's meaning to change, but that's true already, in fact it's kind of the whole point. If we were to do that, then perhaps we would not need to remove existing timezone_abbreviations entries even if IANA deems them obsolete. I'd still be inclined to remove them from our sample data files, but users would easily be able to put them back if they wanted. Thoughts? regards, tom lane
I wrote: > So the idea I'm toying with (again, haven't tried to code this) is to say > that *if* we can match the abbreviation to something in the referenced > zone then we'll use that, but otherwise we fall back to treating the > abbreviation as a macro for the zone name. This turned out to require only fairly localized code changes. So I now propose that we leave the tznames files as-is and apply the attached patch (plus documentation changes, which I've not done yet). Any objections? regards, tom lane diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c index 965c3b4..45ba7cd 100644 *** a/src/backend/utils/adt/datetime.c --- b/src/backend/utils/adt/datetime.c *************** static void AdjustFractDays(double frac, *** 56,63 **** int scale); static int DetermineTimeZoneOffsetInternal(struct pg_tm * tm, pg_tz *tzp, pg_time_t *tp); ! static int DetermineTimeZoneAbbrevOffsetInternal(pg_time_t t, const char *abbr, ! pg_tz *tzp, int *isdst); static pg_tz *FetchDynamicTimeZone(TimeZoneAbbrevTable *tbl, const datetkn *tp); --- 56,64 ---- int scale); static int DetermineTimeZoneOffsetInternal(struct pg_tm * tm, pg_tz *tzp, pg_time_t *tp); ! static bool DetermineTimeZoneAbbrevOffsetInternal(pg_time_t t, ! const char *abbr, pg_tz *tzp, ! int *offset, int *isdst); static pg_tz *FetchDynamicTimeZone(TimeZoneAbbrevTable *tbl, const datetkn *tp); *************** overflow: *** 1689,1707 **** * This differs from the behavior of DetermineTimeZoneOffset() in that a * standard-time or daylight-time abbreviation forces use of the corresponding * GMT offset even when the zone was then in DS or standard time respectively. */ int DetermineTimeZoneAbbrevOffset(struct pg_tm * tm, const char *abbr, pg_tz *tzp) { pg_time_t t; /* * Compute the UTC time we want to probe at. (In event of overflow, we'll * probe at the epoch, which is a bit random but probably doesn't matter.) */ ! (void) DetermineTimeZoneOffsetInternal(tm, tzp, &t); ! return DetermineTimeZoneAbbrevOffsetInternal(t, abbr, tzp, &tm->tm_isdst); } --- 1690,1729 ---- * This differs from the behavior of DetermineTimeZoneOffset() in that a * standard-time or daylight-time abbreviation forces use of the corresponding * GMT offset even when the zone was then in DS or standard time respectively. + * (However, that happens only if we can match the given abbreviation to some + * abbreviation that appears in the IANA timezone data. Otherwise, we fall + * back to doing DetermineTimeZoneOffset().) */ int DetermineTimeZoneAbbrevOffset(struct pg_tm * tm, const char *abbr, pg_tz *tzp) { pg_time_t t; + int zone_offset; + int abbr_offset; + int abbr_isdst; /* * Compute the UTC time we want to probe at. (In event of overflow, we'll * probe at the epoch, which is a bit random but probably doesn't matter.) */ ! zone_offset = DetermineTimeZoneOffsetInternal(tm, tzp, &t); ! /* ! * Try to match the abbreviation to something in the zone definition. ! */ ! if (DetermineTimeZoneAbbrevOffsetInternal(t, abbr, tzp, ! &abbr_offset, &abbr_isdst)) ! { ! /* Success, so use the abbrev-specific answers. */ ! tm->tm_isdst = abbr_isdst; ! return abbr_offset; ! } ! ! /* ! * No match, so use the answers we already got from ! * DetermineTimeZoneOffsetInternal. ! */ ! return zone_offset; } *************** DetermineTimeZoneAbbrevOffsetTS(Timestam *** 1715,1733 **** pg_tz *tzp, int *isdst) { pg_time_t t = timestamptz_to_time_t(ts); ! return DetermineTimeZoneAbbrevOffsetInternal(t, abbr, tzp, isdst); } /* DetermineTimeZoneAbbrevOffsetInternal() * * Workhorse for above two functions: work from a pg_time_t probe instant. ! * DST status is returned into *isdst. */ ! static int ! DetermineTimeZoneAbbrevOffsetInternal(pg_time_t t, const char *abbr, ! pg_tz *tzp, int *isdst) { char upabbr[TZ_STRLEN_MAX + 1]; unsigned char *p; --- 1737,1777 ---- pg_tz *tzp, int *isdst) { pg_time_t t = timestamptz_to_time_t(ts); + int zone_offset; + int abbr_offset; + int tz; + struct pg_tm tm; + fsec_t fsec; ! /* ! * If the abbrev matches anything in the zone data, this is pretty easy. ! */ ! if (DetermineTimeZoneAbbrevOffsetInternal(t, abbr, tzp, ! &abbr_offset, isdst)) ! return abbr_offset; ! ! /* ! * Else, break down the timestamp so we can use DetermineTimeZoneOffset. ! */ ! if (timestamp2tm(ts, &tz, &tm, &fsec, NULL, tzp) != 0) ! ereport(ERROR, ! (errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE), ! errmsg("timestamp out of range"))); ! ! zone_offset = DetermineTimeZoneOffset(&tm, tzp); ! *isdst = tm.tm_isdst; ! return zone_offset; } /* DetermineTimeZoneAbbrevOffsetInternal() * * Workhorse for above two functions: work from a pg_time_t probe instant. ! * On success, return GMT offset and DST status into *offset and *isdst. */ ! static bool ! DetermineTimeZoneAbbrevOffsetInternal(pg_time_t t, const char *abbr, pg_tz *tzp, ! int *offset, int *isdst) { char upabbr[TZ_STRLEN_MAX + 1]; unsigned char *p; *************** DetermineTimeZoneAbbrevOffsetInternal(pg *** 1739,1756 **** *p = pg_toupper(*p); /* Look up the abbrev's meaning at this time in this zone */ ! if (!pg_interpret_timezone_abbrev(upabbr, ! &t, ! &gmtoff, ! isdst, ! tzp)) ! ereport(ERROR, ! (errcode(ERRCODE_CONFIG_FILE_ERROR), ! errmsg("time zone abbreviation \"%s\" is not used in time zone \"%s\"", ! abbr, pg_get_timezone_name(tzp)))); ! ! /* Change sign to agree with DetermineTimeZoneOffset() */ ! return (int) -gmtoff; } --- 1783,1799 ---- *p = pg_toupper(*p); /* Look up the abbrev's meaning at this time in this zone */ ! if (pg_interpret_timezone_abbrev(upabbr, ! &t, ! &gmtoff, ! isdst, ! tzp)) ! { ! /* Change sign to agree with DetermineTimeZoneOffset() */ ! *offset = (int) -gmtoff; ! return true; ! } ! return false; } diff --git a/src/test/regress/expected/timestamptz.out b/src/test/regress/expected/timestamptz.out index 67f26db..2bfc13a 100644 *** a/src/test/regress/expected/timestamptz.out --- b/src/test/regress/expected/timestamptz.out *************** SELECT '2007-12-09 07:30:00 UTC'::timest *** 2603,2605 **** --- 2603,2625 ---- Sun Dec 09 03:00:00 2007 (1 row) + -- + -- Test that the pg_timezone_names and pg_timezone_abbrevs views are + -- more-or-less working. We can't test their contents in any great detail + -- without the outputs changing anytime IANA updates the underlying data, + -- but it seems reasonable to expect at least one entry per major meridian. + -- (At the time of writing, the actual counts are around 38 because of + -- zones using fractional GMT offsets, so this is a pretty loose test.) + -- + select count(distinct utc_offset) >= 24 as ok from pg_timezone_names; + ok + ---- + t + (1 row) + + select count(distinct utc_offset) >= 24 as ok from pg_timezone_abbrevs; + ok + ---- + t + (1 row) + diff --git a/src/test/regress/sql/timestamptz.sql b/src/test/regress/sql/timestamptz.sql index c023095..ce9d1c2 100644 *** a/src/test/regress/sql/timestamptz.sql --- b/src/test/regress/sql/timestamptz.sql *************** SELECT '2007-12-09 07:00:00 UTC'::timest *** 468,470 **** --- 468,481 ---- SELECT '2007-12-09 07:00:01 UTC'::timestamptz AT TIME ZONE 'VET'; SELECT '2007-12-09 07:29:59 UTC'::timestamptz AT TIME ZONE 'VET'; SELECT '2007-12-09 07:30:00 UTC'::timestamptz AT TIME ZONE 'VET'; + + -- + -- Test that the pg_timezone_names and pg_timezone_abbrevs views are + -- more-or-less working. We can't test their contents in any great detail + -- without the outputs changing anytime IANA updates the underlying data, + -- but it seems reasonable to expect at least one entry per major meridian. + -- (At the time of writing, the actual counts are around 38 because of + -- zones using fractional GMT offsets, so this is a pretty loose test.) + -- + select count(distinct utc_offset) >= 24 as ok from pg_timezone_names; + select count(distinct utc_offset) >= 24 as ok from pg_timezone_abbrevs;
On Fri, Sep 2, 2016 at 8:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I had thought that this wouldn't really affect us because PG's > interpretations of zone abbreviations are driven by the info in > timezone/tznames/ rather than the IANA files in timezone/data/. > I forgot however that the "dynamic abbreviation" code relies on > being able to find matching abbreviations in the IANA data. > For example, timezone/tznames/Default lists > > NOVST Asia/Novosibirsk # Novosibirsk Summer Time (obsolete) > > but that zone abbreviation now fails entirely: > > # select '2016-09-02 08:00:00 NOVST'::timestamptz; > ERROR: time zone abbreviation "novst" is not used in time zone "Asia/Novosibirsk" > > (This is also the cause of bug #14307.) > > So the question is what to do about it. > > An easy answer is to just delete such entries from the timezone/tznames/ > files altogether. If we believe the IANA crew's assumption that nobody > uses these abbreviations in the real world, then that seems like it > removes useless maintenance overhead for little cost. I'm a little > worried though that somebody might have followed IANA's lead and actually > started using these abbreviations, in which case we'd get complaints. > > Another possibility is to keep these entries but get rid of the dynamic > zone aliases, reducing them to plain numeric UTC offsets. Probably the > value to use would be whatever was shown as the most recent active value > in the IANA list ... although that's open to interpretation. For > instance, according to the above we'd likely define NOVT as UTC+7, > but then logically NOVST ought to be UTC+8, which doesn't match up > with the fact that when it actually last appeared in the IANA data > it meant UTC+7. So that sounds like it'd be a bit of a mess involving > some judgment calls. > > In the end, part of the reason we've got these files is so that users > can make their own decisions about abbreviation meanings. So the > ultimate answer to any complaints is going to be "if you think NOVT > means thus-and-such then put in an entry that says so". > > So I'm leaning to the just-remove-it answer for any deleted abbreviation > that relies on a dynamic definition. Names that never had more than one > UTC offset can remain in the tznames list. > > Comments? I tried to understand these emails today and mostly failed. I think this stuff needs to be better documented. src/timezones/tznames/README pretty much presumes that you more or less know what's going on: # This directory contains files with timezone sets for PostgreSQL. The problem # is that time zone abbreviations are not unique throughout the world and you # might find out that a time zone abbreviation in the `Default' set collides # with the one you wanted to use. This can be fixed by selecting a timezone # set that defines the abbreviation the way you want it. There might already # be a file here that serves your needs. If not, you can create your own. Of course, it doesn't bother to define the term 'timezone set', a phrase that is not used in any other file in the source tree. Eventually, after some poking around, I figured out that this is described in the documentation under "B.3. Date/Time Configuration Files", but I didn't notice that when I searched the documentation for it. I only figured it out after I actually looked at the contents of "src/timezone/tznames/Default" and saw that it recommended looking at the "Date/Time Support", which I had done, but which I gave up on doing pretty quickly because there was no section whose title included the phrase "time zone". Once I read that section, I ran across this: # The offset is the equivalent offset in seconds from UTC, positive being east # from Greenwich and negative being west. For example, -18000 would be five # hours west of Greenwich, or North American east coast standard time. D # indicates that the zone name represents local daylight-savings time rather # than standard time. Alternatively, a time_zone_name can be given, in which # case that time zone definition is consulted, and the abbreviation's meaning # in that zone is used. I dimly understand that "the abbreviation's meaning in that zone" is referring to an EST vs. EDT type of distinction, but a novice reader of this documentation will probably not understand that from this language. Similarly, your email mentions "dynamic zone aliases" but: [rhaas pgsql]$ git grep -i 'dynamic zone alias' [rhaas pgsql]$ git grep -i 'dynamic zone' [rhaas pgsql]$ git grep -i 'zone alias' [rhaas pgsql]$ I'm guessing that's referring to the same thing as the above documentation paragraph, wherein a "time_zone_name" is given, but it doesn't really refer to the time zone name, but rather the DST or non-DST configuration of that time zone. But I'm not sure I'm right about that. Another thing that's not very well-documented is the use of our own tzdata vs. the system's tzdata. There is a paragraph on it in the installation instructions: https://www.postgresql.org/docs/current/static/install-procedure.html ...but what happens if you use it? Are dynamic zone aliases and time zone sets and such still in the picture, or have you forfeited that stuff by going over to the system's tzdata? Searching the documentation for "time zone" through the web site yields a positively uninspiring set of results: https://www.postgresql.org/search/?u=%2Fdocs%2F9.6%2F&q=time+zone I suggest that if you (or someone else who understands it, if there is any such person) were willing to either improve the existing documentation or maybe even write a whole new chapter on how we do time zone handling, it would be most welcome. I basically feel that such documentation as we already have presumes expert-level knowledge of what's in the IANA files, what's in our configuration files, and how everything fits together; and I doubt many people have that knowledge. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I suggest that if you (or someone else who understands it, if there is > any such person) were willing to either improve the existing > documentation or maybe even write a whole new chapter on how we do > time zone handling, it would be most welcome. Well, I'll put that on my to-do list, but it's a very long list. And TBH, I thought the SGML docs around this were adequate already, so I might not be in the best position to rewrite them to satisfy you. regards, tom lane