Обсуждение: Tracking timezone abbreviation removals in the IANA tz database

Поиск
Список
Период
Сортировка

Tracking timezone abbreviation removals in the IANA tz database

От
Tom Lane
Дата:
It turns out that some of the zone abbreviations shown in the IANA
timezone database don't correspond to any real-world usage, but were
more or less made up by the timezone database crew.  And of late,
those folk have started to remove these invented abbreviations in favor
of just using numeric offsets.  For example, part of the 2016f update
did this:
# Asia/Novosibirsk covers:# 54    RU-NVS    Novosibirsk Oblast
+# From Stepan Golosunov (2016-05-30):
+# http://asozd2.duma.gov.ru/main.nsf/(Spravka)?OpenAgent&RN=1085784-6
+# moves Novosibirsk oblast from UTC+6 to UTC+7.
+# From Stepan Golosunov (2016-07-04):
+# The law was signed yesterday and published today on
+# http://publication.pravo.gov.ru/Document/View/0001201607040064
+Zone Asia/Novosibirsk     5:31:40 -    LMT    1919 Dec 14  6:00
-             6:00    -    NOVT    1930 Jun 21 # Novosibirsk Time
-             7:00    Russia    NOV%sT    1991 Mar 31  2:00s
-             6:00    Russia    NOV%sT    1992 Jan 19  2:00s
-             7:00    Russia    NOV%sT    1993 May 23 # say Shanks & P.
-             6:00    Russia    NOV%sT    2011 Mar 27  2:00s
-             7:00    -    NOVT    2014 Oct 26  2:00s
-             6:00    -    NOVT
+             6:00    -    +06    1930 Jun 21
+             7:00    Russia    +07/+08    1991 Mar 31  2:00s
+             6:00    Russia    +06/+07    1992 Jan 19  2:00s
+             7:00    Russia    +07/+08    1993 May 23 # say Shanks & P.
+             6:00    Russia    +06/+07    2011 Mar 27  2:00s
+             7:00    -    +07    2014 Oct 26  2:00s
+             6:00    -    +06    2016 Jul 24  2:00s
+             7:00    -    +07

According to the Russia rules, %s expands to nothing for standard time or
"S" for summer time, so that the old entries here had the effect of
defining NOVT and NOVST as meaning variously UTC+6, UTC+7, or UTC+8
at different times in the past.  Now those abbreviations are gone,
so far as IANA is concerned.

I had thought that this wouldn't really affect us because PG's
interpretations of zone abbreviations are driven by the info in
timezone/tznames/ rather than the IANA files in timezone/data/.
I forgot however that the "dynamic abbreviation" code relies on
being able to find matching abbreviations in the IANA data.
For example, timezone/tznames/Default lists

NOVST   Asia/Novosibirsk  # Novosibirsk Summer Time (obsolete)

but that zone abbreviation now fails entirely:

# select '2016-09-02 08:00:00 NOVST'::timestamptz;
ERROR:  time zone abbreviation "novst" is not used in time zone "Asia/Novosibirsk"

(This is also the cause of bug #14307.)

So the question is what to do about it.

An easy answer is to just delete such entries from the timezone/tznames/
files altogether.  If we believe the IANA crew's assumption that nobody
uses these abbreviations in the real world, then that seems like it
removes useless maintenance overhead for little cost.  I'm a little
worried though that somebody might have followed IANA's lead and actually
started using these abbreviations, in which case we'd get complaints.

Another possibility is to keep these entries but get rid of the dynamic
zone aliases, reducing them to plain numeric UTC offsets.  Probably the
value to use would be whatever was shown as the most recent active value
in the IANA list ... although that's open to interpretation.  For
instance, according to the above we'd likely define NOVT as UTC+7,
but then logically NOVST ought to be UTC+8, which doesn't match up
with the fact that when it actually last appeared in the IANA data
it meant UTC+7.  So that sounds like it'd be a bit of a mess involving
some judgment calls.

In the end, part of the reason we've got these files is so that users
can make their own decisions about abbreviation meanings.  So the
ultimate answer to any complaints is going to be "if you think NOVT
means thus-and-such then put in an entry that says so".

So I'm leaning to the just-remove-it answer for any deleted abbreviation
that relies on a dynamic definition.  Names that never had more than one
UTC offset can remain in the tznames list.

Comments?

BTW, we should be glad that the IANA crew decided to use ISO not POSIX
sign convention for these numeric zone abbreviations, or we'd really
have a mess on our hands :-(.

Also, the fact that this failure made it into releases shows that I'm
missing some now-necessary validation steps in the zone data update
procedure.  Will add something about that to the timezone/README file.
        regards, tom lane



Re: Tracking timezone abbreviation removals in the IANA tz database

От
Tom Lane
Дата:
I wrote:
> So I'm leaning to the just-remove-it answer for any deleted abbreviation
> that relies on a dynamic definition.  Names that never had more than one
> UTC offset can remain in the tznames list.

After a bit more thought and consumption of caffeine, I am thinking that
that won't really be good enough.  It's clear that the IANA crowd intend
to continue removing made-up abbreviations, in fact there's a pile of
that in their queue right now:
http://mm.icann.org/pipermail/tz/2016-August/023941.html

The trouble from our perspective is that a lot of those abbreviations have
shifted meaning over the years (if they had any real-world usage maybe
they'd have stayed more stable...), which means that we are using dynamic
abbreviations for them, which means breakage as soon as the tznames list
diverges from the IANA database.  Which would be manageable if we only
shipped those filesets together, but in installations built with
--with-system-tzdata (which at least ought to include most vendor
distributions of PG), they don't come from the same place.  This means
that even if we fix removals when we make new releases, other
abbreviations may break the next time the vendor updates their tzdata
package.

We could maybe tolerate individual abbreviations failing like this, but
as noted in bug #14307, the pg_timezone_abbrevs view fails altogether if
there are any broken abbreviations in the active timezone_abbreviations
list.  So that makes this a problem for all users whether or not they care
about the specific abbreviations affected.

This leads me to think that we need to redefine the dynamic abbreviations
feature so it's a bit more robust in the face of this type of situation.

A really simple change (at least logically, haven't looked at the code)
would be to say that a dynamic abbreviation is just a macro for the
referenced zone name, that is if we have

NOVT   Asia/Novosibirsk

in the timezone_abbreviations list then writing NOVT in a timezone value
is exactly equivalent to writing Asia/Novosibirsk.  However, that breaks
backwards compatibility (at least for non-broken abbreviations).  Our
convention up to now has been that if you write a standard-time zone
abbreviation then what it means is your local standard-time UTC offset,
even if DST is currently prevailing in your zone.  For example, DST is
currently in force in the USA, so writing "America/New_York" means UTC-4,
but "EST" means UTC-5 regardless.  Likewise "EDT" means UTC-4 and will
still mean that when winter comes.

So the idea I'm toying with (again, haven't tried to code this) is to say
that *if* we can match the abbreviation to something in the referenced
zone then we'll use that, but otherwise we fall back to treating the
abbreviation as a macro for the zone name.  This would ensure that updates
to the IANA data could not break existing timezone_abbreviations entries
(at least, not unless IANA were to remove a zone name altogether, but they
have never done that AFAIR).  An update could cause an abbreviation's
meaning to change, but that's true already, in fact it's kind of the
whole point.

If we were to do that, then perhaps we would not need to remove existing
timezone_abbreviations entries even if IANA deems them obsolete.  I'd
still be inclined to remove them from our sample data files, but users
would easily be able to put them back if they wanted.

Thoughts?
        regards, tom lane



Re: Tracking timezone abbreviation removals in the IANA tz database

От
Tom Lane
Дата:
I wrote:
> So the idea I'm toying with (again, haven't tried to code this) is to say
> that *if* we can match the abbreviation to something in the referenced
> zone then we'll use that, but otherwise we fall back to treating the
> abbreviation as a macro for the zone name.

This turned out to require only fairly localized code changes.  So
I now propose that we leave the tznames files as-is and apply the
attached patch (plus documentation changes, which I've not done yet).
Any objections?

            regards, tom lane

diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index 965c3b4..45ba7cd 100644
*** a/src/backend/utils/adt/datetime.c
--- b/src/backend/utils/adt/datetime.c
*************** static void AdjustFractDays(double frac,
*** 56,63 ****
                  int scale);
  static int DetermineTimeZoneOffsetInternal(struct pg_tm * tm, pg_tz *tzp,
                                  pg_time_t *tp);
! static int DetermineTimeZoneAbbrevOffsetInternal(pg_time_t t, const char *abbr,
!                                       pg_tz *tzp, int *isdst);
  static pg_tz *FetchDynamicTimeZone(TimeZoneAbbrevTable *tbl, const datetkn *tp);


--- 56,64 ----
                  int scale);
  static int DetermineTimeZoneOffsetInternal(struct pg_tm * tm, pg_tz *tzp,
                                  pg_time_t *tp);
! static bool DetermineTimeZoneAbbrevOffsetInternal(pg_time_t t,
!                                       const char *abbr, pg_tz *tzp,
!                                       int *offset, int *isdst);
  static pg_tz *FetchDynamicTimeZone(TimeZoneAbbrevTable *tbl, const datetkn *tp);


*************** overflow:
*** 1689,1707 ****
   * This differs from the behavior of DetermineTimeZoneOffset() in that a
   * standard-time or daylight-time abbreviation forces use of the corresponding
   * GMT offset even when the zone was then in DS or standard time respectively.
   */
  int
  DetermineTimeZoneAbbrevOffset(struct pg_tm * tm, const char *abbr, pg_tz *tzp)
  {
      pg_time_t    t;

      /*
       * Compute the UTC time we want to probe at.  (In event of overflow, we'll
       * probe at the epoch, which is a bit random but probably doesn't matter.)
       */
!     (void) DetermineTimeZoneOffsetInternal(tm, tzp, &t);

!     return DetermineTimeZoneAbbrevOffsetInternal(t, abbr, tzp, &tm->tm_isdst);
  }


--- 1690,1729 ----
   * This differs from the behavior of DetermineTimeZoneOffset() in that a
   * standard-time or daylight-time abbreviation forces use of the corresponding
   * GMT offset even when the zone was then in DS or standard time respectively.
+  * (However, that happens only if we can match the given abbreviation to some
+  * abbreviation that appears in the IANA timezone data.  Otherwise, we fall
+  * back to doing DetermineTimeZoneOffset().)
   */
  int
  DetermineTimeZoneAbbrevOffset(struct pg_tm * tm, const char *abbr, pg_tz *tzp)
  {
      pg_time_t    t;
+     int            zone_offset;
+     int            abbr_offset;
+     int            abbr_isdst;

      /*
       * Compute the UTC time we want to probe at.  (In event of overflow, we'll
       * probe at the epoch, which is a bit random but probably doesn't matter.)
       */
!     zone_offset = DetermineTimeZoneOffsetInternal(tm, tzp, &t);

!     /*
!      * Try to match the abbreviation to something in the zone definition.
!      */
!     if (DetermineTimeZoneAbbrevOffsetInternal(t, abbr, tzp,
!                                               &abbr_offset, &abbr_isdst))
!     {
!         /* Success, so use the abbrev-specific answers. */
!         tm->tm_isdst = abbr_isdst;
!         return abbr_offset;
!     }
!
!     /*
!      * No match, so use the answers we already got from
!      * DetermineTimeZoneOffsetInternal.
!      */
!     return zone_offset;
  }


*************** DetermineTimeZoneAbbrevOffsetTS(Timestam
*** 1715,1733 ****
                                  pg_tz *tzp, int *isdst)
  {
      pg_time_t    t = timestamptz_to_time_t(ts);

!     return DetermineTimeZoneAbbrevOffsetInternal(t, abbr, tzp, isdst);
  }


  /* DetermineTimeZoneAbbrevOffsetInternal()
   *
   * Workhorse for above two functions: work from a pg_time_t probe instant.
!  * DST status is returned into *isdst.
   */
! static int
! DetermineTimeZoneAbbrevOffsetInternal(pg_time_t t, const char *abbr,
!                                       pg_tz *tzp, int *isdst)
  {
      char        upabbr[TZ_STRLEN_MAX + 1];
      unsigned char *p;
--- 1737,1777 ----
                                  pg_tz *tzp, int *isdst)
  {
      pg_time_t    t = timestamptz_to_time_t(ts);
+     int            zone_offset;
+     int            abbr_offset;
+     int            tz;
+     struct pg_tm tm;
+     fsec_t        fsec;

!     /*
!      * If the abbrev matches anything in the zone data, this is pretty easy.
!      */
!     if (DetermineTimeZoneAbbrevOffsetInternal(t, abbr, tzp,
!                                               &abbr_offset, isdst))
!         return abbr_offset;
!
!     /*
!      * Else, break down the timestamp so we can use DetermineTimeZoneOffset.
!      */
!     if (timestamp2tm(ts, &tz, &tm, &fsec, NULL, tzp) != 0)
!         ereport(ERROR,
!                 (errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
!                  errmsg("timestamp out of range")));
!
!     zone_offset = DetermineTimeZoneOffset(&tm, tzp);
!     *isdst = tm.tm_isdst;
!     return zone_offset;
  }


  /* DetermineTimeZoneAbbrevOffsetInternal()
   *
   * Workhorse for above two functions: work from a pg_time_t probe instant.
!  * On success, return GMT offset and DST status into *offset and *isdst.
   */
! static bool
! DetermineTimeZoneAbbrevOffsetInternal(pg_time_t t, const char *abbr, pg_tz *tzp,
!                                       int *offset, int *isdst)
  {
      char        upabbr[TZ_STRLEN_MAX + 1];
      unsigned char *p;
*************** DetermineTimeZoneAbbrevOffsetInternal(pg
*** 1739,1756 ****
          *p = pg_toupper(*p);

      /* Look up the abbrev's meaning at this time in this zone */
!     if (!pg_interpret_timezone_abbrev(upabbr,
!                                       &t,
!                                       &gmtoff,
!                                       isdst,
!                                       tzp))
!         ereport(ERROR,
!                 (errcode(ERRCODE_CONFIG_FILE_ERROR),
!                  errmsg("time zone abbreviation \"%s\" is not used in time zone \"%s\"",
!                         abbr, pg_get_timezone_name(tzp))));
!
!     /* Change sign to agree with DetermineTimeZoneOffset() */
!     return (int) -gmtoff;
  }


--- 1783,1799 ----
          *p = pg_toupper(*p);

      /* Look up the abbrev's meaning at this time in this zone */
!     if (pg_interpret_timezone_abbrev(upabbr,
!                                      &t,
!                                      &gmtoff,
!                                      isdst,
!                                      tzp))
!     {
!         /* Change sign to agree with DetermineTimeZoneOffset() */
!         *offset = (int) -gmtoff;
!         return true;
!     }
!     return false;
  }


diff --git a/src/test/regress/expected/timestamptz.out b/src/test/regress/expected/timestamptz.out
index 67f26db..2bfc13a 100644
*** a/src/test/regress/expected/timestamptz.out
--- b/src/test/regress/expected/timestamptz.out
*************** SELECT '2007-12-09 07:30:00 UTC'::timest
*** 2603,2605 ****
--- 2603,2625 ----
   Sun Dec 09 03:00:00 2007
  (1 row)

+ --
+ -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
+ -- more-or-less working.  We can't test their contents in any great detail
+ -- without the outputs changing anytime IANA updates the underlying data,
+ -- but it seems reasonable to expect at least one entry per major meridian.
+ -- (At the time of writing, the actual counts are around 38 because of
+ -- zones using fractional GMT offsets, so this is a pretty loose test.)
+ --
+ select count(distinct utc_offset) >= 24 as ok from pg_timezone_names;
+  ok
+ ----
+  t
+ (1 row)
+
+ select count(distinct utc_offset) >= 24 as ok from pg_timezone_abbrevs;
+  ok
+ ----
+  t
+ (1 row)
+
diff --git a/src/test/regress/sql/timestamptz.sql b/src/test/regress/sql/timestamptz.sql
index c023095..ce9d1c2 100644
*** a/src/test/regress/sql/timestamptz.sql
--- b/src/test/regress/sql/timestamptz.sql
*************** SELECT '2007-12-09 07:00:00 UTC'::timest
*** 468,470 ****
--- 468,481 ----
  SELECT '2007-12-09 07:00:01 UTC'::timestamptz AT TIME ZONE 'VET';
  SELECT '2007-12-09 07:29:59 UTC'::timestamptz AT TIME ZONE 'VET';
  SELECT '2007-12-09 07:30:00 UTC'::timestamptz AT TIME ZONE 'VET';
+
+ --
+ -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
+ -- more-or-less working.  We can't test their contents in any great detail
+ -- without the outputs changing anytime IANA updates the underlying data,
+ -- but it seems reasonable to expect at least one entry per major meridian.
+ -- (At the time of writing, the actual counts are around 38 because of
+ -- zones using fractional GMT offsets, so this is a pretty loose test.)
+ --
+ select count(distinct utc_offset) >= 24 as ok from pg_timezone_names;
+ select count(distinct utc_offset) >= 24 as ok from pg_timezone_abbrevs;

Re: Tracking timezone abbreviation removals in the IANA tz database

От
Robert Haas
Дата:
On Fri, Sep 2, 2016 at 8:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I had thought that this wouldn't really affect us because PG's
> interpretations of zone abbreviations are driven by the info in
> timezone/tznames/ rather than the IANA files in timezone/data/.
> I forgot however that the "dynamic abbreviation" code relies on
> being able to find matching abbreviations in the IANA data.
> For example, timezone/tznames/Default lists
>
> NOVST   Asia/Novosibirsk  # Novosibirsk Summer Time (obsolete)
>
> but that zone abbreviation now fails entirely:
>
> # select '2016-09-02 08:00:00 NOVST'::timestamptz;
> ERROR:  time zone abbreviation "novst" is not used in time zone "Asia/Novosibirsk"
>
> (This is also the cause of bug #14307.)
>
> So the question is what to do about it.
>
> An easy answer is to just delete such entries from the timezone/tznames/
> files altogether.  If we believe the IANA crew's assumption that nobody
> uses these abbreviations in the real world, then that seems like it
> removes useless maintenance overhead for little cost.  I'm a little
> worried though that somebody might have followed IANA's lead and actually
> started using these abbreviations, in which case we'd get complaints.
>
> Another possibility is to keep these entries but get rid of the dynamic
> zone aliases, reducing them to plain numeric UTC offsets.  Probably the
> value to use would be whatever was shown as the most recent active value
> in the IANA list ... although that's open to interpretation.  For
> instance, according to the above we'd likely define NOVT as UTC+7,
> but then logically NOVST ought to be UTC+8, which doesn't match up
> with the fact that when it actually last appeared in the IANA data
> it meant UTC+7.  So that sounds like it'd be a bit of a mess involving
> some judgment calls.
>
> In the end, part of the reason we've got these files is so that users
> can make their own decisions about abbreviation meanings.  So the
> ultimate answer to any complaints is going to be "if you think NOVT
> means thus-and-such then put in an entry that says so".
>
> So I'm leaning to the just-remove-it answer for any deleted abbreviation
> that relies on a dynamic definition.  Names that never had more than one
> UTC offset can remain in the tznames list.
>
> Comments?

I tried to understand these emails today and mostly failed. I think
this stuff needs to be better documented.
src/timezones/tznames/README pretty much presumes that you more or
less know what's going on:

# This directory contains files with timezone sets for PostgreSQL.  The problem
# is that time zone abbreviations are not unique throughout the world and you
# might find out that a time zone abbreviation in the `Default' set collides
# with the one you wanted to use.  This can be fixed by selecting a timezone
# set that defines the abbreviation the way you want it.  There might already
# be a file here that serves your needs.  If not, you can create your own.

Of course, it doesn't bother to define the term 'timezone set', a
phrase that is not used in any other file in the source tree.
Eventually, after some poking around, I figured out that this is
described in the documentation under "B.3. Date/Time Configuration
Files", but I didn't notice that when I searched the documentation for
it.  I only figured it out after I actually looked at the contents of
"src/timezone/tznames/Default" and saw that it recommended looking at
the "Date/Time Support", which I had done, but which I gave up on
doing pretty quickly because there was no section whose title included
the phrase "time zone".

Once I read that section, I ran across this:

# The offset is the equivalent offset in seconds from UTC, positive being east
# from Greenwich and negative being west. For example, -18000 would be five
# hours west of Greenwich, or North American east coast standard time. D
# indicates that the zone name represents local daylight-savings time rather
# than standard time. Alternatively, a time_zone_name can be given, in which
# case that time zone definition is consulted, and the abbreviation's meaning
# in that zone is used.

I dimly understand that "the abbreviation's meaning in that zone" is
referring to an EST vs. EDT type of distinction, but a novice reader
of this documentation will probably not understand that from this
language.

Similarly, your email mentions "dynamic zone aliases" but:

[rhaas pgsql]$ git grep -i 'dynamic zone alias'
[rhaas pgsql]$ git grep -i 'dynamic zone'
[rhaas pgsql]$ git grep -i 'zone alias'
[rhaas pgsql]$

I'm guessing that's referring to the same thing as the above
documentation paragraph, wherein a "time_zone_name" is given, but it
doesn't really refer to the time zone name, but rather the DST or
non-DST configuration of that time zone.  But I'm not sure I'm right
about that.

Another thing that's not very well-documented is the use of our own
tzdata vs. the system's tzdata.  There is a paragraph on it in the
installation instructions:

https://www.postgresql.org/docs/current/static/install-procedure.html

...but what happens if you use it?  Are dynamic zone aliases and time
zone sets and such still in the picture, or have you forfeited that
stuff by going over to the system's tzdata?

Searching the documentation for "time zone" through the web site
yields a positively uninspiring set of results:

https://www.postgresql.org/search/?u=%2Fdocs%2F9.6%2F&q=time+zone

I suggest that if you (or someone else who understands it, if there is
any such person) were willing to either improve the existing
documentation or maybe even write a whole new chapter on how we do
time zone handling, it would be most welcome.  I basically feel that
such documentation as we already have presumes expert-level knowledge
of what's in the IANA files, what's in our configuration files, and
how everything fits together; and I doubt many people have that
knowledge.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Tracking timezone abbreviation removals in the IANA tz database

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> I suggest that if you (or someone else who understands it, if there is
> any such person) were willing to either improve the existing
> documentation or maybe even write a whole new chapter on how we do
> time zone handling, it would be most welcome.

Well, I'll put that on my to-do list, but it's a very long list.
And TBH, I thought the SGML docs around this were adequate already,
so I might not be in the best position to rewrite them to satisfy you.
        regards, tom lane