-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tz.cpp should parse the compiled form of the TZDB, not the source #1
Comments
What advantage would this approach have? One disadvantage I see is that it makes the install process more complicated. Now the end user has to also download, build and run |
There are two basic advantages:
|
For Windows, don't parse the IANA TZDB. Windows has its own database in the registry. A serious application would open "HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Time Zones". See http://msdn.microsoft.com/en-gb/library/windows/desktop/ms724253%28v=vs.85%29.aspx |
It might be interesting to have a way to do both a direct parse, and a I am interested in porting this code to Windows. Perhaps there too, dual parsers might be interesting. I'm open to this being a collaborative project (which is one of the reasons I set it up on github). Would you be interested in developing the |
Thanks for the offer, but I have a lot on my plate right now. Adding more responsibilities is not something I can do... |
+1. I maintained zoneinfo for Red Hat for several years, and it's incredible on how many places there are separate copies of this data. E.g. Java has one (of either of two versions), PHP I think has one, then there were pytz for Python which shipped one and the corresponding thing for Ruby... We patched them wherever we could to just use system data. tzdata is very volatile, changes about monthly (mostly around spring and autumn when government feel the need to meddle with daylight saving, but zone boundaries also change more often than you would expect). I don't know about BSD's etc., but on Linux this is a core part of the OS, even glibc depends on tzdata. So I would rather bet on timeliness of OS updates than on timeliness of updates to any bundled data. Few projects have the cycles to issue 10-ish updates a year for something this obscure. Let alone timely. It's fairly common that a government suddenly wakes up and realizes, boy oh boy, we have a DST transition in a week, didn't we want to postpone it? Good luck keeping up with this. As I wrote elsewhere, please please pretty please, just use the compiled system data, whenever possible. The binary format is actually easy to parse. I think it's all documented in man tzfile(5). I wrote a parser some time ago for the tzdiff project, see here: https://git.fedorahosted.org/cgit/tzdiff.git/tree/olson.cc . Having said all this, I understand your position. I don't have spare cycles to write this either unfortunately. EDIT: A typo. |
I can only heavily second @thiagomacieira and @pmachata. The source code of the tzdata is NOT installed by default on GNU/Linux, whereas (as pointed out by @thiagomacieira) the binary is, and @pmachata already explained why bundling the data with your library is a horrible idea. You really have to parse the system tzdata database, which is in the compiled format, there's no way around it. |
I'm currently trying to decode the TZFiles provided on my Ubuntu computer. I'm thus not convinced that using the compiled tzfiles is a good idea. On an OS without updates this files will rapidly be outdated (i.e. Windows). |
If this feature gets coded up, it will be an alternative, not a replacement of the way things are done today. There will be a config flag to chose where you get your tzdb from. |
@chmike: leap seconds are not part of Unix time. As for the date rules, the last rules are supposed to repeat ad infinitum. You're probably mis-interpreting the files. If the rules exist in binary and source form, I don't see why one would waste CPU cycles in parsing text form if the binary form exists. If the OS doesn't have the rules in any form at all (e.g., Windows), you should read from the timezone DB it has, instead of shipping your non-updated rules. If the OS has the binary form and doesn't update them, the OS isn't worth using. And I don't know of any OS that ships source form only. Just don't bundle the rules. @HowardHinnant: your presentation in CppCon has done some harm. Now people are thinking your code is suitable for deployment in production, replacing all other solutions. It isn't. It needs to parse binary and it needs to parse the Windows timezone DB too (see http://msdn.microsoft.com/en-gb/library/windows/desktop/ms725481%28v=vs.85%29.aspx and http://msdn.microsoft.com/en-gb/library/windows/desktop/ms724253%28v=vs.85%29.aspx) |
@thiagomacieira: Well you better get to work and fix this then! Please don't dally. ;-)
And I leave it to each individual to make the decision as to whether these libraries meet their needs or not. If you are not comfortable using either of these libraries, then by all means, please don't. |
As I said before, the problem is that you're too well known. People assume that just because it came from you, it will suffice. This came around again because a collegue saw your presentation at CppCon and posted to the Qt development mailing list saying that we should use it. To which, I replied saying there was no need as our deployed solution in QTimeZone is superior, since it reads the Windows DB, the compiled IANA DB and can also read from ICU's DB via its API. Since my needs are already met, I don't plan on contributing here. But I'm glad you're keeping this open for contribution by someone else who may have similar needs and cannot use Qt. |
The compiled tzfile contains precomputed time values relatives to 1970-01-01T00:00:00 UTC with the associated time offset and isdst flag set to 1 if the offset includes daylight saving time. It is the info obtained with tzset(). There is no general rule like in the source files. Knowledge of leap seconds are needed to correctly interpret a UTC Gregorian date. You get a time of the form 23:59:60 at a leap second. Maybe the leap seconds are stored in TZif2. I didn't check yet. In my file, the table of leap seconds in TZif is empty. The advantage of tzfile is that it is straightforward to use once unpacked. It is stored in big endian, etc. Bien cordialement,
|
Please read "UTC" with a grain of salt here. It's not the UTC you're thinking of that got adjusted by leap seconds. It's the current UTC extended backwards in time as if leap seconds hadn't occurred. See https://en.wikipedia.org/wiki/Unix_time#Leap_seconds. |
@chmike Compiled TZ files may or may not contain leap seconds depending on the way the system is set up. The standard upstream way is to compile the leap-second-aware zoneinfo files to right/ subtree of the distribution, the leap-second-unaware to posix/ subtree, and have ./ subtree contain hardlinks to either of these subtrees. Red Hat had ./* hardlink the posix/ (sans leap seconds) subtree, and I suspect others did that as well, because on Linux, NTP typically handles correct time including leap seconds. Also @chmike, if rules end on 2017, that means either of two things. Either the rules end on 2017 and there are no more transitions defined for the given zone. Some volatile zones may have rules encoded only for present year, even if it is known a transition will take place next year as well, because there is no indication of when it will be. Another possibility is that the rule that follows is regular and can be encoded using POSIX TZ string. You need to read that and parse that as well, which is annoying, but at least the format is well-defined as well. Yet another possibility is that you really mean 2037, which is far enough in the future that it doesn't seem important. It's an arbitrary cut-off and the source distribution has flags, IIRC, to set it. Due to Y2037 problem the 32-bit portion of zoneinfo file can't encode timestamps in more distant future, but the 64-bit one (which has been shipped for close to a decade now) has no problem. And finally @chmike relying on compiled form has an advantage that your system will be kept up to date by OS vendor. @HowardHinnant I appreciate that you don't ship your own zoneinfo data, no irony. But the source form is not distributed by OS vendors, the binary form is. So whoever ends up using using your library will necessarily have to ship it themselves, otherwise the library is no use. I have no problem if the library is used only locally, but as soon as you end up distributing dependent code to end users, the timely support becomes a pain. At the same time, this issue is often not recognized, because people underestimate how volatile this data is, and the timely support ends up being simply absent. If you could at least point out these issues (or perhaps point at this thread) in the documentation, that would be very helpful--we've really seen way, way more data duplication in this area than there should be. And these bugs are somewhat rare, so it's hard to realize you even have this problem. Essentially you'd get a bunch of bug reports every now and then, when one of your customers happens to somehow interface with a country that meddles with this stuff. |
1970-01-01T00:00:00 UTC is the posix epoch date time by definition as of It is still unclear to me what to do before the 1 Jan 1972 (63072000 Le 28/09/2015 22:23, Thiago Macieira a écrit :
Bien cordialement, Ch.Meessen |
@pmachata I checked again and you are fully right. My bad! The time values are until 2037, not 2017. The TZ text rule is indeed stored in the file. I didn't decode it yet, but is easy to spot with hexdump. I also confirm that the leap seconds are present in the files under the I also fully agree that relying on automatically updated files is better when possible. When there is no such update, relying on the compiled files for the local time offsets is still OK because the data is valid until 2037 if the rule doesn't change. The compiled files are also straightforward to use. The only problem, in absence of automatic update, are the leap seconds. |
@HowardHinnant Cool, thanks. @chmike So the point is that the rules actually do change very often. Even if you have rules all the way to 2037, chances are they will get obsolete much more quickly. Depending on the region your customers live in or deal with, it might take some time (e.g. European or US rules seem to be stable), or it might be a surprising spur-of-the-moment thing (e.g. South American countries seem to be more prone to changing DST rules at the last minute, per my experience). Even the US change in, was it 2005?, which was being widely announced a long time before it was effective, ended up surprising all sorts of devices that were clever enough to know about DTS, but not clever enough to know it occasionally changes. Even public transportation buses in San Francisco if my memory serves right, which ended up displaying the wrong time for a week. So if you can't rely on vendor-provided updates, I suspect you really want to make sure you have an update vector yourself, because these things do change, and people underestimate how often they do. |
Merged mainstream repo commits
Update: This library still only reads the source format, not the binary format. But now, if linked against curl (https://curl.haxx.se/libcurl/) which is available for Windows and comes pre-installed on Linux and OS X, the library can be configured to update to the latest database automatically. Fully documented here: |
I just did a comparative survey of the compiled tzdata files on macOS using Google's cctz library. I find that 63% of the OS-supplied timezones have errors in either offset, abbreviation, or both when queried with timestamps outside the range of years [1900, 2037]. On a positive note, 37% of the OS-supplied timezones were free of errors. The survey correlated to those timezones which have offset transitions outside the range [1900, 2037]. I have a report that suggests these same errors exist on iOS. By reading the text IANA files, this library is immune to these errors. |
Any data from before 1970 is "best effort" and can contain errors. Are you saying that the data compiler is making mistakes? Was that zic? |
I realize that data from before 1970 is "best effort" and can contain errors. But that's not a good rationale for introducing further errors. This library accurately reports all of the data in At this time, I do not know for sure if the errors reported to me via cctz's use of my OS's zic-compiled data files are the result of errors in cctz, or errors in zic. I suspect the latter. I also suspect that said errors are fixed in zic, but require either an update or configuration change that Apple has not done. The range of accuracy of the compiled data is suspiciously close to the range of a signed 32 bit count of seconds from 1970. Emphasis: These are all guesses on my part except for the fact that I've detected the errors. The example that brought this issue to my attention was:
|
I'd be really surprised if the mistake is in zic: it's created by the same people who maintain the database itself. More than likely, the error is in the decoder tool. I had a similar bug report sent to me on QTimeZone that we failed to parse the name of a timezone (Asia/Barnaul) after tzdata2016d. It's also probably restricted to 32-bit integers, so it simply can't represent dates before 1902. Other tools do work:
|
Thanks for the
OS X El Capitan 10.11.6 Yep, |
Looks like zdump on a Mac shows the LMT date at the (time_t)INT_MIN point. Why? I don't know... I've confirmed that the zdump binary on my Mac has both 32- and 64-bit architectures inside the fat binary. The previous output I had pasted came from a 64-bit build on Linux. Looks like they're different sources too:
|
https://opensource.apple.com/source/system_cmds/system_cmds-496/zdump.tproj/zdump.c |
I understand Apple replacing GNU tools that are under the GPLv3, but why this one? http://www.iana.org/time-zones has a more up-to-date version licensed under the BSD licence... |
Hmm.... time to file a Radar? http://bugreport.apple.com |
@vlovich: Be my guest. :-) My user-experience with filing bug reports with Apple has not been very good. |
I briefly explored that option, but there is at least one issue: The Windows registry does not use the Olsson time zone names, but rather e.g. "Central Europe Standard Time". |
See CLDR file supplemental/windowsZones.xml |
RE: "If this feature gets coded up, it will be an alternative, not a replacement of the way things are done today. There will be a config flag to chose where you get your tzdb from." We are fairly new to date time programming, and Howard's libraries are meant to be used by a wide audience. Learning curve should be a consideration. We are very grateful that the tzdata files are human readable. With low effort, we could understand and reason about them. It was helpful to see that some historical timezone rules were put in inferentially due to incomplete records. Inline comment threads seem to detail unimplemented exceptions and dubious rules, which would come in handy if a customer asks us "why is the date time math wrong over here and no where else?". Having the ability to add or modify a rule on short notice is a nice to have. We looked at other timezone rule specs as part of our learning process:
We learned more from the human readable tzdata files than all the other sources combined. So, a +1 for keeping the loading of text versions of tzdata as the default behavior. I can see the viewpoint that many would benefit from the binary file support and so would also be very happy with a toggle to read from binary tzdata. RE: "For Windows, don't parse the IANA TZDB. Windows has its own database in the registry. A serious application would open "HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Time Zones". See http://msdn.microsoft.com/en-gb/library/windows/desktop/ms724253%28v=vs.85%29.aspx" Thiago's point on Windows timezone support is an interesting one and we'll be thinking about it. My belief (no hard facts yet) is that most of our customers store their date time values in databases and most databases that support timezone regions seem to adhere to the IANA database. So even on Windows, I think our users would predominantly need tzdata support. I'll be contacting customer support to see if we have significant user date time data specified with Windows timezones. |
The binary data is the same as the source text, only in a more-easily parseable format. Sure, the comments would be missing, but if you need to investigate why something is not what you expected, you can look at the source. For quick information, you can use zdump. Let me also point out that you don't want to modify the tzdata files in any way. Those are updated by IANA several times a year, so if you made modifications, you'd very quickly diverge from the authoritative source. And besides, what reason would you have for having local changes to your understanding of something that is global?
As I said, the CLDR has a table mapping Windows names to IANA names. But I do agree that the quality of the Windows TZDB is much lower than the IANA DB. My point was that a serious application could opt not to ship the 3 MB of binary tzdata, as Windows already has most of the information. I believe size considerations would be very proeminent for anyone who wants tz.cpp, as opposed to using a full framework like ICU. |
This isn't strictly true on macOS because Apple is using 8 year old source files to compile an up-to-date text database into the binary database. I also don't see the leap second information in the macOS binary database. They've recently started shipping the version information which is a positive step. Also it is more difficult, but not impossible, to retrieve the timezone name from the binary files and provide the ability to iterate over the set of timezones. And there is no distinction between links and time zones in the binary format (which is probably not important for most customers). On Windows one doesn't have the IANA database at all, so it must be installed (maybe that will change in the future). Either that, or one is forced to admit that time zone names and definitions are not portable across platforms. The only thing this library uses the CLDR for on Windows is to implement Fwiw, the text form of the IANA database is about 945Kb and covers an infinite temporal range. Should this library be standardized, almost certainly this issue will become moot as the std::lib implementor will supply the database, and whether it is binary or text will become an implementation detail. One of the biggest issues for standardization is whether or not to support |
Given that an application can run for more than ~1.7 months without being restarted, there needs to be a way to refresh the database. Either the implementation does it behind the scenes and thread-safely, or it offers a thread-safe API to do so at the app's discretion. |
I was wondering if it wouldn't be better to provide a lazy |
I've seen that position expressed on the committee (by a single person). But I don't know at this point if that position has widespread support). The status-quo (the C API) doesn't support this feature in practice. tzdatabase updates typically require a reboot. |
The C API does support it, though it's thread-unsafe: the tzset() function. Granted, it's POSIX, not ISO C, but so are localtime_r, which is required in a multihreaded application anyway. |
This API allows lazy initialization. This implementation takes a hybrid approach which is optimized for the text form of the database. It parses the entire database, but omits a post-parse analysis of the data for each time zone until it is actually used. It was experimentally determined that this post-parse analysis was the most expensive step in the parse. This characteristic is actually controllable at the moment with an undocumented flag |
@HowardHinnant -- don't listen to these "I have a better tool" guys. They had 50 years to make it right -- and it is still pathetic. This library is plain simply awesome, you've properly solved the problem we couldn't since very beginning -- that is how to convert from one arbitrary TZ to another. I remember dealing with this problem multiple times. I expect it will lead to propagation of those IANA database copies until OS guys give up and implement it proper (and uniformly) across all platforms that still live. |
Having worked with Howard for a few years now, I can say from experience that this assumption is correct. Everything that Howard produces is well thought out and high quality. |
This has been open for over a year, and there's no fix on the horizon. Closing as "not going to fix." Feel free to re-open this issue if you see value in doing so. |
This is now an option with this commit: a610f08 See https://howardhinnant.github.io/date/tz.html for |
Kudos to Aaron Bishop for driving this effort. |
Adds the necessary files to build Date as part of RTC
I know you're writing only as an example, but given your notoriety, your source code might actually get used elsewhere.
Please parse the compiled form of the database, after
zic
is done with it. In doing that, you should also set the default DB path to "/usr/share/zoneinfo", which should work just about anywhere that has a /usr directory.This should also eliminate the need to set a minimum year to be kept in memory, as the database is already pre-compiled.
The text was updated successfully, but these errors were encountered: