Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bias st.datetimes() towards bug-revealing values such as DST transitions and other oddities #69

Open
tweyter opened this issue May 2, 2015 · 11 comments
Labels
enhancement it's not broken, but we want it to be better

Comments

@tweyter
Copy link

tweyter commented May 2, 2015

pytz.tzinfo.localize will raise a NonExistentTimeError or AmbiguousTimeError exception if it can't resolve the current local time due to the change to/from daylight savings time. This is the source for numerous bugs in software dealing with datetimes in Python. A strategy that selects for these error causing times would help improve the quality of Hypothesis-Datetime.

@tweyter
Copy link
Author

tweyter commented May 2, 2015

As an aside, here's a great write-up about errors when dealing with time.
http://infiniteundo.com/post/25326999628/falsehoods-programmers-believe-about-time
Not everything affects python datetime, but it's informative, nonetheless.

@DRMacIver
Copy link
Member

Forgot to say at the time: Thanks for bringing this up. As per discussion elsewhere, it'll be a little while before I get to this, but I'm definitely going to look into things along these lines.

@tweyter
Copy link
Author

tweyter commented May 4, 2015

You're welcome. I'll try to contribute some code when I get the chance, too. I was just looking at the leap-second issue this morning, and it looks fairly easy to implement. A github version of the time zone database and code can be found here: https://github.com/eggert/tz
It's a GNU/Linux script for implementing all of the time zone issues (what Pytz uses) and it's a bit beyond me, but the list of leap seconds is here: https://github.com/eggert/tz/blob/master/leap-seconds.list
I think all one would have to do is inherit datetime.DatetimeStrategy and replace "def produce_template(self, context, pv):"
Something like this:
base = dt.datetime(1900, 1, 1, 0, 0) # Set TAI epoch
leap_second_time = base + datetime.timedelta(seconds=3550089600) # For the leap-second on July 1, 2012
Then just randomly add or subtract 1-2 seconds and add random microseconds to supply random data that falls around that leap-second.

@DRMacIver DRMacIver added enhancement it's not broken, but we want it to be better help-wanted labels Sep 25, 2015
@Zac-HD Zac-HD self-assigned this May 19, 2017
@Zac-HD
Copy link
Member

Zac-HD commented May 19, 2017

It should be possible to read this list from pytz, which I will try to add to #621 get around to eventually. A more interesting approach would be to draw the timezone first, and then prefer datetimes around the DST transition too...

@Zac-HD Zac-HD removed their assignment Jun 21, 2017
@Zac-HD
Copy link
Member

Zac-HD commented Jun 13, 2018

@pganssle (dateutil maintainer) and I met at Pylondinium and designed a general way to generate nasty datetimes, which works with any source of timezones (tzinfo objects) and I've since confirmed with @DRMacIver that it should work.

Given two naive datetimes as bounds - defaulting to the representable limit if not passed explicitly - and a strategy for Optional[tzinfo], we need to draw a datetime:

The current algorithm:

  • Draw a naive datetime, by drawing each component in order from largest to smallest. This shrinks in the obvious way (taking care that the year goes to 2000 and the months and days to 1 rather than 0). This is what I mean by "draw a naive datetime" below.
  • Then, draw a timezone and add it to the naive value. Retry up to three times before rejecting the draw.

Proposed, "nasty datetimes", algorithm:

  • Draw a timezone T and a naive datetime D.
  • If the timezone is None return D; if UTC return D.astimezone(T) (henceforth D0), with the usual special handling for pytz.
  • If D0 is already "nasty" - at a leap-second, DST transition, imaginary or ambigious time, etc. - return it here. Paul will help write the oracle function for this.
  • Draw a boolean (biased with tuning frequency to be determined), and if False return D0 to ensure shrinking works sensibly and the distribution of nastiness is useful.
  • Now we're in the interesting bit. Draw another datetime and add T; we'll call this D1. Use a second oracle to check if there's a nasty value between D0 and D1 (comparing e.g. offset to UTC, is_dst status, tz_name, checking for leap seconds, etc).
  • If there's nothing nasty here, return D0 (for the last time).
  • If there is something nasty, draw D2 between the naive parts of D0 and D1. Continue this until Dn is nasty, and return that. (This is basically a binary search, with a tradeoff between expected number of draws and the flexibility to discard all but one draw when shrinking.)

Note that this should work with any source of timezones, any bounds the user might choose, and have small overhead in any case where there are no nasty datetimes to find (and 'reasonable' overhead when it does).

If you want to work on this issue, you're entirely welcome - but please contact me first so we avoid can overlapping work or comms and you can ask questions as you go.

@pganssle
Copy link
Contributor

From some minor thinking about it, these are the classes of "nasty" time zones that I think we should try to find for each time zone:

  1. Transitions between STD->DST
  2. Transitions between DST->STD
  3. Changes of base offset
  4. Changes of tzname without a corresponding change in DST
  5. Changes in DST without a corresponding tzname
  6. Changes in DST without a corresponding change to utcoffset
  7. Zones with negative DST (e.g. Europe/Ireland, Namibia during certain periods)
  8. Multiple DST changes in a single year (e.g. Morroco)
  9. Times with offsets that are a non-integer number of minutes or seconds (often occurs pre-1910ish)
  10. Changes in UTC offset by more than 1 hour
  11. Changes in UTC offset that happen at exactly midnight
  12. Changes in UTC offset that result in ambiguous datetimes or missing datetimes outside of 0:00-04:00
  13. Possible special case of 10, 11, and 12: Entire missing or doubled days.

When writing tests for this kind of thing, I tend to hand-select zones that match these edge cases preferably in both the Northern and Southern hemisphere, though I've never found any where this has bitten me.

@Zac-HD
Copy link
Member

Zac-HD commented Jun 5, 2019

Based on a proof-of-concept Paul wrote at PyCon: 27df358. Now we "just" need to implement the search functionality I described last year!

More broadly, I'm thinking that there are actually two distinct things we might want to generate here:

Datetimes that are "weird" in isolation; i.e. exhibit an unusual property listed on the Weirdness enum. We can search toward ambiguous or imaginary times based on dst/utcoffset changes between two endpoints, but for most there's nothing to do beyond recognise it when we see one - which we can use to bail out of the above search early, so it's not totally useless.

Transitions between datetimes. I'm not sure how to to take advantage of this when generating single datetimes, but it would be nice if there was a way to do so... I think this is properly a follow-up issue once we get the first stage working.

@Zac-HD
Copy link
Member

Zac-HD commented Oct 12, 2019

More notes: anyone thinking about leap seconds should go look at @aarchiba's work in nanograv/PINT#549, which includes some lovely new strategies as well as thoughtful consideration of "leap smear".

Unfortunately we can't represent leap seconds with Python's datetime.datetime() type, and this won't be fixed. It would still be useful to make adjacent times more likely though, as the number of TAI seconds elapsed across the interval won't match the number of Unix seconds - but Python effectively does leap-smear via NTP updates, so it's unlikely to be detectable from Python.

pganssle added a commit to pganssle/stdlib-property-tests that referenced this issue Jan 30, 2020
The documented contract of this function is that it is the inverse of
`datetime.isoformat()`.

See GH HypothesisWorks/hypothesis#1401, HypothesisWorks/hypothesis#69
and to a lesser extent HypothesisWorks/hypothesis#2273 for background on
why I have set max_examples so high.
pganssle added a commit to pganssle/stdlib-property-tests that referenced this issue Jan 30, 2020
The documented contract of this function is that it is the inverse of
`datetime.isoformat()`.

See GH HypothesisWorks/hypothesis#1401, HypothesisWorks/hypothesis#69
and to a lesser extent HypothesisWorks/hypothesis#2273 for background on
why I have set max_examples so high.
@auvipy
Copy link

auvipy commented Jun 2, 2022

what about using zoneinfo?

@Zac-HD
Copy link
Member

Zac-HD commented Jun 3, 2022

It's got better handling of the .fold attribute (fixing ambiguous datetimes), but all the other cases are down to the underlying very complicated semantics of timezones themselves rather than which library we use to represent them. See above for the proposed algorigthm and targets.

@Zac-HD Zac-HD changed the title Datetime and Pytz: NonExistentTimeError, AmbiguousTimeError Bias st.datetimes() towards bug-revealing values such as DST transitions and other oddities Jun 3, 2022
@Zac-HD
Copy link
Member

Zac-HD commented Oct 12, 2024

After spending some more time looking into leap seconds, I now think that handling them at all is so rare in Python that biasing towards them is unlikely to be a net improvement in bug-finding power outside of literally astronomical timekeeping code.

If anyone working on that is interested in picking it up, here's some code to get the UTC datetimes of each leap second; you could then sample one and add a random diff from (0, +/- <+1s, +/- 12h). I'd sketched plans to integrate that into st.datetimes(), including transparent shrinking etc., but as above decided against.

diff --git a/hypothesis-python/setup.py b/hypothesis-python/setup.py
index 14c24b9ef..8c96d9dd8 100644
--- a/hypothesis-python/setup.py
+++ b/hypothesis-python/setup.py
@@ -82,7 +82,13 @@ setuptools.setup(
     author_email="david@drmaciver.com",
     packages=setuptools.find_packages(SOURCE),
     package_dir={"": SOURCE},
-    package_data={"hypothesis": ["py.typed", "vendor/tlds-alpha-by-domain.txt"]},
+    package_data={"hypothesis": ["py.typed", "vendor/leap-seconds.list", "vendor/tlds-alpha-by-domain.txt"]},
     url="https://hypothesis.works",
     project_urls={
         "Source": "https://github.com/HypothesisWorks/hypothesis/tree/master/hypothesis-python",
diff --git a/hypothesis-python/src/hypothesis/strategies/_internal/datetime.py b/hypothesis-python/src/hypothesis/strategies/_internal/datetime.py
index 581b3ac3c..5710fabdc 100644
--- a/hypothesis-python/src/hypothesis/strategies/_internal/datetime.py
+++ b/hypothesis-python/src/hypothesis/strategies/_internal/datetime.py
@@ -34,6 +34,18 @@ def is_pytz_timezone(tz):
     return module == "pytz" or module.startswith("pytz.")
 
 
+@lru_cache(maxsize=1)
+def get_leap_seconds() -> tuple[dt.datetime, ...]:
+    """Return a list of UTC datetimes corresponding to each leap second."""
+    traversable = resources.files("hypothesis.vendor") / "tlds-alpha-by-domain.txt"
+    epoch = dt.datetime(1900, 1, 1, tzinfo=dt.timezone.utc)
+    return tuple(
+        epoch + dt.timedelta(seconds=int(line.split()[0]))
+        for line in traversable.read_text(encoding="utf-8").splitlines()
+        if not line.startswith("#")
+    )
+
+
 def replace_tzinfo(value, timezone):
     if is_pytz_timezone(timezone):
         # Pytz timezones are a little complicated, and using the .replace method
diff --git a/tooling/src/hypothesistooling/__main__.py b/tooling/src/hypothesistooling/__main__.py
index 6eb938510..d07914c85 100644
--- a/tooling/src/hypothesistooling/__main__.py
+++ b/tooling/src/hypothesistooling/__main__.py
@@ -363,6 +363,10 @@ def update_vendored_files():
     if fname.read_bytes().splitlines()[1:] != new.splitlines()[1:]:
         fname.write_bytes(new)
 
+    url = "https://hpiers.obspm.fr/iers/bul/bulc/ntp/leap-seconds.list"
+    (vendor / url.split("/")[-1]).write_bytes(requests.get(url).content)
+
     # Always require the most recent version of tzdata - we don't need to worry about
     # pre-releases because tzdata is a 'latest data' package  (unlike pyodide-build).
     # Our crosshair extra is research-grade, so we require latest versions there too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement it's not broken, but we want it to be better
Projects
None yet
Development

No branches or pull requests

5 participants