Bias `st.datetimes()` towards bug-revealing values such as DST transitions and other oddities #69

tweyter · 2015-05-02T08:46:49Z

pytz.tzinfo.localize will raise a NonExistentTimeError or AmbiguousTimeError exception if it can't resolve the current local time due to the change to/from daylight savings time. This is the source for numerous bugs in software dealing with datetimes in Python. A strategy that selects for these error causing times would help improve the quality of Hypothesis-Datetime.

tweyter · 2015-05-02T13:22:26Z

As an aside, here's a great write-up about errors when dealing with time.
http://infiniteundo.com/post/25326999628/falsehoods-programmers-believe-about-time
Not everything affects python datetime, but it's informative, nonetheless.

DRMacIver · 2015-05-04T07:28:50Z

Forgot to say at the time: Thanks for bringing this up. As per discussion elsewhere, it'll be a little while before I get to this, but I'm definitely going to look into things along these lines.

tweyter · 2015-05-04T09:37:43Z

You're welcome. I'll try to contribute some code when I get the chance, too. I was just looking at the leap-second issue this morning, and it looks fairly easy to implement. A github version of the time zone database and code can be found here: https://github.com/eggert/tz
It's a GNU/Linux script for implementing all of the time zone issues (what Pytz uses) and it's a bit beyond me, but the list of leap seconds is here: https://github.com/eggert/tz/blob/master/leap-seconds.list
I think all one would have to do is inherit datetime.DatetimeStrategy and replace "def produce_template(self, context, pv):"
Something like this:
base = dt.datetime(1900, 1, 1, 0, 0) # Set TAI epoch
leap_second_time = base + datetime.timedelta(seconds=3550089600) # For the leap-second on July 1, 2012
Then just randomly add or subtract 1-2 seconds and add random microseconds to supply random data that falls around that leap-second.

Zac-HD · 2017-05-19T08:03:16Z

It should be possible to read this list from pytz, which I will try to ~~add to #621~~ get around to eventually. A more interesting approach would be to draw the timezone first, and then prefer datetimes around the DST transition too...

Zac-HD · 2018-06-13T16:57:59Z

@pganssle (dateutil maintainer) and I met at Pylondinium and designed a general way to generate nasty datetimes, which works with any source of timezones (tzinfo objects) and I've since confirmed with @DRMacIver that it should work.

Given two naive datetimes as bounds - defaulting to the representable limit if not passed explicitly - and a strategy for Optional[tzinfo], we need to draw a datetime:

The current algorithm:

Draw a naive datetime, by drawing each component in order from largest to smallest. This shrinks in the obvious way (taking care that the year goes to 2000 and the months and days to 1 rather than 0). This is what I mean by "draw a naive datetime" below.
Then, draw a timezone and add it to the naive value. Retry up to three times before rejecting the draw.

Proposed, "nasty datetimes", algorithm:

Draw a timezone T and a naive datetime D.
If the timezone is None return D; if UTC return D.astimezone(T) (henceforth D0), with the usual special handling for pytz.
If D0 is already "nasty" - at a leap-second, DST transition, imaginary or ambigious time, etc. - return it here. Paul will help write the oracle function for this.
Draw a boolean (biased with tuning frequency to be determined), and if False return D0 to ensure shrinking works sensibly and the distribution of nastiness is useful.
Now we're in the interesting bit. Draw another datetime and add T; we'll call this D1. Use a second oracle to check if there's a nasty value between D0 and D1 (comparing e.g. offset to UTC, is_dst status, tz_name, checking for leap seconds, etc).
If there's nothing nasty here, return D0 (for the last time).
If there is something nasty, draw D2 between the naive parts of D0 and D1. Continue this until Dn is nasty, and return that. (This is basically a binary search, with a tradeoff between expected number of draws and the flexibility to discard all but one draw when shrinking.)

Note that this should work with any source of timezones, any bounds the user might choose, and have small overhead in any case where there are no nasty datetimes to find (and 'reasonable' overhead when it does).

If you want to work on this issue, you're entirely welcome - but please contact me first so we avoid can overlapping work or comms and you can ask questions as you go.

pganssle · 2018-06-18T17:18:29Z

From some minor thinking about it, these are the classes of "nasty" time zones that I think we should try to find for each time zone:

Transitions between STD->DST
Transitions between DST->STD
Changes of base offset
Changes of tzname without a corresponding change in DST
Changes in DST without a corresponding tzname
Changes in DST without a corresponding change to utcoffset
Zones with negative DST (e.g. Europe/Ireland, Namibia during certain periods)
Multiple DST changes in a single year (e.g. Morroco)
Times with offsets that are a non-integer number of minutes or seconds (often occurs pre-1910ish)
Changes in UTC offset by more than 1 hour
Changes in UTC offset that happen at exactly midnight
Changes in UTC offset that result in ambiguous datetimes or missing datetimes outside of 0:00-04:00
Possible special case of 10, 11, and 12: Entire missing or doubled days.

When writing tests for this kind of thing, I tend to hand-select zones that match these edge cases preferably in both the Northern and Southern hemisphere, though I've never found any where this has bitten me.

Zac-HD · 2019-06-05T14:38:12Z

Based on a proof-of-concept Paul wrote at PyCon: 27df358. Now we "just" need to implement the search functionality I described last year!

More broadly, I'm thinking that there are actually two distinct things we might want to generate here:

Datetimes that are "weird" in isolation; i.e. exhibit an unusual property listed on the Weirdness enum. We can search toward ambiguous or imaginary times based on dst/utcoffset changes between two endpoints, but for most there's nothing to do beyond recognise it when we see one - which we can use to bail out of the above search early, so it's not totally useless.

Transitions between datetimes. I'm not sure how to to take advantage of this when generating single datetimes, but it would be nice if there was a way to do so... I think this is properly a follow-up issue once we get the first stage working.

Zac-HD · 2019-10-12T01:21:31Z

More notes: anyone thinking about leap seconds should go look at @aarchiba's work in nanograv/PINT#549, which includes some lovely new strategies as well as thoughtful consideration of "leap smear".

Unfortunately we can't represent leap seconds with Python's datetime.datetime() type, and this won't be fixed. It would still be useful to make adjacent times more likely though, as the number of TAI seconds elapsed across the interval won't match the number of Unix seconds - but Python effectively does leap-smear via NTP updates, so it's unlikely to be detectable from Python.

The documented contract of this function is that it is the inverse of `datetime.isoformat()`. See GH HypothesisWorks/hypothesis#1401, HypothesisWorks/hypothesis#69 and to a lesser extent HypothesisWorks/hypothesis#2273 for background on why I have set max_examples so high.

auvipy · 2022-06-02T03:57:31Z

what about using zoneinfo?

Zac-HD · 2022-06-03T08:15:53Z

It's got better handling of the .fold attribute (fixing ambiguous datetimes), but all the other cases are down to the underlying very complicated semantics of timezones themselves rather than which library we use to represent them. See above for the proposed algorigthm and targets.

Zac-HD · 2024-10-12T21:08:03Z

After spending some more time looking into leap seconds, I now think that handling them at all is so rare in Python that biasing towards them is unlikely to be a net improvement in bug-finding power outside of literally astronomical timekeeping code.

If anyone working on that is interested in picking it up, here's some code to get the UTC datetimes of each leap second; you could then sample one and add a random diff from (0, +/- <+1s, +/- 12h). I'd sketched plans to integrate that into st.datetimes(), including transparent shrinking etc., but as above decided against.

diff --git a/hypothesis-python/setup.py b/hypothesis-python/setup.py
index 14c24b9ef..8c96d9dd8 100644
--- a/hypothesis-python/setup.py
+++ b/hypothesis-python/setup.py
@@ -82,7 +82,13 @@ setuptools.setup(
     author_email="david@drmaciver.com",
     packages=setuptools.find_packages(SOURCE),
     package_dir={"": SOURCE},
-    package_data={"hypothesis": ["py.typed", "vendor/tlds-alpha-by-domain.txt"]},
+    package_data={"hypothesis": ["py.typed", "vendor/leap-seconds.list", "vendor/tlds-alpha-by-domain.txt"]},
     url="https://hypothesis.works",
     project_urls={
         "Source": "https://github.com/HypothesisWorks/hypothesis/tree/master/hypothesis-python",
diff --git a/hypothesis-python/src/hypothesis/strategies/_internal/datetime.py b/hypothesis-python/src/hypothesis/strategies/_internal/datetime.py
index 581b3ac3c..5710fabdc 100644
--- a/hypothesis-python/src/hypothesis/strategies/_internal/datetime.py
+++ b/hypothesis-python/src/hypothesis/strategies/_internal/datetime.py
@@ -34,6 +34,18 @@ def is_pytz_timezone(tz):
     return module == "pytz" or module.startswith("pytz.")
 
 
+@lru_cache(maxsize=1)
+def get_leap_seconds() -> tuple[dt.datetime, ...]:
+    """Return a list of UTC datetimes corresponding to each leap second."""
+    traversable = resources.files("hypothesis.vendor") / "tlds-alpha-by-domain.txt"
+    epoch = dt.datetime(1900, 1, 1, tzinfo=dt.timezone.utc)
+    return tuple(
+        epoch + dt.timedelta(seconds=int(line.split()[0]))
+        for line in traversable.read_text(encoding="utf-8").splitlines()
+        if not line.startswith("#")
+    )
+
+
 def replace_tzinfo(value, timezone):
     if is_pytz_timezone(timezone):
         # Pytz timezones are a little complicated, and using the .replace method
diff --git a/tooling/src/hypothesistooling/__main__.py b/tooling/src/hypothesistooling/__main__.py
index 6eb938510..d07914c85 100644
--- a/tooling/src/hypothesistooling/__main__.py
+++ b/tooling/src/hypothesistooling/__main__.py
@@ -363,6 +363,10 @@ def update_vendored_files():
     if fname.read_bytes().splitlines()[1:] != new.splitlines()[1:]:
         fname.write_bytes(new)
 
+    url = "https://hpiers.obspm.fr/iers/bul/bulc/ntp/leap-seconds.list"
+    (vendor / url.split("/")[-1]).write_bytes(requests.get(url).content)
+
     # Always require the most recent version of tzdata - we don't need to worry about
     # pre-releases because tzdata is a 'latest data' package  (unlike pyodide-build).
     # Our crosshair extra is research-grade, so we require latest versions there too.

DRMacIver added enhancement it's not broken, but we want it to be better help-wanted labels Sep 25, 2015

Zac-HD self-assigned this May 19, 2017

Zac-HD removed their assignment Jun 21, 2017

Zac-HD removed the help wanted label Feb 20, 2018

pganssle mentioned this issue Dec 13, 2019

Folds and imaginary datetimes in the datetime strategy #2273

Closed

Zac-HD mentioned this issue Apr 13, 2020

Datetime improvements: generate fold=1 and consistent treatment of imaginary times #2392

Merged

Zac-HD mentioned this issue Apr 27, 2020

Improve structure of timezones() strategies #2414

Open

Zac-HD changed the title ~~Datetime and Pytz: NonExistentTimeError, AmbiguousTimeError~~ Bias st.datetimes() towards bug-revealing values such as DST transitions and other oddities Jun 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bias `st.datetimes()` towards bug-revealing values such as DST transitions and other oddities #69

Bias `st.datetimes()` towards bug-revealing values such as DST transitions and other oddities #69

tweyter commented May 2, 2015

tweyter commented May 2, 2015

DRMacIver commented May 4, 2015

tweyter commented May 4, 2015

Zac-HD commented May 19, 2017 •

edited

Loading

Zac-HD commented Jun 13, 2018

pganssle commented Jun 18, 2018

Zac-HD commented Jun 5, 2019

Zac-HD commented Oct 12, 2019 •

edited

Loading

auvipy commented Jun 2, 2022

Zac-HD commented Jun 3, 2022

Zac-HD commented Oct 12, 2024

Bias st.datetimes() towards bug-revealing values such as DST transitions and other oddities #69

Bias st.datetimes() towards bug-revealing values such as DST transitions and other oddities #69

Comments

tweyter commented May 2, 2015

tweyter commented May 2, 2015

DRMacIver commented May 4, 2015

tweyter commented May 4, 2015

Zac-HD commented May 19, 2017 • edited Loading

Zac-HD commented Jun 13, 2018

pganssle commented Jun 18, 2018

Zac-HD commented Jun 5, 2019

Zac-HD commented Oct 12, 2019 • edited Loading

auvipy commented Jun 2, 2022

Zac-HD commented Jun 3, 2022

Zac-HD commented Oct 12, 2024

Bias `st.datetimes()` towards bug-revealing values such as DST transitions and other oddities #69

Bias `st.datetimes()` towards bug-revealing values such as DST transitions and other oddities #69

Zac-HD commented May 19, 2017 •

edited

Loading

Zac-HD commented Oct 12, 2019 •

edited

Loading