-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): Add Arrow->Python datetime support #417
feat(python): Add Arrow->Python datetime support #417
Conversation
(Micro-nit: Missing second r in Arrow in Subject) |
0489c1f
to
d351480
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Added a bunch of comments
For the zoneinfo vs dateutil, we could also bump the minimum Python version from 3.8 to 3.9, and then we can always assume zoneinfo
is available (Python 3.8 is almost end-of-life, see https://devguide.python.org/versions/).
For timezones, there is one aspect not covered by this PR. The Arrow spec also allows fixed offsets of the form "+XX:XX" or "-XX:XX". Those can be handled with creating a timedelta and constructing a datetime.timezone()
from that, I think, see https://docs.python.org/3/library/datetime.html#datetime.timezone)
Fixed offsets are not very useful, so it's fine to not (yet) handle that, but maybe add a TODO comment about it, and we should maybe also raise a better error message in _get_tzinfo
about it)
if item is None: | ||
yield item | ||
else: | ||
yield epoch + timedelta(item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datetime.date
also has a fromtimestamp
method that accepts seconds since epoch, so you could also use that, which seems to be slightly faster:
In [53]: item = 10957
In [54]: epoch = datetime.date(1970, 1, 1)
In [55]: epoch + datetime.timedelta(item)
Out[55]: datetime.date(2000, 1, 1)
In [56]: datetime.date.fromtimestamp(60*60*24*item)
Out[56]: datetime.date(2000, 1, 1)
In [57]: %timeit epoch + datetime.timedelta(item)
164 ns ± 0.569 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [58]: %timeit datetime.date.fromtimestamp(60*60*24*item)
118 ns ± 0.0869 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the same for DATE64 if converting the milliseconds to seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I do see in the Python documentation ( https://docs.python.org/3/library/datetime.html#datetime.date.fromtimestamp )
It’s common for this to be restricted to years from 1970 through 2038
...which is ominous (if up-to-date, which it might not be anymore).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point. But this is true for datetime.datetime.fromtimestamp
as well, though ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to avoid fromtimestamp()
for now in both cases (even though it's quite a bit slower) and added some dates that will make a future test fail if this is actually a problem. Optimizing these conversions is probably a good project for some future (less busy) time.
python/src/nanoarrow/iterator.py
Outdated
s = parent // scale | ||
us = parent % scale * (1_000_000 // scale) | ||
yield fromtimestamp(s, tz_fromtimestamp).replace( | ||
microsecond=us, tzinfo=tz | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fromtimestamp
should work with fractional seconds?
In [63]: import time
In [64]: time.time()
Out[64]: 1712740039.6300623
In [65]: datetime.datetime.fromtimestamp(time.time())
Out[65]: datetime.datetime(2024, 4, 10, 11, 7, 34, 904224)
So in that case could we do a normal division and just pass that to fromtimestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a precision issue that kicks in with floating point representations of timestamps. I changed this to use an epoch
+ reusing the duration iterator, although this is quite a bit slower than using fromtimestamp()
. Eventually this is a job for C or C++.
python/src/nanoarrow/iterator.py
Outdated
elif unit == "us": | ||
scale = 1_000_000 | ||
elif unit == "ns": | ||
storage = _scale_and_round_maybe_none(storage, 0.001) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we silently discard the nanoseconds? An alternative could be to raise an error if the nanoseconds are not zero, or warn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The iterator will now emit a LossyConversionWarning
when this happens (although maybe eventually we want to make this quieter).
if unit == "s": | ||
to_us = 1_000_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For unit of seconds, it's probably a bit more efficient to pass the seconds directly to timedelta(seconds=..)
than first converting to microseconds, because then the timedelta
constructor will then convert the microseconds back to seconds internally.
Don't know if that is worth it though, because of course the current logic makes the loop a bit simpler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a lot that could be optimized here...this pass is mostly for completeness/correctness. Probably this is a job for C or C++ + and Python C API where we can do some of these things efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a lot that could be optimized here...this pass is mostly for completeness/correctness. Probably this is a job for C or C++ + and Python C API where we can do some of these things efficiently.
FWIW, I think it is also nice that this is just in Python (and it's still faster than pyarrow to_pylist anyway). But it's true the bigger gain will probably be found in moving this to C(ython) (at least if we use numpy as baseline, then this specific duration iteration can be improved 10x: this PR for 1M elements: 340ms, pyarrow: 480 ms, this PR but with directly passing seconds: 280ms, numpy: 30ms)
python/src/nanoarrow/iterator.py
Outdated
days, hours, mins, secs, us = item | ||
yield time(hours, mins, secs, us) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
days
is just ignored here. Should we assert that it is 0? (the Arrow spec strictly speaking says that the value should never exceed the number of seconds/milliseconds/.. of 1 day, so for valid data we can be sure to not have a day
here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bit the bullet and added a warning system that warns in the case that days != 0
(although it should probably be a bit smarter and emit fewer warnings at some point).
python/src/nanoarrow/iterator.py
Outdated
if tz_string.upper() == "UTC": | ||
try: | ||
# Added in Python 3.11 | ||
from datetime import UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an alias of datetime.timezone.utc
(lower case), which might be available in older versions as well. So could simplify to use that
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Thank you for the detailed look!
I am not sure that
Good catch! This wasn't too bad to stick into the existing timezone resolver so I added it + a test case! |
python/src/nanoarrow/iterator.py
Outdated
from datetime import timedelta, timezone | ||
|
||
# We can handle UTC without any imports | ||
if re.search(r"^utc$", tz_string, re.IGNORECASE): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you moved to this more complex check compared to tz_string.upper() == "UTC":
? Are there cases we would miss?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it back! It was a paper-thin theory that maybe it was less confusing to use re
since it was used a few lines later, too.
python/src/nanoarrow/iterator.py
Outdated
for item in parent: | ||
if item is None: | ||
yield None | ||
else: | ||
yield (epoch + item).replace(tzinfo=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for item in parent: | |
if item is None: | |
yield None | |
else: | |
yield (epoch + item).replace(tzinfo=None) | |
epoch = epoch.replace(tzinfo=None) | |
for item in parent: | |
if item is None: | |
yield None | |
else: | |
yield epoch + item |
No need to do the replace each time inside the for loop I think?
if unit == "s": | ||
to_us = 1_000_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a lot that could be optimized here...this pass is mostly for completeness/correctness. Probably this is a job for C or C++ + and Python C API where we can do some of these things efficiently.
FWIW, I think it is also nice that this is just in Python (and it's still faster than pyarrow to_pylist anyway). But it's true the bigger gain will probably be found in moving this to C(ython) (at least if we use numpy as baseline, then this specific duration iteration can be improved 10x: this PR for 1M elements: 340ms, pyarrow: 480 ms, this PR but with directly passing seconds: 280ms, numpy: 30ms)
This PR adds support for converting Arrow date, time, timestamp, and duration arrays to Python objects.
It is probably faster to use the DateTime C API, but the timings seem reasonable: