Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Add Arrow->Python datetime support #417

Merged
merged 24 commits into from
Apr 11, 2024

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Apr 5, 2024

This PR adds support for converting Arrow date, time, timestamp, and duration arrays to Python objects.

import pyarrow as pa
import datetime
import zoneinfo
import nanoarrow as na

dt = datetime.datetime.now()
list(na.Array(pa.array([dt])).iter_py())
#> [datetime.datetime(2024, 4, 8, 16, 25, 41, 216438)]

dt_tz = datetime.datetime.now(zoneinfo.ZoneInfo("America/Halifax"))
list(na.Array(pa.array([dt_tz])).iter_py())
#> [datetime.datetime(2024, 4, 8, 16, 29, 7, 226832, tzinfo=zoneinfo.ZoneInfo(key='America/Halifax'))]

tdelta = datetime.timedelta(123, 456, 678)
list(na.Array(pa.array([tdelta])).iter_py())
#> [datetime.timedelta(days=123, seconds=456, microseconds=678)]

just_time = datetime.time(15, 27, 43, 12)
list(na.Array(pa.array([just_time])).iter_py())
#> [datetime.time(15, 27, 43, 12)]

It is probably faster to use the DateTime C API, but the timings seem reasonable:

import pyarrow as pa
import datetime
import zoneinfo
import nanoarrow as na

n = int(1e6)

dt = datetime.datetime.now()
dt_array = pa.array([dt + datetime.timedelta(i) for i in range(n)])
%timeit dt_array.to_pylist()
%timeit list(na.Array(dt_array).iter_py())
#> 805 ms ± 21.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#> 804 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

tdelta_array = pa.array([datetime.timedelta(123 + i, 456, 678) for i in range(n)])
%timeit tdelta_array.to_pylist()
#> 574 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list(na.Array(tdelta_array).iter_py())
#> 399 ms ± 612 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

just_time_array = pa.array([datetime.time(15, 27, 43, i) for i in range(n)])
%timeit just_time_array.to_pylist()
#> 831 ms ± 6.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list(na.Array(just_time_array).iter_py())
#> 399 ms ± 856 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

@eddelbuettel
Copy link
Contributor

(Micro-nit: Missing second r in Arrow in Subject)

@paleolimbot paleolimbot changed the title feat(python): Add Arow->Python datetime support feat(python): Add Arrow->Python datetime support Apr 6, 2024
@paleolimbot paleolimbot marked this pull request as ready for review April 8, 2024 20:14
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Added a bunch of comments

For the zoneinfo vs dateutil, we could also bump the minimum Python version from 3.8 to 3.9, and then we can always assume zoneinfo is available (Python 3.8 is almost end-of-life, see https://devguide.python.org/versions/).

For timezones, there is one aspect not covered by this PR. The Arrow spec also allows fixed offsets of the form "+XX:XX" or "-XX:XX". Those can be handled with creating a timedelta and constructing a datetime.timezone() from that, I think, see https://docs.python.org/3/library/datetime.html#datetime.timezone)
Fixed offsets are not very useful, so it's fine to not (yet) handle that, but maybe add a TODO comment about it, and we should maybe also raise a better error message in _get_tzinfo about it)

if item is None:
yield item
else:
yield epoch + timedelta(item)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datetime.date also has a fromtimestamp method that accepts seconds since epoch, so you could also use that, which seems to be slightly faster:

In [53]: item = 10957

In [54]: epoch = datetime.date(1970, 1, 1)

In [55]: epoch + datetime.timedelta(item)
Out[55]: datetime.date(2000, 1, 1)

In [56]: datetime.date.fromtimestamp(60*60*24*item)
Out[56]: datetime.date(2000, 1, 1)

In [57]: %timeit epoch + datetime.timedelta(item)
164 ns ± 0.569 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [58]: %timeit datetime.date.fromtimestamp(60*60*24*item)
118 ns ± 0.0869 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the same for DATE64 if converting the milliseconds to seconds

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I do see in the Python documentation ( https://docs.python.org/3/library/datetime.html#datetime.date.fromtimestamp )

It’s common for this to be restricted to years from 1970 through 2038

...which is ominous (if up-to-date, which it might not be anymore).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. But this is true for datetime.datetime.fromtimestamp as well, though ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to avoid fromtimestamp() for now in both cases (even though it's quite a bit slower) and added some dates that will make a future test fail if this is actually a problem. Optimizing these conversions is probably a good project for some future (less busy) time.

python/src/nanoarrow/iterator.py Outdated Show resolved Hide resolved
Comment on lines 305 to 309
s = parent // scale
us = parent % scale * (1_000_000 // scale)
yield fromtimestamp(s, tz_fromtimestamp).replace(
microsecond=us, tzinfo=tz
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fromtimestamp should work with fractional seconds?

In [63]: import time

In [64]: time.time()
Out[64]: 1712740039.6300623

In [65]: datetime.datetime.fromtimestamp(time.time())
Out[65]: datetime.datetime(2024, 4, 10, 11, 7, 34, 904224)

So in that case could we do a normal division and just pass that to fromtimestamp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a precision issue that kicks in with floating point representations of timestamps. I changed this to use an epoch + reusing the duration iterator, although this is quite a bit slower than using fromtimestamp(). Eventually this is a job for C or C++.

elif unit == "us":
scale = 1_000_000
elif unit == "ns":
storage = _scale_and_round_maybe_none(storage, 0.001)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we silently discard the nanoseconds? An alternative could be to raise an error if the nanoseconds are not zero, or warn.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iterator will now emit a LossyConversionWarning when this happens (although maybe eventually we want to make this quieter).

Comment on lines +317 to +318
if unit == "s":
to_us = 1_000_000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For unit of seconds, it's probably a bit more efficient to pass the seconds directly to timedelta(seconds=..) than first converting to microseconds, because then the timedelta constructor will then convert the microseconds back to seconds internally.

Don't know if that is worth it though, because of course the current logic makes the loop a bit simpler

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a lot that could be optimized here...this pass is mostly for completeness/correctness. Probably this is a job for C or C++ + and Python C API where we can do some of these things efficiently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a lot that could be optimized here...this pass is mostly for completeness/correctness. Probably this is a job for C or C++ + and Python C API where we can do some of these things efficiently.

FWIW, I think it is also nice that this is just in Python (and it's still faster than pyarrow to_pylist anyway). But it's true the bigger gain will probably be found in moving this to C(ython) (at least if we use numpy as baseline, then this specific duration iteration can be improved 10x: this PR for 1M elements: 340ms, pyarrow: 480 ms, this PR but with directly passing seconds: 280ms, numpy: 30ms)

Comment on lines 273 to 274
days, hours, mins, secs, us = item
yield time(hours, mins, secs, us)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

days is just ignored here. Should we assert that it is 0? (the Arrow spec strictly speaking says that the value should never exceed the number of seconds/milliseconds/.. of 1 day, so for valid data we can be sure to not have a day here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bit the bullet and added a warning system that warns in the case that days != 0 (although it should probably be a bit smarter and emit fewer warnings at some point).

python/src/nanoarrow/iterator.py Outdated Show resolved Hide resolved
if tz_string.upper() == "UTC":
try:
# Added in Python 3.11
from datetime import UTC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an alias of datetime.timezone.utc (lower case), which might be available in older versions as well. So could simplify to use that

python/src/nanoarrow/iterator.py Show resolved Hide resolved
python/src/nanoarrow/iterator.py Outdated Show resolved Hide resolved
@paleolimbot
Copy link
Member Author

Thank you for the detailed look!

For the zoneinfo vs dateutil, we could also bump the minimum Python version from 3.8 to 3.9

I am not sure that zoneinfo is available on enscripten/pyodide (a brief check suggested that dateutil via micropip works but zoneinfo does not).

For timezones, there is one aspect not covered by this PR. The Arrow spec also allows fixed offsets of the form "+XX:XX" or "-XX:XX".

Good catch! This wasn't too bad to stick into the existing timezone resolver so I added it + a test case!

from datetime import timedelta, timezone

# We can handle UTC without any imports
if re.search(r"^utc$", tz_string, re.IGNORECASE):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you moved to this more complex check compared to tz_string.upper() == "UTC": ? Are there cases we would miss?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it back! It was a paper-thin theory that maybe it was less confusing to use re since it was used a few lines later, too.

Comment on lines 314 to 318
for item in parent:
if item is None:
yield None
else:
yield (epoch + item).replace(tzinfo=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for item in parent:
if item is None:
yield None
else:
yield (epoch + item).replace(tzinfo=None)
epoch = epoch.replace(tzinfo=None)
for item in parent:
if item is None:
yield None
else:
yield epoch + item

No need to do the replace each time inside the for loop I think?

Comment on lines +317 to +318
if unit == "s":
to_us = 1_000_000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a lot that could be optimized here...this pass is mostly for completeness/correctness. Probably this is a job for C or C++ + and Python C API where we can do some of these things efficiently.

FWIW, I think it is also nice that this is just in Python (and it's still faster than pyarrow to_pylist anyway). But it's true the bigger gain will probably be found in moving this to C(ython) (at least if we use numpy as baseline, then this specific duration iteration can be improved 10x: this PR for 1M elements: 340ms, pyarrow: 480 ms, this PR but with directly passing seconds: 280ms, numpy: 30ms)

@paleolimbot paleolimbot merged commit 1751bdd into apache:main Apr 11, 2024
11 checks passed
@paleolimbot paleolimbot deleted the python-convert-datetime branch April 11, 2024 17:32
@paleolimbot paleolimbot added this to the nanoarrow 0.5.0 milestone May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants