feat(python): Add Arrow->Python datetime support #417

paleolimbot · 2024-04-05T17:42:05Z

This PR adds support for converting Arrow date, time, timestamp, and duration arrays to Python objects.

import pyarrow as pa
import datetime
import zoneinfo
import nanoarrow as na

dt = datetime.datetime.now()
list(na.Array(pa.array([dt])).iter_py())
#> [datetime.datetime(2024, 4, 8, 16, 25, 41, 216438)]

dt_tz = datetime.datetime.now(zoneinfo.ZoneInfo("America/Halifax"))
list(na.Array(pa.array([dt_tz])).iter_py())
#> [datetime.datetime(2024, 4, 8, 16, 29, 7, 226832, tzinfo=zoneinfo.ZoneInfo(key='America/Halifax'))]

tdelta = datetime.timedelta(123, 456, 678)
list(na.Array(pa.array([tdelta])).iter_py())
#> [datetime.timedelta(days=123, seconds=456, microseconds=678)]

just_time = datetime.time(15, 27, 43, 12)
list(na.Array(pa.array([just_time])).iter_py())
#> [datetime.time(15, 27, 43, 12)]

It is probably faster to use the DateTime C API, but the timings seem reasonable:

import pyarrow as pa
import datetime
import zoneinfo
import nanoarrow as na

n = int(1e6)

dt = datetime.datetime.now()
dt_array = pa.array([dt + datetime.timedelta(i) for i in range(n)])
%timeit dt_array.to_pylist()
%timeit list(na.Array(dt_array).iter_py())
#> 805 ms ± 21.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#> 804 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

tdelta_array = pa.array([datetime.timedelta(123 + i, 456, 678) for i in range(n)])
%timeit tdelta_array.to_pylist()
#> 574 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list(na.Array(tdelta_array).iter_py())
#> 399 ms ± 612 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

just_time_array = pa.array([datetime.time(15, 27, 43, i) for i in range(n)])
%timeit just_time_array.to_pylist()
#> 831 ms ± 6.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list(na.Array(just_time_array).iter_py())
#> 399 ms ± 856 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

eddelbuettel · 2024-04-06T21:37:43Z

(Micro-nit: Missing second r in Arrow in Subject)

jorisvandenbossche

Nice! Added a bunch of comments

For the zoneinfo vs dateutil, we could also bump the minimum Python version from 3.8 to 3.9, and then we can always assume zoneinfo is available (Python 3.8 is almost end-of-life, see https://devguide.python.org/versions/).

For timezones, there is one aspect not covered by this PR. The Arrow spec also allows fixed offsets of the form "+XX:XX" or "-XX:XX". Those can be handled with creating a timedelta and constructing a datetime.timezone() from that, I think, see https://docs.python.org/3/library/datetime.html#datetime.timezone)
Fixed offsets are not very useful, so it's fine to not (yet) handle that, but maybe add a TODO comment about it, and we should maybe also raise a better error message in _get_tzinfo about it)

jorisvandenbossche · 2024-04-10T09:03:26Z

python/src/nanoarrow/iterator.py

+ if item is None:
+ yield item
+ else:
+ yield epoch + timedelta(item)


datetime.date also has a fromtimestamp method that accepts seconds since epoch, so you could also use that, which seems to be slightly faster:

In [53]: item = 10957 In [54]: epoch = datetime.date(1970, 1, 1) In [55]: epoch + datetime.timedelta(item) Out[55]: datetime.date(2000, 1, 1) In [56]: datetime.date.fromtimestamp(60*60*24*item) Out[56]: datetime.date(2000, 1, 1) In [57]: %timeit epoch + datetime.timedelta(item) 164 ns ± 0.569 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) In [58]: %timeit datetime.date.fromtimestamp(60*60*24*item) 118 ns ± 0.0869 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

And the same for DATE64 if converting the milliseconds to seconds

Nice! I do see in the Python documentation ( https://docs.python.org/3/library/datetime.html#datetime.date.fromtimestamp )

It’s common for this to be restricted to years from 1970 through 2038

...which is ominous (if up-to-date, which it might not be anymore).

Hmm, good point. But this is true for datetime.datetime.fromtimestamp as well, though ..

I changed this to avoid fromtimestamp() for now in both cases (even though it's quite a bit slower) and added some dates that will make a future test fail if this is actually a problem. Optimizing these conversions is probably a good project for some future (less busy) time.

python/src/nanoarrow/iterator.py

jorisvandenbossche · 2024-04-10T09:08:52Z

python/src/nanoarrow/iterator.py

+ s = parent // scale
+ us = parent % scale * (1_000_000 // scale)
+ yield fromtimestamp(s, tz_fromtimestamp).replace(
+ microsecond=us, tzinfo=tz
+ )


I think fromtimestamp should work with fractional seconds?

In [63]: import time In [64]: time.time() Out[64]: 1712740039.6300623 In [65]: datetime.datetime.fromtimestamp(time.time()) Out[65]: datetime.datetime(2024, 4, 10, 11, 7, 34, 904224)

So in that case could we do a normal division and just pass that to fromtimestamp?

I think there is a precision issue that kicks in with floating point representations of timestamps. I changed this to use an epoch + reusing the duration iterator, although this is quite a bit slower than using fromtimestamp(). Eventually this is a job for C or C++.

jorisvandenbossche · 2024-04-10T09:10:14Z

python/src/nanoarrow/iterator.py

+ elif unit == "us":
+ scale = 1_000_000
+ elif unit == "ns":
+ storage = _scale_and_round_maybe_none(storage, 0.001)


Should we silently discard the nanoseconds? An alternative could be to raise an error if the nanoseconds are not zero, or warn.

The iterator will now emit a LossyConversionWarning when this happens (although maybe eventually we want to make this quieter).

jorisvandenbossche · 2024-04-10T09:13:42Z

python/src/nanoarrow/iterator.py

+ if unit == "s":
+ to_us = 1_000_000


For unit of seconds, it's probably a bit more efficient to pass the seconds directly to timedelta(seconds=..) than first converting to microseconds, because then the timedelta constructor will then convert the microseconds back to seconds internally.

Don't know if that is worth it though, because of course the current logic makes the loop a bit simpler

I think there is a lot that could be optimized here...this pass is mostly for completeness/correctness. Probably this is a job for C or C++ + and Python C API where we can do some of these things efficiently.

I think there is a lot that could be optimized here...this pass is mostly for completeness/correctness. Probably this is a job for C or C++ + and Python C API where we can do some of these things efficiently.

FWIW, I think it is also nice that this is just in Python (and it's still faster than pyarrow to_pylist anyway). But it's true the bigger gain will probably be found in moving this to C(ython) (at least if we use numpy as baseline, then this specific duration iteration can be improved 10x: this PR for 1M elements: 340ms, pyarrow: 480 ms, this PR but with directly passing seconds: 280ms, numpy: 30ms)

jorisvandenbossche · 2024-04-10T09:18:13Z

python/src/nanoarrow/iterator.py

+ days, hours, mins, secs, us = item
+ yield time(hours, mins, secs, us)


days is just ignored here. Should we assert that it is 0? (the Arrow spec strictly speaking says that the value should never exceed the number of seconds/milliseconds/.. of 1 day, so for valid data we can be sure to not have a day here)

I bit the bullet and added a warning system that warns in the case that days != 0 (although it should probably be a bit smarter and emit fewer warnings at some point).

python/src/nanoarrow/iterator.py

jorisvandenbossche · 2024-04-10T09:22:02Z

python/src/nanoarrow/iterator.py

+ if tz_string.upper() == "UTC":
+ try:
+ # Added in Python 3.11
+ from datetime import UTC


This is an alias of datetime.timezone.utc (lower case), which might be available in older versions as well. So could simplify to use that

python/src/nanoarrow/iterator.py

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

paleolimbot · 2024-04-10T19:42:22Z

Thank you for the detailed look!

For the zoneinfo vs dateutil, we could also bump the minimum Python version from 3.8 to 3.9

I am not sure that zoneinfo is available on enscripten/pyodide (a brief check suggested that dateutil via micropip works but zoneinfo does not).

For timezones, there is one aspect not covered by this PR. The Arrow spec also allows fixed offsets of the form "+XX:XX" or "-XX:XX".

Good catch! This wasn't too bad to stick into the existing timezone resolver so I added it + a test case!

jorisvandenbossche · 2024-04-11T12:52:44Z

python/src/nanoarrow/iterator.py

+ from datetime import timedelta, timezone
+
+ # We can handle UTC without any imports
+ if re.search(r"^utc$", tz_string, re.IGNORECASE):


Is there a reason you moved to this more complex check compared to tz_string.upper() == "UTC": ? Are there cases we would miss?

I changed it back! It was a paper-thin theory that maybe it was less confusing to use re since it was used a few lines later, too.

jorisvandenbossche · 2024-04-11T12:55:33Z

python/src/nanoarrow/iterator.py

+ for item in parent:
+ if item is None:
+ yield None
+ else:
+ yield (epoch + item).replace(tzinfo=None)


Suggested change

for item in parent:

if item is None:

yield None

else:

yield (epoch + item).replace(tzinfo=None)

epoch = epoch.replace(tzinfo=None)

for item in parent:

if item is None:

yield None

else:

yield epoch + item

No need to do the replace each time inside the for loop I think?

jorisvandenbossche · 2024-04-11T13:46:46Z

python/src/nanoarrow/iterator.py

+ if unit == "s":
+ to_us = 1_000_000


I think there is a lot that could be optimized here...this pass is mostly for completeness/correctness. Probably this is a job for C or C++ + and Python C API where we can do some of these things efficiently.

FWIW, I think it is also nice that this is just in Python (and it's still faster than pyarrow to_pylist anyway). But it's true the bigger gain will probably be found in moving this to C(ython) (at least if we use numpy as baseline, then this specific duration iteration can be improved 10x: this PR for 1M elements: 340ms, pyarrow: 480 ms, this PR but with directly passing seconds: 280ms, numpy: 30ms)

paleolimbot changed the title ~~feat(python): Add Arow->Python datetime support~~ feat(python): Add Arrow->Python datetime support Apr 6, 2024

paleolimbot marked this pull request as ready for review April 8, 2024 20:14

paleolimbot and others added 15 commits April 9, 2024 16:20

first pass

5fe5b16

add timezone support

2f3646a

with test

d21b06a

simplify

c114f67

test tzinfo

87d0df9

format

3f5bf44

utc for old Python

0544128

format

b9d63d1

one more

b5f84fe

support ns

dabb2bd

date

df6b322

format

44db73b

more

31b23fd

name of pkg

fe54e61

one more

d351480

paleolimbot force-pushed the python-convert-datetime branch from 0489c1f to d351480 Compare April 9, 2024 19:21

jorisvandenbossche reviewed Apr 10, 2024

View reviewed changes

paleolimbot and others added 8 commits April 10, 2024 12:11

maybe better warnings

67c17f8

maybe safer

8636163

maybe handle better timezones/safer timestamps

ef0fdd3

test offset timezone

9bb57f3

test named and unnamed warnings

c88fcde

isort

8e86a20

Update python/src/nanoarrow/iterator.py

b0655b8

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

rename

98ecbde

jorisvandenbossche reviewed Apr 11, 2024

View reviewed changes

a few more

8f44b5c

paleolimbot merged commit 1751bdd into apache:main Apr 11, 2024
11 checks passed

paleolimbot deleted the python-convert-datetime branch April 11, 2024 17:32

paleolimbot added this to the nanoarrow 0.5.0 milestone May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): Add Arrow->Python datetime support #417

feat(python): Add Arrow->Python datetime support #417

paleolimbot commented Apr 5, 2024 •

edited

Loading

eddelbuettel commented Apr 6, 2024

jorisvandenbossche left a comment

jorisvandenbossche Apr 10, 2024

jorisvandenbossche Apr 10, 2024

paleolimbot Apr 10, 2024

jorisvandenbossche Apr 10, 2024

paleolimbot Apr 10, 2024

jorisvandenbossche Apr 10, 2024

paleolimbot Apr 10, 2024

jorisvandenbossche Apr 10, 2024

paleolimbot Apr 10, 2024

jorisvandenbossche Apr 10, 2024

paleolimbot Apr 10, 2024

jorisvandenbossche Apr 11, 2024

jorisvandenbossche Apr 10, 2024

paleolimbot Apr 10, 2024

jorisvandenbossche Apr 10, 2024

paleolimbot commented Apr 10, 2024

jorisvandenbossche Apr 11, 2024

paleolimbot Apr 11, 2024

jorisvandenbossche Apr 11, 2024

jorisvandenbossche Apr 11, 2024

		days, hours, mins, secs, us = item
		yield time(hours, mins, secs, us)

feat(python): Add Arrow->Python datetime support #417

feat(python): Add Arrow->Python datetime support #417

Conversation

paleolimbot commented Apr 5, 2024 • edited Loading

eddelbuettel commented Apr 6, 2024

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot commented Apr 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot commented Apr 5, 2024 •

edited

Loading