Skip to content

[Python] Add conversion to/from Pandas/Python for Month, Day Nano Interval Type #29432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
asfimport opened this issue Aug 31, 2021 · 9 comments
Closed

Comments

@asfimport
Copy link
Collaborator

asfimport commented Aug 31, 2021

#10177 has been merged we should support conversion to and from this type for standard python surface areas.

Reporter: Micah Kornfield / @emkornfield
Assignee: Micah Kornfield / @emkornfield

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-13806. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Note that the existing interval types (Month, and DayTime) are also not yet supported, not even basic bindings of the types / arrays. So I think a first step would be to add that (with simple conversion based on the raw values?).

For proper conversion to/from Python, the question is also what kind of value to use on the Python side. AFAIK there is not really a Python scalar from the standard library that represents such interval values (datetime.timedelta maps to our Duration type, I think). The dateutil package has a relativedelta object that can be used for this (but it's an external package, and not sure how widely used it is).

For numpy-based conversion, the Months unit could be represented by "timedelta64[M]", as both are a count of number of months (although not a zero-copy conversion, since in numpy it's always 64bit). But for DayTime and MonthDayNano, there is no equivalent (or maybe as a numpy record/struct array?).

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
@tswast  suggested https://pandas.pydata.org/docs/reference/api/pandas.tseries.offsets.DateOffset.html as a possible type I didn't look into how fields are stored in it yet.  Open to suggestion, if no type really maps well then numpy struct seems like a reasonable default to me.

 

I'll try to tackle conversion of the existing types as well.  After reviewing I'll try to make reasonable choices but if there are strong inclinations.  For standard python types my inclination is to map:

DayMillis to datetime.timedelta (according to docs it stores days, seconds and microseconds as separate fields).  Not sure about the reverse mapping though

For numpy-based conversion of months, timedelta64[M] sounds good to me. 

 

For month day nanos, I think if DateOffset doesn't work for numpy, the struct type seems correct to me.  For python, I think maybe just a triple (namedtuple) in the arrow namespace might make sense.

@asfimport
Copy link
Collaborator Author

Tim Swast / @tswast:
Regarding "Python" conversion, we decided in the Python BigQuery client that dateutil is widely used (including by pandas) to go with relativedelta for a similar conversion from this data type to Python object. googleapis/python-bigquery#840

The package appears to be widely used and from what I can tell from https://github.com/dateutil/dateutil no additional transitive dependencies to worry about.

That said, a namedtuple or dict where the names match the arguments to relativedelta (months, days, microseconds) would be pretty easy to convert to a relativedelta if not.

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
Pandas is an optional dependency of Arrow.  Given that relativedelta only supports microsecond precision and the to_pylist methods on arrays don't support parameterization, I think I will go with a named tuple approach for pure python and use DateOffset for pandas conversion. I'll see how hard it would be to import relativedelta's optional from python to construct the array.

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
So I've started implementation. The approach I'm going to go with:

  • Conversion to Pandas will use DateOffset (object dtype).

  • Conversion to Python will used a named tuple

  • Inference from python will detect DateOffset, relevativedelta or the named tuple

  • Conversion (once type is inferred or provided) from python will use duck typing that should support timedelta (timedelta is inferred today to be duration and that won't change), DateOffset, relativedelta and the named tuple of export. It will ignore "absolute" fields in the latter two and also ignore leapdays in relativedelta.

    Please provide feedback if this is reasonable.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
@amol- I moved this back to the 6.0 milestone. It's of course no blocker, but as long as we have the intention to try to get it done for 6.0, I think we can keep the milestone on it.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
@emkornfield that sounds good to me.

Maybe one note that we could also do this in several PRs (eg already start with plain Python tuples conversion, so we have at least basic bindings for the interval type; cfr ARROW-14018)

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
@jorisvandenbossche I posted a PR for MonthDayNanos interval. I'll think this was large enough that I will try to do another one for the other types (the PR contains a proposal for moving most of the logic to C++ and didn't want to put too much in it, if this looks good I think the other interval types probably won't be too bad).

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 11302
#11302

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants