-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3829: [Python] add __arrow_array__ protocol to support third-party array classes in conversion to Arrow #5106
Conversation
…rty array classes in conversion to Arrow
python/pyarrow/tests/test_pandas.py
Outdated
result = pa.array(df['a'].values, type=pa.float64()) | ||
assert result.equals(expected2) | ||
|
||
del pd.arrays.IntegerArray.__arrow_array__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could go in a try-finally to avoid a failure here possibly contaminating other tests.
python/pyarrow/array.pxi
Outdated
@@ -161,7 +161,9 @@ def array(object obj, type=None, mask=None, size=None, from_pandas=None, | |||
else: | |||
c_from_pandas = from_pandas | |||
|
|||
if _is_array_like(obj): | |||
if hasattr(obj, '__arrow_array__'): | |||
return obj.__arrow_array__(type=type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to check for a return value of NotImplemented
and fall back to the default path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Why would you define __arrow_array__
just to return NotImplemented
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think if you don't want to implement this, the user should simply not define __arrow_array__
.
But we should probably add a check to ensure that was is returned from __arrow_array__
is actually a pyarrow Array (that is something that numpy also does).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: if the Pandas series or dataframe column is backed by an Arrow array, does this allow pa.array()
to extract the array without copying?
python/pyarrow/array.pxi
Outdated
@@ -161,7 +161,9 @@ def array(object obj, type=None, mask=None, size=None, from_pandas=None, | |||
else: | |||
c_from_pandas = from_pandas | |||
|
|||
if _is_array_like(obj): | |||
if hasattr(obj, '__arrow_array__'): | |||
return obj.__arrow_array__(type=type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we raise if non-default size
and mask
arguments are passed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds as a good idea.
python/pyarrow/array.pxi
Outdated
@@ -178,7 +180,9 @@ def array(object obj, type=None, mask=None, size=None, from_pandas=None, | |||
mask = values.mask | |||
values = values.data | |||
|
|||
if pandas_api.is_categorical(values): | |||
if hasattr(values, '__arrow_array__'): | |||
return values.__arrow_array__(type=type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: raise if size
or mask
is given?
Yes, that would be the idea (also |
e12ea25
to
e2b10c4
Compare
Updated the PR and added some docs for it. |
docs/source/python/extending2.rst
Outdated
.. _extending: | ||
|
||
Extending pyarrow | ||
================= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that this file can also contain the documentation about creating your own ExtensionType
(documentation which is missing at the moment), so therefore I think extending.rst
would be a good name.
However, we already have extending.rst
which is about using the pyarrow C++ / cython APIs. Anybody an idea for another name for this file? ("extending2" is not meant to keep :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe extending_types.rst
? Or user_defined_types.rst
, since ultimately __arrow_array__
, ExtensionType
, and (eventually) custom Arrow-to-Pandas conversions are all about user-defined types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, renamed to extending_types.rst
(the title can still include "user defined types" once we add documentation about that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original document could also be renamed extending_cpp.rst
or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorisvandenbossche for taking this up. LGTM from my side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one small issue.
Looks good @jorisvandenbossche! LGTM. I'm looking to add support for
However the returned type would then not be |
@rok at the moment, the protocol is specifically meant to convert to a So I would personally keep it on that, for now. The question is also what would be the exact purpose for extending it? (but I am not very familiar with the tensor part of the library / the message spec). |
@jorisvandenbossche agreed with keeping this as is. |
The main problem is that currently sparse data does not fit in the Arrow tabular format, so in general DataFrames with such sparse columns, can't be converted to Arrow tables. As you mention, the target could be For the Tensors, I don't think there is currently a "catch all" constructor like you have for |
Thanks @jorisvandenbossche, I'll look into that direction and here's an issue I opened for it. |
python/pyarrow/array.pxi
Outdated
"converted with the __arrow_array__ protocol.") | ||
res = obj.__arrow_array__(type=type) | ||
if not isinstance(res, Array): | ||
raise ValueError("The object's __arrow_array__ method does not " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TypeError
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numpy returns a ValueError in this case, but yes, a TypeError seems more appropriate
Codecov Report
@@ Coverage Diff @@
## master #5106 +/- ##
===========================================
- Coverage 87.62% 65.26% -22.36%
===========================================
Files 1014 497 -517
Lines 145908 67514 -78394
Branches 1437 0 -1437
===========================================
- Hits 127857 44066 -83791
- Misses 17689 23448 +5759
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
https://issues.apache.org/jira/browse/ARROW-3829 & https://issues.apache.org/jira/browse/ARROW-5271. And as illustration for the mailing list discussion (will post there in a bit).