ARROW-3829: [Python] add __arrow_array__ protocol to support third-party array classes in conversion to Arrow #5106

jorisvandenbossche · 2019-08-16T14:09:51Z

https://issues.apache.org/jira/browse/ARROW-3829 & https://issues.apache.org/jira/browse/ARROW-5271. And as illustration for the mailing list discussion (will post there in a bit).

…rty array classes in conversion to Arrow

ghost · 2019-08-16T19:29:06Z

python/pyarrow/tests/test_pandas.py

+    result = pa.array(df['a'].values, type=pa.float64())
+    assert result.equals(expected2)
+
+    del pd.arrays.IntegerArray.__arrow_array__


Maybe this could go in a try-finally to avoid a failure here possibly contaminating other tests.

ghost · 2019-08-16T19:30:21Z

python/pyarrow/array.pxi

@@ -161,7 +161,9 @@ def array(object obj, type=None, mask=None, size=None, from_pandas=None,
    else:
        c_from_pandas = from_pandas

-    if _is_array_like(obj):
+    if hasattr(obj, '__arrow_array__'):
+        return obj.__arrow_array__(type=type)


Would it make sense to check for a return value of NotImplemented and fall back to the default path?

I don't think so. Why would you define __arrow_array__ just to return NotImplemented?

Yes, I think if you don't want to implement this, the user should simply not define __arrow_array__.

But we should probably add a check to ensure that was is returned from __arrow_array__ is actually a pyarrow Array (that is something that numpy also does).

pitrou

Question: if the Pandas series or dataframe column is backed by an Arrow array, does this allow pa.array() to extract the array without copying?

pitrou · 2019-08-19T10:04:09Z

python/pyarrow/array.pxi

@@ -161,7 +161,9 @@ def array(object obj, type=None, mask=None, size=None, from_pandas=None,
    else:
        c_from_pandas = from_pandas

-    if _is_array_like(obj):
+    if hasattr(obj, '__arrow_array__'):
+        return obj.__arrow_array__(type=type)


Should we raise if non-default size and mask arguments are passed here?

Yes, that sounds as a good idea.

pitrou · 2019-08-19T10:04:53Z

python/pyarrow/array.pxi

@@ -178,7 +180,9 @@ def array(object obj, type=None, mask=None, size=None, from_pandas=None,
                mask = values.mask
                values = values.data

-        if pandas_api.is_categorical(values):
+        if hasattr(values, '__arrow_array__'):
+            return values.__arrow_array__(type=type)


Same here: raise if size or mask is given?

python/pyarrow/tests/test_pandas.py

jorisvandenbossche · 2019-08-19T10:36:26Z

Question: if the Pandas series or dataframe column is backed by an Arrow array, does this allow pa.array() to extract the array without copying?

Yes, that would be the idea (also pa.array(numpy_array) does not create a copy of the data if possible). In that case the __arrow_array__ should return that backing Arrow array (which is what fletcher could do).

jorisvandenbossche · 2019-08-20T09:47:45Z

Updated the PR and added some docs for it.

jorisvandenbossche · 2019-08-20T09:49:51Z

docs/source/python/extending2.rst

+.. _extending:
+
+Extending pyarrow
+=================


The idea is that this file can also contain the documentation about creating your own ExtensionType (documentation which is missing at the moment), so therefore I think extending.rst would be a good name.
However, we already have extending.rst which is about using the pyarrow C++ / cython APIs. Anybody an idea for another name for this file? ("extending2" is not meant to keep :))

Maybe extending_types.rst? Or user_defined_types.rst, since ultimately __arrow_array__, ExtensionType, and (eventually) custom Arrow-to-Pandas conversions are all about user-defined types.

Thanks, renamed to extending_types.rst (the title can still include "user defined types" once we add documentation about that)

The original document could also be renamed extending_cpp.rst or something.

xhochy

Thanks @jorisvandenbossche for taking this up. LGTM from my side.

pitrou

LGTM, just one small issue.

python/pyarrow/tests/test_pandas.py

rok · 2019-08-21T16:03:55Z

Looks good @jorisvandenbossche! LGTM.

I'm looking to add support for pd.SparseArray, so setting __arrow_array__ to something like:

def __arrow_array__(self, type=None):
    return pa.SparseTensorCOO.from_numpy(data=self.sp_values, coords=self.sp_index.indices.reshape(1,-1), shape=(self.shape))

However the returned type would then not be pa.array. Do you think it would make sense to support cases where returned type is not pa.array with __arrow_array__ interface?

jorisvandenbossche · 2019-08-21T16:28:17Z

@rok at the moment, the protocol is specifically meant to convert to a pyarrow.Array (it is also only used internally in the pa.array(..) code).

So I would personally keep it on that, for now. The question is also what would be the exact purpose for extending it? (but I am not very familiar with the tensor part of the library / the message spec).

rok · 2019-08-21T16:53:22Z

@jorisvandenbossche agreed with keeping this as is.
My motivation is enabling conversion of sparse columns (pd.SparseArray) contained in an arbitrary pd.DataFrame to pyarrow type (probably pa.SparseTensorCOO).
I'm not sure what would the best place to do this be?

jorisvandenbossche · 2019-08-22T17:58:59Z

My motivation is enabling conversion of sparse columns (pd.SparseArray) contained in an arbitrary pd.DataFrame to pyarrow type (probably pa.SparseTensorCOO).

The main problem is that currently sparse data does not fit in the Arrow tabular format, so in general DataFrames with such sparse columns, can't be converted to Arrow tables. As you mention, the target could be pd.SparseTensorCOO, but currently the main target of this protocol is the recordbatch message format, and not the (sparse) tensor message format.

For the Tensors, I don't think there is currently a "catch all" constructor like you have for pa.array? That might be a start if you want to enable conversion of general objects to (sparse) tensors. But I would propose to open a new issue for that if you want to continue the discussion.

rok · 2019-08-22T20:12:02Z

Thanks @jorisvandenbossche, I'll look into that direction and here's an issue I opened for it.

pitrou · 2019-08-27T13:48:50Z

python/pyarrow/array.pxi

+            "converted with the __arrow_array__ protocol.")
+    res = obj.__arrow_array__(type=type)
+    if not isinstance(res, Array):
+        raise ValueError("The object's __arrow_array__ method does not "


TypeError, no?

Numpy returns a ValueError in this case, but yes, a TypeError seems more appropriate

codecov-io · 2019-08-27T20:16:00Z

Codecov Report

Merging #5106 into master will decrease coverage by 22.35%.
The diff coverage is 85.71%.

@@             Coverage Diff             @@
##           master    #5106       +/-   ##
===========================================
- Coverage   87.62%   65.26%   -22.36%     
===========================================
  Files        1014      497      -517     
  Lines      145908    67514    -78394     
  Branches     1437        0     -1437     
===========================================
- Hits       127857    44066    -83791     
- Misses      17689    23448     +5759     
+ Partials      362        0      -362

Impacted Files	Coverage Δ
python/pyarrow/tests/test_pandas.py	`95.26% <100%> (+0.08%)`	⬆️
python/pyarrow/pandas-shim.pxi	`64.35% <70%> (-0.91%)`	⬇️
python/pyarrow/tests/test_array.py	`95.65% <71.42%> (-0.35%)`	⬇️
python/pyarrow/array.pxi	`80.12% <92.3%> (+0.43%)`	⬆️
cpp/src/arrow/util/memory.h	`0% <0%> (-100%)`	⬇️
cpp/src/gandiva/date_utils.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/util/memory.cc	`0% <0%> (-100%)`	⬇️
cpp/src/gandiva/decimal_type_util.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/filesystem/util_internal.cc	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/compute/logical_type.h	`0% <0%> (-100%)`	⬇️
... and 796 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a40d6b6...bab01f1. Read the comment docs.

ARROW-3829: [Python] add __arrow_array__ protocol to support third-pa…

c82eb88

…rty array classes in conversion to Arrow

ghost reviewed Aug 16, 2019

View reviewed changes

ghost approved these changes Aug 16, 2019

View reviewed changes

pitrou reviewed Aug 19, 2019

View reviewed changes

jorisvandenbossche added 2 commits August 20, 2019 10:15

compat for older pandas versions

198d699

add validation of additional keywords and return value

e2b10c4

jorisvandenbossche force-pushed the ARROW-3829-array-protocol branch from e12ea25 to e2b10c4 Compare August 20, 2019 08:37

add docs

4861154

jorisvandenbossche commented Aug 20, 2019

View reviewed changes

ghost approved these changes Aug 20, 2019

View reviewed changes

xhochy approved these changes Aug 20, 2019

View reviewed changes

rename to extending_types.rst

8ac304a

pitrou reviewed Aug 20, 2019

View reviewed changes

python/pyarrow/tests/test_pandas.py Show resolved Hide resolved

use try ... finally

8e70995

pitrou reviewed Aug 27, 2019

View reviewed changes

ValueError -> TypeError

bab01f1

wesm closed this in 38401a1 Aug 27, 2019

jorisvandenbossche deleted the ARROW-3829-array-protocol branch September 10, 2019 15:35

This was referenced Sep 10, 2019

ENH: Add IntegerArray.__arrow_array__ for custom conversion to Arrow pandas-dev/pandas#28368

Merged

Serialization / Deserialization of ExtensionArrays pandas-dev/pandas#20612

Open

asfimport mentioned this pull request Sep 24, 2019

[Python] Support protocols to extract Arrow objects from third-party classes #20293

Closed

ARROW-3829: [Python] add __arrow_array__ protocol to support third-party array classes in conversion to Arrow #5106

ARROW-3829: [Python] add __arrow_array__ protocol to support third-party array classes in conversion to Arrow #5106

Uh oh!

Conversation

jorisvandenbossche commented Aug 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorisvandenbossche commented Aug 19, 2019

Uh oh!

jorisvandenbossche commented Aug 20, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xhochy left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rok commented Aug 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Aug 21, 2019

Uh oh!

rok commented Aug 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Aug 22, 2019

Uh oh!

rok commented Aug 22, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Aug 27, 2019

Codecov Report

Uh oh!

Uh oh!

jorisvandenbossche commented Aug 16, 2019 •

edited

Loading

pitrou left a comment •

edited

Loading

rok commented Aug 21, 2019 •

edited

Loading

rok commented Aug 21, 2019 •

edited

Loading