ARROW-1673: [Python] Add support for numpy 'bool' type #1199

pcmoritz · 2017-10-13T23:08:15Z

This is currently a workaround until the Arrow tensor supports zero copy of byte-length booleans.

pcmoritz · 2017-10-13T23:26:50Z

cpp/src/arrow/type.h

@@ -328,7 +328,7 @@ class ARROW_EXPORT BooleanType : public FixedWidthType, public NoExtraMeta {
 Status Accept(TypeVisitor* visitor) const override;
 std::string ToString() const override;

- int bit_width() const override { return 1; }


I'm not at all sure if this is the right fix; maybe we need a separate field for the width if the type is contained in a tensor? Standardizing around numpy for tensors seems the way to go.

Boolean in Arrow is 1 bit, so don’t make this change. We may need to get creative about dealing with NumPy’s metadata

Fair enough, for our Python serialization we could hack around it by defining a custom serializer. However, for Tensor.from_numpy() we can't do that because the type needs to be fully encoded in the Tensor type. Would it be acceptable to introduce a new type for this? Let me know which solution you prefer.

In particular I'm thinking of introducing a "bool8" type, which is a bool that is encoded as a single byte.

I created https://issues.apache.org/jira/browse/ARROW-1674. This is probably the right way to handle this at the format level. We can separately add a data type in C++. It will be useful to be able to receive numpy.bool_ data with zero copy in Arrow

robertnishihara · 2017-10-17T23:37:45Z

This just passes bool arrays to the custom serializer, right? Does it make sense to register the custom serializer in the default serialization context or no?

pcmoritz · 2017-10-17T23:42:37Z

Magically, this is already taken care of. The custom serializer we already have is generic, it will convert the array to nested lists and the custom deserializer will make a numpy array out of it. Not the most efficient solution but it fixes the problem until we have the proper solution :)

robertnishihara · 2017-10-17T23:57:50Z

Oh, I see. Could be made efficient by having the custom serializer special case bool arrays. But if this is just temporary then no need to.

pcmoritz · 2017-10-18T04:40:49Z

+1 this is ready to merge as a workaround for ray-project/ray#1121

wesm · 2017-10-18T13:41:52Z

I haven't looked too deeply, but could you explain how this fix works?

pcmoritz · 2017-10-18T15:38:53Z

Yeah, the switch case I removed makes it fall back to the default, which uses the custom serializer. This will fall back to the function

arrow/python/pyarrow/serialization.py

Line 81 in a043018

def _serialize_numpy_array(obj):

which converts the array to a nested list upon serialization and back upon deserialization.

wesm · 2017-10-18T21:07:08Z

Got it, so the workaround is slower / not zero copy. No big deal. I will work to get this fixed more properly + zero copy reads in time for 0.8.0

wesm

+1

pcmoritz commented Oct 13, 2017

View reviewed changes

pcmoritz force-pushed the ndarray-bool branch 5 times, most recently from 097a78b to 1de11a4 Compare October 17, 2017 22:52

pcmoritz added 4 commits October 17, 2017 16:28

add support for numpy 'bool' type

11c7ed3

change bool width to 1 byte

ad4c6b9

update

8fce724

deploy workaround

14943a0

pcmoritz force-pushed the ndarray-bool branch from 1de11a4 to 14943a0 Compare October 17, 2017 23:28

pcmoritz mentioned this pull request Oct 17, 2017

ARROW-1674: [Format, C++] Add support for byte length booleans in Tensors #1201

Closed

wesm approved these changes Oct 18, 2017

View reviewed changes

asfgit closed this in 298e343 Oct 18, 2017

wesm deleted the ndarray-bool branch October 18, 2017 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-1673: [Python] Add support for numpy 'bool' type #1199

ARROW-1673: [Python] Add support for numpy 'bool' type #1199

pcmoritz commented Oct 13, 2017 •

edited

Loading

pcmoritz Oct 13, 2017 •

edited

Loading

wesm Oct 13, 2017

pcmoritz Oct 13, 2017

pcmoritz Oct 13, 2017

wesm Oct 14, 2017

robertnishihara commented Oct 17, 2017

pcmoritz commented Oct 17, 2017

robertnishihara commented Oct 17, 2017

pcmoritz commented Oct 18, 2017

wesm commented Oct 18, 2017 •

edited

Loading

pcmoritz commented Oct 18, 2017 •

edited

Loading

wesm commented Oct 18, 2017

wesm left a comment

ARROW-1673: [Python] Add support for numpy 'bool' type #1199

ARROW-1673: [Python] Add support for numpy 'bool' type #1199

Conversation

pcmoritz commented Oct 13, 2017 • edited Loading

pcmoritz Oct 13, 2017 • edited Loading

Choose a reason for hiding this comment

wesm Oct 13, 2017

Choose a reason for hiding this comment

pcmoritz Oct 13, 2017

Choose a reason for hiding this comment

pcmoritz Oct 13, 2017

Choose a reason for hiding this comment

wesm Oct 14, 2017

Choose a reason for hiding this comment

robertnishihara commented Oct 17, 2017

pcmoritz commented Oct 17, 2017

robertnishihara commented Oct 17, 2017

pcmoritz commented Oct 18, 2017

wesm commented Oct 18, 2017 • edited Loading

pcmoritz commented Oct 18, 2017 • edited Loading

wesm commented Oct 18, 2017

wesm left a comment

Choose a reason for hiding this comment

pcmoritz commented Oct 13, 2017 •

edited

Loading

pcmoritz Oct 13, 2017 •

edited

Loading

wesm commented Oct 18, 2017 •

edited

Loading

pcmoritz commented Oct 18, 2017 •

edited

Loading