Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1673: [Python] Add support for numpy 'bool' type #1199

Closed
wants to merge 4 commits into from

Conversation

pcmoritz
Copy link
Contributor

@pcmoritz pcmoritz commented Oct 13, 2017

This is currently a workaround until the Arrow tensor supports zero copy of byte-length booleans.

@@ -328,7 +328,7 @@ class ARROW_EXPORT BooleanType : public FixedWidthType, public NoExtraMeta {
Status Accept(TypeVisitor* visitor) const override;
std::string ToString() const override;

int bit_width() const override { return 1; }
Copy link
Contributor Author

@pcmoritz pcmoritz Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not at all sure if this is the right fix; maybe we need a separate field for the width if the type is contained in a tensor? Standardizing around numpy for tensors seems the way to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boolean in Arrow is 1 bit, so don’t make this change. We may need to get creative about dealing with NumPy’s metadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, for our Python serialization we could hack around it by defining a custom serializer. However, for Tensor.from_numpy() we can't do that because the type needs to be fully encoded in the Tensor type. Would it be acceptable to introduce a new type for this? Let me know which solution you prefer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular I'm thinking of introducing a "bool8" type, which is a bool that is encoded as a single byte.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created https://issues.apache.org/jira/browse/ARROW-1674. This is probably the right way to handle this at the format level. We can separately add a data type in C++. It will be useful to be able to receive numpy.bool_ data with zero copy in Arrow

@pcmoritz pcmoritz force-pushed the ndarray-bool branch 5 times, most recently from 097a78b to 1de11a4 Compare October 17, 2017 22:52
@robertnishihara
Copy link
Contributor

This just passes bool arrays to the custom serializer, right? Does it make sense to register the custom serializer in the default serialization context or no?

@pcmoritz
Copy link
Contributor Author

Magically, this is already taken care of. The custom serializer we already have is generic, it will convert the array to nested lists and the custom deserializer will make a numpy array out of it. Not the most efficient solution but it fixes the problem until we have the proper solution :)

@robertnishihara
Copy link
Contributor

Oh, I see. Could be made efficient by having the custom serializer special case bool arrays. But if this is just temporary then no need to.

@pcmoritz
Copy link
Contributor Author

+1 this is ready to merge as a workaround for ray-project/ray#1121

@wesm
Copy link
Member

wesm commented Oct 18, 2017

I haven't looked too deeply, but could you explain how this fix works?

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Oct 18, 2017

Yeah, the switch case I removed makes it fall back to the default, which uses the custom serializer. This will fall back to the function

def _serialize_numpy_array(obj):

which converts the array to a nested list upon serialization and back upon deserialization.

@wesm
Copy link
Member

wesm commented Oct 18, 2017

Got it, so the workaround is slower / not zero copy. No big deal. I will work to get this fixed more properly + zero copy reads in time for 0.8.0

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@asfgit asfgit closed this in 298e343 Oct 18, 2017
@wesm wesm deleted the ndarray-bool branch October 18, 2017 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants