-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1673: [Python] Add support for numpy 'bool' type #1199
Conversation
cpp/src/arrow/type.h
Outdated
@@ -328,7 +328,7 @@ class ARROW_EXPORT BooleanType : public FixedWidthType, public NoExtraMeta { | |||
Status Accept(TypeVisitor* visitor) const override; | |||
std::string ToString() const override; | |||
|
|||
int bit_width() const override { return 1; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not at all sure if this is the right fix; maybe we need a separate field for the width if the type is contained in a tensor? Standardizing around numpy for tensors seems the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boolean in Arrow is 1 bit, so don’t make this change. We may need to get creative about dealing with NumPy’s metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, for our Python serialization we could hack around it by defining a custom serializer. However, for Tensor.from_numpy() we can't do that because the type needs to be fully encoded in the Tensor type. Would it be acceptable to introduce a new type for this? Let me know which solution you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular I'm thinking of introducing a "bool8" type, which is a bool that is encoded as a single byte.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created https://issues.apache.org/jira/browse/ARROW-1674. This is probably the right way to handle this at the format level. We can separately add a data type in C++. It will be useful to be able to receive numpy.bool_
data with zero copy in Arrow
097a78b
to
1de11a4
Compare
1de11a4
to
14943a0
Compare
This just passes bool arrays to the custom serializer, right? Does it make sense to register the custom serializer in the default serialization context or no? |
Magically, this is already taken care of. The custom serializer we already have is generic, it will convert the array to nested lists and the custom deserializer will make a numpy array out of it. Not the most efficient solution but it fixes the problem until we have the proper solution :) |
Oh, I see. Could be made efficient by having the custom serializer special case bool arrays. But if this is just temporary then no need to. |
+1 this is ready to merge as a workaround for ray-project/ray#1121 |
I haven't looked too deeply, but could you explain how this fix works? |
Yeah, the switch case I removed makes it fall back to the default, which uses the custom serializer. This will fall back to the function arrow/python/pyarrow/serialization.py Line 81 in a043018
which converts the array to a nested list upon serialization and back upon deserialization. |
Got it, so the workaround is slower / not zero copy. No big deal. I will work to get this fixed more properly + zero copy reads in time for 0.8.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
This is currently a workaround until the Arrow tensor supports zero copy of byte-length booleans.