Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1674: [Format, C++] Add support for byte length booleans in Tensors #1201

Closed
wants to merge 6 commits into from

Conversation

pcmoritz
Copy link
Contributor

This adds support for byte length booleans in Tensors on the C++ side and also a new logical field in the Boolean flatbuffers message that lets us distinguish between booleans represented by bits and booleans represented by bytes. The use of byte sized booleans is restricted to Tensors.

@pcmoritz pcmoritz changed the title ARROW-1674: Format, C++] Add support for byte length booleans in Tensors ARROW-1674: [Format, C++] Add support for byte length booleans in Tensors Oct 14, 2017
@pcmoritz
Copy link
Contributor Author

This is now ready to review/merge

@@ -49,6 +49,9 @@ struct Type {
/// Boolean as 1 bit, LSB bit-packed ordering
BOOL,

/// Boolean as 1 byte (may only be used in Tensors)
BOOL8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -1 on this type as it further complicates the spec. Introducing it will affect more than just the tensors. Any reasoning to also have this on the JAVA side of things, too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t agree that adding metadata expands the “Arrow specification” in a strict sense — we need some way to accommodate this particular logical type (which occurs in NumPy, R, deep learning frameworks, etc), so if adding a new enumeration value won’t do it let’s try to come up with another solution. If this were something more esoteric I would agree, but we have put off dealing with this kind of data for a while (also an issue with Feather format).

Having an additional boolean type metadata does not mean that we have to implement two versions of every algorithm, or it may be that the byte-boolean version would cast to bit-packed (a cast we definitely need to implement — so having it in arrow::compute::Cast would be valuable) and then use the bit packed path.

Copy link
Contributor

@jacques-n jacques-n Oct 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my comments, I'm mostly -1 on this as well. Just because someone else calls a 1 byte value boolean doesn't mean it has to map to arrow's boolean. This is just a uint8 from my perspective.

@wesm
Copy link
Member

wesm commented Oct 15, 2017

@jacques-n @julienledem @kou could you give your thoughts on this particular issue?

What we are running into on the C++ / Python side at least is that Arrow is becoming effectively a "platform for in-memory data management". So we have some overlapping pieces of tech:

  • The Arrow columnar format -- i.e. the memory that can be described by a RecordBatch, and that we document in the Markdown documents in https://github.com/apache/arrow/tree/master/format
  • The C++ libraries, which handle shared memory transport, zero copy reads, and metadata conversions to other runtimes

Zero copy transport and full-fidelity metadata when dealing with other runtimes is important. So if a system hands us a 1GB buffer intended to transport it to and from shared memory with some metadata to describe the contents, it would be good to maximize what we can reasonably describe with the Arrow metadata to minimize the need for copying and conversions. I believe that we need to be able to expand our ability to represent diverse scalar type metadata without necessarily expanding what the "Arrow columnar format" means.

In this particular case, other frameworks which may give memory to the Arrow libraries represent boolean data as a uint8/int8 array of 1's and 0's. So somehow we have to be able to distinguish uint8-as-boolean vs. uint8-as-uint8 (and you can do a zero-copy cast from boolean-uint8 to plain uint8, of course). As this is a metadata-only concern I do not believe it constitutes expanding the definition of what constitutes boolean data according to the Arrow columnar format.

This particular implementation does require (eventually) that Arrow implementations check whether the bit width of received boolean data is 1 -- if not, if they have no special handling for 8-bit boolean, they could simply treat the data as uint8/int8, possibly with some loss of metadata. As a result of this, I would not expect to have integration test this type, as it would not necessarily be reasonable to expect other Arrow libraries to support extra metadata falling outside the primary Arrow specification

@jacques-n
Copy link
Contributor

jacques-n commented Oct 15, 2017

If I understand what you're saying, I think we should be handling this the opposite of the way you're proposing. This is really a byte vector, not a bit vector. If a particular system wants to add metadata so it can know that this is a bit vector, it could do that with the existing custom metadata stuff [1]. It doesn't seem like something that Arrow should need to worry about.

[1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L285

@wesm
Copy link
Member

wesm commented Oct 15, 2017

@jacques-n in general I would agree where “odd logical types” do not fit in any obvious place in our type metadata, the custom_metadata field can be employed. The data here is not a bit vector, it semantically boolean. In this particular case I would argue that over time enough systems will feature native support for a byte-size boolean that having unambiguous metadata would be valuable. If we use custom_metadata, then two systems must be aware of their respective conventions for annotating boolean data, which may cause implementers a headache

@wesm
Copy link
Member

wesm commented Oct 15, 2017

I also want to emphasize that the serialized Flatbuffers metadata and the in-memory C++ object model are not a 1-1 mapping. So we don’t need format changes to solve this problem

@jacques-n
Copy link
Contributor

I'm fine with it being more formal than a KeyValue but feel like it is a property of int metadata, not bool. Systems that don't understand the property can still deal with it that way.

@wesm
Copy link
Member

wesm commented Oct 15, 2017

@jacques-n I see. Our data types in Schema.fbs are all logical anyway, and it's only through interpretation of the metadata that we determine the physical memory representation.

I can implement a solution using custom_metadata to unblock the immediate need for 0.8.0 and perhaps we can take a "wait and see" approach on the Arrow schema having built-in knowledge of 8-bit booleans. This will be useful anyway for the next time we need to add a C++ logical type that isn't part of Schema.fbs. @pcmoritz do you mind me pushing to this branch?

@pcmoritz
Copy link
Contributor Author

Please go ahead! Actually @jacques-n solution of introducing a field to the int types sounds pretty nice from the practical perspective (of other systems being able to read the byte bool as "uint8" even if they don't deal with this case); it feels a little backwards from the specification perspective however. But feel free to implement this using custom_metadata if you prefer.

@pcmoritz
Copy link
Contributor Author

Also then we don't need to introduce a new type on the C++ side which is nice. BOOL8 feels a little strange.

@wesm
Copy link
Member

wesm commented Oct 15, 2017

I'm actually not so sure about that, I think the cleanest way is to have a new data type object so that we have the option to perform dynamic dispatch on byte-boolean data

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Oct 17, 2017

@wesm How do you feel about leaving this unresolved for now and I can implement a custom serializer for pyarrow.serialize that will take care of numpy.bool? This won't fix Tensor.from_numpy and also doesn't help with Feather but it will resolve ray-project/ray#1121

@wesm
Copy link
Member

wesm commented Oct 17, 2017

If you'd like to implement a workaround, that's totally fine. I haven't had much time to code over the last week but hoping to get this done this week. I can open a new PR (probably will need a new JIRA for the custom_metadata plumbing) if you leave your branch around for me to use as a starting point

@pcmoritz
Copy link
Contributor Author

Thanks, sounds good! The workaround is very trivial, so I implemented it in #1199 now. I'll keep this PR around.

@wesm
Copy link
Member

wesm commented Mar 12, 2018

I'm closing this PR until we have a chance to address the underlying issue (distinguishing byte-size boolean vs uint8) in more detail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants