ARROW-1674: [Format, C++] Add support for byte length booleans in Tensors #1201

pcmoritz · 2017-10-14T22:50:14Z

This adds support for byte length booleans in Tensors on the C++ side and also a new logical field in the Boolean flatbuffers message that lets us distinguish between booleans represented by bits and booleans represented by bytes. The use of byte sized booleans is restricted to Tensors.

pcmoritz · 2017-10-15T02:30:43Z

This is now ready to review/merge

xhochy · 2017-10-15T08:20:12Z

cpp/src/arrow/type.h

@@ -49,6 +49,9 @@ struct Type {
 /// Boolean as 1 bit, LSB bit-packed ordering
 BOOL,

+ /// Boolean as 1 byte (may only be used in Tensors)
+ BOOL8,


I'm -1 on this type as it further complicates the spec. Introducing it will affect more than just the tensors. Any reasoning to also have this on the JAVA side of things, too?

I don’t agree that adding metadata expands the “Arrow specification” in a strict sense — we need some way to accommodate this particular logical type (which occurs in NumPy, R, deep learning frameworks, etc), so if adding a new enumeration value won’t do it let’s try to come up with another solution. If this were something more esoteric I would agree, but we have put off dealing with this kind of data for a while (also an issue with Feather format).

Having an additional boolean type metadata does not mean that we have to implement two versions of every algorithm, or it may be that the byte-boolean version would cast to bit-packed (a cast we definitely need to implement — so having it in arrow::compute::Cast would be valuable) and then use the bit packed path.

Per my comments, I'm mostly -1 on this as well. Just because someone else calls a 1 byte value boolean doesn't mean it has to map to arrow's boolean. This is just a uint8 from my perspective.

wesm · 2017-10-15T16:13:43Z

@jacques-n @julienledem @kou could you give your thoughts on this particular issue?

What we are running into on the C++ / Python side at least is that Arrow is becoming effectively a "platform for in-memory data management". So we have some overlapping pieces of tech:

The Arrow columnar format -- i.e. the memory that can be described by a RecordBatch, and that we document in the Markdown documents in https://github.com/apache/arrow/tree/master/format
The C++ libraries, which handle shared memory transport, zero copy reads, and metadata conversions to other runtimes

Zero copy transport and full-fidelity metadata when dealing with other runtimes is important. So if a system hands us a 1GB buffer intended to transport it to and from shared memory with some metadata to describe the contents, it would be good to maximize what we can reasonably describe with the Arrow metadata to minimize the need for copying and conversions. I believe that we need to be able to expand our ability to represent diverse scalar type metadata without necessarily expanding what the "Arrow columnar format" means.

In this particular case, other frameworks which may give memory to the Arrow libraries represent boolean data as a uint8/int8 array of 1's and 0's. So somehow we have to be able to distinguish uint8-as-boolean vs. uint8-as-uint8 (and you can do a zero-copy cast from boolean-uint8 to plain uint8, of course). As this is a metadata-only concern I do not believe it constitutes expanding the definition of what constitutes boolean data according to the Arrow columnar format.

This particular implementation does require (eventually) that Arrow implementations check whether the bit width of received boolean data is 1 -- if not, if they have no special handling for 8-bit boolean, they could simply treat the data as uint8/int8, possibly with some loss of metadata. As a result of this, I would not expect to have integration test this type, as it would not necessarily be reasonable to expect other Arrow libraries to support extra metadata falling outside the primary Arrow specification

jacques-n · 2017-10-15T18:01:44Z

If I understand what you're saying, I think we should be handling this the opposite of the way you're proposing. This is really a byte vector, not a bit vector. If a particular system wants to add metadata so it can know that this is a bit vector, it could do that with the existing custom metadata stuff [1]. It doesn't seem like something that Arrow should need to worry about.

[1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L285

wesm · 2017-10-15T18:51:50Z

@jacques-n in general I would agree where “odd logical types” do not fit in any obvious place in our type metadata, the custom_metadata field can be employed. The data here is not a bit vector, it semantically boolean. In this particular case I would argue that over time enough systems will feature native support for a byte-size boolean that having unambiguous metadata would be valuable. If we use custom_metadata, then two systems must be aware of their respective conventions for annotating boolean data, which may cause implementers a headache

wesm · 2017-10-15T18:55:40Z

I also want to emphasize that the serialized Flatbuffers metadata and the in-memory C++ object model are not a 1-1 mapping. So we don’t need format changes to solve this problem

jacques-n · 2017-10-15T19:19:51Z

I'm fine with it being more formal than a KeyValue but feel like it is a property of int metadata, not bool. Systems that don't understand the property can still deal with it that way.

wesm · 2017-10-15T19:22:19Z

@jacques-n I see. Our data types in Schema.fbs are all logical anyway, and it's only through interpretation of the metadata that we determine the physical memory representation.

I can implement a solution using custom_metadata to unblock the immediate need for 0.8.0 and perhaps we can take a "wait and see" approach on the Arrow schema having built-in knowledge of 8-bit booleans. This will be useful anyway for the next time we need to add a C++ logical type that isn't part of Schema.fbs. @pcmoritz do you mind me pushing to this branch?

pcmoritz · 2017-10-15T19:26:07Z

Please go ahead! Actually @jacques-n solution of introducing a field to the int types sounds pretty nice from the practical perspective (of other systems being able to read the byte bool as "uint8" even if they don't deal with this case); it feels a little backwards from the specification perspective however. But feel free to implement this using custom_metadata if you prefer.

pcmoritz · 2017-10-15T19:33:19Z

Also then we don't need to introduce a new type on the C++ side which is nice. BOOL8 feels a little strange.

wesm · 2017-10-15T19:38:45Z

I'm actually not so sure about that, I think the cleanest way is to have a new data type object so that we have the option to perform dynamic dispatch on byte-boolean data

pcmoritz · 2017-10-17T20:57:00Z

@wesm How do you feel about leaving this unresolved for now and I can implement a custom serializer for pyarrow.serialize that will take care of numpy.bool? This won't fix Tensor.from_numpy and also doesn't help with Feather but it will resolve ray-project/ray#1121

wesm · 2017-10-17T21:08:36Z

If you'd like to implement a workaround, that's totally fine. I haven't had much time to code over the last week but hoping to get this done this week. I can open a new PR (probably will need a new JIRA for the custom_metadata plumbing) if you leave your branch around for me to use as a starting point

pcmoritz · 2017-10-17T23:30:58Z

Thanks, sounds good! The workaround is very trivial, so I implemented it in #1199 now. I'll keep this PR around.

wesm · 2018-03-12T18:46:45Z

I'm closing this PR until we have a chance to address the underlying issue (distinguishing byte-size boolean vs uint8) in more detail

pcmoritz added 2 commits October 14, 2017 15:47

add support for byte length booleans

caddfce

update

246f994

pcmoritz changed the title ~~ARROW-1674: Format, C++] Add support for byte length booleans in Tensors~~ ARROW-1674: [Format, C++] Add support for byte length booleans in Tensors Oct 14, 2017

pcmoritz added 4 commits October 14, 2017 16:27

add BOOL and BOOL8 to is_tensor_supported

92a5be1

fix valgrind

fcde2f7

fix tensor comparison function

3208546

fix windows build

58ff6eb

xhochy reviewed Oct 15, 2017

View reviewed changes

wesm closed this Mar 12, 2018

asfimport mentioned this pull request Nov 3, 2021

[C++] Add ExtensionType implementation for 8-bit boolean values #17682

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-1674: [Format, C++] Add support for byte length booleans in Tensors #1201

ARROW-1674: [Format, C++] Add support for byte length booleans in Tensors #1201

pcmoritz commented Oct 14, 2017

pcmoritz commented Oct 15, 2017

xhochy Oct 15, 2017

wesm Oct 15, 2017

jacques-n Oct 15, 2017 •

edited

Loading

wesm commented Oct 15, 2017 •

edited

Loading

jacques-n commented Oct 15, 2017 •

edited

Loading

wesm commented Oct 15, 2017

wesm commented Oct 15, 2017

jacques-n commented Oct 15, 2017

wesm commented Oct 15, 2017

pcmoritz commented Oct 15, 2017

pcmoritz commented Oct 15, 2017

wesm commented Oct 15, 2017 •

edited

Loading

pcmoritz commented Oct 17, 2017 •

edited

Loading

wesm commented Oct 17, 2017

pcmoritz commented Oct 17, 2017

wesm commented Mar 12, 2018

ARROW-1674: [Format, C++] Add support for byte length booleans in Tensors #1201

ARROW-1674: [Format, C++] Add support for byte length booleans in Tensors #1201

Conversation

pcmoritz commented Oct 14, 2017

pcmoritz commented Oct 15, 2017

xhochy Oct 15, 2017

Choose a reason for hiding this comment

wesm Oct 15, 2017

Choose a reason for hiding this comment

jacques-n Oct 15, 2017 • edited Loading

Choose a reason for hiding this comment

wesm commented Oct 15, 2017 • edited Loading

jacques-n commented Oct 15, 2017 • edited Loading

wesm commented Oct 15, 2017

wesm commented Oct 15, 2017

jacques-n commented Oct 15, 2017

wesm commented Oct 15, 2017

pcmoritz commented Oct 15, 2017

pcmoritz commented Oct 15, 2017

wesm commented Oct 15, 2017 • edited Loading

pcmoritz commented Oct 17, 2017 • edited Loading

wesm commented Oct 17, 2017

pcmoritz commented Oct 17, 2017

wesm commented Mar 12, 2018

jacques-n Oct 15, 2017 •

edited

Loading

wesm commented Oct 15, 2017 •

edited

Loading

jacques-n commented Oct 15, 2017 •

edited

Loading

wesm commented Oct 15, 2017 •

edited

Loading

pcmoritz commented Oct 17, 2017 •

edited

Loading