ARROW-549: [C++] Add arrow::Concatenate function to combine multiple arrays into a single Array#3746
ARROW-549: [C++] Add arrow::Concatenate function to combine multiple arrays into a single Array#3746bkietz wants to merge 21 commits intoapache:masterfrom
Conversation
wesm
left a comment
There was a problem hiding this comment.
This seems to be on the right track. The test coverage is a bit thin, though. It would be better to return NotImplemented in paths that are untested if tests don't get written for them
cpp/src/arrow/array-test.cc
Outdated
There was a problem hiding this comment.
How about just writing:
CheckConcatenate(type, arrays_as_json, expected_json)
The ConcatenateParam seems unnecessarily elaborate to me.
It doesn't appear that the non-zero offset cases are being tested right now, so in those cases will have to construct and slice the test arrays manually
cpp/src/arrow/array.cc
Outdated
There was a problem hiding this comment.
This implementation needs a more appropriate home than array.cc. We could create a subdirectory of miscellaneous array algorithms. I don't have a strong opinion, @xhochy @pitrou @fsaintjacques @emkornfield thoughts?
There was a problem hiding this comment.
under util/ might make sense for now. We can see how many random algorithms get put here vs just used in kernels (my second choice would be under compute I suppose).
There was a problem hiding this comment.
Algorithms could be grouped under src/arrow/array/algorithm_<category>.{h,cc} with a single convenience header/source src/arrow/algorithm.{h,cc} similar to builders
There was a problem hiding this comment.
If this is all private, we should avoid convenience headers that balloon compile times.
cpp/src/arrow/array.cc
Outdated
cpp/src/arrow/array.cc
Outdated
There was a problem hiding this comment.
I think this should return Invalid if any type is unequal
cpp/src/arrow/array.cc
Outdated
There was a problem hiding this comment.
Seems like the null bitmap is getting concatenated twice here
|
I think you need to define 2^31 - 1 overflow behavior for offsets (lists and binary) |
573be68 to
4ef76e7
Compare
|
@fsaintjacques re property testing |
cpp/src/arrow/buffer.h
Outdated
There was a problem hiding this comment.
Do we want this at the top-level namespace, maybe static Status Buffer::Concatenate(...) or BuffersConcatenate()?
There was a problem hiding this comment.
the renaming seems unnecessary- the overload will not conflict with concatenate(arrays)
There was a problem hiding this comment.
I'm not concerned about conflict. I'm concerned about readability. buffer::Concatenate or Buffer::Concatenate speaks a lot more than just Concatenate.
There was a problem hiding this comment.
IMHO it's more readable to avoid manual name mangling; the argument types are part of what users will read
There was a problem hiding this comment.
Would it help to introduce an alias like:
using BufferVector = std::vector<std::shared_ptr<Buffer>>
I think this would help make the signature easier to read.
cpp/src/arrow/util/concatenate.cc
Outdated
There was a problem hiding this comment.
Ditto, there's no need for this function to depends on internal state. In fact, I'd want to see this function explicitly unit tested.
|
@wesm @fsaintjacques how's this look? |
cpp/src/arrow/util/concatenate.h
Outdated
There was a problem hiding this comment.
I think you could use the ArrayVector alias in the first input argument.
There was a problem hiding this comment.
Great, that will neaten this up quite a bit. I see there's also BufferVector which should be useful too
cpp/src/arrow/array-test.cc
Outdated
There was a problem hiding this comment.
The Mersenne-Twister PRNG engine has been identified as a source of test slowness. See discussion here.
There was a problem hiding this comment.
Alright, I'll switch to std::default_random_engine
cpp/src/arrow/array-test.cc
Outdated
There was a problem hiding this comment.
Another place where the ArrayVector alias could be used.
pitrou
left a comment
There was a problem hiding this comment.
Thanks for doing this. Here are some comments below. In particular, I would welcome a bit more clarity in the implementation.
cpp/src/arrow/array-test.cc
Outdated
There was a problem hiding this comment.
Is it? What if I have sliced a StringArray?
There was a problem hiding this comment.
It's in the spec: the first offset in an offsets buffer is always 0.
cpp/src/arrow/array-test.cc
Outdated
cpp/src/arrow/array-test.cc
Outdated
There was a problem hiding this comment.
I would rather have this in a separate test file. array-test.cc is already too large, and too slow to compile.
There was a problem hiding this comment.
Alright, I'll move it to arrow-concatenate-test.
cpp/src/arrow/buffer.h
Outdated
There was a problem hiding this comment.
Please call this ConcatenateBuffers.
There was a problem hiding this comment.
Should I rename Concatenate(arrays) to ConcatenateArrays as well?
There was a problem hiding this comment.
Yes, you can, though disambiguating requires only one rename ;-)
cpp/src/arrow/util/concatenate.cc
Outdated
There was a problem hiding this comment.
Can you add a comment explaining what this does?
cpp/src/arrow/util/concatenate.cc
Outdated
There was a problem hiding this comment.
Can you add a comment explaining what this does?
- test concatenation of randomly generated array slices - ensure some of those array slices will have offset != 0 - test all primitive types, StringType, ListType, StructType, and DictionaryType - add disabled test for UnionType
vagrind was complaining about uninitialized values due to the way CopyBitmap restores trailing bits; extended CopyBitmap to (optionally) elide that step
- renaming - use memcpy not std::copy in ConcatenateBuffers - concatenating zero buffers does not terminate - refactor for clarity - add comments
14a78f7 to
c1600bc
Compare
|
@pitrou does this address your concerns? |
|
Thanks @bkietz. Can you open a JIRA for concatenation of union arrays? |
Concatenate arrays into a single array