-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make slice work for nested types #389
Conversation
@jorgecarleitao I have the below failures, which are mostly related to I worked on this over the weekend out of curiousity, so I'll continue working on the other tests this week.
|
I think instead of switching of child_data.is_empty we rather need a special case for struct and union types. ListArray for example should work fine with the current logic. The arrays' offset applies to the offsets buffer which indexes into the child buffer, slicing just has to adjust that one offset without having to change the offset of child arrays. The same applies to Binary or StringArrays, which have the same layout as lists. For struct (and probably union too) however we'd need to adjust the offset field of all the child arrays by calling slice on them. |
@jhorstmann I'm still looking into this. Nested list slices currently don't work in the main branch; that's what made me come back to this issue. The primitive in |
@nevi-me can you point me to a testcase and does this only involve arrow or the parquet roundtrip? I'd be happy to take a look. My thoughts are that a |
Isn't it possible to not offset the child data, and instead only offset the In arrow2, I am doing impl<O: Offset> ListArray<O> {
pub fn slice(&self, offset: usize, length: usize) -> Self {
let validity = self.validity.clone().map(|x| x.slice(offset, length));
let offsets = self.offsets.clone().slice(offset, length + 1);
Self {
data_type: self.data_type.clone(),
offsets,
values: self.values.clone(),
validity,
offset: self.offset + offset,
}
}
} note how the |
@jorgecarleitao I assume in your implementation the validity bitmap itself takes care of the offset? That seems like a clean solution. But we then have to ensure that no kernels access the underlying bytes/buffer of the bitmap directly and all access goes through a (chunked) iterator or some other abstraction. |
@@ -85,12 +85,7 @@ impl From<ArrayData> for StructArray { | |||
fn from(data: ArrayData) -> Self { | |||
let mut boxed_fields = vec![]; | |||
for cd in data.child_data() { | |||
let child_data = if data.offset() != 0 || data.len() != cd.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhorstmann @jorgecarleitao the problem was here. We take data
which has children and their own offsets, then slice that data correctly to populate boxed_fields
, but then we still use the data
with the child arrays that aren't offset.
The result's that the struct is correct when looking at it through boxed_fields
(e.g. the print utility uses this), but once you need to do anything with data
, it's as if the child values were never sliced.
The lightbulb turned on while i was trying to figure out why a test for #491 was failing.
There's a change that @bjchambers authored (9f96561) which addressed the child offsets, but my tests were failing because I needed to revert what they did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to make sense -- my main interest was in getting the tests I added (concat on sliced struct arrays) to pass because I had run into problems with that. It seems like the tests are still here and passing, and it sounds like this may be more robust as well (addressing other issues I've hit when slicing StructArray
such as the mentioned equality issues).
For my own curiosity -- If I understand correctly, slicing the child data doesn't actually copy the data, but just creates a reference to the same data with different offsets, so this shouldn't affect performance significantly, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to make sense -- my main interest was in getting the tests I added (concat on sliced struct arrays) to pass because I had run into problems with that. It seems like the tests are still here and passing, and it sounds like this may be more robust as well (addressing other issues I've hit when slicing StructArray such as the mentioned equality issues).
Yes, that's correct. I relied on your tests to verify that my changes make sense.
For my own curiosity -- If I understand correctly, slicing the child data doesn't actually copy the data, but just creates a reference to the same data with different offsets, so this shouldn't affect performance significantly, correct?
Yes. The cost should be negligible as we're cloning a bunch of arcs individually instead of all at once
Codecov Report
@@ Coverage Diff @@
## master #389 +/- ##
=======================================
Coverage 82.60% 82.60%
=======================================
Files 167 167
Lines 45984 45986 +2
=======================================
+ Hits 37984 37986 +2
Misses 8000 8000
Continue to review full report at Codecov.
|
@jorgecarleitao @jhorstmann @bjchambers this is ready for review, and partially fixes #514 |
revert changes made in ARROW-11394 See commit apache@9f96561 Only slice into structs
array.offset() + start + len, | ||
) | ||
}) | ||
mutable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reverses 9f96561
This looks good and makes sense to me. StructArray (and probably union too) are a bit of a special case in how offsets are propagated. Since we currently do not push down offsets into the buffers, the offset of an array applies to the direct buffers of the array (including the validity buffer). That is still the case for StructArray, which has no other buffers, so Arrays are still working consistently in that regard. A test for concatenation of sliced StructArrays would be nice to have. |
What parts of #514 doesn't this address? I have a branch with some tests for equality that I can try on-top of this MR, but it seems (possibly naively) that this should address the struct equality after slicing problem mentioned there. |
@jhorstmann this already exists at https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/concat.rs#L367-L409, it was added by @bjchambers |
@bjchambers I forgot to go back to #514 and add details there. I tried your test case on master and on this PR, and it passed on this PR, but with a slight change to the test case. I made the below change, and the test passed. let a = make_struct(vec![
- None,
+ Some((None, None)),
Some((Some("joe"), Some(1))),
Some((None, Some(2))),
Some((None, None)),
Some((Some("mark"), Some(4))),
Some((Some("doe"), Some(5))),
]); |
@alamb I need this PR to fix a test failure on the map PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #554 for this change
Looks good to me -- and it sounds like there is additional coverage coming in support of MapArray
👍
Which issue does this PR close?
Closes #554
Rationale for this change
ArrayData::slice()
does not work for nested types, because only theArrayData::buffers
are updated with the new offset and length. This has caused a lot of issues in the past.This blocks us from being able to implement
RecordBatch::slice()
, and has led to creating #381 to sidestep this issue.What changes are included in this PR?
I propose a manual slice where we check if
ArrayData::child_data
is not empty, and then propagate the offset and length down to them. The only trick is with list and largelist, where we have to inspect the offset buffers to determine what the new offset and length should be.Are there any user-facing changes?
No UFC