-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41136: [C#] Recompute null count for sliced arrays on demand #41144
Conversation
public readonly int Offset; | ||
public readonly ArrowBuffer[] Buffers; | ||
public readonly ArrayData[] Children; | ||
public readonly ArrayData Dictionary; // Only used for dictionary type | ||
|
||
public int NullCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ABI breakage acceptable? I expect most users would access the null count from the Array rather than ArrayData so it might not affect too many users. But as an alternative I could add a new property and mark this one as obsolete, or maybe just always compute the null count in the constructor rather than doing it on demand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a tough one. The problem with ABI breakage is that it makes diamond-shaped dependency graphs unfixable without modifying some other dependency. Computing eagerly is a potential performance regression. Marking the original field as obsolete and defining a new property feels like it's probably the best option. I suspect most uses will go through IArrowArray.NullCount anyway, so the impact on existing source is likely to be small. The only painful part is probably coming up with a new name... .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah OK, maybe a GetNullCount
method instead would make sense, partly because I can't think of a better property name, but then it's also obvious it might do more work than a plain member access.
I guess we also don't necessarily need to deprecate the existing member because there might be cases where users only want the null count if it can be accessed quickly and don't want to pay the cost of computing it, so we could just document that it may return a negative number. And we could remove the readonly
modifier to allow updating it if GetNullCount
is called, which isn't a breaking change according to the rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reverted NullCount
back to a normal member but made it non-readonly and added a GetNullCount
method instead. I've changed most uses of NullCount
to GetNullCount()
, except in the C Data interface implementation which explicitly allows -1 values: https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowArray.null_count
Let me know if you'd prefer to deprecate NullCount
instead though, or have any other suggestions for how to handle this.
{ | ||
if (_nullCount == RecalculateNullCount) | ||
{ | ||
_nullCount = ComputeNullCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might result in duplicate work if multiple threads try to access the null count concurrently, but I don't imagine this is a big concern, and the C++ implementation has the same behaviour.
// Note: Dictionary arrays may be logically null if there is a null in the dictionary values, | ||
// but this isn't accounted for by the IArrowArray.IsNull implementation, | ||
// so we maintain consistency with that behaviour here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C++ implementation actually has a separate ComputeLogicalNullCount
method on ArrayData
and Array
that handles union arrays, dictionary arrays and run end encoded arrays, and there the null_count
for union arrays always returns 0. The C# UnionArray.IsValid
method depends on the NullCount
value though, so I figured it made sense to compute the actual null count for union arrays.
@@ -111,7 +123,25 @@ public ArrayData Slice(int offset, int length) | |||
length = Math.Min(Length - offset, length); | |||
offset += Offset; | |||
|
|||
return new ArrayData(DataType, length, RecalculateNullCount, offset, Buffers, Children, Dictionary); | |||
int nullCount; | |||
if (NullCount == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would force calculation of the null count on the original array. Is it worth preserving laziness and saying "if we haven't calculated the original null count then mark the sliced null count as requiring calculation"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, yes definitely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was fixed as a side effect of reverting NullCount
back to being a normal int member.
return Length - BitUtility.CountBits(Buffers[0].Span, Offset, Length); | ||
} | ||
|
||
private int ComputeSparseUnionNullCount() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth moving these as statics onto SparseUnionArray and DenseUnionArray respectively so that the type-specific logic remains in one place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is great! I think we should preserve binary compatibility by adding a new property.
@@ -71,7 +71,7 @@ public ArrayDataConcatenationVisitor(IReadOnlyList<ArrayData> arrayDataList, Mem | |||
foreach (ArrayData arrayData in _arrayDataList) | |||
{ | |||
_totalLength += arrayData.Length; | |||
_totalNullCount += arrayData.NullCount; | |||
_totalNullCount += arrayData.GetNullCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could make sense to check NullCount
for -1 here, and then set the final null count to -1 if it was -1 in any array, rather than forcing a computation. But this class has bigger problems as it doesn't account for non-zero offsets anywhere as far as I can see, although it does seem to account for array lengths being less than the full buffer size. Maybe it should throw a NotImplementedException
if it encounters a non-zero offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you file a separate bug for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep sure, I've made #41164
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f20c1e8. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…apache#41144) ### Rationale for this change See apache#41136. This is required for writing sliced arrays to IPC files and streams. It should also be generally useful too by allowing users to be able to rely on valid null count values for sliced arrays. ### What changes are included in this PR? Adds a new `ArrayData.GetNullCount()` method which computes the actual null count if it is unknown, and changes most uses of `ArrayData.NullCount` to `GetNullCount()`. ### Are these changes tested? Yes, I've extended the existing slicing tests to also verify the null count, and added new tests specifically for union arrays, which needed special handling. ### Are there any user-facing changes? Yes, this is a user-facing behaviour change. I don't expect that it should break any existing code as users are unlikely to rely on the NullCount being -1. * GitHub Issue: apache#41136 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
### Rationale for this change See #41136. This is required for writing sliced arrays to IPC files and streams. It should also be generally useful too by allowing users to be able to rely on valid null count values for sliced arrays. ### What changes are included in this PR? Adds a new `ArrayData.GetNullCount()` method which computes the actual null count if it is unknown, and changes most uses of `ArrayData.NullCount` to `GetNullCount()`. ### Are these changes tested? Yes, I've extended the existing slicing tests to also verify the null count, and added new tests specifically for union arrays, which needed special handling. ### Are there any user-facing changes? Yes, this is a user-facing behaviour change. I don't expect that it should break any existing code as users are unlikely to rely on the NullCount being -1. * GitHub Issue: #41136 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
…apache#41144) ### Rationale for this change See apache#41136. This is required for writing sliced arrays to IPC files and streams. It should also be generally useful too by allowing users to be able to rely on valid null count values for sliced arrays. ### What changes are included in this PR? Adds a new `ArrayData.GetNullCount()` method which computes the actual null count if it is unknown, and changes most uses of `ArrayData.NullCount` to `GetNullCount()`. ### Are these changes tested? Yes, I've extended the existing slicing tests to also verify the null count, and added new tests specifically for union arrays, which needed special handling. ### Are there any user-facing changes? Yes, this is a user-facing behaviour change. I don't expect that it should break any existing code as users are unlikely to rely on the NullCount being -1. * GitHub Issue: apache#41136 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
…apache#41144) ### Rationale for this change See apache#41136. This is required for writing sliced arrays to IPC files and streams. It should also be generally useful too by allowing users to be able to rely on valid null count values for sliced arrays. ### What changes are included in this PR? Adds a new `ArrayData.GetNullCount()` method which computes the actual null count if it is unknown, and changes most uses of `ArrayData.NullCount` to `GetNullCount()`. ### Are these changes tested? Yes, I've extended the existing slicing tests to also verify the null count, and added new tests specifically for union arrays, which needed special handling. ### Are there any user-facing changes? Yes, this is a user-facing behaviour change. I don't expect that it should break any existing code as users are unlikely to rely on the NullCount being -1. * GitHub Issue: apache#41136 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
…apache#41144) ### Rationale for this change See apache#41136. This is required for writing sliced arrays to IPC files and streams. It should also be generally useful too by allowing users to be able to rely on valid null count values for sliced arrays. ### What changes are included in this PR? Adds a new `ArrayData.GetNullCount()` method which computes the actual null count if it is unknown, and changes most uses of `ArrayData.NullCount` to `GetNullCount()`. ### Are these changes tested? Yes, I've extended the existing slicing tests to also verify the null count, and added new tests specifically for union arrays, which needed special handling. ### Are there any user-facing changes? Yes, this is a user-facing behaviour change. I don't expect that it should break any existing code as users are unlikely to rely on the NullCount being -1. * GitHub Issue: apache#41136 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
…apache#41144) ### Rationale for this change See apache#41136. This is required for writing sliced arrays to IPC files and streams. It should also be generally useful too by allowing users to be able to rely on valid null count values for sliced arrays. ### What changes are included in this PR? Adds a new `ArrayData.GetNullCount()` method which computes the actual null count if it is unknown, and changes most uses of `ArrayData.NullCount` to `GetNullCount()`. ### Are these changes tested? Yes, I've extended the existing slicing tests to also verify the null count, and added new tests specifically for union arrays, which needed special handling. ### Are there any user-facing changes? Yes, this is a user-facing behaviour change. I don't expect that it should break any existing code as users are unlikely to rely on the NullCount being -1. * GitHub Issue: apache#41136 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
…apache#41144) ### Rationale for this change See apache#41136. This is required for writing sliced arrays to IPC files and streams. It should also be generally useful too by allowing users to be able to rely on valid null count values for sliced arrays. ### What changes are included in this PR? Adds a new `ArrayData.GetNullCount()` method which computes the actual null count if it is unknown, and changes most uses of `ArrayData.NullCount` to `GetNullCount()`. ### Are these changes tested? Yes, I've extended the existing slicing tests to also verify the null count, and added new tests specifically for union arrays, which needed special handling. ### Are there any user-facing changes? Yes, this is a user-facing behaviour change. I don't expect that it should break any existing code as users are unlikely to rely on the NullCount being -1. * GitHub Issue: apache#41136 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
Rationale for this change
See #41136. This is required for writing sliced arrays to IPC files and streams. It should also be generally useful too by allowing users to be able to rely on valid null count values for sliced arrays.
What changes are included in this PR?
Adds a new
ArrayData.GetNullCount()
method which computes the actual null count if it is unknown, and changes most uses ofArrayData.NullCount
toGetNullCount()
.Are these changes tested?
Yes, I've extended the existing slicing tests to also verify the null count, and added new tests specifically for union arrays, which needed special handling.
Are there any user-facing changes?
Yes, this is a user-facing behaviour change. I don't expect that it should break any existing code as users are unlikely to rely on the NullCount being -1.