Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41136: [C#] Recompute null count for sliced arrays on demand #41144

Merged
merged 6 commits into from
Apr 12, 2024

Conversation

adamreeve
Copy link
Contributor

@adamreeve adamreeve commented Apr 11, 2024

Rationale for this change

See #41136. This is required for writing sliced arrays to IPC files and streams. It should also be generally useful too by allowing users to be able to rely on valid null count values for sliced arrays.

What changes are included in this PR?

Adds a new ArrayData.GetNullCount() method which computes the actual null count if it is unknown, and changes most uses of ArrayData.NullCount to GetNullCount().

Are these changes tested?

Yes, I've extended the existing slicing tests to also verify the null count, and added new tests specifically for union arrays, which needed special handling.

Are there any user-facing changes?

Yes, this is a user-facing behaviour change. I don't expect that it should break any existing code as users are unlikely to rely on the NullCount being -1.

public readonly int Offset;
public readonly ArrowBuffer[] Buffers;
public readonly ArrayData[] Children;
public readonly ArrayData Dictionary; // Only used for dictionary type

public int NullCount
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ABI breakage acceptable? I expect most users would access the null count from the Array rather than ArrayData so it might not affect too many users. But as an alternative I could add a new property and mark this one as obsolete, or maybe just always compute the null count in the constructor rather than doing it on demand?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a tough one. The problem with ABI breakage is that it makes diamond-shaped dependency graphs unfixable without modifying some other dependency. Computing eagerly is a potential performance regression. Marking the original field as obsolete and defining a new property feels like it's probably the best option. I suspect most uses will go through IArrowArray.NullCount anyway, so the impact on existing source is likely to be small. The only painful part is probably coming up with a new name... .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah OK, maybe a GetNullCount method instead would make sense, partly because I can't think of a better property name, but then it's also obvious it might do more work than a plain member access.

I guess we also don't necessarily need to deprecate the existing member because there might be cases where users only want the null count if it can be accessed quickly and don't want to pay the cost of computing it, so we could just document that it may return a negative number. And we could remove the readonly modifier to allow updating it if GetNullCount is called, which isn't a breaking change according to the rules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted NullCount back to a normal member but made it non-readonly and added a GetNullCount method instead. I've changed most uses of NullCount to GetNullCount(), except in the C Data interface implementation which explicitly allows -1 values: https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowArray.null_count

Let me know if you'd prefer to deprecate NullCount instead though, or have any other suggestions for how to handle this.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 11, 2024
{
if (_nullCount == RecalculateNullCount)
{
_nullCount = ComputeNullCount();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might result in duplicate work if multiple threads try to access the null count concurrently, but I don't imagine this is a big concern, and the C++ implementation has the same behaviour.

Comment on lines +176 to +178
// Note: Dictionary arrays may be logically null if there is a null in the dictionary values,
// but this isn't accounted for by the IArrowArray.IsNull implementation,
// so we maintain consistency with that behaviour here.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ implementation actually has a separate ComputeLogicalNullCount method on ArrayData and Array that handles union arrays, dictionary arrays and run end encoded arrays, and there the null_count for union arrays always returns 0. The C# UnionArray.IsValid method depends on the NullCount value though, so I figured it made sense to compute the actual null count for union arrays.

@@ -111,7 +123,25 @@ public ArrayData Slice(int offset, int length)
length = Math.Min(Length - offset, length);
offset += Offset;

return new ArrayData(DataType, length, RecalculateNullCount, offset, Buffers, Children, Dictionary);
int nullCount;
if (NullCount == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would force calculation of the null count on the original array. Is it worth preserving laziness and saying "if we haven't calculated the original null count then mark the sliced null count as requiring calculation"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, yes definitely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was fixed as a side effect of reverting NullCount back to being a normal int member.

return Length - BitUtility.CountBits(Buffers[0].Span, Offset, Length);
}

private int ComputeSparseUnionNullCount()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth moving these as statics onto SparseUnionArray and DenseUnionArray respectively so that the type-specific logic remains in one place?

Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is great! I think we should preserve binary compatibility by adding a new property.

@@ -71,7 +71,7 @@ public ArrayDataConcatenationVisitor(IReadOnlyList<ArrayData> arrayDataList, Mem
foreach (ArrayData arrayData in _arrayDataList)
{
_totalLength += arrayData.Length;
_totalNullCount += arrayData.NullCount;
_totalNullCount += arrayData.GetNullCount();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could make sense to check NullCount for -1 here, and then set the final null count to -1 if it was -1 in any array, rather than forcing a computation. But this class has bigger problems as it doesn't account for non-zero offsets anywhere as far as I can see, although it does seem to account for array lengths being less than the full buffer size. Maybe it should throw a NotImplementedException if it encounters a non-zero offset?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you file a separate bug for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep sure, I've made #41164

@CurtHagenlocher CurtHagenlocher merged commit f20c1e8 into apache:main Apr 12, 2024
12 checks passed
@CurtHagenlocher CurtHagenlocher removed the awaiting committer review Awaiting committer review label Apr 12, 2024
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Apr 12, 2024
@adamreeve adamreeve deleted the recompute_null_count branch April 12, 2024 03:01
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f20c1e8.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request Apr 15, 2024
…apache#41144)

### Rationale for this change

See apache#41136. This is required for writing sliced arrays to IPC files and streams. It should also be generally useful too by allowing users to be able to rely on valid null count values for sliced arrays.

### What changes are included in this PR?

Adds a new `ArrayData.GetNullCount()` method which computes the actual null count if it is unknown, and changes most uses of `ArrayData.NullCount` to `GetNullCount()`.

### Are these changes tested?

Yes, I've extended the existing slicing tests to also verify the null count, and added new tests specifically for union arrays, which needed special handling.

### Are there any user-facing changes?

Yes, this is a user-facing behaviour change. I don't expect that it should break any existing code as users are unlikely to rely on the NullCount being -1.

* GitHub Issue: apache#41136

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
raulcd pushed a commit that referenced this pull request Apr 29, 2024
### Rationale for this change

See #41136. This is required for writing sliced arrays to IPC files and streams. It should also be generally useful too by allowing users to be able to rely on valid null count values for sliced arrays.

### What changes are included in this PR?

Adds a new `ArrayData.GetNullCount()` method which computes the actual null count if it is unknown, and changes most uses of `ArrayData.NullCount` to `GetNullCount()`.

### Are these changes tested?

Yes, I've extended the existing slicing tests to also verify the null count, and added new tests specifically for union arrays, which needed special handling.

### Are there any user-facing changes?

Yes, this is a user-facing behaviour change. I don't expect that it should break any existing code as users are unlikely to rely on the NullCount being -1.

* GitHub Issue: #41136

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…apache#41144)

### Rationale for this change

See apache#41136. This is required for writing sliced arrays to IPC files and streams. It should also be generally useful too by allowing users to be able to rely on valid null count values for sliced arrays.

### What changes are included in this PR?

Adds a new `ArrayData.GetNullCount()` method which computes the actual null count if it is unknown, and changes most uses of `ArrayData.NullCount` to `GetNullCount()`.

### Are these changes tested?

Yes, I've extended the existing slicing tests to also verify the null count, and added new tests specifically for union arrays, which needed special handling.

### Are there any user-facing changes?

Yes, this is a user-facing behaviour change. I don't expect that it should break any existing code as users are unlikely to rely on the NullCount being -1.

* GitHub Issue: apache#41136

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 4, 2024
…apache#41144)

### Rationale for this change

See apache#41136. This is required for writing sliced arrays to IPC files and streams. It should also be generally useful too by allowing users to be able to rely on valid null count values for sliced arrays.

### What changes are included in this PR?

Adds a new `ArrayData.GetNullCount()` method which computes the actual null count if it is unknown, and changes most uses of `ArrayData.NullCount` to `GetNullCount()`.

### Are these changes tested?

Yes, I've extended the existing slicing tests to also verify the null count, and added new tests specifically for union arrays, which needed special handling.

### Are there any user-facing changes?

Yes, this is a user-facing behaviour change. I don't expect that it should break any existing code as users are unlikely to rely on the NullCount being -1.

* GitHub Issue: apache#41136

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…apache#41144)

### Rationale for this change

See apache#41136. This is required for writing sliced arrays to IPC files and streams. It should also be generally useful too by allowing users to be able to rely on valid null count values for sliced arrays.

### What changes are included in this PR?

Adds a new `ArrayData.GetNullCount()` method which computes the actual null count if it is unknown, and changes most uses of `ArrayData.NullCount` to `GetNullCount()`.

### Are these changes tested?

Yes, I've extended the existing slicing tests to also verify the null count, and added new tests specifically for union arrays, which needed special handling.

### Are there any user-facing changes?

Yes, this is a user-facing behaviour change. I don't expect that it should break any existing code as users are unlikely to rely on the NullCount being -1.

* GitHub Issue: apache#41136

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…apache#41144)

### Rationale for this change

See apache#41136. This is required for writing sliced arrays to IPC files and streams. It should also be generally useful too by allowing users to be able to rely on valid null count values for sliced arrays.

### What changes are included in this PR?

Adds a new `ArrayData.GetNullCount()` method which computes the actual null count if it is unknown, and changes most uses of `ArrayData.NullCount` to `GetNullCount()`.

### Are these changes tested?

Yes, I've extended the existing slicing tests to also verify the null count, and added new tests specifically for union arrays, which needed special handling.

### Are there any user-facing changes?

Yes, this is a user-facing behaviour change. I don't expect that it should break any existing code as users are unlikely to rely on the NullCount being -1.

* GitHub Issue: apache#41136

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…apache#41144)

### Rationale for this change

See apache#41136. This is required for writing sliced arrays to IPC files and streams. It should also be generally useful too by allowing users to be able to rely on valid null count values for sliced arrays.

### What changes are included in this PR?

Adds a new `ArrayData.GetNullCount()` method which computes the actual null count if it is unknown, and changes most uses of `ArrayData.NullCount` to `GetNullCount()`.

### Are these changes tested?

Yes, I've extended the existing slicing tests to also verify the null count, and added new tests specifically for union arrays, which needed special handling.

### Are there any user-facing changes?

Yes, this is a user-facing behaviour change. I don't expect that it should break any existing code as users are unlikely to rely on the NullCount being -1.

* GitHub Issue: apache#41136

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants