-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34315: [C++] Correct is_null kernel for Union and RunEndEncoded logical nulls #35036
base: main
Are you sure you want to change the base?
GH-34315: [C++] Correct is_null kernel for Union and RunEndEncoded logical nulls #35036
Conversation
…ded logical nulls
cc @felipecrv (this is very much a draft (still needs more tests, doc comments, applying the same changes to is_valid, etc), but already would like to get your thoughts on the implementation approach) |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
@@ -102,6 +104,15 @@ Status IsNullExec(KernelContext* ctx, const ExecSpan& batch, ExecResult* out) { | |||
bit_util::SetBitsTo(out_bitmap, out_span->offset, out_span->length, false); | |||
} | |||
|
|||
const auto t = arr.type->id(); | |||
if (t == Type::SPARSE_UNION) { | |||
union_util::SetLogicalSparseUnionNullBits(arr, out_bitmap, out_span->offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the same pattern as we had for SetNanBits
(used just below, i.e. updating the previously created bitmap), not that this is necessarily the best approach though
// ASSERT_OK(ree_array->ValidateFull()); | ||
// auto expected = | ||
// ArrayFromJSON(boolean(), "[false, false, false, true, true, false, false]"); | ||
// CheckScalarUnary("is_null", ree_array, expected); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is passing, except for the variant that runs it with scalars, because ArraySpan::FillFromScalar
is not implemented for RunEndEncoded cc @felipecrv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I had implemented that 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding it! I'm gonna use it for something I'm doing as well.
cpp/src/arrow/array/data.cc
Outdated
// Populate null count and validity bitmap (only for non-union/null types) | ||
this->null_count = value.is_valid ? 0 : 1; | ||
if (!is_union(type_id) && type_id != Type::NA) { | ||
// Populate null count and validity bitmap (only for non-union/ree/null types) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment can now be removed as the if
checks wrapping the null_count
assignments document the context of each assignment
cpp/src/arrow/array/data.cc
Outdated
} else if (type_id == Type::RUN_END_ENCODED) { | ||
const auto& scalar = checked_cast<const RunEndEncodedScalar&>(value); | ||
this->child_data.resize(2); | ||
auto run_end_type = scalar.run_end_type(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto&
to elide one refcount increment
const auto t = arr.type->id(); | ||
if (t == Type::SPARSE_UNION) { | ||
SetSparseUnionLogicalNullBits(arr, out_bitmap, out_span->offset); | ||
} else if (t == Type::DENSE_UNION) { | ||
SetDenseUnionLogicalNullBits(arr, out_bitmap, out_span->offset); | ||
} else if (t == Type::RUN_END_ENCODED) { | ||
SetREELogicalNullBits(arr, out_bitmap, out_span->offset); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole block can be moved into the else
above to make it more clear that this only gets called after a complete zeroing of out_bitmap
.
Now we have to know about a logical implication that is not explicit in the code: t \in {UNION, REE} \implies arr.GetNullCount() == 0
. That is what in turn guarantees out_bitmap
is zeroed before the Set*LogicalNullBits
. Too much logic. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The // Input has no nulls => output is entirely false.
also needs to be more nuanced after that move.
cpp/src/arrow/array/data.cc
Outdated
@@ -236,6 +236,19 @@ void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_siz | |||
span->buffers[buffer_index].size = 2 * sizeof(offset_type); | |||
} | |||
|
|||
template <typename ArrowType> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: To be consistent with REE code elsewhere: RunEndType
.
cpp/src/arrow/array/data.cc
Outdated
void SetRunForScalar(ArraySpan* span, std::shared_ptr<DataType> type, | ||
uint64_t* scratch_space) { | ||
// Create a lenght-1 array with the value 1 for a REE scalar | ||
using run_type = typename ArrowType::c_type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run_end_type
(run_type is ambiguous -- could be the type of the value in the run).
cpp/src/arrow/array/data.cc
Outdated
template <typename ArrowType> | ||
void SetRunForScalar(ArraySpan* span, std::shared_ptr<DataType> type, | ||
uint64_t* scratch_space) { | ||
// Create a lenght-1 array with the value 1 for a REE scalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
length typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment can be removed after the other changes that clarify the code are made.
cpp/src/arrow/array/data.cc
Outdated
@@ -236,6 +236,19 @@ void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_siz | |||
span->buffers[buffer_index].size = 2 * sizeof(offset_type); | |||
} | |||
|
|||
template <typename ArrowType> | |||
void SetRunForScalar(ArraySpan* span, std::shared_ptr<DataType> type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FillRunEndsArrayFromScalar
? s/type/run_end_type
cpp/src/arrow/array/data.cc
Outdated
buffer[0] = static_cast<run_type>(1); | ||
auto data_buffer = | ||
std::make_shared<Buffer>(reinterpret_cast<uint8_t*>(buffer), sizeof(run_type)); | ||
auto data = ArrayData::Make(type, 1, {nullptr, data_buffer}, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::move(type)
and std::move(data_buffer)
cpp/src/arrow/array/data.cc
Outdated
auto data_buffer = | ||
std::make_shared<Buffer>(reinterpret_cast<uint8_t*>(buffer), sizeof(run_type)); | ||
auto data = ArrayData::Make(type, 1, {nullptr, data_buffer}, 0); | ||
span->child_data[0].SetMembers(*data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better if the caller passes this->child_data[0]
to this instead of the span
. Making it obvious at the callsite how both child_data
are filled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could derive the scratch_space
parameter from that as well: making you use the child_data[0].scratch_space
instead of the parent's scratch_space
.
cpp/src/arrow/array/data.cc
Outdated
break; | ||
default: | ||
DCHECK_EQ(run_end_type->id(), Type::INT64); | ||
SetRunForScalar<Int64Type>(this, run_end_type, this->scratch_space); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're getting that stack trace you posted above from this new implementation?
I think this will be less risky if you use the this->child_data[0].scratch_space
instead of this->scratch_space
for the run-ends array buffer. I think that's the contract: an ArraySpan
is free to use the scratch_space
it directly owns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're getting that stack trace you posted above from this new implementation?
No, that was with the previous implementation. This last commit was to fix that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But thanks for the comments! Will clean-up
I don't know if it's related or not but there is some danger with the way we handle scratch space (#35581 ) |
cpp/src/arrow/array/data.cc
Outdated
std::make_shared<Buffer>(reinterpret_cast<uint8_t*>(buffer), sizeof(RunEndCType)); | ||
auto data = | ||
ArrayData::Make(std::move(run_end_type), 1, {nullptr, std::move(data_buffer)}, 0); | ||
span->SetMembers(*data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm... ArraySpan
is non-owning, so it won't keep this data
available. Sorry for not noticing this before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to set the members directly. One by one here. That skips allocating and deallocating an ArrayData
which is the point of using ArraySpan
s in the first-place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The run_end_type
should be a pointer coming directly from the parent REE type, so don't pass it as a shared_ptr copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same is done for Dictionary though:
// Populate dictionary data
const auto& dict_scalar = checked_cast<const DictionaryScalar&>(value);
this->child_data.resize(1);
this->child_data[0].SetMembers(*dict_scalar.value.dictionary->data());
Or is that fine because that is data owned by the scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is that fine because that is data owned by the scalar?
Yes.
void SetREELogicalNullBits(const ArraySpan& span, uint8_t* out_bitmap, | ||
int64_t out_offset) { | ||
const auto& values = arrow::ree_util::ValuesArray(span); | ||
DCHECK(!is_nested(values.type->id())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, it is not forbidden to have nested REE values. So this should be turned into an error return (perhaps Status::NotImplemented
) at a higher level.
However, instead of refusing to implement this, you could use another strategy:
- call
IsNull
on the values array - REE-decode the REE array comprised of (run ends, run values IsNull)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been meaning to send a PR forbidding nested REEs because it's pointless to run-end encode a values array since all the runs already have length 1 and the run-end arrays contains a strictly increasing sequence of indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felipecrv I think you're talking about something else? "Nested" here means any type with child fields (e.g. list, struct...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could use another strategy:
That would indeed be a good alternative and would be more robust for whathever type is used for the REE values. I did a quick benchmark comparing both strategies in python:
In [2]: run_lengths = np.random.randint(1, 10, 100_000)
In [3]: run_values = [1, 2, 3, 4, None] * 20000
In [4]: arr = pa.RunEndEncodedArray.from_arrays(run_lengths.cumsum(), run_values)
In [5]: res1 = pc.is_null(arr)
In [6]: res2 = pc.run_end_decode(pa.RunEndEncodedArray.from_arrays(np.asarray(arr.run_ends), pc.is_null(arr.values)))
In [7]: res1.equals(res2)
Out[7]: True
In [8]: %timeit pc.is_null(arr)
309 µs ± 843 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [9]: %timeit pc.run_end_decode(pa.RunEndEncodedArray.from_arrays(np.asarray(arr.run_ends), pc.is_null(arr.values)))
1.07 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
This is running with this branch (in release mode), so pc.is_null
is using this PR's implementation, and the other is the python equivalent of what you propose (IIUC).
The alternative seems significantly slower, although I don't know how much of that is due to overhead of going through python several times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea, but the proposal here is to make things correct first ;-) It's also possible that run-end-decoding could be optimized...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on a quick profile, a significant part of the time is spent in the actual RunEndDecode impl, indicating it's not just from the python overhead. But it's also true that this implementation is not necessary optimized ;)
Now maybe the DCHECK(!is_nested(values.type->id()));
isn't actually correct. What it is protecting from is that I am getting the first buffer of the data and assuming this is a validity bitmap. But that's only not true for REE/union, and not for "nested" types in general.
Although for supporting nan_is_null
, that will still be easier to do through recursively calling the is_null kernel. Will take a look at doing it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou I misunderstood "nested REE" as a REE inside another REE.
auto expected = | ||
ArrayFromJSON(boolean(), "[false, false, false, true, true, false, false]"); | ||
CheckScalarUnary("is_null", ree_array, expected); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test with floating-point run end values and clearing/setting the nan_is_null
option?
DenseUnionArray::Make(*type_ids, *value_offsets, | ||
{dense_field_i64, dense_field_str})); | ||
CheckScalarUnary("is_null", arr2, expected); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add floating-point child union values and test with/without the nan_is_null
option set?
@@ -1637,6 +1637,48 @@ def test_is_null(): | |||
assert result.equals(expected) | |||
|
|||
|
|||
def test_is_null_union(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to add these tests on the Python side? You have similar tests in C++ already.
@@ -236,6 +236,18 @@ void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_siz | |||
span->buffers[buffer_index].size = 2 * sizeof(offset_type); | |||
} | |||
|
|||
template <typename RunEndType> | |||
void FillRunEndsArrayForScalar(ArraySpan* span, const DataType* run_end_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these changes related for this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, can you perhaps add a test for FillFromScalar
on a run-end-encoded scalar? For example in arrow/array/array_run_end_test.cc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the tests for scalar kernels automatically run the kernel also on actual scalars, and then this ends up creating an array from the scalar. So those changes to FillFromScalar
are needed to be able to run any kernel on a REE scalar.
At the moment we don't actually have any test for FillFromScalar
directly (also not for other types), only through testing the kernels on scalars.
// Populate null count and validity bitmap | ||
if (type_id == Type::NA) { | ||
this->null_count = 1; | ||
} else if (is_union(type_id) || type_id == Type::RUN_END_ENCODED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else if (is_union(type_id) || type_id == Type::RUN_END_ENCODED) { | |
} else if (!HasValidityBitmap(type_id)) { |
} else if (is_union(type_id) || type_id == Type::RUN_END_ENCODED) { | ||
this->null_count = 0; | ||
} else { | ||
this->null_count = value.is_valid ? 0 : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do, but they have the same semantics as dictionary arrays: the top-level validity refers to the indices, regardless of the underlying dictionary values.
} else if (type_id == Type::RUN_END_ENCODED) { | ||
const auto& scalar = checked_cast<const RunEndEncodedScalar&>(value); | ||
this->child_data.resize(2); | ||
auto& run_end_type = scalar.run_end_type(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: style
auto& run_end_type = scalar.run_end_type(); | |
const auto& run_end_type = scalar.run_end_type(); |
Rationale for this change
Currently the
is_null
kernel always returns all-False for union and run-end-encoded types, since those don't have a top-level validity buffer. Update the kernel to take the logical nulls into account.Are these changes tested?
Yes, both in Python and C++
Are there any user-facing changes?
Yes, this changes (corrects) the behaviour of the
is_null
kernel.pa.compute.is_null()
returns incorrect answer for dense union arrays and segfaults for dense union scalars #34315