GH-34315: [C++] Correct is_null kernel for Union and RunEndEncoded logical nulls #35036

jorisvandenbossche · 2023-04-11T10:39:40Z

Rationale for this change

Currently the is_null kernel always returns all-False for union and run-end-encoded types, since those don't have a top-level validity buffer. Update the kernel to take the logical nulls into account.

Are these changes tested?

Yes, both in Python and C++

Are there any user-facing changes?

Yes, this changes (corrects) the behaviour of the is_null kernel.

Closes: [Python] pa.compute.is_null() returns incorrect answer for dense union arrays and segfaults for dense union scalars #34315

…ded logical nulls

jorisvandenbossche · 2023-04-11T10:40:03Z

cc @felipecrv (this is very much a draft (still needs more tests, doc comments, applying the same changes to is_valid, etc), but already would like to get your thoughts on the implementation approach)

jorisvandenbossche · 2023-04-11T10:56:53Z

cpp/src/arrow/compute/kernels/scalar_validity.cc

@@ -102,6 +104,15 @@ Status IsNullExec(KernelContext* ctx, const ExecSpan& batch, ExecResult* out) {
    bit_util::SetBitsTo(out_bitmap, out_span->offset, out_span->length, false);
  }

+  const auto t = arr.type->id();
+  if (t == Type::SPARSE_UNION) {
+    union_util::SetLogicalSparseUnionNullBits(arr, out_bitmap, out_span->offset);


I followed the same pattern as we had for SetNanBits (used just below, i.e. updating the previously created bitmap), not that this is necessarily the best approach though

cpp/src/arrow/array/data.cc

cpp/src/arrow/util/ree_util.cc

cpp/src/arrow/util/union_util.h

cpp/src/arrow/util/ree_util.cc

jorisvandenbossche · 2023-04-13T13:35:20Z

cpp/src/arrow/compute/kernels/scalar_validity_test.cc

+//   ASSERT_OK(ree_array->ValidateFull());
+//   auto expected =
+//       ArrayFromJSON(boolean(), "[false, false, false, true, true, false, false]");
+//   CheckScalarUnary("is_null", ree_array, expected);


This test is passing, except for the variant that runs it with scalars, because ArraySpan::FillFromScalar is not implemented for RunEndEncoded cc @felipecrv

I thought I had implemented that 🤔

Thanks for adding it! I'm gonna use it for something I'm doing as well.

felipecrv · 2023-04-13T14:15:33Z

cpp/src/arrow/array/data.cc

-  // Populate null count and validity bitmap (only for non-union/null types)
-  this->null_count = value.is_valid ? 0 : 1;
-  if (!is_union(type_id) && type_id != Type::NA) {
+  // Populate null count and validity bitmap (only for non-union/ree/null types)


this comment can now be removed as the if checks wrapping the null_count assignments document the context of each assignment

felipecrv · 2023-04-13T14:27:24Z

cpp/src/arrow/array/data.cc

+  } else if (type_id == Type::RUN_END_ENCODED) {
+    const auto& scalar = checked_cast<const RunEndEncodedScalar&>(value);
+    this->child_data.resize(2);
+    auto run_end_type = scalar.run_end_type();


auto& to elide one refcount increment

felipecrv · 2023-04-13T14:40:48Z

cpp/src/arrow/compute/kernels/scalar_validity.cc

+  const auto t = arr.type->id();
+  if (t == Type::SPARSE_UNION) {
+    SetSparseUnionLogicalNullBits(arr, out_bitmap, out_span->offset);
+  } else if (t == Type::DENSE_UNION) {
+    SetDenseUnionLogicalNullBits(arr, out_bitmap, out_span->offset);
+  } else if (t == Type::RUN_END_ENCODED) {
+    SetREELogicalNullBits(arr, out_bitmap, out_span->offset);
+  }
+


This whole block can be moved into the else above to make it more clear that this only gets called after a complete zeroing of out_bitmap.

Now we have to know about a logical implication that is not explicit in the code: t \in {UNION, REE} \implies arr.GetNullCount() == 0. That is what in turn guarantees out_bitmap is zeroed before the Set*LogicalNullBits. Too much logic. 😅

The // Input has no nulls => output is entirely false. also needs to be more nuanced after that move.

felipecrv · 2023-05-17T12:28:11Z

cpp/src/arrow/array/data.cc

@@ -236,6 +236,19 @@ void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_siz
  span->buffers[buffer_index].size = 2 * sizeof(offset_type);
 }

+template <typename ArrowType>


nitpick: To be consistent with REE code elsewhere: RunEndType.

felipecrv · 2023-05-17T12:28:58Z

cpp/src/arrow/array/data.cc

+void SetRunForScalar(ArraySpan* span, std::shared_ptr<DataType> type,
+                     uint64_t* scratch_space) {
+  // Create a lenght-1 array with the value 1 for a REE scalar
+  using run_type = typename ArrowType::c_type;


run_end_type (run_type is ambiguous -- could be the type of the value in the run).

felipecrv · 2023-05-17T12:29:53Z

cpp/src/arrow/array/data.cc

+template <typename ArrowType>
+void SetRunForScalar(ArraySpan* span, std::shared_ptr<DataType> type,
+                     uint64_t* scratch_space) {
+  // Create a lenght-1 array with the value 1 for a REE scalar


length typo

I think the comment can be removed after the other changes that clarify the code are made.

felipecrv · 2023-05-17T12:31:56Z

cpp/src/arrow/array/data.cc

@@ -236,6 +236,19 @@ void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_siz
  span->buffers[buffer_index].size = 2 * sizeof(offset_type);
 }

+template <typename ArrowType>
+void SetRunForScalar(ArraySpan* span, std::shared_ptr<DataType> type,


FillRunEndsArrayFromScalar? s/type/run_end_type

felipecrv · 2023-05-17T12:33:23Z

cpp/src/arrow/array/data.cc

+  buffer[0] = static_cast<run_type>(1);
+  auto data_buffer =
+      std::make_shared<Buffer>(reinterpret_cast<uint8_t*>(buffer), sizeof(run_type));
+  auto data = ArrayData::Make(type, 1, {nullptr, data_buffer}, 0);


std::move(type) and std::move(data_buffer)

felipecrv · 2023-05-17T12:38:30Z

cpp/src/arrow/array/data.cc

+  auto data_buffer =
+      std::make_shared<Buffer>(reinterpret_cast<uint8_t*>(buffer), sizeof(run_type));
+  auto data = ArrayData::Make(type, 1, {nullptr, data_buffer}, 0);
+  span->child_data[0].SetMembers(*data);


I think it's better if the caller passes this->child_data[0] to this instead of the span. Making it obvious at the callsite how both child_data are filled.

You could derive the scratch_space parameter from that as well: making you use the child_data[0].scratch_space instead of the parent's scratch_space.

felipecrv · 2023-05-17T12:58:18Z

cpp/src/arrow/array/data.cc

+        break;
+      default:
+        DCHECK_EQ(run_end_type->id(), Type::INT64);
+        SetRunForScalar<Int64Type>(this, run_end_type, this->scratch_space);


You're getting that stack trace you posted above from this new implementation?

I think this will be less risky if you use the this->child_data[0].scratch_space instead of this->scratch_space for the run-ends array buffer. I think that's the contract: an ArraySpan is free to use the scratch_space it directly owns.

You're getting that stack trace you posted above from this new implementation?

No, that was with the previous implementation. This last commit was to fix that

But thanks for the comments! Will clean-up

westonpace · 2023-05-18T13:18:20Z

I don't know if it's related or not but there is some danger with the way we handle scratch space (#35581 )

felipecrv · 2023-05-18T15:37:46Z

cpp/src/arrow/array/data.cc

+      std::make_shared<Buffer>(reinterpret_cast<uint8_t*>(buffer), sizeof(RunEndCType));
+  auto data =
+      ArrayData::Make(std::move(run_end_type), 1, {nullptr, std::move(data_buffer)}, 0);
+  span->SetMembers(*data);


Hmmmm... ArraySpan is non-owning, so it won't keep this data available. Sorry for not noticing this before.

Feel free to set the members directly. One by one here. That skips allocating and deallocating an ArrayData which is the point of using ArraySpans in the first-place.

The run_end_type should be a pointer coming directly from the parent REE type, so don't pass it as a shared_ptr copy.

The same is done for Dictionary though:

// Populate dictionary data const auto& dict_scalar = checked_cast<const DictionaryScalar&>(value); this->child_data.resize(1); this->child_data[0].SetMembers(*dict_scalar.value.dictionary->data());

https://github.com/apache/arrow/blob/fbe5f641d327ee81db00ce5f056940a69f4d8603/cpp/src/arrow/array/data.cc#L325C1-L328

Or is that fine because that is data owned by the scalar?

Or is that fine because that is data owned by the scalar?

Yes.

…n-us_null

pitrou · 2023-05-31T10:01:57Z

cpp/src/arrow/compute/kernels/scalar_validity.cc

+void SetREELogicalNullBits(const ArraySpan& span, uint8_t* out_bitmap,
+                           int64_t out_offset) {
+  const auto& values = arrow::ree_util::ValuesArray(span);
+  DCHECK(!is_nested(values.type->id()));


AFAIK, it is not forbidden to have nested REE values. So this should be turned into an error return (perhaps Status::NotImplemented) at a higher level.

However, instead of refusing to implement this, you could use another strategy:

call IsNull on the values array

REE-decode the REE array comprised of (run ends, run values IsNull)

I've been meaning to send a PR forbidding nested REEs because it's pointless to run-end encode a values array since all the runs already have length 1 and the run-end arrays contains a strictly increasing sequence of indexes.

@felipecrv I think you're talking about something else? "Nested" here means any type with child fields (e.g. list, struct...).

you could use another strategy:

That would indeed be a good alternative and would be more robust for whathever type is used for the REE values. I did a quick benchmark comparing both strategies in python:

In [2]: run_lengths = np.random.randint(1, 10, 100_000) In [3]: run_values = [1, 2, 3, 4, None] * 20000 In [4]: arr = pa.RunEndEncodedArray.from_arrays(run_lengths.cumsum(), run_values) In [5]: res1 = pc.is_null(arr) In [6]: res2 = pc.run_end_decode(pa.RunEndEncodedArray.from_arrays(np.asarray(arr.run_ends), pc.is_null(arr.values))) In [7]: res1.equals(res2) Out[7]: True In [8]: %timeit pc.is_null(arr) 309 µs ± 843 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each) In [9]: %timeit pc.run_end_decode(pa.RunEndEncodedArray.from_arrays(np.asarray(arr.run_ends), pc.is_null(arr.values))) 1.07 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

This is running with this branch (in release mode), so pc.is_null is using this PR's implementation, and the other is the python equivalent of what you propose (IIUC).
The alternative seems significantly slower, although I don't know how much of that is due to overhead of going through python several times.

No idea, but the proposal here is to make things correct first ;-) It's also possible that run-end-decoding could be optimized...

Based on a quick profile, a significant part of the time is spent in the actual RunEndDecode impl, indicating it's not just from the python overhead. But it's also true that this implementation is not necessary optimized ;)

Now maybe the DCHECK(!is_nested(values.type->id())); isn't actually correct. What it is protecting from is that I am getting the first buffer of the data and assuming this is a validity bitmap. But that's only not true for REE/union, and not for "nested" types in general.

Although for supporting nan_is_null, that will still be easier to do through recursively calling the is_null kernel. Will take a look at doing it that way.

@pitrou I misunderstood "nested REE" as a REE inside another REE.

pitrou · 2023-05-31T10:02:54Z

cpp/src/arrow/compute/kernels/scalar_validity_test.cc

+  auto expected =
+      ArrayFromJSON(boolean(), "[false, false, false, true, true, false, false]");
+  CheckScalarUnary("is_null", ree_array, expected);
+}


Can you add a test with floating-point run end values and clearing/setting the nan_is_null option?

pitrou · 2023-05-31T10:03:27Z

cpp/src/arrow/compute/kernels/scalar_validity_test.cc

+                       DenseUnionArray::Make(*type_ids, *value_offsets,
+                                             {dense_field_i64, dense_field_str}));
+  CheckScalarUnary("is_null", arr2, expected);
+}


Can you add floating-point child union values and test with/without the nan_is_null option set?

pitrou · 2023-05-31T10:04:21Z

python/pyarrow/tests/test_compute.py

@@ -1637,6 +1637,48 @@ def test_is_null():
    assert result.equals(expected)


+def test_is_null_union():


Is it necessary to add these tests on the Python side? You have similar tests in C++ already.

pitrou · 2023-05-31T10:05:57Z

cpp/src/arrow/array/data.cc

@@ -236,6 +236,18 @@ void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_siz
  span->buffers[buffer_index].size = 2 * sizeof(offset_type);
 }

+template <typename RunEndType>
+void FillRunEndsArrayForScalar(ArraySpan* span, const DataType* run_end_type) {


Are these changes related for this PR?

If so, can you perhaps add a test for FillFromScalar on a run-end-encoded scalar? For example in arrow/array/array_run_end_test.cc.

Yes, the tests for scalar kernels automatically run the kernel also on actual scalars, and then this ends up creating an array from the scalar. So those changes to FillFromScalar are needed to be able to run any kernel on a REE scalar.

At the moment we don't actually have any test for FillFromScalar directly (also not for other types), only through testing the kernels on scalars.

pitrou · 2023-05-31T10:10:23Z

cpp/src/arrow/array/data.cc

+  // Populate null count and validity bitmap
+  if (type_id == Type::NA) {
+    this->null_count = 1;
+  } else if (is_union(type_id) || type_id == Type::RUN_END_ENCODED) {


Suggested change

} else if (is_union(type_id) || type_id == Type::RUN_END_ENCODED) {

} else if (!HasValidityBitmap(type_id)) {

pitrou · 2023-05-31T10:10:33Z

cpp/src/arrow/array/data.cc

+  } else if (is_union(type_id) || type_id == Type::RUN_END_ENCODED) {
+    this->null_count = 0;
+  } else {
+    this->null_count = value.is_valid ? 0 : 1;


We do, but they have the same semantics as dictionary arrays: the top-level validity refers to the indices, regardless of the underlying dictionary values.

pitrou · 2023-05-31T10:12:22Z

cpp/src/arrow/array/data.cc

+  } else if (type_id == Type::RUN_END_ENCODED) {
+    const auto& scalar = checked_cast<const RunEndEncodedScalar&>(value);
+    this->child_data.resize(2);
+    auto& run_end_type = scalar.run_end_type();


Nit: style

Suggested change

auto& run_end_type = scalar.run_end_type();

const auto& run_end_type = scalar.run_end_type();

…cratch space

apacheGH-34315: [C++] Correct is_null kernel for Union and RunEndEnco…

0ded622

…ded logical nulls

This comment was marked as off-topic.

Sign in to view

github-actions bot added Component: C++ Component: Python awaiting committer review Awaiting committer review labels Apr 11, 2023

lint

6fa5eeb

jorisvandenbossche commented Apr 11, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 11, 2023