Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34315: [C++] Correct is_null kernel for Union and RunEndEncoded logical nulls #35036

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Apr 11, 2023

Rationale for this change

Currently the is_null kernel always returns all-False for union and run-end-encoded types, since those don't have a top-level validity buffer. Update the kernel to take the logical nulls into account.

Are these changes tested?

Yes, both in Python and C++

Are there any user-facing changes?

Yes, this changes (corrects) the behaviour of the is_null kernel.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Apr 11, 2023

cc @felipecrv (this is very much a draft (still needs more tests, doc comments, applying the same changes to is_valid, etc), but already would like to get your thoughts on the implementation approach)

@github-actions

This comment was marked as off-topic.

@github-actions

This comment was marked as off-topic.

@@ -102,6 +104,15 @@ Status IsNullExec(KernelContext* ctx, const ExecSpan& batch, ExecResult* out) {
bit_util::SetBitsTo(out_bitmap, out_span->offset, out_span->length, false);
}

const auto t = arr.type->id();
if (t == Type::SPARSE_UNION) {
union_util::SetLogicalSparseUnionNullBits(arr, out_bitmap, out_span->offset);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the same pattern as we had for SetNanBits (used just below, i.e. updating the previously created bitmap), not that this is necessarily the best approach though

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 11, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 13, 2023
// ASSERT_OK(ree_array->ValidateFull());
// auto expected =
// ArrayFromJSON(boolean(), "[false, false, false, true, true, false, false]");
// CheckScalarUnary("is_null", ree_array, expected);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is passing, except for the variant that runs it with scalars, because ArraySpan::FillFromScalar is not implemented for RunEndEncoded cc @felipecrv

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had implemented that 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding it! I'm gonna use it for something I'm doing as well.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 13, 2023
@github-actions github-actions bot removed the awaiting changes Awaiting changes label Apr 13, 2023
// Populate null count and validity bitmap (only for non-union/null types)
this->null_count = value.is_valid ? 0 : 1;
if (!is_union(type_id) && type_id != Type::NA) {
// Populate null count and validity bitmap (only for non-union/ree/null types)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment can now be removed as the if checks wrapping the null_count assignments document the context of each assignment

} else if (type_id == Type::RUN_END_ENCODED) {
const auto& scalar = checked_cast<const RunEndEncodedScalar&>(value);
this->child_data.resize(2);
auto run_end_type = scalar.run_end_type();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto& to elide one refcount increment

Comment on lines 172 to 180
const auto t = arr.type->id();
if (t == Type::SPARSE_UNION) {
SetSparseUnionLogicalNullBits(arr, out_bitmap, out_span->offset);
} else if (t == Type::DENSE_UNION) {
SetDenseUnionLogicalNullBits(arr, out_bitmap, out_span->offset);
} else if (t == Type::RUN_END_ENCODED) {
SetREELogicalNullBits(arr, out_bitmap, out_span->offset);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole block can be moved into the else above to make it more clear that this only gets called after a complete zeroing of out_bitmap.

Now we have to know about a logical implication that is not explicit in the code: t \in {UNION, REE} \implies arr.GetNullCount() == 0. That is what in turn guarantees out_bitmap is zeroed before the Set*LogicalNullBits. Too much logic. 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The // Input has no nulls => output is entirely false. also needs to be more nuanced after that move.

@github-actions github-actions bot added the awaiting review Awaiting review label Apr 13, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 17, 2023
@@ -236,6 +236,19 @@ void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_siz
span->buffers[buffer_index].size = 2 * sizeof(offset_type);
}

template <typename ArrowType>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: To be consistent with REE code elsewhere: RunEndType.

void SetRunForScalar(ArraySpan* span, std::shared_ptr<DataType> type,
uint64_t* scratch_space) {
// Create a lenght-1 array with the value 1 for a REE scalar
using run_type = typename ArrowType::c_type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_end_type (run_type is ambiguous -- could be the type of the value in the run).

template <typename ArrowType>
void SetRunForScalar(ArraySpan* span, std::shared_ptr<DataType> type,
uint64_t* scratch_space) {
// Create a lenght-1 array with the value 1 for a REE scalar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

length typo

Copy link
Contributor

@felipecrv felipecrv May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment can be removed after the other changes that clarify the code are made.

@@ -236,6 +236,19 @@ void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_siz
span->buffers[buffer_index].size = 2 * sizeof(offset_type);
}

template <typename ArrowType>
void SetRunForScalar(ArraySpan* span, std::shared_ptr<DataType> type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FillRunEndsArrayFromScalar? s/type/run_end_type

buffer[0] = static_cast<run_type>(1);
auto data_buffer =
std::make_shared<Buffer>(reinterpret_cast<uint8_t*>(buffer), sizeof(run_type));
auto data = ArrayData::Make(type, 1, {nullptr, data_buffer}, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::move(type) and std::move(data_buffer)

auto data_buffer =
std::make_shared<Buffer>(reinterpret_cast<uint8_t*>(buffer), sizeof(run_type));
auto data = ArrayData::Make(type, 1, {nullptr, data_buffer}, 0);
span->child_data[0].SetMembers(*data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better if the caller passes this->child_data[0] to this instead of the span. Making it obvious at the callsite how both child_data are filled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could derive the scratch_space parameter from that as well: making you use the child_data[0].scratch_space instead of the parent's scratch_space.

break;
default:
DCHECK_EQ(run_end_type->id(), Type::INT64);
SetRunForScalar<Int64Type>(this, run_end_type, this->scratch_space);
Copy link
Contributor

@felipecrv felipecrv May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're getting that stack trace you posted above from this new implementation?

I think this will be less risky if you use the this->child_data[0].scratch_space instead of this->scratch_space for the run-ends array buffer. I think that's the contract: an ArraySpan is free to use the scratch_space it directly owns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're getting that stack trace you posted above from this new implementation?

No, that was with the previous implementation. This last commit was to fix that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But thanks for the comments! Will clean-up

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 17, 2023
@westonpace
Copy link
Member

I don't know if it's related or not but there is some danger with the way we handle scratch space (#35581 )

std::make_shared<Buffer>(reinterpret_cast<uint8_t*>(buffer), sizeof(RunEndCType));
auto data =
ArrayData::Make(std::move(run_end_type), 1, {nullptr, std::move(data_buffer)}, 0);
span->SetMembers(*data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm... ArraySpan is non-owning, so it won't keep this data available. Sorry for not noticing this before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to set the members directly. One by one here. That skips allocating and deallocating an ArrayData which is the point of using ArraySpans in the first-place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run_end_type should be a pointer coming directly from the parent REE type, so don't pass it as a shared_ptr copy.

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same is done for Dictionary though:

      // Populate dictionary data
      const auto& dict_scalar = checked_cast<const DictionaryScalar&>(value);
      this->child_data.resize(1);
      this->child_data[0].SetMembers(*dict_scalar.value.dictionary->data());

https://github.com/apache/arrow/blob/fbe5f641d327ee81db00ce5f056940a69f4d8603/cpp/src/arrow/array/data.cc#L325C1-L328

Or is that fine because that is data owned by the scalar?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is that fine because that is data owned by the scalar?

Yes.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 23, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 23, 2023
@jorisvandenbossche jorisvandenbossche requested a review from pitrou May 31, 2023 09:25
void SetREELogicalNullBits(const ArraySpan& span, uint8_t* out_bitmap,
int64_t out_offset) {
const auto& values = arrow::ree_util::ValuesArray(span);
DCHECK(!is_nested(values.type->id()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, it is not forbidden to have nested REE values. So this should be turned into an error return (perhaps Status::NotImplemented) at a higher level.

However, instead of refusing to implement this, you could use another strategy:

  • call IsNull on the values array
  • REE-decode the REE array comprised of (run ends, run values IsNull)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been meaning to send a PR forbidding nested REEs because it's pointless to run-end encode a values array since all the runs already have length 1 and the run-end arrays contains a strictly increasing sequence of indexes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felipecrv I think you're talking about something else? "Nested" here means any type with child fields (e.g. list, struct...).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use another strategy:

That would indeed be a good alternative and would be more robust for whathever type is used for the REE values. I did a quick benchmark comparing both strategies in python:

In [2]: run_lengths = np.random.randint(1, 10, 100_000)

In [3]: run_values = [1, 2, 3, 4, None] * 20000

In [4]: arr = pa.RunEndEncodedArray.from_arrays(run_lengths.cumsum(), run_values)

In [5]: res1 = pc.is_null(arr)

In [6]: res2 = pc.run_end_decode(pa.RunEndEncodedArray.from_arrays(np.asarray(arr.run_ends), pc.is_null(arr.values)))

In [7]: res1.equals(res2)
Out[7]: True

In [8]: %timeit pc.is_null(arr)
309 µs ± 843 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [9]: %timeit pc.run_end_decode(pa.RunEndEncodedArray.from_arrays(np.asarray(arr.run_ends), pc.is_null(arr.values)))
1.07 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

This is running with this branch (in release mode), so pc.is_null is using this PR's implementation, and the other is the python equivalent of what you propose (IIUC).
The alternative seems significantly slower, although I don't know how much of that is due to overhead of going through python several times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea, but the proposal here is to make things correct first ;-) It's also possible that run-end-decoding could be optimized...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on a quick profile, a significant part of the time is spent in the actual RunEndDecode impl, indicating it's not just from the python overhead. But it's also true that this implementation is not necessary optimized ;)

Now maybe the DCHECK(!is_nested(values.type->id())); isn't actually correct. What it is protecting from is that I am getting the first buffer of the data and assuming this is a validity bitmap. But that's only not true for REE/union, and not for "nested" types in general.

Although for supporting nan_is_null, that will still be easier to do through recursively calling the is_null kernel. Will take a look at doing it that way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou I misunderstood "nested REE" as a REE inside another REE.

auto expected =
ArrayFromJSON(boolean(), "[false, false, false, true, true, false, false]");
CheckScalarUnary("is_null", ree_array, expected);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test with floating-point run end values and clearing/setting the nan_is_null option?

DenseUnionArray::Make(*type_ids, *value_offsets,
{dense_field_i64, dense_field_str}));
CheckScalarUnary("is_null", arr2, expected);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add floating-point child union values and test with/without the nan_is_null option set?

@@ -1637,6 +1637,48 @@ def test_is_null():
assert result.equals(expected)


def test_is_null_union():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to add these tests on the Python side? You have similar tests in C++ already.

@@ -236,6 +236,18 @@ void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_siz
span->buffers[buffer_index].size = 2 * sizeof(offset_type);
}

template <typename RunEndType>
void FillRunEndsArrayForScalar(ArraySpan* span, const DataType* run_end_type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes related for this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, can you perhaps add a test for FillFromScalar on a run-end-encoded scalar? For example in arrow/array/array_run_end_test.cc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the tests for scalar kernels automatically run the kernel also on actual scalars, and then this ends up creating an array from the scalar. So those changes to FillFromScalar are needed to be able to run any kernel on a REE scalar.

At the moment we don't actually have any test for FillFromScalar directly (also not for other types), only through testing the kernels on scalars.

// Populate null count and validity bitmap
if (type_id == Type::NA) {
this->null_count = 1;
} else if (is_union(type_id) || type_id == Type::RUN_END_ENCODED) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else if (is_union(type_id) || type_id == Type::RUN_END_ENCODED) {
} else if (!HasValidityBitmap(type_id)) {

} else if (is_union(type_id) || type_id == Type::RUN_END_ENCODED) {
this->null_count = 0;
} else {
this->null_count = value.is_valid ? 0 : 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do, but they have the same semantics as dictionary arrays: the top-level validity refers to the indices, regardless of the underlying dictionary values.

} else if (type_id == Type::RUN_END_ENCODED) {
const auto& scalar = checked_cast<const RunEndEncodedScalar&>(value);
this->child_data.resize(2);
auto& run_end_type = scalar.run_end_type();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: style

Suggested change
auto& run_end_type = scalar.run_end_type();
const auto& run_end_type = scalar.run_end_type();

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 31, 2023
bkietz added a commit to bkietz/arrow that referenced this pull request Jul 6, 2023
bkietz added a commit to bkietz/arrow that referenced this pull request Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants