GH-35581: [C++] Store offsets in scalars #36018

bkietz · 2023-06-09T21:03:22Z

ArraySpan contained scratch space inside itself for storing offsets when viewing a scalar as a length=1 array. This could lead to dangling pointers in copies of the ArraySpan since copies' pointers will always refer to the original's scratch space, which may have been destroyed.

This patch moves that scratch space into Scalar itself, which is more conformant to the contract of a view since the scratch space will be valid as long as the Scalar is alive, regardless of how ArraySpans are copied.

Closes: [C++] ArraySpan::FillFromScalar is unsafe #35581

bkietz · 2023-06-09T21:04:30Z

@felipecrv

felipecrv · 2023-06-09T22:44:49Z

cpp/CMakePresets.json

@@ -11,6 +11,7 @@
      "hidden": true,
      "generator": "Ninja",
      "cacheVariables": {
+        "CMAKE_EXPORT_COMPILE_COMMANDS": "ON",


I don't think we want to merge this and generate the huge file in CI builds, right?

It's less than a megabyte on my machine; I think it's not expensive and it's worthwhile to enable clang tools by default

I agree that the generated file is not huge, but I also think these presets should enable just what's necessary for development. They serve as documentation and starting points for users. It's easy to enable additional options on top of an existing preset, AFAIK.

But which tools need this in CI? I use this file on my machine for LSP on my editor.

The presets aren't just for CI builds, as @pitrou says it's a starting point for users. I would say this is an easy and useful default to provide for users- myself included. I'll move this to a different issue, though, since it's not directly pertinent to offset storage

cpp/src/arrow/array/data.cc

cpp/src/arrow/scalar.h

cpp/src/arrow/array/data.cc

cpp/src/arrow/scalar.h

pitrou · 2023-06-15T15:27:25Z

This LGTM on the principle.

pitrou · 2023-06-15T15:28:07Z

There are CI failures though.

cpp/src/arrow/array/data.cc

bkietz · 2023-06-16T13:09:06Z

This PR should probably wait until #35036 is merged

jorisvandenbossche · 2023-07-05T13:32:07Z

This PR should probably wait until #35036 is merged

It's fine for this to be merged first, will update my PR later afterwards then (since I am using the scratch space in the other PR, fixing that first here also make sense)

…*every* Scalar

…cratch space

arrays

bkietz · 2023-07-10T18:27:37Z

@pitrou PTAL

pitrou · 2023-07-10T18:54:52Z

@github-actions crossbow submit -g cpp

github-actions · 2023-07-10T18:57:32Z

Revision: 4ea421a

Submitted crossbow builds: ursacomputing/crossbow @ actions-8dcf3ee3c6

Task	Status
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp
test-debian-11-cpp-amd64
test-debian-11-cpp-i386
test-fedora-35-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20

pitrou · 2023-07-11T06:39:42Z

cpp/src/arrow/array/data.cc

    case Type::RUN_END_ENCODED:
-      return 0;
+      return 1;


I merged a bit quickly, why does RUN_END_ENCODED need one buffer here?

Many places in the codebase assume that buffers.size() >= 1, even if buffers[0] == nullptr. When I added test cases which exercised REE scalars those places segfaulted. I thought that requiring buffers.size() >= 1 for REE (as we do for union) was the most expeditious fix

@felipecrv What do you think here? Should we require a one-element buffers vector for REE?

RUN_END_ENCODED doesn't need any buffer but NA also does not and we return 1 here. 🤔

It's a "once you start lying you can't stop lying" kind of problem for GetNumBuffers

FWIW, this was the only place in the codebase which didn't already give REE at least one buffer; the constructors, the builder, ... already did so

cpp/src/arrow/compute/kernels/test_util.cc

felipecrv · 2023-07-11T15:23:04Z

cpp/src/arrow/array/data.cc

+  ARROW_CHECK_LE(off, length) << "Slice offset (" << off
+                              << ") greater than array length (" << length << ")";


I wonder if this has to always perform int to string conversion. Even when the check passes.

It shouldn't. The operator<< is only invoked when the assertion fails. Though operator precedence makes me slightly uneasy:

#define ARROW_CHECK_OR_LOG(condition, level) \ ARROW_PREDICT_TRUE(condition) \ ? ARROW_IGNORE_EXPR(0) \ : ::arrow::util::Voidify() & ARROW_LOG(level) << " Check failed: " #condition " "

We could actually add a test for that, IMHO.

conbench-apache-arrow · 2023-07-21T18:19:32Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 96ac514.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

### Rationale for this change Bring back the problematic test case of random `case_when` on union(bool, string) type. This case used to fail. However #36018 already addressed the issue. More information about how it used to fail, please refer to #15192 (comment). ### What changes are included in this PR? Bring back the test code. ### Are these changes tested? Yes, the change is the test. ### Are there any user-facing changes? No. * Closes: #15192 Authored-by: zanmato <zanmato1984@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

…pache#39308) ### Rationale for this change Bring back the problematic test case of random `case_when` on union(bool, string) type. This case used to fail. However apache#36018 already addressed the issue. More information about how it used to fail, please refer to apache#15192 (comment). ### What changes are included in this PR? Bring back the test code. ### Are these changes tested? Yes, the change is the test. ### Are there any user-facing changes? No. * Closes: apache#15192 Authored-by: zanmato <zanmato1984@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

bkietz requested a review from pitrou June 9, 2023 21:03

github-actions bot added Component: C++ awaiting committer review Awaiting committer review labels Jun 9, 2023

felipecrv requested changes Jun 9, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Jun 15, 2023

pitrou reviewed Jun 15, 2023

View reviewed changes

cpp/src/arrow/array/data.cc Outdated Show resolved Hide resolved

pitrou reviewed Jun 15, 2023

View reviewed changes

cpp/src/arrow/scalar.h Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Jun 15, 2023

felipecrv reviewed Jun 15, 2023

View reviewed changes

cpp/src/arrow/array/data.cc Outdated Show resolved Hide resolved

felipecrv approved these changes Jun 15, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 15, 2023

felipecrv approved these changes Jun 16, 2023

View reviewed changes

bkietz force-pushed the 35581-store-offsets-in-scalars branch from 077168a to 52bfbab Compare June 16, 2023 14:07

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 16, 2023

bkietz mentioned this pull request Jun 16, 2023

[C++] Export compile commands database by default #36124

Closed

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 16, 2023

bkietz added 6 commits July 6, 2023 16:52

address review comments

262ee88

review comments: use c array, revert preset

13fe04c

extract scalar scratch space to an intermediate class so it isn't in …

3a344aa

…*every* Scalar

review comments, SetOffsetsForScalar -> OffsetsForScalar

f7aaff9

work stolen from apache#35036: add support for REE scalars with new s…

6be009a

…cratch space

add tests for FillFromScalar

c2f1e40

bkietz force-pushed the 35581-store-offsets-in-scalars branch from 64ee6c4 to c2f1e40 Compare July 6, 2023 20:55

bkietz added 5 commits July 7, 2023 11:58

type id overload of HasValidityBitmap, expand zero buffer for empty

eef13d5

arrays

format

2d55d63

add print statements to spy on strange assert failure

8e4b423

more debugging; is a compiler getting kZeros wrong?

1dc0c78

remove debug code

4ea421a

pitrou merged commit 96ac514 into apache:main Jul 11, 2023

pitrou removed the awaiting change review Awaiting change review label Jul 11, 2023

pitrou reviewed Jul 11, 2023

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jul 11, 2023

felipecrv reviewed Jul 11, 2023

View reviewed changes

cpp/src/arrow/compute/kernels/test_util.cc Outdated Show resolved Hide resolved

felipecrv reviewed Jul 11, 2023

View reviewed changes

pitrou mentioned this pull request Jul 11, 2023

[C++] Add a test for evaluation of ARROW_CHECK payload #36618

Closed

This was referenced Dec 21, 2023

[C++] "case_when" test failure on random union inputs #15192

Closed

GH-15192: [C++] Bring back case_when tests for union types #39308

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-35581: [C++] Store offsets in scalars #36018

GH-35581: [C++] Store offsets in scalars #36018

bkietz commented Jun 9, 2023 •

edited by github-actions bot

Loading

bkietz commented Jun 9, 2023

felipecrv Jun 9, 2023

bkietz Jun 15, 2023

pitrou Jun 15, 2023

felipecrv Jun 15, 2023

bkietz Jun 15, 2023

bkietz Jun 16, 2023

pitrou commented Jun 15, 2023

pitrou commented Jun 15, 2023

bkietz commented Jun 16, 2023

jorisvandenbossche commented Jul 5, 2023

bkietz commented Jul 10, 2023

pitrou commented Jul 10, 2023

github-actions bot commented Jul 10, 2023

pitrou Jul 11, 2023

bkietz Jul 11, 2023

pitrou Jul 11, 2023

felipecrv Jul 11, 2023

bkietz Jul 12, 2023

felipecrv Jul 11, 2023

pitrou Jul 11, 2023

pitrou Jul 11, 2023

conbench-apache-arrow bot commented Jul 21, 2023

		ARROW_CHECK_LE(off, length) << "Slice offset (" << off
		<< ") greater than array length (" << length << ")";

GH-35581: [C++] Store offsets in scalars #36018

GH-35581: [C++] Store offsets in scalars #36018

Conversation

bkietz commented Jun 9, 2023 • edited by github-actions bot Loading

bkietz commented Jun 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Jun 15, 2023

pitrou commented Jun 15, 2023

bkietz commented Jun 16, 2023

jorisvandenbossche commented Jul 5, 2023

bkietz commented Jul 10, 2023

pitrou commented Jul 10, 2023

github-actions bot commented Jul 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Jul 21, 2023

bkietz commented Jun 9, 2023 •

edited by github-actions bot

Loading