Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35581: [C++] Store offsets in scalars #36018

Merged
merged 12 commits into from
Jul 11, 2023

Conversation

bkietz
Copy link
Member

@bkietz bkietz commented Jun 9, 2023

ArraySpan contained scratch space inside itself for storing offsets when viewing a scalar as a length=1 array. This could lead to dangling pointers in copies of the ArraySpan since copies' pointers will always refer to the original's scratch space, which may have been destroyed.

This patch moves that scratch space into Scalar itself, which is more conformant to the contract of a view since the scratch space will be valid as long as the Scalar is alive, regardless of how ArraySpans are copied.

@bkietz bkietz requested a review from pitrou June 9, 2023 21:03
@bkietz
Copy link
Member Author

bkietz commented Jun 9, 2023

@felipecrv

@@ -11,6 +11,7 @@
"hidden": true,
"generator": "Ninja",
"cacheVariables": {
"CMAKE_EXPORT_COMPILE_COMMANDS": "ON",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to merge this and generate the huge file in CI builds, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's less than a megabyte on my machine; I think it's not expensive and it's worthwhile to enable clang tools by default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the generated file is not huge, but I also think these presets should enable just what's necessary for development. They serve as documentation and starting points for users. It's easy to enable additional options on top of an existing preset, AFAIK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But which tools need this in CI? I use this file on my machine for LSP on my editor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The presets aren't just for CI builds, as @pitrou says it's a starting point for users. I would say this is an easy and useful default to provide for users- myself included. I'll move this to a different issue, though, since it's not directly pertinent to offset storage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpp/src/arrow/array/data.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/data.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/data.cc Outdated Show resolved Hide resolved
cpp/src/arrow/scalar.h Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Jun 15, 2023
cpp/src/arrow/array/data.cc Outdated Show resolved Hide resolved
cpp/src/arrow/scalar.h Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Jun 15, 2023

This LGTM on the principle.

@pitrou
Copy link
Member

pitrou commented Jun 15, 2023

There are CI failures though.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Jun 15, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 15, 2023
@bkietz
Copy link
Member Author

bkietz commented Jun 16, 2023

This PR should probably wait until #35036 is merged

@bkietz bkietz force-pushed the 35581-store-offsets-in-scalars branch from 077168a to 52bfbab Compare June 16, 2023 14:07
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 16, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 16, 2023
@jorisvandenbossche
Copy link
Member

This PR should probably wait until #35036 is merged

It's fine for this to be merged first, will update my PR later afterwards then (since I am using the scratch space in the other PR, fixing that first here also make sense)

@bkietz bkietz force-pushed the 35581-store-offsets-in-scalars branch from 64ee6c4 to c2f1e40 Compare July 6, 2023 20:55
@bkietz
Copy link
Member Author

bkietz commented Jul 10, 2023

@pitrou PTAL

@pitrou
Copy link
Member

pitrou commented Jul 10, 2023

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: 4ea421a

Submitted crossbow builds: ursacomputing/crossbow @ actions-8dcf3ee3c6

Task Status
test-alpine-linux-cpp Github Actions
test-build-cpp-fuzz Github Actions
test-conda-cpp Github Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp Github Actions
test-debian-11-cpp-amd64 Github Actions
test-debian-11-cpp-i386 Github Actions
test-fedora-35-cpp Github Actions
test-ubuntu-20.04-cpp Github Actions
test-ubuntu-20.04-cpp-bundled Github Actions
test-ubuntu-20.04-cpp-minimal-with-formats Github Actions
test-ubuntu-20.04-cpp-thread-sanitizer Github Actions
test-ubuntu-22.04-cpp Github Actions
test-ubuntu-22.04-cpp-20 Github Actions

@pitrou pitrou merged commit 96ac514 into apache:main Jul 11, 2023
@pitrou pitrou removed the awaiting change review Awaiting change review label Jul 11, 2023
case Type::RUN_END_ENCODED:
return 0;
return 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged a bit quickly, why does RUN_END_ENCODED need one buffer here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many places in the codebase assume that buffers.size() >= 1, even if buffers[0] == nullptr. When I added test cases which exercised REE scalars those places segfaulted. I thought that requiring buffers.size() >= 1 for REE (as we do for union) was the most expeditious fix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felipecrv What do you think here? Should we require a one-element buffers vector for REE?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RUN_END_ENCODED doesn't need any buffer but NA also does not and we return 1 here. 🤔

It's a "once you start lying you can't stop lying" kind of problem for GetNumBuffers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this was the only place in the codebase which didn't already give REE at least one buffer; the constructors, the builder, ... already did so

@github-actions github-actions bot added awaiting committer review Awaiting committer review awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jul 11, 2023
Comment on lines +133 to +134
ARROW_CHECK_LE(off, length) << "Slice offset (" << off
<< ") greater than array length (" << length << ")";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this has to always perform int to string conversion. Even when the check passes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't. The operator<< is only invoked when the assertion fails. Though operator precedence makes me slightly uneasy:

#define ARROW_CHECK_OR_LOG(condition, level) \
  ARROW_PREDICT_TRUE(condition)              \
  ? ARROW_IGNORE_EXPR(0)                     \
  : ::arrow::util::Voidify() & ARROW_LOG(level) << " Check failed: " #condition " "

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could actually add a test for that, IMHO.

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 96ac514.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

pitrou pushed a commit that referenced this pull request Dec 21, 2023
### Rationale for this change

Bring back the problematic test case of random `case_when` on union(bool, string) type. This case used to fail. However #36018 already addressed the issue.

More information about how it used to fail, please refer to #15192 (comment).

### What changes are included in this PR?

Bring back the test code.

### Are these changes tested?

Yes, the change is the test.

### Are there any user-facing changes?

No.

* Closes: #15192

Authored-by: zanmato <zanmato1984@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
…pache#39308)

### Rationale for this change

Bring back the problematic test case of random `case_when` on union(bool, string) type. This case used to fail. However apache#36018 already addressed the issue.

More information about how it used to fail, please refer to apache#15192 (comment).

### What changes are included in this PR?

Bring back the test code.

### Are these changes tested?

Yes, the change is the test.

### Are there any user-facing changes?

No.

* Closes: apache#15192

Authored-by: zanmato <zanmato1984@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…pache#39308)

### Rationale for this change

Bring back the problematic test case of random `case_when` on union(bool, string) type. This case used to fail. However apache#36018 already addressed the issue.

More information about how it used to fail, please refer to apache#15192 (comment).

### What changes are included in this PR?

Bring back the test code.

### Are these changes tested?

Yes, the change is the test.

### Are there any user-facing changes?

No.

* Closes: apache#15192

Authored-by: zanmato <zanmato1984@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] ArraySpan::FillFromScalar is unsafe
4 participants