Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35344: [C++][Format] Implementation of the LIST_VIEW and LARGE_LIST_VIEW array formats #35345

Merged
merged 91 commits into from
Nov 22, 2023

Conversation

felipecrv
Copy link
Contributor

@felipecrv felipecrv commented Apr 26, 2023

Rationale for this change

Mailing list discussion: https://lists.apache.org/thread/r28rw5n39jwtvn08oljl09d4q2c1ysvb

What changes are included in this PR?

Initial implementation of the new format in C++.

Are these changes tested?

Unit tests being written on every commit adding new functionality. More needs to be implemented for Integration Tests (required) to be implementable.

Are there any user-facing changes?

A new array format. It should have no impact for users that don't use it.

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #35344 has been automatically assigned in GitHub to PR creator.

@felipecrv
Copy link
Contributor Author

@bkietz

return rag_.ArrayOf(std::move(type), size, null_probability);
}

// TODO(GH-38656): Use the random array generators from testing/random.h here
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou I isolated all the random-generation code in this class and removed the complicated List[View]ConcatenationChecker templates.

@felipecrv
Copy link
Contributor Author

@pitrou

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for finding two more nits. Feel free to ping when done!

Comment on lines 273 to 285
if (sizes[position] > 0) {
// NOTE: Concatenate can be called during IPC reads to append delta
// dictionaries. Avoid UB on non-validated input by doing the addition in the
// unsigned domain. (the result can later be validated using
// Array::ValidateFull)
const auto displaced_offset = SafeSignedAdd(offsets[position], displacement);
// displaced_offset>=0 is guaranteed by RangeOfValuesUsed returning the
// smallest offset of valid and non-empty list-views.
DCHECK_GE(displaced_offset, 0);
dst[position] = displaced_offset;
} else {
// Do nothing to leave dst[position] as 0.
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misreading, but is it just the same as visit_not_null(i)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I extracted the function from below when I noticed the dup, but forgot to do the reverse-inlining above.

Pushing soon.

cpp/src/arrow/array/concatenate_test.cc Show resolved Hide resolved
cpp/src/arrow/util/list_util.h Outdated Show resolved Hide resolved
@wgtmac wgtmac removed their request for review November 22, 2023 01:40
@pitrou
Copy link
Member

pitrou commented Nov 22, 2023

@felipecrv
Copy link
Contributor Author

@felipecrv We'll want to update https://github.com/apache/arrow/blob/main/docs/source/status.rst in a followup PR.

I will be extremely glad to send that PR.

@pitrou pitrou merged commit 8cc71ab into apache:main Nov 22, 2023
34 of 35 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Nov 22, 2023
@mapleFU
Copy link
Member

mapleFU commented Nov 22, 2023

bravo 🍺!

@mapleFU
Copy link
Member

mapleFU commented Nov 22, 2023

I've create an issue about parquet. #38849

Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 8cc71ab.

There were 5 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 14 possible false positives for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…pache#37877)

### Rationale for this change

More details in the draft implementations of this spec:

 - C++: apache#35345
 - Go: apache#37468

### What changes are included in this PR?

 - Some unrelated fixes to the spec text (I can extract these to another PR if necessary)
 - Changes to the spec text
 - Additions to the Flatbuffers specifications of the Arrow format

### Are these changes tested?

N/A.

### Are there any user-facing changes?

Changes in documentation and backwards compatible additions to the format spec.

* Closes: apache#37876

Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…E_LIST_VIEW array formats (apache#37468)

### Rationale for this change

Go implementation of apache#35345.

### What changes are included in this PR?

- [x] Add `LIST_VIEW` and `LARGE_LIST_VIEW` to datatype.go
- [x] Add `ListView` and `LargeListView` to list.go
- [x] Add `ListViewType` and `LargeListViewType` to datatype_nested.go
- [x] Add list-view builders
- [x] Implement list-view comparison in compare.go
- [x] String conversion in both directions
- [x] Validation of list-view arrays
- [x] Generation of random list-view arrays
- [x] Concatenation of list-view arrays in concat.go
- [x] JSON serialization/deserialization
- [x] Add data used for tests in `arrdata.go`
- [x] Add Flatbuffer changes
- [x] Add IPC support

### Are these changes tested?

Yes. Existing tests are being changed to also cover list-view variations as well as new tests focused solely on the list-view format.

### Are there any user-facing changes?

New structs and functions introduced.
* Closes: apache#35344

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…GE_LIST_VIEW array formats (apache#35345)

### Rationale for this change

Mailing list discussion: https://lists.apache.org/thread/r28rw5n39jwtvn08oljl09d4q2c1ysvb

### What changes are included in this PR?

Initial implementation of the new format in C++.

### Are these changes tested?

Unit tests being written on every commit adding new functionality. More needs to be implemented for Integration Tests (required) to be implementable.

### Are there any user-facing changes?

A new array format. It should have no impact for users that don't use it.
* Closes: apache#35344

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Format] Draft an implementation of the LIST_VIEW array format
5 participants