-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40517: [C#] Fix writing sliced arrays to IPC format #41197
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CurtHagenlocher
approved these changes
Apr 15, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit a6cdcd0. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
CurtHagenlocher
pushed a commit
that referenced
this pull request
Apr 16, 2024
### Rationale for this change Fixes concatenation of union arrays. ### What changes are included in this PR? * Re-enables union array concatenation tests that were disabled in #41197 after making union array comparisons more thorough in the `ArrowReaderVerifier` * Updates the union array concatenation logic to account for array lengths when concatenating the type and offset buffers, and fixes how the base offset is calculated. * Fixes creating the type buffers for the array concatenation tests. ### Are these changes tested? Yes, this uses the existing `ArrowArrayConcatenatorTests` tests. ### Are there any user-facing changes? Yes, this is a user-facing bug fix. * GitHub Issue: #41198 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
raulcd
pushed a commit
that referenced
this pull request
Apr 29, 2024
### Rationale for this change Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read. ### What changes are included in this PR? * Fixes `BinaryViewArray.GetBytes` to account for the array offset * Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset * Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8 * Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed. Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged. ### Are these changes tested? Yes, I've added new unit tests for this. ### Are there any user-facing changes? Yes, this is a user-facing bug fix. * GitHub Issue: #40517 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
tolleybot
pushed a commit
to tmct/arrow
that referenced
this pull request
May 2, 2024
…41197) ### Rationale for this change Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read. ### What changes are included in this PR? * Fixes `BinaryViewArray.GetBytes` to account for the array offset * Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset * Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8 * Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed. Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged. ### Are these changes tested? Yes, I've added new unit tests for this. ### Are there any user-facing changes? Yes, this is a user-facing bug fix. * GitHub Issue: apache#40517 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
tolleybot
pushed a commit
to tmct/arrow
that referenced
this pull request
May 2, 2024
### Rationale for this change Fixes concatenation of union arrays. ### What changes are included in this PR? * Re-enables union array concatenation tests that were disabled in apache#41197 after making union array comparisons more thorough in the `ArrowReaderVerifier` * Updates the union array concatenation logic to account for array lengths when concatenating the type and offset buffers, and fixes how the base offset is calculated. * Fixes creating the type buffers for the array concatenation tests. ### Are these changes tested? Yes, this uses the existing `ArrowArrayConcatenatorTests` tests. ### Are there any user-facing changes? Yes, this is a user-facing bug fix. * GitHub Issue: apache#41198 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
tolleybot
pushed a commit
to tmct/arrow
that referenced
this pull request
May 4, 2024
…41197) ### Rationale for this change Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read. ### What changes are included in this PR? * Fixes `BinaryViewArray.GetBytes` to account for the array offset * Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset * Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8 * Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed. Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged. ### Are these changes tested? Yes, I've added new unit tests for this. ### Are there any user-facing changes? Yes, this is a user-facing bug fix. * GitHub Issue: apache#40517 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
tolleybot
pushed a commit
to tmct/arrow
that referenced
this pull request
May 4, 2024
### Rationale for this change Fixes concatenation of union arrays. ### What changes are included in this PR? * Re-enables union array concatenation tests that were disabled in apache#41197 after making union array comparisons more thorough in the `ArrowReaderVerifier` * Updates the union array concatenation logic to account for array lengths when concatenating the type and offset buffers, and fixes how the base offset is calculated. * Fixes creating the type buffers for the array concatenation tests. ### Are these changes tested? Yes, this uses the existing `ArrowArrayConcatenatorTests` tests. ### Are there any user-facing changes? Yes, this is a user-facing bug fix. * GitHub Issue: apache#41198 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
rok
pushed a commit
to tmct/arrow
that referenced
this pull request
May 8, 2024
…41197) ### Rationale for this change Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read. ### What changes are included in this PR? * Fixes `BinaryViewArray.GetBytes` to account for the array offset * Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset * Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8 * Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed. Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged. ### Are these changes tested? Yes, I've added new unit tests for this. ### Are there any user-facing changes? Yes, this is a user-facing bug fix. * GitHub Issue: apache#40517 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
rok
pushed a commit
to tmct/arrow
that referenced
this pull request
May 8, 2024
### Rationale for this change Fixes concatenation of union arrays. ### What changes are included in this PR? * Re-enables union array concatenation tests that were disabled in apache#41197 after making union array comparisons more thorough in the `ArrowReaderVerifier` * Updates the union array concatenation logic to account for array lengths when concatenating the type and offset buffers, and fixes how the base offset is calculated. * Fixes creating the type buffers for the array concatenation tests. ### Are these changes tested? Yes, this uses the existing `ArrowArrayConcatenatorTests` tests. ### Are there any user-facing changes? Yes, this is a user-facing bug fix. * GitHub Issue: apache#41198 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
rok
pushed a commit
to tmct/arrow
that referenced
this pull request
May 8, 2024
…41197) ### Rationale for this change Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read. ### What changes are included in this PR? * Fixes `BinaryViewArray.GetBytes` to account for the array offset * Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset * Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8 * Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed. Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged. ### Are these changes tested? Yes, I've added new unit tests for this. ### Are there any user-facing changes? Yes, this is a user-facing bug fix. * GitHub Issue: apache#40517 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
rok
pushed a commit
to tmct/arrow
that referenced
this pull request
May 8, 2024
### Rationale for this change Fixes concatenation of union arrays. ### What changes are included in this PR? * Re-enables union array concatenation tests that were disabled in apache#41197 after making union array comparisons more thorough in the `ArrowReaderVerifier` * Updates the union array concatenation logic to account for array lengths when concatenating the type and offset buffers, and fixes how the base offset is calculated. * Fixes creating the type buffers for the array concatenation tests. ### Are these changes tested? Yes, this uses the existing `ArrowArrayConcatenatorTests` tests. ### Are there any user-facing changes? Yes, this is a user-facing bug fix. * GitHub Issue: apache#41198 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
vibhatha
pushed a commit
to vibhatha/arrow
that referenced
this pull request
May 25, 2024
…41197) ### Rationale for this change Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read. ### What changes are included in this PR? * Fixes `BinaryViewArray.GetBytes` to account for the array offset * Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset * Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8 * Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed. Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged. ### Are these changes tested? Yes, I've added new unit tests for this. ### Are there any user-facing changes? Yes, this is a user-facing bug fix. * GitHub Issue: apache#40517 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
vibhatha
pushed a commit
to vibhatha/arrow
that referenced
this pull request
May 25, 2024
### Rationale for this change Fixes concatenation of union arrays. ### What changes are included in this PR? * Re-enables union array concatenation tests that were disabled in apache#41197 after making union array comparisons more thorough in the `ArrowReaderVerifier` * Updates the union array concatenation logic to account for array lengths when concatenating the type and offset buffers, and fixes how the base offset is calculated. * Fixes creating the type buffers for the array concatenation tests. ### Are these changes tested? Yes, this uses the existing `ArrowArrayConcatenatorTests` tests. ### Are there any user-facing changes? Yes, this is a user-facing bug fix. * GitHub Issue: apache#41198 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read.
What changes are included in this PR?
BinaryViewArray.GetBytes
to account for the array offsetFixedSizeBinaryArray.GetBytes
to account for the array offsetArrowStreamWriter
so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8ArrowStreamWriter
, making theArrowRecordBatchFlatBufferBuilder
class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between theArrowRecordBatchFlatBufferBuilder
class and theCreateSelfAndChildrenFieldNodes
method, which I've removed.Note that after this change, we still write more data than required when writing a slice of a
ListArray
,BinaryArray
,ListViewArray
,BinaryViewArray
orDenseUnionArray
. When writing aListArray
for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this here. I will make a follow-up issue for this once this PR is merged.Are these changes tested?
Yes, I've added new unit tests for this.
Are there any user-facing changes?
Yes, this is a user-facing bug fix.