Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40517: [C#] Fix writing sliced arrays to IPC format #41197

Merged
merged 11 commits into from
Apr 15, 2024

Conversation

adamreeve
Copy link
Contributor

@adamreeve adamreeve commented Apr 15, 2024

Rationale for this change

Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read.

What changes are included in this PR?

  • Fixes BinaryViewArray.GetBytes to account for the array offset
  • Fixes FixedSizeBinaryArray.GetBytes to account for the array offset
  • Updates ArrowStreamWriter so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8
  • Refactors ArrowStreamWriter, making the ArrowRecordBatchFlatBufferBuilder class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the ArrowRecordBatchFlatBufferBuilder class and the CreateSelfAndChildrenFieldNodes method, which I've removed.

Note that after this change, we still write more data than required when writing a slice of a ListArray, BinaryArray, ListViewArray, BinaryViewArray or DenseUnionArray. When writing a ListArray for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this here. I will make a follow-up issue for this once this PR is merged.

Are these changes tested?

Yes, I've added new unit tests for this.

Are there any user-facing changes?

Yes, this is a user-facing bug fix.

@adamreeve adamreeve marked this pull request as ready for review April 15, 2024 04:52
Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@CurtHagenlocher CurtHagenlocher merged commit a6cdcd0 into apache:main Apr 15, 2024
10 checks passed
@CurtHagenlocher CurtHagenlocher removed the awaiting review Awaiting review label Apr 15, 2024
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Apr 15, 2024
@adamreeve adamreeve deleted the ipc_slices branch April 15, 2024 20:16
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit a6cdcd0.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

CurtHagenlocher pushed a commit that referenced this pull request Apr 16, 2024
### Rationale for this change

Fixes concatenation of union arrays.

### What changes are included in this PR?

* Re-enables union array concatenation tests that were disabled in #41197 after making union array comparisons more thorough in the `ArrowReaderVerifier`
* Updates the union array concatenation logic to account for array lengths when concatenating the type and offset buffers, and fixes how the base offset is calculated.
* Fixes creating the type buffers for the array concatenation tests.

### Are these changes tested?

Yes, this uses the existing `ArrowArrayConcatenatorTests` tests.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: #41198

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
raulcd pushed a commit that referenced this pull request Apr 29, 2024
### Rationale for this change

Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read.

### What changes are included in this PR?

* Fixes `BinaryViewArray.GetBytes` to account for the array offset
* Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset
* Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8
* Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed.

Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged.

### Are these changes tested?

Yes, I've added new unit tests for this.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: #40517

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…41197)

### Rationale for this change

Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read.

### What changes are included in this PR?

* Fixes `BinaryViewArray.GetBytes` to account for the array offset
* Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset
* Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8
* Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed.

Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged.

### Are these changes tested?

Yes, I've added new unit tests for this.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: apache#40517

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
### Rationale for this change

Fixes concatenation of union arrays.

### What changes are included in this PR?

* Re-enables union array concatenation tests that were disabled in apache#41197 after making union array comparisons more thorough in the `ArrowReaderVerifier`
* Updates the union array concatenation logic to account for array lengths when concatenating the type and offset buffers, and fixes how the base offset is calculated.
* Fixes creating the type buffers for the array concatenation tests.

### Are these changes tested?

Yes, this uses the existing `ArrowArrayConcatenatorTests` tests.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: apache#41198

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 4, 2024
…41197)

### Rationale for this change

Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read.

### What changes are included in this PR?

* Fixes `BinaryViewArray.GetBytes` to account for the array offset
* Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset
* Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8
* Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed.

Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged.

### Are these changes tested?

Yes, I've added new unit tests for this.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: apache#40517

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 4, 2024
### Rationale for this change

Fixes concatenation of union arrays.

### What changes are included in this PR?

* Re-enables union array concatenation tests that were disabled in apache#41197 after making union array comparisons more thorough in the `ArrowReaderVerifier`
* Updates the union array concatenation logic to account for array lengths when concatenating the type and offset buffers, and fixes how the base offset is calculated.
* Fixes creating the type buffers for the array concatenation tests.

### Are these changes tested?

Yes, this uses the existing `ArrowArrayConcatenatorTests` tests.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: apache#41198

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…41197)

### Rationale for this change

Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read.

### What changes are included in this PR?

* Fixes `BinaryViewArray.GetBytes` to account for the array offset
* Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset
* Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8
* Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed.

Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged.

### Are these changes tested?

Yes, I've added new unit tests for this.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: apache#40517

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
### Rationale for this change

Fixes concatenation of union arrays.

### What changes are included in this PR?

* Re-enables union array concatenation tests that were disabled in apache#41197 after making union array comparisons more thorough in the `ArrowReaderVerifier`
* Updates the union array concatenation logic to account for array lengths when concatenating the type and offset buffers, and fixes how the base offset is calculated.
* Fixes creating the type buffers for the array concatenation tests.

### Are these changes tested?

Yes, this uses the existing `ArrowArrayConcatenatorTests` tests.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: apache#41198

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…41197)

### Rationale for this change

Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read.

### What changes are included in this PR?

* Fixes `BinaryViewArray.GetBytes` to account for the array offset
* Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset
* Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8
* Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed.

Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged.

### Are these changes tested?

Yes, I've added new unit tests for this.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: apache#40517

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
### Rationale for this change

Fixes concatenation of union arrays.

### What changes are included in this PR?

* Re-enables union array concatenation tests that were disabled in apache#41197 after making union array comparisons more thorough in the `ArrowReaderVerifier`
* Updates the union array concatenation logic to account for array lengths when concatenating the type and offset buffers, and fixes how the base offset is calculated.
* Fixes creating the type buffers for the array concatenation tests.

### Are these changes tested?

Yes, this uses the existing `ArrowArrayConcatenatorTests` tests.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: apache#41198

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…41197)

### Rationale for this change

Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read.

### What changes are included in this PR?

* Fixes `BinaryViewArray.GetBytes` to account for the array offset
* Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset
* Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8
* Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed.

Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged.

### Are these changes tested?

Yes, I've added new unit tests for this.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: apache#40517

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
### Rationale for this change

Fixes concatenation of union arrays.

### What changes are included in this PR?

* Re-enables union array concatenation tests that were disabled in apache#41197 after making union array comparisons more thorough in the `ArrowReaderVerifier`
* Updates the union array concatenation logic to account for array lengths when concatenating the type and offset buffers, and fixes how the base offset is calculated.
* Fixes creating the type buffers for the array concatenation tests.

### Are these changes tested?

Yes, this uses the existing `ArrowArrayConcatenatorTests` tests.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: apache#41198

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants