Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C#] IPC stream writer should write slices of buffers when writing sliced arrays #40517

Closed
adamreeve opened this issue Mar 13, 2024 · 7 comments

Comments

@adamreeve
Copy link
Contributor

adamreeve commented Mar 13, 2024

Describe the enhancement requested

Currently the C# ArrowStreamWriter always writes all data in a buffer. This behaviour differs to the Python/C++ implementation, which only writes slices of the buffers when an array has a nonzero offset or size in bytes less than the buffer length. This can be observed by looking at the sizes of IPC files for a whole RecordBatch, compared to slices of the data.

Python:

import pyarrow as pa
import numpy as np


num_rows = 400
rows_per_batch = 100

ints = pa.array(np.arange(0, num_rows, 1, dtype=np.int32))
floats = pa.array(np.arange(0, num_rows / 10.0, 0.1, dtype=np.float32))

all_data = pa.RecordBatch.from_arrays([ints, floats], names=["a", "b"])

sink = pa.BufferOutputStream()
with pa.ipc.new_stream(sink, all_data.schema) as writer:
    writer.write_batch(all_data)
buf = sink.getvalue()
print(f"Size of serialized full batch = {buf.size}")

for offset in range(0, num_rows, rows_per_batch):
    slice = all_data.slice(offset, rows_per_batch)
    sink = pa.BufferOutputStream()
    with pa.ipc.new_stream(sink, slice.schema) as writer:
        writer.write_batch(slice)
    buf = sink.getvalue()
    print(f"Size of serialized slice at offset {offset} = {buf.size}")

This outputs:

Size of serialized full batch = 3576
Size of serialized slice at offset 0 = 1176
Size of serialized slice at offset 100 = 1176
Size of serialized slice at offset 200 = 1176
Size of serialized slice at offset 300 = 1176

The size of the full batch is 1/4 the full batch after accounting for the overhead of metadata.

Doing the same in C#:

const int numRows = 400;
const int rowsPerBatch = 100;

var allData = new RecordBatch.Builder()
    .Append("a", false, col => col.Int32(array => array.AppendRange(Enumerable.Range(0, numRows))))
    .Append("b", false, col => col.Float(array => array.AppendRange(Enumerable.Range(0, numRows).Select(i => 0.1f * i))))
    .Build();

{
    using var ms = new MemoryStream();
    using var writer = new ArrowFileWriter(ms, allData.Schema, false, new IpcOptions());
    await writer.WriteStartAsync();
    await writer.WriteRecordBatchAsync(allData);
    await writer.WriteEndAsync();

    Console.WriteLine($"Size of serialized full batch = {ms.Length}");
}

for (var offset = 0; offset < allData.Length; offset += rowsPerBatch)
{
    var arraySlices = allData.Arrays
        .Select(arr => ArrowArrayFactory.Slice(arr, offset, rowsPerBatch))
        .ToArray();
    var slice = new RecordBatch(allData.Schema, arraySlices, arraySlices[0].Length);

    using var ms = new MemoryStream();
    using var writer = new ArrowFileWriter(ms, slice.Schema, false, new IpcOptions());
    await writer.WriteStartAsync();
    await writer.WriteRecordBatchAsync(slice);
    await writer.WriteEndAsync();

    Console.WriteLine($"Size of serialized slice at offset {offset} = {ms.Length}");
}

This outputs:

Size of serialized full batch = 3802
Size of serialized slice at offset 0 = 3802
Size of serialized slice at offset 100 = 3802
Size of serialized slice at offset 200 = 3802
Size of serialized slice at offset 300 = 3802

Writing a slice of the data results in the same file size as writing the full data, but we'd like to be able to break IPC data into smaller slices in order to send it over a transport that has a message size limit. We're currently working around this by copying the data after slicing.

From a quick look at the C++ implementation, one complication is dealing with null bitmaps, which need to be copied to ensure the start is aligned with a byte boundary.

Component(s)

C#

@adamreeve
Copy link
Contributor Author

I initially thought that the problem was just that the files were bigger than they need to be, and that the array offsets would be correctly round-tripped. But on further investigation, it looks like the IPC files written from sliced arrays are invalid and can't be read back by the C# library or Python library.

Reading these sliced batches back from C# fails with:

System.IO.InvalidDataException
Null count length must be >= 0
   at Apache.Arrow.Ipc.ArrowReaderImplementation.LoadField(MetadataVersion version, RecordBatchEnumerator& recordBatchEnumerator, Field field, FieldNode& fieldNode, ByteBuffer bodyData, IBufferCreator bufferCreator)

and Python displays the arrays as invalid:

<Invalid array: Buffer #0 too small in array of type int32 and length 100: expected at least 13 byte(s), got 0>

@CurtHagenlocher
Copy link
Contributor

There's code which suggests a null count of -1 was intended to mean "RecalculateNullCount" but apparently there's nothing which actually does this.

@CurtHagenlocher
Copy link
Contributor

CurtHagenlocher commented Mar 14, 2024

Your example works if I add a quick hack e.g.

+            if (nullCount == RecalculateNullCount)
+            {
+                NullCount = Length - CalculateValidCount();
+            }

and

+        public int CalculateValidCount()
+        {
+            switch (DataType)
+            {
+                case FixedWidthType fixedWidthType:
+                    ArrowBuffer nullBuffer = Buffers[0];
+                    return nullBuffer.IsEmpty ? 0 : BitUtility.CountBits(nullBuffer.Span);
+                default:
+                    // TODO:
+                    throw new NotImplementedException();
+            }
+        }

I can finish implementing this over the coming weekend, or someone else could take it on.

@adamreeve
Copy link
Contributor Author

When you say it works, I guess you mean that the data can be round-tripped and read correctly (at least from .NET), but the files for a slice are still as large as writing the unsliced data?

I should be able to work on fixing this eventually, but I have other more urgent work at the moment so am not sure when I will get around to this.

@CurtHagenlocher
Copy link
Contributor

Yeah, it was late and I only had time for a quick look. This avoids the exception but ends up producing the wrong results. The entire ArrowBuffer is being serialized instead of just the part needed for the slice, and while the length is recorded correctly there's (of course) no offset being serialized so the data being read back starts at the beginning of the buffer for each slice -- meaning it's the wrong data.

The root of this problem is that ArrowStreamWriter.ArrowRecordBatchFlatBufferBuilder is serializing entire buffers instead of just the parts of the buffer associated with the slice. I suspect this will not be a trivial fix.

@adamreeve
Copy link
Contributor Author

take

CurtHagenlocher pushed a commit that referenced this issue Apr 15, 2024
### Rationale for this change

Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read.

### What changes are included in this PR?

* Fixes `BinaryViewArray.GetBytes` to account for the array offset
* Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset
* Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8
* Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed.

Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged.

### Are these changes tested?

Yes, I've added new unit tests for this.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: #40517

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
@CurtHagenlocher CurtHagenlocher added this to the 17.0.0 milestone Apr 15, 2024
@CurtHagenlocher
Copy link
Contributor

Issue resolved by pull request 41197
#41197

@raulcd raulcd modified the milestones: 17.0.0, 16.1.0 Apr 29, 2024
raulcd pushed a commit that referenced this issue Apr 29, 2024
### Rationale for this change

Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read.

### What changes are included in this PR?

* Fixes `BinaryViewArray.GetBytes` to account for the array offset
* Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset
* Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8
* Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed.

Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged.

### Are these changes tested?

Yes, I've added new unit tests for this.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: #40517

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
tolleybot pushed a commit to tmct/arrow that referenced this issue May 2, 2024
…41197)

### Rationale for this change

Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read.

### What changes are included in this PR?

* Fixes `BinaryViewArray.GetBytes` to account for the array offset
* Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset
* Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8
* Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed.

Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged.

### Are these changes tested?

Yes, I've added new unit tests for this.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: apache#40517

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
tolleybot pushed a commit to tmct/arrow that referenced this issue May 4, 2024
…41197)

### Rationale for this change

Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read.

### What changes are included in this PR?

* Fixes `BinaryViewArray.GetBytes` to account for the array offset
* Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset
* Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8
* Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed.

Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged.

### Are these changes tested?

Yes, I've added new unit tests for this.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: apache#40517

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
rok pushed a commit to tmct/arrow that referenced this issue May 8, 2024
…41197)

### Rationale for this change

Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read.

### What changes are included in this PR?

* Fixes `BinaryViewArray.GetBytes` to account for the array offset
* Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset
* Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8
* Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed.

Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged.

### Are these changes tested?

Yes, I've added new unit tests for this.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: apache#40517

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
rok pushed a commit to tmct/arrow that referenced this issue May 8, 2024
…41197)

### Rationale for this change

Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read.

### What changes are included in this PR?

* Fixes `BinaryViewArray.GetBytes` to account for the array offset
* Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset
* Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8
* Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed.

Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged.

### Are these changes tested?

Yes, I've added new unit tests for this.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: apache#40517

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…41197)

### Rationale for this change

Fixes writing sliced arrays to IPC files or streams, so that they can be successfully read back in. Previously, writing such data would succeed but then couldn't be read.

### What changes are included in this PR?

* Fixes `BinaryViewArray.GetBytes` to account for the array offset
* Fixes `FixedSizeBinaryArray.GetBytes` to account for the array offset
* Updates `ArrowStreamWriter` so that it writes slices of buffers when required, and handles slicing bitmap arrays by creating a copy if the offset isn't a multiple of 8
* Refactors `ArrowStreamWriter`, making the `ArrowRecordBatchFlatBufferBuilder` class responsible for building a list of field nodes as well as buffers. This was required to avoid having to duplicate logic for handling array types with child data between the `ArrowRecordBatchFlatBufferBuilder` class and the `CreateSelfAndChildrenFieldNodes` method, which I've removed.

Note that after this change, we still write more data than required when writing a slice of a `ListArray`, `BinaryArray`, `ListViewArray`, `BinaryViewArray` or `DenseUnionArray`. When writing a `ListArray` for example, we write slices of the null bitmap and value offsets and write the full values array. Ideally we should write a slice of the values and adjust the value offsets so they start at zero. The C++ implementation for example handles this [here](https://github.com/apache/arrow/blob/18c74b0733c9ff473a211259cf10705b2c9be891/cpp/src/arrow/ipc/writer.cc#L316). I will make a follow-up issue for this once this PR is merged.

### Are these changes tested?

Yes, I've added new unit tests for this.

### Are there any user-facing changes?

Yes, this is a user-facing bug fix.
* GitHub Issue: apache#40517

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Curt Hagenlocher <curt@hagenlocher.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants