DataFrame GetMutableBuffer method and ReadOnlyBuffer issues

As it was mentioned in #6642 DataFrame has a lot of boilerplate code like:

```
DataFrameBuffer<TResult> resultMutableBuffer = DataFrameBuffer<TResult>.GetMutableBuffer(resultBuffer);
resultContainer.Buffers[b] = resultMutableBuffer;
Span<TResult> resultSpan = resultMutableBuffer.Span;
DataFrameBuffer<byte> resultMutableNullBitMapBuffer = DataFrameBuffer<byte>.GetMutableBuffer(resultContainer.NullBitMapBuffers[b]);
resultContainer.NullBitMapBuffers[b] = resultMutableNullBitMapBuffer;
Span<byte> resultNullBitMapSpan = resultMutableNullBitMapBuffer.Span;
```

It looks like it's used for working with readonly memory, when dataframe column is created from the Apache Arrow RecordBatch in

`public static DataFrame FromArrowRecordBatch(RecordBatch recordBatch)`

It's also used in constructor of PrimitiveDataFrameColumn:

`public PrimitiveDataFrameColumn(string name, ReadOnlyMemory<byte> buffer, ReadOnlyMemory<byte> nullBitMap, int length = 0, int nullCount = 0) `

which I think shouldn't be public, as it highly depends on internal implementation of PrimitiveColumn.

There is an issue with FromArrowRecordBatch factory method:

RecordBatch is a disposable object. Apache Arrow by default uses NativeMemoryAllocator to allocate unmanaged memory (for example, this default allocator is used in Spark.Net to create RecorBatch and pass it to DataFrame.FromArrowRecordBatch factory method).
So it's up to a DataFrame to hold the link to the RecordBatch and correctly Dispose it. Or it has to copy the unmanaged readonly memory from the RecordBatch into managed buffers (that exactly what is happening in GetMutableBuffer on attempt to edit data), but in this case we can avoid using ReadOnlyBuffers at all or at least limit it usage to ReadOnlyDataFrame class.

The suggestion is:
1) Avoid using GetMutableBuffer and ReadOnlyBuffers in the DataFrame, copy memory from Apache Arrow Record Batch on DataFrame creation (anyway we have to do on any attempt to edit DataFrame)
2) Introduce ReadOnlyDataFrame with limit set of operation (only readonly like Sort, GroupBy, Filter and etc and other with inPlace set to false). ReadOnlyDataFrame will also implement ML.IDataView. So it will be the way to create ReadOnlyDataFrame from Apache Arrow RecordsBatch without copy operation and use it in ML .Net
3) Make ReadOnlyDataFrameBuffer to implement IDisposable interface. For example, similar to Apache ArrowBuffer:

```
public readonly partial struct ArrowBuffer : IEquatable<ArrowBuffer>, IDisposable
{
    private readonly IMemoryOwner<byte> _memoryOwner;
    private readonly ReadOnlyMemory<byte> _memory;

    public static ArrowBuffer Empty => new ArrowBuffer(Memory<byte>.Empty);

    public ArrowBuffer(ReadOnlyMemory<byte> data)
    {
        _memoryOwner = null;
        _memory = data;
    }

    internal ArrowBuffer(IMemoryOwner<byte> memoryOwner)
    {
        // When wrapping an IMemoryOwner, don't cache the Memory<byte>
        // since the owner may be disposed, and the cached Memory would
        // be invalid.

        _memoryOwner = memoryOwner;
        _memory = Memory<byte>.Empty;
    }

    public ReadOnlyMemory<byte> Memory =>
        _memoryOwner != null ? _memoryOwner.Memory : _memory;
...

public void Dispose()
{
    _memoryOwner?.Dispose();
}
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DataFrame GetMutableBuffer method and ReadOnlyBuffer issues #6715

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DataFrame GetMutableBuffer method and ReadOnlyBuffer issues #6715

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions