Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame GetMutableBuffer method and ReadOnlyBuffer issues #6715

Open
asmirnov82 opened this issue May 29, 2023 · 0 comments
Open

DataFrame GetMutableBuffer method and ReadOnlyBuffer issues #6715

asmirnov82 opened this issue May 29, 2023 · 0 comments
Labels
enhancement New feature or request
Milestone

Comments

@asmirnov82
Copy link
Contributor

As it was mentioned in #6642 DataFrame has a lot of boilerplate code like:

DataFrameBuffer<TResult> resultMutableBuffer = DataFrameBuffer<TResult>.GetMutableBuffer(resultBuffer);
resultContainer.Buffers[b] = resultMutableBuffer;
Span<TResult> resultSpan = resultMutableBuffer.Span;
DataFrameBuffer<byte> resultMutableNullBitMapBuffer = DataFrameBuffer<byte>.GetMutableBuffer(resultContainer.NullBitMapBuffers[b]);
resultContainer.NullBitMapBuffers[b] = resultMutableNullBitMapBuffer;
Span<byte> resultNullBitMapSpan = resultMutableNullBitMapBuffer.Span;

It looks like it's used for working with readonly memory, when dataframe column is created from the Apache Arrow RecordBatch in

public static DataFrame FromArrowRecordBatch(RecordBatch recordBatch)

It's also used in constructor of PrimitiveDataFrameColumn:

public PrimitiveDataFrameColumn(string name, ReadOnlyMemory<byte> buffer, ReadOnlyMemory<byte> nullBitMap, int length = 0, int nullCount = 0)

which I think shouldn't be public, as it highly depends on internal implementation of PrimitiveColumn.

There is an issue with FromArrowRecordBatch factory method:

RecordBatch is a disposable object. Apache Arrow by default uses NativeMemoryAllocator to allocate unmanaged memory (for example, this default allocator is used in Spark.Net to create RecorBatch and pass it to DataFrame.FromArrowRecordBatch factory method).
So it's up to a DataFrame to hold the link to the RecordBatch and correctly Dispose it. Or it has to copy the unmanaged readonly memory from the RecordBatch into managed buffers (that exactly what is happening in GetMutableBuffer on attempt to edit data), but in this case we can avoid using ReadOnlyBuffers at all or at least limit it usage to ReadOnlyDataFrame class.

The suggestion is:

  1. Avoid using GetMutableBuffer and ReadOnlyBuffers in the DataFrame, copy memory from Apache Arrow Record Batch on DataFrame creation (anyway we have to do on any attempt to edit DataFrame)
  2. Introduce ReadOnlyDataFrame with limit set of operation (only readonly like Sort, GroupBy, Filter and etc and other with inPlace set to false). ReadOnlyDataFrame will also implement ML.IDataView. So it will be the way to create ReadOnlyDataFrame from Apache Arrow RecordsBatch without copy operation and use it in ML .Net
  3. Make ReadOnlyDataFrameBuffer to implement IDisposable interface. For example, similar to Apache ArrowBuffer:
public readonly partial struct ArrowBuffer : IEquatable<ArrowBuffer>, IDisposable
{
    private readonly IMemoryOwner<byte> _memoryOwner;
    private readonly ReadOnlyMemory<byte> _memory;

    public static ArrowBuffer Empty => new ArrowBuffer(Memory<byte>.Empty);

    public ArrowBuffer(ReadOnlyMemory<byte> data)
    {
        _memoryOwner = null;
        _memory = data;
    }

    internal ArrowBuffer(IMemoryOwner<byte> memoryOwner)
    {
        // When wrapping an IMemoryOwner, don't cache the Memory<byte>
        // since the owner may be disposed, and the cached Memory would
        // be invalid.

        _memoryOwner = memoryOwner;
        _memory = Memory<byte>.Empty;
    }

    public ReadOnlyMemory<byte> Memory =>
        _memoryOwner != null ? _memoryOwner.Memory : _memory;
...

public void Dispose()
{
    _memoryOwner?.Dispose();
}
@asmirnov82 asmirnov82 added the enhancement New feature or request label May 29, 2023
@ghost ghost added the untriaged New issue has not been triaged label May 29, 2023
@michaelgsharp michaelgsharp added this to the ML.NET Future milestone Jan 24, 2024
@ghost ghost removed the untriaged New issue has not been triaged label Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants