You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It looks like it's used for working with readonly memory, when dataframe column is created from the Apache Arrow RecordBatch in
public static DataFrame FromArrowRecordBatch(RecordBatch recordBatch)
It's also used in constructor of PrimitiveDataFrameColumn:
public PrimitiveDataFrameColumn(string name, ReadOnlyMemory<byte> buffer, ReadOnlyMemory<byte> nullBitMap, int length = 0, int nullCount = 0)
which I think shouldn't be public, as it highly depends on internal implementation of PrimitiveColumn.
There is an issue with FromArrowRecordBatch factory method:
RecordBatch is a disposable object. Apache Arrow by default uses NativeMemoryAllocator to allocate unmanaged memory (for example, this default allocator is used in Spark.Net to create RecorBatch and pass it to DataFrame.FromArrowRecordBatch factory method).
So it's up to a DataFrame to hold the link to the RecordBatch and correctly Dispose it. Or it has to copy the unmanaged readonly memory from the RecordBatch into managed buffers (that exactly what is happening in GetMutableBuffer on attempt to edit data), but in this case we can avoid using ReadOnlyBuffers at all or at least limit it usage to ReadOnlyDataFrame class.
The suggestion is:
Avoid using GetMutableBuffer and ReadOnlyBuffers in the DataFrame, copy memory from Apache Arrow Record Batch on DataFrame creation (anyway we have to do on any attempt to edit DataFrame)
Introduce ReadOnlyDataFrame with limit set of operation (only readonly like Sort, GroupBy, Filter and etc and other with inPlace set to false). ReadOnlyDataFrame will also implement ML.IDataView. So it will be the way to create ReadOnlyDataFrame from Apache Arrow RecordsBatch without copy operation and use it in ML .Net
Make ReadOnlyDataFrameBuffer to implement IDisposable interface. For example, similar to Apache ArrowBuffer:
public readonly partial struct ArrowBuffer : IEquatable<ArrowBuffer>, IDisposable
{
private readonly IMemoryOwner<byte> _memoryOwner;
private readonly ReadOnlyMemory<byte> _memory;
public static ArrowBuffer Empty => new ArrowBuffer(Memory<byte>.Empty);
public ArrowBuffer(ReadOnlyMemory<byte> data)
{
_memoryOwner = null;
_memory = data;
}
internal ArrowBuffer(IMemoryOwner<byte> memoryOwner)
{
// When wrapping an IMemoryOwner, don't cache the Memory<byte>
// since the owner may be disposed, and the cached Memory would
// be invalid.
_memoryOwner = memoryOwner;
_memory = Memory<byte>.Empty;
}
public ReadOnlyMemory<byte> Memory =>
_memoryOwner != null ? _memoryOwner.Memory : _memory;
...
public void Dispose()
{
_memoryOwner?.Dispose();
}
The text was updated successfully, but these errors were encountered:
As it was mentioned in #6642 DataFrame has a lot of boilerplate code like:
It looks like it's used for working with readonly memory, when dataframe column is created from the Apache Arrow RecordBatch in
public static DataFrame FromArrowRecordBatch(RecordBatch recordBatch)
It's also used in constructor of PrimitiveDataFrameColumn:
public PrimitiveDataFrameColumn(string name, ReadOnlyMemory<byte> buffer, ReadOnlyMemory<byte> nullBitMap, int length = 0, int nullCount = 0)
which I think shouldn't be public, as it highly depends on internal implementation of PrimitiveColumn.
There is an issue with FromArrowRecordBatch factory method:
RecordBatch is a disposable object. Apache Arrow by default uses NativeMemoryAllocator to allocate unmanaged memory (for example, this default allocator is used in Spark.Net to create RecorBatch and pass it to DataFrame.FromArrowRecordBatch factory method).
So it's up to a DataFrame to hold the link to the RecordBatch and correctly Dispose it. Or it has to copy the unmanaged readonly memory from the RecordBatch into managed buffers (that exactly what is happening in GetMutableBuffer on attempt to edit data), but in this case we can avoid using ReadOnlyBuffers at all or at least limit it usage to ReadOnlyDataFrame class.
The suggestion is:
The text was updated successfully, but these errors were encountered: