Replies: 2 comments 2 replies
-
Hi @cino189 This is a good question, it would be great if there was better interoperability between the dotnet Arrow library and ParquetSharp but this won't be easy to achieve without a lot of work, and might make sense to be a completely separate library. It's not really correct that Parquet is built on Arrow, the Parquet format existed before the Arrow project, and there are differences in how values are represented in Parquet and in Arrow. The C++ implementation of Parquet from the Arrow project that we use has two main ways to read and write Parquet. There's an Arrow based API for working with Arrow arrays, record batches and tables, and a slightly lower level API for using raw Parquet data. It's this lower level API that we wrap in ParquetSharp, and better integration with Arrow data would probably require using the Arrow based API instead, which handles conversion to and from the Arrow format. Decimal values are a good example of why you can't directly write Arrow data to Parquet without some conversion, as Arrow stores decimal values according to the endianness of the platform writing the data, but in Parquet, decimal values are stored in big-endian order when using fixed length byte array values. So directly writing Arrow decimal values as Parquet generally won't work as you need to flip the ordering of bytes. If you really care about performance and want to avoid the overhead of conversion to dotnet decimal values you might want to use the lower level Decimal128Array array = ...;
if (array.ByteWidth != 16)
{
throw new Exception($"Unsupported decimal byte width ({array.ByteWidth})");
}
var columns = new ParquetSharp.Column[]
{
new Column<decimal>("decimals", LogicalType.Decimal(precision: 29, scale: array.Scale)),
};
using var parquetFile = new ParquetFileWriter(outputPath, columns);
using var rowBatchWriter = parquetFile.AppendRowGroup();
using var colWriter = (ColumnWriter<FixedLenByteArray>) rowBatchWriter.NextColumn();
using (var byteBuffer = new ByteBuffer(1024))
{
var byteArrayValues = new FixedLenByteArray[array.Length];
for (var i = 0; i < array.Length; ++i)
{
var byteArray = byteBuffer.Allocate(16);
var valueSlice = array.ValueBuffer.Span.Slice(array.Offset + i * 16, 16);
unsafe
{
for (var b = 0; b < 16; ++b)
{
*((byte*) byteArray.Pointer + b) = valueSlice[15 - b];
}
}
byteArrayValues[i] = new FixedLenByteArray(byteArray.Pointer);
}
colWriter.WriteBatch(byteArrayValues);
}
parquetFile.Close(); Just to be clear, I'm not recommending using this code, you're probably better off sticking with the higher level |
Beta Was this translation helpful? Give feedback.
-
The latest beta release of ParquetSharp, 13.0.0-beta1, has added integration with the Arrow C# library so you can now write Arrow record batches directly to Parquet. This is documented here: https://github.com/G-Research/ParquetSharp/blob/master/docs/Arrow.md |
Beta Was this translation helpful? Give feedback.
-
I have an application that leverages the Apache Arrow memory format for analytics. I can directly save a record batch as arrow file very quickly but I am looking to parquet for archival reasons. I can convert the arrow memory format to parquet successfully for primitive types simply passing a readOnlySpan of the specific primitive type to the LogicalColumnWriter fairly easily doing something like this:
using (var currentColWriter = columnWriter.NextColumn().LogicalWriter<double>()) { currentColWriter.WriteBatch(((DoubleArray)array).Values); }
However for non primitive types like decimal I need to create an array of decimals parsing the read only span and then pass that to the column writer. That means converting bytes to decimals on write and doing that again on read when I want to read a parquet file back to the arrow memory format for consumption.
Considering parquet is built on top of arrow I am wondering: is there a way to directly convert an arrow record batch to parquet and save it in parquet format rather than in arrow?
Beta Was this translation helpful? Give feedback.
All reactions