Direct conversion from arrow record batch to parquet #300

cino189 · 2022-08-24T16:09:03Z

cino189
Aug 24, 2022

I have an application that leverages the Apache Arrow memory format for analytics. I can directly save a record batch as arrow file very quickly but I am looking to parquet for archival reasons. I can convert the arrow memory format to parquet successfully for primitive types simply passing a readOnlySpan of the specific primitive type to the LogicalColumnWriter fairly easily doing something like this:

using (var currentColWriter = columnWriter.NextColumn().LogicalWriter<double>()) { currentColWriter.WriteBatch(((DoubleArray)array).Values); }

However for non primitive types like decimal I need to create an array of decimals parsing the read only span and then pass that to the column writer. That means converting bytes to decimals on write and doing that again on read when I want to read a parquet file back to the arrow memory format for consumption.

Considering parquet is built on top of arrow I am wondering: is there a way to directly convert an arrow record batch to parquet and save it in parquet format rather than in arrow?

adamreeve · 2022-08-24T23:28:38Z

adamreeve
Aug 24, 2022
Collaborator

Hi @cino189

This is a good question, it would be great if there was better interoperability between the dotnet Arrow library and ParquetSharp but this won't be easy to achieve without a lot of work, and might make sense to be a completely separate library.

It's not really correct that Parquet is built on Arrow, the Parquet format existed before the Arrow project, and there are differences in how values are represented in Parquet and in Arrow. The C++ implementation of Parquet from the Arrow project that we use has two main ways to read and write Parquet. There's an Arrow based API for working with Arrow arrays, record batches and tables, and a slightly lower level API for using raw Parquet data. It's this lower level API that we wrap in ParquetSharp, and better integration with Arrow data would probably require using the Arrow based API instead, which handles conversion to and from the Arrow format.

Decimal values are a good example of why you can't directly write Arrow data to Parquet without some conversion, as Arrow stores decimal values according to the endianness of the platform writing the data, but in Parquet, decimal values are stored in big-endian order when using fixed length byte array values. So directly writing Arrow decimal values as Parquet generally won't work as you need to flip the ordering of bytes.

If you really care about performance and want to avoid the overhead of conversion to dotnet decimal values you might want to use the lower level ColumnWriter API in ParquetSharp rather than a LogicalColumnWriter<decimal> (and a ColumnReader instead of LogicalColumnReader when reading), but this will be a lot more error-prone and isn't recommended. I had a play around with writing decimal data from Arrow as Parquet and came up with this code that works for simple non-null data for example:

Decimal128Array array = ...;

if (array.ByteWidth != 16)
{
    throw new Exception($"Unsupported decimal byte width ({array.ByteWidth})");
}
var columns = new ParquetSharp.Column[]
{
    new Column<decimal>("decimals", LogicalType.Decimal(precision: 29, scale: array.Scale)),
};
using var parquetFile = new ParquetFileWriter(outputPath, columns);
using var rowBatchWriter = parquetFile.AppendRowGroup();
using var colWriter = (ColumnWriter<FixedLenByteArray>) rowBatchWriter.NextColumn();

using (var byteBuffer = new ByteBuffer(1024))
{
    var byteArrayValues = new FixedLenByteArray[array.Length];
    for (var i = 0; i < array.Length; ++i)
    {
        var byteArray = byteBuffer.Allocate(16);
        var valueSlice = array.ValueBuffer.Span.Slice(array.Offset + i * 16, 16);
        unsafe
        {
            for (var b = 0; b < 16; ++b)
            {
                *((byte*) byteArray.Pointer + b) = valueSlice[15 - b];
            }
        }
        byteArrayValues[i] = new FixedLenByteArray(byteArray.Pointer);
    }
    colWriter.WriteBatch(byteArrayValues);
}
parquetFile.Close();

Just to be clear, I'm not recommending using this code, you're probably better off sticking with the higher level LogicalWriter and LogicalReader APIs.

2 replies

cino189 Aug 25, 2022
Author

Hi @adamreeve, thank you very much for your clarification and the code snippet. I was also trying to do something similar using an unsafe memory pointer but I wasn't aware of the difference in bytes handling between Parquet and Arrow so I ended up with a corrupted dataset, now I understand why.

In my application performance is very important for querying "live" data that will remain in arrow format, so I can probably get away with a slower serialization/deserialization to Parquet since I will only seldom need access to archival data.

A temporary option could be to simply compress the arrow files directly using gZip since I need anyway to deserialize the parquet to arrow for my engine to work. I can test the performance of decompressing a zipped arrow file vs deserializing Parquet, but I don't expect huge differences. I will not be able to directly query the parquet files anyway since my engine requires the arrow format. I tested simply compressing a 100MB arrow file with gzip and I ended up with a 15MB file, compared to the 12MB of the parquet file.

Do you think compressing an arrow file for archival is a viable option or is it better to serialize and deserialize to parquet, assuming I will not be able to use the parquet format directly for the time being?

I would be very interested in a direct interoperability between the arrow c# library and parquet should you decide to embark in that project and I would like to contribute where I can.

My best

Cino189

adamreeve Aug 26, 2022
Collaborator

Interestingly, there is support for compressing buffers when writing Arrow IPC files from most languages (eg. see the Python documentation), but this isn't yet supported in the dotnet Arrow library (there's an open Jira issue and it looks like this is being worked on). The Arrow IPC format isn't really designed for long term archival and probably won't compress quite as well as Parquet, but the format is intended to remain backwards compatible so it's probably reasonable to use as an archival format in your situation.

We don't currently have any plans for working on better Arrow-Parquet integration but I'll make sure to update this discussion if that changes.

Cheers,
Adam

adamreeve · 2023-11-08T22:53:31Z

adamreeve
Nov 8, 2023
Collaborator

The latest beta release of ParquetSharp, 13.0.0-beta1, has added integration with the Arrow C# library so you can now write Arrow record batches directly to Parquet. This is documented here: https://github.com/G-Research/ParquetSharp/blob/master/docs/Arrow.md

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct conversion from arrow record batch to parquet #300

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Direct conversion from arrow record batch to parquet #300

cino189 Aug 24, 2022

Replies: 2 comments · 2 replies

adamreeve Aug 24, 2022 Collaborator

cino189 Aug 25, 2022 Author

adamreeve Aug 26, 2022 Collaborator

adamreeve Nov 8, 2023 Collaborator

cino189
Aug 24, 2022

Replies: 2 comments 2 replies

adamreeve
Aug 24, 2022
Collaborator

cino189 Aug 25, 2022
Author

adamreeve Aug 26, 2022
Collaborator

adamreeve
Nov 8, 2023
Collaborator