Performant streaming of Parquet to Postgres #3641

adriangb · 2023-01-30T19:52:47Z

adriangb
Jan 30, 2023

I couldn't find anything out there so I'm looking into writing something to move data from parquet files (possibly stored in an object store or HTTP) into Postgres as fast as possible. For my use case I have a couple of requirements:

Able to work with larger than memory files without blowing up memory or taking forever
Can be hooked into from Python

My initial attempt was using Polars but unfortunately it is not able to read Parquet files in batches efficiently and converting from Arrow -> Python types (via Polars) and then Python types -> Postgres (via asyncpg) is slow.

So naturally I'm looking at doing this in Rust to speed things up.

Here's my plan, I want to see if it seems viable. I'll have a Rust library that goes from RecordBatch to Postgres' binary format. It'll look something like this:

pub struct PostgresBinaryCopyEncoder {}

impl PostgresBinaryCopyEncoder {
    // Initialize with arrow schema for these RecordBatches
    pub fn new(types: Schema) -> BinaryCopyInWriter {}

    // If this is the first call inlcude headers in the returned data
    // Gather all of the columns via RecordBatch.columns
    // Zip the columns and iterate over them row-wise
    // For each value use the column's type to downcast into a Rust type
    // Use the ToSql trait which is already implemented for Rust values to encode into pg's binary format
    pub fn encode(&mut self, values: &RecordBatch) -> Result<&[u8], Error> {}

   // Get the footers
   pub fn finalize(&mut self) -> Result<&[u8, Error]> {}

And then a Python side that manages the IO so that it can decide to stream the bytes into Postgres directly, write to a file somewhere, read the parquet file from disk, in threads for an async environment, etc. That would look something like this:

from thislib,_rust_stuff import PosgresBinaryCopyEncoder as PosgresBinaryCopyEncoderRust

class PosgresBinaryCopyEncoder:
    def __init__(self, stream: RecordBatchStreamReader) -> None:  ...

   def stream(self) -> Iterable[bytes]:
      schema = self._stream.schema
      encoder = RecordBatchStreamReader(schema)
      while True:
          try:
              batch = self._stream.read_next_batch()
              yield encoder.encoder(batch)
          except StopIteration:
              yield encoder.finalize()

I've already written something to encode Rust types into Postgres types based on rust-postgres-binary-copy. Where I'm a bit confused is on how to iterate over a RecordBatch and get Rust types out of it. Alternatively I guess we could iterate over a RecordBatch and get arrow data out of it but then I'd need to re-implement the postgres_types::ToSql trait for all arrow types. And I expect an extra hop from Arrow -> Rust native type will be quite cheap.

tustvold · 2023-01-31T10:40:16Z

tustvold
Jan 31, 2023
Collaborator

Where I'm a bit confused is on how to iterate over a RecordBatch and get Rust types out of it.

Is the schema known at compile time or is it dynamic?

4 replies

alamb Jan 31, 2023
Collaborator

confused is on how to iterate over a RecordBatch and get Rust types out of it.

What you typically have to do is to downcast each column to its concrete type (see https://docs.rs/arrow/32.0.0/arrow/#type-erasure--trait-objects) and then you can iterate over the native types

So something like

let float_array = array.as_any().downcast_ref::<Float64Array>().unwrap();

/// Iterate over Option<f64> in the array
for maybe_float in float_array.iter() {

}

adriangb Jan 31, 2023
Author

It would not be know at Rust compile time. @alamb yep that’s exactly what’s needed! It ends up being a ton of boilerplate because I have to handle TimestampMillisecondArray and TimestampMicrosecondArray separately. But it’s working!

I’ll report back once I have some performance numbers.

tustvold Jan 31, 2023
Collaborator

You may be able to reduce the boilerplate with the macros under https://docs.rs/arrow-array/latest/arrow_array/macro.downcast_primitive_array.html

adriangb Jan 31, 2023
Author

Amazing, I’ll give that a shot!

adriangb · 2023-02-01T01:27:38Z

adriangb
Feb 1, 2023
Author

So the good news is this works! Here's what I've got so far: https://github.com/adriangb/pgpq
It's not particularly fast and it crashed on some row of the NYC Yellow Cab dataset.
I suspect the way I wrote it is super inefficient and it could be done much better. Just look at all those buffer allocations...

9 replies

adriangb Feb 4, 2023
Author

Thanks for the input. I imagine what I have right now is plenty fast for most used cases given there’s network latency involved, but I am curious just from a learning perspective.

I believe the columnar approach is approximately what I first tried, but it was probably doing it quite suboptimally. If I go that route, I think I have no option but to allocate multiple buffers (one per row) or keep track of every index in a single buffer and then re-assemble the Vec<Vec> or Vec + tracking info to write it out. Right?

With the optimizations that the compiler applies like auto vectorizarion, is it important to keep whatever happens within the loop to a minimum, or does that not matter much? I imagine the more that is going on there the harder it is to optimize. I guess this leads to the choice between looping and putting out &dyn ToSql objects iterating over the downcast array or actually writing those out to the buffer within the loop (which is a couple more function calls, including doing the dynamic dispatch).

I’m not sure I know what bump allocation means. My goal was to allocate a fixed size buffer and flush it to Postgres once it’s full with the idea that the Python side of this could use threads to do parallel work and push the bottleneck on Postgres.

adriangb Feb 4, 2023
Author

I imagine that the copying of data to pass it into Python is also sub-optimal, but I don’t see any way around that. One test I need to do is pure Rust runtime vs mixed runtime.

tustvold Feb 4, 2023
Collaborator

If I go that route, I think I have no option but to allocate multiple buffers (one per row) or keep track of every index in a single buffer and then re-assemble the Vec or Vec + tracking info to write it out

The way the arrow row format works is it does a first parse to work out how long each row is going to be in a columnar fashion. It then creates two buffers, a single buffer to contain all the row data, and a second to contain the offsets to slice this buffer for each row. It can then populate these rows in a columnar fashion as it has already established where each row is "located".

It can then expose an iterator over these rows, by iterating over the offsets and applying this to the row buffer

I’m not sure I know what bump allocation means

If you're appending to say a Vec it will allocate x bytes initially, and then if it needs more it will double the size of this allocation. This will keep reallocating as the backing buffer grows, and many of these reallocations will involve copying all the currently populated data to a new memory location.

adriangb Feb 4, 2023
Author

That makes sense. And you’d even be able to read from the head of the buffer as it’s populated. I’ll give it a shot just to see if I can make it work, it’ll be a fun experiment for me. My guess is that since ultimately I need to allocate a fixed chunk of memory to pass it out to Python I won’t be able to do much better on the Rust side.

adriangb Feb 6, 2023
Author

Thank you for taking the time to explain. I was able to get something like that working in a benchmark: https://github.com/adriangb/pgpq/blob/main/pgpq/benches/columnar_vs_rows.rs

It does indeed seem to be 25-50% faster, at least on the input data and sizes I used. I think most of the gain is from reduced reallocations as you described above.

That said I'm not sure I'll be able to use that technique: I don't actually know the size of each value in the buffer until I call .to_sql() on it which requires an output buffer to write into. The only way I see to get around this is to duplicate or replace logic in the existing ToSql implementations.

adriangb · 2023-02-01T18:41:52Z

adriangb
Feb 1, 2023
Author

Interestingly I think I may have found a bug while working on this: #3646

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performant streaming of Parquet to Postgres #3641

{{title}}

Replies: 3 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Performant streaming of Parquet to Postgres #3641

adriangb Jan 30, 2023

Replies: 3 comments · 13 replies

tustvold Jan 31, 2023 Collaborator

alamb Jan 31, 2023 Collaborator

adriangb Jan 31, 2023 Author

tustvold Jan 31, 2023 Collaborator

adriangb Jan 31, 2023 Author

adriangb Feb 1, 2023 Author

adriangb Feb 4, 2023 Author

adriangb Feb 4, 2023 Author

tustvold Feb 4, 2023 Collaborator

adriangb Feb 4, 2023 Author

adriangb Feb 6, 2023 Author

adriangb Feb 1, 2023 Author

adriangb
Jan 30, 2023

Replies: 3 comments 13 replies

tustvold
Jan 31, 2023
Collaborator

alamb Jan 31, 2023
Collaborator

adriangb Jan 31, 2023
Author

tustvold Jan 31, 2023
Collaborator

adriangb Jan 31, 2023
Author

adriangb
Feb 1, 2023
Author

adriangb Feb 4, 2023
Author

adriangb Feb 4, 2023
Author

tustvold Feb 4, 2023
Collaborator

adriangb Feb 4, 2023
Author

adriangb Feb 6, 2023
Author

adriangb
Feb 1, 2023
Author