Reduce allocations in reading #195

jorgecarleitao · 2021-09-13T18:06:03Z

This is a follow up of #193.

My understanding is that this crate is optimized for reading to (the owning enum) Value, that encapsulates all types, like e.g. serde-json does.

This design makes it expensive to deserialize it to other formats. Specifically, the main challenge is that given a rows of strings, each row will have to allocate a new String because Value owns String, even when the values were already allocated into a Vec<u8> during buffering + decompression (See Reader and Block). This happens to all non-copy types (String, Bytes, Fixed and Enum).

Formats such as arrow have their own way of storing strings. Thus, currently, using Value::String(String) adds a large overhead (I can quantify them if we need to) to interoperate. Specifically, the data path from a Read of a uncompressed file with 1 column of strings to Value is:

Read - [read] -> Vec<u8> - [decompress] -> Vec<u8> - [decode] -> String - [move] -> Value

This yields an extra allocation per row per column of non-copy types (String, Bytes, Fixed and Enum).

Usually, it is more efficient to read it to a Vec<u8> and produce non-owned values from it, in this case this would be

Read - [read] -> Vec<u8> - [decompress] -> Vec<u8> - [decode] -> &'a str -> Value<'a>

(there are other tricks to avoid re-allocating a Vec<u8> per Block + decompression which can be discussed in separate).

One way to address this is to declare a "non-owned" Value (e.g. ValueRef::String(&'a str)), and have the decoder to be of the form decode<'a>(&'a[u8]) -> Result<(ValueRef<'a>, usize)> where usize is the number of read bytes.

Since each row block has a declared number of bytes upfront and we read them to a buf: Vec<u8> as part of the buffering in Reader + decompression, we can leverage the existence of this buf to decode them to non-owning ValueRef, thereby avoiding the allocation per row per column.

We can offer impl<'a> From<ValueRef<'a>> for Value that calls to_string() (and the like to other types such as Bytes), if the user wishes to remove the lifetime at the cost of an allocation.

This is a pretty large, though, because we would need to also change the signatures of zag_i64 and the like to read from &[u8] instead of R: Read, since they must return how many bytes were consumed to "advance pointer of &[u8]".

I recently did this type of exercise as part of a complete re-write of the parquet crate parquet2, where this exact same problem emerges: a column is composed by pages (avro "blocks"), that are decompressed and decoded individually; it is advantageous to read each page to Vec<u8> before decompressing and decoding them, but we should not perform further allocations when not needed, specially for strings, bytes, etc.

The result of this re-write was a 5-10x speedup, which is why I am cautiously confident that this effort may be worth here too.

The text was updated successfully, but these errors were encountered:

Ten0 · 2022-04-17T21:53:38Z

There's the same kind of suboptimal writing in the serde module, e.g. it uses visit_str instead of visit_borrowed_str.
Looks like this (and possibly other issues) should be ported over to the jira repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce allocations in reading #195

Reduce allocations in reading #195

jorgecarleitao commented Sep 13, 2021

Ten0 commented Apr 17, 2022 •

edited

Loading

Reduce allocations in reading #195

Reduce allocations in reading #195

Comments

jorgecarleitao commented Sep 13, 2021

Ten0 commented Apr 17, 2022 • edited Loading

Ten0 commented Apr 17, 2022 •

edited

Loading