Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce allocations in reading #195

Open
jorgecarleitao opened this issue Sep 13, 2021 · 1 comment
Open

Reduce allocations in reading #195

jorgecarleitao opened this issue Sep 13, 2021 · 1 comment

Comments

@jorgecarleitao
Copy link

This is a follow up of #193.

My understanding is that this crate is optimized for reading to (the owning enum) Value, that encapsulates all types, like e.g. serde-json does.

This design makes it expensive to deserialize it to other formats. Specifically, the main challenge is that given a rows of strings, each row will have to allocate a new String because Value owns String, even when the values were already allocated into a Vec<u8> during buffering + decompression (See Reader and Block). This happens to all non-copy types (String, Bytes, Fixed and Enum).

Formats such as arrow have their own way of storing strings. Thus, currently, using Value::String(String) adds a large overhead (I can quantify them if we need to) to interoperate. Specifically, the data path from a Read of a uncompressed file with 1 column of strings to Value is:

Read - [read] -> Vec<u8> - [decompress] -> Vec<u8> - [decode] -> String - [move] -> Value

This yields an extra allocation per row per column of non-copy types (String, Bytes, Fixed and Enum).

Usually, it is more efficient to read it to a Vec<u8> and produce non-owned values from it, in this case this would be

Read - [read] -> Vec<u8> - [decompress] -> Vec<u8> - [decode] -> &'a str -> Value<'a>

(there are other tricks to avoid re-allocating a Vec<u8> per Block + decompression which can be discussed in separate).

One way to address this is to declare a "non-owned" Value (e.g. ValueRef::String(&'a str)), and have the decoder to be of the form decode<'a>(&'a[u8]) -> Result<(ValueRef<'a>, usize)> where usize is the number of read bytes.

Since each row block has a declared number of bytes upfront and we read them to a buf: Vec<u8> as part of the buffering in Reader + decompression, we can leverage the existence of this buf to decode them to non-owning ValueRef, thereby avoiding the allocation per row per column.

We can offer impl<'a> From<ValueRef<'a>> for Value that calls to_string() (and the like to other types such as Bytes), if the user wishes to remove the lifetime at the cost of an allocation.

This is a pretty large, though, because we would need to also change the signatures of zag_i64 and the like to read from &[u8] instead of R: Read, since they must return how many bytes were consumed to "advance pointer of &[u8]".

I recently did this type of exercise as part of a complete re-write of the parquet crate parquet2, where this exact same problem emerges: a column is composed by pages (avro "blocks"), that are decompressed and decoded individually; it is advantageous to read each page to Vec<u8> before decompressing and decoding them, but we should not perform further allocations when not needed, specially for strings, bytes, etc.

The result of this re-write was a 5-10x speedup, which is why I am cautiously confident that this effort may be worth here too.

@Ten0
Copy link

Ten0 commented Apr 17, 2022

There's the same kind of suboptimal writing in the serde module, e.g. it uses visit_str instead of visit_borrowed_str.
Looks like this (and possibly other issues) should be ported over to the jira repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants