You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My understanding is that this crate is optimized for reading to (the owning enum) Value, that encapsulates all types, like e.g. serde-json does.
This design makes it expensive to deserialize it to other formats. Specifically, the main challenge is that given a rows of strings, each row will have to allocate a new String because Value owns String, even when the values were already allocated into a Vec<u8> during buffering + decompression (See Reader and Block). This happens to all non-copy types (String, Bytes, Fixed and Enum).
Formats such as arrow have their own way of storing strings. Thus, currently, using Value::String(String) adds a large overhead (I can quantify them if we need to) to interoperate. Specifically, the data path from a Read of a uncompressed file with 1 column of strings to Value is:
(there are other tricks to avoid re-allocating a Vec<u8> per Block + decompression which can be discussed in separate).
One way to address this is to declare a "non-owned" Value (e.g. ValueRef::String(&'a str)), and have the decoder to be of the form decode<'a>(&'a[u8]) -> Result<(ValueRef<'a>, usize)> where usize is the number of read bytes.
Since each row block has a declared number of bytes upfront and we read them to a buf: Vec<u8> as part of the buffering in Reader + decompression, we can leverage the existence of this buf to decode them to non-owning ValueRef, thereby avoiding the allocation per row per column.
We can offer impl<'a> From<ValueRef<'a>> for Value that calls to_string() (and the like to other types such as Bytes), if the user wishes to remove the lifetime at the cost of an allocation.
This is a pretty large, though, because we would need to also change the signatures of zag_i64 and the like to read from &[u8] instead of R: Read, since they must return how many bytes were consumed to "advance pointer of &[u8]".
I recently did this type of exercise as part of a complete re-write of the parquet crate parquet2, where this exact same problem emerges: a column is composed by pages (avro "blocks"), that are decompressed and decoded individually; it is advantageous to read each page to Vec<u8> before decompressing and decoding them, but we should not perform further allocations when not needed, specially for strings, bytes, etc.
The result of this re-write was a 5-10x speedup, which is why I am cautiously confident that this effort may be worth here too.
The text was updated successfully, but these errors were encountered:
There's the same kind of suboptimal writing in the serde module, e.g. it uses visit_str instead of visit_borrowed_str.
Looks like this (and possibly other issues) should be ported over to the jira repo.
This is a follow up of #193.
My understanding is that this crate is optimized for reading to (the owning enum)
Value
, that encapsulates all types, like e.g. serde-json does.This design makes it expensive to deserialize it to other formats. Specifically, the main challenge is that given a rows of strings, each row will have to allocate a new
String
becauseValue
ownsString
, even when the values were already allocated into aVec<u8>
during buffering + decompression (SeeReader
andBlock
). This happens to all non-copy types (String
,Bytes
,Fixed
andEnum
).Formats such as arrow have their own way of storing strings. Thus, currently, using
Value::String(String)
adds a large overhead (I can quantify them if we need to) to interoperate. Specifically, the data path from aRead
of a uncompressed file with 1 column of strings to Value is:This yields an extra allocation per row per column of non-copy types (
String
,Bytes
,Fixed
andEnum
).Usually, it is more efficient to read it to a
Vec<u8>
and produce non-owned values from it, in this case this would be(there are other tricks to avoid re-allocating a
Vec<u8>
per Block + decompression which can be discussed in separate).One way to address this is to declare a "non-owned"
Value
(e.g.ValueRef::String(&'a str)
), and have the decoder to be of the formdecode<'a>(&'a[u8]) -> Result<(ValueRef<'a>, usize)>
whereusize
is the number of read bytes.Since each row block has a declared number of bytes upfront and we read them to a
buf: Vec<u8>
as part of the buffering inReader
+ decompression, we can leverage the existence of thisbuf
to decode them to non-owningValueRef
, thereby avoiding the allocation per row per column.We can offer
impl<'a> From<ValueRef<'a>> for Value
that callsto_string()
(and the like to other types such asBytes
), if the user wishes to remove the lifetime at the cost of an allocation.This is a pretty large, though, because we would need to also change the signatures of
zag_i64
and the like to read from&[u8]
instead ofR: Read
, since they must return how many bytes were consumed to "advance pointer of&[u8]
".I recently did this type of exercise as part of a complete re-write of the
parquet
crate parquet2, where this exact same problem emerges: a column is composed by pages (avro "blocks"), that are decompressed and decoded individually; it is advantageous to read each page toVec<u8>
before decompressing and decoding them, but we should not perform further allocations when not needed, specially for strings, bytes, etc.The result of this re-write was a 5-10x speedup, which is why I am cautiously confident that this effort may be worth here too.
The text was updated successfully, but these errors were encountered: