You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While reading a very large parquet file with basically all string fields was very slow(430MB gzipped), after profiling with osx instruments, I noticed that a lot of time is spent in "convert_byte_array", in particular, "reserving" and allocating Vec::with_capacity, which is done before String::from_utf8_unchecked.
It seems like using String as the underlying storage is causing this(String uses Vec for its underlying storage), this also requires copying from slice to vec.
"Field::Str" is a pub enum so I am not sure how "refactorable" is the String part, for example, converting it into a &str(we can perhaps then defer the conversion from &[u8] to Vec until the user really needs a String)
But of course, changing it to &str can result in quite a bit of interface changes... So I am wondering if there are already some plans or solution on the way to improve the handling of the "Field::Str" case?
The text was updated successfully, but these errors were encountered:
Comment from Wong Shek Hei(shekhei) @ 2019-11-27T08:05:42.844+0000:
I have modified the Field::Str locally to hold the ByteArray instead, that removes the copying. Reading a 1.5MM, 1000 column file(440mb) gz.parquet file, on MacBook Pro (15-inch, 2019), improved from 3m20s to 2m20s.
But the problem is this will modify the signature of the Field::Str Variant.
jorgecarleitao
changed the title
[Parquet] Reading UTF-8/JSON/ENUM field results in a lot of vec allocation
Reading UTF-8/JSON/ENUM field results in a lot of vec allocation
Apr 29, 2021
Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-7252
While reading a very large parquet file with basically all string fields was very slow(430MB gzipped), after profiling with osx instruments, I noticed that a lot of time is spent in "convert_byte_array", in particular, "reserving" and allocating Vec::with_capacity, which is done before String::from_utf8_unchecked.
It seems like using String as the underlying storage is causing this(String uses Vec for its underlying storage), this also requires copying from slice to vec.
"Field::Str" is a pub enum so I am not sure how "refactorable" is the String part, for example, converting it into a &str(we can perhaps then defer the conversion from &[u8] to Vec until the user really needs a String)
But of course, changing it to &str can result in quite a bit of interface changes... So I am wondering if there are already some plans or solution on the way to improve the handling of the "Field::Str" case?
The text was updated successfully, but these errors were encountered: