Reading UTF-8/JSON/ENUM field results in a lot of vec allocation #58

alamb · 2021-04-26T11:24:45Z

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-7252

While reading a very large parquet file with basically all string fields was very slow(430MB gzipped), after profiling with osx instruments, I noticed that a lot of time is spent in "convert_byte_array", in particular, "reserving" and allocating Vec::with_capacity, which is done before String::from_utf8_unchecked.

It seems like using String as the underlying storage is causing this(String uses Vec for its underlying storage), this also requires copying from slice to vec.

"Field::Str" is a pub enum so I am not sure how "refactorable" is the String part, for example, converting it into a &str(we can perhaps then defer the conversion from &[u8] to Vec until the user really needs a String)

But of course, changing it to &str can result in quite a bit of interface changes... So I am wondering if there are already some plans or solution on the way to improve the handling of the "Field::Str" case?

alamb · 2021-04-26T11:24:46Z

Comment from Wong Shek Hei(shekhei) @ 2019-11-27T08:05:42.844+0000:

I have modified the Field::Str locally to hold the ByteArray instead, that removes the copying. Reading a 1.5MM, 1000 column file(440mb) gz.parquet file, on MacBook Pro (15-inch, 2019), improved from 3m20s to 2m20s.

 

But the problem is this will modify the signature of the Field::Str Variant.

tustvold · 2022-10-27T01:37:39Z

I believe this was closed by #1082

alamb added the arrow Changes to the arrow crate label Apr 26, 2021

alamb added parquet Changes to the parquet crate and removed arrow Changes to the arrow crate labels Apr 26, 2021

jorgecarleitao changed the title ~~[Parquet] Reading UTF-8/JSON/ENUM field results in a lot of vec allocation~~ Reading UTF-8/JSON/ENUM field results in a lot of vec allocation Apr 29, 2021

tustvold closed this as completed Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading UTF-8/JSON/ENUM field results in a lot of vec allocation #58

Reading UTF-8/JSON/ENUM field results in a lot of vec allocation #58

alamb commented Apr 26, 2021

alamb commented Apr 26, 2021

tustvold commented Oct 27, 2022

Reading UTF-8/JSON/ENUM field results in a lot of vec allocation #58

Reading UTF-8/JSON/ENUM field results in a lot of vec allocation #58

Comments

alamb commented Apr 26, 2021

alamb commented Apr 26, 2021

tustvold commented Oct 27, 2022