Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading UTF-8/JSON/ENUM field results in a lot of vec allocation #58

Closed
alamb opened this issue Apr 26, 2021 · 2 comments
Closed

Reading UTF-8/JSON/ENUM field results in a lot of vec allocation #58

alamb opened this issue Apr 26, 2021 · 2 comments
Labels
parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-7252

While reading a very large parquet file with basically all string fields was very slow(430MB gzipped), after profiling with osx instruments, I noticed that a lot of time is spent in "convert_byte_array", in particular, "reserving" and allocating Vec::with_capacity, which is done before String::from_utf8_unchecked.

It seems like using String as the underlying storage is causing this(String uses Vec for its underlying storage), this also requires copying from slice to vec.

"Field::Str" is a pub enum so I am not sure how "refactorable" is the String part, for example, converting it into a &str(we can perhaps then defer the conversion from &[u8] to Vec until the user really needs a String)

But of course, changing it to &str can result in quite a bit of interface changes... So I am wondering if there are already some plans or solution on the way to improve the handling of the "Field::Str" case?

 

@alamb alamb added the arrow Changes to the arrow crate label Apr 26, 2021
@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2021

Comment from Wong Shek Hei(shekhei) @ 2019-11-27T08:05:42.844+0000:

I have modified the Field::Str locally to hold the ByteArray instead, that removes the copying. Reading a 1.5MM, 1000 column file(440mb) gz.parquet file, on MacBook Pro (15-inch, 2019), improved from 3m20s to 2m20s.

 

But the problem is this will modify the signature of the Field::Str Variant.

@alamb alamb added parquet Changes to the parquet crate and removed arrow Changes to the arrow crate labels Apr 26, 2021
@jorgecarleitao jorgecarleitao changed the title [Parquet] Reading UTF-8/JSON/ENUM field results in a lot of vec allocation Reading UTF-8/JSON/ENUM field results in a lot of vec allocation Apr 29, 2021
@tustvold
Copy link
Contributor

I believe this was closed by #1082

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

2 participants