-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet: parallelize over row groups ~3x
#3924
Conversation
Awesome. On the other side of that, does polars automatically create row groups when writing to parquet now? There isn't an explicit param for it in |
Not by default, but you can ensure polars does. Note that there is no free lunch, because polars likes contiguous data, so it can be that polars first needs to rechunk the data for some operations. |
Perfect; I have no idea why I went looking at |
pub enum ParallelStrategy { | ||
/// Don't parallelize | ||
None, | ||
/// Parallelize over the row groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the doc a typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems Parallelize over the row groups
should be the comment of RowGroups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, now I see it. Jup, that's wrong.
Can be ~3x faster if a parquet file contains many row groups.
Some timings: