Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet: parallelize over row groups ~3x #3924

Merged
merged 1 commit into from
Jul 7, 2022
Merged

parquet: parallelize over row groups ~3x #3924

merged 1 commit into from
Jul 7, 2022

Conversation

ritchie46
Copy link
Member

Can be ~3x faster if a parquet file contains many row groups.

Some timings:

image

@github-actions github-actions bot added python Related to Python Polars rust Related to Rust Polars labels Jul 7, 2022
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jul 7, 2022

Can be ~3x faster if a parquet file contains many row groups.

Awesome. On the other side of that, does polars automatically create row groups when writing to parquet now? There isn't an explicit param for it in write_ipc, but with these kinds of gains on the reading side, might be reasonable to expose one, so a round-trip through parquet could be optimized?

@ritchie46
Copy link
Member Author

Awesome. On the other side of that, does polars automatically create row groups when writing to parquet now?

Not by default, but you can ensure polars does. DataFrame.write_parquet accepts a row_group_size argument that dictates the number of rows per rg.

Note that there is no free lunch, because polars likes contiguous data, so it can be that polars first needs to rechunk the data for some operations.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jul 7, 2022

Not by default, but you can ensure polars does. DataFrame.write_parquet accepts a row_group_size argument that dictates the number of rows per rg.

Perfect; I have no idea why I went looking at write_ipc, hah... 🤦

@ritchie46 ritchie46 merged commit 34f546e into master Jul 7, 2022
@ritchie46 ritchie46 deleted the parquet branch July 7, 2022 09:56
pub enum ParallelStrategy {
/// Don't parallelize
None,
/// Parallelize over the row groups
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the doc a typo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems Parallelize over the row groups should be the comment of RowGroups

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I see it. Jup, that's wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants