-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add row_group_size argument to Dataset.to_parquet #218
Conversation
cc @rnyak |
@rjzamora looks like some tests are failing. |
Right, I guess CI is using a pretty old version of cudf - The necessary feature was exposed in cudf in 22.08 (rapidsai/cudf#10980). I guess we could just raise an error when the user tries to use this option with an older cudf release. |
I think it's reasonable to require a new enough version of cudf instead of making code changes to accommodate older versions. I'm working on getting the CI image back up to date—with any luck, hopefully in the next week or so. (It's a slow process since it involves waiting on a lot of container builds.) |
That makes sense to me - Thanks @karlhigley ! |
@rjzamora and @karlhigley when do you think this PR can be revisited based on new cudf version comment? thanks. |
The CI image has been updated to match our containers, so hopefully the tests will pass now 🤞🏻 |
Adds simple
row_group_size
argument toDataset.to_parquet
, allowing users to set the maximum number of rows desired in a single Parquet row-group.