Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add row_group_size argument to Dataset.to_parquet #218

Merged
merged 6 commits into from
Mar 16, 2023

Conversation

rjzamora
Copy link
Contributor

Adds simple row_group_size argument to Dataset.to_parquet, allowing users to set the maximum number of rows desired in a single Parquet row-group.

@rjzamora rjzamora added the enhancement New feature or request label Feb 13, 2023
@rjzamora
Copy link
Contributor Author

cc @rnyak

@karlhigley karlhigley added this to the Merlin 23.03 milestone Feb 13, 2023
@rnyak
Copy link
Contributor

rnyak commented Feb 24, 2023

@rjzamora looks like some tests are failing.

@rnyak rnyak requested a review from karlhigley February 24, 2023 00:03
@rjzamora
Copy link
Contributor Author

looks like some tests are failing.

Right, I guess CI is using a pretty old version of cudf - The necessary feature was exposed in cudf in 22.08 (rapidsai/cudf#10980). I guess we could just raise an error when the user tries to use this option with an older cudf release.

@karlhigley
Copy link
Contributor

I think it's reasonable to require a new enough version of cudf instead of making code changes to accommodate older versions. I'm working on getting the CI image back up to date—with any luck, hopefully in the next week or so. (It's a slow process since it involves waiting on a lot of container builds.)

@rjzamora
Copy link
Contributor Author

I think it's reasonable to require a new enough version of cudf instead of making code changes to accommodate older versions. I'm working on getting the CI image back up to date—with any luck, hopefully in the next week or so. (It's a slow process since it involves waiting on a lot of container builds.)

That makes sense to me - Thanks @karlhigley !

@rnyak
Copy link
Contributor

rnyak commented Mar 9, 2023

@rjzamora and @karlhigley when do you think this PR can be revisited based on new cudf version comment? thanks.

@karlhigley
Copy link
Contributor

The CI image has been updated to match our containers, so hopefully the tests will pass now 🤞🏻

@karlhigley karlhigley merged commit 3ff75df into NVIDIA-Merlin:main Mar 16, 2023
@rjzamora rjzamora deleted the row-group-size branch March 16, 2023 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants