-
Notifications
You must be signed in to change notification settings - Fork 7k
[Data] Callback-based stat computation for preprocessors and ValueCounter #56848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and valuable refactoring of the preprocessor statistics computation, moving to a callback-based StatComputationPlan. This greatly improves code organization and modularity. The addition of ValueCounter and new BlockColumnAccessor methods are also great enhancements.
My review focuses on ensuring the new architecture is correctly and consistently applied. I've found a few issues:
- A critical bug in
OneHotEncoder'smax_categorieslogic where it's applied per-batch instead of globally. - A bug in
ValueCounter'szero_factorythat would lead to aKeyError. - A method definition issue in
UniformKBinsDiscretizerthat would cause aTypeError. - Some preprocessors in
scaler.pywere missed in the refactoring and still use the olddataset.aggregatepattern.
Addressing these points will help solidify this excellent refactoring.
| return {x} | ||
|
|
||
|
|
||
| class ValueCounter(AggregateFnV2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this plugged in properly? It's not called in Encoder codepath
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
1dcabb6 to
a5ea0f4
Compare
a650d2c to
ee354a5
Compare
…nter * Updated preprocessors to use a callback-based approach for stat computation. This improves code organization and reduces duplication. * Added ValueCounter aggregator and value_counts method to BlockColumnAccessor. Includes implementations for both Arrow and Pandas backends. Signed-off-by: cem <cem@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: cem-anyscale <cem@anyscale.com> Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
ee354a5 to
1ae5182
Compare
…nter (#56848) * Updated preprocessors to use a callback-based approach for stat computation. This improves code organization and reduces duplication. * Added ValueCounter aggregator and value_counts method to BlockColumnAccessor. Includes implementations for both Arrow and Pandas backends. <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: cem <cem@anyscale.com> Signed-off-by: cem-anyscale <cem@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
…nter (#56848) * Updated preprocessors to use a callback-based approach for stat computation. This improves code organization and reduces duplication. * Added ValueCounter aggregator and value_counts method to BlockColumnAccessor. Includes implementations for both Arrow and Pandas backends. <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: cem <cem@anyscale.com> Signed-off-by: cem-anyscale <cem@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
…nter (ray-project#56848) * Updated preprocessors to use a callback-based approach for stat computation. This improves code organization and reduces duplication. * Added ValueCounter aggregator and value_counts method to BlockColumnAccessor. Includes implementations for both Arrow and Pandas backends. <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: cem <cem@anyscale.com> Signed-off-by: cem-anyscale <cem@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xgui <xgui@anyscale.com>
…nter (#56848) * Updated preprocessors to use a callback-based approach for stat computation. This improves code organization and reduces duplication. * Added ValueCounter aggregator and value_counts method to BlockColumnAccessor. Includes implementations for both Arrow and Pandas backends. <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: cem <cem@anyscale.com> Signed-off-by: cem-anyscale <cem@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…nter (ray-project#56848) * Updated preprocessors to use a callback-based approach for stat computation. This improves code organization and reduces duplication. * Added ValueCounter aggregator and value_counts method to BlockColumnAccessor. Includes implementations for both Arrow and Pandas backends. <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: cem <cem@anyscale.com> Signed-off-by: cem-anyscale <cem@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…nter (ray-project#56848) * Updated preprocessors to use a callback-based approach for stat computation. This improves code organization and reduces duplication. * Added ValueCounter aggregator and value_counts method to BlockColumnAccessor. Includes implementations for both Arrow and Pandas backends. <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: cem <cem@anyscale.com> Signed-off-by: cem-anyscale <cem@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…nter (ray-project#56848) * Updated preprocessors to use a callback-based approach for stat computation. This improves code organization and reduces duplication. * Added ValueCounter aggregator and value_counts method to BlockColumnAccessor. Includes implementations for both Arrow and Pandas backends. <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: cem <cem@anyscale.com> Signed-off-by: cem-anyscale <cem@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Why are these changes needed?
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.