Skip to content

Conversation

@cem-anyscale
Copy link
Contributor

  • Updated preprocessors to use a callback-based approach for stat computation. This improves code organization and reduces duplication.
  • Added ValueCounter aggregator and value_counts method to BlockColumnAccessor. Includes implementations for both Arrow and Pandas backends.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@cem-anyscale cem-anyscale requested a review from a team as a code owner September 23, 2025 18:51
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and valuable refactoring of the preprocessor statistics computation, moving to a callback-based StatComputationPlan. This greatly improves code organization and modularity. The addition of ValueCounter and new BlockColumnAccessor methods are also great enhancements.

My review focuses on ensuring the new architecture is correctly and consistently applied. I've found a few issues:

  • A critical bug in OneHotEncoder's max_categories logic where it's applied per-batch instead of globally.
  • A bug in ValueCounter's zero_factory that would lead to a KeyError.
  • A method definition issue in UniformKBinsDiscretizer that would cause a TypeError.
  • Some preprocessors in scaler.py were missed in the refactoring and still use the old dataset.aggregate pattern.

Addressing these points will help solidify this excellent refactoring.

cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener bot added docs An issue or change related to documentation data Ray Data-related issues labels Sep 23, 2025
cursor[bot]

This comment was marked as outdated.

return {x}


class ValueCounter(AggregateFnV2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this plugged in properly? It's not called in Encoder codepath

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Oct 14, 2025
@cem-anyscale cem-anyscale force-pushed the cem/callback_based_stats branch 2 times, most recently from 1dcabb6 to a5ea0f4 Compare October 14, 2025 21:29
cursor[bot]

This comment was marked as outdated.

@github-actions github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Oct 15, 2025
@cem-anyscale cem-anyscale added the go add ONLY when ready to merge, run all tests label Oct 15, 2025
@cem-anyscale cem-anyscale force-pushed the cem/callback_based_stats branch 3 times, most recently from a650d2c to ee354a5 Compare October 20, 2025 18:53
cem-anyscale and others added 9 commits October 20, 2025 11:54
…nter

* Updated preprocessors to use a callback-based approach for stat computation. This improves code organization and reduces duplication.
* Added ValueCounter aggregator and value_counts method to BlockColumnAccessor. Includes implementations for both Arrow and Pandas backends.

Signed-off-by: cem <cem@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: cem-anyscale <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
@cem-anyscale cem-anyscale force-pushed the cem/callback_based_stats branch from ee354a5 to 1ae5182 Compare October 20, 2025 18:54
@alexeykudinkin alexeykudinkin merged commit 48d8ec2 into master Oct 20, 2025
6 checks passed
@alexeykudinkin alexeykudinkin deleted the cem/callback_based_stats branch October 20, 2025 20:38
kamil-kaczmarek pushed a commit that referenced this pull request Oct 20, 2025
…nter (#56848)

* Updated preprocessors to use a callback-based approach for stat
computation. This improves code organization and reduces duplication.
* Added ValueCounter aggregator and value_counts method to
BlockColumnAccessor. Includes implementations for both Arrow and Pandas
backends.

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem-anyscale <cem@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
kamil-kaczmarek pushed a commit that referenced this pull request Oct 20, 2025
…nter (#56848)

* Updated preprocessors to use a callback-based approach for stat
computation. This improves code organization and reduces duplication.
* Added ValueCounter aggregator and value_counts method to
BlockColumnAccessor. Includes implementations for both Arrow and Pandas
backends.

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem-anyscale <cem@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
…nter (ray-project#56848)

* Updated preprocessors to use a callback-based approach for stat
computation. This improves code organization and reduces duplication.
* Added ValueCounter aggregator and value_counts method to
BlockColumnAccessor. Includes implementations for both Arrow and Pandas
backends.

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem-anyscale <cem@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
…nter (#56848)

* Updated preprocessors to use a callback-based approach for stat
computation. This improves code organization and reduces duplication.
* Added ValueCounter aggregator and value_counts method to
BlockColumnAccessor. Includes implementations for both Arrow and Pandas
backends.

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem-anyscale <cem@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…nter (ray-project#56848)

* Updated preprocessors to use a callback-based approach for stat
computation. This improves code organization and reduces duplication.
* Added ValueCounter aggregator and value_counts method to
BlockColumnAccessor. Includes implementations for both Arrow and Pandas
backends.

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem-anyscale <cem@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…nter (ray-project#56848)

* Updated preprocessors to use a callback-based approach for stat
computation. This improves code organization and reduces duplication.
* Added ValueCounter aggregator and value_counts method to
BlockColumnAccessor. Includes implementations for both Arrow and Pandas
backends.

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem-anyscale <cem@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…nter (ray-project#56848)

* Updated preprocessors to use a callback-based approach for stat
computation. This improves code organization and reduces duplication.
* Added ValueCounter aggregator and value_counts method to
BlockColumnAccessor. Includes implementations for both Arrow and Pandas
backends.

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem-anyscale <cem@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues docs An issue or change related to documentation go add ONLY when ready to merge, run all tests unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants