[Data] [1/2] - Add aggregators per column type. Numerical, categorial, and vector aggregators #56610

goutamvenkat-anyscale · 2025-09-17T02:56:13Z

Why are these changes needed?

Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns

This is prep work for adding pd.describe() like functionality for Ray Data.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

gemini-code-assist

Code Review

This pull request introduces new aggregators MissingValuePercentage and ZeroPercentage, along with helper functions to generate default aggregators for numerical, categorical, and vector columns. The changes are well-structured and the new features are thoroughly tested. My review focuses on improving documentation, code clarity, and fixing issues in the new tests and build configuration to ensure they run correctly.

python/ray/data/tests/test_dataset_stats.py

python/ray/data/BUILD.bazel

python/ray/data/stats.py

gemini-code-assist · 2025-09-17T02:57:38Z

python/ray/data/stats.py

+        if name not in name_to_type:
+            continue


This check if name not in name_to_type: appears to be redundant. The missing_cols check on lines 115-117 already ensures that all columns are present in the schema. You can probably remove this if/continue block for cleaner code.

nit: Just do name_to_type.get(name) and the conditional below will handle it

python/ray/data/tests/test_dataset_stats.py

python/ray/data/aggregate.py

…al, and vector aggregators Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns This is prep work for adding `pd.describe()` like functionality for Ray Data.

cem-anyscale

LGTM; just two minor comments.

python/ray/data/aggregate.py

cem-anyscale · 2025-09-18T23:12:32Z

python/ray/data/aggregate.py

+        if count == 0:
+            return [0, 0]
+
+        arrow_compatible = column_accessor._as_arrow_compatible()


should we abstract this as well?

I can add this as a follow up

alexeykudinkin · 2025-09-22T20:45:42Z

python/ray/data/stats.py

+    Returns:
+        FeatureAggregators containing categorized column names and their aggregators
+    """
+    schema = dataset.schema()


This might trigger execution

alexeykudinkin · 2025-09-22T20:46:47Z

python/ray/data/stats.py

+        if name not in name_to_type:
+            continue


nit: Just do name_to_type.get(name) and the conditional below will handle it

alexeykudinkin · 2025-09-22T20:47:45Z

python/ray/data/stats.py

+        elif pa.types.is_string(ftype):
+            str_columns.append(name)
+            all_aggs.extend(categorical_aggregators(name))
+        elif pa.types.is_list(ftype):
+            vector_columns.append(name)
+            all_aggs.extend(vector_aggregators(name))


Let's abstract common utils also handling dictionary encoded ones

…, and vector aggregators (ray-project#56610) ## Why are these changes needed? Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns This is prep work for adding `pd.describe()` like functionality for Ray Data. Signed-off-by: zac <zac@anyscale.com>

…, and vector aggregators (#56610) ## Why are these changes needed? Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns This is prep work for adding `pd.describe()` like functionality for Ray Data. Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…, and vector aggregators (ray-project#56610) ## Why are these changes needed? Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns This is prep work for adding `pd.describe()` like functionality for Ray Data. Signed-off-by: Marco Stephan <marco@magic.dev>

…, and vector aggregators (#56610) ## Why are these changes needed? Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns This is prep work for adding `pd.describe()` like functionality for Ray Data. Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…, and vector aggregators (#56610) ## Why are these changes needed? Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns This is prep work for adding `pd.describe()` like functionality for Ray Data. Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

…, and vector aggregators Original PR #56610 by goutamvenkat-anyscale Original: ray-project/ray#56610

…. Numerical, categorial, and vector aggregators Merged from original PR #56610 Original: ray-project/ray#56610

…, and vector aggregators Original PR #56610 by goutamvenkat-anyscale Original: ray-project/ray#56610

…. Numerical, categorial, and vector aggregators Merged from original PR #56610 Original: ray-project/ray#56610

…, and vector aggregators (ray-project#56610) ## Why are these changes needed? Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns This is prep work for adding `pd.describe()` like functionality for Ray Data.

…, and vector aggregators Original PR #56610 by goutamvenkat-anyscale Original: ray-project/ray#56610

…. Numerical, categorial, and vector aggregators Merged from original PR #56610 Original: ray-project/ray#56610

…, and vector aggregators (ray-project#56610) ## Why are these changes needed? Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns This is prep work for adding `pd.describe()` like functionality for Ray Data.

…, and vector aggregators (ray-project#56610) ## Why are these changes needed? Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns This is prep work for adding `pd.describe()` like functionality for Ray Data. Signed-off-by: Future-Outlier <eric901201@gmail.com>

goutamvenkat-anyscale requested a review from a team as a code owner September 17, 2025 02:56

gemini-code-assist bot reviewed Sep 17, 2025

View reviewed changes

gvspraveen mentioned this pull request Sep 17, 2025

[data] summary API for datasets #56510

Closed

8 tasks

ray-gardener bot added docs An issue or change related to documentation data Ray Data-related issues labels Sep 17, 2025

goutamvenkat-anyscale force-pushed the custom_agg_and_stats branch 2 times, most recently from 7510c67 to 7d562d7 Compare September 17, 2025 17:22

goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Sep 17, 2025

goutamvenkat-anyscale force-pushed the custom_agg_and_stats branch 2 times, most recently from 7ae2e7c to 5078abc Compare September 17, 2025 23:20

cem-anyscale reviewed Sep 18, 2025

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

goutamvenkat-anyscale force-pushed the custom_agg_and_stats branch from 5078abc to 90a1cb2 Compare September 18, 2025 20:00

cem-anyscale approved these changes Sep 18, 2025

View reviewed changes

Merge branch 'master' into custom_agg_and_stats

f645562

richardliaw approved these changes Sep 22, 2025

View reviewed changes

richardliaw merged commit b3799a5 into ray-project:master Sep 22, 2025
5 checks passed

goutamvenkat-anyscale deleted the custom_agg_and_stats branch September 22, 2025 18:39

alexeykudinkin reviewed Sep 22, 2025

View reviewed changes

snorkelopstesting2-coder mentioned this pull request Oct 11, 2025

[Data] [1/2] - Add aggregators per column type. Numerical, categorial, and vector aggregators snorkel-marlin-repos/ray-project_ray_pr_56610_d9ea6244-18c1-471f-a03a-e7c02f4c0ccb#1

Merged

snorkelopstesting2-coder mentioned this pull request Oct 11, 2025

[Data] [1/2] - Add aggregators per column type. Numerical, categorial, and vector aggregators snorkel-marlin-repos/ray-project_ray_pr_56610_9b561a1c-fb4c-4cc8-8778-28a09088dea6#1

Merged

snorkelopstesting2-coder mentioned this pull request Oct 22, 2025

[Data] [1/2] - Add aggregators per column type. Numerical, categorial, and vector aggregators snorkel-marlin-repos/ray-project_ray_pr_56610_0e29e16f-2237-4185-a7ba-75b5f72e9a75#1

Merged

[Data] [1/2] - Add aggregators per column type. Numerical, categorial, and vector aggregators #56610

[Data] [1/2] - Add aggregators per column type. Numerical, categorial, and vector aggregators #56610

Uh oh!

Conversation

goutamvenkat-anyscale commented Sep 17, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cem-anyscale left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cem-anyscale Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexeykudinkin Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants