Skip to content

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

Why are these changes needed?

Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns

This is prep work for adding pd.describe() like functionality for Ray Data.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner September 17, 2025 02:56
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new aggregators MissingValuePercentage and ZeroPercentage, along with helper functions to generate default aggregators for numerical, categorical, and vector columns. The changes are well-structured and the new features are thoroughly tested. My review focuses on improving documentation, code clarity, and fixing issues in the new tests and build configuration to ensure they run correctly.

Comment on lines +133 to +134
if name not in name_to_type:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check if name not in name_to_type: appears to be redundant. The missing_cols check on lines 115-117 already ensures that all columns are present in the schema. You can probably remove this if/continue block for cleaner code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Just do name_to_type.get(name) and the conditional below will handle it

@gvspraveen gvspraveen mentioned this pull request Sep 17, 2025
8 tasks
@ray-gardener ray-gardener bot added docs An issue or change related to documentation data Ray Data-related issues labels Sep 17, 2025
@goutamvenkat-anyscale goutamvenkat-anyscale force-pushed the custom_agg_and_stats branch 2 times, most recently from 7510c67 to 7d562d7 Compare September 17, 2025 17:22
@goutamvenkat-anyscale goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Sep 17, 2025
@goutamvenkat-anyscale goutamvenkat-anyscale force-pushed the custom_agg_and_stats branch 2 times, most recently from 7ae2e7c to 5078abc Compare September 17, 2025 23:20
…al, and vector aggregators

Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns

This is prep work for adding `pd.describe()` like functionality for Ray Data.
Copy link
Contributor

@cem-anyscale cem-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; just two minor comments.

if count == 0:
return [0, 0]

arrow_compatible = column_accessor._as_arrow_compatible()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we abstract this as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add this as a follow up

@richardliaw richardliaw merged commit b3799a5 into ray-project:master Sep 22, 2025
5 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the custom_agg_and_stats branch September 22, 2025 18:39
Returns:
FeatureAggregators containing categorized column names and their aggregators
"""
schema = dataset.schema()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might trigger execution

Comment on lines +133 to +134
if name not in name_to_type:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Just do name_to_type.get(name) and the conditional below will handle it

Comment on lines +153 to +158
elif pa.types.is_string(ftype):
str_columns.append(name)
all_aggs.extend(categorical_aggregators(name))
elif pa.types.is_list(ftype):
vector_columns.append(name)
all_aggs.extend(vector_aggregators(name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's abstract common utils also handling dictionary encoded ones

ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
…, and vector aggregators (ray-project#56610)

## Why are these changes needed?

Added two new aggregators - MissingValuePercentage and ZeroPercentage
which is applicable to numerical columns

This is prep work for adding `pd.describe()` like functionality for Ray
Data.

Signed-off-by: zac <zac@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Sep 24, 2025
…, and vector aggregators (#56610)

## Why are these changes needed?

Added two new aggregators - MissingValuePercentage and ZeroPercentage
which is applicable to numerical columns

This is prep work for adding `pd.describe()` like functionality for Ray
Data.

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
marcostephan pushed a commit to marcostephan/ray that referenced this pull request Sep 24, 2025
…, and vector aggregators (ray-project#56610)

## Why are these changes needed?

Added two new aggregators - MissingValuePercentage and ZeroPercentage
which is applicable to numerical columns

This is prep work for adding `pd.describe()` like functionality for Ray
Data.

Signed-off-by: Marco Stephan <marco@magic.dev>
elliot-barn pushed a commit that referenced this pull request Sep 27, 2025
…, and vector aggregators (#56610)

## Why are these changes needed?

Added two new aggregators - MissingValuePercentage and ZeroPercentage
which is applicable to numerical columns

This is prep work for adding `pd.describe()` like functionality for Ray
Data.

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
…, and vector aggregators (#56610)

## Why are these changes needed?

Added two new aggregators - MissingValuePercentage and ZeroPercentage
which is applicable to numerical columns

This is prep work for adding `pd.describe()` like functionality for Ray
Data.

Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
snorkelopstesting2-coder pushed a commit to snorkel-marlin-repos/ray-project_ray_pr_56610_d9ea6244-18c1-471f-a03a-e7c02f4c0ccb that referenced this pull request Oct 11, 2025
…, and vector aggregators

Original PR #56610 by goutamvenkat-anyscale
Original: ray-project/ray#56610
snorkelopstesting2-coder added a commit to snorkel-marlin-repos/ray-project_ray_pr_56610_d9ea6244-18c1-471f-a03a-e7c02f4c0ccb that referenced this pull request Oct 11, 2025
…. Numerical, categorial, and vector aggregators

Merged from original PR #56610
Original: ray-project/ray#56610
snorkelopstesting2-coder pushed a commit to snorkel-marlin-repos/ray-project_ray_pr_56610_9b561a1c-fb4c-4cc8-8778-28a09088dea6 that referenced this pull request Oct 11, 2025
…, and vector aggregators

Original PR #56610 by goutamvenkat-anyscale
Original: ray-project/ray#56610
snorkelopstesting2-coder added a commit to snorkel-marlin-repos/ray-project_ray_pr_56610_9b561a1c-fb4c-4cc8-8778-28a09088dea6 that referenced this pull request Oct 11, 2025
…. Numerical, categorial, and vector aggregators

Merged from original PR #56610
Original: ray-project/ray#56610
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…, and vector aggregators (ray-project#56610)

## Why are these changes needed?

Added two new aggregators - MissingValuePercentage and ZeroPercentage
which is applicable to numerical columns

This is prep work for adding `pd.describe()` like functionality for Ray
Data.
snorkelopstesting2-coder pushed a commit to snorkel-marlin-repos/ray-project_ray_pr_56610_0e29e16f-2237-4185-a7ba-75b5f72e9a75 that referenced this pull request Oct 22, 2025
…, and vector aggregators

Original PR #56610 by goutamvenkat-anyscale
Original: ray-project/ray#56610
snorkelopstesting2-coder added a commit to snorkel-marlin-repos/ray-project_ray_pr_56610_0e29e16f-2237-4185-a7ba-75b5f72e9a75 that referenced this pull request Oct 22, 2025
…. Numerical, categorial, and vector aggregators

Merged from original PR #56610
Original: ray-project/ray#56610
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…, and vector aggregators (ray-project#56610)

## Why are these changes needed?

Added two new aggregators - MissingValuePercentage and ZeroPercentage
which is applicable to numerical columns

This is prep work for adding `pd.describe()` like functionality for Ray
Data.
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…, and vector aggregators (ray-project#56610)

## Why are these changes needed?

Added two new aggregators - MissingValuePercentage and ZeroPercentage
which is applicable to numerical columns

This is prep work for adding `pd.describe()` like functionality for Ray
Data.

Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues docs An issue or change related to documentation go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants