-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Data] [1/2] - Add aggregators per column type. Numerical, categorial, and vector aggregators #56610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] [1/2] - Add aggregators per column type. Numerical, categorial, and vector aggregators #56610
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces new aggregators MissingValuePercentage and ZeroPercentage, along with helper functions to generate default aggregators for numerical, categorical, and vector columns. The changes are well-structured and the new features are thoroughly tested. My review focuses on improving documentation, code clarity, and fixing issues in the new tests and build configuration to ensure they run correctly.
| if name not in name_to_type: | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Just do name_to_type.get(name) and the conditional below will handle it
7510c67 to
7d562d7
Compare
7ae2e7c to
5078abc
Compare
…al, and vector aggregators Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns This is prep work for adding `pd.describe()` like functionality for Ray Data.
5078abc to
90a1cb2
Compare
cem-anyscale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; just two minor comments.
| if count == 0: | ||
| return [0, 0] | ||
|
|
||
| arrow_compatible = column_accessor._as_arrow_compatible() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we abstract this as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add this as a follow up
| Returns: | ||
| FeatureAggregators containing categorized column names and their aggregators | ||
| """ | ||
| schema = dataset.schema() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might trigger execution
| if name not in name_to_type: | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Just do name_to_type.get(name) and the conditional below will handle it
| elif pa.types.is_string(ftype): | ||
| str_columns.append(name) | ||
| all_aggs.extend(categorical_aggregators(name)) | ||
| elif pa.types.is_list(ftype): | ||
| vector_columns.append(name) | ||
| all_aggs.extend(vector_aggregators(name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's abstract common utils also handling dictionary encoded ones
…, and vector aggregators (ray-project#56610) ## Why are these changes needed? Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns This is prep work for adding `pd.describe()` like functionality for Ray Data. Signed-off-by: zac <zac@anyscale.com>
…, and vector aggregators (#56610) ## Why are these changes needed? Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns This is prep work for adding `pd.describe()` like functionality for Ray Data. Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…, and vector aggregators (ray-project#56610) ## Why are these changes needed? Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns This is prep work for adding `pd.describe()` like functionality for Ray Data. Signed-off-by: Marco Stephan <marco@magic.dev>
…, and vector aggregators (#56610) ## Why are these changes needed? Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns This is prep work for adding `pd.describe()` like functionality for Ray Data. Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…, and vector aggregators (#56610) ## Why are these changes needed? Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns This is prep work for adding `pd.describe()` like functionality for Ray Data. Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…, and vector aggregators Original PR #56610 by goutamvenkat-anyscale Original: ray-project/ray#56610
…. Numerical, categorial, and vector aggregators Merged from original PR #56610 Original: ray-project/ray#56610
…, and vector aggregators Original PR #56610 by goutamvenkat-anyscale Original: ray-project/ray#56610
…. Numerical, categorial, and vector aggregators Merged from original PR #56610 Original: ray-project/ray#56610
…, and vector aggregators (ray-project#56610) ## Why are these changes needed? Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns This is prep work for adding `pd.describe()` like functionality for Ray Data.
…, and vector aggregators Original PR #56610 by goutamvenkat-anyscale Original: ray-project/ray#56610
…. Numerical, categorial, and vector aggregators Merged from original PR #56610 Original: ray-project/ray#56610
…, and vector aggregators (ray-project#56610) ## Why are these changes needed? Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns This is prep work for adding `pd.describe()` like functionality for Ray Data.
…, and vector aggregators (ray-project#56610) ## Why are these changes needed? Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns This is prep work for adding `pd.describe()` like functionality for Ray Data. Signed-off-by: Future-Outlier <eric901201@gmail.com>
Why are these changes needed?
Added two new aggregators - MissingValuePercentage and ZeroPercentage which is applicable to numerical columns
This is prep work for adding
pd.describe()like functionality for Ray Data.Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.