-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#5925: Enable grouping on categoricals with range-partitioning impl #6862
Conversation
modin/core/dataframe/pandas/utils.py
Outdated
@@ -38,6 +38,8 @@ def concatenate(dfs): | |||
assert df.columns.equals(dfs[0].columns) | |||
for i in dfs[0].columns.get_indexer_for(dfs[0].select_dtypes("category").columns): | |||
columns = [df.iloc[:, i] for df in dfs] | |||
if not all(isinstance(col.dtype, pandas.CategoricalDtype) for col in columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there was a bug in this function that was never triggered before, decided to fix it in this PR. Tried to made the fix more explicit by adding comments:
all_categorical_parts_are_empty = None
has_non_categorical_parts = False
for col in columns:
if isinstance(col.dtype, pandas.CategoricalDtype):
if all_categorical_parts_are_empty is None:
all_categorical_parts_are_empty = len(col) == 0
continue
all_categorical_parts_are_empty &= len(col) == 0
else:
has_non_categorical_parts = True
# 'union_categoricals' raises an error if some of the passed values don't have categorical dtype,
# if it happens, we only want to continue when all parts with categorical dtypes are actually empty.
# This can happen if there were an aggregation that discards categorical dtypes and that aggregation
# doesn't properly do so for empty partitions
if has_non_categorical_parts and all_categorical_parts_are_empty:
continue
original_names = df.index.names | ||
df = pandas.concat([df, values]) | ||
if kwargs["sort"]: | ||
# TODO: write search-sorted insertion or sort the result after insertion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide more details? I can't understand what you want here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the comment:
# TODO: inserting 'values' based on 'searchsorted' result might be more efficient
# in cases of small amount of 'values'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev very detailed comments, thank you!
# If the aggregation has failed, the result would be empty. Assuming the | ||
# fill value to be `np.NaN` here (this may not always be correct!!!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is this also a hack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is a hack, added an according note
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
…artitioning impl Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
CI on master started to fail after merging this PR https://github.com/modin-project/modin/actions/runs/7701338108/job/20995335180 I'll prepare a separate PR to revert this |
What do these changes do?
This PR allows grouping on categorical columns using range-partitioning implementation. The main challenge with this was to support proper behavior of the default value of
groupby(observed=False)
parameter.observed=False
as a default value for groupby is deprecated, however it will only be replaced in pandas 3.0, so we should have an implementation for that.What's
observed=False
?This parameter includes missing categories into the result index, the missing values are then filled with the default value for this particular aggregation. Consider this example for more details:
An example of how `observed=False` works
How
observed=False
is implemented in this PRGroupby itself is always being called with
observed=True
parameter, meaning that the result won't contain missing categories. Then I've added a post-processing procedure for groupby results that determines missing categories and inserts them in a proper order to partitions. Two kernels are being submitted in order to perform this:add_missing_categories_to_groupby()
function that takes all resulted partitions with some metadata and determines both missing categories and a fill value that should be used as an aggregation result for these groups. Then this kernel determines which missing categories should go to which partitions so the result remain sorted. As the result of this kernel a dictionary is being returned, mapping partition indices to missing categorical values to be inserted to this partition.Is it possible to run groupby with
observed=False
parameter in the beginning and avoid this post-processing step?It's possible, but then we'll need to filter out fake missing values in an additional post-processing stage, which in combination with that each kernel now returns much bigger dataframes makes this implementation slower than the presented one:
How this parameter is handled in other modin's groupby implementations?
In the full-axis implementation, we always pass
observed=False
to the groupby since we're dealing with a full-column partition here and so leave pandas to deal with it.In the MapReduce implementation, the map stage is always performed with
observed=True
parameter and the reduce stage (the stage where we build a full-column partition) passesobserved=False
to the reduction groupby and gets proper result.Can this be a single full-column kernel performing the post-processing instead of two map kernels?
The overhead of launching two kernels appeared to be much less than running a full-column operation. The full-column kernel approach appeared to be ~1.5x slower than the two map kernels approach.
Performance results
I performed testing on datasets: original H2O dataset and a slightly modified one. The results for the modified dataset are quite good, but for the original one they're quite dissapointing.
1. Tests on modified H2O dataset
Modifications I made to the dataset:
id3
column is casted from categorical to str.Why?: the column has 1_000_000 unique values which doesn't work very well with the current implementation of categories in modin (see 2nd problem here). In particular, having a column with so much unique values as a categorical, makes each operation much slower, because of the overhead of containing all 1_000_000 categorical values in each partition.
id1
only had 10 categorical values, now it has 10_000 categorical valuesid2
only had 10 categorical values, now it has 100 categorical valuesThere were several test cases, you can read their description in the code:
code I used to measure the results
500mb data, aggregation functions: `grp.agg(["mean", "sum", "max"])`
In this scenario, the only case where the compa-ratio changed its color after enabling
observed=False
iscase2_1
. However, the absolute difference is not that high.The really sad thing is that in cases where the grouping column doesn't have missing categories (
case3_1
andcase3_2
) we still see an overhead of 20-40% just to ensure that there are no missing cats at the post-processing stage.It's also worth mentioning that the overhead of
case3_1
andcase4
is almost the same, meaning that the implementation of the post-processing kernels itself doesn't add a lot of overhead, the main overhead is from the fact of submitting an extra post-processing kernel.1.5gb data, aggregation functions: `grp.agg(["mean", "sum", "max"])`
The relative overhead of
observed=False
has dropped from ~40% on avarege down to ~15% on avarege when compared with 500mb dataset.500mb data, aggregation functions: `grp.apply(lambda df: df.sum())`
Here we're insterested in
case4 - case5
, as here we're manually running the applied func on every missing group in order to compute individual default values.2. Tests on original H2O dataset
As was described above, original H2O dataset has a categorical column with a lot of unique values (
id3
). The high uniquenes makes modin to strugle because of its implementation of categoricals that stores all unique values in each partition (see 2nd problem here).I've ran the following test cases on original H2O data in two scenarios: with
id3
being categoricals andid3
being a string type.Script to measure
H2O original data ~500mb, 'id3' is categorical
H2O original data ~500mb, 'id3' is a string type
H2O original data ~500mb, 'id3_cat' vs 'id_str'
In this comparison we see that casting
id3
to an object type makes things much faster. Giving a thought of whether we should claim that creating categorical with high uniquennes is an antipattern for modin.BTW, for pandas when comparing
id3_cat
vsid3_str
, categorical approach wins.However, when comparing
modin_id3_str
vspandas_id3_cat
, modin is still faster in some cases.flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date