-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] Support Dask dataframes with 'category' columns (fixes #3861) #3908
[dask] Support Dask dataframes with 'category' columns (fixes #3861) #3908
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameslamb Thanks a lot! I loved how one-line simplification fixed critical issue!
However, I'm afraid that tests are not very reliable. I remember, the same problem was with pure Pandas tests where I have to fix random_state
to force LightGBM to use category column for splits.
np.random.seed(42) # sometimes there is no difference how cols are treated (cat or not cat) |
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Sorry, I don't understand your comment. How does the random state impact how pandas handles category columns? Is it that in that test, you were using such a small data size that sometimes the column was mostly a single categorical level, and so it got dropped by pre-filtering? (https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html#enable-feature-pre-filtering-when-creating-dataset) |
I mean, with some seeds LightGBM models have identical output for cases when one particular column is treated as categorical and when it is not. I think it will be good to ensure that categorical column in Dask test has some impact on LightGBM model. |
Ok. I could check the output of |
I think it will be enough to check that |
OHHHH now I understand what you mean. You're saying it's random whether or not LightGBM decides to even treat it as categorical vs. continuous data? That's suprising to me 😬 |
I don't think so. I believe some "unlucky" distributions in cat column were treated as constant column, hence LightGBM drops it. Or maybe some another thing caused such behaviour. I haven't investigated these cases. I just remember my asserts in tests for different outputs w/wo cat column were failing without fixing random seed. |
This is exactly what I meant in #3908 (comment). And if that's the case, then checking that the column was used for splits (#3908 (comment)) would be enough. |
Yeah, if you think it will be easier, sure, please try with |
Thanks! I think they're both equally easy, but the The results of I'll add the test with trees_to_dataframe() shortly. I'll also update the test models to use |
Ok I think I got this working! I ended up not needing the I was really really struggling with getting LightGBM to use a single categorical feature every test run because of the small data size, so I changed the approach to "just add multiple categorical columns and make sure that at least one of them was used". This seemed to work! I re-ran the tests several times and I think this approach is reliable. For classification and ranking --> add 5 category columns, each of which is a random draw from For regression --> make sure that the training data from For regression tasks with just 100 observations, the random continuous features always overwhelm the categorical columns and get chosen for splits. I saw that even when setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for enhancing the test! Please check my new comment below
if dX.dtypes[col].name == 'category' | ||
] | ||
tree_df = dask_classifier.booster_.trees_to_dataframe() | ||
assert tree_df['split_feature'].isin(cat_cols).sum() > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this assert is enough to say that at least one of cat_cols
was categorical. I believe you should check for comparison sign as well to ensure these features were treated as categorical but not as numerical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goal was not to check that the columns are treated as categorical features by LightGBM. My goal was to ensure that having pandas "category" columns in your training data does not break LightGBM's Dask estimators (as it currently does on master
, documented in #3861).
As far as I can tell, "category" columns will only be automatically treated as categorical features if you set categorical_feature = 'auto'
:
LightGBM/python-package/lightgbm/basic.py
Lines 513 to 514 in 8ef874b
if categorical_feature == 'auto': # use cat cols from DataFrame | |
categorical_feature = cat_cols_not_ordered |
categorical_featue
to the column names / indices in the parameters.
The tests like assert_eq(p1_local, p2)
already test that the Dask interface is producing a model that is similar to the one produced by the scikit-learn interface.
I can add categorical_feature = [col for col in dX.columns if col.startswith('cat_')]
and test, that's fine. I will try to do that tonight. But it wasn't the intention of this PR or these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added these new checks in 07c9dca
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for I didn't get the purpose of this PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the fix and enhancements of tests!
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This PR fixes #3861.
Root Cause
Thanks to @jmoralez for figuring out that the root issue was
pandas
"category" columns for data frames!lightgbm.dask._predict_part()
did some conversions thatt did not handle these columns correctly.Changes in this PR
Removes code in
_predict_part()
that converts a chunk of data to array before passing it through to.predict()
on the underlying scikit-learn model..values
on a pandas "category" column, the results are strings. Passing a numpy array with string data into.predict()
methods in LightGBM leads to errors.Adds data frames with a "category" column to many of the Dask unit tests
Reproducible example
I've added a reproducible example in #3861 (comment). If you run the code on current
master
, it will fail with the error described in #3861. If you run that code on this branch, it will succeed 😀 .References
I found a few issues and PRs related to "category" columns in
pandas
being supported in the scikit-learn interface. Putting them here for anyone who finds this PR from search in the future.LightGBM/python-package/lightgbm/basic.py
Lines 497 to 500 in 8ef874b