-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dask_cudf: scikit-learn API leads to impossible train-time: ValueError("feature_names mismatch #6268
Comments
FYI this problem does not occur if I just call:
So it seems to be a problem with the eval_set. However, I confirmed that before calling the fit with eval_set that the dask_cudf frame |
The other odd message is that the Dmatrix is empty, and this shows up in the attempted MRE example above as well. Neither the training data or eval_set are empty frames. Perhaps some of the early stopping PRs submitted for 1.3.0 already fixed this kind of problem? |
As an aside, related to using dask_cudf, the same example sometimes fails in other ways:
As well as:
The latter pickle issue was more wild. It was as if loggers from global space were trying to be pickled (i.e. when dask cache was not found and went to pickle things). I traced through in debug mode and saw it happening, but I didn't understand why dask would be doing that. |
Also perhaps relevant, the empty dmatrix message is wrong in that if one asks the shape of the computed frame, it is some number of rows by 1 column, but the |
The empty DMatrix just means empty on specific worker. Dask does not balance the dataset among workers perfectly so some of them can be starving. On latest xgboost with reg/cls models you can safely ignore the warning if you don't care about performance at the moment (balanced is better). With ranking/survival this warning is real. |
The mismatched feature name is new to me. I will try to reproduce it on my end.
Looks like something is messing up with Python reflection. :-( Any chance you can get the error without pycharm? As for pickling. I think I need a MRE. Sometimes It's hard the reason about "why is dask pickling this and that". Thanks for reporting the errors, this will help smoothing the user experience. I understand your frustration, but could you please break the issue into separated ones? It's difficult to track with comments mixed together. |
Yes, I will break into separate issues once I have some kind of MRE. At this point I'm putting all things in the same issue because they may be related and could help find an MRE. |
Actually hit same error randomly with more than 1 feature:
Again, no good reason, doesn't normally fail, but here I'm no longer using eval_set for dask_cudf. Just somehow still the feature names get nuked. |
Is it possible that a worker got a dataframe with some features gone? Just guessing. |
@trivialfis While trying to repro the crash: I was able to reproduce this feature name issue.
Don't let the inner name fool you, that's the other issue name for the crash. This gives:
|
I think I haven't hit it lately because I made sure to always have at least 1 chunk on each worker. So maybe it's just bad cascade of errors. |
Hmm, training continuation with empty matrix... |
Same setup as here: #6232 (comment)
When running same dataset, but using dask_cudf with only 1 column, keep hitting this error only with dask_cudf:
With the scikit-learn API, this should be an impossible situation to get into during training, yet it always happens.
However, trying to reproduce this outside our application does not lead to the same error:
gives
However, I'm still reporting since clearly it should be impossible to happen with the sklearn API that we are using.
Inside the application where this example is used, we have many more imports, so there may be some conflict, similar to the "dill" issue I posted before. But this seems to be more relevant to xgboost proper.
While I try to find a MRE, do you have any advice or thoughts?
Also, this only happens on a multi-GPU machine. The exact code in our application on a single-GPU machine runs through just fine with dask_cudf without such problems.
@teju85
The text was updated successfully, but these errors were encountered: