-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[XGBoost4J-Spark] Early stopping and best iteration #6893
Comments
@wbo4958 probably has some insight. |
@candalfigomoro According to the code https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j/src/main/java/ml/dmlc/xgboost4j/java/XGBoost.java#L253. Looks like "it uses the best iteration + num_early_stopping_rounds". And I have no idea how to get the values of the best iteration, Seems we need to support this. |
@wbo4958 |
@wbo4958 If you are interested in the feature, you can use the |
Ok, will add this feature. |
Hi @wbo4958 - I too wanted to use this feature in spark. Just wanted to know if you were able to work on it ? If not I can give it a try. |
XGBoostClassificationModel object has a method called getVersion(). Not much info in the documentation. Based on the experimentation I did, booster.getVersion() / 2 always returns the latest iteration even with early stopping. So ( (booster.getVersion() / 2) - earlyStoppingRound ) gives the bestIteration. Can anyone confirm this or if there are any cases when this won't work ? @trivialfis or @wbo4958 or @CodingCat ? |
@naveenkb |
I have added bestIteration using |
There's a |
We are in the process of replacing that parameter with more robust |
@naveenkb |
@candalfigomoro Sure. Sorry for the delay. I will raise a PR in few days. |
Hello, I was going through the parameters of the XGBoost 4J spark mentioned in The definition of numEarlyStoppingRounds: is as follows: If non-zero, the training will be stopped after a specified number of consecutive increases in any evaluation metric. But shouldn't it be "the training will be stopped after a specified number of consecutive non-increase (same or decrease) in any evaluation metric" Is there any parameter through which I can set a threshold for early stopping rounds? If the evaluation metric doesn't improve by at-least the threshold within early stopping rounds, the training stops. Thanks, |
This is tricky because some metrics need to be minimized (e.g. MSE) while other metrics need to be maximized (e.g. accuracy). See also the |
How do you expose the bestIteration and bestScore attained during training? |
|
TODO: Follow up with documents. |
This feature would very much be appreciated for XGBoost4J (non-spark) library as well. We have a situation where the evaluation function does not necessarily decrease as the loss decreases. In fact, in our situation the evaluation function can increase when the loss decreases too far. This is deliberate: we use a quantile loss function and a custom evaluation metric to ensure that the loss function does not decrease to zero (if the loss is zero, the predictions are no longer quantiles). The current implementation means that the model that is returned after early stopping rounds is far from optimal for many of our models, while a good performance was reached at earlier iterations. |
This has been asked before (e.g #3140 (comment)) but no answer was ever given.
In XGBoost4J-Spark we can use early stopping by using
setNumEarlyStoppingRounds
.transform()
, does it use by default the best iteration (the best number of trees) or the best iteration +num_early_stopping_rounds
?num_early_stopping_rounds
, how can I extract the value of the best iteration so I can settreeLimit
to the best iteration?Thanks
The text was updated successfully, but these errors were encountered: