-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] group data is only set for training set and is set incorrectly #3097
Comments
Yes, this is a known limitation of the current group data support. The "right" way to fix this is to make the group explicitly available for each row in the input data frame, e.g. via #2749. |
seems #2749 is merged, just wondering is there any example on how to set the group id in dataframe and pass it to XGBoost? |
No, it has not been merged yet. Exposing this in the JVM wrapper would require a little bit of work as well. |
I made some local changes like following, would that work? And also when looking at the results, I found for some records, the prediction result is different for the same input. Is it the nature of distributed training? That only a random subset of all trees will be used? `
} |
in the master branch of xgboost, we have allowed the user to have per-instance group info (like qid), check #3369 |
When creating a watch, input data is split into trainMatrix and testMatrix randomly. But the input groupData is set only to trainMatrix. And the groupData param is for the original data set, it does fit for the split trainMatrix any more.
https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala#L520
The text was updated successfully, but these errors were encountered: