-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2.1.0-SNAPSHOT GPU jars train FAILED on Check failed: device_ == device (0 vs. 3)
#9510
Comments
cc @wbo4958 |
@wbo4958 Could you please help work on some new tests when you are available? |
in addition, if the spark cluster implementation doesn't set the correct device ordinal, we might need to fix the pyspark predict function as well. Edit: I think (Py)Spark doesn't support GPU-based inference. We should just set the device ordinal to CPU. Note: Unless there's a user hack into the internal of pyspark-xgboost, the device ordinal of the booster returned from training is always 0. This is not optimal since on a multi-GPU machine, we might create multiple |
That's true xgboost-pyspark doesn't support gpu prediction, but xgboost-jvm indeed supports GPU prediction. It seems there's something wrong. |
Then we need to make sure the booster is running on the CPU when transform is called. |
Looks like this issue is not reproducible with latest GPU SNAPSHOT jars over the weekends, let's observe it for one or two more days, if still not repro, I'll close the issue, thanks! cc @wbo4958 |
upload driver and executor logs |
Will skip the issue if it's no longer reproducible. |
Feel free to re-open if there's new information. |
Use
2.1.0-SNAPSHOT
GPU jars, train failed onsrc/common/host_device_vector.cu:166: Check failed: device_ == device (0 vs. 3) : New device ordinal is different from previous one.
Driver log:
driver-log.txt
The text was updated successfully, but these errors were encountered: