-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MGPU] python-gpu/test_large_sizes.py failed. #3794
Comments
@trivialfis Does #3738 solve this issue? |
No, this one needs be solved by handling changed parameter. |
@trivialfis I'm confused, don't lines 100 and 106 create Booster object from scratch? Which objects are being shared? |
Also, should I add this as a blocking issue? |
@hcho3 The weight and label from DMatrix are not resharded. To fix this we need to come up with a solution handling changing parameters, in this case, the number of GPUs . I think we can solve it along with the callback issue, so a known bug describing users currently can't change the number of GPUs in between two training/predict sessions might be more appropriate than a quick fix (reshard everything). |
So add an item to "known issue" for the next release, then? |
Yes, please. |
@trivialfis What would be the work-around for this issue? Should I delete the DMatrix object and re-load? |
Should work, but let me test it. |
@hcho3 Tested. ag_dtrain = xgb.DMatrix(X, y, nthread=40) In between two training sessions in |
Got it. The release note for 0.81 will contain a short description of the work-around. Thanks! |
How does this work-around apply to us using the sklearn interface? I mention because h2o4gpu testing hasn't failed with xgboost errors in a while, but after recent updates we get the distribution is empty issue for 1 test. |
During test,
weight_
andlabels_
insideMetainfo
are resharded with a different and non-empty distribution, causing the CHECK insidereshard
to fail.The cause is pretty simple, the test first run on single GPU:
xgboost/tests/python-gpu/test_large_sizes.py
Line 100 in 516457f
Then later run on multi-gpu:
xgboost/tests/python-gpu/test_large_sizes.py
Line 106 in 516457f
While running multi-gpu test, the distribution of
weigth_
andlabel_
remains unchanged. Hence, to make the test pass we simply have to remove one of these two tests.This issue is a reminder for me to sort out a better approach for multi-gpu.
The text was updated successfully, but these errors were encountered: