Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MGPU] python-gpu/test_large_sizes.py failed. #3794

Closed
trivialfis opened this issue Oct 15, 2018 · 12 comments
Closed

[MGPU] python-gpu/test_large_sizes.py failed. #3794

trivialfis opened this issue Oct 15, 2018 · 12 comments

Comments

@trivialfis
Copy link
Member

trivialfis commented Oct 15, 2018

During test, weight_ and labels_ inside Metainfo are resharded with a different and non-empty distribution, causing the CHECK inside reshard to fail.

The cause is pretty simple, the test first run on single GPU:

eprint("gpu_hist updater 1 gpu")

Then later run on multi-gpu:

eprint("gpu_hist updater all gpus")

While running multi-gpu test, the distribution of weigth_ and label_ remains unchanged. Hence, to make the test pass we simply have to remove one of these two tests.

This issue is a reminder for me to sort out a better approach for multi-gpu.

@hcho3
Copy link
Collaborator

hcho3 commented Oct 24, 2018

@trivialfis Does #3738 solve this issue?

@trivialfis
Copy link
Member Author

No, this one needs be solved by handling changed parameter.

@hcho3
Copy link
Collaborator

hcho3 commented Oct 24, 2018

@trivialfis I'm confused, don't lines 100 and 106 create Booster object from scratch? Which objects are being shared?

@hcho3
Copy link
Collaborator

hcho3 commented Oct 24, 2018

Also, should I add this as a blocking issue?

@trivialfis
Copy link
Member Author

@hcho3 The weight and label from DMatrix are not resharded. To fix this we need to come up with a solution handling changing parameters, in this case, the number of GPUs . I think we can solve it along with the callback issue, so a known bug describing users currently can't change the number of GPUs in between two training/predict sessions might be more appropriate than a quick fix (reshard everything).

@hcho3
Copy link
Collaborator

hcho3 commented Oct 24, 2018

So add an item to "known issue" for the next release, then?

@trivialfis
Copy link
Member Author

Yes, please.

@hcho3
Copy link
Collaborator

hcho3 commented Oct 25, 2018

@trivialfis What would be the work-around for this issue? Should I delete the DMatrix object and re-load?

@trivialfis
Copy link
Member Author

Should work, but let me test it.

@trivialfis
Copy link
Member Author

@hcho3 Tested.
Inserting:

            ag_dtrain = xgb.DMatrix(X, y, nthread=40)

In between two training sessions in /test_large_sizes.py will let the test pass. The above line essentially recreated DMatrix. I will focus on #3825 and #3795 next. After that we can try to find out a solution to these problems. Feel free to let me know if you have any suggestion. :)

@hcho3
Copy link
Collaborator

hcho3 commented Oct 25, 2018

Got it. The release note for 0.81 will contain a short description of the work-around. Thanks!

@pseudotensor
Copy link
Contributor

How does this work-around apply to us using the sklearn interface?

I mention because h2o4gpu testing hasn't failed with xgboost errors in a while, but after recent updates we get the distribution is empty issue for 1 test.

@lock lock bot locked as resolved and limited conversation to collaborators Dec 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants