[MGPU] python-gpu/test_large_sizes.py failed. #3794

trivialfis · 2018-10-15T03:48:17Z

During test, weight_ and labels_ inside Metainfo are resharded with a different and non-empty distribution, causing the CHECK inside reshard to fail.

The cause is pretty simple, the test first run on single GPU:

xgboost/tests/python-gpu/test_large_sizes.py

Line 100 in 516457f

eprint("gpu_hist updater 1 gpu")

Then later run on multi-gpu:

xgboost/tests/python-gpu/test_large_sizes.py

Line 106 in 516457f

eprint("gpu_hist updater all gpus")

While running multi-gpu test, the distribution of weigth_ and label_ remains unchanged. Hence, to make the test pass we simply have to remove one of these two tests.

This issue is a reminder for me to sort out a better approach for multi-gpu.

The text was updated successfully, but these errors were encountered:

hcho3 · 2018-10-24T16:51:32Z

@trivialfis Does #3738 solve this issue?

trivialfis · 2018-10-24T22:35:31Z

No, this one needs be solved by handling changed parameter.

hcho3 · 2018-10-24T23:19:05Z

@trivialfis I'm confused, don't lines 100 and 106 create Booster object from scratch? Which objects are being shared?

hcho3 · 2018-10-24T23:21:07Z

Also, should I add this as a blocking issue?

trivialfis · 2018-10-24T23:33:43Z

@hcho3 The weight and label from DMatrix are not resharded. To fix this we need to come up with a solution handling changing parameters, in this case, the number of GPUs . I think we can solve it along with the callback issue, so a known bug describing users currently can't change the number of GPUs in between two training/predict sessions might be more appropriate than a quick fix (reshard everything).

hcho3 · 2018-10-24T23:37:22Z

So add an item to "known issue" for the next release, then?

trivialfis · 2018-10-25T00:18:11Z

Yes, please.

hcho3 · 2018-10-25T06:32:32Z

@trivialfis What would be the work-around for this issue? Should I delete the DMatrix object and re-load?

trivialfis · 2018-10-25T06:38:08Z

Should work, but let me test it.

trivialfis · 2018-10-25T07:39:19Z

@hcho3 Tested.
Inserting:

            ag_dtrain = xgb.DMatrix(X, y, nthread=40)

In between two training sessions in /test_large_sizes.py will let the test pass. The above line essentially recreated DMatrix. I will focus on #3825 and #3795 next. After that we can try to find out a solution to these problems. Feel free to let me know if you have any suggestion. :)

hcho3 · 2018-10-25T08:56:03Z

Got it. The release note for 0.81 will contain a short description of the work-around. Thanks!

pseudotensor · 2019-05-30T07:25:23Z

How does this work-around apply to us using the sklearn interface?

I mention because h2o4gpu testing hasn't failed with xgboost errors in a while, but after recent updates we get the distribution is empty issue for 1 test.

trivialfis mentioned this issue Oct 18, 2018

Fix #3579: Fix memory leak when learing rate callback is registered #3803

Closed

hcho3 added the known-issue label Oct 25, 2018

trivialfis mentioned this issue Nov 3, 2018

Fix specifying gpu_id, add tests. #3851

Merged

trivialfis mentioned this issue Feb 19, 2019

XGBoost 0.81.0.1 gpu_id > 0 fails although nvidia-smi finds 4 gpus #4160

Closed

ksangeek mentioned this issue Mar 27, 2019

[GPU] python-gpu/test_large_sizes.py fails with debug_verbose=5 #4298

Closed

trivialfis closed this as completed Sep 17, 2019

lock bot locked as resolved and limited conversation to collaborators Dec 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MGPU] python-gpu/test_large_sizes.py failed. #3794

[MGPU] python-gpu/test_large_sizes.py failed. #3794

trivialfis commented Oct 15, 2018 •

edited

Loading

hcho3 commented Oct 24, 2018

trivialfis commented Oct 24, 2018

hcho3 commented Oct 24, 2018

hcho3 commented Oct 24, 2018

trivialfis commented Oct 24, 2018

hcho3 commented Oct 24, 2018

trivialfis commented Oct 25, 2018

hcho3 commented Oct 25, 2018

trivialfis commented Oct 25, 2018

trivialfis commented Oct 25, 2018

hcho3 commented Oct 25, 2018

pseudotensor commented May 30, 2019

[MGPU] python-gpu/test_large_sizes.py failed. #3794

[MGPU] python-gpu/test_large_sizes.py failed. #3794

Comments

trivialfis commented Oct 15, 2018 • edited Loading

hcho3 commented Oct 24, 2018

trivialfis commented Oct 24, 2018

hcho3 commented Oct 24, 2018

hcho3 commented Oct 24, 2018

trivialfis commented Oct 24, 2018

hcho3 commented Oct 24, 2018

trivialfis commented Oct 25, 2018

hcho3 commented Oct 25, 2018

trivialfis commented Oct 25, 2018

trivialfis commented Oct 25, 2018

hcho3 commented Oct 25, 2018

pseudotensor commented May 30, 2019

trivialfis commented Oct 15, 2018 •

edited

Loading