Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask XGBoost hangs during training with multiple GPU workers #6649

Closed
elaineejiang opened this issue Jan 27, 2021 · 6 comments · Fixed by #6743
Closed

Dask XGBoost hangs during training with multiple GPU workers #6649

elaineejiang opened this issue Jan 27, 2021 · 6 comments · Fixed by #6743

Comments

@elaineejiang
Copy link

Hi, I am using XGBoost (v.1.1.1) with Dask (v. 2020.12.0). I have a Dask cluster that connects to remote GPU workers using k8s (v.1.14). I've noticed that if I train on multiple GPU workers, the dispatch-training tasks will hang on Rabit initialization. For example, this is what the call stack looks like in one of the workers:

Key: dispatched_train-60c6f36b-c5df-4871-9ff1-bf5fc547ccdf-0
File "[...]/ext/public/python/3/7/x/dist/lib/python3.7/threading.py", line 890, in _bootstrap self._bootstrap_inner()

File "[...]/ext/public/python/3/7/x/dist/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run()

File "[...]/ext/public/python/3/7/x/dist/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs)

File "[...]/ext/public/python/distributed/2020/12/0/dist/lib/python3.7/distributed/threadpoolexecutor.py", line 55, in _worker task.run()

File "[...]/ext/public/python/distributed/2020/12/0/dist/lib/python3.7/distributed/_concurrent_futures_thread.py", line 65, in run result = self.fn(*self.args, **self.kwargs)

File "[...]/ext/public/python/distributed/2020/12/0/dist/lib/python3.7/distributed/worker.py", line 3425, in apply_function result = function(*args, **kwargs)

File "[...]/ext/public/python/xgboost/1/1/1/dist/lib/python3.7/xgboost/dask.py", line 418, in dispatched_train with RabitContext(rabit_args):

File "[...]/ext/public/python/xgboost/1/1/1/dist/lib/python3.7/xgboost/dask.py", line 82, in __enter__ rabit.init(self.args)

File "[...]/ext/public/python/xgboost/1/1/1/dist/lib/python3.7/xgboost/rabit.py", line 27, in init _LIB.RabitInit(len(arr), arr)

Here is a reproducible example:

# Prereq: Set up Dask cluster and client with >1 GPU workers

import pandas as pd
import numpy as np
import xgboost as xgb
import dask.dataframe as dd

seed = 42
random_state = np.random.RandomState(seed)

df = pd.DataFrame(random_state.random_sample((1000, 4)), columns=list(['x1', 'x2', 'x3', 'y']))
xcols = ['x1', 'x2', 'x3']
ycol = ['y']

X, y = (dd.from_pandas(df[xcols], npartitions=4), dd.from_pandas(df[ycol], npartitions=4))

learner = xgb.dask.DaskXGBRegressor(objective='reg:squarederror',
    n_estimators=16,
    max_depth=8,
    learning_rate=0.1,
    verbosity=3,
    tree_method="gpu_hist")

learner.fit(X, y)

I've noticed this happening with multiple CPU workers as well, but it occurs less frequently. It seems like the issue could be related to #6604 and #6469, although I tried the patch provided in #6469, and the workers were still hanging during training. Any ideas on how to resolve this?

@trivialfis
Copy link
Member

Could you please try 1.3.3?

@elaineejiang
Copy link
Author

@trivialfis Thanks for the quick response! I'm trying to build 1.3.3., but keep hitting this error:

[...]/xgboost/1/3/3/build/python3.7/include/xgboost/parameter.h(92): error: no instance of function template "dmlc::Parameter<PType>::UpdateAllowUnknown [with PType=xgboost::tree::TrainParam]" matches the argument list
            argument types are: (const xgboost::Args, __nv_bool *)
          detected during:
            instantiation of "xgboost::Args xgboost::XGBoostParameter<Type>::UpdateAllowUnknown(const Container &, __nv_bool *) [with Type=xgboost::tree::TrainParam, Container=xgboost::Args]" 

@hcho3
Copy link
Collaborator

hcho3 commented Jan 28, 2021

Try running git submodule update --init --recursive.

@elaineejiang
Copy link
Author

Thanks @hcho3 ! I was able to build 1.3.3, but still seeing the issue where the workers are hanging on Rabit initialization:

File "[...]/ext/public/python/3/7/x/dist/lib/python3.7/threading.py", line 890, in _bootstrap self._bootstrap_inner()

File "[...]/ext/public/python/3/7/x/dist/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run()

File "[...]/ext/public/python/3/7/x/dist/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs)

File "[...]/ext/public/python/distributed/2020/12/0/dist/lib/python3.7/distributed/threadpoolexecutor.py", line 55, in _worker task.run()

File "[...]/ext/public/python/distributed/2020/12/0/dist/lib/python3.7/distributed/_concurrent_futures_thread.py", line 65, in run result = self.fn(*self.args, **self.kwargs)

File "[...]/ext/public/python/distributed/2020/12/0/dist/lib/python3.7/distributed/worker.py", line 3425, in apply_function result = function(*args, **kwargs)

File "[...]/ext/public/python/xgboost/1/3/3/dist/lib/python3.7/xgboost/dask.py", line 648, in dispatched_train with RabitContext(rabit_args):

File "[...]/ext/public/python/xgboost/1/3/3/dist/lib/python3.7/xgboost/dask.py", line 106, in __enter__ rabit.init(self.args)

File "[...]/ext/public/python/xgboost/1/3/3/dist/lib/python3.7/xgboost/rabit.py", line 27, in init _LIB.RabitInit(len(arr), arr)

@elaineejiang
Copy link
Author

I was looking for a workaround and saw that the tests for the Dask API always contain these lines:

# Always call this before using distributed module
xgb.rabit.init()
rank = xgb.rabit.get_rank()
world = xgb.rabit.get_world_size()

(from https://github.com/dmlc/xgboost/blob/master/tests/distributed/distributed_gpu.py)
I added a call to rabit.init() at the top of dispatch_train (https://github.com/dmlc/xgboost/blob/v1.3.3/python-package/xgboost/dask.py#L646) and now multi-GPU training works sometimes. I wanted to give this update in case it gives more insight into possible solutions, cc: @hcho3 @trivialfis. Much appreciated!

@trivialfis
Copy link
Member

Opened an issue in dask/distributed#4485

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants