-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Dask XGBoost hanging on rabit initialization during training with multi-GPU multi-nodes #6677
Conversation
Did you figure out why didn't the worker receive data? |
@@ -818,6 +818,8 @@ def dispatched_train( | |||
|
|||
''' | |||
LOGGER.debug('Training on %s', str(worker_addr)) | |||
# Initialize rabit without workers first | |||
rabit.init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let me know if it makes sense.
I don't quite understand why does this help.
@@ -839,6 +841,11 @@ def dispatched_train( | |||
LOGGER.info(msg) | |||
else: | |||
local_param[p] = worker.nthreads | |||
|
|||
# If worker did not receive input data, return without failing | |||
if local_dtrain.num_row() == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should let the train continue, otherwise the cluster can hang on waiting for this exited worker for synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I let training continue, I get this error Check failed: mparam_.num_feature != 0 (0 vs. 0) : 0 feature is supplied. Are you using raw Booster interface?
in learner.cc
if the worker did not receive input data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elaineejiang Thanks for the explanation. In that case, the synchronization between DMatrix is failing. During the construction, DMatrix on workers synchronize the number of columns in input data, see
xgboost/src/data/simple_dmatrix.cc
Line 159 in 72892cc
rabit::Allreduce<rabit::op::Max>(&info_.num_col_, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks @trivialfis. Just to be clear, I'm only seeing this error when I try to run multiple XGBoost learners in parallel via
dask_client.submit(xgb.dask.DaskXGBRegressor([...]).fit(dd, xcols, ycol, wcol))
Is this something that's not recommended or not supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the correct version should be something similar to:
classifier_future = client.submit(classifier.fit, X, y, sample_weight=w, eval_set=[(X, y)])
This should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah so sorry -- I typed that out on the fly and got the syntax wrong. I am seeing
Check failed: mparam_.num_feature != 0 (0 vs. 0) : 0 feature is supplied. Are you using raw Booster interface?
with your version.
If that's supposed to work, I guess it's just my network issue?
To me the issue you encountered seems to be a network issue. That's quite problematic as each cluster has its own quirks, I don't know what exactly is happening on your cluster. If I were you I would start with trying to figure out why the connection is refused. |
Agreed that this is likely to do with my specific cluster configuration. I have seen similar issues posted (#6604) that haven't been closed yet and was wondering if the initial call to |
It shouldn't. |
I think I can reproduce the issue now with setting |
Em .. my error seems to be different. dask is wrongly identifying data being created by other client. It doesn't apply here. |
I see. I saw in the GPU tests (https://github.com/dmlc/xgboost/blob/master/tests/distributed/distributed_gpu.py) to call rabit initializer:
which is why I decided to add it just to see what would happen. However, I don't know why calling |
It might not be working correctly I think. A better place to look at tests would be I'm looking into parallel model training, might take some time as I vaguely recall there's a similar issue on Spark package. For now please train 1 model at a time. |
@trivialfis sounds good, thanks for all the help debugging! Here's a summary for anyone who might be following. The two issues are: 1. Rabit initialization hangs during training on multi-node, multi-GPU
2. Synchronization between DMatrix is failing, resulting in
|
If you are willing to patch xgboost, here is something you can try: Change the _assert_dask_support()
args = {k: v for k, v in locals().items() if k != 'self'}
if self._client is None: # Don't set the client yourself
try:
with distributed.worker_client() as client:
self.client = client
return self.client.sync(self._fit_async, **args)
except ValueError:
pass
return self.client.sync(self._fit_async, **args) This workaround uses the client from def test_parallel_submits(client: "Client"):
from sklearn.datasets import load_digits
futures = []
for i in range(10):
X_, y_ = load_digits(return_X_y=True)
X_ += 1.0
X = client.submit(dd.from_array, X_, chunksize=32)
y = client.submit(dd.from_array, y_, chunksize=32)
cls = xgb.dask.DaskXGBClassifier(
verbosity=1, n_estimators=30, eval_metric="merror"
)
f = client.submit(cls.fit, X, y, pure=False)
futures.append(f)
classifiers = client.gather(futures)
assert len(classifiers) == 10
for cls in classifiers:
assert cls.get_booster().num_boosted_rounds() == 30 |
Or equivalently you can set the client and launch training on a local function:
|
I've tried doing the latter example (getting worker client before launching) but I still hit the |
Thanks for the reply. I think for now one will have to train one model at a time. xgboost relies on MPI like communication framework, if one of the worker is scheduled behind the others it will hang. |
Thanks @trivialfis for looking into this. Would it be possible to add this request to the roadmap? |
I will put it in there, but not promising it will deliver. Seems to be a lot of work. |
Thanks @trivialfis! Is there an expected release deadline for 1.4.0? |
This helped me resolve the issue in #6649. Please let me know if it makes sense.