Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Continue #6112
RABIT parameter handling
Currently on master branch, the rabit time out handling test is only passing because of an unrelated error:
This exception has nothing to do with RABIT timeout. Once
rabit_robust
is removed, thisassertion will no longer exist and the test will fail as it's testing timeout facility. I
dug into it and found the parameter "rabit_timeout" is not being passed down into rabit at
all. Tracing back the line changes, I found that #5082 has removed these lines in
xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala
:so currently non of the rabit parameter is handled correctly. This PR adds back these 2 lines.
Timeout
The old implementation of timeout used an async worker thread to throw the exception, which is incorrect as exception is caught by main thread. Timeout is not trivial to implement in real application. Killing a thread is difficult as it violates all the states shared with main thread. In this PR I pass the timeout interval to
poll
, which serves as a stop gate.Finalization
The
rabit_robust
had a pseudo checkpoint operation during shutdown, which formed a global barrier. The barrier is now removed along withrabit_robust
.Socket exception
Previously allreduce checks for out of band data on socket, which is now removed. In most of the applications, OOB data can be safely ignored.
Other Subtleties
I admit I don't fully understand how are those tests passing before the PR. For example, if there's a sync at shutdown, some spark failure tests should hang but they didn't. I suspect it's due to similar reason described in section
RABIT parameter handling
.