-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Drop single point model recovery. #6112
Conversation
1dc2d2d
to
b9d03be
Compare
RABIT parameter handlingCurrently on master branch, the rabit time out handling test is only passing because of an unrelated error:
This exception has nothing to do with RABIT timeout. Once
so currently non of the rabit parameter is handled correctly. This PR adds back these 2 lines. TimeoutThe old implementation of timeout used an async worker thread to throw the exception, which is incorrect as exception is caught by main thread. Timeout is not trivial to implement in real application. Killing a thread is difficult as it violates all the states shared with main thread. In this PR I pass the timeout interval to FinalizationThe Other SubtletiesI admit I don't fully understand how are those tests passing before the PR. For example, if there's a sync at shutdown, some spark failure tests should hang but they didn't. I suspect it's due to similar reason described in section |
@CodingCat @hcho3 @wbo4958 @RAMitchell Please help taking a look. |
ea455ad
to
b12174d
Compare
* Pass rabit params. Currently on master branch, the test is only passing because of an unrelated error: ``` AssertError:buf should have data from resbuf, rabit is configured to keep process running ``` This exception has nothing to do with RABIT timeout. Once `rabit_robust` is removed, this assertion will no longer exist and the test will fail as it's testing timeout facility. I dig into it and found the parameter "rabit_timeout" is not being passed down into rabit at all. Tracing the line changes, I found that dmlc#5082 has removed these lines in `xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala`: ``` val xgbRabitParams = xgbParamsFactory.buildRabitParams.asJava ... tracker.getWorkerEnvs().putAll(xgbRabitParams) ``` so currently non of the rabit parameter is correctly handled.
This reverts commit b12174d.
a5ba56f
to
30a1b0e
Compare
Closing to avoid spamming. |
No description provided.