Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop single point model recovery #6262

Merged
merged 2 commits into from
Oct 21, 2020
Merged

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Oct 20, 2020

Continue #6112

RABIT parameter handling

Currently on master branch, the rabit time out handling test is only passing because of an unrelated error:

AssertError:buf should have data from resbuf, rabit is configured to keep process running

This exception has nothing to do with RABIT timeout. Once rabit_robust is removed, this
assertion will no longer exist and the test will fail as it's testing timeout facility. I
dug into it and found the parameter "rabit_timeout" is not being passed down into rabit at
all. Tracing back the line changes, I found that #5082 has removed these lines in
xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala:

    val xgbRabitParams = xgbParamsFactory.buildRabitParams.asJava
    ...
            tracker.getWorkerEnvs().putAll(xgbRabitParams)

so currently non of the rabit parameter is handled correctly. This PR adds back these 2 lines.

Timeout

The old implementation of timeout used an async worker thread to throw the exception, which is incorrect as exception is caught by main thread. Timeout is not trivial to implement in real application. Killing a thread is difficult as it violates all the states shared with main thread. In this PR I pass the timeout interval to poll, which serves as a stop gate.

Finalization

The rabit_robust had a pseudo checkpoint operation during shutdown, which formed a global barrier. The barrier is now removed along with rabit_robust.

Socket exception

Previously allreduce checks for out of band data on socket, which is now removed. In most of the applications, OOB data can be safely ignored.

Other Subtleties

I admit I don't fully understand how are those tests passing before the PR. For example, if there's a sync at shutdown, some spark failure tests should hang but they didn't. I suspect it's due to similar reason described in section RABIT parameter handling.

@trivialfis
Copy link
Member Author

cc @chenqin

tests/travis/run_test.sh Outdated Show resolved Hide resolved
* Pass rabit params in JVM package.
* Implement timeout using poll timeout parameter.
* Remove OOB data check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants