Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Drop single point model recovery. #6112

Closed
wants to merge 16 commits into from

Conversation

trivialfis
Copy link
Member

No description provided.

@trivialfis
Copy link
Member Author

trivialfis commented Oct 18, 2020

RABIT parameter handling

Currently on master branch, the rabit time out handling test is only passing because of an unrelated error:

AssertError:buf should have data from resbuf, rabit is configured to keep process running

This exception has nothing to do with RABIT timeout. Once rabit_robust is removed, this
assertion will no longer exist and the test will fail as it's testing timeout facility. I
dug into it and found the parameter "rabit_timeout" is not being passed down into rabit at
all. Tracing back the line changes, I found that #5082 has removed these lines in
xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala:

    val xgbRabitParams = xgbParamsFactory.buildRabitParams.asJava
    ...
            tracker.getWorkerEnvs().putAll(xgbRabitParams)

so currently non of the rabit parameter is handled correctly. This PR adds back these 2 lines.

Timeout

The old implementation of timeout used an async worker thread to throw the exception, which is incorrect as exception is caught by main thread. Timeout is not trivial to implement in real application. Killing a thread is difficult as it violates all the states shared with main thread. In this PR I pass the timeout interval to poll, which serves as a stop gate.

Finalization

The rabit_robust had a pseudo checkpoint operation during shutdown, which formed a global barrier. The barrier is now removed along with rabit_robust.

Other Subtleties

I admit I don't fully understand how are those tests passing before the PR. For example, if there's a sync at shutdown, some spark failure tests should hang but they didn't. I suspect it's due to similar reason described in section RABIT parameter handling.

@trivialfis
Copy link
Member Author

@CodingCat @hcho3 @wbo4958 @RAMitchell Please help taking a look.

* Pass rabit params.

Currently on master branch, the test is only passing because of an unrelated error:

```
AssertError:buf should have data from resbuf, rabit is configured to keep process running
```

This exception has nothing to do with RABIT timeout.  Once `rabit_robust` is removed, this
assertion will no longer exist and the test will fail as it's testing timeout facility.  I
dig into it and found the parameter "rabit_timeout" is not being passed down into rabit at
all.  Tracing the line changes, I found that dmlc#5082 has removed these lines in
`xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala`:

```
    val xgbRabitParams = xgbParamsFactory.buildRabitParams.asJava
    ...
            tracker.getWorkerEnvs().putAll(xgbRabitParams)
```
so currently non of the rabit parameter is correctly handled.
@trivialfis
Copy link
Member Author

Closing to avoid spamming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant