-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exit when allreduce/broadcast error cause timeout #112
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
split flag config from interval config
@trivialfis can you help review this pr? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't really look into the details yet. But your work is very worth documenting. Could you please start writing user documents?
updated guide doc |
Pretty good overall. Please see inlined questions. |
@chenqin Big thanks for your good work. |
More detail can be found in #105
launch async task in background when rabit cluster suffer collective operation failure (socket connection error, other host failure etc). If cluster restore within timeout threshold, async task will return; otherwise it trigger timeout and exit process.
@hcho3 @trivialfis