Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exit when allreduce/broadcast error cause timeout #112

Merged
merged 9 commits into from
Oct 11, 2019
Merged

exit when allreduce/broadcast error cause timeout #112

merged 9 commits into from
Oct 11, 2019

Conversation

chenqin
Copy link
Contributor

@chenqin chenqin commented Oct 2, 2019

More detail can be found in #105

launch async task in background when rabit cluster suffer collective operation failure (socket connection error, other host failure etc). If cluster restore within timeout threshold, async task will return; otherwise it trigger timeout and exit process.

  • support rabit_timeout for user to opt-in, by default this feature is closed.
  • extensive unit tests with error mocking, run through integration test
  • speed up travis ci tests by removing duplicated doc/lint on osx
    @hcho3 @trivialfis

@chenqin

This comment has been minimized.

@chenqin chenqin changed the title [WIP] implement timeout side task implement rabit timeout side thread Oct 4, 2019
@chenqin
Copy link
Contributor Author

chenqin commented Oct 4, 2019

@hcho3 @trivialfis @CodingCat

@chenqin

This comment has been minimized.

@chenqin

This comment has been minimized.

split flag config from interval config
.travis.yml Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
@chenqin chenqin changed the title implement rabit timeout side thread exit when allreduce/broadcast error cause timeout Oct 7, 2019
@chenqin
Copy link
Contributor Author

chenqin commented Oct 8, 2019

@trivialfis can you help review this pr? :)

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't really look into the details yet. But your work is very worth documenting. Could you please start writing user documents?

scripts/travis_script.sh Outdated Show resolved Hide resolved
scripts/travis_script.sh Outdated Show resolved Hide resolved
src/allreduce_base.h Outdated Show resolved Hide resolved
@chenqin
Copy link
Contributor Author

chenqin commented Oct 9, 2019

I haven't really look into the details yet. But your work is very worth documenting. Could you please start writing user documents?

updated guide doc

src/allreduce_robust.cc Show resolved Hide resolved
src/allreduce_robust.h Show resolved Hide resolved
src/allreduce_robust.cc Show resolved Hide resolved
@trivialfis
Copy link
Member

Pretty good overall. Please see inlined questions.

@trivialfis trivialfis merged commit 5d1b613 into dmlc:master Oct 11, 2019
@trivialfis
Copy link
Member

@chenqin Big thanks for your good work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants