Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add timeout thread to avoid rabit hang forever #105

Closed
chenqin opened this issue Sep 14, 2019 · 1 comment
Closed

add timeout thread to avoid rabit hang forever #105

chenqin opened this issue Sep 14, 2019 · 1 comment

Comments

@chenqin
Copy link
Contributor

chenqin commented Sep 14, 2019

We observed some rabit agent hang forever in some production jobs, this is due to fault tolerant underlying assumption "failed worker will retry and catch up".

In some cases where scheduler has limited resources or configured not launching retry task, rest of fleet will hang forever.

Some options

  • add heatbeat to tracker and shutdown worker after timeout
  • use socket OOB message to propagate and init count down thread, shutdown agent after timeout

heartbeat put periodical weight on tracker which already shows performance issue given large scale cluster. This might not be best approach moving forward.

Socket OOB seems more promising. The idea is when allreduce/broadcast/checkpoint operations return error due to connected peers giving out socket error. CheckAndRecover implementation checks return type which already false and reset links. we might be able to have a singleton thread which sleeps for configurable time before exit program. Only when tracker signal worker with recover signal, we might terminate that singleton timeout thread

  /*!
   * \brief if err_type indicates an error
   *         recover links according to the error type reported
   *        if there is no error, return true
   * \param err_type the type of error happening in the system
   * \return true if err_type is kSuccess, false otherwise
   */
  bool CheckAndRecover(ReturnType err_type);
@chenqin chenqin changed the title [WIP] add timeout to avoid rabit hang forever add timeout thread to avoid rabit hang forever Oct 4, 2019
@chenqin
Copy link
Contributor Author

chenqin commented Oct 11, 2019

merged to master

@chenqin chenqin closed this as completed Oct 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant