add timeout thread to avoid rabit hang forever #105

chenqin · 2019-09-14T16:02:02Z

We observed some rabit agent hang forever in some production jobs, this is due to fault tolerant underlying assumption "failed worker will retry and catch up".

In some cases where scheduler has limited resources or configured not launching retry task, rest of fleet will hang forever.

Some options

add heatbeat to tracker and shutdown worker after timeout
use socket OOB message to propagate and init count down thread, shutdown agent after timeout

heartbeat put periodical weight on tracker which already shows performance issue given large scale cluster. This might not be best approach moving forward.

Socket OOB seems more promising. The idea is when allreduce/broadcast/checkpoint operations return error due to connected peers giving out socket error. CheckAndRecover implementation checks return type which already false and reset links. we might be able to have a singleton thread which sleeps for configurable time before exit program. Only when tracker signal worker with recover signal, we might terminate that singleton timeout thread

  /*!
   * \brief if err_type indicates an error
   *         recover links according to the error type reported
   *        if there is no error, return true
   * \param err_type the type of error happening in the system
   * \return true if err_type is kSuccess, false otherwise
   */
  bool CheckAndRecover(ReturnType err_type);

The text was updated successfully, but these errors were encountered:

chenqin · 2019-10-11T17:12:38Z

merged to master

chenqin mentioned this issue Sep 20, 2019

[EPIC] Allow failed worker retry in distributed training dmlc/xgboost#4753

Closed

10 tasks

chenqin mentioned this issue Oct 2, 2019

exit when allreduce/broadcast error cause timeout #112

Merged

chenqin changed the title ~~[WIP] add timeout to avoid rabit hang forever~~ add timeout thread to avoid rabit hang forever Oct 4, 2019

chenqin closed this as completed Oct 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add timeout thread to avoid rabit hang forever #105

add timeout thread to avoid rabit hang forever #105

chenqin commented Sep 14, 2019 •

edited

Loading

chenqin commented Oct 11, 2019

add timeout thread to avoid rabit hang forever #105

add timeout thread to avoid rabit hang forever #105

Comments

chenqin commented Sep 14, 2019 • edited Loading

chenqin commented Oct 11, 2019

chenqin commented Sep 14, 2019 •

edited

Loading