[EPIC] Allow failed worker retry in distributed training #4753

chenqin · 2019-08-08T17:25:16Z

Fault recovery in native distributed XGB training (not xgb-spark) have been broken for more than a year. Community have seen various of issue like this dmlc/rabit#63

expose flag to opt-in detailed log in rabit
support bootstrap cache in rabit , expose flag to opt-in rabit bootstrap cache
design doc.pdf
PR candidate: support bootstrap allreduce/broadcast rabit#98
patch native xgb code base and support worker retry
pr remove is_bootstrap parameter rabit#102 cleanup rabit api
pr [rabit_bootstrap_cache ] failed xgb worker recover from other workers #4808 patch xgb worker
pr cover change with gtest unittests, clean up cmake script and code includes rabit#106 unittest rabit, cleanup make file
patch xgb-spark code base and support task retry
design [jvm-packages] retry-able xgboost4j-spark booster #4831
- surface configuration to jvm layer [jvm-packages] expose rabit configuration to user #4861 pr [jvm-packages] update rabit, surface new changes to spark, add parity and failure tests #4966
- short circuit, timeout rabit add timeout thread to avoid rabit hang forever rabit#105
- enable failed foreachparition spark task retry and disable spark checkpoint if user opt-in with configuration "rabit_bootstrap_cache=1"

tracking backlogs work from merged prs

move histogram init before loadcheckpoint. [rabit_bootstrap_cache ] failed xgb worker recover from other workers #4808 (comment)
remove duplicated colmn size check in train/eval datasets
cleanup duplicated rabit cmake build in xgboost pr cover change with gtest unittests, clean up cmake script and code includes rabit#106

trivialfis · 2019-08-12T15:04:18Z

Looks interesting. ;-)

trams · 2019-08-20T21:07:00Z

This looks interesting indeed. One question: "What is native distributed XGB training?" How can I use it? Is it the one which involves a python package + dask?

chenqin · 2019-08-23T17:50:20Z

This looks interesting indeed. One question: "What is native distributed XGB training?" How can I use it? Is it the one which involves a python package + dask?

If you try xgboost with more than one machine. You are already using rabit, we added feature and track changes in progress to xgb layer

https://www.slideshare.net/ChenQin1/scaling-xgboost

trams · 2019-08-23T18:44:58Z

@chenqin, I understand that xgboost uses rabit as AllReduce implementation. I still don't really know what is a native distributed XGB training?
Does xgboost-spark run native distributed XGB training? If yes then why do we add "native" here? What other distributed XGB training do we have?

I am new to the project and I am still discovering different ways to use xgboost so this is an honest question.

chenqin · 2019-08-23T20:05:20Z

@chenqin, I understand that xgboost uses rabit as AllReduce implementation. I still don't really know what is a native distributed XGB training?
Does xgboost-spark run native distributed XGB training? If yes then why do we add "native" here? What other distributed XGB training do we have?

I am new to the project and I am still discovering different ways to use xgboost so this is an honest question.

In some use cases where user don't have complex feature generating needs, user can launch xgboost native worker (c++) without data processing
framework. https://github.com/kubeflow/xgboost-operator

…lc#4753

chenqin · 2019-08-29T05:59:35Z

from description in this thread, yes, dask-xgboost leverage rabit as well.
#2032

hcho3 · 2019-09-19T17:49:04Z

@chenqin It's epic :)

nateagr · 2019-12-11T17:06:46Z

Hi there! Any update on this epic ? Thanks.

chenqin · 2019-12-14T03:20:15Z

Hi there! Any update on this epic ? Thanks.

Yes, it's going (holiday session slow), still miss the final piece of patch XGBoost-spark after we can get current open pr landed.

trivialfis · 2020-09-27T08:16:09Z

I don't think this is possible in short term. Let's stick with fail all strategy for now. We @hcho3 are thinking about redesigning RABIT from scratch.

chenqin mentioned this issue Aug 8, 2019

[Roadmap] XGBoost 1.0.0 Roadmap #4680

Closed

9 tasks

chenqin added a commit to chenqin/xgboost that referenced this issue Aug 29, 2019

workaround booster save/load inconsistency, leave fix to item 3 in dm…

6e1bdca

…lc#4753

chenqin mentioned this issue Sep 4, 2019

[jvm-packages] retry-able xgboost4j-spark booster #4831

Closed

chenqin changed the title ~~Allow failed worker retry in distributed training~~ [EPIC] Allow failed worker retry in distributed training Sep 19, 2019

trivialfis closed this as completed Sep 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Allow failed worker retry in distributed training #4753

[EPIC] Allow failed worker retry in distributed training #4753

chenqin commented Aug 8, 2019 •

edited

Loading

trivialfis commented Aug 12, 2019

trams commented Aug 20, 2019

chenqin commented Aug 23, 2019

trams commented Aug 23, 2019

chenqin commented Aug 23, 2019

chenqin commented Aug 29, 2019

hcho3 commented Sep 19, 2019

nateagr commented Dec 11, 2019

chenqin commented Dec 14, 2019

trivialfis commented Sep 27, 2020

[EPIC] Allow failed worker retry in distributed training #4753

[EPIC] Allow failed worker retry in distributed training #4753

Comments

chenqin commented Aug 8, 2019 • edited Loading

trivialfis commented Aug 12, 2019

trams commented Aug 20, 2019

chenqin commented Aug 23, 2019

trams commented Aug 23, 2019

chenqin commented Aug 23, 2019

chenqin commented Aug 29, 2019

hcho3 commented Sep 19, 2019

nateagr commented Dec 11, 2019

chenqin commented Dec 14, 2019

trivialfis commented Sep 27, 2020

chenqin commented Aug 8, 2019 •

edited

Loading