Rewrite Dask interface. #4819

trivialfis · 2019-08-30T07:00:29Z

This PR rewrites the dask interface based on dask-xgboost, with added support for evaluation metrics and slightly different interface.

Implement functional train and predict.
Implement Scikit-Learn wrapper for regression and classification. No ranking yet.
Implement DaskDMatrix as a proxy for underlying distributed DMatrix.

Closes #4814.

trivialfis · 2019-08-30T07:14:32Z

@RAMitchell @hcho3 @CodingCat I can further abstract DMatrix to accept dask.dataframe, but with considerably more changes. WDYT?

trivialfis · 2019-08-30T07:20:35Z

I would like to see how many works are in common with #4656 . @thesuperzapper Could you also join the review after this PR no longer being WIP?

RAMitchell · 2019-09-02T03:25:47Z

@RAMitchell @hcho3 @CodingCat I can further abstract DMatrix to accept dask.dataframe, but with considerably more changes. WDYT?

No rush on this I think. Lets stabilise the dask interface first.

Whats your plan for assigning gpu_id in this interface?

trivialfis · 2019-09-02T03:50:43Z

@RAMitchell I have another branch that uses a utility I implemented. I need to somehow merge it here.

mtjrider

Separate the client from the current Python process by enabling a client override. Remove experimental new Python syntax. Use Dask's method for frame concatenation instead of dispatching the type yourself.

python-package/xgboost/distributed.py

mrocklin · 2019-09-11T01:29:50Z

@RAMitchell asked me for my thoughts on the use of the distributed_dispatch decorator here.

Strictly from a Dask user's perspective I think that it's nice to have first class support for Dask within XGBoost. However, I can see how from an XGBoost maintainer's point of view, this might be concerning.

In general I think that API extensibility is good. For example it's nice that I can use np.sum on a numpy array, cupy array, dask array, and so on. However usually when making an API extensible it's common to establish a contract that anyone can implement, without having to directly modify code within the core project. For example maybe xgb.train looks for a .__xgb_train__ method on the passed-in arguments and calls that. That way, if some other project comes along and implements things correctly then they can be first class as well, even if they don't have strong ties to the XGBoost maintainers. Protocols like this tend to level the playing field.

That's all maybe a bit philosophical. I'm happy to get more into details here if folks want.

mrocklin · 2019-09-11T01:30:50Z

I'll also say, it would be nice to find protocols that would allow us to pass in Dask dataframes or Dask arrays directly, without creating the DaskDMatrix object. I'm not sure how easy that would be. Maybe using something like functools.singledispatch or multipledispatch?

python-package/xgboost/distributed.py

mrocklin · 2019-09-11T01:38:35Z

The actual implementation looks like what we've all done historically. It's not super clean but there doesn't seem to be a better solution at the moment. I think it would be worth thinking about how to improve these sorts of workflows within Dask.

My guess is that that's out of scope for folks here (some core Dask devs probably need to spend some serious time thinking about that). Mostly I mention this so that you don't grow too attached to this implementation, and remain open to some day replacing it with something else.

trivialfis · 2019-09-11T04:29:09Z

The actual implementation looks like what we've all done historically

Yeah. I learned a lot from it. Especially the part of obtaining worker local data.

My guess is that that's out of scope for folks here

I'm interested in getting to know more about dask. So please keep me in the loop.

Mostly I mention this so that you don't grow too attached to this implementation

That won't be a problem. If there's an improvement we can change it anytime. I don't expect one commit can get everything right. ;-)

As for sklearn interface. I'm working on it.

* Consider data locality in distributing data. * New interface that looks similar to the one with single node. * Pass explicit client object. * Support evaluation history.

codecov-io · 2019-09-22T16:27:05Z

Codecov Report

Merging #4819 into master will decrease coverage by 5.91%.
The diff coverage is 23.97%.

@@            Coverage Diff             @@
##           master    #4819      +/-   ##
==========================================
- Coverage   77.63%   71.72%   -5.92%     
==========================================
  Files          11       11              
  Lines        2039     2281     +242     
==========================================
+ Hits         1583     1636      +53     
- Misses        456      645     +189

Impacted Files	Coverage Δ
python-package/xgboost/sklearn.py	`87.69% <ø> (-0.04%)`	⬇️
python-package/xgboost/__init__.py	`89.47% <100%> (ø)`	⬆️
python-package/xgboost/dask.py	`19.13% <19.13%> (-4.31%)`	⬇️
python-package/xgboost/core.py	`75.57% <25%> (-0.77%)`	⬇️
python-package/xgboost/compat.py	`54.4% <70.37%> (+2.97%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc8c9b0...8be19b0. Read the comment docs.

trivialfis · 2019-09-23T04:54:10Z

@mrocklin @RAMitchell @mt-jones Please take a look again.

python-package/xgboost/dask.py

demo/dask/gpu_training.py

RAMitchell

LGTM. You will have to make sure all of these functions appear in the Python API documentation.

trivialfis · 2019-09-24T03:46:45Z

@mrocklin @TomAugspurger The original implementation of dask-xgboost is acknowledged in both doc and code header. Thank you so much!

trivialfis · 2019-09-24T05:35:15Z

@CodingCat @hcho3 Would you like to take a look?

mtjrider

Looks great to me.

trivialfis · 2019-09-25T05:30:01Z

Merging as we will have more time to test it. @hcho3 @CodingCat Try it out, I have grown to like dask now. ;-)

trivialfis · 2019-09-25T05:30:38Z

@mt-jones @mrocklin @RAMitchell Thanks!

trivialfis requested a review from RAMitchell August 30, 2019 07:01

trivialfis changed the title ~~[WIP] Rewrite Dask support.~~ [WIP] Rewrite Dask interface. Aug 30, 2019

trivialfis force-pushed the slurm-dask branch 2 times, most recently from c5dc6df to 08ca328 Compare August 30, 2019 14:22

trivialfis mentioned this pull request Aug 30, 2019

Making sure to have right mapping between predictors and output for dask interface. #4814

Closed

trivialfis force-pushed the slurm-dask branch 3 times, most recently from b46bb59 to 1688432 Compare September 9, 2019 09:38

mtjrider suggested changes Sep 10, 2019

View reviewed changes

python-package/xgboost/distributed.py Outdated Show resolved Hide resolved

python-package/xgboost/distributed.py Outdated Show resolved Hide resolved

python-package/xgboost/distributed.py Outdated Show resolved Hide resolved

mrocklin reviewed Sep 11, 2019

View reviewed changes

python-package/xgboost/distributed.py Outdated Show resolved Hide resolved

mrocklin reviewed Sep 11, 2019

View reviewed changes

python-package/xgboost/distributed.py Outdated Show resolved Hide resolved

mrocklin reviewed Sep 11, 2019

View reviewed changes

python-package/xgboost/distributed.py Outdated Show resolved Hide resolved

trivialfis force-pushed the slurm-dask branch from 61b4051 to 551696a Compare September 18, 2019 07:35

Rewrite of dask interface.

d029b9f

* Consider data locality in distributing data. * New interface that looks similar to the one with single node. * Pass explicit client object. * Support evaluation history.

trivialfis force-pushed the slurm-dask branch from fc26061 to d029b9f Compare September 20, 2019 08:16

Import dask in init.

4b6c213

mrocklin mentioned this pull request Sep 20, 2019

Generalizing Dask-XGBoost dask/distributed#3075

Open

trivialfis added 2 commits September 22, 2019 11:41

Fix prediction, use dask concat.

a10c666

Copy some documents.

afb1af9

trivialfis added 2 commits September 22, 2019 14:28

Force passing client in predict.

5a7b9e9

Run tests on Travis.

04e2f02

trivialfis added 4 commits September 22, 2019 23:26

Implement base class for better documentation.

e933644

Force client parameter in DaskDMatrix.

da0cd5a

Add document.

4482d0c

Fix building doc, don't import complete pandas.

d395551

trivialfis marked this pull request as ready for review September 23, 2019 04:53

trivialfis changed the title ~~[WIP] Rewrite Dask interface.~~ Rewrite Dask interface. Sep 23, 2019

Remove duplicated doc.

e53844b

mtjrider suggested changes Sep 24, 2019

View reviewed changes

python-package/xgboost/dask.py Outdated Show resolved Hide resolved

python-package/xgboost/dask.py Outdated Show resolved Hide resolved

demo/dask/gpu_training.py Show resolved Hide resolved

RAMitchell reviewed Sep 24, 2019

View reviewed changes

trivialfis added 3 commits September 23, 2019 23:10

Fix doc related issues.

0fd9ff1

Address reviewers' comments and better doc string.

ee4cadc

Small changes to doc.

8be19b0

Document the current limitations.

73a4af3

mtjrider reviewed Sep 24, 2019

View reviewed changes

trivialfis merged commit b8433c4 into dmlc:master Sep 25, 2019

trivialfis deleted the slurm-dask branch September 25, 2019 11:57

lock bot locked as resolved and limited conversation to collaborators Dec 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite Dask interface. #4819

Rewrite Dask interface. #4819

trivialfis commented Aug 30, 2019 •

edited

Loading

trivialfis commented Aug 30, 2019

trivialfis commented Aug 30, 2019

RAMitchell commented Sep 2, 2019

trivialfis commented Sep 2, 2019

mtjrider left a comment

mrocklin commented Sep 11, 2019

mrocklin commented Sep 11, 2019

mrocklin commented Sep 11, 2019

trivialfis commented Sep 11, 2019 •

edited

Loading

codecov-io commented Sep 22, 2019 •

edited

Loading

trivialfis commented Sep 23, 2019

RAMitchell left a comment

trivialfis commented Sep 24, 2019

trivialfis commented Sep 24, 2019 •

edited

Loading

mtjrider left a comment

trivialfis commented Sep 25, 2019

trivialfis commented Sep 25, 2019

Rewrite Dask interface. #4819

Rewrite Dask interface. #4819

Conversation

trivialfis commented Aug 30, 2019 • edited Loading

trivialfis commented Aug 30, 2019

trivialfis commented Aug 30, 2019

RAMitchell commented Sep 2, 2019

trivialfis commented Sep 2, 2019

mtjrider left a comment

Choose a reason for hiding this comment

mrocklin commented Sep 11, 2019

mrocklin commented Sep 11, 2019

mrocklin commented Sep 11, 2019

trivialfis commented Sep 11, 2019 • edited Loading

codecov-io commented Sep 22, 2019 • edited Loading

Codecov Report

trivialfis commented Sep 23, 2019

RAMitchell left a comment

Choose a reason for hiding this comment

trivialfis commented Sep 24, 2019

trivialfis commented Sep 24, 2019 • edited Loading

mtjrider left a comment

Choose a reason for hiding this comment

trivialfis commented Sep 25, 2019

trivialfis commented Sep 25, 2019

trivialfis commented Aug 30, 2019 •

edited

Loading

trivialfis commented Sep 11, 2019 •

edited

Loading

codecov-io commented Sep 22, 2019 •

edited

Loading

trivialfis commented Sep 24, 2019 •

edited

Loading