Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite Dask interface. #4819

Merged
merged 15 commits into from
Sep 25, 2019
Merged

Rewrite Dask interface. #4819

merged 15 commits into from
Sep 25, 2019

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Aug 30, 2019

This PR rewrites the dask interface based on dask-xgboost, with added support for evaluation metrics and slightly different interface.

  • Implement functional train and predict.
  • Implement Scikit-Learn wrapper for regression and classification. No ranking yet.
  • Implement DaskDMatrix as a proxy for underlying distributed DMatrix.

Closes #4814.

@trivialfis trivialfis changed the title [WIP] Rewrite Dask support. [WIP] Rewrite Dask interface. Aug 30, 2019
@trivialfis
Copy link
Member Author

@RAMitchell @hcho3 @CodingCat I can further abstract DMatrix to accept dask.dataframe, but with considerably more changes. WDYT?

@trivialfis
Copy link
Member Author

I would like to see how many works are in common with #4656 . @thesuperzapper Could you also join the review after this PR no longer being WIP?

@RAMitchell
Copy link
Member

@RAMitchell @hcho3 @CodingCat I can further abstract DMatrix to accept dask.dataframe, but with considerably more changes. WDYT?

No rush on this I think. Lets stabilise the dask interface first.

Whats your plan for assigning gpu_id in this interface?

@trivialfis
Copy link
Member Author

@RAMitchell I have another branch that uses a utility I implemented. I need to somehow merge it here.

@trivialfis trivialfis force-pushed the slurm-dask branch 3 times, most recently from b46bb59 to 1688432 Compare September 9, 2019 09:38
Copy link
Contributor

@mtjrider mtjrider left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate the client from the current Python process by enabling a client override. Remove experimental new Python syntax. Use Dask's method for frame concatenation instead of dispatching the type yourself.

python-package/xgboost/distributed.py Outdated Show resolved Hide resolved
python-package/xgboost/distributed.py Outdated Show resolved Hide resolved
python-package/xgboost/distributed.py Outdated Show resolved Hide resolved
@mrocklin
Copy link
Contributor

@RAMitchell asked me for my thoughts on the use of the distributed_dispatch decorator here.

Strictly from a Dask user's perspective I think that it's nice to have first class support for Dask within XGBoost. However, I can see how from an XGBoost maintainer's point of view, this might be concerning.

In general I think that API extensibility is good. For example it's nice that I can use np.sum on a numpy array, cupy array, dask array, and so on. However usually when making an API extensible it's common to establish a contract that anyone can implement, without having to directly modify code within the core project. For example maybe xgb.train looks for a .__xgb_train__ method on the passed-in arguments and calls that. That way, if some other project comes along and implements things correctly then they can be first class as well, even if they don't have strong ties to the XGBoost maintainers. Protocols like this tend to level the playing field.

That's all maybe a bit philosophical. I'm happy to get more into details here if folks want.

@mrocklin
Copy link
Contributor

I'll also say, it would be nice to find protocols that would allow us to pass in Dask dataframes or Dask arrays directly, without creating the DaskDMatrix object. I'm not sure how easy that would be. Maybe using something like functools.singledispatch or multipledispatch?

@mrocklin
Copy link
Contributor

The actual implementation looks like what we've all done historically. It's not super clean but there doesn't seem to be a better solution at the moment. I think it would be worth thinking about how to improve these sorts of workflows within Dask.

My guess is that that's out of scope for folks here (some core Dask devs probably need to spend some serious time thinking about that). Mostly I mention this so that you don't grow too attached to this implementation, and remain open to some day replacing it with something else.

@trivialfis
Copy link
Member Author

trivialfis commented Sep 11, 2019

The actual implementation looks like what we've all done historically

Yeah. I learned a lot from it. Especially the part of obtaining worker local data.

My guess is that that's out of scope for folks here

I'm interested in getting to know more about dask. So please keep me in the loop.

Mostly I mention this so that you don't grow too attached to this implementation

That won't be a problem. If there's an improvement we can change it anytime. I don't expect one commit can get everything right. ;-)

As for sklearn interface. I'm working on it.

* Consider data locality in distributing data.
* New interface that looks similar to the one with single node.
* Pass explicit client object.
* Support evaluation history.
@codecov-io
Copy link

codecov-io commented Sep 22, 2019

Codecov Report

Merging #4819 into master will decrease coverage by 5.91%.
The diff coverage is 23.97%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4819      +/-   ##
==========================================
- Coverage   77.63%   71.72%   -5.92%     
==========================================
  Files          11       11              
  Lines        2039     2281     +242     
==========================================
+ Hits         1583     1636      +53     
- Misses        456      645     +189
Impacted Files Coverage Δ
python-package/xgboost/sklearn.py 87.69% <ø> (-0.04%) ⬇️
python-package/xgboost/__init__.py 89.47% <100%> (ø) ⬆️
python-package/xgboost/dask.py 19.13% <19.13%> (-4.31%) ⬇️
python-package/xgboost/core.py 75.57% <25%> (-0.77%) ⬇️
python-package/xgboost/compat.py 54.4% <70.37%> (+2.97%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc8c9b0...8be19b0. Read the comment docs.

@trivialfis trivialfis marked this pull request as ready for review September 23, 2019 04:53
@trivialfis trivialfis changed the title [WIP] Rewrite Dask interface. Rewrite Dask interface. Sep 23, 2019
@trivialfis
Copy link
Member Author

@mrocklin @RAMitchell @mt-jones Please take a look again.

python-package/xgboost/dask.py Outdated Show resolved Hide resolved
python-package/xgboost/dask.py Outdated Show resolved Hide resolved
demo/dask/gpu_training.py Show resolved Hide resolved
Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. You will have to make sure all of these functions appear in the Python API documentation.

@trivialfis
Copy link
Member Author

@mrocklin @TomAugspurger The original implementation of dask-xgboost is acknowledged in both doc and code header. Thank you so much!

@trivialfis
Copy link
Member Author

trivialfis commented Sep 24, 2019

@CodingCat @hcho3 Would you like to take a look?

Copy link
Contributor

@mtjrider mtjrider left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me.

@trivialfis
Copy link
Member Author

Merging as we will have more time to test it. @hcho3 @CodingCat Try it out, I have grown to like dask now. ;-)

@trivialfis trivialfis merged commit b8433c4 into dmlc:master Sep 25, 2019
@trivialfis
Copy link
Member Author

@mt-jones @mrocklin @RAMitchell Thanks!

@trivialfis trivialfis deleted the slurm-dask branch September 25, 2019 11:57
@lock lock bot locked as resolved and limited conversation to collaborators Dec 24, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Making sure to have right mapping between predictors and output for dask interface.
5 participants