Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DataTable in Dask #3830

Closed
StrikerRUS opened this issue Jan 24, 2021 · 4 comments
Closed

Support DataTable in Dask #3830

StrikerRUS opened this issue Jan 24, 2021 · 4 comments
Labels

Comments

@StrikerRUS
Copy link
Collaborator

Summary

Dask estimators should support input in a form of H2O DataTable .

Motivation

This change would bring the Dask interface closer to full feature parity with the non-Dask interface.

Description

Initial step can be supporting DataTable via converting it into Numpy array.

References

#3515 (comment)

data : string, numpy array, pandas DataFrame, H2O DataTable's Frame, scipy.sparse or list of numpy arrays
Data source of Dataset.
If string, it represents the path to txt file.

elif isinstance(data, DataTable):
preds, nrow = self.__pred_for_np2d(data.to_numpy(), start_iteration, num_iteration, predict_type)

@StrikerRUS
Copy link
Collaborator Author

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@jameslamb
Copy link
Collaborator

I'm -1 on this change. I believe that .fit() in the Dask module should only accept Dask collections (Dask DataFrame and Dask Array).

Type hints and type decisions for Dask (#3756) is my next priority, and in the PR for that I'll propose that we raise an error in .fit() if X, y, or sample_weight are not Dask collections. This is what XGBoost does as well, and I think it's a very good pattern: https://github.com/dmlc/xgboost/blob/a275f4026728ed14fbc70da142ef7a4a1d3de04d/python-package/xgboost/dask.py#L258-L263.

If we don't put in such limitatioons, lightgbm will have to take on responsibility for how to move a non-Dask input out to the Dask cluster. That will introduce a lot of maintenance for not a lot of gain to users. The task of taking a non-Dask input and turning it into a Dask collection is theoretically simple, but there isn't a single "right" way to do it and the best way can depend on the shape of your data and the nature of your task. Consider dask/dask#6833 (comment) and the rest of the discussion on that issue and the linked ones for some examples of how this can be a rough part of the Dask experience.

The comment referenced for this issue, #3515 (comment), was on the internals of _train_part() in the Dask module. The types that function takes are those that make up a single partition of a Dask collection. A Dask DataFrame is a collection of pandas dataframes. A Dask Array is a collection of numpy arrays or scipy sparse matrices. See this famous images from the Dask docs (https://docs.dask.org/en/latest/dataframe.html).

image

@StrikerRUS
Copy link
Collaborator Author

@jameslamb Thanks for the discussion!
Understood!

the Dask module should only accept Dask collections (Dask DataFrame and Dask Array).

Agree with this statement.

I'm going to strike out this issue from feature requests right now. And looking forward for your PR with type decisions!

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants