Support DataTable in Dask #3830

StrikerRUS · 2021-01-24T01:23:53Z

Summary

Dask estimators should support input in a form of H2O DataTable .

Motivation

This change would bring the Dask interface closer to full feature parity with the non-Dask interface.

Description

Initial step can be supporting DataTable via converting it into Numpy array.

References

#3515 (comment)

LightGBM/python-package/lightgbm/basic.py

Lines 946 to 948 in da44387

    
                   data : string, numpy array, pandas DataFrame, H2O DataTable's Frame, scipy.sparse or list of numpy arrays 
        
                       Data source of Dataset. 
        
                       If string, it represents the path to txt file.

LightGBM/python-package/lightgbm/basic.py

Lines 619 to 620 in da44387

    
           elif isinstance(data, DataTable): 
        
               preds, nrow = self.__pred_for_np2d(data.to_numpy(), start_iteration, num_iteration, predict_type)

StrikerRUS · 2021-01-24T01:25:03Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

jameslamb · 2021-01-24T03:40:55Z

I'm -1 on this change. I believe that .fit() in the Dask module should only accept Dask collections (Dask DataFrame and Dask Array).

Type hints and type decisions for Dask (#3756) is my next priority, and in the PR for that I'll propose that we raise an error in .fit() if X, y, or sample_weight are not Dask collections. This is what XGBoost does as well, and I think it's a very good pattern: https://github.com/dmlc/xgboost/blob/a275f4026728ed14fbc70da142ef7a4a1d3de04d/python-package/xgboost/dask.py#L258-L263.

If we don't put in such limitatioons, lightgbm will have to take on responsibility for how to move a non-Dask input out to the Dask cluster. That will introduce a lot of maintenance for not a lot of gain to users. The task of taking a non-Dask input and turning it into a Dask collection is theoretically simple, but there isn't a single "right" way to do it and the best way can depend on the shape of your data and the nature of your task. Consider dask/dask#6833 (comment) and the rest of the discussion on that issue and the linked ones for some examples of how this can be a rough part of the Dask experience.

The comment referenced for this issue, #3515 (comment), was on the internals of _train_part() in the Dask module. The types that function takes are those that make up a single partition of a Dask collection. A Dask DataFrame is a collection of pandas dataframes. A Dask Array is a collection of numpy arrays or scipy sparse matrices. See this famous images from the Dask docs (https://docs.dask.org/en/latest/dataframe.html).

StrikerRUS · 2021-01-24T13:44:43Z

@jameslamb Thanks for the discussion!
Understood!

the Dask module should only accept Dask collections (Dask DataFrame and Dask Array).

Agree with this statement.

I'm going to strike out this issue from feature requests right now. And looking forward for your PR with type decisions!

github-actions · 2023-08-23T16:24:26Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

StrikerRUS added feature request dask labels Jan 24, 2021

StrikerRUS mentioned this issue Jan 24, 2021

Feature Requests & Voting Hub #2302

Open

StrikerRUS closed this as completed Jan 24, 2021

StrikerRUS removed the feature request label Jan 24, 2021

jameslamb mentioned this issue Jan 26, 2021

[dask] Add type hints in Dask package #3866

Merged

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support DataTable in Dask #3830

Support DataTable in Dask #3830

StrikerRUS commented Jan 24, 2021

StrikerRUS commented Jan 24, 2021

jameslamb commented Jan 24, 2021

StrikerRUS commented Jan 24, 2021

github-actions bot commented Aug 23, 2023

Support DataTable in Dask #3830

Support DataTable in Dask #3830

Comments

StrikerRUS commented Jan 24, 2021

Summary

Motivation

Description

References

StrikerRUS commented Jan 24, 2021

jameslamb commented Jan 24, 2021

StrikerRUS commented Jan 24, 2021

github-actions bot commented Aug 23, 2023