-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support all LightGBM parallel tree learners in Dask #3834
Comments
@StrikerRUS I disagree with calling this a "good first issue", and believe it should be a feature request. Using a different tree learner isn't as simple as just changing the tree_learner parameter. You have to be sure that the way that your data (either Dask DataFrame or Dask Array) is partitioned matches the method you chose. For a dataset with
Implementing this will require some non-trivial changes to the tests and the Dask module. |
I'm going to close this and add it to #2302. Thank you very much for writing this up, and for all of the Dask issues you've written up! Having this all documented in the backlog takes a lot of work but it's also critical to getting other contributors to come help. Anyone who sees this is welcome to pick up this feature! Leave a comment and we can re-open the issue. |
Huh, I was suspecting this but shy to say it out loud! 🙂 Given that only LightGBM/python-package/lightgbm/dask.py Lines 254 to 257 in ac706e1
|
I've proposed a PR: #3848 |
Seems that Lines 130 to 131 in 7880b79
|
Oh wow! For context, looks like that comment was added in #986. So does this mean feature-parallel learning is only supported in the CLI? |
I think so. |
ok I can make a separate pull request to update the parameter documentation for that, then |
Adding some information for anyone looking to add You can see a description at https://lightgbm.readthedocs.io/en/latest/Features.html#optimization-in-network-communication. And much more details in https://proceedings.neurips.cc/paper/2016/file/10a5ab2db37feedfdeaab192ead4ac0e-Paper.pdf if you're curious. From the paper:
In theory, adding LightGBM/python-package/lightgbm/dask.py Line 274 in 4580393
Voting parallel uses horizontally-partitioned data just like |
Should I remove this? LightGBM/python-package/lightgbm/dask.py Lines 265 to 266 in 971b548
|
No, I'd prefer to leave that and rely on LightGBM to throw the error mentioned above. If you try to do feature parallel and your data are vertically partitioned, it wouldn't make sense for LightGBM to try to perform data parallel learning (which requires horizontally partitioned data). |
Adding LightGBM/python-package/lightgbm/dask.py Line 274 in 971b548
Just leaves the feature parallel to go into the if and trigger this warning first: UserWarning: Support for tree_learner feature_parallel in lightgbm.dask is experimental and may break in a future release. and then the error. Maybe we should remove the if and go straight to the error? The warning kind of suggests that there is support for feature_parallel in the dask api but there isn't.
|
oh good point! Ok I would support removing that warning entirely at the same time you add support for voting parallel. |
According to this piece of code
LightGBM/python-package/lightgbm/dask.py
Lines 251 to 264 in ac706e1
Dask module supports all available tree learner types.
Refer to https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html#choose-appropriate-parallel-algorithm for more details about difference among
tree_learner
param values.However, our CI tests are run only for
data
parallel tree learner type.LightGBM/tests/python_package_test/test_dask.py
Line 182 in ac706e1
I believe that parametrization of all tests with different tree learners will improve tests and make us confident in Dask module quality.
How tests can be parametrized:
LightGBM/tests/python_package_test/test_dask.py
Line 31 in ac706e1
LightGBM/tests/python_package_test/test_dask.py
Line 166 in ac706e1
The text was updated successfully, but these errors were encountered: