-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync for init score of binary objective function #4332
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your interest in LightGBM!
To ensure an effective review and make the best use of maintainers' limited time, please update this pull request's description with an explanation of why you think this change should be made. If this is related to an existing feature request or bug report, please link to that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when training a binary classifier in distribution mode, the init score should depend on all training dataset, not local training dataset
Ok, thanks! I'd like to hear what @shiyu1994 says. In the meantime, is it possible to provide a reproducible example demonstrating the bug that this pull request is intended to fix? This project's LightGBM/tests/python_package_test/test_dask.py Lines 292 to 295 in 36957ed
If you can provide a reproducible example, it would also help us to understand what those tests are missing. That might also be helpful in answering this related question I'd want to investigate:
|
the models created in distributed mode are identical because it calls Network::GlobalSyncUpByMean to sync init score by mean. LightGBM/src/boosting/gbdt.cpp Lines 333 to 342 in 3dd4a3f
however, in BinaryLogLoss::BoostFromScore method, the initscore and pavg only depends on local data, it makes me confused when I check the logs of all machines, becasue initscore and pavg printed in logs are local value, not synced up by Network::GlobalSyncUpByMean. LightGBM/src/objective/binary_objective.hpp Lines 155 to 159 in 3dd4a3f
And when the label distribution are almost same in different machines, it's okay to get global initscore by reducing local initscore by mean, but when the label distribution are not same in different machines, there may be a big gap between initscore(reduced by mean) and the "real" init score, it may slow down the training convergence.
yes, this issue also exists in regreesion objectives and xentropy objectives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix. This is really important to get consistent results using data distributed training. I notice that the regression objective need to synchronize the information for initial score too, but the case is a little bit more complicated. Maybe we can leave it in another PR.
Got it! @shiyu1994 could you write up a feature request issue describing this bug for the regression and xentropy objectives? (and any others you think might need it). I'd write this up but I think you can probably provide a more useful and specific description of the problem than I can. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much! Approving since all the Dask tests are passing and since @shiyu1994 approved.
I've created #4405 to document the future work for other objectives. To keep the conversation in one place, I am locking further discussion on this pull request. |
@jameslamb Thank you for writing that. I will follow that thread to synchronize initial scores for other objectives. |
when training a binary classifier in distribution mode, the init score should depend on all training dataset, not local training dataset