Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync for init score of binary objective function #4332

Merged
merged 1 commit into from
Jun 25, 2021

Conversation

loveclj
Copy link
Contributor

@loveclj loveclj commented Jun 1, 2021

when training a binary classifier in distribution mode, the init score should depend on all training dataset, not local training dataset

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your interest in LightGBM!

To ensure an effective review and make the best use of maintainers' limited time, please update this pull request's description with an explanation of why you think this change should be made. If this is related to an existing feature request or bug report, please link to that as well.

@loveclj loveclj requested a review from jameslamb June 1, 2021 05:36
Copy link
Contributor Author

@loveclj loveclj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when training a binary classifier in distribution mode, the init score should depend on all training dataset, not local training dataset

@StrikerRUS StrikerRUS added the fix label Jun 2, 2021
@jameslamb
Copy link
Collaborator

Ok, thanks! I'd like to hear what @shiyu1994 says.

In the meantime, is it possible to provide a reproducible example demonstrating the bug that this pull request is intended to fix?

This project's dask module (a Python interface for distributed training) has unit tests that check that binary classification models created using data parallel training and a model trained only on local data produce identical predictions:

assert_eq(s1, s2)
assert_eq(p1, p2)
assert_eq(p1, y)
assert_eq(p2, y)

If you can provide a reproducible example, it would also help us to understand what those tests are missing.

That might also be helpful in answering this related question I'd want to investigate:

does this bug only affect the binary objective, or should other LightGBM built-in objectives also be changed in this way?

@loveclj
Copy link
Contributor Author

loveclj commented Jun 3, 2021

the models created in distributed mode are identical because it calls Network::GlobalSyncUpByMean to sync init score by mean.

double ObtainAutomaticInitialScore(const ObjectiveFunction* fobj, int class_id) {
double init_score = 0.0;
if (fobj != nullptr) {
init_score = fobj->BoostFromScore(class_id);
}
if (Network::num_machines() > 1) {
init_score = Network::GlobalSyncUpByMean(init_score);
}
return init_score;
}

however, in BinaryLogLoss::BoostFromScore method, the initscore and pavg only depends on local data, it makes me confused when I check the logs of all machines, becasue initscore and pavg printed in logs are local value, not synced up by Network::GlobalSyncUpByMean.

double pavg = suml / sumw;
pavg = std::min(pavg, 1.0 - kEpsilon);
pavg = std::max<double>(pavg, kEpsilon);
double initscore = std::log(pavg / (1.0f - pavg)) / sigmoid_;
Log::Info("[%s:%s]: pavg=%f -> initscore=%f", GetName(), __func__, pavg, initscore);

And when the label distribution are almost same in different machines, it's okay to get global initscore by reducing local initscore by mean, but when the label distribution are not same in different machines, there may be a big gap between initscore(reduced by mean) and the "real" init score, it may slow down the training convergence.

does this bug only affect the binary objective, or should other LightGBM built-in objectives also be changed in this way?

yes, this issue also exists in regreesion objectives and xentropy objectives.
https://github.com/microsoft/LightGBM/blob/3dd4a3f9339b79c994f0286cdd9cc316782278d6/src/objective/regression_objective.hpp
https://github.com/microsoft/LightGBM/blob/3dd4a3f9339b79c994f0286cdd9cc316782278d6/src/objective/xentropy_objective.hpp

Copy link
Collaborator

@shiyu1994 shiyu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. This is really important to get consistent results using data distributed training. I notice that the regression objective need to synchronize the information for initial score too, but the case is a little bit more complicated. Maybe we can leave it in another PR.

@jameslamb
Copy link
Collaborator

Got it! @shiyu1994 could you write up a feature request issue describing this bug for the regression and xentropy objectives? (and any others you think might need it). I'd write this up but I think you can probably provide a more useful and specific description of the problem than I can.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much! Approving since all the Dask tests are passing and since @shiyu1994 approved.

@jameslamb
Copy link
Collaborator

I've created #4405 to document the future work for other objectives. To keep the conversation in one place, I am locking further discussion on this pull request.

@microsoft microsoft locked as resolved and limited conversation to collaborators Jun 25, 2021
@shiyu1994
Copy link
Collaborator

@jameslamb Thank you for writing that. I will follow that thread to synchronize initial scores for other objectives.

@loveclj loveclj deleted the init_score_sync branch March 28, 2022 06:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants