sync for init score of binary objective function #4332

loveclj · 2021-06-01T04:51:10Z

when training a binary classifier in distribution mode, the init score should depend on all training dataset, not local training dataset

jameslamb

Thank you for your interest in LightGBM!

To ensure an effective review and make the best use of maintainers' limited time, please update this pull request's description with an explanation of why you think this change should be made. If this is related to an existing feature request or bug report, please link to that as well.

loveclj

when training a binary classifier in distribution mode, the init score should depend on all training dataset, not local training dataset

jameslamb · 2021-06-03T03:16:44Z

Ok, thanks! I'd like to hear what @shiyu1994 says.

In the meantime, is it possible to provide a reproducible example demonstrating the bug that this pull request is intended to fix?

This project's dask module (a Python interface for distributed training) has unit tests that check that binary classification models created using data parallel training and a model trained only on local data produce identical predictions:

LightGBM/tests/python_package_test/test_dask.py

Lines 292 to 295 in 36957ed

    
           assert_eq(s1, s2) 
        
           assert_eq(p1, p2) 
        
           assert_eq(p1, y) 
        
           assert_eq(p2, y)

If you can provide a reproducible example, it would also help us to understand what those tests are missing.

That might also be helpful in answering this related question I'd want to investigate:

does this bug only affect the binary objective, or should other LightGBM built-in objectives also be changed in this way?

loveclj · 2021-06-03T07:44:27Z

the models created in distributed mode are identical because it calls Network::GlobalSyncUpByMean to sync init score by mean.

LightGBM/src/boosting/gbdt.cpp

Lines 333 to 342 in 3dd4a3f

    
           double ObtainAutomaticInitialScore(const ObjectiveFunction* fobj, int class_id) { 
        
             double init_score = 0.0; 
        
             if (fobj != nullptr) { 
        
               init_score = fobj->BoostFromScore(class_id); 
        
             } 
        
             if (Network::num_machines() > 1) { 
        
               init_score = Network::GlobalSyncUpByMean(init_score); 
        
             } 
        
             return init_score; 
        
           }

however, in BinaryLogLoss::BoostFromScore method, the initscore and pavg only depends on local data, it makes me confused when I check the logs of all machines, becasue initscore and pavg printed in logs are local value, not synced up by Network::GlobalSyncUpByMean.

LightGBM/src/objective/binary_objective.hpp

Lines 155 to 159 in 3dd4a3f

    
           double pavg = suml / sumw; 
        
           pavg = std::min(pavg, 1.0 - kEpsilon); 
        
           pavg = std::max<double>(pavg, kEpsilon); 
        
           double initscore = std::log(pavg / (1.0f - pavg)) / sigmoid_; 
        
           Log::Info("[%s:%s]: pavg=%f -> initscore=%f",  GetName(), __func__, pavg, initscore);

And when the label distribution are almost same in different machines, it's okay to get global initscore by reducing local initscore by mean, but when the label distribution are not same in different machines, there may be a big gap between initscore(reduced by mean) and the "real" init score, it may slow down the training convergence.

does this bug only affect the binary objective, or should other LightGBM built-in objectives also be changed in this way?

yes, this issue also exists in regreesion objectives and xentropy objectives.
https://github.com/microsoft/LightGBM/blob/3dd4a3f9339b79c994f0286cdd9cc316782278d6/src/objective/regression_objective.hpp
https://github.com/microsoft/LightGBM/blob/3dd4a3f9339b79c994f0286cdd9cc316782278d6/src/objective/xentropy_objective.hpp

shiyu1994

Thanks for the fix. This is really important to get consistent results using data distributed training. I notice that the regression objective need to synchronize the information for initial score too, but the case is a little bit more complicated. Maybe we can leave it in another PR.

jameslamb · 2021-06-20T02:50:35Z

Got it! @shiyu1994 could you write up a feature request issue describing this bug for the regression and xentropy objectives? (and any others you think might need it). I'd write this up but I think you can probably provide a more useful and specific description of the problem than I can.

jameslamb

Thanks very much! Approving since all the Dask tests are passing and since @shiyu1994 approved.

jameslamb · 2021-06-25T03:49:50Z

I've created #4405 to document the future work for other objectives. To keep the conversation in one place, I am locking further discussion on this pull request.

shiyu1994 · 2021-06-29T06:39:39Z

@jameslamb Thank you for writing that. I will follow that thread to synchronize initial scores for other objectives.

sync for init score of binary objective function

281e64e

loveclj requested review from btrotta, chivee, guolinke and shiyu1994 as code owners June 1, 2021 04:51

jameslamb self-requested a review June 1, 2021 05:09

jameslamb requested changes Jun 1, 2021

View reviewed changes

loveclj requested a review from jameslamb June 1, 2021 05:36

loveclj commented Jun 1, 2021

View reviewed changes

StrikerRUS added the fix label Jun 2, 2021

StrikerRUS added the awaiting review label Jun 16, 2021

shiyu1994 approved these changes Jun 18, 2021

View reviewed changes

shiyu1994 removed the awaiting review label Jun 18, 2021

jameslamb approved these changes Jun 25, 2021

View reviewed changes

jameslamb merged commit 0701a32 into microsoft:master Jun 25, 2021

jameslamb mentioned this pull request Jun 25, 2021

data parallel training uses local init score for some objectives, but should use a global init score #4405

Open

microsoft locked as resolved and limited conversation to collaborators Jun 25, 2021

loveclj deleted the init_score_sync branch March 28, 2022 06:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync for init score of binary objective function #4332

sync for init score of binary objective function #4332

loveclj commented Jun 1, 2021 •

edited

Loading

jameslamb left a comment

loveclj left a comment

jameslamb commented Jun 3, 2021

loveclj commented Jun 3, 2021 •

edited

Loading

shiyu1994 left a comment

jameslamb commented Jun 20, 2021

jameslamb left a comment

jameslamb commented Jun 25, 2021

shiyu1994 commented Jun 29, 2021

sync for init score of binary objective function #4332

sync for init score of binary objective function #4332

Conversation

loveclj commented Jun 1, 2021 • edited Loading

jameslamb left a comment

Choose a reason for hiding this comment

loveclj left a comment

Choose a reason for hiding this comment

jameslamb commented Jun 3, 2021

loveclj commented Jun 3, 2021 • edited Loading

shiyu1994 left a comment

Choose a reason for hiding this comment

jameslamb commented Jun 20, 2021

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb commented Jun 25, 2021

shiyu1994 commented Jun 29, 2021

loveclj commented Jun 1, 2021 •

edited

Loading

loveclj commented Jun 3, 2021 •

edited

Loading