[fix] Fix bug in data distributed learning with local empty leaf #4185
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In data distributed learning (including both
data
andvoting
modes), when the value distribution of a feature differs too much across machines (for example, two machines have totally nonoverlapped ranges of the same feature), there can be a leaf which contains 0 local data on a machine during training.(Note that this happens more easily when
pre_partition=true
. Withpre_partition=false
, the data partition will be random and feature value distributions across machines are likely to be similar.)In that case, the histogram buffers of that leaf won't be cleared because the
ConstructHistograms
method inDataset
will simply exit.LightGBM/include/LightGBM/dataset.h
Lines 470 to 478 in 6ad3e6e
Thus the histogram clear code in this method is not executed.
LightGBM/src/io/dataset.cpp
Lines 1192 to 1193 in 6ad3e6e
But still, with data distributed learning, the histograms will be transferred to other machines to get the global histograms, for example, in
data
mode, we haveLightGBM/src/treelearner/data_parallel_tree_learner.cpp
Lines 159 to 170 in fba18e4
Since the histogram buffer is not cleared, content from previous iteration remains in the buffer, which results in a wrong global histogram after synchronization.
This PR is to guarantee that under data distributed learning, when an empty local leaf appears, its histogram content is cleared before synchronization. And it should fix issue #4026 and potentially issue #4178.