[fix] Fix bug in data distributed learning with local empty leaf #4185

shiyu1994 · 2021-04-16T09:00:29Z

In data distributed learning (including both data and voting modes), when the value distribution of a feature differs too much across machines (for example, two machines have totally nonoverlapped ranges of the same feature), there can be a leaf which contains 0 local data on a machine during training.

(Note that this happens more easily when pre_partition=true. With pre_partition=false, the data partition will be random and feature value distributions across machines are likely to be similar.)

In that case, the histogram buffers of that leaf won't be cleared because the ConstructHistograms method in Dataset will simply exit.

LightGBM/include/LightGBM/dataset.h

Lines 470 to 478 in 6ad3e6e

    
           inline void ConstructHistograms( 
        
               const std::vector<int8_t>& is_feature_used, 
        
               const data_size_t* data_indices, data_size_t num_data, 
        
               const score_t* gradients, const score_t* hessians, 
        
               score_t* ordered_gradients, score_t* ordered_hessians, 
        
               TrainingShareStates* share_state, hist_t* hist_data) const { 
        
             if (num_data <= 0) { 
        
               return; 
        
             }

Thus the histogram clear code in this method is not executed.

LightGBM/src/io/dataset.cpp

Lines 1192 to 1193 in 6ad3e6e

    
           std::memset(reinterpret_cast<void*>(data_ptr), 0, 
        
                       num_bin * kHistEntrySize);

But still, with data distributed learning, the histograms will be transferred to other machines to get the global histograms, for example, in data mode, we have

LightGBM/src/treelearner/data_parallel_tree_learner.cpp

Lines 159 to 170 in fba18e4

    
           #pragma omp parallel for schedule(static) 
        
           for (int feature_index = 0; feature_index < this->num_features_; ++feature_index) { 
        
             if (this->col_sampler_.is_feature_used_bytree()[feature_index] == false) 
        
               continue; 
        
             // copy to buffer 
        
             std::memcpy(input_buffer_.data() + buffer_write_start_pos_[feature_index], 
        
                         this->smaller_leaf_histogram_array_[feature_index].RawData(), 
        
                         this->smaller_leaf_histogram_array_[feature_index].SizeOfHistgram()); 
        
           } 
        
           // Reduce scatter for histogram 
        
           Network::ReduceScatter(input_buffer_.data(), reduce_scatter_size_, sizeof(hist_t), block_start_.data(), 
        
                                  block_len_.data(), output_buffer_.data(), static_cast<comm_size_t>(output_buffer_.size()), &HistogramSumReducer);

Since the histogram buffer is not cleared, content from previous iteration remains in the buffer, which results in a wrong global histogram after synchronization.

This PR is to guarantee that under data distributed learning, when an empty local leaf appears, its histogram content is cleared before synchronization. And it should fix issue #4026 and potentially issue #4178.

jameslamb

Explanation and changes make sense to me, thank you very much for the thorough explanation!

freetz-tiplu · 2021-06-28T09:20:56Z

This fix does fix the case for data-parallel training but there are still some errors in the voting-parallel case as seen in #4414

github-actions · 2023-08-23T19:17:59Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

when a leaf has no local data, its histogram shuold be cleared

054d455

shiyu1994 requested review from chivee, jameslamb, huanzhang12, Laurae2, wxchan, guolinke, StrikerRUS and btrotta April 16, 2021 09:00

StrikerRUS added the fix label Apr 16, 2021

jameslamb approved these changes Apr 16, 2021

View reviewed changes

imatiach-msft mentioned this pull request Apr 19, 2021

Large prediction results unless using repartition(1) in databricks with lgbm model microsoft/SynapseML#986

Open

shiyu1994 merged commit 0a847ef into microsoft:master Apr 22, 2021

This was referenced Apr 23, 2021

chore: update to lightgbm 3.2.110 microsoft/SynapseML#1029

Merged

[mmlspark] how to deal with "Stopped training because there are no more leaves that meet the split requirements" #4178

Closed

jose-moralez mentioned this pull request Apr 23, 2021

[dask] Distributed training sometimes produces very high leaf values #4026

Closed

imatiach-msft mentioned this pull request May 3, 2021

Upgrading to 1.0.0-rc2 results in a large drop in classification performance using LightGBMClassifier. microsoft/SynapseML#919

Open

jameslamb mentioned this pull request May 20, 2021

release 3.3.0 #4310

Closed

21 tasks

freetz-tiplu mentioned this pull request Jun 28, 2021

Trainings binary logloss increasing in some iterations in voting-parallel setting #4414

Open

jameslamb mentioned this pull request Oct 25, 2021

[dask] lightgbm + dask generates crazy predictions #4695

Closed

shiyu1994 mentioned this pull request May 30, 2022

[Bug] Data parallel lgbm throws exception when splitting a leaf #5222

Open

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Fix bug in data distributed learning with local empty leaf #4185

[fix] Fix bug in data distributed learning with local empty leaf #4185

shiyu1994 commented Apr 16, 2021

jameslamb left a comment

freetz-tiplu commented Jun 28, 2021

github-actions bot commented Aug 23, 2023

	inline void ConstructHistograms(
	const std::vector<int8_t>& is_feature_used,
	const data_size_t* data_indices, data_size_t num_data,
	const score_t* gradients, const score_t* hessians,
	score_t* ordered_gradients, score_t* ordered_hessians,
	TrainingShareStates* share_state, hist_t* hist_data) const {
	if (num_data <= 0) {
	return;
	}

	std::memset(reinterpret_cast<void*>(data_ptr), 0,
	num_bin * kHistEntrySize);

	#pragma omp parallel for schedule(static)
	for (int feature_index = 0; feature_index < this->num_features_; ++feature_index) {
	if (this->col_sampler_.is_feature_used_bytree()[feature_index] == false)
	continue;
	// copy to buffer
	std::memcpy(input_buffer_.data() + buffer_write_start_pos_[feature_index],
	this->smaller_leaf_histogram_array_[feature_index].RawData(),
	this->smaller_leaf_histogram_array_[feature_index].SizeOfHistgram());
	}
	// Reduce scatter for histogram
	Network::ReduceScatter(input_buffer_.data(), reduce_scatter_size_, sizeof(hist_t), block_start_.data(),
	block_len_.data(), output_buffer_.data(), static_cast<comm_size_t>(output_buffer_.size()), &HistogramSumReducer);

[fix] Fix bug in data distributed learning with local empty leaf #4185

[fix] Fix bug in data distributed learning with local empty leaf #4185

Conversation

shiyu1994 commented Apr 16, 2021

jameslamb left a comment

Choose a reason for hiding this comment

freetz-tiplu commented Jun 28, 2021

github-actions bot commented Aug 23, 2023