-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU optimizations #4433
CPU optimizations #4433
Conversation
It seems this is based on an old commit. Is it possible to make a clean rebase? |
I'll review this pull request after the 0.90 release |
@SmirnovEgorRu Looks like this pull request is breaking some distributed training tests: https://xgboost-ci.net/blue/organizations/jenkins/xgboost/detail/PR-4433/7/pipeline#step-145-log-581 |
@hcho3 Yes, now I want to understand how distributed mode works. Do we have some materials about this? |
@SmirnovEgorRu The XGBoost paper have some details about distributed training. Roughly speaking, every occurrence of AllReduce causes synchronization of all nodes |
Let me look at this over the weekend and see why it's breaking distributed training. |
Hi, @SmirnovEgorRu @hcho3 @RAMitchell I want to revert part of #4310 . That PR improved performance by allocating buffer, specifically: xgboost/src/common/hist_util.cc Line 111 in 8ddd271
This makes it very hard to load sparse dataset. For example the first line of a concatenated URL dataset from #2326 has largest index |
@hcho3 , looks like all issues are resolved, including distributed mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. We need to do some benchmarks on different datasets. But before that, please
- make a clean rebase
- add unittest cases for new/changed functions to show their correctness.
Performance is good, but please don't do it at the expense of maintainability. There are many improvements pending to hist algorithm, so please review your PR first, see if there's a cleaner way of doing things. ;-)
Will add a detailed review once the code become cleaner.
}; | ||
|
||
namespace tree { | ||
class SplitEvaluator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used somewhere in this file?
row_ptr_[nid] = data_.size(); | ||
data_.resize(data_.size() + nbins_); | ||
if (data_arr_[nid] == nullptr) { | ||
data_arr_[nid] = new std::vector<GradStatHist>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to use new? I would try to avoid raw new and delete all together.
} | ||
|
||
void BuildBlockHist(const std::vector<GradientPair>& gpair, | ||
const RowSetCollection::Elem row_indices, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indention is incorrect.
@@ -206,7 +214,8 @@ class QuantileHistMock : public QuantileHistMaker { | |||
} | |||
|
|||
/* Now compare against result given by EvaluateSplit() */ | |||
RealImpl::EvaluateSplit(0, gmat, hist_, *(*dmat), tree); | |||
EvaluateSplitsBatch(nodes, gmat, **dmat, hist_is_init, hist_buffers, const_cast<RegTree*>(&tree)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is a const_cast necessary?
@@ -15,54 +15,23 @@ | |||
#include <vector> | |||
#include <string> | |||
#include <queue> | |||
#include <deque> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find where you use deque
const uint32_t idx_bin = 2*index[j]; | ||
const size_t idx_gh = 2*rid[i]; | ||
grad_stat.sum_grad += pgh[idx_gh]; | ||
grad_stat.sum_hess += pgh[idx_gh+1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please give some notes about memory layout.
*/ | ||
using GHistRow = Span<tree::GradStats>; | ||
struct GradStatHist { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So can we remove GradStats
?
global_start = dmlc::GetTime(); | ||
} | ||
|
||
inline void EndPerfMonitor() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from an old commit. It's no longer here. Could you make a clean rebase?
const size_t n_local_histograms = std::min(nthread, n_local_blocks); | ||
|
||
for (size_t j = 0; j < n_local_blocks; ++j) { | ||
task_nid.push_back(nid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be done by one resize?
hist_is_init->resize(nodes.size()); | ||
|
||
// input data for tasks | ||
int32_t n_tasks = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What task?
2 additional digits are not needed to guarantee that casting the decimal representation will result in the same float, see #3980 (comment)
* Make AUCPR work with multiple query groups * Check AUCPR <= 1.0 in distributed setting
…per-group (#4216) * In AUC and AUCPR metrics, detect whether weights are per-instance or per-group * Fix C++ style check * Add a test for weighted AUC
* Change memory dump size in R test.
Add tutorial on missing values and how to handle those within XGBoost.
* Add `BUILD_WITH_SHARED_NCCL` to CMake.
…xplicitly given in XGBoost4J-Spark (#4446) * Automatically set maximize_evaluation_metrics if not explicitly given. * When custom_eval is set, require maximize_evaluation_metrics. * Update documents on early stop in XGBoost4J-Spark. * Fix code error.
* [CI] Upgrade to Spark 2.4.2 * Pass Spark version to build script * Allow multiple --build-arg in ci_build.sh * Fix syntax * Fix container name * Update pom.xml * Fix container name * Update Jenkinsfile * Update pom.xml * Update Dockerfile.jvm_cross
* mgpu prediction using explicit sharding
* [CI] Build XGBoost wheels with CUDA 9.0 * Do not call archiveArtifacts for 8.0 wheel
* Fix #4462: Use /MT flag consistently for MSVC target * First attempt at Windows CI * Distinguish stages in Linux and Windows pipelines * Try running CMake in Windows pipeline * Add build step
* Fix dask API sphinx docstrings * Update GPU docs page
* Only define `gpu_id` and `n_gpus` in `LearnerTrainParam` * Pass LearnerTrainParam through XGBoost vid factory method. * Disable all GPU usage when GPU related parameters are not specified (fixes XGBoost choosing GPU over aggressively). * Test learner train param io. * Fix gpu pickling.
* - training with external memory part 1 of 2 - this pr focuses on computing the quantiles using multiple gpus on a dataset that uses the external cache capabilities - there will a follow-up pr soon after this that will support creation of histogram indices on large dataset as well - both of these changes are required to support training with external memory - the sparse pages in dmatrix are taken in batches and the the cut matrices are incrementally built - also snuck in some (perf) changes related to sketches aggregation amongst multiple features across multiple sparse page batches. instead of aggregating the summary inside each device and merged later, it is aggregated in-place when the device is working on different rows but the same feature
* simplify the config.h file * revise config.h * revised config.h * revise format * revise format issues * revise whitespace issues * revise whitespace namespace format issues * revise namespace format issues * format issues * format issues * format issues * format issues * Revert submodule changes * minor change * Update src/common/config.h Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu> * address format issue from trivialfis * Use correct cub submodule
…4519) * Smarter choice of histogram construction for distributed gpu_hist * Limit omp team size in ExecuteShards
This PR is changes from #4278 without pre-processing optimizations (it has been merged yet #4310)
In addition I refactored the code to make it simple. In previews version I used omp tasks, but they are not supported in MSVS by default, so I replaced them by simple omp parallel for.
Code changes looks massive, but it is hard to split them on small parts.