Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU optimizations #4433

Closed
wants to merge 57 commits into from
Closed

CPU optimizations #4433

wants to merge 57 commits into from

Conversation

SmirnovEgorRu
Copy link
Contributor

This PR is changes from #4278 without pre-processing optimizations (it has been merged yet #4310)

In addition I refactored the code to make it simple. In previews version I used omp tasks, but they are not supported in MSVS by default, so I replaced them by simple omp parallel for.

Code changes looks massive, but it is hard to split them on small parts.

@trivialfis
Copy link
Member

trivialfis commented May 6, 2019

It seems this is based on an old commit. Is it possible to make a clean rebase?

@hcho3
Copy link
Collaborator

hcho3 commented May 6, 2019

I'll review this pull request after the 0.90 release

@hcho3
Copy link
Collaborator

hcho3 commented May 23, 2019

@SmirnovEgorRu Looks like this pull request is breaking some distributed training tests: https://xgboost-ci.net/blue/organizations/jenkins/xgboost/detail/PR-4433/7/pipeline#step-145-log-581

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3 Yes, now I want to understand how distributed mode works. Do we have some materials about this?
Potentially, my changes should optimize distributed mode too, but need to change code.

@hcho3
Copy link
Collaborator

hcho3 commented May 24, 2019

@SmirnovEgorRu The XGBoost paper have some details about distributed training. Roughly speaking, every occurrence of AllReduce causes synchronization of all nodes

@hcho3
Copy link
Collaborator

hcho3 commented May 24, 2019

Let me look at this over the weekend and see why it's breaking distributed training.

@trivialfis
Copy link
Member

trivialfis commented May 25, 2019

Hi, @SmirnovEgorRu @hcho3 @RAMitchell I want to revert part of #4310 . That PR improved performance by allocating buffer, specifically:

buff[tid].resize(block_size * ncol);

This makes it very hard to load sparse dataset. For example the first line of a concatenated URL dataset from #2326 has largest index 3224681, but with only 108 columns. For a 16 threads machine, this for loop will generate 16 * 3224681 * 512 * 2 * sizeof(float) bytes (196.8189 GB). Which means my machine won't even be able load a single line of data. And just for future reference, allocating memory based on feature index might not be a good idea. :-(

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3 , looks like all issues are resolved, including distributed mode.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. We need to do some benchmarks on different datasets. But before that, please

  1. make a clean rebase
  2. add unittest cases for new/changed functions to show their correctness.

Performance is good, but please don't do it at the expense of maintainability. There are many improvements pending to hist algorithm, so please review your PR first, see if there's a cleaner way of doing things. ;-)

Will add a detailed review once the code become cleaner.

};

namespace tree {
class SplitEvaluator;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used somewhere in this file?

row_ptr_[nid] = data_.size();
data_.resize(data_.size() + nbins_);
if (data_arr_[nid] == nullptr) {
data_arr_[nid] = new std::vector<GradStatHist>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to use new? I would try to avoid raw new and delete all together.

}

void BuildBlockHist(const std::vector<GradientPair>& gpair,
const RowSetCollection::Elem row_indices,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indention is incorrect.

@@ -206,7 +214,8 @@ class QuantileHistMock : public QuantileHistMaker {
}

/* Now compare against result given by EvaluateSplit() */
RealImpl::EvaluateSplit(0, gmat, hist_, *(*dmat), tree);
EvaluateSplitsBatch(nodes, gmat, **dmat, hist_is_init, hist_buffers, const_cast<RegTree*>(&tree));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a const_cast necessary?

@@ -15,54 +15,23 @@
#include <vector>
#include <string>
#include <queue>
#include <deque>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find where you use deque

const uint32_t idx_bin = 2*index[j];
const size_t idx_gh = 2*rid[i];
grad_stat.sum_grad += pgh[idx_gh];
grad_stat.sum_hess += pgh[idx_gh+1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please give some notes about memory layout.

*/
using GHistRow = Span<tree::GradStats>;
struct GradStatHist {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So can we remove GradStats?

global_start = dmlc::GetTime();
}

inline void EndPerfMonitor() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from an old commit. It's no longer here. Could you make a clean rebase?

const size_t n_local_histograms = std::min(nthread, n_local_blocks);

for (size_t j = 0; j < n_local_blocks; ++j) {
task_nid.push_back(nid);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be done by one resize?

hist_is_init->resize(nodes.size());

// input data for tasks
int32_t n_tasks = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What task?

rongou and others added 18 commits June 2, 2019 16:45
2 additional digits are not needed to guarantee that casting the decimal representation will result in the same float, see #3980 (comment)
* Make AUCPR work with multiple query groups

* Check AUCPR <= 1.0 in distributed setting
…per-group (#4216)

* In AUC and AUCPR metrics, detect whether weights are per-instance or per-group

* Fix C++ style check

* Add a test for weighted AUC
* Change memory dump size in R test.
Add tutorial on missing values and how to handle those within XGBoost.
* Add `BUILD_WITH_SHARED_NCCL` to CMake.
…xplicitly given in XGBoost4J-Spark (#4446)

* Automatically set maximize_evaluation_metrics if not explicitly given.

* When custom_eval is set, require maximize_evaluation_metrics.

* Update documents on early stop in XGBoost4J-Spark.

* Fix code error.
* [CI] Upgrade to Spark 2.4.2

* Pass Spark version to build script

* Allow multiple --build-arg in ci_build.sh

* Fix syntax

* Fix container name

* Update pom.xml

* Fix container name

* Update Jenkinsfile

* Update pom.xml

* Update Dockerfile.jvm_cross
* mgpu prediction using explicit sharding
* [CI] Build XGBoost wheels with CUDA 9.0

* Do not call archiveArtifacts for 8.0 wheel
* Fix #4462: Use /MT flag consistently for MSVC target

* First attempt at Windows CI

* Distinguish stages in Linux and Windows pipelines

* Try running CMake in Windows pipeline

* Add build step
SmirnovEgorRu and others added 26 commits June 2, 2019 16:45
* Fix dask API sphinx docstrings

* Update GPU docs page
* Only define `gpu_id` and `n_gpus` in `LearnerTrainParam`
* Pass LearnerTrainParam through XGBoost vid factory method.
* Disable all GPU usage when GPU related parameters are not specified (fixes XGBoost choosing GPU over aggressively).
* Test learner train param io.
* Fix gpu pickling.
* - training with external memory part 1 of 2
   - this pr focuses on computing the quantiles using multiple gpus on a
     dataset that uses the external cache capabilities
   - there will a follow-up pr soon after this that will support creation
     of histogram indices on large dataset as well
   - both of these changes are required to support training with external memory
   - the sparse pages in dmatrix are taken in batches and the the cut matrices
     are incrementally built
   - also snuck in some (perf) changes related to sketches aggregation amongst multiple
     features across multiple sparse page batches. instead of aggregating the summary
     inside each device and merged later, it is aggregated in-place when the device
     is working on different rows but the same feature
* simplify the config.h file

* revise config.h

* revised config.h

* revise format

* revise format issues

* revise whitespace issues

* revise whitespace namespace format issues

* revise namespace format issues

* format issues

* format issues

* format issues

* format issues

* Revert submodule changes

* minor change

* Update src/common/config.h

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* address format issue from trivialfis

* Use correct cub submodule
…4519)

* Smarter choice of histogram construction for distributed gpu_hist

* Limit omp team size in ExecuteShards
@lock lock bot locked as resolved and limited conversation to collaborators Aug 31, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.