Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add device memory usage profiling. #3795

Closed
wants to merge 1 commit into from

Conversation

trivialfis
Copy link
Member

This is another small step toward optimizing memory usage of gpu_hist method. There might be a better way to handle the backtrace information instead of printing everything. Suggestions are welcomed.

@hcho3
Copy link
Collaborator

hcho3 commented Oct 17, 2018

@trivialfis I'm curious, what is the benefit of building a custom monitor as supposed to using nvprof?

@trivialfis
Copy link
Member Author

@hcho3 It's really strange that I am able to find all the really detailed information by using nvprof , but memory usage.

@hcho3
Copy link
Collaborator

hcho3 commented Oct 17, 2018

@trivialfis So nvprof doesn't give you memory usage information?

@trivialfis
Copy link
Member Author

@hcho3 As far as I know, it doesn't.

@hcho3
Copy link
Collaborator

hcho3 commented Oct 17, 2018

@trivialfis That's interesting. I now see why you are working on this pull request.

@trivialfis
Copy link
Member Author

trivialfis commented Oct 17, 2018

@hcho3 Hi, do you have any insight on how to force the linking in make file? Currently the R test failed with:

dlopen(/Users/travis/build/dmlc/xgboost/xgboost/xgboost.Rcheck/xgboost/libs/xgboost.so, 6): Symbol not found: __ZN7xgboost6common34__dmlc_registry_file_tag_monitor__Ev

Actually, this added dmlc tag was my attemp to solve the problem. Previously the error is some member functions' definition gone missing.

One way to solve it is inlining everything, but that will bring a huge burden for compilation, since the printing messages of Monitor and DeviceMemStat are changed frequently. An added period mark would force a recompilation of the whole project.

An odd thing I notice is that there are two "_" prefixed to the missing symbol, while there should be only one of it. If it's something in the makefile I'm missing please let me know. Otherwise I have to inline all these functions before finishing this PR. Thanks!

@hcho3
Copy link
Collaborator

hcho3 commented Oct 17, 2018

@trivialfis You should add your source file to amalgamation/xgboost-all0.cc, since this is how the R package links all source files.

@trivialfis
Copy link
Member Author

trivialfis commented Oct 17, 2018

@hcho3 Thanks! You just saved my day.

@trivialfis trivialfis force-pushed the profiling branch 2 times, most recently from e6c4758 to ebcdce0 Compare October 18, 2018 09:46
@trivialfis
Copy link
Member Author

trivialfis commented Oct 19, 2018

@RAMitchell , @hcho3
Initial result from my simple testing with epsilon dataset (500000 x 2000 dense), with:

params = {'n_gpus': -1, 'debug_verbose': 5,
          'tree_method': 'gpu_hist',
          'objective': 'binary:logistic'}
num_boost_round = 256

Device memory usage summary: ===========
Peak usage: 9.177299 GB.
Number of allocations: 72 times.

And the biggest memory consumer seems to be the single GPU predictor, which also causes extremely imbalanced memory usage across all GPUs. Here is the obtained trace for related memory allocation:

Pointer: 0x7fedde000000
Peak usage: 5.963445 GB
Traces:
  XGBoosterUpdateOneIter+0x48
  dmlc::StackTrace[abi:cxx11]()+0x53
  ffi_call_unix64+0x4c
  xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*)+0x220
  xgboost::common::DeviceMemoryStat::Allocate(void*, unsigned long)+0x1ec
  xgboost::gbm::GBTree::CommitModel(std::vector<std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >, std::allocator<std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > > > >&&)+0x2d0
  xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::ObjFunction*)+0x85b
  xgboost::predictor::DeviceMatrix::DeviceMatrix(xgboost::DMatrix*, int, bool)+0x16f
  xgboost::predictor::GPUPredictor::DevicePredictInternal(xgboost::DMatrix*, xgboost::HostDeviceVector<float>*, xgboost::gbm::GBTreeModel const&, unsigned long, unsigned long)+0xef0
  xgboost::predictor::GPUPredictor::UpdatePredictionCache(xgboost::gbm::GBTreeModel const&, std::vector<std::unique_ptr<xgboost::TreeUpdater, std::default_delete<xgboost::TreeUpdater> >, std::allocator<std::unique_ptr<xgboost::TreeUpdater, std::default_delete<xgboost::TreeUpdater> > > >*, int)+0xad

We should try it again after #3738 is merged.

@trivialfis
Copy link
Member Author

New profiling result with multigpu predictor in place:
usage.txt

@codecov-io
Copy link

codecov-io commented Nov 5, 2018

Codecov Report

Merging #3795 into master will decrease coverage by 0.19%.
The diff coverage is 20.51%.

Impacted file tree graph

@@             Coverage Diff             @@
##             master    #3795     +/-   ##
===========================================
- Coverage     51.95%   51.76%   -0.2%     
+ Complexity      203      196      -7     
===========================================
  Files           182      182             
  Lines         14500    14479     -21     
  Branches        495      489      -6     
===========================================
- Hits           7534     7495     -39     
- Misses         6728     6750     +22     
+ Partials        238      234      -4
Impacted Files Coverage Δ Complexity Δ
src/gbm/gblinear.cc 14.45% <ø> (ø) 0 <0> (ø) ⬇️
src/gbm/gbtree.cc 18.67% <ø> (ø) 0 <0> (ø) ⬇️
src/linear/updater_coordinate.cc 100% <ø> (ø) 0 <0> (ø) ⬇️
src/learner.cc 30.35% <100%> (+2.73%) 0 <0> (ø) ⬇️
src/common/transform.h 88.88% <100%> (-0.4%) 0 <0> (ø)
src/common/common.cc 100% <100%> (ø) 0 <0> (ø) ⬇️
src/common/monitor.cc 4.91% <4.91%> (ø) 0 <0> (?)
src/common/monitor.h 72.41% <72.41%> (ø) 0 <0> (?)
tests/cpp/helpers.cc 90.47% <0%> (-0.44%) 0% <0%> (ø)
tests/cpp/data/test_simple_dmatrix.cc 97.56% <0%> (-0.17%) 0% <0%> (ø)
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1698fe6...2164176. Read the comment docs.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 12, 2018

@trivialfis FYI, I am going to update the dmlc-core submodule after dmlc/dmlc-core#481 is merged. Then we can set the stacktrace size at runtime.

@trivialfis
Copy link
Member Author

@hcho3 The code itself is ready now. It's just we are not sure whether it's worthy to merge this trunk of code. The wrappers around memory allocations are not trivial.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 13, 2018

@trivialfis Okay, maybe we can come back to it after merging other re-factor PRs.

@trivialfis
Copy link
Member Author

trivialfis commented Nov 13, 2018

@RAMitchell @hcho3 Good news. The large memory usage from GPUPredictor::UpdatePredictionCache in previous profiling is mainly due to I put too much data in evals={}. I don't know why these datasets have caches, but for simple training without evals_set, the memory usage is quite modest.

@trivialfis
Copy link
Member Author

Closing for now.

@trivialfis trivialfis closed this Dec 26, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Mar 26, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants