Objective function evaluation on GPU with minimal PCIe transfers #2935

teju85 · 2017-12-08T19:12:11Z

xgboost::dhvec syncs automatically between host and device
no-copy interfaces have been added
existing use-cases will continue to just sync the data to host and call the implementations with std::vector
GPU objective function, predictor, histogram updater process data directly on GPU, without any back-n-forth between cpu<->gpu.

Here's a perf/test-error comparison between AVX-enabled v/s gpu-objective function.
Summary: gpu-objective function is ~1.32x faster compared to AVX-enabled case.
Also, test-error seems to be slightly lesser in case of gpu-objective function.
(This was measured on a DGX-1 box, cuda v8.0, running only on 1x P100)

# objective function using AVX
$ OMP_PROC_BIND=TRUE python ../tests/benchmark/benchmark.py --rows 10000000 --columns 20 --iterations 100
...
[97]    test-error:0.020449
[98]    test-error:0.020347
[99]    test-error:0.02011
Train Time: 12.854377985 seconds

# gpu objective function + no pcie transfers
$ OMP_PROC_BIND=TRUE python ../tests/benchmark/benchmark.py --rows 10000000 --columns 20 --iterations 100 --params "{'objective': 'gpu:binary:logistic'}"
...
[97]    test-error:0.019104
[98]    test-error:0.019026
[99]    test-error:0.019026
Train Time: 9.67761611938 seconds

- xgboost::dhvec<T> syncs automatically between host and device - no-copy interfaces have been added - default implementations just sync the data to host and call the implementations with std::vector - GPU objective function, predictor, histogram updater process data directly on GPU

pseudotensor

Looks great! I think at higher level, best to have user facing things by default just (say) n_gpus parameter only, which would be 0 for cpu and >=1 for gpu or -1 for all gpus. Then automatically choose the good back end.

@RAMitchell However, need ability to run cpu_predictor or cpu things when doing gpu tree_updater, because often very slow to do some things on GPU. So having fine-grained control is actually good as well.

teju85 · 2017-12-10T11:19:43Z

@pseudotensor

Then automatically choose the good back end.

Are you referring to the extra params we exposed for choosing between cpu/gpu objective functions? I have a point to make:
We now have 3 kinds of objective function evaluation flows: existing cpu based, Rory's AVX gradients and our gpu version. Each of these can have their own effect on the final test-error. (as noted above in this PR's description, as well as in the PR for AVX's gradients). So, isn't it better to expose this to our users as an option? What do you think?

However, need ability to run cpu_predictor or cpu things when doing gpu tree_updater,

I'm confused here (especially after reading previous para). Are you talking about objective function evaluation or prediction for the next iteration? In general, don't you think continuing to stay closer to the location of evaluation (gpu/cpu) is better v/s back-n-forth data movements?

pseudotensor · 2017-12-11T17:57:27Z

Yes, expose all, but I'm suggesting need some reasonable defaults.

RAMitchell

I think we need a single vector class that can be used across XGBoost instead of having two function signatures for everything.

I don't like the addition of many different parameters with the GPU tag, but you are correct that using GPU or CPU can give in slightly different results. For this reason I think we have to use this approach instead of automatically selecting the backend.

In general it means that the XGBoost parameters are more powerful and low level but less intuitive to inexperienced users.

RAMitchell · 2017-12-11T22:04:43Z

include/xgboost/gbm.h

@@ -70,6 +70,10 @@ class GradientBooster {
  virtual void DoBoost(DMatrix* p_fmat,
                       std::vector<bst_gpair>* in_gpair,
                       ObjFunction* obj = nullptr) = 0;
+  virtual void DoBoost(DMatrix* p_fmat,


Is it possible to only have a single function signature here? i.e. can we use the dhvec for everything?

Having a single function signature here (and in other headers) would require modifying a large number of additional files and tests. For example, DoBoost() is also defined by GBLinear, which is not otherwise modified in this pull request.

I'd prefer to make such a change in a separate pull request, if possible.

This could be a possibility. My main concern here is that Nvidia gets other priorities internally and I am left to clean it up :)

…opy-integrate-2

RAMitchell · 2017-12-13T05:45:56Z

Can I get some review comments from another committer please - @hcho3 ? @khotilov ? This PR is critical to continued GPU performance improvements in xgboost but requires significant architectural changes.

khotilov · 2017-12-17T07:52:36Z

I would also prefer not to double function signatures. However, the dhvec then would need to fully act as an std::vector in a CPU-only setting. And I wonder if that would limit the possibilities for picking-and-choosing what parts to run on CPU or on GPU (in a sense as @pseudotensor mentioned earlier).

Also, what does the "dhvec" actually stand for: 'device helper', 'device hybrid', 'Del the funky Homosapien' vector?

teju85 · 2017-12-18T04:19:12Z

@khotilov

Also, what does the "dhvec" actually stand for: 'device helper', 'device hybrid', 'Del the funky Homosapien' vector?

lol! dhvec stands for 'device-and-host vector'.
You make a good point.
It's better to mention this explicitly in the header file to avoid any confusions.

And I wonder if that would limit the possibilities for picking-and-choosing what parts to run on CPU or on GPU

The current design of dhvec exposes explicit interfaces for users to query 'host' or 'device' vector.
Internally, it makes sure to copy the data either way, if and when required.
So, I don't think it will limit our choices.
Only thing it will amount to is large-scale changes to interfaces of xgboost! Thus, risking regressions,
Hence, we want to propose doing this changes over multiple PRs.
What say?

RAMitchell · 2017-12-18T06:31:36Z

Let's go with HostDeviceVector? Google code style says no abbreviations in naming.

@khotilov So far the way we have been exposing the gpu/cpu choice through the hyperparameter interface. E.g. you can select gpu_hist as tree method but cpu_predictor to keep prediction on the cpu. This PR follows that convention so allows full control. I think this is fine for now.

And yes the assumption is that 'dhvec' will be able to substitute std::vector completely as the interface is replaced.

teju85 · 2017-12-18T18:14:51Z

Let's go with HostDeviceVector?
HostDeviceVector documentation

Done and Done.

codecov-io · 2017-12-18T20:07:43Z

Codecov Report

Merging #2935 into master will decrease coverage by 0.15%.
The diff coverage is 30.5%.

@@             Coverage Diff              @@
##             master    #2935      +/-   ##
============================================
- Coverage     43.19%   43.03%   -0.16%     
  Complexity      207      207              
============================================
  Files           155      158       +3     
  Lines         11606    11669      +63     
  Branches       1159     1159              
============================================
+ Hits           5013     5022       +9     
- Misses         6265     6319      +54     
  Partials        328      328

Impacted Files	Coverage Δ	Complexity Δ
include/xgboost/gbm.h	`14.28% <ø> (ø)`	`0 <0> (ø)`	⬇️
include/xgboost/predictor.h	`75% <ø> (ø)`	`0 <0> (ø)`	⬇️
src/objective/regression_obj.cc	`81.92% <ø> (-2.24%)`	`0 <0> (ø)`
include/xgboost/tree_updater.h	`20% <ø> (ø)`	`0 <0> (ø)`	⬇️
src/learner.cc	`25.41% <0%> (-0.26%)`	`0 <0> (ø)`
include/xgboost/objective.h	`33.33% <0%> (-6.67%)`	`0 <0> (ø)`
src/gbm/gbtree.cc	`19.55% <0%> (-0.58%)`	`0 <0> (ø)`
src/gbm/gbm.cc	`25% <0%> (-12.5%)`	`0 <0> (ø)`
src/predictor/cpu_predictor.cc	`68.75% <0%> (-0.88%)`	`0 <0> (ø)`
src/objective/objective.cc	`66.66% <0%> (-24.25%)`	`0 <0> (ø)`
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 76f8f51...fd55cff. Read the comment docs.

RAMitchell · 2017-12-19T03:15:22Z

@tqchen any comments? I would like to merge this.

RAMitchell · 2018-01-06T00:45:26Z

@teju85 @canonizer some more thoughts about this:

Can we have the HostDeviceVector storing the array partitioned across multiple GPUs? My vision for multi gpu parallelism is to partition the training data by row across each gpu. If the predictions/gradients are given to an object on one gpu it will need to transfer them to other GPUs before doing work. This kind of defeats the purpose of minimising pcie transfers.

teju85 · 2018-01-08T04:32:16Z

@RAMitchell HostDeviceVector already has minimal support for multi-gpu's through the 'device' integer argument. The thing that needs to be changed will be the places that use objects of this class to partition accordingly. And this was going to be our next item (along with API cleanup's). Hence, I'd really prefer to do this as a separate PR. What say?

@tqchen Did you get a chance to look at this proposed change? Do you have any comments?

tqchen · 2018-01-12T18:46:22Z

I like the direction of having a data structure that can hold both CPU and GPU arrays.

Moving forward, I hope we have a common DataArray abstraction is, which is a single virtual class that can be passed along the interface. Note that most virtual class is better stored as non-template, so you can query things like

class DataArray {
  public:
    virtual void* device_ptr_internal(int type_code);
   
    template<typename T>
    inline const T* device_ptr() {
       return static_cast<T*>(device_ptr_internal(type_code<T>::value));
   }
}

I think the goal would be simply replace the interface with such DataArray abstraction, and provide a CPU only implementation that is backed by vector, and a GPU implementation that is backed by DeviceHostVector

tqchen · 2018-01-12T18:47:28Z

Maybe @RAMitchell and @hcho3 can do a proposal of the data structure

RAMitchell · 2018-01-22T05:55:19Z

@teju85 @canonizer this PR has caused a segfault in gpu_hist when running on 4 GPUs on a p3.8xlarge. The bug may be reproduced by running the python-gpu unit tests. I suspect it has something to do with the way the gradients are copied to each device. Can you please take a look.

canonizer · 2018-01-22T14:45:47Z

Looking into this.

canonizer · 2018-01-25T14:44:55Z

#3068 should fix this.

hcho3 · 2018-01-30T02:20:24Z

@tqchen I like the idea. I have finally settled down at my new place of employment, so I should be able to take a look at it. Let me come up with a provisional data structure design by the end of this week.

hcho3 · 2018-02-05T20:07:17Z

Reading through the code, it came to my mind that we should re-factor all updaters to use a common data array abstraction to hold 1) DMatrix and 2) gradient vectors. It could look like

class DataVector {
   /* CPU array internals */
   /* GPU array internals (if any): should keep track of updates to sync between CPU and GPU */
   /* CPU / GPU accessors */
  ...
}
class DataMatrix {
   /* CPU array internals */
   /* GPU array internals (if any): should keep track of updates to sync between CPU and GPU */
   /* CPU / GPU accessors */
  ...
}
/* We could simply use one DataArray class with dimension field,
but having separate vector and matrix class help readability, I think */

This solves two issues simultaneously: 1) having to maintain two interfaces for processing gradients and 2) having to manually copy DMatrix to GPU inside the updater using InitData() and a hidden flag. Solving 2) will also allow running multiple GPU updaters on the same GPU DMatrix without redundant copies. Only concern is that a lot of existing code depends on DMatrix. @RAMitchell What's your thought on this?

RAMitchell · 2018-02-05T20:33:35Z

I like the idea. Refactoring dmatrix would be very challenging. Another option is to add methods to the dmatrix to get gpu allocated data in csc or csr format. The data would be lazily allocated when the my god is first called like the current dmatrix column iterator.

Another major challenge of this refactor will be that we will often want data allocated across multiple GPUs. E.g. a vector of 100 gradients may be allocated 50 on one gpu and 50 on another. Any interface or data structure we design should consider this.

pseudotensor reviewed Dec 9, 2017

View reviewed changes

RAMitchell reviewed Dec 11, 2017

View reviewed changes

canonizer added 2 commits December 12, 2017 12:31

Merge branch 'master' of https://github.com/dmlc/xgboost into gpu-noc…

341cda8

…opy-integrate-2

Fixed lint errors.

aa10ba7

Fixed lint and r_test errors.

ec050c4

teju85 added 2 commits December 17, 2017 21:07

Merge branch 'master' of https://github.com/dmlc/xgboost

fadf65e

Added design and usage documentation for dhvec

2f66e41

Renamed dhvec to HostDeviceVector.

fd55cff

RAMitchell merged commit 84ab74f into dmlc:master Jan 12, 2018

Laurae2 mentioned this pull request Jan 14, 2018

Windows xgboost GPU compilation crash #3033

Closed

lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Objective function evaluation on GPU with minimal PCIe transfers #2935

Objective function evaluation on GPU with minimal PCIe transfers #2935

teju85 commented Dec 8, 2017 •

edited

Loading

pseudotensor left a comment

teju85 commented Dec 10, 2017 •

edited

Loading

pseudotensor commented Dec 11, 2017

RAMitchell left a comment

RAMitchell Dec 11, 2017

canonizer Dec 12, 2017

RAMitchell Dec 13, 2017

RAMitchell commented Dec 13, 2017

khotilov commented Dec 17, 2017

teju85 commented Dec 18, 2017

RAMitchell commented Dec 18, 2017

teju85 commented Dec 18, 2017 •

edited

Loading

codecov-io commented Dec 18, 2017 •

edited

Loading

RAMitchell commented Dec 19, 2017

RAMitchell commented Jan 6, 2018

teju85 commented Jan 8, 2018

tqchen commented Jan 12, 2018

tqchen commented Jan 12, 2018

RAMitchell commented Jan 22, 2018

canonizer commented Jan 22, 2018

canonizer commented Jan 25, 2018

hcho3 commented Jan 30, 2018

hcho3 commented Feb 5, 2018 •

edited

Loading

RAMitchell commented Feb 5, 2018

Objective function evaluation on GPU with minimal PCIe transfers #2935

Objective function evaluation on GPU with minimal PCIe transfers #2935

Conversation

teju85 commented Dec 8, 2017 • edited Loading

pseudotensor left a comment

Choose a reason for hiding this comment

teju85 commented Dec 10, 2017 • edited Loading

pseudotensor commented Dec 11, 2017

RAMitchell left a comment

Choose a reason for hiding this comment

RAMitchell Dec 11, 2017

Choose a reason for hiding this comment

canonizer Dec 12, 2017

Choose a reason for hiding this comment

RAMitchell Dec 13, 2017

Choose a reason for hiding this comment

RAMitchell commented Dec 13, 2017

khotilov commented Dec 17, 2017

teju85 commented Dec 18, 2017

RAMitchell commented Dec 18, 2017

teju85 commented Dec 18, 2017 • edited Loading

codecov-io commented Dec 18, 2017 • edited Loading

Codecov Report

RAMitchell commented Dec 19, 2017

RAMitchell commented Jan 6, 2018

teju85 commented Jan 8, 2018

tqchen commented Jan 12, 2018

tqchen commented Jan 12, 2018

RAMitchell commented Jan 22, 2018

canonizer commented Jan 22, 2018

canonizer commented Jan 25, 2018

hcho3 commented Jan 30, 2018

hcho3 commented Feb 5, 2018 • edited Loading

RAMitchell commented Feb 5, 2018

teju85 commented Dec 8, 2017 •

edited

Loading

teju85 commented Dec 10, 2017 •

edited

Loading

teju85 commented Dec 18, 2017 •

edited

Loading

codecov-io commented Dec 18, 2017 •

edited

Loading

hcho3 commented Feb 5, 2018 •

edited

Loading