Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Objective function evaluation on GPU with minimal PCIe transfers #2935

Merged
merged 7 commits into from
Jan 12, 2018

Conversation

teju85
Copy link
Contributor

@teju85 teju85 commented Dec 8, 2017

  • xgboost::dhvec syncs automatically between host and device
  • no-copy interfaces have been added
  • existing use-cases will continue to just sync the data to host and call the implementations with std::vector
  • GPU objective function, predictor, histogram updater process data directly on GPU, without any back-n-forth between cpu<->gpu.

Here's a perf/test-error comparison between AVX-enabled v/s gpu-objective function.
Summary: gpu-objective function is ~1.32x faster compared to AVX-enabled case.
Also, test-error seems to be slightly lesser in case of gpu-objective function.
(This was measured on a DGX-1 box, cuda v8.0, running only on 1x P100)

# objective function using AVX
$ OMP_PROC_BIND=TRUE python ../tests/benchmark/benchmark.py --rows 10000000 --columns 20 --iterations 100
...
[97]    test-error:0.020449
[98]    test-error:0.020347
[99]    test-error:0.02011
Train Time: 12.854377985 seconds

# gpu objective function + no pcie transfers
$ OMP_PROC_BIND=TRUE python ../tests/benchmark/benchmark.py --rows 10000000 --columns 20 --iterations 100 --params "{'objective': 'gpu:binary:logistic'}"
...
[97]    test-error:0.019104
[98]    test-error:0.019026
[99]    test-error:0.019026
Train Time: 9.67761611938 seconds

- xgboost::dhvec<T> syncs automatically between host and device
- no-copy interfaces have been added
- default implementations just sync the data to host
  and call the implementations with std::vector
- GPU objective function, predictor, histogram updater process data
  directly on GPU
Copy link
Contributor

@pseudotensor pseudotensor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I think at higher level, best to have user facing things by default just (say) n_gpus parameter only, which would be 0 for cpu and >=1 for gpu or -1 for all gpus. Then automatically choose the good back end.

@RAMitchell However, need ability to run cpu_predictor or cpu things when doing gpu tree_updater, because often very slow to do some things on GPU. So having fine-grained control is actually good as well.

@teju85
Copy link
Contributor Author

teju85 commented Dec 10, 2017

@pseudotensor

Then automatically choose the good back end.

Are you referring to the extra params we exposed for choosing between cpu/gpu objective functions? I have a point to make:
We now have 3 kinds of objective function evaluation flows: existing cpu based, Rory's AVX gradients and our gpu version. Each of these can have their own effect on the final test-error. (as noted above in this PR's description, as well as in the PR for AVX's gradients). So, isn't it better to expose this to our users as an option? What do you think?

However, need ability to run cpu_predictor or cpu things when doing gpu tree_updater,

I'm confused here (especially after reading previous para). Are you talking about objective function evaluation or prediction for the next iteration? In general, don't you think continuing to stay closer to the location of evaluation (gpu/cpu) is better v/s back-n-forth data movements?

@pseudotensor
Copy link
Contributor

Yes, expose all, but I'm suggesting need some reasonable defaults.

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a single vector class that can be used across XGBoost instead of having two function signatures for everything.

I don't like the addition of many different parameters with the GPU tag, but you are correct that using GPU or CPU can give in slightly different results. For this reason I think we have to use this approach instead of automatically selecting the backend.

In general it means that the XGBoost parameters are more powerful and low level but less intuitive to inexperienced users.

@@ -70,6 +70,10 @@ class GradientBooster {
virtual void DoBoost(DMatrix* p_fmat,
std::vector<bst_gpair>* in_gpair,
ObjFunction* obj = nullptr) = 0;
virtual void DoBoost(DMatrix* p_fmat,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to only have a single function signature here? i.e. can we use the dhvec for everything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a single function signature here (and in other headers) would require modifying a large number of additional files and tests. For example, DoBoost() is also defined by GBLinear, which is not otherwise modified in this pull request.

I'd prefer to make such a change in a separate pull request, if possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a possibility. My main concern here is that Nvidia gets other priorities internally and I am left to clean it up :)

@RAMitchell
Copy link
Member

Can I get some review comments from another committer please - @hcho3 ? @khotilov ? This PR is critical to continued GPU performance improvements in xgboost but requires significant architectural changes.

@khotilov
Copy link
Member

I would also prefer not to double function signatures. However, the dhvec then would need to fully act as an std::vector in a CPU-only setting. And I wonder if that would limit the possibilities for picking-and-choosing what parts to run on CPU or on GPU (in a sense as @pseudotensor mentioned earlier).

Also, what does the "dhvec" actually stand for: 'device helper', 'device hybrid', 'Del the funky Homosapien' vector?

@teju85
Copy link
Contributor Author

teju85 commented Dec 18, 2017

@khotilov

Also, what does the "dhvec" actually stand for: 'device helper', 'device hybrid', 'Del the funky Homosapien' vector?

lol! dhvec stands for 'device-and-host vector'.
You make a good point.
It's better to mention this explicitly in the header file to avoid any confusions.

And I wonder if that would limit the possibilities for picking-and-choosing what parts to run on CPU or on GPU

The current design of dhvec exposes explicit interfaces for users to query 'host' or 'device' vector.
Internally, it makes sure to copy the data either way, if and when required.
So, I don't think it will limit our choices.
Only thing it will amount to is large-scale changes to interfaces of xgboost! Thus, risking regressions,
Hence, we want to propose doing this changes over multiple PRs.
What say?

@RAMitchell
Copy link
Member

Let's go with HostDeviceVector? Google code style says no abbreviations in naming.

@khotilov So far the way we have been exposing the gpu/cpu choice through the hyperparameter interface. E.g. you can select gpu_hist as tree method but cpu_predictor to keep prediction on the cpu. This PR follows that convention so allows full control. I think this is fine for now.

And yes the assumption is that 'dhvec' will be able to substitute std::vector completely as the interface is replaced.

@teju85
Copy link
Contributor Author

teju85 commented Dec 18, 2017

Let's go with HostDeviceVector?
HostDeviceVector documentation

Done and Done.

@codecov-io
Copy link

codecov-io commented Dec 18, 2017

Codecov Report

Merging #2935 into master will decrease coverage by 0.15%.
The diff coverage is 30.5%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #2935      +/-   ##
============================================
- Coverage     43.19%   43.03%   -0.16%     
  Complexity      207      207              
============================================
  Files           155      158       +3     
  Lines         11606    11669      +63     
  Branches       1159     1159              
============================================
+ Hits           5013     5022       +9     
- Misses         6265     6319      +54     
  Partials        328      328
Impacted Files Coverage Δ Complexity Δ
include/xgboost/gbm.h 14.28% <ø> (ø) 0 <0> (ø) ⬇️
include/xgboost/predictor.h 75% <ø> (ø) 0 <0> (ø) ⬇️
src/objective/regression_obj.cc 81.92% <ø> (-2.24%) 0 <0> (ø)
include/xgboost/tree_updater.h 20% <ø> (ø) 0 <0> (ø) ⬇️
src/learner.cc 25.41% <0%> (-0.26%) 0 <0> (ø)
include/xgboost/objective.h 33.33% <0%> (-6.67%) 0 <0> (ø)
src/gbm/gbtree.cc 19.55% <0%> (-0.58%) 0 <0> (ø)
src/gbm/gbm.cc 25% <0%> (-12.5%) 0 <0> (ø)
src/predictor/cpu_predictor.cc 68.75% <0%> (-0.88%) 0 <0> (ø)
src/objective/objective.cc 66.66% <0%> (-24.25%) 0 <0> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 76f8f51...fd55cff. Read the comment docs.

@RAMitchell
Copy link
Member

@tqchen any comments? I would like to merge this.

@RAMitchell
Copy link
Member

@teju85 @canonizer some more thoughts about this:

Can we have the HostDeviceVector storing the array partitioned across multiple GPUs? My vision for multi gpu parallelism is to partition the training data by row across each gpu. If the predictions/gradients are given to an object on one gpu it will need to transfer them to other GPUs before doing work. This kind of defeats the purpose of minimising pcie transfers.

@teju85
Copy link
Contributor Author

teju85 commented Jan 8, 2018

@RAMitchell HostDeviceVector already has minimal support for multi-gpu's through the 'device' integer argument. The thing that needs to be changed will be the places that use objects of this class to partition accordingly. And this was going to be our next item (along with API cleanup's). Hence, I'd really prefer to do this as a separate PR. What say?

@tqchen Did you get a chance to look at this proposed change? Do you have any comments?

@RAMitchell RAMitchell merged commit 84ab74f into dmlc:master Jan 12, 2018
@tqchen
Copy link
Member

tqchen commented Jan 12, 2018

I like the direction of having a data structure that can hold both CPU and GPU arrays.

Moving forward, I hope we have a common DataArray abstraction is, which is a single virtual class that can be passed along the interface. Note that most virtual class is better stored as non-template, so you can query things like

class DataArray {
  public:
    virtual void* device_ptr_internal(int type_code);
   
    template<typename T>
    inline const T* device_ptr() {
       return static_cast<T*>(device_ptr_internal(type_code<T>::value));
   }
}

I think the goal would be simply replace the interface with such DataArray abstraction, and provide a CPU only implementation that is backed by vector, and a GPU implementation that is backed by DeviceHostVector

@tqchen
Copy link
Member

tqchen commented Jan 12, 2018

Maybe @RAMitchell and @hcho3 can do a proposal of the data structure

@RAMitchell
Copy link
Member

@teju85 @canonizer this PR has caused a segfault in gpu_hist when running on 4 GPUs on a p3.8xlarge. The bug may be reproduced by running the python-gpu unit tests. I suspect it has something to do with the way the gradients are copied to each device. Can you please take a look.

@canonizer
Copy link
Contributor

Looking into this.

@canonizer
Copy link
Contributor

#3068 should fix this.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 30, 2018

@tqchen I like the idea. I have finally settled down at my new place of employment, so I should be able to take a look at it. Let me come up with a provisional data structure design by the end of this week.

@hcho3
Copy link
Collaborator

hcho3 commented Feb 5, 2018

Reading through the code, it came to my mind that we should re-factor all updaters to use a common data array abstraction to hold 1) DMatrix and 2) gradient vectors. It could look like

class DataVector {
   /* CPU array internals */
   /* GPU array internals (if any): should keep track of updates to sync between CPU and GPU */
   /* CPU / GPU accessors */
  ...
}
class DataMatrix {
   /* CPU array internals */
   /* GPU array internals (if any): should keep track of updates to sync between CPU and GPU */
   /* CPU / GPU accessors */
  ...
}
/* We could simply use one DataArray class with dimension field,
but having separate vector and matrix class help readability, I think */

This solves two issues simultaneously: 1) having to maintain two interfaces for processing gradients and 2) having to manually copy DMatrix to GPU inside the updater using InitData() and a hidden flag. Solving 2) will also allow running multiple GPU updaters on the same GPU DMatrix without redundant copies. Only concern is that a lot of existing code depends on DMatrix. @RAMitchell What's your thought on this?

@RAMitchell
Copy link
Member

I like the idea. Refactoring dmatrix would be very challenging. Another option is to add methods to the dmatrix to get gpu allocated data in csc or csr format. The data would be lazily allocated when the my god is first called like the current dmatrix column iterator.

Another major challenge of this refactor will be that we will often want data allocated across multiple GPUs. E.g. a vector of 100 gradients may be allocated 50 on one gpu and 50 on another. Any interface or data structure we design should consider this.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants