-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Objective function evaluation on GPU with minimal PCIe transfers #2935
Conversation
- xgboost::dhvec<T> syncs automatically between host and device - no-copy interfaces have been added - default implementations just sync the data to host and call the implementations with std::vector - GPU objective function, predictor, histogram updater process data directly on GPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I think at higher level, best to have user facing things by default just (say) n_gpus parameter only, which would be 0 for cpu and >=1 for gpu or -1 for all gpus. Then automatically choose the good back end.
@RAMitchell However, need ability to run cpu_predictor or cpu things when doing gpu tree_updater, because often very slow to do some things on GPU. So having fine-grained control is actually good as well.
Are you referring to the extra params we exposed for choosing between cpu/gpu objective functions? I have a point to make:
I'm confused here (especially after reading previous para). Are you talking about objective function evaluation or prediction for the next iteration? In general, don't you think continuing to stay closer to the location of evaluation (gpu/cpu) is better v/s back-n-forth data movements? |
Yes, expose all, but I'm suggesting need some reasonable defaults. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a single vector class that can be used across XGBoost instead of having two function signatures for everything.
I don't like the addition of many different parameters with the GPU tag, but you are correct that using GPU or CPU can give in slightly different results. For this reason I think we have to use this approach instead of automatically selecting the backend.
In general it means that the XGBoost parameters are more powerful and low level but less intuitive to inexperienced users.
@@ -70,6 +70,10 @@ class GradientBooster { | |||
virtual void DoBoost(DMatrix* p_fmat, | |||
std::vector<bst_gpair>* in_gpair, | |||
ObjFunction* obj = nullptr) = 0; | |||
virtual void DoBoost(DMatrix* p_fmat, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to only have a single function signature here? i.e. can we use the dhvec for everything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a single function signature here (and in other headers) would require modifying a large number of additional files and tests. For example, DoBoost()
is also defined by GBLinear
, which is not otherwise modified in this pull request.
I'd prefer to make such a change in a separate pull request, if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a possibility. My main concern here is that Nvidia gets other priorities internally and I am left to clean it up :)
I would also prefer not to double function signatures. However, the dhvec then would need to fully act as an std::vector in a CPU-only setting. And I wonder if that would limit the possibilities for picking-and-choosing what parts to run on CPU or on GPU (in a sense as @pseudotensor mentioned earlier). Also, what does the "dhvec" actually stand for: 'device helper', 'device hybrid', 'Del the funky Homosapien' vector? |
lol! dhvec stands for 'device-and-host vector'.
The current design of dhvec exposes explicit interfaces for users to query 'host' or 'device' vector. |
Let's go with HostDeviceVector? Google code style says no abbreviations in naming. @khotilov So far the way we have been exposing the gpu/cpu choice through the hyperparameter interface. E.g. you can select gpu_hist as tree method but cpu_predictor to keep prediction on the cpu. This PR follows that convention so allows full control. I think this is fine for now. And yes the assumption is that 'dhvec' will be able to substitute std::vector completely as the interface is replaced. |
Done and Done. |
Codecov Report
@@ Coverage Diff @@
## master #2935 +/- ##
============================================
- Coverage 43.19% 43.03% -0.16%
Complexity 207 207
============================================
Files 155 158 +3
Lines 11606 11669 +63
Branches 1159 1159
============================================
+ Hits 5013 5022 +9
- Misses 6265 6319 +54
Partials 328 328
Continue to review full report at Codecov.
|
@tqchen any comments? I would like to merge this. |
@teju85 @canonizer some more thoughts about this: Can we have the HostDeviceVector storing the array partitioned across multiple GPUs? My vision for multi gpu parallelism is to partition the training data by row across each gpu. If the predictions/gradients are given to an object on one gpu it will need to transfer them to other GPUs before doing work. This kind of defeats the purpose of minimising pcie transfers. |
@RAMitchell HostDeviceVector already has minimal support for multi-gpu's through the 'device' integer argument. The thing that needs to be changed will be the places that use objects of this class to partition accordingly. And this was going to be our next item (along with API cleanup's). Hence, I'd really prefer to do this as a separate PR. What say? @tqchen Did you get a chance to look at this proposed change? Do you have any comments? |
I like the direction of having a data structure that can hold both CPU and GPU arrays. Moving forward, I hope we have a common DataArray abstraction is, which is a single virtual class that can be passed along the interface. Note that most virtual class is better stored as non-template, so you can query things like class DataArray {
public:
virtual void* device_ptr_internal(int type_code);
template<typename T>
inline const T* device_ptr() {
return static_cast<T*>(device_ptr_internal(type_code<T>::value));
}
} I think the goal would be simply replace the interface with such DataArray abstraction, and provide a CPU only implementation that is backed by vector, and a GPU implementation that is backed by DeviceHostVector |
Maybe @RAMitchell and @hcho3 can do a proposal of the data structure |
@teju85 @canonizer this PR has caused a segfault in gpu_hist when running on 4 GPUs on a p3.8xlarge. The bug may be reproduced by running the python-gpu unit tests. I suspect it has something to do with the way the gradients are copied to each device. Can you please take a look. |
Looking into this. |
#3068 should fix this. |
@tqchen I like the idea. I have finally settled down at my new place of employment, so I should be able to take a look at it. Let me come up with a provisional data structure design by the end of this week. |
Reading through the code, it came to my mind that we should re-factor all updaters to use a common data array abstraction to hold 1) DMatrix and 2) gradient vectors. It could look like class DataVector {
/* CPU array internals */
/* GPU array internals (if any): should keep track of updates to sync between CPU and GPU */
/* CPU / GPU accessors */
...
}
class DataMatrix {
/* CPU array internals */
/* GPU array internals (if any): should keep track of updates to sync between CPU and GPU */
/* CPU / GPU accessors */
...
}
/* We could simply use one DataArray class with dimension field,
but having separate vector and matrix class help readability, I think */ This solves two issues simultaneously: 1) having to maintain two interfaces for processing gradients and 2) having to manually copy DMatrix to GPU inside the updater using |
I like the idea. Refactoring dmatrix would be very challenging. Another option is to add methods to the dmatrix to get gpu allocated data in csc or csr format. The data would be lazily allocated when the my god is first called like the current dmatrix column iterator. Another major challenge of this refactor will be that we will often want data allocated across multiple GPUs. E.g. a vector of 100 gradients may be allocated 50 on one gpu and 50 on another. Any interface or data structure we design should consider this. |
Here's a perf/test-error comparison between AVX-enabled v/s gpu-objective function.
Summary: gpu-objective function is ~1.32x faster compared to AVX-enabled case.
Also, test-error seems to be slightly lesser in case of gpu-objective function.
(This was measured on a DGX-1 box, cuda v8.0, running only on 1x P100)