-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ranking metric computation accelaration on gpu #5326
Conversation
- this pr accelerates map, ndcg, precision, auc and aucpr metrics on gpu - further, it also accelerate the auc and aucpr metrics on cpu i'll post the performance numbers shortly
performance numberstest environment
test
results
|
Is there any way that we can dispatch the computation kernel based on Clarification: |
the following shows the
|
@trivialfis - are you asking if it is possible to separate the cpu and gpu implementations completely into separate files without having the the simplest thing i can think of (for metric
|
Yup. ;-) My suggestion is simply two functions called @RAMitchell might have better suggestions. |
Another thing is we want to replace the |
- this is to avoid including the cuda file into the cc file, thus cluttering the code with ifdef gimmicks - this approach can be used when the 2 implementations digresses fairly considerably - it is possible for such an approach to be used elsewhere (objectives and such) where the implementation digresses considerably - there is no impact to performance
@trivialfis i have now decoupled the cpu and the gpu implementations such that they reside independently without resorting to the include trick (as you allude). please see if this is palatable. if it is, then i can roll this to other areas where there is a significant digression between the two implementations. @RAMitchell i would appreciate your input as well on this when you get a chance. |
- this pr should not have affected this functionality. perhaps, a latent issue is showing up now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. Added some comments for now, will need to go more in depth. You can help by isolating any independent parts and we can do them one by one in separate PRs.
} | ||
|
||
// Accessors that returns device pointer | ||
inline const T *GetItemsPtr() const { return ditems_.data().get(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass Span rather than pointers.
// Accessors that returns device pointer | ||
inline const T *GetItemsPtr() const { return ditems_.data().get(); } | ||
inline uint32_t GetNumItems() const { return ditems_.size(); } | ||
inline const caching_device_vector<T> &GetItems() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Span over const device vectors.
} | ||
|
||
Metric * | ||
GPUMetric::CreateGPUMetric(const std::string& name, GenericParameter const* tparam) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this is needed. I don't recall having to do this for other factory methods in GPU code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the main idea is to decouple the cpu and gpu implementations when there is significant digression without the cuda file include trick. @trivialfis raised this issue and the discussion is subsumed in this pr. more specifically, the following change may shed some light.
the gpu implementation is now dynamically looked up from the registry and dispatched when there is a valid device present and if xgboost is gpu enabled.
*/ | ||
// When device ordinal is present, we would want to build the metrics on the GPU. It is possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a little more thought from us if we want to use this approach going forward. Ideally we would have a consistent configuration logic when choosing GPU algorithms.
for (omp_ulong i = 0; i < ndata; ++i) { | ||
exp_p_sum += h_preds[i]; | ||
// calculate AUC | ||
double sum_pospair = 0.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a bit much for me to review in one go. In particular when stuff gets moved its harder to see where the algorithm got changed in the diff. Can we try to separate this into a few PRs, maybe one for moving things around, one for logic for launching/configuring GPU metrics, one per changed metric. Given that a AUC is such a key metric I really want to digest whats going on here.
@RAMitchell - i'll go through this and see if i can split this into multiple digestible pr's. in the interim, i'll comment on the few things you have raised. |
- this is the first of a handful of pr's that splits the larger pr dmlc#5326 - it moves this facility to common (from ranking objective class), so that it can be used for metric computation - it also wraps all the bald device pointers into span
- move segment sorter to common - this is the first of a handful of pr's that splits the larger pr #5326 - it moves this facility to common (from ranking objective class), so that it can be used for metric computation - it also wraps all the bald device pointers into span.
map
,ndcg
,precision
,auc
andaucpr
metrics on gpuauc
andaucpr
metrics on cputhe performance numbers for ranking and non-ranking datasets are below.
@RAMitchell @trivialfis @rongou - please review