Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ranking metric computation accelaration on gpu #5326

Closed
wants to merge 12 commits into from

Conversation

sriramch
Copy link
Contributor

@sriramch sriramch commented Feb 19, 2020

  • this pr accelerates map, ndcg, precision, auc and aucpr metrics on gpu
  • further, it also accelerates the auc and aucpr metrics on cpu

the performance numbers for ranking and non-ranking datasets are below.

@RAMitchell @trivialfis @rongou - please review

   - this pr accelerates map, ndcg, precision, auc and aucpr metrics on gpu
   - further, it also accelerate the auc and aucpr metrics on cpu

i'll post the performance numbers shortly
@sriramch sriramch changed the title - ranking metric computation accelaration on gpu [wip] ranking metric computation accelaration on gpu Feb 19, 2020
@sriramch
Copy link
Contributor Author

performance numbers

test environment

  • 1 socket
  • 6 cores/socket
  • 2 threads/core
  • 80 gb system memory
  • v100 gpu

test

  • uses all cpu threads
  • builds 100 trees
  • runs the different objective functions rank:pairwise, rank:ndcg and rank:map
  • computes metrics map, ndcg, pre@2, auc and aucpr
  • uses the small and a large mslr ranking benchmark datasets having the following characteristics:
    • small mslr dataset containing ~ 3.6m training instances distributed over ~ 10k groups consuming 3.9 gb disk space
    • large mslr dataset containing ~ 11.3m training instances distributed over ~ 95k groups consuming 13 gb disk space
  • metric eval times are reported below

results

  • no additional gpu memory was used
  • all times are in seconds
  • first column is the eval time on cpu
  • second column is the eval time on gpu
small dataset gpu => train, rank gpu => train, rank, eval large dataset gpu => train, rank gpu => train, rank, eval
algo/metric cpu => eval cpu => none algo/metric cpu => eval cpu => none
       
pairwise/map 2.51 0.39 pairwise/map 9.69 1.47
pairwise/ndcg 6 0.57 pairwise/ndcg 19.04 1.97
pairwise/pre@2 2.38 0.37 pairwise/pre@2 9.02 1.4
pairwise/auc 21.27 0.48 pairwise/auc 80.29 1.75
pairwise/aucpr 27.92 0.61 pairwise/aucpr 115.1 1.87
       
ndcg/map 2.57 0.38 ndcg/map 9.53 1.4
ndcg/ndcg 6.02 0.56 ndcg/ndcg 18.86 1.9
ndcg/pre@2 2.31 0.35 ndcg/pre@2 8.85 1.33
ndcg/auc 21.09 0.47 ndcg/auc 79.01 1.68
ndcg/aucpr 27.31 0.59 ndcg/aucpr 111.41 1.79
       
map/map 2.43 0.38 map/map 9.57 1.41
map/ndcg 5.84 0.56 map/ndcg 18.89 1.91
map/pre@2 2.36 0.35 map/pre@2 8.82 1.34
map/auc 21.06 0.47 map/auc 79.11 1.69
map/aucpr 27.16 0.59 map/aucpr 111.46 1.79

@trivialfis
Copy link
Member

trivialfis commented Feb 20, 2020

Is there any way that we can dispatch the computation kernel based on generic_parameter_.gpu_id? If we can't unify the kernel like other element wise metric/objective, I would like to have a clear separation between two implementations, with cu files containing only GPU kernels.

Clarification:
By kernel, I mean any core computation function, not necessary __global__. And my point is it would be better if we can avoid the include trick. I created it when porting element wise objective, where I can write the code once and have both host/device implementations. If it's not the case, I think lesser tricks is better.

@sriramch
Copy link
Contributor Author

the following shows the auc and aucpr metric computation improvements on cpu for a non ranking dataset - in this case higgs.

  • same host spec as above and builds 100 trees
  • train on gpu, metric computation/eval on cpu
  • all times are in seconds
metric master this pr
auc 82.95 68.7
aucpr 142.08 124.12

@sriramch sriramch changed the title [wip] ranking metric computation accelaration on gpu ranking metric computation accelaration on gpu Feb 20, 2020
@sriramch
Copy link
Contributor Author

And my point is it would be better if we can avoid the include trick.

@trivialfis - are you asking if it is possible to separate the cpu and gpu implementations completely into separate files without having the .cc file include the .cu file?

the simplest thing i can think of (for metric x) is this:

  • register metric x referencing the cpu implementation from the .cc file if it's a cpu only build
  • register x-cpu referencing the cpu implementation from the .cc file if it's a gpu build and register metric x referencing the gpu implementation from the .cu file
    • the gpu implementation will then delegate responsibility to x-cpu when no device ordinal is set (by looking at the generic_parameter_.gpu_id that you mention)

@trivialfis
Copy link
Member

are you asking if it is possible to separate the cpu and gpu implementations completely into separate files without having the .cc file include the .cu file?

Yup. ;-)

My suggestion is simply two functions called CPUAUCKernel and GPUAUCKernel , very much like your EvalRankOnGPU/CPU, but instead of class members we define them as plain old functions. Put the GPU function declaration in a header or the cc file, then put the implementation in a cu file. May need to pass some extra arguments like tparam_, less nice but plain old.

@RAMitchell might have better suggestions.

@trivialfis
Copy link
Member

Another thing is we want to replace the Allreduce with mean value by using a sketch algorithm for approximating global sorting in distributed environment. It's not strictly related to this PR, but seems appropriate to leave a note here.

  - this is to avoid including the cuda file into the cc file, thus cluttering the code
    with ifdef gimmicks
  - this approach can be used when the 2 implementations digresses fairly considerably
  - it is possible for such an approach to be used elsewhere (objectives and such) where
    the implementation digresses considerably
  - there is no impact to performance
@sriramch
Copy link
Contributor Author

@trivialfis i have now decoupled the cpu and the gpu implementations such that they reside independently without resorting to the include trick (as you allude). please see if this is palatable. if it is, then i can roll this to other areas where there is a significant digression between the two implementations.

@RAMitchell i would appreciate your input as well on this when you get a chance.

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Added some comments for now, will need to go more in depth. You can help by isolating any independent parts and we can do them one by one in separate PRs.

src/common/device_helpers.cuh Show resolved Hide resolved
}

// Accessors that returns device pointer
inline const T *GetItemsPtr() const { return ditems_.data().get(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass Span rather than pointers.

// Accessors that returns device pointer
inline const T *GetItemsPtr() const { return ditems_.data().get(); }
inline uint32_t GetNumItems() const { return ditems_.size(); }
inline const caching_device_vector<T> &GetItems() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Span over const device vectors.

}

Metric *
GPUMetric::CreateGPUMetric(const std::string& name, GenericParameter const* tparam) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this is needed. I don't recall having to do this for other factory methods in GPU code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main idea is to decouple the cpu and gpu implementations when there is significant digression without the cuda file include trick. @trivialfis raised this issue and the discussion is subsumed in this pr. more specifically, the following change may shed some light.

the gpu implementation is now dynamically looked up from the registry and dispatched when there is a valid device present and if xgboost is gpu enabled.

*/
// When device ordinal is present, we would want to build the metrics on the GPU. It is possible
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a little more thought from us if we want to use this approach going forward. Ideally we would have a consistent configuration logic when choosing GPU algorithms.

for (omp_ulong i = 0; i < ndata; ++i) {
exp_p_sum += h_preds[i];
// calculate AUC
double sum_pospair = 0.0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bit much for me to review in one go. In particular when stuff gets moved its harder to see where the algorithm got changed in the diff. Can we try to separate this into a few PRs, maybe one for moving things around, one for logic for launching/configuring GPU metrics, one per changed metric. Given that a AUC is such a key metric I really want to digest whats going on here.

@sriramch
Copy link
Contributor Author

@RAMitchell - i'll go through this and see if i can split this into multiple digestible pr's. in the interim, i'll comment on the few things you have raised.

sriramch added a commit to sriramch/xgboost that referenced this pull request Feb 28, 2020
  - this is the first of a handful of pr's that splits the larger pr dmlc#5326
  - it moves this facility to common (from ranking objective class), so that it can be
    used for metric computation
  - it also wraps all the bald device pointers into span
trivialfis pushed a commit that referenced this pull request Feb 29, 2020
- move segment sorter to common
- this is the first of a handful of pr's that splits the larger pr #5326
- it moves this facility to common (from ranking objective class), so that it can be
    used for metric computation
- it also wraps all the bald device pointers into span.
sriramch added a commit to sriramch/xgboost that referenced this pull request Mar 6, 2020
  - this is the last part of dmlc#5326 that has been split
  - this also includes dmlc#5387
@RAMitchell RAMitchell closed this Mar 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants