ranking metric computation accelaration on gpu #5326

sriramch · 2020-02-19T21:39:54Z

this pr accelerates map, ndcg, precision, auc and aucpr metrics on gpu
further, it also accelerates the auc and aucpr metrics on cpu

the performance numbers for ranking and non-ranking datasets are below.

@RAMitchell @trivialfis @rongou - please review

- this pr accelerates map, ndcg, precision, auc and aucpr metrics on gpu - further, it also accelerate the auc and aucpr metrics on cpu i'll post the performance numbers shortly

sriramch · 2020-02-19T22:42:20Z

performance numbers

test environment

1 socket
6 cores/socket
2 threads/core
80 gb system memory
v100 gpu

test

uses all cpu threads
builds 100 trees
runs the different objective functions rank:pairwise, rank:ndcg and rank:map
computes metrics map, ndcg, pre@2, auc and aucpr
uses the small and a large mslr ranking benchmark datasets having the following characteristics:
- small mslr dataset containing ~ 3.6m training instances distributed over ~ 10k groups consuming 3.9 gb disk space
- large mslr dataset containing ~ 11.3m training instances distributed over ~ 95k groups consuming 13 gb disk space
metric eval times are reported below

results

no additional gpu memory was used
all times are in seconds
first column is the eval time on cpu
second column is the eval time on gpu

small dataset	gpu => train, rank	gpu => train, rank, eval	large dataset	gpu => train, rank	gpu => train, rank, eval
algo/metric	cpu => eval	cpu => none	algo/metric	cpu => eval	cpu => none

pairwise/map	2.51	0.39	pairwise/map	9.69	1.47
pairwise/ndcg	6	0.57	pairwise/ndcg	19.04	1.97
pairwise/pre@2	2.38	0.37	pairwise/pre@2	9.02	1.4
pairwise/auc	21.27	0.48	pairwise/auc	80.29	1.75
pairwise/aucpr	27.92	0.61	pairwise/aucpr	115.1	1.87

ndcg/map	2.57	0.38	ndcg/map	9.53	1.4
ndcg/ndcg	6.02	0.56	ndcg/ndcg	18.86	1.9
ndcg/pre@2	2.31	0.35	ndcg/pre@2	8.85	1.33
ndcg/auc	21.09	0.47	ndcg/auc	79.01	1.68
ndcg/aucpr	27.31	0.59	ndcg/aucpr	111.41	1.79

map/map	2.43	0.38	map/map	9.57	1.41
map/ndcg	5.84	0.56	map/ndcg	18.89	1.91
map/pre@2	2.36	0.35	map/pre@2	8.82	1.34
map/auc	21.06	0.47	map/auc	79.11	1.69
map/aucpr	27.16	0.59	map/aucpr	111.46	1.79

trivialfis · 2020-02-20T11:27:13Z

Is there any way that we can dispatch the computation kernel based on generic_parameter_.gpu_id? If we can't unify the kernel like other element wise metric/objective, I would like to have a clear separation between two implementations, with cu files containing only GPU kernels.

Clarification:
By kernel, I mean any core computation function, not necessary __global__. And my point is it would be better if we can avoid the include trick. I created it when porting element wise objective, where I can write the code once and have both host/device implementations. If it's not the case, I think lesser tricks is better.

sriramch · 2020-02-20T17:28:34Z

the following shows the auc and aucpr metric computation improvements on cpu for a non ranking dataset - in this case higgs.

same host spec as above and builds 100 trees
train on gpu, metric computation/eval on cpu
all times are in seconds

metric	master	this pr
auc	82.95	68.7
aucpr	142.08	124.12

sriramch · 2020-02-20T17:39:53Z

And my point is it would be better if we can avoid the include trick.

@trivialfis - are you asking if it is possible to separate the cpu and gpu implementations completely into separate files without having the .cc file include the .cu file?

the simplest thing i can think of (for metric x) is this:

register metric x referencing the cpu implementation from the .cc file if it's a cpu only build
register x-cpu referencing the cpu implementation from the .cc file if it's a gpu build and register metric x referencing the gpu implementation from the .cu file
- the gpu implementation will then delegate responsibility to x-cpu when no device ordinal is set (by looking at the generic_parameter_.gpu_id that you mention)

trivialfis · 2020-02-20T18:55:40Z

are you asking if it is possible to separate the cpu and gpu implementations completely into separate files without having the .cc file include the .cu file?

Yup. ;-)

My suggestion is simply two functions called CPUAUCKernel and GPUAUCKernel , very much like your EvalRankOnGPU/CPU, but instead of class members we define them as plain old functions. Put the GPU function declaration in a header or the cc file, then put the implementation in a cu file. May need to pass some extra arguments like tparam_, less nice but plain old.

@RAMitchell might have better suggestions.

trivialfis · 2020-02-20T19:09:02Z

Another thing is we want to replace the Allreduce with mean value by using a sketch algorithm for approximating global sorting in distributed environment. It's not strictly related to this PR, but seems appropriate to leave a note here.

…tric

- this is to avoid including the cuda file into the cc file, thus cluttering the code with ifdef gimmicks - this approach can be used when the 2 implementations digresses fairly considerably - it is possible for such an approach to be used elsewhere (objectives and such) where the implementation digresses considerably - there is no impact to performance

sriramch · 2020-02-26T17:56:30Z

@trivialfis i have now decoupled the cpu and the gpu implementations such that they reside independently without resorting to the include trick (as you allude). please see if this is palatable. if it is, then i can roll this to other areas where there is a significant digression between the two implementations.

@RAMitchell i would appreciate your input as well on this when you get a chance.

…tric

- this pr should not have affected this functionality. perhaps, a latent issue is showing up now

RAMitchell

Sorry for the delay. Added some comments for now, will need to go more in depth. You can help by isolating any independent parts and we can do them one by one in separate PRs.

src/common/device_helpers.cuh

RAMitchell · 2020-02-27T22:04:16Z

src/common/device_helpers.cuh

+  }
+
+  // Accessors that returns device pointer
+  inline const T *GetItemsPtr() const { return ditems_.data().get(); }


Pass Span rather than pointers.

RAMitchell · 2020-02-27T22:04:50Z

src/common/device_helpers.cuh

+  // Accessors that returns device pointer
+  inline const T *GetItemsPtr() const { return ditems_.data().get(); }
+  inline uint32_t GetNumItems() const { return ditems_.size(); }
+  inline const caching_device_vector<T> &GetItems() const {


Span over const device vectors.

RAMitchell · 2020-02-27T22:07:28Z

src/metric/metric.cc

+}
+
+Metric *
+GPUMetric::CreateGPUMetric(const std::string& name, GenericParameter const* tparam) {


Not sure why this is needed. I don't recall having to do this for other factory methods in GPU code.

the main idea is to decouple the cpu and gpu implementations when there is significant digression without the cuda file include trick. @trivialfis raised this issue and the discussion is subsumed in this pr. more specifically, the following change may shed some light.

decouplement

the gpu implementation is now dynamically looked up from the registry and dispatched when there is a valid device present and if xgboost is gpu enabled.

RAMitchell · 2020-02-27T22:10:32Z

src/metric/rank_metric.cc

 */
+// When device ordinal is present, we would want to build the metrics on the GPU. It is possible


This needs a little more thought from us if we want to use this approach going forward. Ideally we would have a consistent configuration logic when choosing GPU algorithms.

RAMitchell · 2020-02-27T22:19:33Z

src/metric/rank_metric.cc

-    for (omp_ulong i = 0; i < ndata; ++i) {
-      exp_p_sum += h_preds[i];
+        // calculate AUC
+        double sum_pospair = 0.0;


There is a bit much for me to review in one go. In particular when stuff gets moved its harder to see where the algorithm got changed in the diff. Can we try to separate this into a few PRs, maybe one for moving things around, one for logic for launching/configuring GPU metrics, one per changed metric. Given that a AUC is such a key metric I really want to digest whats going on here.

sriramch · 2020-02-27T22:47:36Z

@RAMitchell - i'll go through this and see if i can split this into multiple digestible pr's. in the interim, i'll comment on the few things you have raised.

- this is the first of a handful of pr's that splits the larger pr dmlc#5326 - it moves this facility to common (from ranking objective class), so that it can be used for metric computation - it also wraps all the bald device pointers into span

- move segment sorter to common - this is the first of a handful of pr's that splits the larger pr #5326 - it moves this facility to common (from ranking objective class), so that it can be used for metric computation - it also wraps all the bald device pointers into span.

- this is the last part of dmlc#5326 that has been split - this also includes dmlc#5387

- ranking metric computation accelaration on gpu

a36386d

- this pr accelerates map, ndcg, precision, auc and aucpr metrics on gpu - further, it also accelerate the auc and aucpr metrics on cpu i'll post the performance numbers shortly

sriramch changed the title ~~- ranking metric computation accelaration on gpu~~ [wip] ranking metric computation accelaration on gpu Feb 19, 2020

- fixing first set of build errors

ea48505

- more lint/clang fixes

bc8c272

sriramch changed the title ~~[wip] ranking metric computation accelaration on gpu~~ ranking metric computation accelaration on gpu Feb 20, 2020

sriramch added 5 commits February 24, 2020 15:54

Merge branch 'master' of https://github.com/dmlc/xgboost into rank_me…

8b4c75d

…tric

Merge branch 'master' of https://github.com/dmlc/xgboost into rank_me…

49ae879

…tric

Merge branch 'master' of https://github.com/dmlc/xgboost into rank_me…

2ed119e

…tric

Merge branch 'master' of https://github.com/dmlc/xgboost into rank_me…

8054f72

…tric

sriramch added 4 commits February 26, 2020 18:15

- fix lint, clang-tidy issues

4e64e5b

- more lint fixes

b16f140

Merge branch 'master' of https://github.com/dmlc/xgboost into rank_me…

fe20c1f

…tric

- increase the tolerance level to try and fix windows build

9751ea6

- this pr should not have affected this functionality. perhaps, a latent issue is showing up now

RAMitchell reviewed Feb 27, 2020

View reviewed changes

sriramch mentioned this pull request Feb 28, 2020

- move segment sorter to common #5378

Merged

trivialfis mentioned this pull request Mar 4, 2020

- create a gpu metrics (internal) registry #5387

Merged

sriramch added a commit to sriramch/xgboost that referenced this pull request Mar 6, 2020

- ranking metric accelaration on the gpu

604c63b

- this is the last part of dmlc#5326 that has been split - this also includes dmlc#5387

sriramch mentioned this pull request Mar 8, 2020

- ranking metric acceleration on the gpu #5398

Merged

RAMitchell closed this Mar 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ranking metric computation accelaration on gpu #5326

ranking metric computation accelaration on gpu #5326

sriramch commented Feb 19, 2020 •

edited

Loading

sriramch commented Feb 19, 2020

trivialfis commented Feb 20, 2020 •

edited

Loading

sriramch commented Feb 20, 2020

sriramch commented Feb 20, 2020

trivialfis commented Feb 20, 2020

trivialfis commented Feb 20, 2020

sriramch commented Feb 26, 2020

RAMitchell left a comment

RAMitchell Feb 27, 2020

RAMitchell Feb 27, 2020

RAMitchell Feb 27, 2020

sriramch Feb 27, 2020

RAMitchell Feb 27, 2020

RAMitchell Feb 27, 2020

sriramch commented Feb 27, 2020

		*/
		// When device ordinal is present, we would want to build the metrics on the GPU. It is possible

ranking metric computation accelaration on gpu #5326

ranking metric computation accelaration on gpu #5326

Conversation

sriramch commented Feb 19, 2020 • edited Loading

sriramch commented Feb 19, 2020

performance numbers

test environment

test

results

trivialfis commented Feb 20, 2020 • edited Loading

sriramch commented Feb 20, 2020

sriramch commented Feb 20, 2020

trivialfis commented Feb 20, 2020

trivialfis commented Feb 20, 2020

sriramch commented Feb 26, 2020

RAMitchell left a comment

Choose a reason for hiding this comment

RAMitchell Feb 27, 2020

Choose a reason for hiding this comment

RAMitchell Feb 27, 2020

Choose a reason for hiding this comment

RAMitchell Feb 27, 2020

Choose a reason for hiding this comment

sriramch Feb 27, 2020

Choose a reason for hiding this comment

RAMitchell Feb 27, 2020

Choose a reason for hiding this comment

RAMitchell Feb 27, 2020

Choose a reason for hiding this comment

sriramch commented Feb 27, 2020

sriramch commented Feb 19, 2020 •

edited

Loading

trivialfis commented Feb 20, 2020 •

edited

Loading