- ndcg ltr implementation on gpu #5004

sriramch · 2019-11-01T16:58:37Z

this is a follow-up to the pairwise ltr implementation
one more is remaining after this (which will follow this pr)
i'll work on getting the performance numbers

- this is a follow-up to the pairwise ltr implementation - one more is remaining after this (which will follow this pr) - i'll work on getting the performance numbers

hcho3 · 2019-11-01T17:04:35Z

@sriramch Which reference did you use for the ranking algorithm? I was looking at Chris J.C. Burges's paper and I personally find it hard to understand. I'm wondering if you have another reference I can look at.

sriramch · 2019-11-01T17:59:23Z

@sriramch Which reference did you use for the ranking algorithm? I was looking at Chris J.C. Burges's paper and I personally find it hard to understand. I'm wondering if you have another reference I can look at.

@hcho3 - the primary ones i looked at are the following (for pairwise):

http://wwwconference.org/proceedings/www2011/proceedings/p377.pdf
http://159.226.40.238/~junxu/publications/IRJ2010_LETOR.pdf

that said, i'm primarily looking at the cpu implementation as a reference implementation to model this.

i haven't come across the paper you referenced.

sriramch · 2019-11-01T18:12:44Z

performance numbers

test environment

1 socket
6 cores/socket
2 threads/core
80 gb system memory
v100 gpu

test

tree method is gpu_hist
uses all cpu threads
100 trees
objective is rank:ndcg

small mslr dataset

~ 3.6m training instances
distributed over ~ 10k groups
consuming 3.9 gb disk space
master does the gradient computation on cpu
fork does the gradient computation on gpu - i.e. this pr

branch	peak_memory_usage(mb)	time(s) / map-metric	time(s) / rmse-metric	time(s) / auc-metric
master	2163	9.75s / .642747	9.82s / 1.104549	9.97s / 1.007689
fork	2177	0.61s / .642742	0.61s / 1.109999	0.62s / 1.004864

large mslr dataset

~ 11.3m training instances
distributed over ~ 95k groups
consuming 13 gb disk space
master does the gradient computation on cpu
fork does the gradient computation on gpu - i.e. this pr

branch	peak_memory_usage(mb)	time(s) / map-metric	time(s) / rmse-metric	time(s) / auc-metric
master	4328	28.63s / .651674	28.2s / 1.084873	28.71 / 1.037019
fork	4371	1.62s / .650833	1.62s / 1.081308	1.62s / 1.035145

sriramch · 2019-11-01T21:05:49Z

@hcho3 - is the last test failure a transient error that plausibly could resolve itself with a restart?

the pre-lint/clang commit passed this stage, and thus the question.

hcho3 · 2019-11-01T21:16:26Z

@sriramch I restarted it. We have a backlog item to remove Travis CI: #4498

codecov-io · 2019-11-01T22:06:10Z

Codecov Report

❗ No coverage uploaded for pull request base (master@1733c9e). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master    #5004   +/-   ##
=========================================
  Coverage          ?   71.52%           
=========================================
  Files             ?       11           
  Lines             ?     2311           
  Branches          ?        0           
=========================================
  Hits              ?     1653           
  Misses            ?      658           
  Partials          ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1733c9e...ad7547e. Read the comment docs.

sriramch · 2019-11-01T23:12:30Z

@sriramch I restarted it. We have a backlog item to remove Travis CI: #4498

@hcho3 thanks for your help!

sriramch · 2019-11-01T23:15:20Z

@RAMitchell @trivialfis @rongou please review when you get a chance.

RAMitchell

I haven't made an effort to understand the algorithm but have left some general comments. Speedups look amazing and your code/documentation is great. We will need more systematic integration tests for these new ranking objectives are being added.

RAMitchell · 2019-11-01T23:59:14Z

src/objective/rank_obj.cu

+// If left is false, find the number of elements < v; 0 if nothing is lesser
+template <typename T>
+__device__ __forceinline__ uint32_t
+CountNumItemsImpl(bool left, const T * __restrict__ items, uint32_t n, T v) {


This function could possibly be generalised with dh::UpperBound into a binary search function, bool left would be a templated comparison operator. Returning the index as well as the element would give you the count.

Just an idea, not necessary for this PR but could be nice if you plan to use this kind of operator in future.

@RAMitchell thanks for your suggestion. let me work on this as a follow-up pr.

RAMitchell · 2019-11-02T00:08:29Z

src/objective/rank_obj.cu

+    dh::safe_cuda(cudaGetDevice(&device_id));
+    uint32_t *dindexable_sorted_preds_pos_ptr = dindexable_sorted_preds_pos_ptr_;
+    // Sort the positions (as indices), and group its indices as sorted prediction positions
+    dh::LaunchN(device_id, dindexable_sorted_preds_pos_->size(),


I think this is just a thrust scatter operation. Prefer thrust for readability.

Not sure if you have used permutation iterators before, but they are also useful for these types of situations. You can create a 'virtual' permutation on an array without actually moving data around.

RAMitchell · 2019-11-02T00:14:10Z

tests/cpp/objective/test_ranking_obj.cc

@@ -76,4 +76,33 @@ TEST(Objective, DeclareUnifiedTest(PairwiseRankingGPairSameLabels)) {
  ASSERT_NO_THROW(obj->DefaultEvalMetric());
 }

+TEST(Objective, DeclareUnifiedTest(NDCGRankingGPair)) {


These tests look a little sparse considering the amount of changes. e.g. what if problems occur only when more than one GPU block is launched?

Consider some more c++ tests and integration tests in python (or scala if you prefer).

fair point. i'll work on this next in the context of this pr next.

RAMitchell · 2019-11-02T00:19:12Z

src/objective/rank_obj.cu

+    // Wait until the computations done by the kernel is complete
+    dh::safe_cuda(cudaStreamSynchronize(nullptr));
+
+    weight_computer.ReleaseResources();


weight_computer.ReleaseResources();

I am trying to move away from having objects in half configured states in xgboost. For example if you use an object in the wrong way or at the wrong time you would get a seg fault or some cryptic change of behaviour.

Instead we want to create an object on the stack when we need it, guarantee that it is in a usable state after construction, then let it be destroyed. This was previously hard to do without performance penalties because GPU memory allocation is expensive, but I think it is possible with caching allocators.

please look. i have introduced an abstraction that lets the types be reference counted in fcb55e1. this will let the objects themselves keep track of how many copies they have.

@sriramch That's impressive. Never thought about adding an intrusive ptr in kernel.

RAMitchell · 2019-11-02T00:21:08Z

src/objective/rank_obj.cu

+  // Need this on the device as it is used in the kernels
+  dh::caching_device_vector<uint32_t> dgroups_;       // Group information on device
+
+  dh::XGBCachingDeviceAllocator<char> alloc_;         // Allocator to be used by sort for managing


Just a note that this allocator does not necessarily need to be a member variable, it can be created and destroyed whenever you need without performance penalty.

trivialfis · 2019-11-03T16:31:40Z

Will look into this tomorrow.

trivialfis · 2019-11-03T16:55:23Z

@hcho3

Which reference did you use for the ranking algorithm?

I read Learning to Rank for Information Retrieval by Tie-Yan Liu to get a sense for what's learning to rank, then the paper you referred. Hope that helps.

trivialfis · 2019-11-03T17:12:27Z

Wrong link. Fixed. Sorry. Some slides are here if you are not into the whole book.

- create a new abstraction that lets types holding expensive resources be shallow copied and reference counted on both kernels and hosts - the types holding these resources can then decide when it is appropriate to release those releases - which will be when 'this' is the *only* instance left

sriramch · 2019-11-05T19:14:47Z

@hcho3 is there a way to get the nvidia driver version on windows? i looked at the build logs, but couldn't (could only find nvcc compiler version). typically, nvidia-smi provides this information.

trivialfis · 2019-11-05T19:53:30Z

@sriramch You can add a line to

xgboost/Jenkinsfile-win64

Line 73 in b29b8c2

bat "nvcc --version"

- this is to check and see if is running a version that has issues with unified memory addressing

sriramch · 2019-11-05T20:05:07Z

@sriramch You can add a line to

xgboost/Jenkinsfile-win64

Line 73 in b29b8c2

bat "nvcc --version"

thanks @trivialfis for your tip. i'll try now...

…e easy to detect the path on windows where nvidia-smi is installed

RAMitchell · 2019-11-05T23:13:24Z

I think I see what you are doing with RefCountable. The problem is that we have a memory owning class on the host that we would like to use in a kernel, but the device vector objects cannot be shallow copied and used in the kernel unless we construct them as pointers. My solution to this is normally:

Create a nested class inside NDCGLambdaWeightComputer, call it NDCGLambdaWeightComputerDeviceAccessor. This nested class contains pointers to the device vectors in the parent class and methods to be called inside the kernel. This class is trivially copyable.

Before launching the kernel call
auto accessor = weight_computer.GetAccessor()
And copy this into the kernel by lambda capture.

So basically we create thin wrapper classes for use in kernels with some clear semantics on how they are related to memory owning classes that exist on the host.

WDYT?

device addresses that can be managed/copied on demand cheaply - using managed memory on windows requires a bit more work and isn't as seamless as it is on *nix's

sriramch · 2019-11-06T03:47:54Z

So basically we create thin wrapper classes for use in kernels with some clear semantics on how they are related to memory owning classes that exist on the host.

WDYT?

[sc] sounds good to me. using managed memory needs a little bit more work to be used seamlessly x-platform, as i realized after reading the cuda documentation, that data movement between host and device isn't as transparent on windows as it is on linux's. the unified memory model on windows is pre-6.x, even when its compute capability is 6.x or greater. the upshot is that periodic cudaDeviceSynchronize may be required to move data between the device and host.

…rank_ndcg

- put it through the build system wringer to see what comes up

trivialfis

LGTM. Amazing work!

…rank_ndcg

…ion algorithm to pick a pairwise label within its group. this will get the metrics closer - use stable_sort in the cpu version to match the gpu version

sriramch · 2019-11-12T16:06:50Z

@RAMitchell i have addressed your comments. please look.

the build failure seems to be a transient error, which i presume will go away with a restart of the build pipeline (or that specific piece of that pipeline).

@trivialfis thanks for your review.

trivialfis · 2019-11-12T16:50:00Z

Restarted the build.

trivialfis · 2019-11-12T16:53:48Z

src/objective/rank_obj.cu

+    const uint32_t *dgroups = dgroups_.data().get();
+    uint32_t ngroups = dgroups_.size();
+    int device_id = -1;
+    dh::safe_cuda(cudaGetDevice(&device_id));


Actually it occurred to me that the device might not be appropriately set in here yet as gradient comes before training.

wouldn't learner configure the generic parameter, which is then passed to the objective function when it is created? further, this method is invoked on a cpu thread that already sets the device by extracting it from the generic parameter in ComputeGradientsOnGPU

Yes. But I remember the learner class doesn't set device with cuda as it's a cc file. So a little worry. I guess it's fine to recover it from thread local cuda state if you are sure it's set.

Can one of you please confirm this before we merge, otherwise I'm happy.

@RAMitchell - yes, this works as expected. i explicitly tried setting the gpu_id to a non-zero value on a multi gpu node and was able to see the right gpu being used

src/objective/rank_obj.cu

- ndcg ltr implementation on gpu

1643821

- this is a follow-up to the pairwise ltr implementation - one more is remaining after this (which will follow this pr) - i'll work on getting the performance numbers

- fix lint/clang-tidy errors

665f100

RAMitchell reviewed Nov 2, 2019

View reviewed changes

sriramch added 3 commits November 5, 2019 02:57

Merge branch 'master' of https://github.com/dmlc/xgboost into rank_ndcg

aa37c20

Merge branch 'master' of https://github.com/dmlc/xgboost into rank_ndcg

d484134

- print the nvidia driver information on the build host...

2342076

- this is to check and see if is running a version that has issues with unified memory addressing

sriramch added 2 commits November 5, 2019 20:05

- fix batch file

5ce5e65

- try a different approach to get the driver version, as it may not b…

8c18111

…e easy to detect the path on windows where nvidia-smi is installed

- rm ref counted implementation and create objects that exclusive hold

6163190

device addresses that can be managed/copied on demand cheaply - using managed memory on windows requires a bit more work and isn't as seamless as it is on *nix's

sriramch added 4 commits November 6, 2019 04:14

- fix clang-tidy issues

5c2a90b

Merge branch 'master' of https://github.com/dmlc/xgboost into rank_ndcg

c8d767a

- add tests

e36ce0d

- fix clang-tidy and doxygen errors

6fcf7b6

sriramch added 4 commits November 6, 2019 23:58

Merge branch 'rank_ndcg' of https://github.com/sriramch/xgboost into …

f3dd101

…rank_ndcg

- initial version of a ranking objective functional test

7bb5221

- put it through the build system wringer to see what comes up

- increase the tolerance level for these tiny test datasets

dc88f8d

Merge branch 'master' of https://github.com/dmlc/xgboost into rank_ndcg

42b414e

trivialfis approved these changes Nov 11, 2019

View reviewed changes

sriramch added 2 commits November 11, 2019 18:42

Merge branch 'rank_ndcg' of https://github.com/sriramch/xgboost into …

fc4ce65

…rank_ndcg

- make sure that the cpu and the gpu version uses the same randomizat…

ad7547e

…ion algorithm to pick a pairwise label within its group. this will get the metrics closer - use stable_sort in the cpu version to match the gpu version

trivialfis reviewed Nov 12, 2019

View reviewed changes

RAMitchell merged commit 2abe69d into dmlc:master Nov 12, 2019

lock bot locked as resolved and limited conversation to collaborators Feb 10, 2020

- ndcg ltr implementation on gpu #5004

- ndcg ltr implementation on gpu #5004

Conversation

sriramch commented Nov 1, 2019

hcho3 commented Nov 1, 2019

sriramch commented Nov 1, 2019

sriramch commented Nov 1, 2019 • edited Loading

performance numbers

test environment

test

small mslr dataset

large mslr dataset

sriramch commented Nov 1, 2019

hcho3 commented Nov 1, 2019

codecov-io commented Nov 1, 2019 • edited Loading

Codecov Report

sriramch commented Nov 1, 2019

sriramch commented Nov 1, 2019

RAMitchell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Nov 5, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis commented Nov 3, 2019

trivialfis commented Nov 3, 2019 • edited Loading

trivialfis commented Nov 3, 2019

sriramch commented Nov 5, 2019

trivialfis commented Nov 5, 2019

sriramch commented Nov 5, 2019

RAMitchell commented Nov 5, 2019 • edited Loading

sriramch commented Nov 6, 2019 • edited Loading

trivialfis left a comment

Choose a reason for hiding this comment

sriramch commented Nov 12, 2019

trivialfis commented Nov 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sriramch commented Nov 1, 2019 •

edited

Loading

codecov-io commented Nov 1, 2019 •

edited

Loading

trivialfis Nov 5, 2019 •

edited

Loading

trivialfis commented Nov 3, 2019 •

edited

Loading

RAMitchell commented Nov 5, 2019 •

edited

Loading

sriramch commented Nov 6, 2019 •

edited

Loading