Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

- ndcg ltr implementation on gpu #5004

Merged
merged 19 commits into from
Nov 12, 2019
Merged

- ndcg ltr implementation on gpu #5004

merged 19 commits into from
Nov 12, 2019

Conversation

sriramch
Copy link
Contributor

@sriramch sriramch commented Nov 1, 2019

  • this is a follow-up to the pairwise ltr implementation
  • one more is remaining after this (which will follow this pr)
  • i'll work on getting the performance numbers

  - this is a follow-up to the pairwise ltr implementation
  - one more is remaining after this (which will follow this pr)
  - i'll work on getting the performance numbers
@hcho3
Copy link
Collaborator

hcho3 commented Nov 1, 2019

@sriramch Which reference did you use for the ranking algorithm? I was looking at Chris J.C. Burges's paper and I personally find it hard to understand. I'm wondering if you have another reference I can look at.

@sriramch
Copy link
Contributor Author

sriramch commented Nov 1, 2019

@sriramch Which reference did you use for the ranking algorithm? I was looking at Chris J.C. Burges's paper and I personally find it hard to understand. I'm wondering if you have another reference I can look at.

@hcho3 - the primary ones i looked at are the following (for pairwise):

http://wwwconference.org/proceedings/www2011/proceedings/p377.pdf
http://159.226.40.238/~junxu/publications/IRJ2010_LETOR.pdf

that said, i'm primarily looking at the cpu implementation as a reference implementation to model this.

i haven't come across the paper you referenced.

@sriramch
Copy link
Contributor Author

sriramch commented Nov 1, 2019

performance numbers

test environment

  • 1 socket
  • 6 cores/socket
  • 2 threads/core
  • 80 gb system memory
  • v100 gpu

test

  • tree method is gpu_hist
  • uses all cpu threads
  • 100 trees
  • objective is rank:ndcg

small mslr dataset

  • ~ 3.6m training instances
  • distributed over ~ 10k groups
  • consuming 3.9 gb disk space
  • master does the gradient computation on cpu
  • fork does the gradient computation on gpu - i.e. this pr
branch peak_memory_usage(mb) time(s) / map-metric time(s) / rmse-metric time(s) / auc-metric
master 2163 9.75s / .642747 9.82s / 1.104549 9.97s / 1.007689
fork 2177 0.61s / .642742 0.61s / 1.109999 0.62s / 1.004864

large mslr dataset

  • ~ 11.3m training instances
  • distributed over ~ 95k groups
  • consuming 13 gb disk space
  • master does the gradient computation on cpu
  • fork does the gradient computation on gpu - i.e. this pr
branch peak_memory_usage(mb) time(s) / map-metric time(s) / rmse-metric time(s) / auc-metric
master 4328 28.63s / .651674 28.2s / 1.084873 28.71 / 1.037019
fork 4371 1.62s / .650833 1.62s / 1.081308 1.62s / 1.035145

@sriramch
Copy link
Contributor Author

sriramch commented Nov 1, 2019

@hcho3 - is the last test failure a transient error that plausibly could resolve itself with a restart?

the pre-lint/clang commit passed this stage, and thus the question.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 1, 2019

@sriramch I restarted it. We have a backlog item to remove Travis CI: #4498

@codecov-io
Copy link

codecov-io commented Nov 1, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@1733c9e). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #5004   +/-   ##
=========================================
  Coverage          ?   71.52%           
=========================================
  Files             ?       11           
  Lines             ?     2311           
  Branches          ?        0           
=========================================
  Hits              ?     1653           
  Misses            ?      658           
  Partials          ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1733c9e...ad7547e. Read the comment docs.

@sriramch
Copy link
Contributor Author

sriramch commented Nov 1, 2019

@sriramch I restarted it. We have a backlog item to remove Travis CI: #4498

@hcho3 thanks for your help!

@sriramch
Copy link
Contributor Author

sriramch commented Nov 1, 2019

@RAMitchell @trivialfis @rongou please review when you get a chance.

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't made an effort to understand the algorithm but have left some general comments. Speedups look amazing and your code/documentation is great. We will need more systematic integration tests for these new ranking objectives are being added.

// If left is false, find the number of elements < v; 0 if nothing is lesser
template <typename T>
__device__ __forceinline__ uint32_t
CountNumItemsImpl(bool left, const T * __restrict__ items, uint32_t n, T v) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function could possibly be generalised with dh::UpperBound into a binary search function, bool left would be a templated comparison operator. Returning the index as well as the element would give you the count.

Just an idea, not necessary for this PR but could be nice if you plan to use this kind of operator in future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RAMitchell thanks for your suggestion. let me work on this as a follow-up pr.

dh::safe_cuda(cudaGetDevice(&device_id));
uint32_t *dindexable_sorted_preds_pos_ptr = dindexable_sorted_preds_pos_ptr_;
// Sort the positions (as indices), and group its indices as sorted prediction positions
dh::LaunchN(device_id, dindexable_sorted_preds_pos_->size(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just a thrust scatter operation. Prefer thrust for readability.

Not sure if you have used permutation iterators before, but they are also useful for these types of situations. You can create a 'virtual' permutation on an array without actually moving data around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -76,4 +76,33 @@ TEST(Objective, DeclareUnifiedTest(PairwiseRankingGPairSameLabels)) {
ASSERT_NO_THROW(obj->DefaultEvalMetric());
}

TEST(Objective, DeclareUnifiedTest(NDCGRankingGPair)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests look a little sparse considering the amount of changes. e.g. what if problems occur only when more than one GPU block is launched?

Consider some more c++ tests and integration tests in python (or scala if you prefer).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair point. i'll work on this next in the context of this pr next.

// Wait until the computations done by the kernel is complete
dh::safe_cuda(cudaStreamSynchronize(nullptr));

weight_computer.ReleaseResources();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weight_computer.ReleaseResources();

I am trying to move away from having objects in half configured states in xgboost. For example if you use an object in the wrong way or at the wrong time you would get a seg fault or some cryptic change of behaviour.

Instead we want to create an object on the stack when we need it, guarantee that it is in a usable state after construction, then let it be destroyed. This was previously hard to do without performance penalties because GPU memory allocation is expensive, but I think it is possible with caching allocators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please look. i have introduced an abstraction that lets the types be reference counted in fcb55e1. this will let the objects themselves keep track of how many copies they have.

Copy link
Member

@trivialfis trivialfis Nov 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sriramch That's impressive. Never thought about adding an intrusive ptr in kernel.

// Need this on the device as it is used in the kernels
dh::caching_device_vector<uint32_t> dgroups_; // Group information on device

dh::XGBCachingDeviceAllocator<char> alloc_; // Allocator to be used by sort for managing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that this allocator does not necessarily need to be a member variable, it can be created and destroyed whenever you need without performance penalty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@trivialfis
Copy link
Member

Will look into this tomorrow.

@trivialfis
Copy link
Member

trivialfis commented Nov 3, 2019

@hcho3

Which reference did you use for the ranking algorithm?

I read Learning to Rank for Information Retrieval by Tie-Yan Liu to get a sense for what's learning to rank, then the paper you referred. Hope that helps.

@trivialfis
Copy link
Member

Wrong link. Fixed. Sorry. Some slides are here if you are not into the whole book.

  - create a new abstraction that lets types holding expensive resources be shallow copied
    and reference counted on both kernels and hosts
  - the types holding these resources can then decide when it is appropriate to release those
    releases - which will be when 'this' is the *only* instance left
@sriramch
Copy link
Contributor Author

sriramch commented Nov 5, 2019

@hcho3 is there a way to get the nvidia driver version on windows? i looked at the build logs, but couldn't (could only find nvcc compiler version). typically, nvidia-smi provides this information.

@trivialfis
Copy link
Member

@sriramch You can add a line to

bat "nvcc --version"

  - this is to check and see if is running a version that has issues with unified
    memory addressing
@sriramch
Copy link
Contributor Author

sriramch commented Nov 5, 2019

@sriramch You can add a line to

bat "nvcc --version"

thanks @trivialfis for your tip. i'll try now...

…e easy

  to detect the path on windows where nvidia-smi is installed
@RAMitchell
Copy link
Member

RAMitchell commented Nov 5, 2019

I think I see what you are doing with RefCountable. The problem is that we have a memory owning class on the host that we would like to use in a kernel, but the device vector objects cannot be shallow copied and used in the kernel unless we construct them as pointers. My solution to this is normally:

Create a nested class inside NDCGLambdaWeightComputer, call it NDCGLambdaWeightComputerDeviceAccessor. This nested class contains pointers to the device vectors in the parent class and methods to be called inside the kernel. This class is trivially copyable.

Before launching the kernel call
auto accessor = weight_computer.GetAccessor()
And copy this into the kernel by lambda capture.

So basically we create thin wrapper classes for use in kernels with some clear semantics on how they are related to memory owning classes that exist on the host.

WDYT?

  device addresses that can be managed/copied on demand cheaply
  - using managed memory on windows requires a bit more work and isn't as seamless
    as it is on *nix's
@sriramch
Copy link
Contributor Author

sriramch commented Nov 6, 2019

So basically we create thin wrapper classes for use in kernels with some clear semantics on how they are related to memory owning classes that exist on the host.

WDYT?

[sc] sounds good to me. using managed memory needs a little bit more work to be used seamlessly x-platform, as i realized after reading the cuda documentation, that data movement between host and device isn't as transparent on windows as it is on linux's. the unified memory model on windows is pre-6.x, even when its compute capability is 6.x or greater. the upshot is that periodic cudaDeviceSynchronize may be required to move data between the device and host.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Amazing work!

…ion algorithm to pick

  a pairwise label within its group. this will get the metrics closer
- use stable_sort in the cpu version to match the gpu version
@sriramch
Copy link
Contributor Author

@RAMitchell i have addressed your comments. please look.

the build failure seems to be a transient error, which i presume will go away with a restart of the build pipeline (or that specific piece of that pipeline).

@trivialfis thanks for your review.

@trivialfis
Copy link
Member

Restarted the build.

const uint32_t *dgroups = dgroups_.data().get();
uint32_t ngroups = dgroups_.size();
int device_id = -1;
dh::safe_cuda(cudaGetDevice(&device_id));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it occurred to me that the device might not be appropriately set in here yet as gradient comes before training.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't learner configure the generic parameter, which is then passed to the objective function when it is created? further, this method is invoked on a cpu thread that already sets the device by extracting it from the generic parameter in ComputeGradientsOnGPU

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But I remember the learner class doesn't set device with cuda as it's a cc file. So a little worry. I guess it's fine to recover it from thread local cuda state if you are sure it's set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can one of you please confirm this before we merge, otherwise I'm happy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RAMitchell - yes, this works as expected. i explicitly tried setting the gpu_id to a non-zero value on a multi gpu node and was able to see the right gpu being used

src/objective/rank_obj.cu Show resolved Hide resolved
@RAMitchell RAMitchell merged commit 2abe69d into dmlc:master Nov 12, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Feb 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants