-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
- ndcg ltr implementation on gpu #5004
Conversation
sriramch
commented
Nov 1, 2019
- this is a follow-up to the pairwise ltr implementation
- one more is remaining after this (which will follow this pr)
- i'll work on getting the performance numbers
- this is a follow-up to the pairwise ltr implementation - one more is remaining after this (which will follow this pr) - i'll work on getting the performance numbers
@sriramch Which reference did you use for the ranking algorithm? I was looking at Chris J.C. Burges's paper and I personally find it hard to understand. I'm wondering if you have another reference I can look at. |
@hcho3 - the primary ones i looked at are the following (for pairwise): http://wwwconference.org/proceedings/www2011/proceedings/p377.pdf that said, i'm primarily looking at the cpu implementation as a reference implementation to model this. i haven't come across the paper you referenced. |
performance numberstest environment
test
small mslr dataset
large mslr dataset
|
Codecov Report
@@ Coverage Diff @@
## master #5004 +/- ##
=========================================
Coverage ? 71.52%
=========================================
Files ? 11
Lines ? 2311
Branches ? 0
=========================================
Hits ? 1653
Misses ? 658
Partials ? 0 Continue to review full report at Codecov.
|
@RAMitchell @trivialfis @rongou please review when you get a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't made an effort to understand the algorithm but have left some general comments. Speedups look amazing and your code/documentation is great. We will need more systematic integration tests for these new ranking objectives are being added.
// If left is false, find the number of elements < v; 0 if nothing is lesser | ||
template <typename T> | ||
__device__ __forceinline__ uint32_t | ||
CountNumItemsImpl(bool left, const T * __restrict__ items, uint32_t n, T v) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function could possibly be generalised with dh::UpperBound into a binary search function, bool left
would be a templated comparison operator. Returning the index as well as the element would give you the count.
Just an idea, not necessary for this PR but could be nice if you plan to use this kind of operator in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RAMitchell thanks for your suggestion. let me work on this as a follow-up pr.
src/objective/rank_obj.cu
Outdated
dh::safe_cuda(cudaGetDevice(&device_id)); | ||
uint32_t *dindexable_sorted_preds_pos_ptr = dindexable_sorted_preds_pos_ptr_; | ||
// Sort the positions (as indices), and group its indices as sorted prediction positions | ||
dh::LaunchN(device_id, dindexable_sorted_preds_pos_->size(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just a thrust scatter operation. Prefer thrust for readability.
Not sure if you have used permutation iterators before, but they are also useful for these types of situations. You can create a 'virtual' permutation on an array without actually moving data around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -76,4 +76,33 @@ TEST(Objective, DeclareUnifiedTest(PairwiseRankingGPairSameLabels)) { | |||
ASSERT_NO_THROW(obj->DefaultEvalMetric()); | |||
} | |||
|
|||
TEST(Objective, DeclareUnifiedTest(NDCGRankingGPair)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests look a little sparse considering the amount of changes. e.g. what if problems occur only when more than one GPU block is launched?
Consider some more c++ tests and integration tests in python (or scala if you prefer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair point. i'll work on this next in the context of this pr next.
src/objective/rank_obj.cu
Outdated
// Wait until the computations done by the kernel is complete | ||
dh::safe_cuda(cudaStreamSynchronize(nullptr)); | ||
|
||
weight_computer.ReleaseResources(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weight_computer.ReleaseResources();
I am trying to move away from having objects in half configured states in xgboost. For example if you use an object in the wrong way or at the wrong time you would get a seg fault or some cryptic change of behaviour.
Instead we want to create an object on the stack when we need it, guarantee that it is in a usable state after construction, then let it be destroyed. This was previously hard to do without performance penalties because GPU memory allocation is expensive, but I think it is possible with caching allocators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please look. i have introduced an abstraction that lets the types be reference counted in fcb55e1. this will let the objects themselves keep track of how many copies they have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sriramch That's impressive. Never thought about adding an intrusive ptr in kernel.
src/objective/rank_obj.cu
Outdated
// Need this on the device as it is used in the kernels | ||
dh::caching_device_vector<uint32_t> dgroups_; // Group information on device | ||
|
||
dh::XGBCachingDeviceAllocator<char> alloc_; // Allocator to be used by sort for managing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that this allocator does not necessarily need to be a member variable, it can be created and destroyed whenever you need without performance penalty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Will look into this tomorrow. |
I read |
Wrong link. Fixed. Sorry. Some slides are here if you are not into the whole book. |
- create a new abstraction that lets types holding expensive resources be shallow copied and reference counted on both kernels and hosts - the types holding these resources can then decide when it is appropriate to release those releases - which will be when 'this' is the *only* instance left
@hcho3 is there a way to get the nvidia driver version on windows? i looked at the build logs, but couldn't (could only find nvcc compiler version). typically, |
- this is to check and see if is running a version that has issues with unified memory addressing
thanks @trivialfis for your tip. i'll try now... |
…e easy to detect the path on windows where nvidia-smi is installed
I think I see what you are doing with RefCountable. The problem is that we have a memory owning class on the host that we would like to use in a kernel, but the device vector objects cannot be shallow copied and used in the kernel unless we construct them as pointers. My solution to this is normally: Create a nested class inside NDCGLambdaWeightComputer, call it NDCGLambdaWeightComputerDeviceAccessor. This nested class contains pointers to the device vectors in the parent class and methods to be called inside the kernel. This class is trivially copyable. Before launching the kernel call So basically we create thin wrapper classes for use in kernels with some clear semantics on how they are related to memory owning classes that exist on the host. WDYT? |
device addresses that can be managed/copied on demand cheaply - using managed memory on windows requires a bit more work and isn't as seamless as it is on *nix's
[sc] sounds good to me. using managed memory needs a little bit more work to be used seamlessly x-platform, as i realized after reading the cuda documentation, that data movement between host and device isn't as transparent on windows as it is on linux's. the unified memory model on windows is pre-6.x, even when its compute capability is 6.x or greater. the upshot is that periodic |
- put it through the build system wringer to see what comes up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Amazing work!
…ion algorithm to pick a pairwise label within its group. this will get the metrics closer - use stable_sort in the cpu version to match the gpu version
@RAMitchell i have addressed your comments. please look. the build failure seems to be a transient error, which i presume will go away with a restart of the build pipeline (or that specific piece of that pipeline). @trivialfis thanks for your review. |
Restarted the build. |
const uint32_t *dgroups = dgroups_.data().get(); | ||
uint32_t ngroups = dgroups_.size(); | ||
int device_id = -1; | ||
dh::safe_cuda(cudaGetDevice(&device_id)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it occurred to me that the device might not be appropriately set in here yet as gradient comes before training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't learner configure the generic parameter, which is then passed to the objective function when it is created? further, this method is invoked on a cpu thread that already sets the device by extracting it from the generic parameter in ComputeGradientsOnGPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But I remember the learner class doesn't set device with cuda as it's a cc file. So a little worry. I guess it's fine to recover it from thread local cuda state if you are sure it's set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can one of you please confirm this before we merge, otherwise I'm happy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RAMitchell - yes, this works as expected. i explicitly tried setting the gpu_id to a non-zero value on a multi gpu node and was able to see the right gpu being used