-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU support in GPUPredictor. #3738
Conversation
- GPUPredictor is multi-GPU - removed DeviceMatrix, as it has been made obsolete by using HostDeviceVector in DMatrix
src/predictor/gpu_predictor.cu
Outdated
auto& offsets = *out_offsets; | ||
offsets.resize(devices.Size() + 1); | ||
offsets[0] = 0; | ||
#pragma omp parallel for schedule(static, 1) if (devices.Size() > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for dropping by. Just to be safe, it might be better to save the current device before spawning threads that can change it. Otherwise subsequent code could potentially access memory at wrong device.
xgboost/src/common/device_helpers.cuh
Line 947 in baef574
class SaveCudaContext { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cudaSetDevice()
changes the device of the caller thread only, so it does not alter the device in a different thread.
In case only a single device is used (and no threads are spawned), cudaSetDevice()
will be called with that device.
Otherwise, the code using a GPU is responsible for setting the device being used. This means cudaSetDevice()
in public methods, in shards and in OpenMP loops (with 1 iteration per device). Private methods can assume that the right device has been set already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had a discussion with @trivialfis about this. My view is that @canonizer 's approach is fine for now, although it would be good to look at better ways of managing the global multi-GPU state.
We have already had one difficult to find bug because the active device was not what was expected. Perhaps there is a way to manage this so we can explicitly prevent kernels being called with the incorrect active device index.
@canonizer Can you add a test for multi-GPU prediction? I am about to add a multi-GPU slave worker to the Jenkins CI server. The multi-GPU tests will run as a separate task than single-GPU tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned by @hcho3 would be good if you can add explicit multi-GPU testing. We now have multi-GPU machines running on Jenkins.
You will also need to rebase this due to some minor conflicts with my recent dmatrix changes.
src/predictor/gpu_predictor.cu
Outdated
@@ -143,19 +100,21 @@ struct DevicePredictionNode { | |||
|
|||
struct ElementLoader { | |||
bool use_shared; | |||
size_t* d_row_ptr; | |||
Entry* d_data; | |||
const size_t* d_row_ptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to use span instead of raw pointers here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/predictor/gpu_predictor.cu
Outdated
auto begin_ptr = d_data + d_row_ptr[ridx]; | ||
auto end_ptr = d_data + d_row_ptr[ridx + 1]; | ||
Entry* previous_middle = nullptr; | ||
auto begin_ptr = d_data + d_row_ptr[ridx] - entry_start; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we generalise our other binary search code and use that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but probably in a different pull request.
If you want it in this pull request, feel free to comment on this, and I'll get back to it on Thursday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to rush into cleaning up. We can do it later.
src/predictor/gpu_predictor.cu
Outdated
@@ -225,14 +184,15 @@ __device__ float GetLeafWeight(bst_uint ridx, const DevicePredictionNode* tree, | |||
template <int BLOCK_THREADS> | |||
__global__ void PredictKernel(const DevicePredictionNode* d_nodes, | |||
float* d_out_predictions, size_t* d_tree_segments, | |||
int* d_tree_group, size_t* d_row_ptr, | |||
Entry* d_data, size_t tree_begin, | |||
int* d_tree_group, const size_t* d_row_ptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might as well upgrade all of these raw pointers to spans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/predictor/gpu_predictor.cu
Outdated
auto& offsets = *out_offsets; | ||
offsets.resize(devices.Size() + 1); | ||
offsets[0] = 0; | ||
#pragma omp parallel for schedule(static, 1) if (devices.Size() > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had a discussion with @trivialfis about this. My view is that @canonizer 's approach is fine for now, although it would be good to look at better ways of managing the global multi-GPU state.
We have already had one difficult to find bug because the active device was not what was expected. Perhaps there is a way to manage this so we can explicitly prevent kernels being called with the incorrect active device index.
Any updates on multi-GPU tests? |
Added a multi-GPU test for GPUPredictor and addressed reviewers' comments. |
src/predictor/gpu_predictor.cu
Outdated
auto begin_ptr = d_data + d_row_ptr[ridx]; | ||
auto end_ptr = d_data + d_row_ptr[ridx + 1]; | ||
Entry* previous_middle = nullptr; | ||
auto begin_ptr = d_data.begin() + d_row_ptr[ridx] - entry_start; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to
auto begin_ptr = d_data.begin() + d_row_ptr[ridx] - entry_start; | |
auto begin_ptr = d_data.begin() + (d_row_ptr[ridx] - entry_start); |
should pass the multi-gpu test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can make this change yourself now that you are a member.
src/predictor/gpu_predictor.cu
Outdated
auto end_ptr = d_data + d_row_ptr[ridx + 1]; | ||
Entry* previous_middle = nullptr; | ||
auto begin_ptr = d_data.begin() + d_row_ptr[ridx] - entry_start; | ||
auto end_ptr = d_data.begin() + d_row_ptr[ridx + 1] - entry_start; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto end_ptr = d_data.begin() + d_row_ptr[ridx + 1] - entry_start; | |
auto end_ptr = d_data.begin() + (d_row_ptr[ridx + 1] - entry_start); |
And this one. :)
Okay, I fixed a bug that's caught by my profiling script. But I can't reproduce the one on Jenkins. |
It may be out of memory error? Let me run it on my end using the same instance type. |
@hcho3 Thanks! I ran cuda-memcheck and Sanitizer with no luck so far. The test requires very small amount of memory. |
I compiled and ran this pull request on my p2.8xlarge instance and got the same error. I will run it through cuda-gdb to see if it helps. |
@trivialfis @canonizer I got this backtrace by running
Maybe this line is problematic? xgboost/src/predictor/gpu_predictor.cu Lines 233 to 236 in 10bc710
|
I added some diagnostic printout before the cudaMemcpy line:
The last cudaMemcpy call fails because the device pointer for shard 6 is null. |
model.CommitModel(std::move(trees), 0); | ||
model.param.num_output_group = 1; | ||
|
||
int n_row = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this line to
int n_row = 5; | |
int n_row = 8; |
gets rid of segmentation fault.
The p2.8xlarge instance has 8 GPUs, so with the matrix with 5 rows, some of the GPUs were getting 0 row. We should handle this edge case either by restricting the number of devices when too few rows are given, or by correctly handling zero-row shards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trivialfis How should we handle this edge case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hcho3 cuda-gdb doesn't work for me with NCCL, neither on Fedora nor on Ubuntu.. :(
The GPUSet::All()
has an optional parameter specifying number of rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* Reinitialize shards when GPUSet is changed. * Tests range of data.
@hcho3 This again comes down to changing parameters. We need to handle the situation when |
@canonizer , @hcho3 , @RAMitchell I tried to overcome it with a check to see if |
Codecov Report
@@ Coverage Diff @@
## master #3738 +/- ##
============================================
- Coverage 52.09% 52.06% -0.04%
- Complexity 196 203 +7
============================================
Files 181 181
Lines 14341 14358 +17
Branches 489 495 +6
============================================
+ Hits 7471 7475 +4
- Misses 6636 6645 +9
- Partials 234 238 +4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@hcho3 Hi, could you give another look before I merge it? |
src/objective/multiclass_obj.cu
Outdated
out_gpair->Reshard(GPUSet::Empty()); | ||
preds.Reshard(GPUSet::Empty()); | ||
// out_gpair->Reshard(GPUSet::Empty()); | ||
// preds.Reshard(GPUSet::Empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we commenting out lines? If they are not needed, we should just remove them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
TEST(gpu_predictor, Test) { | ||
std::unique_ptr<Predictor> gpu_predictor = | ||
std::unique_ptr<Predictor>(Predictor::Create("gpu_predictor")); | ||
std::unique_ptr<Predictor> cpu_predictor = | ||
std::unique_ptr<Predictor>(Predictor::Create("cpu_predictor")); | ||
|
||
// gpu_predictor->Init({std::pair<std::string, std::string>("n_gpus", "1")}, {}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for cleaning up for me.
Thanks everyone! |
* Multi-GPU support in GPUPredictor. - GPUPredictor is multi-GPU - removed DeviceMatrix, as it has been made obsolete by using HostDeviceVector in DMatrix * Replaced pointers with spans in GPUPredictor. * Added a multi-GPU predictor test. * Fix multi-gpu test. * Fix n_rows < n_gpus. * Reinitialize shards when GPUSet is changed. * Tests range of data. * Remove commented code. * Remove commented code.
* Multi-GPU support in GPUPredictor. - GPUPredictor is multi-GPU - removed DeviceMatrix, as it has been made obsolete by using HostDeviceVector in DMatrix * Replaced pointers with spans in GPUPredictor. * Added a multi-GPU predictor test. * Fix multi-gpu test. * Fix n_rows < n_gpus. * Reinitialize shards when GPUSet is changed. * Tests range of data. * Remove commented code. * Remove commented code.
Closes #3756