Multi-GPU support in GPUPredictor. #3738

canonizer · 2018-09-28T13:16:48Z

GPUPredictor is multi-GPU
removed DeviceMatrix, as it has been made obsolete by using HostDeviceVector in DMatrix

- GPUPredictor is multi-GPU - removed DeviceMatrix, as it has been made obsolete by using HostDeviceVector in DMatrix

trivialfis · 2018-09-28T14:18:23Z

src/predictor/gpu_predictor.cu

+    auto& offsets = *out_offsets;
+    offsets.resize(devices.Size() + 1);
+    offsets[0] = 0;
+#pragma omp parallel for schedule(static, 1) if (devices.Size() > 1)


Sorry for dropping by. Just to be safe, it might be better to save the current device before spawning threads that can change it. Otherwise subsequent code could potentially access memory at wrong device.

xgboost/src/common/device_helpers.cuh

Line 947 in baef574

class SaveCudaContext {

cudaSetDevice() changes the device of the caller thread only, so it does not alter the device in a different thread.

In case only a single device is used (and no threads are spawned), cudaSetDevice() will be called with that device.

Otherwise, the code using a GPU is responsible for setting the device being used. This means cudaSetDevice() in public methods, in shards and in OpenMP loops (with 1 iteration per device). Private methods can assume that the right device has been set already.

Just had a discussion with @trivialfis about this. My view is that @canonizer 's approach is fine for now, although it would be good to look at better ways of managing the global multi-GPU state.

We have already had one difficult to find bug because the active device was not what was expected. Perhaps there is a way to manage this so we can explicitly prevent kernels being called with the incorrect active device index.

hcho3 · 2018-09-28T21:34:40Z

@canonizer Can you add a test for multi-GPU prediction? I am about to add a multi-GPU slave worker to the Jenkins CI server. The multi-GPU tests will run as a separate task than single-GPU tests.

RAMitchell

As mentioned by @hcho3 would be good if you can add explicit multi-GPU testing. We now have multi-GPU machines running on Jenkins.

You will also need to rebase this due to some minor conflicts with my recent dmatrix changes.

RAMitchell · 2018-10-01T01:13:48Z

src/predictor/gpu_predictor.cu

@@ -143,19 +100,21 @@ struct DevicePredictionNode {

 struct ElementLoader {
  bool use_shared;
-  size_t* d_row_ptr;
-  Entry* d_data;
+  const size_t* d_row_ptr;


Would be good to use span instead of raw pointers here.

RAMitchell · 2018-10-01T01:14:27Z

src/predictor/gpu_predictor.cu

-      auto begin_ptr = d_data + d_row_ptr[ridx];
-      auto end_ptr = d_data + d_row_ptr[ridx + 1];
-      Entry* previous_middle = nullptr;
+      auto begin_ptr = d_data + d_row_ptr[ridx] - entry_start;


Can we generalise our other binary search code and use that here?

Yes, but probably in a different pull request.

If you want it in this pull request, feel free to comment on this, and I'll get back to it on Thursday.

No need to rush into cleaning up. We can do it later.

RAMitchell · 2018-10-01T01:15:14Z

src/predictor/gpu_predictor.cu

@@ -225,14 +184,15 @@ __device__ float GetLeafWeight(bst_uint ridx, const DevicePredictionNode* tree,
 template <int BLOCK_THREADS>
 __global__ void PredictKernel(const DevicePredictionNode* d_nodes,
                              float* d_out_predictions, size_t* d_tree_segments,
-                              int* d_tree_group, size_t* d_row_ptr,
-                              Entry* d_data, size_t tree_begin,
+                              int* d_tree_group, const size_t* d_row_ptr,


We might as well upgrade all of these raw pointers to spans.

RAMitchell · 2018-10-01T01:20:25Z

src/predictor/gpu_predictor.cu

+    auto& offsets = *out_offsets;
+    offsets.resize(devices.Size() + 1);
+    offsets[0] = 0;
+#pragma omp parallel for schedule(static, 1) if (devices.Size() > 1)


Just had a discussion with @trivialfis about this. My view is that @canonizer 's approach is fine for now, although it would be good to look at better ways of managing the global multi-GPU state.

We have already had one difficult to find bug because the active device was not what was expected. Perhaps there is a way to manage this so we can explicitly prevent kernels being called with the incorrect active device index.

hcho3 · 2018-10-10T17:51:14Z

Any updates on multi-GPU tests?

canonizer · 2018-10-21T22:09:56Z

Added a multi-GPU test for GPUPredictor and addressed reviewers' comments.

trivialfis · 2018-10-22T23:26:33Z

src/predictor/gpu_predictor.cu

-      auto begin_ptr = d_data + d_row_ptr[ridx];
-      auto end_ptr = d_data + d_row_ptr[ridx + 1];
-      Entry* previous_middle = nullptr;
+      auto begin_ptr = d_data.begin() + d_row_ptr[ridx] - entry_start;


Change to

Suggested change

auto begin_ptr = d_data.begin() + d_row_ptr[ridx] - entry_start;

auto begin_ptr = d_data.begin() + (d_row_ptr[ridx] - entry_start);

should pass the multi-gpu test.

I think you can make this change yourself now that you are a member.

trivialfis · 2018-10-22T23:26:56Z

src/predictor/gpu_predictor.cu

-      auto end_ptr = d_data + d_row_ptr[ridx + 1];
-      Entry* previous_middle = nullptr;
+      auto begin_ptr = d_data.begin() + d_row_ptr[ridx] - entry_start;
+      auto end_ptr = d_data.begin() + d_row_ptr[ridx + 1] - entry_start;


Suggested change

auto end_ptr = d_data.begin() + d_row_ptr[ridx + 1] - entry_start;

auto end_ptr = d_data.begin() + (d_row_ptr[ridx + 1] - entry_start);

And this one. :)

trivialfis · 2018-10-23T02:11:43Z

Okay, I fixed a bug that's caught by my profiling script. But I can't reproduce the one on Jenkins.

hcho3 · 2018-10-23T02:21:31Z

It may be out of memory error? Let me run it on my end using the same instance type.

trivialfis · 2018-10-23T02:55:50Z

@hcho3 Thanks! I ran cuda-memcheck and Sanitizer with no luck so far. The test requires very small amount of memory.

hcho3 · 2018-10-23T08:32:24Z

I compiled and ran this pull request on my p2.8xlarge instance and got the same error. I will run it through cuda-gdb to see if it helps.

hcho3 · 2018-10-23T08:38:21Z

@trivialfis @canonizer I got this backtrace by running testxgboost through gdb:

#0  __memmove_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:136
#1  0x00007ffff5958bdf in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007ffff5a123ee in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007ffff5a126ed in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007ffff5a139b6 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007ffff59256be in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007ffff59259d8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007ffff5a761d5 in cuMemcpy () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#8  0x000000000088f022 in cudart::driverHelper::memcpyDispatch(void*, void const*, unsigned long, cudaMemcpyKind, bool) ()
#9  0x000000000086f896 in cudart::cudaApiMemcpy(void*, void const*, unsigned long, cudaMemcpyKind) ()
#10 0x0000000000891ed8 in cudaMemcpy ()
#11 0x0000000000779b52 in xgboost::predictor::GPUPredictor::DeviceOffsets () at /home/ubuntu/xgboost/src/predictor/gpu_predictor.cu:234
#12 0x00007ffff712d43e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#13 0x00007ffff6cf26ba in start_thread (arg=0x7ffff25d7700) at pthread_create.c:333
#14 0x00007ffff6a2841d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Maybe this line is problematic?

xgboost/src/predictor/gpu_predictor.cu

Lines 233 to 236 in 10bc710

    
           // copy the last element from every shard 
        
           dh::safe_cuda(cudaMemcpy(&offsets[shard + 1], 
        
                                    data.DevicePointer(device) + data.DeviceSize(device) - 1, 
        
                                    sizeof(size_t), cudaMemcpyDefault));

hcho3 · 2018-10-23T08:47:50Z

I added some diagnostic printout before the cudaMemcpy line:

[08:44:06] /home/ubuntu/xgboost/src/predictor/gpu_predictor.cu:233: cudaMemcpy(&offsets[0 + 1], 0x7c136c0200 + 2 - 1, sizeof(size_t), cudaMemcpyDefault));
[08:44:06] /home/ubuntu/xgboost/src/predictor/gpu_predictor.cu:233: cudaMemcpy(&offsets[1 + 1], 0x7c13ec0200 + 2 - 1, sizeof(size_t), cudaMemcpyDefault));
[08:44:06] /home/ubuntu/xgboost/src/predictor/gpu_predictor.cu:233: cudaMemcpy(&offsets[2 + 1], 0x7c13ac0200 + 2 - 1, sizeof(size_t), cudaMemcpyDefault));
[08:44:06] /home/ubuntu/xgboost/src/predictor/gpu_predictor.cu:233: cudaMemcpy(&offsets[3 + 1], 0x7c132c0200 + 2 - 1, sizeof(size_t), cudaMemcpyDefault));
[08:44:06] /home/ubuntu/xgboost/src/predictor/gpu_predictor.cu:233: cudaMemcpy(&offsets[4 + 1], 0x7c142c0200 + 2 - 1, sizeof(size_t), cudaMemcpyDefault));
[08:44:06] /home/ubuntu/xgboost/src/predictor/gpu_predictor.cu:233: cudaMemcpy(&offsets[5 + 1], 0x7c18380000 + 1 - 1, sizeof(size_t), cudaMemcpyDefault));
[08:44:07] /home/ubuntu/xgboost/src/predictor/gpu_predictor.cu:233: cudaMemcpy(&offsets[6 + 1], 0 + 0 - 1, sizeof(size_t), cudaMemcpyDefault));

The last cudaMemcpy call fails because the device pointer for shard 6 is null.

hcho3 · 2018-10-23T08:54:52Z

tests/cpp/predictor/test_gpu_predictor.cu

+  model.CommitModel(std::move(trees), 0);
+  model.param.num_output_group = 1;
+
+  int n_row = 5;


Changing this line to

Suggested change

int n_row = 5;

int n_row = 8;

gets rid of segmentation fault.

The p2.8xlarge instance has 8 GPUs, so with the matrix with 5 rows, some of the GPUs were getting 0 row. We should handle this edge case either by restricting the number of devices when too few rows are given, or by correctly handling zero-row shards.

@trivialfis How should we handle this edge case?

@hcho3 cuda-gdb doesn't work for me with NCCL, neither on Fedora nor on Ubuntu.. :(
The GPUSet::All() has an optional parameter specifying number of rows.

* Reinitialize shards when GPUSet is changed. * Tests range of data.

trivialfis · 2018-10-23T11:49:50Z

@hcho3 This again comes down to changing parameters. We need to handle the situation when n_gpus is limited hence changed by the n_rows of input data.

trivialfis · 2018-10-23T11:54:03Z

@canonizer , @hcho3 , @RAMitchell I tried to overcome it with a check to see if GPUSet is changed, if so all DeviceShards are re-built. It might not be a nice solution, suggestions are welcomed.

codecov-io · 2018-10-23T13:27:45Z

Codecov Report

Merging #3738 into master will decrease coverage by 0.03%.
The diff coverage is 60%.

@@             Coverage Diff              @@
##             master    #3738      +/-   ##
============================================
- Coverage     52.09%   52.06%   -0.04%     
- Complexity      196      203       +7     
============================================
  Files           181      181              
  Lines         14341    14358      +17     
  Branches        489      495       +6     
============================================
+ Hits           7471     7475       +4     
- Misses         6636     6645       +9     
- Partials        234      238       +4

Impacted Files	Coverage Δ	Complexity Δ
src/common/span.h	`98.63% <ø> (ø)`	`0 <0> (ø)`	⬇️
src/gbm/gbtree.cc	`18.67% <0%> (ø)`	`0 <0> (ø)`	⬇️
src/objective/multiclass_obj.cu	`93.75% <100%> (+0.41%)`	`0 <0> (ø)`	⬇️
src/objective/hinge.cu	`82.35% <100%> (ø)`	`0 <0> (ø)`	⬇️
src/objective/regression_obj.cu	`87.46% <100%> (ø)`	`0 <0> (ø)`	⬇️
src/common/host_device_vector.h	`75% <0%> (-5%)`	`0% <0%> (ø)`
.../src/main/java/ml/dmlc/xgboost4j/java/XGBoost.java	`84.13% <0%> (-2.59%)`	`41% <0%> (+7%)`
src/common/host_device_vector.cc	`63.88% <0%> (-1.39%)`	`0% <0%> (ø)`
...oost4j/scala/spark/params/LearningTaskParams.scala	`81.08% <0%> (-1.28%)`	`0% <0%> (ø)`
.../scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala	`74.9% <0%> (-0.86%)`	`0% <0%> (ø)`
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update abf2f66...96fe214. Read the comment docs.

RAMitchell

LGTM

trivialfis · 2018-10-24T03:47:03Z

@hcho3 Hi, could you give another look before I merge it?

hcho3 · 2018-10-24T03:49:32Z

src/objective/multiclass_obj.cu

-    out_gpair->Reshard(GPUSet::Empty());
-    preds.Reshard(GPUSet::Empty());
+    // out_gpair->Reshard(GPUSet::Empty());
+    // preds.Reshard(GPUSet::Empty());


Why are we commenting out lines? If they are not needed, we should just remove them

hcho3 · 2018-10-24T04:11:38Z

tests/cpp/predictor/test_gpu_predictor.cu

 TEST(gpu_predictor, Test) {
  std::unique_ptr<Predictor> gpu_predictor =
      std::unique_ptr<Predictor>(Predictor::Create("gpu_predictor"));
  std::unique_ptr<Predictor> cpu_predictor =
      std::unique_ptr<Predictor>(Predictor::Create("cpu_predictor"));

+  // gpu_predictor->Init({std::pair<std::string, std::string>("n_gpus", "1")}, {});


Remove this line too

Thank for cleaning up for me.

hcho3 · 2018-10-24T05:59:04Z

Thanks everyone!

* Multi-GPU support in GPUPredictor. - GPUPredictor is multi-GPU - removed DeviceMatrix, as it has been made obsolete by using HostDeviceVector in DMatrix * Replaced pointers with spans in GPUPredictor. * Added a multi-GPU predictor test. * Fix multi-gpu test. * Fix n_rows < n_gpus. * Reinitialize shards when GPUSet is changed. * Tests range of data. * Remove commented code. * Remove commented code.

Multi-GPU support in GPUPredictor.

9f50ad7

- GPUPredictor is multi-GPU - removed DeviceMatrix, as it has been made obsolete by using HostDeviceVector in DMatrix

trivialfis reviewed Sep 28, 2018

View reviewed changes

RAMitchell requested changes Oct 1, 2018

View reviewed changes

hcho3 mentioned this pull request Oct 1, 2018

[ANNOUCEMENT] 0.81 release planned on November 1, 2018 #3744

Closed

12 tasks

trivialfis mentioned this pull request Oct 5, 2018

Multi-gpu terminate called after throwing an instance of 'thrust::system::system_error' parallel_for failed: out of memory #3756

Closed

This was referenced Oct 19, 2018

Add device memory usage profiling. #3795

Closed

[MGPU] benchmark/benchmark_tree.py failed. #3809

Closed

canonizer added 3 commits October 21, 2018 22:37

Merge branch 'master' into mgpu-predictor-2

47d764f

Replaced pointers with spans in GPUPredictor.

801f671

Added a multi-GPU predictor test.

55980e4

trivialfis reviewed Oct 22, 2018

View reviewed changes

Fix multi-gpu test.

10bc710

trivialfis mentioned this pull request Oct 23, 2018

[Temporary] Multi-GPU predictor #3819

Closed

hcho3 requested changes Oct 23, 2018

View reviewed changes

Fix n_rows < n_gpus.

6a31dc2

* Reinitialize shards when GPUSet is changed. * Tests range of data.

RAMitchell approved these changes Oct 24, 2018

View reviewed changes

hcho3 reviewed Oct 24, 2018

View reviewed changes

hcho3 approved these changes Oct 24, 2018

View reviewed changes

Remove commented code.

01a35b6

hcho3 reviewed Oct 24, 2018

View reviewed changes

Remove commented code.

96fe214

hcho3 merged commit 2a59ff2 into dmlc:master Oct 24, 2018

hcho3 mentioned this pull request Oct 24, 2018

[MGPU] python-gpu/test_large_sizes.py failed. #3794

Closed

lock bot locked as resolved and limited conversation to collaborators Jan 22, 2019

	auto begin_ptr = d_data.begin() + d_row_ptr[ridx] - entry_start;
	auto begin_ptr = d_data.begin() + (d_row_ptr[ridx] - entry_start);

	auto end_ptr = d_data.begin() + d_row_ptr[ridx + 1] - entry_start;
	auto end_ptr = d_data.begin() + (d_row_ptr[ridx + 1] - entry_start);

Multi-GPU support in GPUPredictor. #3738

Multi-GPU support in GPUPredictor. #3738

Conversation

canonizer commented Sep 28, 2018 • edited by hcho3 Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hcho3 commented Sep 28, 2018

RAMitchell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hcho3 commented Oct 10, 2018

canonizer commented Oct 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis commented Oct 23, 2018

hcho3 commented Oct 23, 2018

trivialfis commented Oct 23, 2018

hcho3 commented Oct 23, 2018

hcho3 commented Oct 23, 2018

hcho3 commented Oct 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis commented Oct 23, 2018

trivialfis commented Oct 23, 2018

codecov-io commented Oct 23, 2018 • edited Loading

Codecov Report

RAMitchell left a comment

Choose a reason for hiding this comment

trivialfis commented Oct 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hcho3 commented Oct 24, 2018

canonizer commented Sep 28, 2018 •

edited by hcho3

Loading

codecov-io commented Oct 23, 2018 •

edited

Loading