Fluid distributed training performance is terrible using GPU #8119

typhoonzero · 2018-02-05T03:22:53Z

Running vgg16 with cifar10 dataset. Using kubectl to submit a fluid cluster job with 5 pservers and 5 trainers. Trainers request 1 GPU each using alpha.kubernetes.io/nvidia-gpu: 1

CUDA: 8
cuDNN: 5
driver version: 375.26
GPU: P40
HostNetwork

Additional information: I see that CPU usage is up to 100% for a long time in the container, may be the CPU becomes the bottle neck?

Per mini-batch time: around 60s
When CPU only, it's arount 10s.

------------------------->     Profiling Report     <-------------------------
Time unit: ms
Sorted by total time in descending order in the same thread
Event                            Calls       Total       Min.        Max.        Ave.
thread0::split                   5865        1.11765e+07 2.63047     6652.4      1905.63
thread0::concat                  5865        1.09052e+07 2.61659     6175.19     1859.36
thread0::send                    391         2.29786e+06 327.89      13663.4     5876.87
thread0::conv2d_grad             5083        893.141     0.065567    104.159     0.175711
thread0::conv2d                  5083        807.148     0.051993    11.0981     0.158794
thread0::fill_zeros_like         25806       562.583     0.012788    11.0516     0.0218005
thread0::batch_norm              5474        525.538     0.055927    6.09994     0.0960062
thread0::batch_norm_grad         5474        346.792     0.044622    9.09123     0.0633526
thread0::elementwise_add_grad    6256        341.264     0.037377    8.06849     0.0545499
thread0::elementwise_add         6256        295.606     0.024       7.93195     0.0472516
thread0::dropout                 3910        191.713     0.033447    6.07088     0.0490315
thread0::pool2d                  1955        183.506     0.036702    9.3676      0.0938649
thread0::mul                     1173        158.41      0.035665    8.38415     0.135047
thread0::pool2d_grad             1955        151.755     0.041505    8.08374     0.0776243
thread0::relu                    5474        143.952     0.015749    5.04012     0.0262974
thread0::dropout_grad            3910        131.019     0.022926    5.03294     0.0335086
thread0::relu_grad               5474        130.262     0.016308    0.196264    0.0237965
thread0::mul_grad                1173        125.795     0.055516    3.11973     0.107242
thread0::cast                    782         34.6569     0.019949    0.660779    0.0443183
thread0::softmax                 391         33.7846     0.043393    0.707031    0.0864057
thread0::fetch                   782         27.9014     0.02143     0.06991     0.0356795
thread0::elementwise_mul         391         22.6289     0.029586    0.658594    0.0578745
thread0::sum                     782         21.7157     0.015956    0.057544    0.0277695
thread0::mean                    391         20.8824     0.017462    0.680731    0.0534077
thread0::cross_entropy           391         18.9823     0.023271    7.04633     0.0485482

The text was updated successfully, but these errors were encountered:

typhoonzero · 2018-02-05T07:53:33Z

Related: #8139

typhoonzero · 2018-02-05T12:26:19Z

Tested split_op using unit test, splitting a 100x2x512 tensor to 10 tensors, run both CPU and GPU for 5000 times:

('check on ', <paddle.v2.fluid.core.CPUPlace object at 0x7f0737f3c5a0>, ' spent ', 0.0024118423461914062)
('check on ', <paddle.v2.fluid.core.CUDAPlace object at 0x7f0737f3c990>, ' spent ', 0.003762960433959961)
('time spent for split ', 35.22474503517151)

It's not spending 1.9s for each run of the operator. Need to test more.

helinwang · 2018-02-05T21:24:18Z

Maybe this testing order could help us spot the problem:

single parameter server.
multiple parameter servers, without splitting a single tensor (each tensor will be placed on a single parameter server).
multiple parameter server splitting and placing a single tensor onto different pservers.

helinwang · 2018-02-05T21:44:18Z

Strange, I guess in the profiling result split and concat is running on CPU (which is normal), but isn't that without GPU we are running the same split and concat OP on CPU as well?

I think we need to know is split running CPU kernel or GPU kernel.

Maybe we can turn on nvidia profiler and check which kernel/CUDA API is consuming the most time. I am guessing there could be excessive calls of cudaMemcpy. One possibility I can think of is: the split is running a CPU kernel, the input of split is a GPU tensor, it got converted to a CPU tensor, and the next OP is a GPU OP, the CPU tensor got converted to the GPU tensor again, and finally send OP is a CPU OP, the GPU tensor got converted to a CPU tensor again.

typhoonzero · 2018-02-06T12:14:01Z

I'm running profiling using nvprof. I'll update this issue once I've got some results.

Yancey1989 · 2018-02-07T08:39:11Z

FROM @helinwang

Maybe we can turn on nvidia profiler and check which kernel/CUDA API is consuming the most time. I am guessing there could be excessive calls of cudaMemcpy.

You're right, split and concat execute the calls of cudaMemcpy or std::memcpy excessively.

Because of split and concat use StrideMemcpy to copy memory between input and output tensor, and the StrideMemcpy would call cudaMemcpy or std::memcpy too much times for a output variables.

Paddle/paddle/operators/detail/strided_memcpy.h

Lines 51 to 62 in b41205d

    
           struct StridedMemcpyFunctor { 
        
             void operator()(const platform::DeviceContext& dev_ctx, const T* src, 
        
                             framework::Dim<Rank> src_stride, framework::Dim<Rank> dst_dim, 
        
                             framework::Dim<Rank> dst_stride, T* dst) const { 
        
               for (int64_t i = 0; i < dst_dim.head; ++i) { 
        
                 StridedMemcpyFunctor<T, Rank - 1> func; 
        
                 func(dev_ctx, src, src_stride.tail, dst_dim.tail, dst_stride.tail, dst); 
        
                 src += src_stride.head; 
        
                 dst += dst_stride.head; 
        
               } 
        
             } 
        
           };

I think we need to reduce the times of calling memcpy function.

typhoonzero · 2018-02-12T07:39:13Z

After the recent fix merged, I've tested several rounds of vgg16:

4 pserver + 4 trainers, 1GPU per trainer:

------------------------->     Profiling Report     <-------------------------
Place: CUDA
Time unit: ms
Sorted by total time in descending order in the same thread
Event                            Calls       Total       Min.        Max.        Ave.
thread0::send                    391         400535      354.304     56871.3     1024.39
thread0::elementwise_add_grad    6256        13039.2     0.010048    11.7761     2.08427
thread0::conv2d_grad             5083        9707.2      0.484544    4.32714     1.90974
thread0::conv2d                  5083        5886.27     0.220096    3.36442     1.15803
thread0::dropout                 3910        4092.42     0.03776     5.31005     1.04665
thread0::batch_norm_grad         5474        1329.41     0.037216    0.801248    0.24286
thread0::batch_norm              5474        1241.89     0.065536    0.67136     0.226871
thread0::relu_grad               5474        757.701     0.006176    0.461344    0.138418
thread0::pool2d_grad             1955        601.043     0.106848    0.61744     0.307439
thread0::concat                  5865        587.793     0.015968    40.9061     0.10022
thread0::relu                    5474        578.35      0.009152    36.185      0.105654
thread0::elementwise_add         6256        566.953     0.004864    0.433312    0.0906255
thread0::dropout_grad            3910        424.318     0.00592     0.457312    0.108521
thread0::split                   5865        296.092     0.01328     0.118048    0.0504845
thread0::fill_zeros_like         25806       229.97      0.004032    0.13248     0.0089115
thread0::pool2d                  1955        176.405     0.015072    0.213472    0.0902326
thread0::fetch                   782         84.889      0.026432    49.0703     0.108554
thread0::mul_grad                1173        48.398      0.02352     0.060448    0.0412601
thread0::mul                     1173        24.4906     0.016128    0.03984     0.0208786
thread0::feed                    782         10.2206     0.004736    0.211456    0.0130698
thread0::softmax                 391         8.23811     0.019808    0.03424     0.0210693
thread0::accuracy                391         4.90198     0.011424    0.014144    0.012537
thread0::top_k                   391         4.83814     0.010432    0.021952    0.0123738
thread0::softmax_grad            391         4.62147     0.011168    0.022208    0.0118196
thread0::cast                    782         3.99926     0.004576    0.006112    0.00511415
thread0::mean                    391         3.99443     0.008608    0.01872     0.0102159
thread0::sum                     782         3.99309     0.004448    0.006144    0.00510625
thread0::cross_entropy_grad      391         3.18061     0.007328    0.009152    0.00813455
thread0::cross_entropy           391         2.21078     0.005056    0.017472    0.00565418
thread0::mean_grad               391         2.02422     0.004928    0.006496    0.00517704
thread0::fill_constant           391         2.00957     0.004416    0.017504    0.00513956

The most time-consuming operator becomes send, this is reasonable that we have to copy data to CPU memory and do communication without GPU direct.

typhoonzero self-assigned this Feb 5, 2018

typhoonzero assigned Yancey1989 and helinwang Feb 5, 2018

This was referenced Feb 8, 2018

concat_op is extremely slow in GPU #6988

Closed

Improve split and concat op #8270

Merged

typhoonzero closed this as completed Feb 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluid distributed training performance is terrible using GPU #8119

Fluid distributed training performance is terrible using GPU #8119

typhoonzero commented Feb 5, 2018 •

edited

Loading

typhoonzero commented Feb 5, 2018

typhoonzero commented Feb 5, 2018

helinwang commented Feb 5, 2018 •

edited

Loading

helinwang commented Feb 5, 2018 •

edited

Loading

typhoonzero commented Feb 6, 2018

Yancey1989 commented Feb 7, 2018 •

edited

Loading

typhoonzero commented Feb 12, 2018

Fluid distributed training performance is terrible using GPU #8119

Fluid distributed training performance is terrible using GPU #8119

Comments

typhoonzero commented Feb 5, 2018 • edited Loading

typhoonzero commented Feb 5, 2018

typhoonzero commented Feb 5, 2018

helinwang commented Feb 5, 2018 • edited Loading

helinwang commented Feb 5, 2018 • edited Loading

typhoonzero commented Feb 6, 2018

Yancey1989 commented Feb 7, 2018 • edited Loading

typhoonzero commented Feb 12, 2018

typhoonzero commented Feb 5, 2018 •

edited

Loading

helinwang commented Feb 5, 2018 •

edited

Loading

helinwang commented Feb 5, 2018 •

edited

Loading

Yancey1989 commented Feb 7, 2018 •

edited

Loading