Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluid distributed training performance is terrible using GPU #8119

Closed
typhoonzero opened this issue Feb 5, 2018 · 7 comments
Closed

Fluid distributed training performance is terrible using GPU #8119

typhoonzero opened this issue Feb 5, 2018 · 7 comments
Assignees

Comments

@typhoonzero
Copy link
Contributor

typhoonzero commented Feb 5, 2018

Running vgg16 with cifar10 dataset. Using kubectl to submit a fluid cluster job with 5 pservers and 5 trainers. Trainers request 1 GPU each using alpha.kubernetes.io/nvidia-gpu: 1

CUDA: 8
cuDNN: 5
driver version: 375.26
GPU: P40
HostNetwork

Additional information: I see that CPU usage is up to 100% for a long time in the container, may be the CPU becomes the bottle neck?

Per mini-batch time: around 60s
When CPU only, it's arount 10s.

------------------------->     Profiling Report     <-------------------------
Time unit: ms
Sorted by total time in descending order in the same thread
Event                            Calls       Total       Min.        Max.        Ave.
thread0::split                   5865        1.11765e+07 2.63047     6652.4      1905.63
thread0::concat                  5865        1.09052e+07 2.61659     6175.19     1859.36
thread0::send                    391         2.29786e+06 327.89      13663.4     5876.87
thread0::conv2d_grad             5083        893.141     0.065567    104.159     0.175711
thread0::conv2d                  5083        807.148     0.051993    11.0981     0.158794
thread0::fill_zeros_like         25806       562.583     0.012788    11.0516     0.0218005
thread0::batch_norm              5474        525.538     0.055927    6.09994     0.0960062
thread0::batch_norm_grad         5474        346.792     0.044622    9.09123     0.0633526
thread0::elementwise_add_grad    6256        341.264     0.037377    8.06849     0.0545499
thread0::elementwise_add         6256        295.606     0.024       7.93195     0.0472516
thread0::dropout                 3910        191.713     0.033447    6.07088     0.0490315
thread0::pool2d                  1955        183.506     0.036702    9.3676      0.0938649
thread0::mul                     1173        158.41      0.035665    8.38415     0.135047
thread0::pool2d_grad             1955        151.755     0.041505    8.08374     0.0776243
thread0::relu                    5474        143.952     0.015749    5.04012     0.0262974
thread0::dropout_grad            3910        131.019     0.022926    5.03294     0.0335086
thread0::relu_grad               5474        130.262     0.016308    0.196264    0.0237965
thread0::mul_grad                1173        125.795     0.055516    3.11973     0.107242
thread0::cast                    782         34.6569     0.019949    0.660779    0.0443183
thread0::softmax                 391         33.7846     0.043393    0.707031    0.0864057
thread0::fetch                   782         27.9014     0.02143     0.06991     0.0356795
thread0::elementwise_mul         391         22.6289     0.029586    0.658594    0.0578745
thread0::sum                     782         21.7157     0.015956    0.057544    0.0277695
thread0::mean                    391         20.8824     0.017462    0.680731    0.0534077
thread0::cross_entropy           391         18.9823     0.023271    7.04633     0.0485482
@typhoonzero typhoonzero self-assigned this Feb 5, 2018
@typhoonzero
Copy link
Contributor Author

Related: #8139

@typhoonzero
Copy link
Contributor Author

Tested split_op using unit test, splitting a 100x2x512 tensor to 10 tensors, run both CPU and GPU for 5000 times:

('check on ', <paddle.v2.fluid.core.CPUPlace object at 0x7f0737f3c5a0>, ' spent ', 0.0024118423461914062)
('check on ', <paddle.v2.fluid.core.CUDAPlace object at 0x7f0737f3c990>, ' spent ', 0.003762960433959961)
('time spent for split ', 35.22474503517151)

It's not spending 1.9s for each run of the operator. Need to test more.

@helinwang
Copy link
Contributor

helinwang commented Feb 5, 2018

Maybe this testing order could help us spot the problem:

  1. single parameter server.
  2. multiple parameter servers, without splitting a single tensor (each tensor will be placed on a single parameter server).
  3. multiple parameter server splitting and placing a single tensor onto different pservers.

@helinwang
Copy link
Contributor

helinwang commented Feb 5, 2018

Strange, I guess in the profiling result split and concat is running on CPU (which is normal), but isn't that without GPU we are running the same split and concat OP on CPU as well?

I think we need to know is split running CPU kernel or GPU kernel.

Maybe we can turn on nvidia profiler and check which kernel/CUDA API is consuming the most time. I am guessing there could be excessive calls of cudaMemcpy. One possibility I can think of is: the split is running a CPU kernel, the input of split is a GPU tensor, it got converted to a CPU tensor, and the next OP is a GPU OP, the CPU tensor got converted to the GPU tensor again, and finally send OP is a CPU OP, the GPU tensor got converted to a CPU tensor again.

@typhoonzero
Copy link
Contributor Author

I'm running profiling using nvprof. I'll update this issue once I've got some results.

@Yancey1989
Copy link
Contributor

Yancey1989 commented Feb 7, 2018

FROM @helinwang

Maybe we can turn on nvidia profiler and check which kernel/CUDA API is consuming the most time. I am guessing there could be excessive calls of cudaMemcpy.

You're right, split and concat execute the calls of cudaMemcpy or std::memcpy excessively.

Because of split and concat use StrideMemcpy to copy memory between input and output tensor, and the StrideMemcpy would call cudaMemcpy or std::memcpy too much times for a output variables.

struct StridedMemcpyFunctor {
void operator()(const platform::DeviceContext& dev_ctx, const T* src,
framework::Dim<Rank> src_stride, framework::Dim<Rank> dst_dim,
framework::Dim<Rank> dst_stride, T* dst) const {
for (int64_t i = 0; i < dst_dim.head; ++i) {
StridedMemcpyFunctor<T, Rank - 1> func;
func(dev_ctx, src, src_stride.tail, dst_dim.tail, dst_stride.tail, dst);
src += src_stride.head;
dst += dst_stride.head;
}
}
};

I think we need to reduce the times of calling memcpy function.

@typhoonzero
Copy link
Contributor Author

After the recent fix merged, I've tested several rounds of vgg16:

4 pserver + 4 trainers, 1GPU per trainer:

------------------------->     Profiling Report     <-------------------------
Place: CUDA
Time unit: ms
Sorted by total time in descending order in the same thread
Event                            Calls       Total       Min.        Max.        Ave.
thread0::send                    391         400535      354.304     56871.3     1024.39
thread0::elementwise_add_grad    6256        13039.2     0.010048    11.7761     2.08427
thread0::conv2d_grad             5083        9707.2      0.484544    4.32714     1.90974
thread0::conv2d                  5083        5886.27     0.220096    3.36442     1.15803
thread0::dropout                 3910        4092.42     0.03776     5.31005     1.04665
thread0::batch_norm_grad         5474        1329.41     0.037216    0.801248    0.24286
thread0::batch_norm              5474        1241.89     0.065536    0.67136     0.226871
thread0::relu_grad               5474        757.701     0.006176    0.461344    0.138418
thread0::pool2d_grad             1955        601.043     0.106848    0.61744     0.307439
thread0::concat                  5865        587.793     0.015968    40.9061     0.10022
thread0::relu                    5474        578.35      0.009152    36.185      0.105654
thread0::elementwise_add         6256        566.953     0.004864    0.433312    0.0906255
thread0::dropout_grad            3910        424.318     0.00592     0.457312    0.108521
thread0::split                   5865        296.092     0.01328     0.118048    0.0504845
thread0::fill_zeros_like         25806       229.97      0.004032    0.13248     0.0089115
thread0::pool2d                  1955        176.405     0.015072    0.213472    0.0902326
thread0::fetch                   782         84.889      0.026432    49.0703     0.108554
thread0::mul_grad                1173        48.398      0.02352     0.060448    0.0412601
thread0::mul                     1173        24.4906     0.016128    0.03984     0.0208786
thread0::feed                    782         10.2206     0.004736    0.211456    0.0130698
thread0::softmax                 391         8.23811     0.019808    0.03424     0.0210693
thread0::accuracy                391         4.90198     0.011424    0.014144    0.012537
thread0::top_k                   391         4.83814     0.010432    0.021952    0.0123738
thread0::softmax_grad            391         4.62147     0.011168    0.022208    0.0118196
thread0::cast                    782         3.99926     0.004576    0.006112    0.00511415
thread0::mean                    391         3.99443     0.008608    0.01872     0.0102159
thread0::sum                     782         3.99309     0.004448    0.006144    0.00510625
thread0::cross_entropy_grad      391         3.18061     0.007328    0.009152    0.00813455
thread0::cross_entropy           391         2.21078     0.005056    0.017472    0.00565418
thread0::mean_grad               391         2.02422     0.004928    0.006496    0.00517704
thread0::fill_constant           391         2.00957     0.004416    0.017504    0.00513956

The most time-consuming operator becomes send, this is reasonable that we have to copy data to CPU memory and do communication without GPU direct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants