-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fluid distributed training performance is terrible using GPU #8119
Comments
Related: #8139 |
Tested
It's not spending 1.9s for each run of the operator. Need to test more. |
Maybe this testing order could help us spot the problem:
|
Strange, I guess in the profiling result I think we need to know is Maybe we can turn on nvidia profiler and check which kernel/CUDA API is consuming the most time. I am guessing there could be excessive calls of |
I'm running profiling using |
FROM @helinwang
You're right, Because of Paddle/paddle/operators/detail/strided_memcpy.h Lines 51 to 62 in b41205d
I think we need to reduce the times of calling |
After the recent fix merged, I've tested several rounds of vgg16: 4 pserver + 4 trainers, 1GPU per trainer:
The most time-consuming operator becomes |
Running vgg16 with cifar10 dataset. Using
kubectl
to submit a fluid cluster job with 5 pservers and 5 trainers. Trainers request 1 GPU each usingalpha.kubernetes.io/nvidia-gpu: 1
CUDA: 8
cuDNN: 5
driver version: 375.26
GPU: P40
HostNetwork
Additional information: I see that CPU usage is up to 100% for a long time in the container, may be the CPU becomes the bottle neck?
Per mini-batch time: around 60s
When CPU only, it's arount 10s.
The text was updated successfully, but these errors were encountered: