-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add v2 dist benchmark vgg #7539
Changes from 37 commits
373f8ba
27e31f6
bbff57e
9ad149a
311d159
a0ac133
b315a40
9f50195
820ee78
541b42e
d3905fb
cb34f6a
b38452d
08b529a
900e911
438d2ab
a28fd4e
da3b14b
70142ae
7aed1c1
bd64719
419e4c4
38b8b7f
cfbbb98
f9db562
8d9c3fc
d6edfd0
355ecaf
b7fbb91
c98b40e
5530212
ccef94a
00b9aed
747df80
7c2d32b
978396e
52df85f
0bbd7bc
a5acad1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
FROM python:2.7.14 | ||
RUN pip install -U kubernetes opencv-python && apt-get update -y && apt-get install -y iputils-ping libgtk2.0-dev | ||
# NOTE: By default CI built wheel packages turn WITH_DISTRIBUTE=OFF, | ||
# so we must build one with distribute support to install in this image. | ||
RUN pip install paddlepaddle | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, in order to make debugging faster, lines below changes much, and download dataset is slow, so add this line. |
||
RUN sh -c 'echo "import paddle.v2 as paddle\npaddle.dataset.cifar.train10()" | python' | ||
RUN pip uninstall -y paddlepaddle | ||
|
||
# below lines may change a lot for debugging | ||
ADD https://raw.githubusercontent.com/PaddlePaddle/cloud/develop/docker/paddle_k8s /usr/bin | ||
ADD https://raw.githubusercontent.com/PaddlePaddle/cloud/develop/docker/k8s_tools.py /root | ||
ADD *.whl / | ||
RUN pip install /*.whl && rm -f /*.whl && \ | ||
chmod +x /usr/bin/paddle_k8s | ||
ENV LD_LIBRARY_PATH=/usr/local/lib | ||
ADD vgg16_fluid.py vgg16_v2.py /workspace/ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# Performance for distributed vgg16 | ||
|
||
## Test Result | ||
|
||
### Hardware Infomation | ||
|
||
- CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz | ||
- cpu MHz : 2101.000 | ||
- cache size : 20480 KB | ||
|
||
### Single Node Single Thread | ||
|
||
- PServer Count: 10 | ||
- Trainer Count: 20 | ||
- Metrics: samples / sec | ||
|
||
| Batch Size | 32 | 64 | 128 | 256 | | ||
| -- | -- | -- | -- | -- | | ||
| PaddlePaddle Fluid | 15.44 | 16.32 | 16.74 | 16.79 | | ||
| PaddlePaddle v2 | 15.97 | 17.04 | 17.60 | 17.83 | | ||
| TensorFlow | - | - | - | - | | ||
|
||
### Different Batch Size | ||
|
||
- PServer Count: 10 | ||
- Trainer Count: 20 | ||
- Per trainer CPU Core: 1 | ||
- Metrics: samples / sec | ||
|
||
| Batch Size | 32 | 64 | 128 | 256 | | ||
| -- | -- | -- | -- | -- | | ||
| PaddlePaddle Fluid | 190.20 | 222.15 | 247.40 | 258.18 | | ||
| PaddlePaddle v2 | 170.96 | 233.71 | 256.14 | 329.23 | | ||
| TensorFlow | - | - | - | - | | ||
|
||
|
||
### Accelerate Rate | ||
|
||
- Pserver Count: 20 | ||
- Batch Size: 128 | ||
- Metrics: samples / sec | ||
|
||
| Trainer Count | 20 | 40 | 80 | 100 | | ||
| -- | -- | -- | -- | -- | | ||
| PaddlePaddle Fluid | 263.29 (78.64%) | 518.80 (77.47%) | 836.26 (62.44%) | 1019.29 (60.89%) | | ||
| PaddlePaddle v2 (need more tests) | 326.85 (92.85%) | 534.58 (75.93%) | 853.30 (60.60%) | 1041.99 (59.20%) | | ||
| TensorFlow | - | - | - | - | | ||
|
||
### Different Pserver Count | ||
|
||
- Trainer Count: 60 | ||
- Batch Size: 128 | ||
- Metrics: mini-batch / sec | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Do you mean |
||
|
||
| PServer Count | 3 | 6 |10 | 20 | | ||
| -- | -- | -- | -- | -- | | ||
| PaddlePaddle Fluid(should fix in next PR) | 589.1 | 592.6 | 656.4 | 655.8 | | ||
| PaddlePaddle v2 | 593.4 | 791.3 | 729.7 | 821.7 | | ||
| TensorFlow | - | - | - | - | | ||
|
||
*The performance gap between Fuild and v2 comes from the network interference.* | ||
|
||
|
||
## Steps to run the performance test | ||
|
||
1. You must re-compile PaddlePaddle and enable `-DWITH_DISTRIBUTE` to build PaddlePaddle with distributed support. | ||
1. When the build finishes, copy the output `whl` package located under `build/python/dist` to current directory. | ||
1. Run `docker build -t [image:tag] .` to build the docker image and run `docker push [image:tag]` to push the image to reponsitory so kubernetes can find it. | ||
1. Run `kubectl create -f pserver.yaml && kubectl create -f trainer.yaml` to start the job on your kubernetes cluster (you must configure the `kubectl` client before this step). | ||
1. Run `kubectl get po` to get running pods, and run `kubectl logs [podID]` to fetch the pod log of pservers and trainers. | ||
|
||
Check the logs for the distributed training progress and analyze the performance. | ||
|
||
## Enable verbos logs | ||
|
||
Edit `pserver.yaml` and `trainer.yaml` and add an environment variable `GLOG_v=3` to see what happend in detail. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure whether we need to add |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
apiVersion: extensions/v1beta1 | ||
kind: ReplicaSet | ||
metadata: | ||
name: vgg16job-pserver | ||
spec: | ||
replicas: 10 | ||
template: | ||
metadata: | ||
labels: | ||
paddle-job-pserver: vgg16job | ||
spec: | ||
hostNetwork: true | ||
imagePullSecrets: | ||
- name: job-registry-secret | ||
containers: | ||
- name: pserver | ||
image: "registry.baidu.com/paddlepaddle/fluid_benchmark:vgg16" | ||
imagePullPolicy: Always | ||
ports: | ||
- name: jobport-30236 | ||
containerPort: 30236 | ||
env: | ||
- name: PADDLE_JOB_NAME | ||
value: vgg16job | ||
- name: MKL_NUM_THREADS | ||
value: "1" | ||
- name: TRAINING_ROLE | ||
value: "PSERVER" | ||
- name: TRAINERS | ||
value: "20" | ||
- name: PSERVERS | ||
value: "10" | ||
- name: TOPOLOGY | ||
value: "" | ||
- name: ENTRY | ||
value: "MKL_NUM_THREADS=1 python /workspace/vgg16_fluid.py --local 0" | ||
- name: TRAINER_PACKAGE | ||
value: "/workspace" | ||
- name: PADDLE_INIT_PORT | ||
value: "30236" | ||
- name: PADDLE_INIT_NICS | ||
value: "xgbe0" | ||
- name: PADDLE_INIT_TRAINER_COUNT | ||
value: "1" | ||
- name: PADDLE_INIT_PORTS_NUM | ||
value: "1" | ||
- name: PADDLE_INIT_PORTS_NUM_FOR_SPARSE | ||
value: "1" | ||
- name: PADDLE_INIT_NUM_GRADIENT_SERVERS | ||
value: "20" | ||
- name: PADDLE_INIT_NUM_PASSES | ||
value: "1" | ||
- name: PADDLE_INIT_USE_GPU | ||
value: "0" | ||
- name: LD_LIBRARY_PATH | ||
value: "/usr/local/lib:/usr/local/nvidia/lib64" | ||
- name: NAMESPACE | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: "metadata.namespace" | ||
- name: POD_IP | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: "status.podIP" | ||
command: ["paddle_k8s", "start_fluid"] | ||
resources: | ||
requests: | ||
memory: 10Gi | ||
cpu: 4 | ||
limits: | ||
memory: 10Gi | ||
cpu: 4 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
apiVersion: batch/v1 | ||
kind: Job | ||
metadata: | ||
name: vgg16job-trainer | ||
spec: | ||
parallelism: 20 | ||
completions: 20 | ||
template: | ||
metadata: | ||
labels: | ||
paddle-job: vgg16job | ||
spec: | ||
imagePullSecrets: | ||
- name: job-registry-secret | ||
hostNetwork: true | ||
containers: | ||
- name: trainer | ||
image: "registry.baidu.com/paddlepaddle/fluid_benchmark:vgg16" | ||
imagePullPolicy: Always | ||
command: ["paddle_k8s", "start_fluid"] | ||
env: | ||
- name: PADDLE_JOB_NAME | ||
value: vgg16job | ||
- name: TRAINING_ROLE | ||
value: "TRAINER" | ||
- name: TRAINERS | ||
value: "20" | ||
- name: PSERVERS | ||
value: "10" | ||
- name: TOPOLOGY | ||
value: "" | ||
- name: ENTRY | ||
value: "MKL_NUM_THREADS=1 python /workspace/vgg16_fluid.py --local 0 --batch_size 128" | ||
- name: TRAINER_PACKAGE | ||
value: "/workspace" | ||
- name: PADDLE_INIT_PORT | ||
value: "30236" | ||
- name: PADDLE_INIT_NICS | ||
value: "xgbe0" | ||
- name: PADDLE_INIT_TRAINER_COUNT | ||
value: "1" | ||
- name: PADDLE_INIT_PORTS_NUM | ||
value: "1" | ||
- name: PADDLE_INIT_PORTS_NUM_FOR_SPARSE | ||
value: "1" | ||
- name: PADDLE_INIT_NUM_GRADIENT_SERVERS | ||
value: "20" | ||
- name: PADDLE_INIT_NUM_PASSES | ||
value: "1" | ||
- name: PADDLE_INIT_USE_GPU | ||
value: "0" | ||
- name: LD_LIBRARY_PATH | ||
value: "/usr/local/lib:/usr/local/nvidia/lib64" | ||
- name: NAMESPACE | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: "metadata.namespace" | ||
- name: POD_IP | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: "status.podIP" | ||
resources: | ||
requests: | ||
memory: 40Gi | ||
cpu: 2 | ||
limits: | ||
memory: 40Gi | ||
cpu: 2 | ||
restartPolicy: Never |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
apiVersion: extensions/v1beta1 | ||
kind: ReplicaSet | ||
metadata: | ||
name: vgg16v2job-pserver | ||
spec: | ||
replicas: 10 | ||
template: | ||
metadata: | ||
labels: | ||
paddle-job-pserver: vgg16v2job | ||
spec: | ||
hostNetwork: true | ||
imagePullSecrets: | ||
- name: job-registry-secret | ||
containers: | ||
- name: pserver | ||
image: "registry.baidu.com/paddlepaddle/fluid_benchmark:vgg16" | ||
imagePullPolicy: Always | ||
ports: | ||
- name: jobport-30236 | ||
containerPort: 30236 | ||
env: | ||
- name: PADDLE_JOB_NAME | ||
value: vgg16v2job | ||
- name: TRAINERS | ||
value: "20" | ||
- name: PSERVERS | ||
value: "10" | ||
- name: TOPOLOGY | ||
value: "" | ||
- name: ENTRY | ||
value: "python train.py" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
- name: TRAINER_PACKAGE | ||
value: "/workspace" | ||
- name: PADDLE_INIT_PORT | ||
value: "30236" | ||
- name: PADDLE_INIT_NICS | ||
value: "xgbe0" | ||
- name: PADDLE_INIT_TRAINER_COUNT | ||
value: "1" | ||
- name: PADDLE_INIT_PORTS_NUM | ||
value: "1" | ||
- name: PADDLE_INIT_PORTS_NUM_FOR_SPARSE | ||
value: "1" | ||
- name: PADDLE_INIT_NUM_GRADIENT_SERVERS | ||
value: "20" | ||
- name: PADDLE_INIT_NUM_PASSES | ||
value: "1" | ||
- name: PADDLE_INIT_USE_GPU | ||
value: "0" | ||
- name: LD_LIBRARY_PATH | ||
value: "/usr/local/lib:/usr/local/nvidia/lib64" | ||
- name: NAMESPACE | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: "metadata.namespace" | ||
command: ["paddle_k8s", "start_pserver"] | ||
resources: | ||
requests: | ||
memory: 10Gi | ||
cpu: 4 | ||
limits: | ||
memory: 10Gi | ||
cpu: 4 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
apiVersion: batch/v1 | ||
kind: Job | ||
metadata: | ||
name: vgg16v2job-trainer | ||
spec: | ||
parallelism: 20 | ||
completions: 20 | ||
template: | ||
metadata: | ||
labels: | ||
paddle-job: vgg16v2job | ||
spec: | ||
imagePullSecrets: | ||
- name: job-registry-secret | ||
hostNetwork: true | ||
containers: | ||
- name: trainer | ||
image: "registry.baidu.com/paddlepaddle/fluid_benchmark:vgg16" | ||
imagePullPolicy: Always | ||
command: ["paddle_k8s", "start_trainer", "v2"] | ||
env: | ||
- name: PADDLE_JOB_NAME | ||
value: vgg16v2job | ||
- name: BATCH_SIZE | ||
value: "256" | ||
- name: TRAINERS | ||
value: "20" | ||
- name: PSERVERS | ||
value: "10" | ||
- name: TOPOLOGY | ||
value: "" | ||
- name: ENTRY | ||
value: "cd /workspace && MKL_NUM_THREADS=1 python /workspace/vgg16_v2.py" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
- name: TRAINER_PACKAGE | ||
value: "/workspace" | ||
- name: PADDLE_INIT_PORT | ||
value: "30236" | ||
- name: PADDLE_INIT_NICS | ||
value: "xgbe0" | ||
- name: PADDLE_INIT_TRAINER_COUNT | ||
value: "1" | ||
- name: PADDLE_INIT_PORTS_NUM | ||
value: "1" | ||
- name: PADDLE_INIT_PORTS_NUM_FOR_SPARSE | ||
value: "1" | ||
- name: PADDLE_INIT_NUM_GRADIENT_SERVERS | ||
value: "20" | ||
- name: PADDLE_INIT_NUM_PASSES | ||
value: "2" | ||
- name: PADDLE_INIT_USE_GPU | ||
value: "0" | ||
- name: LD_LIBRARY_PATH | ||
value: "/usr/local/lib:/usr/local/nvidia/lib64" | ||
- name: NAMESPACE | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: "metadata.namespace" | ||
resources: | ||
requests: | ||
memory: 40Gi | ||
cpu: 2 | ||
limits: | ||
memory: 40Gi | ||
cpu: 2 | ||
restartPolicy: Never |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我觉得既然是测试,最好不用这个而是用paddle:dev。