-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add v2 dist benchmark vgg #7539
Add v2 dist benchmark vgg #7539
Conversation
… dist_train_benchmark_vgg16
… dist_train_benchmark_vgg16
… dist_train_benchmark_vgg16
… dist_train_benchmark_vgg16
…onzero/Paddle into dist_train_benchmark_vgg16
…onzero/Paddle into dist_train_benchmark_vgg16
benchmark/cluster/vgg16/README.md
Outdated
|
||
| Batch Size | 32 | 64 | 128 | 256 | | ||
| -- | -- | -- | -- | -- | | ||
| PaddlePaddle Fluid | - | 247.40 | - | - | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seesm fluid's performance is 247.40/64=3.866
batch per second, and v2's performance is 256.14/128=2.001
batch per second.
Seems the different is huge, do you have an idea why? (also could you please check if my math is correct).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, wrong column. I'll update this PR with full test result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, we should think about how to
avoid duplication with the same content of PaddleCloud.
Thanks! Looks like we have a nice improvement over V2 on batch size 256! |
benchmark/cluster/vgg16/Dockerfile
Outdated
#RUN mkdir -p /workspace | ||
#ADD reader.py /workspace/ | ||
#RUN python /workspace/reader.py | ||
FROM python:2.7.14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我觉得既然是测试,最好不用这个而是用paddle:dev。
- 不用安装其他的依赖
- 调试的时候进入容器可以用各种命令查看系统的状态。
benchmark/cluster/vgg16/Dockerfile
Outdated
RUN pip install /*.whl && rm -f /*.whl | ||
ENV LD_LIBRARY_PATH=/usr/local/lib | ||
ADD reader.py /workspace/ | ||
RUN python /workspace/reader.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个基本上下载不下来,所以需要加提示,提示用户使用代理。
- name: TOPOLOGY | ||
value: "" | ||
- name: ENTRY | ||
value: "cd /workspace && MKL_NUM_THREADS=1 python /workspace/vgg16_v2.py" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python -u
,强制输出日志。
- name: TOPOLOGY | ||
value: "" | ||
- name: ENTRY | ||
value: "python train.py" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python -u
,强制输出日志。
… dist_train_benchmark_vgg16
benchmark/cluster/vgg16/README.md
Outdated
| PaddlePaddle v2 | 15.97 | 17.04 | 17.60 | 17.83 | | ||
| TensorFlow | - | - | - | - | | ||
|
||
### different batch size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
different batch size
=>
Different Batch Size
benchmark/cluster/vgg16/README.md
Outdated
| TensorFlow | - | - | - | - | | ||
|
||
|
||
### Accelerate rate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accelerate Rate
benchmark/cluster/vgg16/README.md
Outdated
| PaddlePaddle v2 (need more tests) | 326.85 | 534.58 | 853.30 | 1041.99 | | ||
| TensorFlow | - | - | - | - | | ||
|
||
### different pserver number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different PServer Count
benchmark/cluster/vgg16/README.md
Outdated
| TensorFlow | - | - | - | - | | ||
|
||
|
||
### Accelerate rate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a rate metrics, so maybe we need to calculate this value by https://github.com/PaddlePaddle/Paddle/tree/develop/benchmark/cluster#measure-parallel-efficiency-by-increasing-trainer-count ?
Add results.
…onzero/Paddle into dist_train_benchmark_vgg16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and please refine the titles with the web-site: http://www.titlecase.com
benchmark/cluster/vgg16/README.md
Outdated
|
||
- Trainer Count: 60 | ||
- Batch Size: 128 | ||
- Metrics: mini-batch / sec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mini-batch / sec
Do you mean samples / sec
?
benchmark/cluster/vgg16/README.md
Outdated
|
||
## Enable verbos logs | ||
|
||
Edit `pserver.yaml` and `trainer.yaml` and add an environment variable `GLOG_v=3` to see what happend in detail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether we need to add GLOG_logtostderr=1
, if you have tested it, please ignore this comment.
benchmark/cluster/vgg16/Dockerfile
Outdated
RUN pip install -U kubernetes opencv-python && apt-get update -y && apt-get install -y iputils-ping libgtk2.0-dev | ||
# NOTE: By default CI built wheel packages turn WITH_DISTRIBUTE=OFF, | ||
# so we must build one with distribute support to install in this image. | ||
RUN pip install paddlepaddle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this pip install
is redundant? Move the dataset download after line12 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, in order to make debugging faster, lines below changes much, and download dataset is slow, so add this line.
@@ -1,3 +1,16 @@ | |||
// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copyright message is duplicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!!
No description provided.