Name	Name	Last commit message	Last commit date
parent directory ..
distributed-tf	distributed-tf
Dockerfile	Dockerfile
Dockerfile.k8s	Dockerfile.k8s
README.md	README.md
build.sh	build.sh
run.sh	run.sh
tf-gpu-pod.yaml	tf-gpu-pod.yaml
toy.py	toy.py

A toy example of distributed tensorflow

Requirements:

Nvidia driver libraries in the directory /var/lib/nvidia on host node
mount the NFS/CephFS to the same directory (e.g. /mnt/cephfs) on each node

Build the image:

docker build -t toy .

在 Dockefile 中，基于 10.10.10.94:5000/tensorflow:0.9.0-gpu 创建 toy docker images，创建 example 文件夹，并将 toy.py 复制到镜像中供测试。

上传至私有 docker registry, 方便其他机器下载测试：

docker tag toy 10.10.10.94:5000/toy
docker push 10.10.10.94:5000/toy

Test Guide:

使用 10.10.10.94 和 10.10.10.191 两台带有 GPUs 的节点测试。
两个节点分别担当 worker 和 ps 角色。
在 worker 上执行线性拟合的实验估计拟合的权重和 offset，迭代次数10次。

Start parameter server:

在 10.10.10.94 上启动：

docker run --net=host --privileged \
-v /var/lib/nvidia:/usr/local/nvidia/lib64 \
-v /mnt/cephfs:/mnt/cephfs -it 10.10.10.94:5000/toy /bin/bash

在 docker container 中启动 toy.py，设置 --job_name=ps 和 --task_index=0：

python toy.py  \
--ps_hosts=10.10.10.94:2222  --worker_hosts=10.10.10.191:2222 \
--job_name=ps --task_index=0

部分输出结果：

...
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 1 2 3 4 5 6 7
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y Y Y Y N N N N
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 1:   Y Y Y Y N N N N
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 2:   Y Y Y Y N N N N
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 3:   Y Y Y Y N N N N
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 4:   N N N N Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 5:   N N N N Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 6:   N N N N Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 7:   N N N N Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:04:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX TITAN X, pci bus id: 0000:05:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX TITAN X, pci bus id: 0000:08:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GeForce GTX TITAN X, pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:4) -> (device: 4, name: GeForce GTX TITAN X, pci bus id: 0000:84:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:5) -> (device: 5, name: GeForce GTX TITAN X, pci bus id: 0000:85:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:6) -> (device: 6, name: GeForce GTX TITAN X, pci bus id: 0000:88:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:7) -> (device: 7, name: GeForce GTX TITAN X, pci bus id: 0000:89:00.0)
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -> {localhost:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {10.10.10.191:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2222

Start worker:

在 10.10.10.191 上启动：

docker run --net=host --privileged \
-v /var/lib/nvidia:/usr/local/nvidia/lib64 \
-v /mnt/cephfs:/mnt/cephfs -it 10.10.10.94:5000/toy /bin/bash

在 docker container 中启动 toy.py，设置 --job_name=worker 和 --task_index=0：

python toy.py  \
--ps_hosts=10.10.10.94:2222  --worker_hosts=10.10.10.191:2222 \
--job_name=worker --task_index=0

部分输出结果：

...
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 1 2 3
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 1:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 2:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 3:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980, pci bus id: 0000:04:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 980, pci bus id: 0000:05:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX 980, pci bus id: 0000:08:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GeForce GTX 980, pci bus id: 0000:09:00.0)
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -> {10.10.10.94:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {localhost:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2222
=========> step: 0
-0.863596
9.73686
=========> step: 1
0.318785
10.474
=========> step: 2
1.12191
10.315
=========> step: 3
1.55386
10.1671
=========> step: 4
1.77544
10.0838
=========> step: 5
1.88783
10.0406
=========> step: 6
1.94467
10.0187
=========> step: 7
1.9734
10.0076
=========> step: 8
1.98791
10.0019
=========> step: 9
1.99525
9.99909

迭代 10 次后，权重和 offset 分别逼近 2 和 10。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed_tensorflow_toy

distributed_tensorflow_toy

README.md

A toy example of distributed tensorflow

Requirements:

Build the image:

Test Guide:

Start parameter server:

Start worker:

References

Files

distributed_tensorflow_toy

Directory actions

More options

Directory actions

More options

Latest commit

History

distributed_tensorflow_toy

Folders and files

parent directory

README.md

A toy example of distributed tensorflow

Requirements:

Build the image:

Test Guide:

Start parameter server:

Start worker:

References