fluid_cluster_train_cn.md #8656

seiriosPlus · 2018-02-28T09:34:17Z

Yancey1989 · 2018-02-28T11:08:58Z

doc/howto/cluster/fluid_cluster_train_cn.md

@@ -0,0 +1,126 @@
+# Fluid 分布式版本使用指南
+> 本篇文章将说明在Paddle Fluid 版本下进行分布式训练的配置和执行


这段不是一个引用，可以去掉引用符: >.

Yancey1989 · 2018-02-28T11:10:11Z

doc/howto/cluster/fluid_cluster_train_cn.md

+* 可用的集群
+    包含一个或多个计算节点的集群，每一个节点都能够执行PaddlePaddle的训练任务且拥有唯一的IP地址，集群内的所有计算节点可以通过网络相互通信。
+* 安装PaddlePaddle Fluid with Distribute 版本
+    所有的计算节点上均需要按照分布式版本的PaddlePaddle, 在用于GPU等设备的机器上还需要额外安装好相应的驱动程序和CUDA的库。


按照

=>

安装

Yancey1989 · 2018-02-28T11:11:09Z

doc/howto/cluster/fluid_cluster_train_cn.md

+    包含一个或多个计算节点的集群，每一个节点都能够执行PaddlePaddle的训练任务且拥有唯一的IP地址，集群内的所有计算节点可以通过网络相互通信。
+* 安装PaddlePaddle Fluid with Distribute 版本
+    所有的计算节点上均需要按照分布式版本的PaddlePaddle, 在用于GPU等设备的机器上还需要额外安装好相应的驱动程序和CUDA的库。
+    **注意：**当前对外提供的PaddlePaddle版本并不支持分布式，需要通过源码重新编译。编译和安装方法参见[URL](http://www.paddlepaddle.org/docs/develop/documentation/en/getstarted/build_and_install/index_en.html)。


**注意：**当前对外提供的PaddlePaddle版本

和上一段需要空一行，否则显示的格式不对。

Yancey1989 · 2018-02-28T11:12:15Z

doc/howto/cluster/fluid_cluster_train_cn.md

+    所有的计算节点上均需要按照分布式版本的PaddlePaddle, 在用于GPU等设备的机器上还需要额外安装好相应的驱动程序和CUDA的库。
+    **注意：**当前对外提供的PaddlePaddle版本并不支持分布式，需要通过源码重新编译。编译和安装方法参见[URL](http://www.paddlepaddle.org/docs/develop/documentation/en/getstarted/build_and_install/index_en.html)。
+    cmake编译命令中需要将WITH_DISTRIBUTE设置为ON，下面是一个最小化的cmake编译指令：
+``` 


包含linux命令的的代码引用最好使用

```bash

Yancey1989 · 2018-02-28T11:13:10Z

doc/howto/cluster/fluid_cluster_train_cn.md

+```
+## 更新训练脚本
+这里，我们以[Deep Learing 101](http://www.paddlepaddle.org/docs/develop/book/01.fit_a_line/index.html)课程中的第一章 fit a line 为例。
+### 非分布式训练脚本


非分布式训练脚本

改成 单机训练脚本 更直观。

Yancey1989 · 2018-02-28T11:13:46Z

doc/howto/cluster/fluid_cluster_train_cn.md

+我们创建了一个简单的全连接神经网络程序，并且通过fluid的Executor执行了100次迭代。
+现在我们需要将该非分布式版本的程序更新为分布式版本的程序。
+
+### 介绍parameter server


parameter server

=>

Parameter Server

Yancey1989 · 2018-02-28T11:16:28Z

doc/howto/cluster/fluid_cluster_train_cn.md

+启动顺序，先启动全部的PSERVER后，再启动TRAINER。
+**其中：training_role 是用来区分当前所起服务的角色的，用于训练程序中，用户可根据需要自行定义，其他参数为fluid.DistributeTranspiler的transpile函数所需要，需要在调用函数前进行定义，至于如何从外部环境传入，用户可自定义。**
+
+### DEMO


中文的话，可以统一用中文标题吧

DEMO

=>

样例

Yancey1989 · 2018-02-28T11:17:57Z

doc/howto/cluster/fluid_cluster_train_cn.md

+![](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/cluster/src/trainer.png)
+**因此，在分布式的Fluid环境中，我们有两个角色需要创建，分别是 Parameter Server 和 Trainer**
+
+### 程序分片


程序分片这个标题听起来比较奇怪，可以换成其他的名字么？例如 分布式程序 ？

Yancey1989 · 2018-03-02T03:23:40Z

doc/howto/cluster/fluid_cluster_train_cn.md

+            exit(0)  # if avg cost less than 10.0, we think our code is good.
+exit(1)
+```
+我们创建了一个简单的全连接神经网络程序，并且通过fluid的Executor执行了100次迭代。


fluid => Fluid

Yancey1989 · 2018-03-02T03:24:05Z

doc/howto/cluster/fluid_cluster_train_cn.md

+我们创建了一个简单的全连接神经网络程序，并且通过fluid的Executor执行了100次迭代。
+现在我们需要将该单机训练版本的程序更新为分布式版本的程序。
+
+### 介绍Parameter server


Parameter server

=>
Parameter Server

Yancey1989 · 2018-03-02T03:26:36Z

doc/howto/cluster/fluid_cluster_train_cn.md

+PADDLE_INIT_PORT=6174 PADDLE_INIT_PSERVERS=192.168.1.2 TRAINERS=2 POD_IP=192.168.1.2 PADDLE_INIT_TRAINER_ID=1 TRAINING_ROLE=PSERVER python test_fit_a_line.py
+```
+执行命令后请等待出现提示： ``` Server listening on 192.168.1.2:6174 ```
+第二步：启动trainer, 启动trainer的命令：


第二步：启动trainer, 启动trainer的命令：

需要在这行上面空一行，不然会和上一行显示在一起。

typhoonzero · 2018-03-20T06:43:02Z

@seiriosPlus can you update this PR or create a new one?

Yancey1989 requested review from Yancey1989 and typhoonzero February 28, 2018 11:04

Yancey1989 reviewed Feb 28, 2018

View reviewed changes

Yancey1989 reviewed Mar 2, 2018

View reviewed changes

seiriosPlus closed this Mar 21, 2018

seiriosPlus force-pushed the develop branch from b9e6adc to 873cb9b Compare March 21, 2018 03:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fluid_cluster_train_cn.md #8656

fluid_cluster_train_cn.md #8656

seiriosPlus commented Feb 28, 2018

Yancey1989 Feb 28, 2018

Yancey1989 Feb 28, 2018

Yancey1989 Feb 28, 2018

Yancey1989 Feb 28, 2018

Yancey1989 Feb 28, 2018

Yancey1989 Feb 28, 2018

Yancey1989 Feb 28, 2018

Yancey1989 Feb 28, 2018

Yancey1989 Mar 2, 2018

Yancey1989 Mar 2, 2018

Yancey1989 Mar 2, 2018

typhoonzero commented Mar 20, 2018

		@@ -0,0 +1,126 @@
		# Fluid 分布式版本使用指南
		> 本篇文章将说明在Paddle Fluid 版本下进行分布式训练的配置和执行

fluid_cluster_train_cn.md #8656

fluid_cluster_train_cn.md #8656

Conversation

seiriosPlus commented Feb 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

typhoonzero commented Mar 20, 2018