add block execution doc #4125

QiJune · 2017-09-15T07:58:01Z

Here is better to review.

Superjomn · 2017-09-15T17:22:04Z

doc/block_roadmap.md

+
+4. Data parallelism feature(run sequentially in each device)
+
+5. Implement a data dependency analysis engine


dependency engine?

Yes, just the same as mxnet.

wangkuiyi · 2017-09-16T22:26:26Z

doc/block_roadmap.md

@@ -0,0 +1,77 @@
+## Overview


I think we can move this file into /doc/design/block_execution.md.

wangkuiyi · 2017-09-16T22:31:16Z

doc/block_roadmap.md

@@ -0,0 +1,77 @@
+## Overview
+
+We use `Block` to describe our neural network's topology. Actually, a neural network will be executed in various hardware environment. Generally, we will implement data parallelism and model parallelism to get high performance and scalability.


Actually, a neural network will be executed in various hardware environment. Generally, we will implement data parallelism and model parallelism to get high performance and scalability.

==>

There are some challenges here:

Execution Plan.
It is often that a kind of device cannot accelerate all operators in a block. This requires an execution plan to use multiple devices. For example, some computational operators run on the GPU and cross-node data communication operators like SendOp and RecvOp on the CPU.

Parallel Execution.
It is often that a computer has more than one acceleration devices. For example, most servers and portable computers like PX2 have more than one CUDA GPUs. We want to make full use of them. Intuitively there are two approaches: (1) running operators in a block on multiple devices, and (2) duplicating the block and running them on multiple devices. The former is also known as model parallelism and the latter data parallelism.

Variable Places.
Devices often have their onboard memories, and it is usually much more efficient for them to operator variables in the onboard memories. The execution plan should include the placement of variables in a block.

wangkuiyi · 2017-09-16T22:32:32Z

doc/block_roadmap.md

+Here are some features we want to support:
+
+- Baseline
+ - one thread in a stand-alone machine: A neural network is executed sequentially in a single CPU or GPU.


one thread in a stand-alone machine: A neural network is executed sequentially in a single CPU or GPU.

=>

a single thread runs a block on a single device, e.g., a CPU or a GPU.

wangkuiyi · 2017-09-16T22:42:51Z

doc/block_roadmap.md

@@ -0,0 +1,77 @@
+## Overview
+
+We use `Block` to describe our neural network's topology. Actually, a neural network will be executed in various hardware environment. Generally, we will implement data parallelism and model parallelism to get high performance and scalability.


We might need to be clear about the problem setting here.

We use Block to describe our neural network's topology.

==>

PaddlePaddle represents a neural network as a block. A block is like a pair of curly braces in programming languages like C++ and Java, it includes some variables and a sequence of instructions, or, operators.

Usually we run a block using a CPU thread. But we want to accelerate the execution of some operators using devices like GPU and FPGA.

helinwang

Thanks for the PR! I think the device ID and the executor concepts are very important.

helinwang · 2017-09-25T21:55:19Z

doc/design/block_execution.md

+1. Execution Plan. It is often that a kind of device cannot accelerate all operators in a block. This requires an execution plan to use multiple devices. For example, some computational operators run on the GPU and cross-node data communication operators like SendOp and RecvOp on the CPU.
+
+2. Parallel Execution.
+It is often that a computer has more than one acceleration devices. For example, most servers and portable computers like PX2 have more than one CUDA GPUs. We want to make full use of them. Intuitively there are two approaches: (1) running operators in a block on multiple devices, and (2) duplicating the block and running them on multiple devices. The former is also known as model parallelism and the latter data parallelism.


Maybe we can add an example of CPU with multiple cores. This happens almost everywhere.

Exactly, we can implement both data parallelism and model parallelism in CPU with multiple cores first.

helinwang · 2017-09-25T21:56:02Z

doc/design/block_execution.md

+ - a single thread runs a block on a single device, e.g., a CPU or a GPU.
+
+2. Data Parallelism
+ - multi-threads in a stand-alone machine: Only one copy of parameters needs to be storaged in CPU memory.


storaged -> stored

helinwang · 2017-09-25T21:59:35Z

doc/design/block_execution.md

+ - multi-threads/multi-GPUs in a cluster: Each node in the cluster supports multi-threads/multi-GPUs.
+
+3. Model Parallelism
+ - Operators in a block can locate in different devices(CPU/GPU/FPGA) in a stand-alone machine. Users have to set device id for every operator manually.


Users have to set device id for every operator manually.

I think we need to change this line to: "Users can optionally set device id for operators.".

Because forcing the user to set every OP manually

adds too much effort for the user, and

some of the OP is not visible to the user, for example, gradient calculation OPs, send / recv OPs.

Yes, we should allow users to set specific device id for operators.

helinwang · 2017-09-25T22:02:09Z

doc/design/block_execution.md

+3. Variable Places. Devices often have their onboard memories, and it is usually much more efficient for them to operator variables in the onboard memories. The execution plan should include the placement of variables in a block.
+
+
+Here are the features we want to support:


Are the below feature sorted by priority? It looks like so because of the "1. 2. 3. ...). But the second point in "3. Model Parallelism" seems very important.

It's sorted by priority, data parallelism have higher priority than model parallelism.

helinwang · 2017-09-25T22:04:17Z

doc/design/block_execution.md

+## Analysis 
+
+
+Since a block is actually a graph, the data member of a graph should be Nodes and Edges. Every Node must have a device id for descirbing the place information. Please refer to the design of [TensorFlow](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/node_def.proto#L48). In our block design now, the data members are VarDesc and OpDesc. So, VarDesc and OpDesc must have a field to descirbe concrete device information. 


I am not sure if all VarDesc needs device information, maybe only parameter vars need them?
Note that TensorFlow's Var is different from PaddlePaddle's Var, PaddlePaddle's Var includes the edges in the graph while TF does not.

In common, device information of VarDesc is the same with OpDesc. We have to figure out what's the definition of Graph first, and then we can decide whether VarDesc needs device information.

helinwang · 2017-09-25T22:08:23Z

doc/design/block_execution.md

+
+- Data Parallelism
+ - stand-alone machine: Users have to set several device ids, and each device will execute the same block.
+ - cluster: Users have to set node ids and device ids, and each device will execute the same block.


I think in a cluster user should not set any node ID. Since for the elastic cluster scheduling the user does not know how many nodes are available for the user to use. Even if the user knows how many nodes are available for him, we don't want user change source code to make a model that runs on 5 machines runs on 10 machines.

helinwang · 2017-09-25T22:11:22Z

doc/design/block_execution.md

+
+- A simple device id setting rules
+ - Firsly, users have to set one or several default device id for a topology. These devices are the same and usually finish the most computation of a topology. We mainly consider data parallelism in this step.
+ - Secondly, users need to set device id for each operators in a topology. If users set nothing, the operator will just use default device id. When devices have to be switched, users have to add a copy operator manually.


users have to add a copy operator manually

I think the copy operator should be transparent to the user.

In my opinion, we need an interface that allow user to specify what they want, not how they want things to be done. Only in this way, the framework will have the freedom to do optimizations.

We can just let users to set manually before we find better automatic optimization method.

helinwang · 2017-09-25T22:13:57Z

doc/design/block_execution.md

+
+1. Baseline feature
+
+2. Divide `Block` into two concepts, `Block` and `Executor`


Maybe we can add a simple Executor that implements the Executor interface, so that in the future we only need to swap the Executor implementation.

dzhwinter · 2017-09-25T23:06:36Z

doc/design/block_execution.md

+ out_cpu = pd.v2.copy_op(out, device='cpu')
+ return out_cpu
+
+pd.run(fpga_infer, devices=['fpga:0', 'fpga:1']) 


what is with pd.device(fpga_id) used for? Add name-prefix for variables?

dzhwinter · 2017-09-25T23:06:51Z

doc/design/block_execution.md

+Actually, we can have both data parallelism and model parallelism. In order to support the mixing paralelism, we have to propose these two strategies:
+
+
+- A simple device id setting rules


It's better for user changing nothing when he switches from single device to multi-device, multi-node.

dzhwinter · 2017-09-26T00:30:35Z

doc/design/block_execution.md

@@ -0,0 +1,113 @@
+## Overview
+PaddlePaddle represents a neural network as a `Block`. The word block and graph are interchangable in the desgin of PaddlePaddle. A block is like a pair of curly braces in programming languages like C++ and Java, it includes some variables and a sequence of instructions, or, operators. Usually we run a block using a CPU thread. But we want to accelerate the execution of some operators using devices like GPU and FPGA.


@QiJune @helinwang
we discussed face to face and reached some agreements.
1、 To unify our discussion results, we define some common concepts to represent these modules.

Placement Policy
When the operator run call happens, it needs to be assigned to a specific device. We define the Placement Policy as operator assigned policy. Currently, we only need a simple priority rule to implement the simplest version.

Scheduler
In the multi-node or multi-device environment, we have more than one block, need a module to manage the op running process. We define the Scheduler as op runner manager.

Converter
The user-defined block is different from the one which really is executed. There is a gap between the original block and optimized block. We define the Converter as the optimizer module.

2、Block should not contain a Run interface.
Currently, the Block design contains an interface of RUN, which is inherited from OpBase. This interface is overlapped with Scheduler module.

void Run(const framework::Scope& scope, const platform::DeviceContext& dev_ctx) const override { PADDLE_ENFORCE(symbols_ready_, "operators and variables should be created first."); for (auto& op : runtime_table_.ops()) { op->Run(scope, dev_ctx); } }

When the block replaces NetOp, there are two concerns.
First, the backward module use NetOp to represent a nested Network and a group of operators. The block contains much more than a group of operators, which is needed in the backward process. Second, some Op like FC, uses NetOp to contain more than one operator, also need to be replaced.

Sorry I didn't catch up with your conclusions.

In my thoughts, placement should only describe variables, not operators. The placement is where to store the variable, and then we can know where to run the operators for sure. This is some "storage define the computes".

A "scheduler" often do schedule executions on limited resources, but when do multi-node or multi-device running, the nodes, CPU resources and GPU resources are already scheduled by linux kernel or kubernetes, we only need to do thread/process creating for each block and let linux or kubernetes handle them.

"Converter" is definitely needed, but the name "converter" is too generic, should be named like "BlockOptimizer" or something else, what you think?

One operator may have several input and output variable. We must ensure these variables are in the same device. So, we take this device as Operator's device. If one variable is not in the same device, users should add a copy operator(Paddle can do this later).

Maybe Executor is a better name?

Yes, BlockOptimizer is more specific.

This picture shows how to define variable places and add copy ops automatically.

@typhoonzero So, where will the operator run? on gpu:0 or gpu:1. Where do We insert copy operator?before calculation or after calculation. I think we have to specify a concrete device for a operator. Please refer to the design of TensorFlow. We should have a GraphView of Block, which the data members are Nodes and Edges instead vector of VarDesc and OpDesc. The device info will located in Node in GraphView of Block.

@typhoonzero @QiJune

we only need to do thread/process creating for each block and let linux or kubernetes handle them.

We still need scheduling: schedule the execution of the OP only when it's ready to run (i.e. when all dependency are done). One example would we when a recv OP on node 1 is waiting for the send OP on node 0, recv OP is not ready until send OP is done. If the scheduler schedules the recv OP before it's ready, it will be waiting for the send OP and block the running thread in the tread pool unnecessarily.

@QiJune I see your point, I agree with defining device place for each "Node"(or operator). The reason is people define the graph by ops not by every variable, like the output of two ops is not defined by user.

When we know the variable's device place, we can easily figure out where the operator runs, this is not the reason I think.

If the scheduler schedules the recv OP before it's ready, it will be waiting for the send OP and block the running thread in the thread pool unnecessarily.

@helinwang Is this scheduler only for scheduling on GPU? Because when use CPU, linux kernel will put the thread to "Sleep" in this situation and do not consume CPU resource.

@typhoonzero Sorry, my example is misleading. I mean we need a scheduler to put OP to "run queue" only when the dependencies of the OP is done. Because by definition the OP should only run when all of its dependencies is done.

typhoonzero · 2017-09-26T03:42:34Z

doc/design/block_execution.md

+
+## Conclusion
+
+We will split our implementation of `Block` into following steps:


This section seems like a plan, and I think we should not put plan to a design doc, because when we implement part of the design we have to change this design doc.

Design docs should be more "static", put plans to issue or projects may be better.

Can merge this sections to features we are going to implement.

Project is more suitable.

helinwang · 2017-09-28T23:42:50Z

doc/design/block_execution.md

+```
+class Executor {
+public:
+ Executor(ProgramDesc*);


PaddlePaddle will eval() ProgramDesc very often, should we create a Executor every time a ProgramDesc is evaluated?

In my mind, one local session will create and own a single Executor, and ProgramDesc will be sent to the Executor to run.

Eval will pass a ProgramDesc pointer to Executor. It consumes little time.

Does Executor create the device contexts? If so, creating the Executor is not cheap.

helinwang · 2017-09-28T23:44:29Z

doc/design/block_execution.md

+
+Each pass will modify the global ProgramDesc.
+
+### Run-time analysis


Maybe mention "Converter" in section unified-pass-interface is the module that handle the analysis.

helinwang · 2017-09-28T23:49:02Z

doc/design/block_execution.md

+};
+```
+
+#### Placement Policy


I thought placement policy is evaluated by "Converter", so perhaps put this as a sub section of "Unified Pass Interface"

helinwang · 2017-09-28T23:49:52Z

doc/design/block_execution.md

+
+Please refer to the survey [doc](https://github.com/QiJune/Paddle/blob/e90ec7783a1abe7f7627f97559cc46488e41cc7e/doc/design/graph_survey.md) on Computation Graph. Users will write a neural network topology with Symbolic API. And the composition of operators should be done at compile-time in this level too.
+
+#### Unified Pass Interface


Need a description of what does "Unified Pass Interface" mean. Maybe can remove "Unified Pass Interface", change the title to "Converter" and describe below what does "Converter" mean.

helinwang · 2017-09-28T23:51:27Z

doc/design/block_execution.md

+```
+class DAGExecutor : public Executor {
+public:
+ void Transform();


I think it's Converter's job to optimize the ProgramDesc, executor's job is to run the optimized ProgramDesc efficiently, so Executor is not responsible for Transform.

Because ProgramDesc uses linear list to represent a topology, it's hard to do further optimization upon ProgramDesc.
SimpleExecutor will parse ProgramDesc sequentially. DAGExecutor will parse ProgramDesc into Graph, and more optimization will be done upon Graph structure.

DAGExecutor will parse ProgramDesc into Graph, and more optimization will be done upon Graph structure.

As my comment above, I think Executor should not do optimization, it's the Converter's job to optimize the ProgramDesc.

luotao1 · 2019-02-01T05:48:08Z

感谢您给PaddlePaddle贡献文档。由于文档已迁移至FluidDoc repo，因此关闭您的PR，欢迎您向FluidDoc Repo贡献文档。
Thanks for contributing to PaddlePaddle! Since documents have been moved to FluidDoc repo, we close this PR. Welcome to contribute to FluidDoc repo.

add block roadmap doc

40b32cd

QiJune requested review from Superjomn, helinwang and dzhwinter September 15, 2017 07:58

QiJune added the Block design label Sep 15, 2017

Superjomn reviewed Sep 15, 2017

View reviewed changes

wangkuiyi reviewed Sep 16, 2017

View reviewed changes

QiJune added 2 commits September 18, 2017 16:06

Merge remote-tracking branch 'baidu/develop' into block_roadmap

350a572

follow comments and rename to block_execution.md

3989b25

QiJune changed the title ~~add block roadmap doc~~ add block execution doc Sep 18, 2017

helinwang requested changes Sep 25, 2017

View reviewed changes

dzhwinter reviewed Sep 25, 2017

View reviewed changes

dzhwinter requested changes Sep 26, 2017

View reviewed changes

typhoonzero reviewed Sep 26, 2017

View reviewed changes

QiJune added 3 commits September 27, 2017 13:29

Merge remote-tracking branch 'baidu/develop' into block_roadmap

1462821

follow comments and add more details

23449bf

Merge remote-tracking branch 'baidu/develop' into block_roadmap

9e70fad

QiJune requested review from reyoung, jacquesqiao and JiayiFeng September 27, 2017 23:17

add reference to symblic computation graph

9447858

helinwang reviewed Sep 28, 2017

View reviewed changes

abhinavarora self-requested a review September 28, 2017 23:53

QiJune added 3 commits September 29, 2017 08:26

follow comments and update doc

b03462f

fix links

521ea42

add symbolic API core concept Expression

8117183

luotao1 closed this Feb 1, 2019


		4. Data parallelism feature(run sequentially in each device)

		5. Implement a data dependency analysis engine

		@@ -0,0 +1,77 @@
		## Overview

		We use `Block` to describe our neural network's topology. Actually, a neural network will be executed in various hardware environment. Generally, we will implement data parallelism and model parallelism to get high performance and scalability.

		3. Variable Places. Devices often have their onboard memories, and it is usually much more efficient for them to operator variables in the onboard memories. The execution plan should include the placement of variables in a block.


		Here are the features we want to support:

		## Analysis


		Since a block is actually a graph, the data member of a graph should be Nodes and Edges. Every Node must have a device id for descirbing the place information. Please refer to the design of [TensorFlow](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/node_def.proto#L48). In our block design now, the data members are VarDesc and OpDesc. So, VarDesc and OpDesc must have a field to descirbe concrete device information.


		1. Baseline feature

		2. Divide `Block` into two concepts, `Block` and `Executor`

		Actually, we can have both data parallelism and model parallelism. In order to support the mixing paralelism, we have to propose these two strategies:


		- A simple device id setting rules

		@@ -0,0 +1,113 @@
		## Overview
		PaddlePaddle represents a neural network as a `Block`. The word block and graph are interchangable in the desgin of PaddlePaddle. A block is like a pair of curly braces in programming languages like C++ and Java, it includes some variables and a sequence of instructions, or, operators. Usually we run a block using a CPU thread. But we want to accelerate the execution of some operators using devices like GPU and FPGA.


		## Conclusion

		We will split our implementation of `Block` into following steps:


		Each pass will modify the global ProgramDesc.

		### Run-time analysis


		Please refer to the survey [doc](https://github.com/QiJune/Paddle/blob/e90ec7783a1abe7f7627f97559cc46488e41cc7e/doc/design/graph_survey.md) on Computation Graph. Users will write a neural network topology with Symbolic API. And the composition of operators should be done at compile-time in this level too.

		#### Unified Pass Interface

add block execution doc #4125

add block execution doc #4125

Conversation

QiJune commented Sep 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Sep 25, 2017 • edited Loading

Choose a reason for hiding this comment

QiJune Sep 25, 2017 • edited Loading

Choose a reason for hiding this comment

helinwang Sep 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Sep 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QiJune Sep 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Sep 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Sep 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Sep 29, 2017 • edited Loading

Choose a reason for hiding this comment

luotao1 commented Feb 1, 2019

QiJune commented Sep 15, 2017 •

edited

Loading

helinwang left a comment •

edited

Loading

helinwang Sep 25, 2017 •

edited

Loading

QiJune Sep 25, 2017 •

edited

Loading

helinwang Sep 25, 2017 •

edited

Loading

helinwang Sep 27, 2017 •

edited

Loading

QiJune Sep 26, 2017 •

edited

Loading

helinwang Sep 28, 2017 •

edited

Loading

helinwang Sep 28, 2017 •

edited

Loading

helinwang Sep 29, 2017 •

edited

Loading