Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add block execution doc #4125

Closed
wants to merge 10 commits into from
Closed

Conversation

QiJune
Copy link
Member

@QiJune QiJune commented Sep 15, 2017

Here is better to review.


4. Data parallelism feature(run sequentially in each device)

5. Implement a data dependency analysis engine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dependency engine?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just the same as mxnet.

@@ -0,0 +1,77 @@
## Overview
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move this file into /doc/design/block_execution.md.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,77 @@
## Overview

We use `Block` to describe our neural network's topology. Actually, a neural network will be executed in various hardware environment. Generally, we will implement data parallelism and model parallelism to get high performance and scalability.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, a neural network will be executed in various hardware environment. Generally, we will implement data parallelism and model parallelism to get high performance and scalability.

==>

There are some challenges here:

  1. Execution Plan.
    It is often that a kind of device cannot accelerate all operators in a block. This requires an execution plan to use multiple devices. For example, some computational operators run on the GPU and cross-node data communication operators like SendOp and RecvOp on the CPU.

  2. Parallel Execution.
    It is often that a computer has more than one acceleration devices. For example, most servers and portable computers like PX2 have more than one CUDA GPUs. We want to make full use of them. Intuitively there are two approaches: (1) running operators in a block on multiple devices, and (2) duplicating the block and running them on multiple devices. The former is also known as model parallelism and the latter data parallelism.

  3. Variable Places.
    Devices often have their onboard memories, and it is usually much more efficient for them to operator variables in the onboard memories. The execution plan should include the placement of variables in a block.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Here are some features we want to support:

- Baseline
- one thread in a stand-alone machine: A neural network is executed sequentially in a single CPU or GPU.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thread in a stand-alone machine: A neural network is executed sequentially in a single CPU or GPU.

=>

a single thread runs a block on a single device, e.g., a CPU or a GPU.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,77 @@
## Overview

We use `Block` to describe our neural network's topology. Actually, a neural network will be executed in various hardware environment. Generally, we will implement data parallelism and model parallelism to get high performance and scalability.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to be clear about the problem setting here.

We use Block to describe our neural network's topology.

==>

PaddlePaddle represents a neural network as a block. A block is like a pair of curly braces in programming languages like C++ and Java, it includes some variables and a sequence of instructions, or, operators.

Usually we run a block using a CPU thread. But we want to accelerate the execution of some operators using devices like GPU and FPGA.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@QiJune QiJune changed the title add block roadmap doc add block execution doc Sep 18, 2017
Copy link
Contributor

@helinwang helinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I think the device ID and the executor concepts are very important.

1. Execution Plan. It is often that a kind of device cannot accelerate all operators in a block. This requires an execution plan to use multiple devices. For example, some computational operators run on the GPU and cross-node data communication operators like SendOp and RecvOp on the CPU.

2. Parallel Execution.
It is often that a computer has more than one acceleration devices. For example, most servers and portable computers like PX2 have more than one CUDA GPUs. We want to make full use of them. Intuitively there are two approaches: (1) running operators in a block on multiple devices, and (2) duplicating the block and running them on multiple devices. The former is also known as model parallelism and the latter data parallelism.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add an example of CPU with multiple cores. This happens almost everywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, we can implement both data parallelism and model parallelism in CPU with multiple cores first.

- a single thread runs a block on a single device, e.g., a CPU or a GPU.

2. Data Parallelism
- multi-threads in a stand-alone machine: Only one copy of parameters needs to be storaged in CPU memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storaged -> stored

- multi-threads/multi-GPUs in a cluster: Each node in the cluster supports multi-threads/multi-GPUs.

3. Model Parallelism
- Operators in a block can locate in different devices(CPU/GPU/FPGA) in a stand-alone machine. Users have to set device id for every operator manually.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users have to set device id for every operator manually.

I think we need to change this line to: "Users can optionally set device id for operators.".

Because forcing the user to set every OP manually

  1. adds too much effort for the user, and
  2. some of the OP is not visible to the user, for example, gradient calculation OPs, send / recv OPs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should allow users to set specific device id for operators.

3. Variable Places. Devices often have their onboard memories, and it is usually much more efficient for them to operator variables in the onboard memories. The execution plan should include the placement of variables in a block.


Here are the features we want to support:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the below feature sorted by priority? It looks like so because of the "1. 2. 3. ...). But the second point in "3. Model Parallelism" seems very important.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's sorted by priority, data parallelism have higher priority than model parallelism.

## Analysis


Since a block is actually a graph, the data member of a graph should be Nodes and Edges. Every Node must have a device id for descirbing the place information. Please refer to the design of [TensorFlow](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/node_def.proto#L48). In our block design now, the data members are VarDesc and OpDesc. So, VarDesc and OpDesc must have a field to descirbe concrete device information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if all VarDesc needs device information, maybe only parameter vars need them?
Note that TensorFlow's Var is different from PaddlePaddle's Var, PaddlePaddle's Var includes the edges in the graph while TF does not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In common, device information of VarDesc is the same with OpDesc. We have to figure out what's the definition of Graph first, and then we can decide whether VarDesc needs device information.


- Data Parallelism
- stand-alone machine: Users have to set several device ids, and each device will execute the same block.
- cluster: Users have to set node ids and device ids, and each device will execute the same block.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in a cluster user should not set any node ID. Since for the elastic cluster scheduling the user does not know how many nodes are available for the user to use. Even if the user knows how many nodes are available for him, we don't want user change source code to make a model that runs on 5 machines runs on 10 machines.


- A simple device id setting rules
- Firsly, users have to set one or several default device id for a topology. These devices are the same and usually finish the most computation of a topology. We mainly consider data parallelism in this step.
- Secondly, users need to set device id for each operators in a topology. If users set nothing, the operator will just use default device id. When devices have to be switched, users have to add a copy operator manually.
Copy link
Contributor

@helinwang helinwang Sep 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users have to add a copy operator manually

I think the copy operator should be transparent to the user.

In my opinion, we need an interface that allow user to specify what they want, not how they want things to be done. Only in this way, the framework will have the freedom to do optimizations.

Copy link
Member Author

@QiJune QiJune Sep 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just let users to set manually before we find better automatic optimization method.


1. Baseline feature

2. Divide `Block` into two concepts, `Block` and `Executor`
Copy link
Contributor

@helinwang helinwang Sep 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add a simple Executor that implements the Executor interface, so that in the future we only need to swap the Executor implementation.

out_cpu = pd.v2.copy_op(out, device='cpu')
return out_cpu

pd.run(fpga_infer, devices=['fpga:0', 'fpga:1'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is with pd.device(fpga_id) used for? Add name-prefix for variables?

Actually, we can have both data parallelism and model parallelism. In order to support the mixing paralelism, we have to propose these two strategies:


- A simple device id setting rules
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better for user changing nothing when he switches from single device to multi-device, multi-node.

@@ -0,0 +1,113 @@
## Overview
PaddlePaddle represents a neural network as a `Block`. The word block and graph are interchangable in the desgin of PaddlePaddle. A block is like a pair of curly braces in programming languages like C++ and Java, it includes some variables and a sequence of instructions, or, operators. Usually we run a block using a CPU thread. But we want to accelerate the execution of some operators using devices like GPU and FPGA.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@QiJune @helinwang
we discussed face to face and reached some agreements.
1、 To unify our discussion results, we define some common concepts to represent these modules.

  • Placement Policy
    When the operator run call happens, it needs to be assigned to a specific device. We define the Placement Policy as operator assigned policy. Currently, we only need a simple priority rule to implement the simplest version.
  • Scheduler
    In the multi-node or multi-device environment, we have more than one block, need a module to manage the op running process. We define the Scheduler as op runner manager.
  • Converter
    The user-defined block is different from the one which really is executed. There is a gap between the original block and optimized block. We define the Converter as the optimizer module.

2、Block should not contain a Run interface.
Currently, the Block design contains an interface of RUN, which is inherited from OpBase. This interface is overlapped with Scheduler module.

 void Run(const framework::Scope& scope,
           const platform::DeviceContext& dev_ctx) const override {
    PADDLE_ENFORCE(symbols_ready_, "operators and variables should be created first.");
    for (auto& op : runtime_table_.ops()) {
      op->Run(scope, dev_ctx);
    }
  }

When the block replaces NetOp, there are two concerns.
First, the backward module use NetOp to represent a nested Network and a group of operators. The block contains much more than a group of operators, which is needed in the backward process. Second, some Op like FC, uses NetOp to contain more than one operator, also need to be replaced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't catch up with your conclusions.

  1. In my thoughts, placement should only describe variables, not operators. The placement is where to store the variable, and then we can know where to run the operators for sure. This is some "storage define the computes".
  2. A "scheduler" often do schedule executions on limited resources, but when do multi-node or multi-device running, the nodes, CPU resources and GPU resources are already scheduled by linux kernel or kubernetes, we only need to do thread/process creating for each block and let linux or kubernetes handle them.
  3. "Converter" is definitely needed, but the name "converter" is too generic, should be named like "BlockOptimizer" or something else, what you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. One operator may have several input and output variable. We must ensure these variables are in the same device. So, we take this device as Operator's device. If one variable is not in the same device, users should add a copy operator(Paddle can do this later).
  2. Maybe Executor is a better name?
  3. Yes, BlockOptimizer is more specific.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block

This picture shows how to define variable places and add copy ops automatically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typhoonzero So, where will the operator run? on gpu:0 or gpu:1. Where do We insert copy operator?before calculation or after calculation. I think we have to specify a concrete device for a operator. Please refer to the design of TensorFlow. We should have a GraphView of Block, which the data members are Nodes and Edges instead vector of VarDesc and OpDesc. The device info will located in Node in GraphView of Block.

Copy link
Contributor

@helinwang helinwang Sep 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typhoonzero @QiJune

we only need to do thread/process creating for each block and let linux or kubernetes handle them.

We still need scheduling: schedule the execution of the OP only when it's ready to run (i.e. when all dependency are done). One example would we when a recv OP on node 1 is waiting for the send OP on node 0, recv OP is not ready until send OP is done. If the scheduler schedules the recv OP before it's ready, it will be waiting for the send OP and block the running thread in the tread pool unnecessarily.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@QiJune I see your point, I agree with defining device place for each "Node"(or operator). The reason is people define the graph by ops not by every variable, like the output of two ops is not defined by user.

When we know the variable's device place, we can easily figure out where the operator runs, this is not the reason I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the scheduler schedules the recv OP before it's ready, it will be waiting for the send OP and block the running thread in the thread pool unnecessarily.

@helinwang Is this scheduler only for scheduling on GPU? Because when use CPU, linux kernel will put the thread to "Sleep" in this situation and do not consume CPU resource.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typhoonzero Sorry, my example is misleading. I mean we need a scheduler to put OP to "run queue" only when the dependencies of the OP is done. Because by definition the OP should only run when all of its dependencies is done.


## Conclusion

We will split our implementation of `Block` into following steps:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section seems like a plan, and I think we should not put plan to a design doc, because when we implement part of the design we have to change this design doc.

Design docs should be more "static", put plans to issue or projects may be better.

Can merge this sections to features we are going to implement.

Copy link
Member Author

@QiJune QiJune Sep 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Project is more suitable.

```
class Executor {
public:
Executor(ProgramDesc*);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PaddlePaddle will eval() ProgramDesc very often, should we create a Executor every time a ProgramDesc is evaluated?

In my mind, one local session will create and own a single Executor, and ProgramDesc will be sent to the Executor to run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eval will pass a ProgramDesc pointer to Executor. It consumes little time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Executor create the device contexts? If so, creating the Executor is not cheap.


Each pass will modify the global ProgramDesc.

### Run-time analysis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention "Converter" in section unified-pass-interface is the module that handle the analysis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

};
```

#### Placement Policy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought placement policy is evaluated by "Converter", so perhaps put this as a sub section of "Unified Pass Interface"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


Please refer to the survey [doc](https://github.com/QiJune/Paddle/blob/e90ec7783a1abe7f7627f97559cc46488e41cc7e/doc/design/graph_survey.md) on Computation Graph. Users will write a neural network topology with Symbolic API. And the composition of operators should be done at compile-time in this level too.

#### Unified Pass Interface
Copy link
Contributor

@helinwang helinwang Sep 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a description of what does "Unified Pass Interface" mean. Maybe can remove "Unified Pass Interface", change the title to "Converter" and describe below what does "Converter" mean.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

```
class DAGExecutor : public Executor {
public:
void Transform();
Copy link
Contributor

@helinwang helinwang Sep 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's Converter's job to optimize the ProgramDesc, executor's job is to run the optimized ProgramDesc efficiently, so Executor is not responsible for Transform.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because ProgramDesc uses linear list to represent a topology, it's hard to do further optimization upon ProgramDesc.
SimpleExecutor will parse ProgramDesc sequentially. DAGExecutor will parse ProgramDesc into Graph, and more optimization will be done upon Graph structure.

Copy link
Contributor

@helinwang helinwang Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DAGExecutor will parse ProgramDesc into Graph, and more optimization will be done upon Graph structure.

As my comment above, I think Executor should not do optimization, it's the Converter's job to optimize the ProgramDesc.

@luotao1
Copy link
Contributor

luotao1 commented Feb 1, 2019

感谢您给PaddlePaddle贡献文档。由于文档已迁移至FluidDoc repo,因此关闭您的PR,欢迎您向FluidDoc Repo贡献文档。
Thanks for contributing to PaddlePaddle! Since documents have been moved to FluidDoc repo, we close this PR. Welcome to contribute to FluidDoc repo.

@luotao1 luotao1 closed this Feb 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants