Execute the program with multi threads #6223

Yancey1989 · 2017-12-04T08:24:06Z

QiJune · 2017-12-04T08:28:07Z

doc/design/refactor/multi_device.md

+ the `MultiCPUExecutor` will call `block.clone()` and `scope.clone()` to make
+ a list of blocks and scopes, the size equals the thread number, and then execute
+ the graph in each thread.
+1. Collect the gradients and update the parameters


In multi-threads, we need only one copy parameter in CPU memory. It's different from multi-GPUs, where each GPU will hold parameter in its own GPU memory.

Got it, there would be one parameters and it could be on the global scope, for anther, the gradients for each thread is different, so maybe we also need to create the gradients on multi scope.

typhoonzero · 2017-12-04T08:35:22Z

doc/design/refactor/multi_device.md

+ MULTIPLE_NODE = 3
+};
+
+class ExecutionPlan {


These definitions should be in #6078

Sure, delete this description.

typhoonzero · 2017-12-04T08:36:55Z

doc/design/refactor/multi_device.md

+
+ For the data parallelism, we need to pass the attribution `start` and `end`
+ index for the mini-batch, and this will be calculated in the optimizer step.
+1. `multiCPUExecutor` execute the ExecutionPlan which the type is Multi CPU


We don't need a multiCPUExecutor, just check trainer_count in the current Executor and do the copy if trainer_count > 1

same with @typhoonzero . I think that we do not need a MultiCPUExecutor, since every thread can visit the same memory, I can not see any benefit to split thread execution into separated scope/block.
And how should it be when the math library has built-in openmp. Should we classified it as single thread executor or MultiCPUExecutor?

As the comment: #6223 (comment), for my understand, each thread will calculate the gradient independently, and the graident has the same variable name, so maybe they should be create on different scope, or maybe my understanding has something wrong?

@Yancey1989 you are right, scopes can have child scopes to resolve this.

Definitely, at least, there should be different gradients variables(not only different name). Scope is just grouping them in somewhere -- for example, created different gradients variables with specific name prefix in global scope is a visible solution. Create a new Scope or not isn't the key issue to this concept.

I believe that our Executor should have the ability to run in multithread enviroment. We do not need the concept of MultiCPUExecutor.

@dzhwinter

And how should it be when the math library has built-in openmp

I think it's independent with the data parallelism, users could set the thread count of openmp or the threads of executor by themselves.

Create a new Scope or not isn't the key issue to this concept

Got it, and if we need the multiple variable with different prefix, The idea of @helinwang is nice, #6223 (comment)

@dzhwinter

I believe that our Executor should have the ability to run in multithread enviroment. We do not need the concept of MultiCPUExecutor.

It's a good idea, we can pass a parameter thread_count to Executor, and the Executor will execute the ExecutionPlan with multi threads. For another, I think we also need to insert some sync Op while Split X and Reduce dW.

helinwang · 2017-12-04T20:04:25Z

Thanks for the PR, graph is well done!

Several thoughts (CC: @typhoonzero @dzhwinter @QiJune @reyoung ):

ExecutionPlan should never care about how many thread is running it. How many thread is a property of Executor. The Executor should be able to distribute the computation inside ExecutionPlan to different threads automatically.
We should only have one Executor implementation (e.g., no multiCPUExecutor) which can run any ExecutionPlan.
The Executor should know anything about different roles (e.g., is it running as a trainer or a perserver), thus Executor should not know about trainer_count.

Other details:

Given that ExecutionPlan should know about how many thread is running it, there are many places in this PR describing what different thread should do, we probably don't need them.

Here is what I envision (only drew the forward pass to save space):

In the above image, the program at left is converted to the program at right for data parallelism (2x parallelism). X0, X1, Y0, Y1 are all generated variable names, they are not named with X and Y because in one ProgramDesc, we could not have multiple variables with the same name.

The converted program is still a single program, it does not know how many threads run them. The converted program can better utilize two threads, but we can use the Executor of any number of threads to run it.

typhoonzero · 2017-12-05T04:45:21Z

We need a way to sync threads, maybe using "Channels" is a good idea. Like when we need to merge updated weights, each thread can push a signal message to a channel, the Executor is trying to get n signal for each mini-batch.

Yancey1989 · 2017-12-05T07:58:49Z

Thanks @helinwang , one ProgramDesc is a good idea! And aggree with 1 and 2.
For the 3 point:

The Executor should know anything about different roles (e.g., is it running as a trainer or a perserver), thus Executor should not know about trainer_count.

I'm confusion about this point, if the Executor does not care the trainer_count, you mean the Executor just execute the graph, and there is another MultiThreadExecutor(or other name) create multi threads and call the Executor?

@typhoonzero

We need a way to sync threads, maybe using "Channels" is a good idea. Like when we need to merge updated weights, each thread can push a signal message to a channel, the Executor is trying to get n signal for each mini-batch.

The Channels is nice, and how about we implement the sync threads with some operators, and we can create the condition variable. I think the benefits are:

We can describe the sync threads on the graph.
The Executor does need to care the opportunity of do the sync, it defined by the graph.

helinwang · 2017-12-05T21:22:43Z

@Yancey1989

I'm confusion about this point, if the Executor does not care the trainer_count, you mean the Executor just execute the graph, and there is another MultiThreadExecutor(or other name) create multi threads and call the Executor?

Sorry maybe I did not explain clearly. There will only be one executor implementation, and how many thread it runs can be configured. There will be one executor for each trainer, so executor should not care trainer_count.

helinwang · 2017-12-05T21:25:15Z

@typhoonzero

We need a way to sync threads, maybe using "Channels" is a good idea. Like when we need to merge updated weights, each thread can push a signal message to a channel, the Executor is trying to get n signal for each mini-batch.

There will be only one executor, the executor can be configured to run multiple threads. each OP will be scheduled to run only after all OPs it depends on is finished. Yes, to make sure "all OPs it depends on is finished", we need proper synchronization.

helinwang · 2017-12-11T23:25:39Z

doc/design/refactor/multi_device.md

@@ -0,0 +1,49 @@
+# Design Doc: Execute the Program with Multi Thread
+
+In PaddlePaddle, the user could declare **wich** operators will be


wich -> which

helinwang · 2017-12-11T23:34:38Z

doc/design/refactor/multi_device.md

+with parallel.do(thread_count=N):
+ y_predict = fluid.fc(input=x, size=1)
+ cost = fluid.layers.square_error_cost(input=y_predict, label=y)
+ avg_cost = fluid.layers.mean(x=cost)


I understand you have put a lot of time thinking about the API, but as a user looking at the API I have several questions:

x is split by fluid.split(x, N), how about y_predict, y and avg_cost, do the user need to split it?

If the avg_cost is split into N, can the user set thread_count to M in:
with parallel.do(thread_count=M): sgd_optimizer.minimize(avg_cost)

why should the user call fluid.merge_grad_avg(x, N), what will happen if the user forget to call it, and how can the user debug this problem?

From the above questions, I think the Python API for parallel.Do is very hard to design and hard to understand for the user. I briefly talked with @wangkuiyi last week, and we agreed that the Python API is lower priority comparing to the auto transpiler.

So maybe we should postpone the Python API design for parallel.Do for now (maybe focus more on transpiler in this PR?), and focus on the automatic program transpiler. In this way the user don't even need to worry about the Python API for parallel.Do. If there are the power user who want to use the Python API (I highly doubt it), we can add the Python API at that time.

why should the user call fluid.merge_grad_avg(x, N), what will happen if the user forget to call it, and how can the user debug this problem?

Sorry, x should be w, this will merge the dW comes from all parallel threads(each thread will run a block on a scope).
And I think in data parallelism, use should specify(or average_gradients by default?) how to merge the gradients, this is the same with multi GPU.

So maybe we should postpone the Python API design for parallel.Do for now (maybe focus more on transpiler in this PR?),

Got it, and will do that, maybe we can discuss the Python API in #6394 .

helinwang · 2017-12-11T23:37:49Z

doc/design/refactor/multi_device.md

+
+## Operator Kernel
+
+- We need a global threadpool, and initialize the threadpool when the


Since the OP could create the Executor (e.g., while OP), the Executor creation needs to be very lightweight. So perhaps the thread pool initialization should relate with Executor.

You mean initialize the ThreadPool in Parallel Op? If that I agree with you.

I think ThreadPool will be a singleton (only one instance per process), so it could be initialized in the beginning when the application starts, or lazily when it gets used the first time.

Sure, lazily is great, :)

helinwang · 2017-12-11T23:40:12Z

doc/design/refactor/multi_device.md

+
+- We need a global threadpool, and initialize the threadpool when the
+ Executor run the Parallel Op for the first time.
+- The Op kernel will create an Executor instance, and send the blocks which number


By "Op kernel" do you mean the parallel.Do op? If so, I think it should just create N (N=thread_count) executors, and run them (rather than create one executor and letting it know N). How the synchronization is done depends on the answer to this question: #6394 (comment)

Sorry I changed my mind about "there should only be a single instance of Executor", since we decided to separate Executor and thread pool in the last meeting. I think we can have as many executor as we want, and a single thread pool instance.

Will follow #6394 (comment).

I think we can have as many executor as we want, and a single thread pool instance.

Agree with that.

Yancey1989 · 2017-12-14T02:22:30Z

Update: auto transpiler the user-defined graph to multi-thread graph.

typhoonzero · 2017-12-14T05:08:11Z

doc/design/refactor/multi_device.md

+Op graph to a multi-thread Op graph, and run `ParallelDo` Op to run the
+multi-thread graph.
+
+## Graph Converter


Should rename to a Transpiler

typhoonzero · 2017-12-14T05:11:07Z

doc/design/refactor/multi_device.md

+
+- `Multi-Thread Transpiler` will convert the graph to a multi-threads graph
+ which would be executed with multi-threads.
+- `BlockingCounter` will `Init/Decrement` a condition variable, and Blocking `Wait`


Name to ConditionVariable should be fine. ConditionVariable can store any type of "condition", not only counter.

I think the ConditionVariable is a great idea, how about adding this type in Variable in fluid? And it also needs some Op like condition_varaible.wait()/notify_all()/notify_one(). How about adding a design doc for Condition Variable firstly?

typhoonzero · 2017-12-14T05:20:13Z

doc/design/refactor/multi_device.md

+ - Use a list of block id as the input, and create multi Executor to run 
+ these Blocks with multi-threads.
+ - Initialize a `BlockingCounter` instance and wait until all threads are done.
+- `Split` Operator will split the Input Tensor into N slices.


Can split and merge reuse current operators like split_op and mean_op to merge?

Maybe we need to do something enhancement such as how to split a LodTensor, and Merge maybe including mean and other mathematical methods such as sum/max ..., but you're right, mean_op/split_op would be reused.

typhoonzero · 2017-12-14T05:24:46Z

doc/design/refactor/multi_device.md

+
+After converted:
+
+<img src="src/multi-threads/multi-threads@3x.png" width="1000">


Can this figure also show the scopes these blocks use? And when how to ensure thread-safety when doing the merge?

I think the Merge Op should waiting for all forward/backward Ops completed.

dzhwinter · 2017-12-14T07:45:54Z

doc/design/refactor/multi_device.md

+- `BlockingCounter` will `Init/Decrement` a condition variable, and Blocking `Wait`
+ for the condition variable become `0`:
+ ```cpp
+ BlockingCounter bc(thread_count);


Is it same with WaitGroup?

Yes, they are the same, WaitGroup is here

dzhwinter · 2017-12-14T07:52:56Z

doc/design/refactor/multi_device.md

+
+<img src="src/multi-threads/multi-threads@3x.png" width="1000">
+
+## Implement


My question is same with today's meeting. How about we have two kinds of thread? say aggregating thread and computing thread? Then we can use the thread mutex and conditional variable without design them in our op level.

Another question if we keep the thread same as your post, then which thread should run the merge op(Block 0)? If the optimization step is very heavy, does the computing gradient thread should wait for it?

Third question, some LBFGS optimize algorithm need more than one pass, how should we distribute the computing thread and optimizing thread?
LBFGS = > http://pytorch.org/docs/master/optim.html#optimizer-step-closure

My question is same with today's meeting. How about we have two kinds of thread? say aggregating thread and computing thread? Then we can use the thread mutex and conditional variable without design them in our op level

I think aggregating thread and computing thread is a great idea, but I'm not sure it will improve performance, because we need to know when the dW from all threads were been calculated, this also need more context switch, and fewer threads will execute the computing, how about adding this optimize to TODO?

Another question if we keep the thread same as your post, then which thread should run the merge op(Block 0)? If the optimization step is very heavy, does the computing gradient thread should wait for it?

I think we can improve the performance by executing the optimizer with multi-threads, such as distribute multi parameters W0, W1, dW0, dW1 ... to the different thread and execute the optimizer in the different thread. And I will add TODO named Execute the optimizer with multi-threads.

Third question, some LBFGS optimize algorithm need more than one pass, how should we distribute the computing thread and optimizing thread?

The same as above, maybe we can improve the performance for future, but maybe we could execute the optimizer with a single thread.

typhoonzero · 2017-12-18T10:11:57Z

doc/design/refactor/multi_cpu.md

+ }
+ bc.Wait();
+ ```
+- `ParallelDo` Operator


In https://github.com/PaddlePaddle/Paddle/pull/6394/files#r157330955, design for the future that we should have an API for parallel synchronization. It seems that Go's preemptive is similar to CSP, but deep learning communications are more suitable to BSP which needs a Barrier.

But, if we are considering developing a more "general programming language", like enabling users to define data streams, using CSP is better.

It's a good topic, I think the key point is that we need to make a choice from

general programming language: describe a program with fluid API.

model language: describe a model (data stream) with fluid API.

And we could continue the discussion at https://github.com/PaddlePaddle/Paddle/pull/6394/files#r157330955

dzhwinter

The Multi-CPU design given is relevant for parallel.for. We have some discussions about optimizer design. I believe giving a fast implementation will lead us to a better result.

Yancey1989 added 2 commits December 4, 2017 16:18

multi cpu design

dff2c31

update

d2f9df5

Yancey1989 requested review from helinwang, jacquesqiao, dzhwinter and typhoonzero December 4, 2017 08:24

QiJune reviewed Dec 4, 2017

View reviewed changes

typhoonzero reviewed Dec 4, 2017

View reviewed changes

Yancey1989 added 2 commits December 4, 2017 20:42

multi cpu executor to executor

2416299

add graph converting

c154f52

Yancey1989 mentioned this pull request Dec 6, 2017

Executor: need multiple thread support #6319

Closed

use parallel operator to execute blocks with multi threads

539ead8

Yancey1989 changed the title ~~Add multi cpu design~~ Execute the program with multi threads Dec 7, 2017

helinwang reviewed Dec 11, 2017

View reviewed changes

Yancey1989 added 3 commits December 14, 2017 10:17

use auto-transpiler

18cb6d2

use auto-transpiler

efbed3f

update

97680f1

typhoonzero reviewed Dec 14, 2017

View reviewed changes

dzhwinter reviewed Dec 14, 2017

View reviewed changes

update graph

747749d

typhoonzero reviewed Dec 18, 2017

View reviewed changes

dzhwinter approved these changes Dec 20, 2017

View reviewed changes

Yancey1989 merged commit 2d5ec16 into PaddlePaddle:develop Dec 20, 2017

Yancey1989 deleted the multi_cpu_design branch December 20, 2017 05:37

		@@ -0,0 +1,49 @@
		# Design Doc: Execute the Program with Multi Thread

		In PaddlePaddle, the user could declare wich operators will be


		## Operator Kernel

		- We need a global threadpool, and initialize the threadpool when the


		After converted:

		<img src="src/multi-threads/multi-threads@3x.png" width="1000">


		<img src="src/multi-threads/multi-threads@3x.png" width="1000">

		## Implement

Execute the program with multi threads #6223

Execute the program with multi threads #6223

Conversation

Yancey1989 commented Dec 4, 2017

QiJune Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang commented Dec 4, 2017 • edited Loading

typhoonzero commented Dec 5, 2017

Yancey1989 commented Dec 5, 2017 • edited Loading

helinwang commented Dec 5, 2017 • edited Loading

helinwang commented Dec 5, 2017 • edited Loading

Choose a reason for hiding this comment

helinwang Dec 11, 2017 • edited Loading

Choose a reason for hiding this comment

Yancey1989 Dec 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 Dec 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Dec 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 commented Dec 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 Dec 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 Dec 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 Dec 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dzhwinter left a comment

Choose a reason for hiding this comment

QiJune Dec 4, 2017 •

edited

Loading

helinwang commented Dec 4, 2017 •

edited

Loading

Yancey1989 commented Dec 5, 2017 •

edited

Loading

helinwang commented Dec 5, 2017 •

edited

Loading

helinwang commented Dec 5, 2017 •

edited

Loading

helinwang Dec 11, 2017 •

edited

Loading

Yancey1989 Dec 12, 2017 •

edited

Loading

Yancey1989 Dec 12, 2017 •

edited

Loading

helinwang Dec 11, 2017 •

edited

Loading

Yancey1989 Dec 18, 2017 •

edited

Loading

Yancey1989 Dec 18, 2017 •

edited

Loading

Yancey1989 Dec 18, 2017 •

edited

Loading