-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design doc: operator based parameter server. #3747
Conversation
doc/design/ops/dist_train.md
Outdated
|
||
## Abstract | ||
|
||
We propose an approach to implment the parameter server. In this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
doc/design/ops/dist_train.md
Outdated
Below is an example of converting the user defined graph to the | ||
sub-graphs for the trainer and the parameter server: | ||
|
||
<img src="src/local-graph.png" width="300"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W is also an input of parameter update op
if I understand correctly, for SEND op, it only sends graph, for RECEIVE op, it only receives gradient. any given worker sees only part of the whole graph, but how would the training data travel through the whole graph? would there be some part of the graph idle until data reaches it's parent? |
doc/design/ops/dist_train.md
Outdated
|
||
1. The parameter variable W and it's optimizer subgraph are placed on the parameter server. | ||
1. Operators are added to the sub-graphs. | ||
- *send* operator sends data and sender's address to the destination. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: if there are multiple parameters(or variables) to send to parameter server, are we:
- create multiple
Send
operators for each variable or, - create one
Hash
operator to divide parameters equally and oneSender
operator to do send.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also, maybe add some description about the send
, recv
operators like:
- Send:
- Inputs:
- Outputs:
- Description:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same confusion with @typhoonzero , maybe we need an operator to sharded parameters.
For others, from @typhoonzero
create multiple Send operators for each variable
Maybe we only need one Sender
operator. If we have too many parameters, too much Sender
operator will cased too much connection to parameter server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typhoonzero @Yancey1989 sorry, the PR could be more clear:
In short, the answer is "create multiple Send operators for each variable".
From the graph's perspective, the Send
and Recv
OP are one for each variable (but not one per replica: different (trainer) replicas share one Send
and Recv
for each variable).
In the implementation detail, we could group send implementations to a single port handler.
In this design the variable placement is done by the graph converter before the graph is sent to worker, so it's not a runtime concept like a Hash
operator. I think the "Hash" solution is for the simplest element-wise optimization case. If we want the parameter server to do things more than element-wise operation, we need to decide the parameter variable and OP placement before the graph is sent to worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. You mean people who develop graphs, do not need to look into the implement details about how we actually send and recv variables, the graph is how the calculations flow logically. But when we build and optimize the graph, we can make the actual send operation one per trainer.
Will you add some implementation thoughts in this PR or in another one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typhoonzero Thanks for the reminder on the implementation thoughts! That's a good idea. I will perhaps not mention implementation detail in this PR, but create a separate issue discussing it. After receiving you guys' comments, I have some point need to re-think and will update this PR and create the implementation detail issue at that time.
@putcn Sorry, my PR could be more clear, the |
From @Superjom : graph拼起来之后应该还是一个可以用的graph。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
According to the current design, there is more concept need to clarify in this design doc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* add python train time eval * add mpii infer support
Here could be easier to review