-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fault tolerant distributed training, just work version, with etcd #2849
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++ few minor comments.
go/master/client.go
Outdated
@@ -38,7 +39,8 @@ func (c *Client) getRecords() { | |||
if err != nil { | |||
// TODO(helin): wait before move on with next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove the TODO, since it's completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
etcd_endpoint = "http://" + etcd_ip + ":2379" | ||
|
||
|
||
def cloud_reader(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can put this into paddle.reader.recordio
in the future, so paddle.reader.recordio
can support both local and cloud reader. (not necessary in this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Will do this in next PR.
@@ -51,7 +51,8 @@ def __init__(self, | |||
update_equation, | |||
extra_layers=None, | |||
is_local=True, | |||
pserver_spec=None): | |||
pserver_spec=None, | |||
use_etcd=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can make the default requires less dependency, by defaulting use_etcd
to False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_etcd
has default value to be True
currently
import os | ||
import cPickle as pickle | ||
|
||
etcd_ip = os.getenv("MASTER_IP", "127.0.0.1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不好意思,这里上次没看见,是不是应该把“MASTER_IP”改成"ETCD_IP"? master貌似跟etcd没有关系。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前paddlecloud启动job时,etcd是启动在master pod里的。后续考虑多个job使用同一个etcd的情况下修改下。
parameterClient_(-1), | ||
newParameters_(nullptr), | ||
newGradients_(nullptr), | ||
pserverSpec_(pserverSpec), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里有时候是pserver的addresses,有时候是etcd的address,是不是该改成remoteAddr之类的。
This is a "just work" version, supporting distributed training with etcd and master process.