Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discussion] How to set clusterspec #369

Closed
gaocegege opened this issue Feb 4, 2018 · 3 comments
Closed

[discussion] How to set clusterspec #369

gaocegege opened this issue Feb 4, 2018 · 3 comments

Comments

@gaocegege
Copy link
Member

Now we set the TF_CONFIG to get the cluster spec in the training code, and it follows the idea in Google Cloud Machine Learning Engine (Cloud ML Engine).

In TensorFlow distributed training docs, it uses CLI arguments instead. We can not support those two ways since we want to hide the service discovery layer from the users, so I think we could discuss which one will be supported. Maybe AI engineers could give us more info.

\cc @DjangoPeng @ScorpioCPH

@jlewi
Copy link
Contributor

jlewi commented Feb 4, 2018

TF_CONFIG is a TensorFlow convention that TF APIs like the EstimatorAPI use to get information about the runtime environment and configure the job appropriately.

If users want to use command line arguments, they can write a launcher script that parses TF_CONFIG and sets the command line arguments as needed. This is what we do in the TFCNN example; see https://github.com/kubeflow/kubeflow/blob/master/tf-controller-examples/tf-cnn/launcher.py

I think the problem with command line arguments is that everyone will use slightly different conventions.

@gaocegege
Copy link
Member Author

SGTM, WDYT @DjangoPeng

@gaocegege
Copy link
Member Author

I am closing it since it is stale. If any comments I will reopen it, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants