-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor argparser to be consistent with torch.distributed #4149
Conversation
PR overall LGTM, do you have a chance to test the pr on a pod? This part will not be tested on our CI so I want to be careful not to introduce any regression. |
@JackCaoG is there a particular way to setup/install my branch across the pod that I'm doing incorrectly? I tried doing: gcloud compute tpus tpu-vm ssh zach-accelerate-v3-32 --zone {ZONE} --command cd /usr/share; pip3 install git+https://github.com/muellerzr/xla@refactor-args --worker all But I hit this error during install: ERROR: Command errored out with exit status 1:
command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-_bjvr1g8/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-_bjvr1g8/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-req-build-_bjvr1g8/pip-egg-info
cwd: /tmp/pip-req-build-_bjvr1g8/
Complete output (8 lines):
File "/tmp/pip-req-build-_bjvr1g8/scripts/gen_lazy_tensor.py", line 33
def lowering_function(self, f: Union[NativeFunctionsGroup,
^
SyntaxError: invalid syntax
Failed to generate lazy files: ['python', '/tmp/pip-req-build-_bjvr1g8/scripts/gen_lazy_tensor.py']
Building torch_xla version: 1.14
XLA Commit ID: a227933f6bb4b4ff4d93412307052393211cca88
PyTorch Commit ID:
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
##### Command execution on worker 3 failed with exit status 1. Continuing. (repeat n times for all the workers) In case it may be a python version issue all the pod nodes are running py 3.8.10 |
you don't this change on every host, |
@JackCaoG thanks! Can confirm it works just fine by just copy/pasting the file over there. |
can you fix the linter and we are good to merge. |
@JackCaoG sorry about that! Missed a line :) |
Thanks @muellerzr ! |
In 🤗 Accelerate we do our best not to rely on subprocess too heavily when calling argparsed scripts and also want to make sure it's possible to filter and sort based on the arguments presented in them (see here for an example).
Since we want
accelerate launch
to be able to launch TPU Pods, a refactor like this is an easy QOL for us to make it simple and helps making it consistent with torch distributed (see here), as well as being able to start the script without the use of argparseThis PR also makes it so that the inner functionality to spawn the TPU cluster and resolve the VM is in its own function and can be imported/used directly
Let me know what you think, thanks!