Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor argparser to be consistent with torch.distributed #4149

Merged
merged 6 commits into from
Nov 3, 2022

Conversation

muellerzr
Copy link
Contributor

@muellerzr muellerzr commented Nov 2, 2022

In 🤗 Accelerate we do our best not to rely on subprocess too heavily when calling argparsed scripts and also want to make sure it's possible to filter and sort based on the arguments presented in them (see here for an example).

Since we want accelerate launch to be able to launch TPU Pods, a refactor like this is an easy QOL for us to make it simple and helps making it consistent with torch distributed (see here), as well as being able to start the script without the use of argparse

This PR also makes it so that the inner functionality to spawn the TPU cluster and resolve the VM is in its own function and can be imported/used directly

Let me know what you think, thanks!

@JackCaoG
Copy link
Collaborator

JackCaoG commented Nov 3, 2022

PR overall LGTM, do you have a chance to test the pr on a pod? This part will not be tested on our CI so I want to be careful not to introduce any regression.

@muellerzr
Copy link
Contributor Author

muellerzr commented Nov 3, 2022

@JackCaoG is there a particular way to setup/install my branch across the pod that I'm doing incorrectly? I tried doing:

gcloud compute tpus tpu-vm ssh zach-accelerate-v3-32 --zone {ZONE} --command cd /usr/share; pip3 install git+https://github.com/muellerzr/xla@refactor-args --worker all

But I hit this error during install:

    ERROR: Command errored out with exit status 1:
     command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-_bjvr1g8/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-_bjvr1g8/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-req-build-_bjvr1g8/pip-egg-info
         cwd: /tmp/pip-req-build-_bjvr1g8/
    Complete output (8 lines):
      File "/tmp/pip-req-build-_bjvr1g8/scripts/gen_lazy_tensor.py", line 33
        def lowering_function(self, f: Union[NativeFunctionsGroup,
                                     ^
    SyntaxError: invalid syntax
    Failed to generate lazy files: ['python', '/tmp/pip-req-build-_bjvr1g8/scripts/gen_lazy_tensor.py']
    Building torch_xla version: 1.14
    XLA Commit ID: a227933f6bb4b4ff4d93412307052393211cca88
    PyTorch Commit ID:
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
##### Command execution on worker 3 failed with exit status 1. Continuing.

(repeat n times for all the workers)

In case it may be a python version issue all the pod nodes are running py 3.8.10

@JackCaoG
Copy link
Collaborator

JackCaoG commented Nov 3, 2022

you don't this change on every host, xla_dist is only being called in master host(which every host you ssh into and do python -m xla_dist). Hence you can easily checkout this branch, rebuild, or simple download the diff and manually patched the pytorch/xla at the system directory.

@muellerzr
Copy link
Contributor Author

@JackCaoG thanks! Can confirm it works just fine by just copy/pasting the file over there.

@JackCaoG
Copy link
Collaborator

JackCaoG commented Nov 3, 2022

can you fix the linter and we are good to merge.

@muellerzr
Copy link
Contributor Author

@JackCaoG sorry about that! Missed a line :)

@JackCaoG JackCaoG merged commit bde1bc6 into pytorch:master Nov 3, 2022
@JackCaoG
Copy link
Collaborator

JackCaoG commented Nov 3, 2022

Thanks @muellerzr !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants