-
Notifications
You must be signed in to change notification settings - Fork 977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions wrt training on TPU Pod #501
Comments
Hi there!
|
Hi @sgugger I started trying this out an the first thing that popped up is that How to get past this? This is probably be a lengthy thread with all the errors. I would be happy to document all the info once it's fixed |
@sumanthd17 can you provide the full error it gives for you there? (And nothing wrong with this being lengthy, usually these make great educational material eventually as well 😄 ) |
|
I create a new TPU VM and installed the following libraries. torch==1.11+cu10.2 After this I tried to run
|
I was able to get past the above issue and start training on 8 cores (changed versions, attached below). But more cores are still an issue. Any workaround for this?
|
We're actively working on this, give us a bit please 😃 |
For now though consider this unsupported |
Is for now more in days, weeks or months ^^? |
Since we were gonna make this a lengthy, educational thread... ;) I'm facing a similar issue as described in your post before and huunguyen10's post. Now, I've been seeing this error quite a bit while working with pods and code that was originally meant to run on a TPU-VM by itself. Which of the settings do you think is the important one? I'm thinking xla/pytorch might need to be at at least 1.11 to fix this issue, but not completely sure yet. Also would love to hear where the HuggingFace team is at, and whether they consider this as a target, or more of a nice-to-have if things happen to fall into place, so I can adjust my expectations and own efforts accordingly. Thanks for all the insights offered here so far! Hoping for more lengthy posts (or copy-pasteable-solutions will work as well 😇) Edit, things to consider:
|
@muellerzr @sgugger Thanks for the great accelerate library, It is super reliable on 8 cores! Can I know when will accelerate support trainings on TPU vm with more than 8 cores? We are eager to try accelerate on more TPUs:) |
@jianguoz It's not a priority for now, as we have no mean of testing the solution (our request to get access to a free small TPU pod to maintain Accelerate was denied). Of course if lots of users show interest, we'll reconsider! |
@sgugger Thanks very much for your quick update:). We have several colleagues interested in deploying Accelerator on more cores. Looking forward to the future release:) |
To help us properly gauge the need for this feature, if you are actively trying to train on a TPU pod with PyTorch could you react with a 👍 to this message? 😄 Thanks! |
@Ontopic @sumanthd17 Hi there, please react above message with a 👍🏻 if you want to train models on more than 8 TPU cores in future |
@sherlock42 check this out for training on TPU VMs with accelerate |
@jianguoz what modifications did you make to run accelerate on 8 Google Cloud TPUs? If you have any code that you could share |
You can take a look at https://huggingface.co/docs/accelerate/index and especially examples inside https://github.com/huggingface/transformers/tree/main/examples/pytorch. It is pretty simple to run accelerate on TPUs. |
We also have a doc specifically on tpu best practices: https://huggingface.co/docs/accelerate/concept_guides/training_tpu |
@muellerzr There are several likes with interests in the training of Accelerate on TPU vm with more than 8 cores, and I think many people may have the same requests to scale up their training with Accelerator but have not yet noticed this GitHub issue. Do you have any plans to prioritize our request? We could provide potential feedbacks on TPU vm 32. Thanks:) |
@jianguoz in about two weeks or so I'll be able to look at this, as yes this seems to be a quite desired feature 😄 Thanks for your patience everyone! |
Hi @jianguoz et all, I am happy to say we're at a place where you can beta-test the new pod launcher! Currently it only supports GCP-based TPU pods, as this is what we can test on currently. Here are the steps to try this out: (Assume all commands are ran solely from the main ssh instance/worker you are working off of unless specified otherwise)
Please let me know how this experience works for you, and what feedback you may have on it! |
Hi Zachary,
Thanks for the great update! We are currently trying the new launcher on
V3-32. Will give some feelings soon:)
Zachary Mueller ***@***.***>于2022年11月15日 周二下午2:01写道:
… Hi @jianguoz
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjianguoz&data=05%7C01%7Cjzhan51%40groute.uic.edu%7Cf8aab848d230457f9d3908dac754e4b8%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C638041464715626191%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OBgcENf1JtNwELThb1BK7XTSTuxr8h7FzEQWgTp4a%2Bg%3D&reserved=0>
et all, I am happy to say we're at a place where you can beta-test the new
pod launcher!
Currently it only supports GCP-based TPU pods, as this is what we can test
on currently.
Here are the steps to try this out:
(Assume all commands are ran solely from the main ssh instance/worker you
are working off of unless specified otherwise)
1. Either install torch_xla from their latest nightly or where
torch_xla is installed on the main instance (do pip show torch_xla to
find it) put this
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fraw.githubusercontent.com%2Fpytorch%2Fxla%2Fmaster%2Ftorch_xla%2Fdistributed%2Fxla_dist.py&data=05%7C01%7Cjzhan51%40groute.uic.edu%7Cf8aab848d230457f9d3908dac754e4b8%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C638041464715626191%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=SB%2FkJACOdPVDUffkPaAdtOKvsyR%2BPXZBVsD88I%2Ft7A8%3D&reserved=0>
file in there to replace torch_xla.distributed.xla_dist. (e.g. could
look like: wget
https://raw.githubusercontent.com/pytorch/xla/master/torch_xla/distributed/xla_dist.py
-O /usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_dist.py
)
2. Install accelerate with sudo pip3 install git+
***@***.*** (for the
commands to be available, sudo is scary I know!)
3. Run accelerate config and answer the new prompts, or modify your
existing default_config.yaml (which can be found in
.cache/huggingface/accelerate/) to include:
- tpu_name: SOME_TPU_NAME this TPU name should align with how it is
registered in *GCP*, so what you pass in when calling gcloud compute
tpus tpu-vm ssh {SOME_TPU_NAME}.
- tpu_zone: SOME_TPU_ZONE this is the zone your TPU pod lives in, such
as europe-west4-a
- tpu_cluster: true this will make sure you're enabling the cluster
launcher
1. Make sure your pod is configured to ssh into each other, by
performing gcloud compute config-ssh. If you can't do this you may
need to login with gcloud auth login first
2. Using the new accelerate tpu-config command, download a script you
wish to run and store it in /usr/share/. For example I did: accelerate
tpu-config --command "sudo wget
https://gist.githubusercontent.com/muellerzr/a85c9692101d47a9264a27fb5478225a/raw/bbdfff6868cbf61fcc0dcff8b76fe64b06fe43ab/xla_script.py
" (I'll look into if a better way to upload a file to all of them
without using wget is possible, but for now this is what the API is :) ).
This command will also *start* in /usr/share, hence why there is no
need to cd. Alternatively make sure that the script you wish to run is
available in every pod how you see fit, just make sure the file is there
and available!
3. Accelerate needs to be installed on each pod, so do accelerate
tpu-config --command "sudo pip3 install git+
***@***.*** In the
future this will just be accelerate tpu-config --install_accelerate
4. From there, just run accelerate launch /usr/share/{script_name} and
it should launch on the pod for you! E.g. accelerate launch
/usr/share/xla_script.py if you are following the above.
Please let me know how this experience works for you, and what feedback
you may have on it!
—
Reply to this email directly, view it on GitHub
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhuggingface%2Faccelerate%2Fissues%2F501%23issuecomment-1315915566&data=05%7C01%7Cjzhan51%40groute.uic.edu%7Cf8aab848d230457f9d3908dac754e4b8%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C638041464715626191%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bwcanxcLO67pyNUuCR3Yuq0sv58pDOvjnk8McrrUHBg%3D&reserved=0>,
or unsubscribe
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAD3LE4YLKO54OTVX73QOUQDWIQB2BANCNFSM53DANUFQ&data=05%7C01%7Cjzhan51%40groute.uic.edu%7Cf8aab848d230457f9d3908dac754e4b8%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C638041464716251084%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=D6wfqQqByAwNL99upfCwItHoj0kmv%2FU5%2FHUXO7NvXjI%3D&reserved=0>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@muellerzr Hi Zachary, sorry for the late reply (Just restore access to TPUs). When I run |
Hi @jianguoz! Sorry you responded just as I was off for vacation that week. As the directions state you should run Though I would have thought |
@muellerzr Thanks for your instructions! We have tried the above steps, and most commands, such as |
Hey @jianguoz, glad to hear |
Hi @jianguoz, apologize this has taken a month to get to I understand that can be quite frustrating 😢 I was indeed able to recreate your issue on a new TPU pod, so let's try out some different instructions that worked for me:
#!/bin/bash
wget https://raw.githubusercontent.com/pytorch/xla/master/torch_xla/distributed/xla_dist.py -O /usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_dist.py
sudo pip3 install git+https://github.com/huggingface/accelerate@tpu-pod-launch
gcloud compute tpus tpu-vm create {{INSERT_NAME_HERE}} --zone {{INSERT_ZONE_HERE}} --version tpu-vm-pt-1.12 --project={{INSERT_PROJECT_HERE}} --accelerator-type v3-32 --metadata startup-script="$(cat startup.sh)"
Can you confirm if this works for you and you can recreate it? I'm also looking into a better way to send the script across the pods today, since each Thanks for your patience 🙏 If you don't want to setup a new instance and want to follow the old directions, replace |
Hi @muellerzr, thanks for your detailed instructions. While we create a new pod following above process or we start from Step 3. It seems that Step 5 still does not work as it cannot connect to the pod workers, i.e., Step 3 does not help. Can you try it again to see if there are any potential issues? Thanks so much:). If you have time we can also schedule a quick meeting to accelerate the process. |
@jianguoz I did it from a fresh instance when I posted those instructions and did not face any issues. Is it still that "fail to execute" issue? |
Hi @jianguoz thanks for the wonderful debugging session! Let's try running through this all again, please follow these steps:
#!/bin/bash
wget https://raw.githubusercontent.com/pytorch/xla/master/torch_xla/distributed/xla_dist.py -O /usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_dist.py
sudo pip3 install git+https://github.com/huggingface/accelerate@tpu-pod-launch
gcloud compute tpus tpu-vm create {{INSERT_NAME_HERE}} --zone {{INSERT_ZONE_HERE}} --version tpu-vm-pt-1.12 --project={{INSERT_PROJECT_HERE}} --accelerator-type v3-32 --metadata startup-script="$(cat startup.sh)"
Let me know if that works for you or if you face trouble! |
Hi @muellerzr, thanks for the debugging session! When I run the Step 6 on new instructions. I face below errors:
Did you forget to modify the launch file accordingly or I missed something? Thanks:) |
Thanks @jianguoz, please try again by downloading the latest commit I pushed (just wget that Wound up needing one more thing I forgot to do, name duplication 😆 |
Thanks @muellerzr!. It raises another issue:
Below is on the new pod
I tested above commands, and found that step 1 is okay. While after I replace the |
Thanks @jianguoz, will give it a peek tommorow and see if I can solve it! |
This has now been introduced in #1049. Please follow the new
The example script I use is located here: We have also introduced a I did not notice your issue @jianguoz, so do let me know if it is still present after this |
Hi Accelerate Team,
I'm looking to use
run_mlm_no_trainer.py
on TPU v3-128 pod. I have few questions before I want to get started with the process.Thanks in Advance
cc : @sgugger @muellerzr
The text was updated successfully, but these errors were encountered: