-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DeepSpeed] [success] trained t5-11b on 1x 40GB gpu #9996
Comments
Well, I'm closing this right away, since it's not a bug, but feel free to comment or ask questions in the comments. |
(I'm adding to this issue, even though it's closed, because it's directly related) I am seeing OOM trying to get this to work: 1 GPU, SeqLength 128 (originally tried 256), buffers {2e8, 3e8, 5e8} (just changes the epoch of the OOM), BS=1. @stas00 , I kept track of the GPU memory (as reported in nvidia-smi) to see if it's a progressive memory leak, but I don't think it is:
Runscript:
Conda environment:
The monster output: Just the last bit of the output:
|
Thank you for the report and the details, @PeterAJansen In the future, let's try to have a dedicated issue for each unique problem, but since the OP wasn't really an issue, it is now ;) so all is good. Let me see if I can reproduce the problem with your changes, perhaps my data sample was too short. The other difference I see is that you're not using The Let me see
these are normal. not a problem. |
OK, I'm able to reproduce it. The GPU memory usage grows slowly at some times and jumps at quick bump ups of several GBs at other times. I used buffers of 1e8 and cmd:
Which means that either transformers (trainer or model) or DeepSpeed or both leak memory. I'm going to switch to a much smaller model size as with this model it takes ages for it to just start - can't develop like this and try to detect where the leak is coming from. BTW, here is a tip. Currently transformers performs a silly thing - it inits the model, inits the weights, and overwrites all this work with pretrained weights. Which with this model takes like 10 minutes. You can shortcut it with:
which skips 90% of the pointless of weight inits. I'm trying to advocate for this to be a feature here: #9205 |
Heh, we were assuming it was OOM, but it got SIGSEGV - I didn't bother to look closer - so pytorch w/Deepspeed segfaults pretty much at step 22. Investigating... No useful info in the core bt. Stripped binaries. I eliminated the possibility that the issue could be with pytorch. Most likely a regression in DS. Downgrading I must have been using an old DS yesterday and that's why it was working for me. Trying to locate the faulty commit in DS And the reason it was happening always at step 22 was because AdamW wasn't running until this step, this is all those skipping step overflow reports:
As soon as it run it segfaulted. Hopefully we will have a fix soon, but until then please use |
Thanks @stas00 ! I have downgraded to deepspeed 0.3.10 and I'm going to leave Transformers running overnight on a proper training job to see if it crashes (it's currently about 20% completed, so that's promising). Though it does appear that the GPU memory usage periodically moves from ~34GB up to nearly the entire 40GB minus a few hundred MB, so it's a real nail biter watching it: Transformers+DeepSpeed really doesn't believe in wasting RAM... :) |
update: DeepSpeed yanked 0.3.11 from pypi, so a normal pip install should now result in a good working 0.3.10 installed until this issue is fixed. |
Update on my end: with DeepSpeed 0.3.10 it did run successfully through the night on a full job, successfully training and generating the predictions. Amazing work @stas00 et al. |
@stas00 I'm not sure if this is a bug or if I'm just not doing it correctly given how fast most of this is moving, but I'm trying to evaluate/generate predictions post-training and getting not-on-device errors. I should not that it worked fine when I did the whole thing in one command (train/eval/predict) overnight, but now I'm trying to use the fine-tuned model to generate predictions on other data. I have (a) just removed the --do_train flag from the call to finetune_trainer (and, set the model path to the output path of the fine-tuned model), and this gives an error (below). I've also (b) tried CPU-based eval (--device cpu) with the official instructions in examples/seq2seq/, which gave a different error (but I've not done non-cuda eval before, so that might be my issue). Here's the error from (A):
|
Are you on master and not by chance on my experimental t5-pipeline branch? If it's the latter then it's very likely that you'd hit that "not on the current device" error. Please make sure you're using the master |
Definitely on the master :) |
Update: I did figure out the CPU eval error -- I had --fp16 set (as in the example script), which currently throws an esoteric pytorch error on CPU ("threshold_cpu" not implemented for 'Half'). Removing this lets it run on CPU, but with 64 cores T5-11B is evaluating at 150 seconds per generation, instead of less than 1 sec with the GPU, so I think I'll kill that. |
It's AMD. I'm using Peter's machine for debugging this, so you can ask me anything. @PeterAJansen, glad you sorted it out - let me see if I can reproduce that and we could ensure that we prevent the erroneous fp16/cpu combination in first place. Update on DeepSpeed: it looks like the segfault over CPU ADAM problem is specific to AMD, which is the case on your computer, so the DeepSpeed team are working on figuring that out and hopefully will have a new release some time soon that will do the right thing on AMD and be fast too. |
Thank you! |
@PeterAJansen, for the future let's do this:
Then:
;) |
Thanks!
Apologies, I think in my exhilaration that it's running T5-11B on 40G cards that I forgot proper issue submission procedures. The --fp16 error is submitted as isssue #10040 :) |
I'd love to answer your question, @benathi, but I haven't had a chance to experiment with this feature yet. Perhaps asking at https://discuss.huggingface.co/? HF arsenal has several models that implement sparse attention natively: https://huggingface.co/blog/long-range-transformers Deepspeed implements sparse attention, but I am not sure how we would plug it into HF Transformers. That is it has this section of the config file, but I think it only works with some of their internal features. I don't know. Might it be a good idea to ask at https://github.com/microsoft/DeepSpeed - I'd love to know the answer myself - and if we could integrate that into Transformers. If you'd like to take the lead on the research I'd be happy to help integrating it. If you ask please tag me as well. Thank you! |
@stas00 I see the the ds_config.json uses "auto" casting. I cannot train a 13B multilingual mT5-xxl model on the 8x40GB A100 on aws |
"auto" just allows converting I made a possible workaround for t5/mt5 overflows which worked some and not for others, you may want to try: Ideally, especially since you're using A100, you should train in bf16 mixed precision, the work is being done on it here: But deepspeed doesn't yet support bf16 - perhaps it'd be beneficial to ask Deepspeed about supporting bf16 by opening a feature request at https://github.com/microsoft/DeepSpeed/issues - If you feel inspired to do so?
If fairscale gives a working solution then by all means use it. Does it? I just don't know the answer. Megatron-LM released a t5 model recently but it doesn't yet support pipeline, so if tensor parallelism is sufficient to your setup it might do a trick (transformers will have TP shortly as well). You can ping them asking when PP will be added. I doubt that if nobody asks it'll happen any time soon. Their bert/gpt2 have a full dp/tp/pp support, but not yet t5. Finally, try activating Gradient Checkpointing which should help a lot to lower memory usage: |
Thanks a lot @stas00 for your reply. |
Glad to hear that!
DS uses their own mixed precision which doesn't lend to users overriding it. But it should be possible to make an if code branch that if the code is running under deepspeed we could manually upcast to fp32 and then downcast back to fp16 and deepspeed. Let me know if you need help with that, this would require no deepspeed understanding I believe. And I haven't tried that, so it's possible that my idea may or may not work.
Do you mean the sharded DDP (ZeRO@fairscale)? Do let us know, I have no idea what is the state of that project nowadays. |
@stas00 any idea about this, I keep getting overflow. Using Version: 0.5.3 of deepseed due to torch restrictions [2021-11-13 19:22:08,401] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16.0, reducing to 8.0 |
This looks like an issue to report on the deepspeed side, @tuhinjubcse. https://github.com/microsoft/DeepSpeed/issues |
@stas00 could you confirm your torch / deepspeed / apex / transformers versions |
Please see: #9996 (comment) |
@stas00 Thanks so much
I used LR = 1e-3 previously without deep speed and it worked perfectly. I am doing generation, but now when using deep speed loss seems noisy. Anything you recommend? {'loss': 5.4677, 'learning_rate': 0.0, 'epoch': 0.02} |
Oh, that was a totally random setting which makes no impact on the need it was testing (memory usage). I use the same scripts to test many models and most of the time I only care about it working and/or fitting into memory, when I do that particular type of work. I train them for like 50 iterations... Of course, when training for real, I pay attention to the recommended hparam settings. So please don't use any of the lr-like hparams in my examples for fitting memory as a recommendation for real training. But let's not mix unrelated things in the same thread. If you'd like to discuss a different topic please kindly open a new issue and we can discuss it there. |
@stas00 Hopefully this is relevant. I know you had success on A100 40 GB GPU . I am using deep speed on 4 gpus and I recieve OOM after training for several hours. Any idea as to what I can do here
My script
My config
|
are you monitoring the memory consumption over the duration of the training - is it borderline OOM from the get going or is the memory usage slowly creeping up? But regardless, you're using only stage-2, and you want stage-3 in this situation. Since if you're not sharding the params, you get only 12 out of 18 bytes sharded per param. Stage-3 is slower than stage-2 since it has to do more work, but if you can't fit into your gpus stage-3 is what you want. Note that I'm using stage 3 here: #9996 (comment) |
So this is the state at the beginning of the training, right? Then check it say once in 30min and note the differences - if your application is well written then it shouldn't grow after say a few hundred of iterations, assuming the longest seqlen with widest batch size has been consumed already. I'm also noticing that you're using a very old version of our examples - |
The snapshot I sent you was after 5 hrs of training. I have 7M samples and max seq len I reduced to 64 from 128. So hoping it works this time. Last time it failed around 40% of training. Its at 22% now Yes If I still can't make it work I will switch to a recent version of software. |
Right, I'm not sure my message is coming across - I'm suggesting to monitor the memory usage through the training. And that if it OOMs you need to switch to ZeRO-3 and then you should be able to train with a much longer seqlen. Enabling https://huggingface.co/transformers/performance.html#gradient-checkpointing is another technique to allow for much longer seqlen. |
@stas00 many thanks for your guidance. I could finetune 1 epoch. I converted the model to fp32 and saw the output and noticed it's generating garbled text. Now of course this could be bcz its only 1 epoch. But I trained on 772073 samples. Just to be clear I have a T5 3B model trained on same data but using a different code and it works perfecrly, so assuming my data is perfect It generated something I am wondering what could be the reason, One thing I suspect is why is
|
why is your |
@stas00 thats something I don't understand that. As you can see in my script i mentioned 1e-3
Someone here said the same |
I'd be happy to debug this with you, but let's first switch to the current example, which is https://github.com/huggingface/transformers/blob/master/examples/pytorch/translation/run_translation.py - it should be mostly the same with some args renamed - see the README.md for details https://github.com/huggingface/transformers/tree/master/examples/pytorch/translation e.g. my staple cmd that I use is:
Additionally, please open a new Issue since this discussion is now taking over this already closed issue, so let's give it a dedicated space. Just don't forget to tag me in the new Issue. |
how did you infer bro? |
Could you please tell me where can I find the ds_config.json and finetune_trainer.py? Thank you! |
The examples have been renamed and re-organized since the time of this thread, you can find them all here: e.g. the translation is now at For deepspeed please see: |
@stas00 sorry for such question do I understand correctly that every trani example executed 5 seconds? If yes, how many time approx you think tooks training T5-11B from the scratch on such hw? |
multiply iteration time by how many batches you plan to feed the model and you will get the total time needed to train any model - as I wasn't part of the t5 training I don't know what their numbers were. |
Managed to train t5-11b on 1x 40GB gpu w/ Deepspeed (A100-SXM4-40GB)
Thank you, @PeterAJansen for letting me use your hardware!
Thank you, @jeffra and @samyam, for not believing that it is not possible to train t5-11b on 1x 40GB gpu w/ Deepspeed and supporting me that lead me to find a few bugs in the integration.
Sharing details for those who need.
If you want to try this at home please make sure you use transformers master as some bug fixes were just merged in
Well, it's similar to the t5-3b on 24GB success reported here and here.
But this time t5-11b on 1x 40GB gpu (or 4x if you wanted things faster)
As someone asked me before you need a huge amount of general RAM to use ZeRO-Offload for a huge model:
I was using
/usr/bin/time -v program
to get the peak memory measurement - it's theMaximum resident set size
entry in the final report.Question: I don't think
/usr/bin/time
does the right thing for multi-process - I think it only measures the parent process. e.g. with 4x gpus it reported only 102GB RAM, but I clearly saw in top that it was around 240GB. If you have an easy way to measure peak memory that takes into an account forked processes I'm all ears.Batch sizes on one gpu:
I'm referring to these batch sizes in
ds_config.json
:And I tested for 2x and 4x DDP as well, BS=16 OOMed, BS=8 was good so I used that - but could probably squeeze some more.
edit1: later tests show that my test was too short and wasn't getting the CPU Adam optimizer kick in, as it skips the first 20 or so tests because of the overflow. So once it kicks in it takes more GPU memory, so the practical BS is much smaller - I think around 2 on this setup. So most likely you will need to use
BS=2
for real work, until things get optimized even more.edit2: things are getting re-shuffling in the tests, so the default
ds_config.json
file has moved in master to a new, hopefully permanent home. It's now atexamples/tests/deepspeed/ds_config.json
so you will need to adjust the command line to reflect this new location or simply copy it over to where the old one used to be.here is the full benchmark:
Checkpointing should allow making even bigger batch sizes.
The text was updated successfully, but these errors were encountered: