-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of memory error running on 2 A100s with 80 GB each #8
Comments
@joris-sense I have added some lines to properly setup the CUDA devices to both script and the notebook, can you check? |
The cell
needs to be moved beyond where the model is defined. Then, I tried running it on 2xA100 again (I was renting this from vast), and I get an error about not finding FlashAttention even though the command installing it was run and the kernel was reloaded. I'll stop investigating this now because it does work for me with H100 (using quanto instead of BitsAndBytes), and these are actually cheaper on vast than A100...
|
Yes it is defined within the case where fine-tuning is done and not under LoRA part for that reason. I am currently a bit busy (this is a side project at work) I will define requirements file for each of the script, but you should install flash attention with |
That code block appears twice in the notebook now, the first time before "We will load VQAv2 dataset"... that's the part where it breaks. Yeah I tried installing flash-attn in several variants (the Jupyter notebook contains a !pip install command for that as well; these parts seemed to work for me before), somehow it didn't work anyway this time. Anyway thanks for updating these so far! |
@joris-sense hmm interesting, maybe there's an issue with the env import flash_attn
print(flash_attn.__version__) if nothing comes up, can you check if |
Sorry, I already destroyed the 2xA100 instance now... Anyway, it (at least
my current NB that is based on your previous version) does work perfectly
fine on a H100 now, so I guess I'll pass the torch to whoever else finds this issue with a similar setup.
Am Mi., 11. Sept. 2024 um 12:32 Uhr schrieb Merve Noyan <
***@***.***>:
… @joris-sense <https://github.com/joris-sense> hmm interesting, maybe
there's an issue with the env
can you try to do
import flash_attnprint(flash_attn.__version__)
—
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BKT7OJNIKK4B3WGLJK7INILZWAL2VAVCNFSM6AAAAABNWQTQISVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBTGI3DKOJTHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@joris-sense thanks a lot for taking time to open the issues 💗 I think it happens from time to time, above is how I debug |
On another machine, I got that "flash attention 2 is not available" error again, and commenting out |
Hey, thanks for creating these notebooks! But I am trying to run Idefics_FT, and unfortunately, it isn't working... I run into an out of memory error when calling trainer.train() even though I am running on a machine with 2 A100s with 80 GB each. Any idea what the problem could be? I did make sure to not initialize models twice, and just loading the model seems to take only about 10 GB as expected.
The trainer seems to get up to step 3/2144 and the cell running trainer.train() produces the following additional warnings:
The text was updated successfully, but these errors were encountered: