Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running on H100? #9

Open
joris-sense opened this issue Sep 5, 2024 · 6 comments
Open

Running on H100? #9

joris-sense opened this issue Sep 5, 2024 · 6 comments

Comments

@joris-sense
Copy link

Hey, when trying to run Idefics_FT.ipynb on a H100 machine, I seem to be getting the problem described here. Is there a way around this, using something else than bitsandbytes maybe?

@merveenoyan
Copy link
Owner

@joris-sense I ran in an A100 instance and not H100 :( can't you do only LoRA or full FT since you have an access to an H100?

@joris-sense
Copy link
Author

joris-sense commented Sep 6, 2024

I am sitting on it right now and the training loop seems to work when I replace bitsandbytes by quanto =)

So I use

from transformers import QuantoConfig

if USE_QLORA:
    quanto_config = QuantoConfig(weights="int4")
    model = Idefics3ForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", quantization_config=quanto_config if USE_QLORA else None, #bnb_config
_attn_implementation="flash_attention_2", )

Does this have disadvantages compared to bitsandbytes, or is there something else I should use?

@merveenoyan
Copy link
Owner

@joris-sense I think they're the same thing, if anything Quanto is more up to date

@joris-sense
Copy link
Author

joris-sense commented Sep 6, 2024

@joris-sense I ran in an A100 instance and not H100 :( can't you do only LoRA or full FT since you have an access to an H100?

My understanding was that the main advantage of LoRA/QLORA is the reduced memory requirement rather than improved speed? In any case, trying it out, the H100 has similar speed for all 3 methods.

Thinking about it, why does the Jupyter notebook take 50 GB of VRAM even when training a QLOR model with 8B parameters? Shouldn't it be a lot less for 4 bit, on the order of 4 GB?

@merveenoyan
Copy link
Owner

@joris-sense I forgot to mention there but my training setup was only freezing image encoder and not doing LoRA training. I have now uploaded new versions of the notebook and script that is much more QLoRA focused and realized it takes around 17GB VRAM, with 0.002% of params being trained. Can you try?

@joris-sense
Copy link
Author

That still seems like a lot to me, it looks like the model's weights are stored unquantized as this is more than 2*8 GB of VRAM (as I understand, with QLORA the model weights should be stored with quantization as well).

I didn't get your new script to work on my machine with QLORA and stick to full finetuning for now. I seem to get different errors in each execution, but one of them was that flashattention complains during inference (which I copied from your last version -- I am also missing an inference part in the new one) that the model's weights are stored in float32 (as seen by

dtype = next(model.parameters()).dtype

print(dtype)

) and it throws an error "FlashAttention only support fp16 and bf16 data type"). If I convert the weights, the model does infer but it seems to be unrelated to what it is trained on. I also didn't get below 50 GB of VRAM, and sometimes still get out of memory errors, and with your new global variables, the thing didn't seem to find a GPU on a one-H100 setup (I know this is probably too vague to act upon, maybe I will figure out more and make a more reproducible report over the weekend). Also note that your notebook's default settings are still USE_LORA=False and USE_QLORA=False, and it still references model before it is defined in cell 8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants