google/flan-t5-xl - A100 80GB out of memory - LoRA #945

tomekrut · 2023-09-18T16:12:44Z

Hi
I have a model based on T5 but wrapped in another class as there are many customizations done at the header level for text classification use case. All in all, within a Model class I call
self.t5 = T5Model.from_pretrained( 'google/flan-t5-xl', from_tf=False, config=self.config, cache_dir=self.cache_dir, )

another object within the main model class is
self.header = MyCustomClass()

As it is a custom model I looked in the following places

when I call the following
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q", "v"],
lora_dropout=0.05,
modules_to_save=["model.header.____", "model.header.____", ''....],
bias="none",
)

The point is I get the training parameters reduction from 2,7B to 17M. All is great but in terms of GPU memory consumption after initial 20GB of memory usage after the model is loaded when the model starts training it goes OOM. I would imagine that if the weights take around 20GB and an individual (any) 20M parameter model utilizes usually around 2GB of GPU memory. In total, the model training should stay way below 30GB.

Here are the key lines of what happens to the model.

model = MyModel() # inside there is self.t5 object
model = get_peft_model(model, lora_config)
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
for i, batch in enumerate(train_dataloader):
outputs = model(batch)
loss = outputs[0]
loss.backward()
optimizer.step()
optimizer.zero_grad()

What am I doing wrong? I don't even mention the XXL model that should fit in the A100 based on tutorial 2

BTW - in the third tutorial above I see that the count of trainable parameters does not match the expected number
trainable params: 56,164 || all params: 4,100,164 || trainable%: 1.369798866581922
but the number of updated ones in cell 16 and 17 is = 16000 x 3 + 160 + 4000 + 2 = 52,162 instead 56,164. Does any one know why?

The attached LoRA.py can easily run on a single A100 GPU though. Any thoughts?
lora.txt

The text was updated successfully, but these errors were encountered:

BenjaminBossan · 2023-09-19T09:30:23Z

All is great but in terms of GPU memory consumption after initial 20GB of memory usage after the model is loaded when the model starts training it goes OOM. I would imagine that if the weights take around 20GB and an individual (any) 20M parameter model utilizes usually around 2GB of GPU memory. In total, the model training should stay way below 30GB.

I'm not sure if your assumption holds. According to this estimator, I get 10 GB for loading google/flan-t5-xl and 41 GB for training it with float32 and Adam. This does not take batch size or LoRA into consideration, so the real value could be lower, but not necessarily below 30 GB. Even with most of the model parameters frozen, they'll still require memory for calculating the activations (though not the gradients).

in the third tutorial above I see that the count of trainable parameters does not match the expected number

Yes, there is a bug that for modules_to_save, we don't set requires_grad=False on the original weights, which means they are counted towards the number of trainable parameters even though they're not trained. This doesn't really matter in this case, and it's being addressed in #905.

tomekrut · 2023-09-19T09:47:37Z

@BenjaminBossan Thank you for keeping this conversation going. I really appreciate you further comments (help). Here are more details:

great tool (estimator). Bear in mind that in my case T5Model.from_pretrained(...) is a sub class of the bigger model. But let's agree we are more less on the same page on this one - my GPU has 80GB. I don't care whether it is 10GB or 20GB.
for the argument sake let's stick to your numbers - normal training e.g. BATCH_SIZE = 2 should be around 41GB (definitely 80GB is enough to train 2,7B model)
when I call LoRA I get like wrote before "The point is I get the training parameters reduction from 2,7B to 17M". So if the binary model based on your numbers is 10GB and on top of it LoRA is going to train a sub-model of 17M size I should still be able to train the model with BATCH_SIZE=2, float32 and Adam very easily.
Training 17M any model is for me around 2GB of GPU memory.
10GB of flan-xl (binary size) + 2BG of the 17M model but I go OOM on 80GB GPU.
Either I do something terribly wrong or there is a bug. That's my point.

BTW - when I trained "google/flan-t5-large"

without LoRA - 74GB of GPU Memory, Trainable parameters around 800M
with LoRA - (everything else the same, configuration of LoRA as in the first post). The memory consumption is still pretty high 67GB. Trainable parameters less than 20M
I would expect that the memory consumption should be way less significant.

BenjaminBossan · 2023-09-19T12:14:24Z

Yes, I totally agree that a model of this size should be trainable on 80 GB. You mentioned that you made some custom changes. Although it is quite unlikely that they're the reason, it could still be insightful to compare those vs a "vanilla" flan-t5 on a similar task.

BTW - when I trained "google/flan-t5-large"
without LoRA - 74GB of GPU Memory, Trainable parameters around 800M

This feels really excessive, is this vanilla flan-t5 or also including your changes? Do you do anything else that might somehow influence memory usage, like using accelerate, mixed precision, quantization, etc.?

with LoRA - (everything else the same, configuration of LoRA as in the first post). The memory consumption is still pretty high 67GB. Trainable parameters less than 20M

Yes, it seems pretty high. Do you train a single LoRA adapter or are there multiple? Could you try a test with a non-adaptive optimizer like SGD?

tomekrut · 2023-09-23T07:48:35Z

Thank you @BenjaminBossan for all the info. In simple terms, I found the problem, it is the size of the input window that through T5 implementation changes. Most of the demo use cases you guys do are based on 128 or 256 window size. When increased e.g. to 2000 it pretty much explodes. Thanks for all your help.

tomekrut closed this as completed Sep 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

google/flan-t5-xl - A100 80GB out of memory - LoRA #945

google/flan-t5-xl - A100 80GB out of memory - LoRA #945

tomekrut commented Sep 18, 2023

BenjaminBossan commented Sep 19, 2023

tomekrut commented Sep 19, 2023 •

edited

Loading

BenjaminBossan commented Sep 19, 2023

tomekrut commented Sep 23, 2023

google/flan-t5-xl - A100 80GB out of memory - LoRA #945

google/flan-t5-xl - A100 80GB out of memory - LoRA #945

Comments

tomekrut commented Sep 18, 2023

BenjaminBossan commented Sep 19, 2023

tomekrut commented Sep 19, 2023 • edited Loading

BenjaminBossan commented Sep 19, 2023

tomekrut commented Sep 23, 2023

tomekrut commented Sep 19, 2023 •

edited

Loading