You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been running some experiments on KTO using LoRA and noticed that there is a large disparity between the peak allocated and reserved memory, which I suspect is a memory leakage problem.
To my understanding, when using LoRA with KTOTrainer, the target model is also used as the reference model for calculating the reward and approximate KL-divergence by disabling the adapter weights . I am unfamiliar with how gradients and optimizer states might differ when performing KTO compared to SFT, but intuitively I don't think they should be much different, if at all. And so, if we aren't loading a separate model as the reference model, the memory usage should not be that high.
For reference, I am performing KTO with the following parameters, and using a variant of Llama-3-8B-Instruct (called 'aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct'):
Thanks for reporting, please share your system info
Thanks for looking into this!
System:
Debian 11
Python 3.10
1xA100-80GB
Nvidia driver 550.90.07, CUDA 12.4
(running this on a GCP CE instance based on the c0-deeplearning-common-cu123-v20240922-debian-11-py310 image)
System Info
I've been running some experiments on KTO using LoRA and noticed that there is a large disparity between the peak allocated and reserved memory, which I suspect is a memory leakage problem.
To my understanding, when using LoRA with KTOTrainer, the target model is also used as the reference model for calculating the reward and approximate KL-divergence by disabling the adapter weights . I am unfamiliar with how gradients and optimizer states might differ when performing KTO compared to SFT, but intuitively I don't think they should be much different, if at all. And so, if we aren't loading a separate model as the reference model, the memory usage should not be that high.
For reference, I am performing KTO with the following parameters, and using a variant of Llama-3-8B-Instruct (called 'aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct'):
Tracking memory usage with
torch.cuda
utilities, for KTO I have:but for SFT on the same dataset with the same batch size, etc:
Information
Tasks
examples
folderReproduction
def torch_print_gpu_memory():
print(f"Memory allocated: {torch.cuda.memory_allocated() / 1024 ** 3:.4f} GB")
print(f"Memory reserved: {torch.cuda.memory_reserved() / 1024 ** 3:.4f} GB")
print(f"Peak memory allocated: {torch.cuda.max_memory_allocated() / 10243:.4f} GB")
print(f"Peak memory reserved: {torch.cuda.max_memory_reserved() / 10243:.4f} GB")
print('Before loading base model')
torch_print_gpu_memory()
print('\n----------\n')
model = AutoModelForCausalLM.from_pretrained(
'aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct',
trust_remote_code=True,
device_map = 'auto',
torch_dtype = torch.bfloat16,
attn_implementation="flash_attention_2"
)
print('\n----------\n')
print('After loading base model')
torch_print_gpu_memory()
print('\n----------\n')
lora_config = LoraConfig(
r=128,
lora_alpha=128,
lora_dropout=0.05,
target_modules= 'all-linear',
)
training_args = KTOConfig(
num_train_epochs=2,
per_device_train_batch_size=4,
remove_unused_columns=False,
gradient_accumulation_steps=4,
learning_rate=5e-7,
evaluation_strategy="steps",
beta=0.5,
max_length=1000,
gradient_checkpointing=True,
bf16 = True,
save_total_limit=0,
undesirable_weight=1.0,
max_prompt_length=800,
max_completion_length=200,
report_to="none"
)
print('Before KTOTrainer Instantiation')
torch_print_gpu_memory()
print('\n----------\n')
trainer = KTOTrainer(
model=model,
args=training_args,
tokenizer=tokenizer,
train_dataset=dataset,
peft_config=lora_config,
)
print('\n----------\n')
print('After KTOTrainer Instantiation')
torch_print_gpu_memory()
print('\n----------\n')
print('\n----------\n')
print('Before KTO Train')
torch_print_gpu_memory()
print('\n----------\n')
trainer.train()
print('\n----------\n')
print('After KTO Train')
torch_print_gpu_memory()
print('\n----------\n')
Expected behavior
Peak memory reserved when using KTOTrainer for PEFT should be significantly closer to peak memory allocated.
The text was updated successfully, but these errors were encountered: