Merge multiple "distributed LoRA checkpoints" #11314

jolyons123 · 2024-11-18T16:05:37Z

Is your feature request related to a problem? Please describe.

TensorRT-LLM only accepts a single rank .nemo LoRA checkpoint (in the case of Llama 3.1 8b). Therefore, the only way to use my fine-tuned model with TensorRT-LLM backend is to merge my distributed LoRA checkpoints into the base model using the scripts/nlp_language_modeling/merge_lora_weights/merge.py script. However, that results in a lot of big models, if I want to do that for multiple downstream tasks/fine-tuned models.

More specifically, my checkpoints after training with TP=PP=2 look like (the contents of the megatron_gpt_peft_lora_tuning.nemo LoRA checkpoint file):

./                                                                                                                                                                           
./model_config.yaml                                                                                                                                                          
./tp_rank_00_pp_rank_000/
./tp_rank_00_pp_rank_000/model_weights.ckpt 
./tp_rank_00_pp_rank_001/
./tp_rank_00_pp_rank_001/model_weights.ckpt 
./tp_rank_01_pp_rank_000/
./tp_rank_01_pp_rank_000/model_weights.ckpt 
./tp_rank_01_pp_rank_001/
./tp_rank_01_pp_rank_001/model_weights.ckpt

Describe the solution you'd like

It would be nice if we could merge the distributed LoRA weights to a .nemo LoRA checkpoint file that only contains weights for a single rank. That way, the LoRA would be compatible with TensorRT-LLM even if training on multiple GPUs.

Thanks in advance!

Best regards,
John

The text was updated successfully, but these errors were encountered:

jolyons123 assigned okuchaiev Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge multiple "distributed LoRA checkpoints" #11314

Merge multiple "distributed LoRA checkpoints" #11314

jolyons123 commented Nov 18, 2024 •

edited

Loading

Merge multiple "distributed LoRA checkpoints" #11314

Merge multiple "distributed LoRA checkpoints" #11314

Comments

jolyons123 commented Nov 18, 2024 • edited Loading

jolyons123 commented Nov 18, 2024 •

edited

Loading