This repository contains a script for training Molmo Series with using HuggingFace.
However the model uploaded at the huggingfece hub is a sort of a preview version that has few limitations.
- Only supports fp32 (Can supprot fp16 or bf16 however not stable)
- Only single-image is supported
- Grad Checkpointing disabled
- Flash-attention and sdpa disabled
These makes the code uses a lot of vram, that you should use some other techniques like Adam8Bit
, LOMO
and some other things.
I'll update some other features for memory efficient ways. (Also when the official repo is updated.)
[Phi3-Vision Finetuning]
[Qwen2-VL Finetuning]
[LLAMA3.2-Vision Finetuning]
- Deepspeed
- LoRA, QLoRA
- Full-finetuning
Install the required packages using environment.yaml
.
conda env create -f environment.yaml
conda activate molmo
The script requires a dataset formatted according to the LLaVA specification. The dataset should be a JSON file where each entry contains information about conversations and images. Ensure that the image paths in the dataset match the provided --image_folder
.
When using a multi-image dataset, the image tokens should all be <image>
, and the image file names should have been in a list.
Please see the example below and follow format your data.
Example for single image dataset
[
{
"id": "000000033471",
"image": "000000033471.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nWhat are the colors of the bus in the image?"
},
{
"from": "gpt",
"value": "The bus in the image is white and red."
},
{
"from": "human",
"value": "What feature can be seen on the back of the bus?"
},
{
"from": "gpt",
"value": "The back of the bus features an advertisement."
},
{
"from": "human",
"value": "Is the bus driving down the street or pulled off to the side?"
},
{
"from": "gpt",
"value": "The bus is driving down the street, which is crowded with people and other vehicles."
}
]
}
...
]
Note: The model was updated to use bf16 or fp16 however, the output could be chagned compared to fp32.
To run the training script, use the following command:
bash scripts/finetune.sh
If you want to train only the language model with LoRA and perform full training for the vision model:
bash scripts/finetune_lora.sh
If you want to train both the language model and the vision model with LoRA:
bash scripts/finetune_lora_vision.sh
IMPORTANT: If you want to tune the embed_token
with LoRA, You need to tune lm_head
together.
Training arguments
--deepspeed
(str): Path to DeepSpeed config file (default: "scripts/zero2.json").--data_path
(str): Path to the LLaVA formatted training data (a JSON file). (Required)--image_folder
(str): Path to the images folder as referenced in the LLaVA formatted training data. (Required)--model_id
(str): Path to the Llama3.2-Vision model. (Required)--output_dir
(str): Output directory for model checkpoints--num_train_epochs
(int): Number of training epochs (default: 1).--per_device_train_batch_size
(int): Training batch size per GPU per forwarding step.--gradient_accumulation_steps
(int): Gradient accumulation steps (default: 4).--freeze_vision_tower
(bool): Option to freeze vision_model (default: False).--tune_projector
(bool): Option to tune projector (default: True).--num_lora_modules
(int): Number of target modules to add LoRA (-1 means all layers).--vision_lr
(float): Learning rate for vision_model.--projector_lr
(float): Learning rate for projector.--learning_rate
(float): Learning rate for language module.--bf16
(bool): Option for using bfloat16.--fp16
(bool): Option for using fp16.--lora_namespan_exclude
(str): Exclude modules with namespans to add LoRA.--max_seq_length
(int): Maximum sequence length (default: 128K).--bits
(int): Quantization bits (default: 16).--disable_flash_attn2
(bool): Disable Flash Attention 2.--report_to
(str): Reporting tool (choices: 'tensorboard', 'wandb', 'none') (default: 'tensorboard').--logging_dir
(str): Logging directory (default: "./tf-logs").--lora_rank
(int): LoRA rank (default: 128).--lora_alpha
(int): LoRA alpha (default: 256).--lora_dropout
(float): LoRA dropout (default: 0.05).--logging_steps
(int): Logging steps (default: 1).--dataloader_num_workers
(int): Number of data loader workers (default: 4).
Note: The learning rate of vision_model
should be 10x ~ 5x smaller than the language_model
.
If you run out of vram, you can use zero3_offload instead of zero3. However, using zero3 is preferred.
bash scripts/merge_lora.sh
Note: Remember to replace the paths in finetune.sh
or finetune_lora.sh
with your specific paths. (Also in merge_lora.sh
when using LoRA.)
Could not load library libcudnn_cnn_train.so.8. Error: /usr/local/cuda-12.1/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8
You could run unset LD_LIBRARY_PATH
for this error.
You could see this issue
- Update easy use of adam-8bit
This project is licensed under the Apache-2.0 License. See the LICENSE file for details.
If you find this repository useful in your project, please consider giving a ⭐ and citing:
@misc{Molmo-Finetuning,
author = {Yuwon Lee},
title = {Molmo-Finetune},
year = {2024},
publisher = {GitHub},
url = {https://github.com/2U1/Molmo-Finetune}
}
This project is based on
- LLaVA-NeXT: An amazing open-source project of LMM.
- Molmo Series: Awesome pretrained MLLM by AllenAI.