Skip to content

Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".

License

Notifications You must be signed in to change notification settings

Victorwz/MLM_Filter

Repository files navigation

MLM Filter

Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".

Release

  • [12/30/2024] 🔥 We released a new generation MLM-Filter model based on Qwen2.5-1.5B, mlm-filter-qwen2.5-1.5b-gpt4o. The instruction data are re-generated with GPT-4o. With the much smaller LLM backbone, the inference has been significantly improved. The llava codebase for mlm-filter model inference has been completely removed and integrated into LLaVA-Unified.
  • [10/24/2024] 🔥 We released two new MLM-Filter models based on llama3, mlm-filter-llama-3-8b and mlm-filter-llama-3.2-3b.
  • [2/25/2024] 🔥 We released Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters. We propose to adopt fine-tuned Multimodal Language Model as effective and efficient data filters to select high-quality image-text pairs from large-scale web-crawled iamge-text data. Checkout the paper.

Project Structure

Install

We highly suggest you to use python==3.10, i.e.,

conda create -n mlm_filter python=3.10

Then install the dependencies for quality score generation:

pip install git+https://github.com/Victorwz/LLaVA-Unified.git

Quality Score Generation

Inference on Single Image

python mlm_filter_scoring_single_image.py --image-path /path/to/image --caption "text caption"

Parameters to note:

  • --metric: quality scoring metric for generation, select among image_text_matching, object_detail_fulfillment, caption_text_quality, semantic_understanding, all
  • --image-path: path to image file or image url
  • --caption: text caption

Inference on Webdataset Large-Scale Data

bash run_inference.sh ${GPU_START_ID} ${Metric} ${Model_Path} ${Data_Path} ${Tars_Per_GPU} ${Num_GPU}

Parameters to note:

  • GPU_START_ID: for large-scale score generation using multi-machines, specify the index of machines
  • Metric: quality scoring metric for generation, select among image_text_matching, object_detail_fulfillment, caption_text_quality, semantic_understanding, all
  • Model_Path: path to the mlm filter model checkpoint
  • Data_Path: path to the webdataset image-text tars
  • Tars_Per_GPU: the number of webdataset image-text tars for a single-gpu to inference on
  • Num_GPU: the number of GPUs for one machine, e.g. 1, 8, 16

Fine-Tuning MLM as Data Filter

  1. Prepare data

Please download the 50k multimodal instructions and save it to ./data/mlm_filter_instruct_50k_gpt4v_cc12m_4k.json.

Please download the images from constituting datasets:

After downloading all of them, organize the data as follows in ./data/images,

├── coco
│   └── train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
│   ├── VG_100K
│   └── VG_100K_2
└── cc12m

OCR-VQA are repacked by ourselves to ensure there is no failed-to-download images which are included in LLaVA-v1.5-665k instruction dataset.

  1. Start training!

Please refer to LLaVA-Unified for more fine-tuning guidance.

Training script with DeepSpeed ZeRO-3: LLaVA_Unified/scripts/mlm_filter/finetune.sh.

Our Best CLIP Model on DataComp-Medium

We also open-sourced our pre-trained CLIP-ViT-B/32 checkppint under the DataComp-Medium Benchmark Controlled Setting in weizhiwang/clip_datacomp_medium_itm_th_66_AND_odf_th_20_gpt4v. Our best model is trianed on the data filtered by both the ITM and ODF Quality Scores.

License

MIT License

Contacts

For any question or issue, please feel free to contact weizhiwang@ucsb.edu or submit github issues.

Citation

Please cite our paper if you find this repository interesting or helpful in your research:

@article{mlm-filter,
    title={Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters}, 
    author={Wang, Weizhi and Mrini, Khalil and Yang, Linjie and Kumar, Sateesh and Tian, Yu and Yan, Xifeng and Wang, Heng},
    publisher={arXiv preprint arXiv:2403.02677},
    year={2024},
}

Credits

MLM-Filter is developed based on

  • Vicuna: foudation language model for LLaVA
  • LLaVA: the codebase for fine-tuning LLaVA as image-text data filters
  • DataComp: the codebase for data filtering and CLIP pre-training

About

Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published