Skip to content

williamium3000/awesome-mllm-sft

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Awesome-Multimodal-Large-Language-Models-Supervised-Finetuning

A curated list of Multimodal Large Language Models with SFT.

Table of Contents

🔥 MLLM Supervised Finetuning Dataset

MLLM SFT Training Set

Dataset Model Modality Quantity Notes Link
LLaVA-Instruct-150K LLaVA Image 150k LLaVA-Instruct-150K
LLaVA-Instruct-665K LLaVA-1.5 Image 665k LLaVA-Instruct-665K
CogVLM-SFT-311K CogVLM Image 311k English & Chinese CogVLM-SFT-311K
LLaVA-OneVision-Data LLaVA-OneVision Image, Video 1.6M LLaVA-OneVision-Data
ShareGPT4V ShareGPT4V Image 1.2M ShareGPT4V
ShareGPT4Video ShareGPT4Video Video 4.8M ShareGPT4Video
Infinity-MM Aquila-VL Image 34.7M Infinity-MM
LLaVA-Video-178K LLaVA-OneVision (SI) Video 178k Generated by GPT-4o LLaVA-Video-178K
M4-Instruct-Data LLaVA-NeXT-Interleave Image, Video 1177.6K Generated by GPT-4V M4-Instruct-Data
InternVL-Chat-V1-2-SFT-Data InternVL-Chat-V1-2 Image 1.2M InternVL-Chat-V1-2
Cambrian-10M Cambrian-1 Image 10M Cambrian-10M

MLLM Preference Training Set

Dataset Model Modality Quantity Notes Link
RLHF-V-Dataset MiniCPM-V 2.0 Image 5.7k RLHF-V-Dataset
RLAIF-V-Dataset MiniCPM-Llama3-V 2.5 Image 83k RLAIF-V-Dataset
VLFeedback Silkie Image 380k VLFeedback
SPA-VL SPA-VL-DP Image 100k Safety SPA-VL
MMPR InternVL2 Image 3M MMPR

🔥 Paper List

SFT

Visual Instruction Tuning, NIPS 2023 Oral -> LLaVA-Instruct-150K

Paper | Github | Project

Improved Baselines with Visual Instruction Tuning, CVPR 2024 Highlight -> LLaVA-Instruct-665K

Paper | Github | Project

CogVLM: Visual Expert for Pretrained Language Models, 2023 -> CogVLM-SFT-311K

Paper | Github

LLaVA-OneVision: Easy Visual Task Transfer, 2024 -> LLaVA-OneVision-Data

Paper | Github | Project

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions, ECCV 2024 -> ShareGPT4V

Paper | Github | Project

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions, NIPS 2024 -> ShareGPT4Video

Paper | Github | Project

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data, 2024 -> Infinity-MM

Paper

Video Instruction Tuning with Synthetic Data, 2024 -> LLaVA-Video-178K

Paper | Github | Project

LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models, 2024 -> M4-Instruct-Data

Paper | Github | Project

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs, 2024 -> Cambrian-10M

Paper | Github | Project

Preference

RLHF-V: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback, CVPR 2024 -> RLHF-V-Dataset

Paper | Github | Project

RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness, 2024 -> RLAIF-V-Dataset

Paer | Github

Silkie: Preference Distillation for Large Visual Language Models, CoRR 2023 -> VLFeedback

Paper | Github | Project

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model, 2024 -> SPA-VL

Paper | Github | Project

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization, 2024 -> MMPR

Paper | Github | Project

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published