A curated list of Multimodal Large Language Models with SFT.
Dataset | Model | Modality | Quantity | Notes | Link |
---|---|---|---|---|---|
LLaVA-Instruct-150K | LLaVA | Image | 150k | LLaVA-Instruct-150K | |
LLaVA-Instruct-665K | LLaVA-1.5 | Image | 665k | LLaVA-Instruct-665K | |
CogVLM-SFT-311K | CogVLM | Image | 311k | English & Chinese | CogVLM-SFT-311K |
LLaVA-OneVision-Data | LLaVA-OneVision | Image, Video | 1.6M | LLaVA-OneVision-Data | |
ShareGPT4V | ShareGPT4V | Image | 1.2M | ShareGPT4V | |
ShareGPT4Video | ShareGPT4Video | Video | 4.8M | ShareGPT4Video | |
Infinity-MM | Aquila-VL | Image | 34.7M | Infinity-MM | |
LLaVA-Video-178K | LLaVA-OneVision (SI) | Video | 178k | Generated by GPT-4o | LLaVA-Video-178K |
M4-Instruct-Data | LLaVA-NeXT-Interleave | Image, Video | 1177.6K | Generated by GPT-4V | M4-Instruct-Data |
InternVL-Chat-V1-2-SFT-Data | InternVL-Chat-V1-2 | Image | 1.2M | InternVL-Chat-V1-2 | |
Cambrian-10M | Cambrian-1 | Image | 10M | Cambrian-10M |
Dataset | Model | Modality | Quantity | Notes | Link |
---|---|---|---|---|---|
RLHF-V-Dataset | MiniCPM-V 2.0 | Image | 5.7k | RLHF-V-Dataset | |
RLAIF-V-Dataset | MiniCPM-Llama3-V 2.5 | Image | 83k | RLAIF-V-Dataset | |
VLFeedback | Silkie | Image | 380k | VLFeedback | |
SPA-VL | SPA-VL-DP | Image | 100k | Safety | SPA-VL |
MMPR | InternVL2 | Image | 3M | MMPR |