By default, we evaluate COCO Caption Validation set (5000 samples), and report the following metrics: `BLEU-1, BLEU-4, CIDEr, ROUGE-L
We use the following prompt to evaluate all VLMs:
Please describe this image in general. Directly provide the description, do not include prefix like "This image depicts".
No specific prompt is adopted for all VLMs.
Model | BLEU-4 | BLEU-1 | ROUGE-L | CIDEr | Word_cnt mean. | Word_cnt std. |
---|---|---|---|---|---|---|
EMU2-Chat | 38.7 | 78.2 | 56.9 | 109.2 | 9.6 | 1.1 |
Qwen-VL-Chat | 34 | 75.8 | 54.9 | 98.9 | 10 | 1.7 |
IDEFICS-80B-Instruct | 32.5 | 76.1 | 54.1 | 94.9 | 9.7 | 3.2 |
IDEFICS-9B-Instruct | 29.4 | 72.7 | 53.4 | 90.4 | 10.5 | 4.4 |
InstructBLIP-7B | 20.9 | 56.8 | 39.9 | 58.1 | 11.6 | 5.9 |
InstructBLIP-13B | 16.9 | 50 | 37 | 52.4 | 11.8 | 12.8 |
InternLM-XComposer-VL | 12.4 | 38.3 | 37.9 | 41 | 26.3 | 22.2 |
GeminiProVision | 8.4 | 33.2 | 31.2 | 9.7 | 35.2 | 15.7 |
LLaVA-v1.5-7B (QLoRA) | 7.2 | 25 | 36.6 | 43.2 | 48.8 | 42.9 |
mPLUG-Owl2 | 7.1 | 25.8 | 33.6 | 35 | 45.8 | 32.1 |
LLaVA-v1-7B | 6.7 | 27.3 | 26.7 | 6.1 | 40.9 | 16.1 |
VisualGLM | 5.4 | 28.6 | 23.6 | 0.2 | 41.5 | 11.5 |
LLaVA-v1.5-13B (QLoRA) | 5.3 | 19.6 | 25.8 | 17.8 | 72.2 | 39.4 |
LLaVA-v1.5-13B | 5.1 | 20.7 | 21.2 | 0.3 | 70.6 | 22.3 |
LLaVA-v1.5-7B | 4.6 | 19.6 | 19.9 | 0.1 | 72.5 | 21.7 |
PandaGPT-13B | 4.6 | 19.9 | 19.3 | 0.1 | 65.4 | 16.6 |
MiniGPT-4-v1-13B | 4.4 | 20 | 19.8 | 1.3 | 64.4 | 30.5 |
MiniGPT-4-v1-7B | 4.3 | 19.6 | 17.5 | 0.8 | 61.9 | 30.6 |
LLaVA-InternLM-7B (QLoRA) | 4 | 17.3 | 17.2 | 0.1 | 82.3 | 21 |
LLaVA-InternLM2-20B (QLoRA) | 4 | 17.9 | 17.3 | 0 | 83.2 | 20.4 |
CogVLM-17B-Chat | 3.6 | 21.3 | 20 | 0.1 | 56.2 | 13.7 |
Qwen-VL | 3.5 | 11.6 | 30 | 41.1 | 46.6 | 105.2 |
GPT-4v (detail: low) | 3.3 | 18 | 18.1 | 0 | 77.8 | 20.4 |
TransCore-M | 2.1 | 14.2 | 13.8 | 0.2 | 92 | 6.7 |
ShareGPT4V-7B | 1.4 | 9.7 | 10.6 | 0.1 | 147.9 | 45.4 |
MiniGPT-4-v2 | 1.4 | 12.6 | 13.3 | 0.1 | 83 | 27.1 |
OpenFlamingo v2 | 1.3 | 6.4 | 15.8 | 14.9 | 60 | 81.9 |
SharedCaptioner | 1 | 8.8 | 9.2 | 0 | 164.2 | 31.6 |
We noticed that, VLMs that generate long image descriptions tend to achieve inferior scores under different caption metrics.
TBD.