Skip to content

Commit 34a8100

Browse files
authored
update README (#959)
1 parent 9d3a709 commit 34a8100

File tree

2 files changed

+58
-56
lines changed

2 files changed

+58
-56
lines changed

README.md

+28-27
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424

2525
## News 🚀🚀🚀
2626

27+
- `2025/03/13`: 🔥 We introduce [VisualPRM](https://huggingface.co/OpenGVLab/VisualPRM-8B), an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the overall reasoning performance of InternVL2.5-8B and InternVL2.5-78B by 8.4 and 5.9 points, respectively. The training data for this model, termed [VisualPRM400K](https://huggingface.co/datasets/OpenGVLab/VisualPRM400K), is also open-sourced. Please refer to our [paper](https://huggingface.co/papers/2503.10291) and [project page](https://internvl.github.io/blog/2025-03-13-VisualPRM/) for more details.
2728
- `2024/12/20`: 🔥 We release the [InternVL2.5-MPO](https://internvl.github.io/blog/2024-12-20-InternVL-2.5-MPO/), which is finetuned with [Mixed Preference Optimization](https://huggingface.co/papers/2411.10442) on [MMPR-v1.1](https://huggingface.co/datasets/OpenGVLab/MMPR-v1.1). **The resulting models outperform their counterparts without MPO by an average of 2 points across all model scales on the OpenCompass leaderboard.** These models are available at [HF link](https://huggingface.co/collections/OpenGVLab/internvl25-mpo-6753fed98cd828219b12f849).
2829
- `2024/12/17`: 🚀 [InternVL2/2.5](https://github.com/PaddlePaddle/PaddleMIX/tree/develop/paddlemix/examples/internvl2) is supported in [PaddleMIX](https://github.com/PaddlePaddle/PaddleMIX) by Paddle Team.
2930
- `2024/12/05`: 🚀 We release the [InternVL2.5](https://huggingface.co/collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c), an advanced multimodal large language model (MLLM) series with parameter coverage ranging from 1B to 78B. [InternVL2_5-78B](https://huggingface.co/OpenGVLab/InternVL2_5-78B) is the first open-source MLLMs to achieve over **70%** on the **MMMU benchmark**, matching the performance of leading closed-source commercial models like GPT-4o. These models are available at [HF link](https://huggingface.co/collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c).
@@ -426,14 +427,14 @@
426427

427428
ViT-22B uses the private JFT-3B dataset.
428429

429-
| method | #param | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch |
430-
| ------------------- | :----: | :---: | :-----: | :---: | :--: | :--: | :-------: |
431-
| OpenCLIP-G | 1.8B | 86.2 | 89.4 | 77.2 | 63.8 | 87.8 | 66.4 |
432-
| DINOv2-g | 1.1B | 86.5 | 89.6 | 78.4 | 75.9 | 78.8 | 62.5 |
433-
| EVA-01-CLIP-g | 1.1B | 86.5 | 89.3 | 77.4 | 70.5 | 87.7 | 63.1 |
434-
| MAWS-ViT-6.5B | 6.5B | 87.8 | - | - | - | - | - |
435-
| ViT-22B\* | 21.7B | 89.5 | 90.9 | 83.2 | 83.8 | 87.4 | - |
436-
| InternViT-6B (ours) | 5.9B | 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 |
430+
| method | #param | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch |
431+
| ------------------- | :----: | :---: | :-----: | :---: | :---: | :---: | :-------: |
432+
| OpenCLIP-G | 1.8B | 86.2 | 89.4 | 77.2 | 63.8 | 87.8 | 66.4 |
433+
| DINOv2-g | 1.1B | 86.5 | 89.6 | 78.4 | 75.9 | 78.8 | 62.5 |
434+
| EVA-01-CLIP-g | 1.1B | 86.5 | 89.3 | 77.4 | 70.5 | 87.7 | 63.1 |
435+
| MAWS-ViT-6.5B | 6.5B | 87.8 | - | - | - | - | - |
436+
| ViT-22B\* | 21.7B | 89.5 | 90.9 | 83.2 | 83.8 | 87.4 | - |
437+
| InternViT-6B (ours) | 5.9B | 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 |
437438

438439
- Semantic Segmentation [\[see details\]](./segmentation#-evaluation)
439440

@@ -449,12 +450,12 @@
449450

450451
- Zero-Shot Image Classification [\[see details\]](./clip_benchmark#imagenet-variants-and-objectnet)
451452

452-
| method | IN-1K | IN-A | IN-R | IN-V2 | IN-Sketch | ObjectNet |
453-
| ----------------- | :---: | :--: | :--: | :---: | :-------: | :-------: |
454-
| OpenCLIP-G | 80.1 | 69.3 | 92.1 | 73.6 | 68.9 | 73.0 |
455-
| EVA-02-CLIP-E+ | 82.0 | 82.1 | 94.5 | 75.7 | 71.6 | 79.6 |
456-
| ViT-22B\* | 85.9 | 90.1 | 96.0 | 80.9 | - | 87.6 |
457-
| InternVL-C (ours) | 83.2 | 83.8 | 95.5 | 77.3 | 73.9 | 80.6 |
453+
| method | IN-1K | IN-A | IN-R | IN-V2 | IN-Sketch | ObjectNet |
454+
| ----------------- | :---: | :---: | :---: | :---: | :-------: | :-------: |
455+
| OpenCLIP-G | 80.1 | 69.3 | 92.1 | 73.6 | 68.9 | 73.0 |
456+
| EVA-02-CLIP-E+ | 82.0 | 82.1 | 94.5 | 75.7 | 71.6 | 79.6 |
457+
| ViT-22B\* | 85.9 | 90.1 | 96.0 | 80.9 | - | 87.6 |
458+
| InternVL-C (ours) | 83.2 | 83.8 | 95.5 | 77.3 | 73.9 | 80.6 |
458459

459460
- Multilingual Zero-Shot Image Classification [\[see details\]](./clip_benchmark#multilingual-imagenet-1k)
460461

@@ -472,13 +473,13 @@
472473

473474
- Zero-Shot Video Classification
474475

475-
| method | #frame | K400 | K600 | K700 |
476-
| ----------------- | :----: | :--: | :--: | :--: |
477-
| OpenCLIP-G | 1 | 65.9 | 66.1 | 59.2 |
478-
| EVA-02-CLIP-E+ | 1 | 69.8 | 69.3 | 63.4 |
479-
| InternVL-C (ours) | 1 | 71.0 | 71.3 | 65.7 |
480-
| ViCLIP | 8 | 75.7 | 73.5 | 66.4 |
481-
| InternVL-C (ours) | 8 | 79.4 | 78.8 | 71.5 |
476+
| method | #frame | K400 | K600 | K700 |
477+
| ----------------- | :----: | :---: | :---: | :---: |
478+
| OpenCLIP-G | 1 | 65.9 | 66.1 | 59.2 |
479+
| EVA-02-CLIP-E+ | 1 | 69.8 | 69.3 | 63.4 |
480+
| InternVL-C (ours) | 1 | 71.0 | 71.3 | 65.7 |
481+
| ViCLIP | 8 | 75.7 | 73.5 | 66.4 |
482+
| InternVL-C (ours) | 8 | 79.4 | 78.8 | 71.5 |
482483

483484
</details>
484485

@@ -699,12 +700,12 @@
699700

700701
- Multilingual Zero-Shot Image-Text Retrieval on XTD [\[see details\]](./clip_benchmark#xtd)
701702

702-
| method | EN | ES | FR | ZH | IT | KO | RU | JP | average |
703-
| ----------------- | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :-----: |
704-
| AltCLIP | 95.4 | 94.1 | 92.9 | 95.1 | 94.2 | 94.4 | 91.8 | 91.7 | 93.7 |
705-
| OpenCLIP-XLM-R-H | 97.3 | 96.1 | 94.5 | 94.7 | 96.0 | 90.2 | 93.9 | 94.0 | 94.6 |
706-
| InternVL-C (ours) | 97.3 | 95.7 | 95.1 | 95.6 | 96.0 | 92.2 | 93.3 | 95.5 | 95.1 |
707-
| InternVL-G (ours) | 98.6 | 97.7 | 96.5 | 96.7 | 96.9 | 95.1 | 94.8 | 96.1 | 96.6 |
703+
| method | EN | ES | FR | ZH | IT | KO | RU | JP | average |
704+
| ----------------- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :-----: |
705+
| AltCLIP | 95.4 | 94.1 | 92.9 | 95.1 | 94.2 | 94.4 | 91.8 | 91.7 | 93.7 |
706+
| OpenCLIP-XLM-R-H | 97.3 | 96.1 | 94.5 | 94.7 | 96.0 | 90.2 | 93.9 | 94.0 | 94.6 |
707+
| InternVL-C (ours) | 97.3 | 95.7 | 95.1 | 95.6 | 96.0 | 92.2 | 93.3 | 95.5 | 95.1 |
708+
| InternVL-G (ours) | 98.6 | 97.7 | 96.5 | 96.7 | 96.9 | 95.1 | 94.8 | 96.1 | 96.6 |
708709

709710
</details>
710711

README_zh.md

+30-29
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,9 @@
2424

2525
## 最新消息 🚀🚀🚀
2626

27-
- `2024/12/20`: 🔥 我们发布了 [InternVL2.5-MPO系列](https://internvl.github.io/blog/2024-12-20-InternVL-2.5-MPO/)。该系列通过 [Mixed Preference Optimization](https://huggingface.co/papers/2411.10442) 算法和 [MMPR-v1.1](https://huggingface.co/datasets/OpenGVLab/MMPR-v1.1) 数据集微调得到。**该系列模型在OpenCompass评测榜单中的整体性能超过MPO训练前两个百分点。** 这些模型可在 [HF 链接](https://huggingface.co/collections/OpenGVLab/internvl25-mpo-6753fed98cd828219b12f849)中下载.
28-
- `2024/12/17`: 🚀 Paddle团队已在[PaddleMIX](https://github.com/PaddlePaddle/PaddleMIX)框架中适配[InternVL2/2.5](https://github.com/PaddlePaddle/PaddleMIX/tree/develop/paddlemix/examples/internvl2).
27+
- `2025/03/13`: 🔥 我们发布了 [VisualPRM](https://huggingface.co/OpenGVLab/VisualPRM-8B),一个8B参数两的多模态过程奖励模型(PRM)。该模型在 Best-of-8 的评测设置下使得 InternVL2.5-8B 和 InternVL2.5-78B 在七个多模态推理评测基准上的综合性能分别提升了 8.4 和 5.9 分。该模型的训练数据 [VisualPRM400K](https://huggingface.co/datasets/OpenGVLab/VisualPRM400K)也已经开源。请参考我们的[论文](https://huggingface.co/papers/2503.10291)[项目主页](https://internvl.github.io/blog/2025-03-13-VisualPRM/)来了解更多细节。
28+
- `2024/12/20`: 🔥 我们发布了 [InternVL2.5-MPO系列](https://internvl.github.io/blog/2024-12-20-InternVL-2.5-MPO/)。该系列通过 [Mixed Preference Optimization](https://huggingface.co/papers/2411.10442) 算法和 [MMPR-v1.1](https://huggingface.co/datasets/OpenGVLab/MMPR-v1.1) 数据集微调得到。**该系列模型在OpenCompass评测榜单中的整体性能超过MPO训练前两个百分点。** 这些模型可在 [HF 链接](https://huggingface.co/collections/OpenGVLab/internvl25-mpo-6753fed98cd828219b12f849)中下载。
29+
- `2024/12/17`: 🚀 Paddle团队已在[PaddleMIX](https://github.com/PaddlePaddle/PaddleMIX)框架中适配[InternVL2/2.5](https://github.com/PaddlePaddle/PaddleMIX/tree/develop/paddlemix/examples/internvl2)
2930
- `2024/12/05`: 🚀 我们发布了 InternVL2.5 系列,覆盖了从1B参数到78B参数的多模态大语言模型。[InternVL2_5-78B](https://huggingface.co/OpenGVLab/InternVL2_5-78B) 是首个在MMMU benchmark上得分超过70的开源模型。 这些模型可在 [HF 链接](https://huggingface.co/collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c) 中下载。
3031
- `2024/11/14`: 我们发布了 [MMPR](https://huggingface.co/datasets/OpenGVLab/MMPR),一个高质量、大规模的多模态推理偏好数据集,以及 [MPO](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/internvl2.0_mpo),一种高效的偏好优化算法。由此训练的模型 [InternVL2-8B-MPO](https://huggingface.co/OpenGVLab/InternVL2-8B-MPO) 在 MathVista 上取得了 67.0 的准确率。更多详情请参阅我们的[论文](https://arxiv.org/abs/2411.10442)[项目主页](https://internvl.github.io/blog/2024-11-14-InternVL-2.0-MPO/)[文档](https://internvl.readthedocs.io/en/latest/internvl2.0/preference_optimization.html)
3132
- `2024/10/21`: 我们发布了 Mini-InternVL 系列。这些模型在保持极小模型体积的同时实现了出色的性能:4B 模型仅用 5% 的模型大小便达到了 90% 的性能。有关更多详细信息,请查看我们的 [项目页面](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/mini_internvl)[文档](https://internvl.readthedocs.io/en/latest/internvl2.0/domain_adaptation.html)
@@ -426,14 +427,14 @@
426427

427428
ViT-22B uses the private JFT-3B dataset.
428429

429-
| method | #param | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch |
430-
| ------------------- | :----: | :---: | :-----: | :---: | :--: | :--: | :-------: |
431-
| OpenCLIP-G | 1.8B | 86.2 | 89.4 | 77.2 | 63.8 | 87.8 | 66.4 |
432-
| DINOv2-g | 1.1B | 86.5 | 89.6 | 78.4 | 75.9 | 78.8 | 62.5 |
433-
| EVA-01-CLIP-g | 1.1B | 86.5 | 89.3 | 77.4 | 70.5 | 87.7 | 63.1 |
434-
| MAWS-ViT-6.5B | 6.5B | 87.8 | - | - | - | - | - |
435-
| ViT-22B\* | 21.7B | 89.5 | 90.9 | 83.2 | 83.8 | 87.4 | - |
436-
| InternViT-6B (ours) | 5.9B | 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 |
430+
| method | #param | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch |
431+
| ------------------- | :----: | :---: | :-----: | :---: | :---: | :---: | :-------: |
432+
| OpenCLIP-G | 1.8B | 86.2 | 89.4 | 77.2 | 63.8 | 87.8 | 66.4 |
433+
| DINOv2-g | 1.1B | 86.5 | 89.6 | 78.4 | 75.9 | 78.8 | 62.5 |
434+
| EVA-01-CLIP-g | 1.1B | 86.5 | 89.3 | 77.4 | 70.5 | 87.7 | 63.1 |
435+
| MAWS-ViT-6.5B | 6.5B | 87.8 | - | - | - | - | - |
436+
| ViT-22B\* | 21.7B | 89.5 | 90.9 | 83.2 | 83.8 | 87.4 | - |
437+
| InternViT-6B (ours) | 5.9B | 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 |
437438

438439
- 语义分割 [\[查看详情\]](./segmentation#-evaluation)
439440

@@ -449,12 +450,12 @@
449450

450451
- 零样本图像分类 [\[查看详情\]](./clip_benchmark#imagenet-variants-and-objectnet)
451452

452-
| method | IN-1K | IN-A | IN-R | IN-V2 | IN-Sketch | ObjectNet |
453-
| ----------------- | :---: | :--: | :--: | :---: | :-------: | :-------: |
454-
| OpenCLIP-G | 80.1 | 69.3 | 92.1 | 73.6 | 68.9 | 73.0 |
455-
| EVA-02-CLIP-E+ | 82.0 | 82.1 | 94.5 | 75.7 | 71.6 | 79.6 |
456-
| ViT-22B\* | 85.9 | 90.1 | 96.0 | 80.9 | - | 87.6 |
457-
| InternVL-C (ours) | 83.2 | 83.8 | 95.5 | 77.3 | 73.9 | 80.6 |
453+
| method | IN-1K | IN-A | IN-R | IN-V2 | IN-Sketch | ObjectNet |
454+
| ----------------- | :---: | :---: | :---: | :---: | :-------: | :-------: |
455+
| OpenCLIP-G | 80.1 | 69.3 | 92.1 | 73.6 | 68.9 | 73.0 |
456+
| EVA-02-CLIP-E+ | 82.0 | 82.1 | 94.5 | 75.7 | 71.6 | 79.6 |
457+
| ViT-22B\* | 85.9 | 90.1 | 96.0 | 80.9 | - | 87.6 |
458+
| InternVL-C (ours) | 83.2 | 83.8 | 95.5 | 77.3 | 73.9 | 80.6 |
458459

459460
- 多语言零样本图像分类 [\[查看详情\]](./clip_benchmark#multilingual-imagenet-1k)
460461

@@ -472,13 +473,13 @@
472473

473474
- 零样本视频分类
474475

475-
| method | #frame | K400 | K600 | K700 |
476-
| ----------------- | :----: | :--: | :--: | :--: |
477-
| OpenCLIP-G | 1 | 65.9 | 66.1 | 59.2 |
478-
| EVA-02-CLIP-E+ | 1 | 69.8 | 69.3 | 63.4 |
479-
| InternVL-C (ours) | 1 | 71.0 | 71.3 | 65.7 |
480-
| ViCLIP | 8 | 75.7 | 73.5 | 66.4 |
481-
| InternVL-C (ours) | 8 | 79.4 | 78.8 | 71.5 |
476+
| method | #frame | K400 | K600 | K700 |
477+
| ----------------- | :----: | :---: | :---: | :---: |
478+
| OpenCLIP-G | 1 | 65.9 | 66.1 | 59.2 |
479+
| EVA-02-CLIP-E+ | 1 | 69.8 | 69.3 | 63.4 |
480+
| InternVL-C (ours) | 1 | 71.0 | 71.3 | 65.7 |
481+
| ViCLIP | 8 | 75.7 | 73.5 | 66.4 |
482+
| InternVL-C (ours) | 8 | 79.4 | 78.8 | 71.5 |
482483

483484
</details>
484485

@@ -699,12 +700,12 @@
699700

700701
- 多语言零样本图文对检索 [\[查看详情\]](./clip_benchmark#xtd)
701702

702-
| method | EN | ES | FR | ZH | IT | KO | RU | JP | average |
703-
| ----------------- | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :-----: |
704-
| AltCLIP | 95.4 | 94.1 | 92.9 | 95.1 | 94.2 | 94.4 | 91.8 | 91.7 | 93.7 |
705-
| OpenCLIP-XLM-R-H | 97.3 | 96.1 | 94.5 | 94.7 | 96.0 | 90.2 | 93.9 | 94.0 | 94.6 |
706-
| InternVL-C (ours) | 97.3 | 95.7 | 95.1 | 95.6 | 96.0 | 92.2 | 93.3 | 95.5 | 95.1 |
707-
| InternVL-G (ours) | 98.6 | 97.7 | 96.5 | 96.7 | 96.9 | 95.1 | 94.8 | 96.1 | 96.6 |
703+
| method | EN | ES | FR | ZH | IT | KO | RU | JP | average |
704+
| ----------------- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :-----: |
705+
| AltCLIP | 95.4 | 94.1 | 92.9 | 95.1 | 94.2 | 94.4 | 91.8 | 91.7 | 93.7 |
706+
| OpenCLIP-XLM-R-H | 97.3 | 96.1 | 94.5 | 94.7 | 96.0 | 90.2 | 93.9 | 94.0 | 94.6 |
707+
| InternVL-C (ours) | 97.3 | 95.7 | 95.1 | 95.6 | 96.0 | 92.2 | 93.3 | 95.5 | 95.1 |
708+
| InternVL-G (ours) | 98.6 | 97.7 | 96.5 | 96.7 | 96.9 | 95.1 | 94.8 | 96.1 | 96.6 |
708709

709710
</details>
710711

0 commit comments

Comments
 (0)