Skip to content

Update README.md #959

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 28 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@

## News 🚀🚀🚀

- `2025/03/13`: 🔥 We introduce [VisualPRM](https://huggingface.co/OpenGVLab/VisualPRM-8B), an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the overall reasoning performance of InternVL2.5-8B and InternVL2.5-78B by 8.4 and 5.9 points, respectively. The training data for this model, termed [VisualPRM400K](https://huggingface.co/datasets/OpenGVLab/VisualPRM400K), is also open-sourced. Please refer to our [paper](https://huggingface.co/papers/2503.10291) and [project page](https://internvl.github.io/blog/2025-03-13-VisualPRM/) for more details.
- `2024/12/20`: 🔥 We release the [InternVL2.5-MPO](https://internvl.github.io/blog/2024-12-20-InternVL-2.5-MPO/), which is finetuned with [Mixed Preference Optimization](https://huggingface.co/papers/2411.10442) on [MMPR-v1.1](https://huggingface.co/datasets/OpenGVLab/MMPR-v1.1). **The resulting models outperform their counterparts without MPO by an average of 2 points across all model scales on the OpenCompass leaderboard.** These models are available at [HF link](https://huggingface.co/collections/OpenGVLab/internvl25-mpo-6753fed98cd828219b12f849).
- `2024/12/17`: 🚀 [InternVL2/2.5](https://github.com/PaddlePaddle/PaddleMIX/tree/develop/paddlemix/examples/internvl2) is supported in [PaddleMIX](https://github.com/PaddlePaddle/PaddleMIX) by Paddle Team.
- `2024/12/05`: 🚀 We release the [InternVL2.5](https://huggingface.co/collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c), an advanced multimodal large language model (MLLM) series with parameter coverage ranging from 1B to 78B. [InternVL2_5-78B](https://huggingface.co/OpenGVLab/InternVL2_5-78B) is the first open-source MLLMs to achieve over **70%** on the **MMMU benchmark**, matching the performance of leading closed-source commercial models like GPT-4o. These models are available at [HF link](https://huggingface.co/collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c).
Expand Down Expand Up @@ -426,14 +427,14 @@

ViT-22B uses the private JFT-3B dataset.

| method | #param | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch |
| ------------------- | :----: | :---: | :-----: | :---: | :--: | :--: | :-------: |
| OpenCLIP-G | 1.8B | 86.2 | 89.4 | 77.2 | 63.8 | 87.8 | 66.4 |
| DINOv2-g | 1.1B | 86.5 | 89.6 | 78.4 | 75.9 | 78.8 | 62.5 |
| EVA-01-CLIP-g | 1.1B | 86.5 | 89.3 | 77.4 | 70.5 | 87.7 | 63.1 |
| MAWS-ViT-6.5B | 6.5B | 87.8 | - | - | - | - | - |
| ViT-22B\* | 21.7B | 89.5 | 90.9 | 83.2 | 83.8 | 87.4 | - |
| InternViT-6B (ours) | 5.9B | 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 |
| method | #param | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch |
| ------------------- | :----: | :---: | :-----: | :---: | :---: | :---: | :-------: |
| OpenCLIP-G | 1.8B | 86.2 | 89.4 | 77.2 | 63.8 | 87.8 | 66.4 |
| DINOv2-g | 1.1B | 86.5 | 89.6 | 78.4 | 75.9 | 78.8 | 62.5 |
| EVA-01-CLIP-g | 1.1B | 86.5 | 89.3 | 77.4 | 70.5 | 87.7 | 63.1 |
| MAWS-ViT-6.5B | 6.5B | 87.8 | - | - | - | - | - |
| ViT-22B\* | 21.7B | 89.5 | 90.9 | 83.2 | 83.8 | 87.4 | - |
| InternViT-6B (ours) | 5.9B | 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 |

- Semantic Segmentation [\[see details\]](./segmentation#-evaluation)

Expand All @@ -449,12 +450,12 @@

- Zero-Shot Image Classification [\[see details\]](./clip_benchmark#imagenet-variants-and-objectnet)

| method | IN-1K | IN-A | IN-R | IN-V2 | IN-Sketch | ObjectNet |
| ----------------- | :---: | :--: | :--: | :---: | :-------: | :-------: |
| OpenCLIP-G | 80.1 | 69.3 | 92.1 | 73.6 | 68.9 | 73.0 |
| EVA-02-CLIP-E+ | 82.0 | 82.1 | 94.5 | 75.7 | 71.6 | 79.6 |
| ViT-22B\* | 85.9 | 90.1 | 96.0 | 80.9 | - | 87.6 |
| InternVL-C (ours) | 83.2 | 83.8 | 95.5 | 77.3 | 73.9 | 80.6 |
| method | IN-1K | IN-A | IN-R | IN-V2 | IN-Sketch | ObjectNet |
| ----------------- | :---: | :---: | :---: | :---: | :-------: | :-------: |
| OpenCLIP-G | 80.1 | 69.3 | 92.1 | 73.6 | 68.9 | 73.0 |
| EVA-02-CLIP-E+ | 82.0 | 82.1 | 94.5 | 75.7 | 71.6 | 79.6 |
| ViT-22B\* | 85.9 | 90.1 | 96.0 | 80.9 | - | 87.6 |
| InternVL-C (ours) | 83.2 | 83.8 | 95.5 | 77.3 | 73.9 | 80.6 |

- Multilingual Zero-Shot Image Classification [\[see details\]](./clip_benchmark#multilingual-imagenet-1k)

Expand All @@ -472,13 +473,13 @@

- Zero-Shot Video Classification

| method | #frame | K400 | K600 | K700 |
| ----------------- | :----: | :--: | :--: | :--: |
| OpenCLIP-G | 1 | 65.9 | 66.1 | 59.2 |
| EVA-02-CLIP-E+ | 1 | 69.8 | 69.3 | 63.4 |
| InternVL-C (ours) | 1 | 71.0 | 71.3 | 65.7 |
| ViCLIP | 8 | 75.7 | 73.5 | 66.4 |
| InternVL-C (ours) | 8 | 79.4 | 78.8 | 71.5 |
| method | #frame | K400 | K600 | K700 |
| ----------------- | :----: | :---: | :---: | :---: |
| OpenCLIP-G | 1 | 65.9 | 66.1 | 59.2 |
| EVA-02-CLIP-E+ | 1 | 69.8 | 69.3 | 63.4 |
| InternVL-C (ours) | 1 | 71.0 | 71.3 | 65.7 |
| ViCLIP | 8 | 75.7 | 73.5 | 66.4 |
| InternVL-C (ours) | 8 | 79.4 | 78.8 | 71.5 |

</details>

Expand Down Expand Up @@ -699,12 +700,12 @@

- Multilingual Zero-Shot Image-Text Retrieval on XTD [\[see details\]](./clip_benchmark#xtd)

| method | EN | ES | FR | ZH | IT | KO | RU | JP | average |
| ----------------- | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :-----: |
| AltCLIP | 95.4 | 94.1 | 92.9 | 95.1 | 94.2 | 94.4 | 91.8 | 91.7 | 93.7 |
| OpenCLIP-XLM-R-H | 97.3 | 96.1 | 94.5 | 94.7 | 96.0 | 90.2 | 93.9 | 94.0 | 94.6 |
| InternVL-C (ours) | 97.3 | 95.7 | 95.1 | 95.6 | 96.0 | 92.2 | 93.3 | 95.5 | 95.1 |
| InternVL-G (ours) | 98.6 | 97.7 | 96.5 | 96.7 | 96.9 | 95.1 | 94.8 | 96.1 | 96.6 |
| method | EN | ES | FR | ZH | IT | KO | RU | JP | average |
| ----------------- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :-----: |
| AltCLIP | 95.4 | 94.1 | 92.9 | 95.1 | 94.2 | 94.4 | 91.8 | 91.7 | 93.7 |
| OpenCLIP-XLM-R-H | 97.3 | 96.1 | 94.5 | 94.7 | 96.0 | 90.2 | 93.9 | 94.0 | 94.6 |
| InternVL-C (ours) | 97.3 | 95.7 | 95.1 | 95.6 | 96.0 | 92.2 | 93.3 | 95.5 | 95.1 |
| InternVL-G (ours) | 98.6 | 97.7 | 96.5 | 96.7 | 96.9 | 95.1 | 94.8 | 96.1 | 96.6 |

</details>

Expand Down
59 changes: 30 additions & 29 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,9 @@

## 最新消息 🚀🚀🚀

- `2024/12/20`: 🔥 我们发布了 [InternVL2.5-MPO系列](https://internvl.github.io/blog/2024-12-20-InternVL-2.5-MPO/)。该系列通过 [Mixed Preference Optimization](https://huggingface.co/papers/2411.10442) 算法和 [MMPR-v1.1](https://huggingface.co/datasets/OpenGVLab/MMPR-v1.1) 数据集微调得到。**该系列模型在OpenCompass评测榜单中的整体性能超过MPO训练前两个百分点。** 这些模型可在 [HF 链接](https://huggingface.co/collections/OpenGVLab/internvl25-mpo-6753fed98cd828219b12f849)中下载.
- `2024/12/17`: 🚀 Paddle团队已在[PaddleMIX](https://github.com/PaddlePaddle/PaddleMIX)框架中适配[InternVL2/2.5](https://github.com/PaddlePaddle/PaddleMIX/tree/develop/paddlemix/examples/internvl2).
- `2025/03/13`: 🔥 我们发布了 [VisualPRM](https://huggingface.co/OpenGVLab/VisualPRM-8B),一个8B参数两的多模态过程奖励模型(PRM)。该模型在 Best-of-8 的评测设置下使得 InternVL2.5-8B 和 InternVL2.5-78B 在七个多模态推理评测基准上的综合性能分别提升了 8.4 和 5.9 分。该模型的训练数据 [VisualPRM400K](https://huggingface.co/datasets/OpenGVLab/VisualPRM400K)也已经开源。请参考我们的[论文](https://huggingface.co/papers/2503.10291)和[项目主页](https://internvl.github.io/blog/2025-03-13-VisualPRM/)来了解更多细节。
- `2024/12/20`: 🔥 我们发布了 [InternVL2.5-MPO系列](https://internvl.github.io/blog/2024-12-20-InternVL-2.5-MPO/)。该系列通过 [Mixed Preference Optimization](https://huggingface.co/papers/2411.10442) 算法和 [MMPR-v1.1](https://huggingface.co/datasets/OpenGVLab/MMPR-v1.1) 数据集微调得到。**该系列模型在OpenCompass评测榜单中的整体性能超过MPO训练前两个百分点。** 这些模型可在 [HF 链接](https://huggingface.co/collections/OpenGVLab/internvl25-mpo-6753fed98cd828219b12f849)中下载。
- `2024/12/17`: 🚀 Paddle团队已在[PaddleMIX](https://github.com/PaddlePaddle/PaddleMIX)框架中适配[InternVL2/2.5](https://github.com/PaddlePaddle/PaddleMIX/tree/develop/paddlemix/examples/internvl2)。
- `2024/12/05`: 🚀 我们发布了 InternVL2.5 系列,覆盖了从1B参数到78B参数的多模态大语言模型。[InternVL2_5-78B](https://huggingface.co/OpenGVLab/InternVL2_5-78B) 是首个在MMMU benchmark上得分超过70的开源模型。 这些模型可在 [HF 链接](https://huggingface.co/collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c) 中下载。
- `2024/11/14`: 我们发布了 [MMPR](https://huggingface.co/datasets/OpenGVLab/MMPR),一个高质量、大规模的多模态推理偏好数据集,以及 [MPO](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/internvl2.0_mpo),一种高效的偏好优化算法。由此训练的模型 [InternVL2-8B-MPO](https://huggingface.co/OpenGVLab/InternVL2-8B-MPO) 在 MathVista 上取得了 67.0 的准确率。更多详情请参阅我们的[论文](https://arxiv.org/abs/2411.10442)、[项目主页](https://internvl.github.io/blog/2024-11-14-InternVL-2.0-MPO/) 和 [文档](https://internvl.readthedocs.io/en/latest/internvl2.0/preference_optimization.html)。
- `2024/10/21`: 我们发布了 Mini-InternVL 系列。这些模型在保持极小模型体积的同时实现了出色的性能:4B 模型仅用 5% 的模型大小便达到了 90% 的性能。有关更多详细信息,请查看我们的 [项目页面](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/mini_internvl) 和 [文档](https://internvl.readthedocs.io/en/latest/internvl2.0/domain_adaptation.html)。
Expand Down Expand Up @@ -426,14 +427,14 @@

ViT-22B uses the private JFT-3B dataset.

| method | #param | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch |
| ------------------- | :----: | :---: | :-----: | :---: | :--: | :--: | :-------: |
| OpenCLIP-G | 1.8B | 86.2 | 89.4 | 77.2 | 63.8 | 87.8 | 66.4 |
| DINOv2-g | 1.1B | 86.5 | 89.6 | 78.4 | 75.9 | 78.8 | 62.5 |
| EVA-01-CLIP-g | 1.1B | 86.5 | 89.3 | 77.4 | 70.5 | 87.7 | 63.1 |
| MAWS-ViT-6.5B | 6.5B | 87.8 | - | - | - | - | - |
| ViT-22B\* | 21.7B | 89.5 | 90.9 | 83.2 | 83.8 | 87.4 | - |
| InternViT-6B (ours) | 5.9B | 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 |
| method | #param | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch |
| ------------------- | :----: | :---: | :-----: | :---: | :---: | :---: | :-------: |
| OpenCLIP-G | 1.8B | 86.2 | 89.4 | 77.2 | 63.8 | 87.8 | 66.4 |
| DINOv2-g | 1.1B | 86.5 | 89.6 | 78.4 | 75.9 | 78.8 | 62.5 |
| EVA-01-CLIP-g | 1.1B | 86.5 | 89.3 | 77.4 | 70.5 | 87.7 | 63.1 |
| MAWS-ViT-6.5B | 6.5B | 87.8 | - | - | - | - | - |
| ViT-22B\* | 21.7B | 89.5 | 90.9 | 83.2 | 83.8 | 87.4 | - |
| InternViT-6B (ours) | 5.9B | 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 |

- 语义分割 [\[查看详情\]](./segmentation#-evaluation)

Expand All @@ -449,12 +450,12 @@

- 零样本图像分类 [\[查看详情\]](./clip_benchmark#imagenet-variants-and-objectnet)

| method | IN-1K | IN-A | IN-R | IN-V2 | IN-Sketch | ObjectNet |
| ----------------- | :---: | :--: | :--: | :---: | :-------: | :-------: |
| OpenCLIP-G | 80.1 | 69.3 | 92.1 | 73.6 | 68.9 | 73.0 |
| EVA-02-CLIP-E+ | 82.0 | 82.1 | 94.5 | 75.7 | 71.6 | 79.6 |
| ViT-22B\* | 85.9 | 90.1 | 96.0 | 80.9 | - | 87.6 |
| InternVL-C (ours) | 83.2 | 83.8 | 95.5 | 77.3 | 73.9 | 80.6 |
| method | IN-1K | IN-A | IN-R | IN-V2 | IN-Sketch | ObjectNet |
| ----------------- | :---: | :---: | :---: | :---: | :-------: | :-------: |
| OpenCLIP-G | 80.1 | 69.3 | 92.1 | 73.6 | 68.9 | 73.0 |
| EVA-02-CLIP-E+ | 82.0 | 82.1 | 94.5 | 75.7 | 71.6 | 79.6 |
| ViT-22B\* | 85.9 | 90.1 | 96.0 | 80.9 | - | 87.6 |
| InternVL-C (ours) | 83.2 | 83.8 | 95.5 | 77.3 | 73.9 | 80.6 |

- 多语言零样本图像分类 [\[查看详情\]](./clip_benchmark#multilingual-imagenet-1k)

Expand All @@ -472,13 +473,13 @@

- 零样本视频分类

| method | #frame | K400 | K600 | K700 |
| ----------------- | :----: | :--: | :--: | :--: |
| OpenCLIP-G | 1 | 65.9 | 66.1 | 59.2 |
| EVA-02-CLIP-E+ | 1 | 69.8 | 69.3 | 63.4 |
| InternVL-C (ours) | 1 | 71.0 | 71.3 | 65.7 |
| ViCLIP | 8 | 75.7 | 73.5 | 66.4 |
| InternVL-C (ours) | 8 | 79.4 | 78.8 | 71.5 |
| method | #frame | K400 | K600 | K700 |
| ----------------- | :----: | :---: | :---: | :---: |
| OpenCLIP-G | 1 | 65.9 | 66.1 | 59.2 |
| EVA-02-CLIP-E+ | 1 | 69.8 | 69.3 | 63.4 |
| InternVL-C (ours) | 1 | 71.0 | 71.3 | 65.7 |
| ViCLIP | 8 | 75.7 | 73.5 | 66.4 |
| InternVL-C (ours) | 8 | 79.4 | 78.8 | 71.5 |

</details>

Expand Down Expand Up @@ -699,12 +700,12 @@

- 多语言零样本图文对检索 [\[查看详情\]](./clip_benchmark#xtd)

| method | EN | ES | FR | ZH | IT | KO | RU | JP | average |
| ----------------- | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :-----: |
| AltCLIP | 95.4 | 94.1 | 92.9 | 95.1 | 94.2 | 94.4 | 91.8 | 91.7 | 93.7 |
| OpenCLIP-XLM-R-H | 97.3 | 96.1 | 94.5 | 94.7 | 96.0 | 90.2 | 93.9 | 94.0 | 94.6 |
| InternVL-C (ours) | 97.3 | 95.7 | 95.1 | 95.6 | 96.0 | 92.2 | 93.3 | 95.5 | 95.1 |
| InternVL-G (ours) | 98.6 | 97.7 | 96.5 | 96.7 | 96.9 | 95.1 | 94.8 | 96.1 | 96.6 |
| method | EN | ES | FR | ZH | IT | KO | RU | JP | average |
| ----------------- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :-----: |
| AltCLIP | 95.4 | 94.1 | 92.9 | 95.1 | 94.2 | 94.4 | 91.8 | 91.7 | 93.7 |
| OpenCLIP-XLM-R-H | 97.3 | 96.1 | 94.5 | 94.7 | 96.0 | 90.2 | 93.9 | 94.0 | 94.6 |
| InternVL-C (ours) | 97.3 | 95.7 | 95.1 | 95.6 | 96.0 | 92.2 | 93.3 | 95.5 | 95.1 |
| InternVL-G (ours) | 98.6 | 97.7 | 96.5 | 96.7 | 96.9 | 95.1 | 94.8 | 96.1 | 96.6 |

</details>

Expand Down