Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset collection #253

Merged
merged 29 commits into from
Jan 3, 2025
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 31 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,16 @@
> ⭐ If you like this project, please click the "Star" button at the top right to support us. Your support is our motivation to keep going!

## 📋 Contents
- [Introduction](#introduction)
- [News](#News)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Introduction](#-introduction)
- [News](#-news)
- [Installation](#️-installation)
- [Quick Start](#-quick-start)
- [Evaluation Backend](#evaluation-backend)
- [Custom Dataset Evaluation](#custom-dataset-evaluation)
- [Model Serving Performance Evaluation](#Model-Serving-Performance-Evaluation)
- [Arena Mode](#arena-mode)
- [Custom Dataset Evaluation](#️-custom-dataset-evaluation)
- [Model Serving Performance Evaluation](#-model-serving-performance-evaluation)
- [Arena Mode](#-arena-mode)
- [Contribution](#️-contribution)
- [Roadmap](#-roadmap)


## 📝 Introduction
Expand Down Expand Up @@ -72,11 +74,15 @@ Please scan the QR code below to join our community groups:


## 🎉 News
- 🔥🔥 **[2024.12.31]** Support for adding benchmark evaluations, refer to the [📖 Benchmark Evaluation Addition Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html); support for custom mixed dataset evaluations, allowing for more comprehensive model evaluations with less data, refer to the [📖 Mixed Dataset Evaluation Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/collection/index.html).
- 🔥 **[2024.12.13]** Model evaluation optimization: no need to pass the `--template-type` parameter anymore; supports starting evaluation with `evalscope eval --args`. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html) for more details.
- 🔥 **[2024.11.26]** The model inference service performance evaluator has been completely refactored: it now supports local inference service startup and Speed Benchmark; asynchronous call error handling has been optimized. For more details, refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html).
- 🔥 **[2024.10.31]** The best practice for evaluating Multimodal-RAG has been updated, please check the [📖 Blog](https://evalscope.readthedocs.io/zh-cn/latest/blog/RAG/multimodal_RAG.html#multimodal-rag) for more details.
- 🔥 **[2024.10.23]** Supports multimodal RAG evaluation, including the assessment of image-text retrieval using [CLIP_Benchmark](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/clip_benchmark.html), and extends [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html) to support end-to-end multimodal metrics evaluation.
- 🔥 **[2024.10.8]** Support for RAG evaluation, including independent evaluation of embedding models and rerankers using [MTEB/CMTEB](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html), as well as end-to-end evaluation using [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html).

<details><summary>More</summary>

- 🔥 **[2024.09.18]** Our documentation has been updated to include a blog module, featuring some technical research and discussions related to evaluations. We invite you to [📖 read it](https://evalscope.readthedocs.io/en/refact_readme/blog/index.html).
- 🔥 **[2024.09.12]** Support for LongWriter evaluation, which supports 10,000+ word generation. You can use the benchmark [LongBench-Write](evalscope/third_party/longbench_write/README.md) to measure the long output quality as well as the output length.
- 🔥 **[2024.08.30]** Support for custom dataset evaluations, including text datasets and multimodal image-text datasets.
Expand All @@ -88,7 +94,7 @@ Please scan the QR code below to join our community groups:
- 🔥 **[2024.06.13]** EvalScope seamlessly integrates with the fine-tuning framework SWIFT, providing full-chain support from LLM training to evaluation.
- 🔥 **[2024.06.13]** Integrated the Agent evaluation dataset ToolBench.


</details>

## 🛠️ Installation
### Method 1: Install Using pip
Expand Down Expand Up @@ -278,7 +284,7 @@ EvalScope supports using third-party evaluation frameworks to initiate evaluatio
- **ThirdParty**: Third-party evaluation tasks, such as [ToolBench](https://evalscope.readthedocs.io/en/latest/third_party/toolbench.html) and [LongBench-Write](https://evalscope.readthedocs.io/en/latest/third_party/longwriter.html).


## Model Serving Performance Evaluation
## 📈 Model Serving Performance Evaluation
A stress testing tool focused on large language models, which can be customized to support various dataset formats and different API protocol formats.

Reference: Performance Testing [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html)
Expand All @@ -303,19 +309,32 @@ Speed Benchmark Results:
+---------------+-----------------+----------------+
```

## Custom Dataset Evaluation
## 🖊️ Custom Dataset Evaluation
EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation [📖User Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/index.html)


## Arena Mode
## 🏟️ Arena Mode
The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report.

Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html)

## 👷‍♂️ Contribution

EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn), is continuously optimizing its benchmark evaluation features! We invite you to refer to the [Contribution Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html) to easily add your own evaluation benchmarks and share your contributions with the community. Let’s work together to support the growth of EvalScope and make our tools even better! Join us now!

<a href="https://github.com/modelscope/evalscope/graphs/contributors" target="_blank">
<table>
<tr>
<th colspan="2">
<br><img src="https://contrib.rocks/image?repo=modelscope/evalscope"><br><br>
</th>
</tr>
</table>
</a>

## TO-DO List
## 🔜 Roadmap
- [ ] Support for better evaluation report visualization
- [x] Support for mixed evaluations across multiple datasets
- [x] RAG evaluation
- [x] VLM evaluation
- [x] Agents evaluation
Expand All @@ -326,8 +345,6 @@ Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/lates
- [ ] GAIA
- [ ] GPQA
- [x] MBPP
- [ ] Auto-reviewer
- [ ] Qwen-max


## Star History
Expand Down
50 changes: 34 additions & 16 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,15 @@
> ⭐ 如果你喜欢这个项目,请点击右上角的 "Star" 按钮支持我们。你的支持是我们前进的动力!

## 📋 目录
- [简介](#简介)
- [新闻](#新闻)
- [环境准备](#环境准备)
- [快速开始](#快速开始)
- [使用其他评测后端](#使用其他评测后端)
- [自定义数据集评测](#自定义数据集评测)
- [竞技场模式](#竞技场模式)
- [性能评测工具](#推理性能评测工具)
- [简介](#-简介)
- [新闻](#-新闻)
- [环境准备](#️-环境准备)
- [快速开始](#-快速开始)
- [其他评测后端](#-其他评测后端)
- [自定义数据集评测](#-自定义数据集评测)
- [竞技场模式](#-竞技场模式)
- [性能评测工具](#-推理性能评测工具)
- [贡献](#️-贡献)



Expand Down Expand Up @@ -78,11 +79,14 @@ EvalScope还适用于多种评测场景,如端到端RAG评测、竞技场模


## 🎉 新闻
- 🔥🔥 **[2024.12.31]** 支持基准评测添加,参考[📖基准评测添加指南](https://evalscope.readthedocs.io/zh-cn/latest/advanced_guides/add_benchmark.html);支持自定义混合数据集评测,用更少的数据,更全面的评测模型,参考[📖混合数据集评测指南](https://evalscope.readthedocs.io/zh-cn/latest/advanced_guides/collection/index.html)
- 🔥 **[2024.12.13]** 模型评测优化,不再需要传递`--template-type`参数;支持`evalscope eval --args`启动评测,参考[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/get_started/basic_usage.html)
- 🔥 **[2024.11.26]** 模型推理压测工具重构完成:支持本地启动推理服务、支持Speed Benchmark;优化异步调用错误处理,参考[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/index.html)
- 🔥 **[2024.10.31]** 多模态RAG评测最佳实践发布,参考[📖博客](https://evalscope.readthedocs.io/zh-cn/latest/blog/RAG/multimodal_RAG.html#multimodal-rag)
- 🔥 **[2024.10.23]** 支持多模态RAG评测,包括[CLIP_Benchmark](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/rageval_backend/clip_benchmark.html)评测图文检索器,以及扩展了[RAGAS](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/rageval_backend/ragas.html)以支持端到端多模态指标评测。
- 🔥 **[2024.10.8]** 支持RAG评测,包括使用[MTEB/CMTEB](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/rageval_backend/mteb.html)进行embedding模型和reranker的独立评测,以及使用[RAGAS](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/rageval_backend/ragas.html)进行端到端评测。
<details> <summary>更多</summary>

- 🔥 **[2024.09.18]** 我们的文档增加了博客模块,包含一些评测相关的技术调研和分享,欢迎[📖阅读](https://evalscope.readthedocs.io/zh-cn/latest/blog/index.html)
- 🔥 **[2024.09.12]** 支持 LongWriter 评测,您可以使用基准测试 [LongBench-Write](evalscope/third_party/longbench_write/README.md) 来评测长输出的质量以及输出长度。
- 🔥 **[2024.08.30]** 支持自定义数据集评测,包括文本数据集和多模态图文数据集。
Expand All @@ -93,7 +97,7 @@ EvalScope还适用于多种评测场景,如端到端RAG评测、竞技场模
- 🔥 **[2024.06.29]** 支持**OpenCompass**作为第三方评测框架,我们对其进行了高级封装,支持pip方式安装,简化了评测任务配置。
- 🔥 **[2024.06.13]** EvalScope与微调框架SWIFT进行无缝对接,提供LLM从训练到评测的全链路支持 。
- 🔥 **[2024.06.13]** 接入Agent评测集ToolBench。

</details>

## 🛠️ 环境准备
### 方式1. 使用pip安装
Expand Down Expand Up @@ -277,15 +281,15 @@ evalscope eval \
参考:[全部参数说明](https://evalscope.readthedocs.io/zh-cn/latest/get_started/parameters.html)


## 其他评测后端
## 🧪 其他评测后端
EvalScope支持使用第三方评测框架发起评测任务,我们称之为评测后端 (Evaluation Backend)。目前支持的Evaluation Backend有:
- **Native**:EvalScope自身的**默认评测框架**,支持多种评测模式,包括单模型评测、竞技场模式、Baseline模型对比模式等。
- [OpenCompass](https://github.com/open-compass/opencompass):通过EvalScope作为入口,发起OpenCompass的评测任务,轻量级、易于定制、支持与LLM微调框架[ms-wift](https://github.com/modelscope/swift)的无缝集成:[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/opencompass_backend.html)
- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit):通过EvalScope作为入口,发起VLMEvalKit的多模态评测任务,支持多种多模态模型和数据集,支持与LLM微调框架[ms-wift](https://github.com/modelscope/swift)的无缝集成:[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/vlmevalkit_backend.html)
- **RAGEval**:通过EvalScope作为入口,发起RAG评测任务,支持使用[MTEB/CMTEB](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/rageval_backend/mteb.html)进行embedding模型和reranker的独立评测,以及使用[RAGAS](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/rageval_backend/ragas.html)进行端到端评测:[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/rageval_backend/index.html)
- **ThirdParty**: 第三方评测任务,如[ToolBench](https://evalscope.readthedocs.io/zh-cn/latest/third_party/toolbench.html)、[LongBench-Write](https://evalscope.readthedocs.io/zh-cn/latest/third_party/longwriter.html)。

## 推理性能评测工具
## 📈 推理性能评测工具
一个专注于大型语言模型的压力测试工具,可以自定义以支持各种数据集格式和不同的API协议格式。

参考:性能测试[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/index.html)
Expand All @@ -311,15 +315,30 @@ Speed Benchmark Results:
```


## 自定义数据集评测
## 🖊️ 自定义数据集评测
EvalScope支持自定义数据集评测,具体请参考:自定义数据集评测[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/advanced_guides/custom_dataset/index.html)


## 竞技场模式
## 🏟️ 竞技场模式
竞技场模式允许多个候选模型通过两两对比(pairwise battle)的方式进行评测,并可以选择借助AI Enhanced Auto-Reviewer(AAR)自动评测流程或者人工评测的方式,最终得到评测报告。参考:竞技场模式[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/arena.html)

## 👷‍♂️ 贡献

## TO-DO List
EvalScope作为[ModelScope](https://modelscope.cn)的官方评测工具,其基准评测功能正在持续优化中!我们诚邀您参考[贡献指南](https://evalscope.readthedocs.io/zh-cn/latest/advanced_guides/add_benchmark.html),轻松添加自己的评测基准,并与广大社区成员分享您的贡献。一起助力EvalScope的成长,让我们的工具更加出色!快来加入我们吧!

<a href="https://github.com/modelscope/evalscope/graphs/contributors" target="_blank">
<table>
<tr>
<th colspan="2">
<br><img src="https://contrib.rocks/image?repo=modelscope/evalscope"><br><br>
</th>
</tr>
</table>
</a>

## 🔜 Roadmap
- [ ] 支持更好的评测报告可视化
- [x] 支持多数据集混合评测
- [x] RAG evaluation
- [x] VLM evaluation
- [x] Agents evaluation
Expand All @@ -330,8 +349,7 @@ EvalScope支持自定义数据集评测,具体请参考:自定义数据集
- [ ] GAIA
- [ ] GPQA
- [x] MBPP
- [ ] Auto-reviewer
- [ ] Qwen-max



## Star History
Expand Down
Loading
Loading