GitHub - PolarisHsu/AC-EVAL: The official GitHub repository for AC-EVAL, an ancient Chinese evaluation suite for large language models (LLMs)

⏬ Data • 📃 Paper • 🏆 Leaderboard
中文 | English

AC-EVAL presents a thorough evaluation suite for Large Language Models (LLMs) focusing on ancient Chinese, covering eras from the Pre-Qin period to the Qing dynasty. This suite includes 3245 multi-choice questions across 3 levels of difficulty and 13 diverse tasks, as shown below. Please check our paper for more details.

Our aim is to facilitate the assessment of LLMs' capabilities in understanding and processing ancient Chinese language and knowledge.

Leaderboard

Our leaderboard, which is updated regularly, showcases both zero-shot and five-shot accuracies of various models across two distinct settings: Answer-Only (AO) and Chain-of-Thought (COT).

Zero-shot AO

Model	General Historical Knowledge	Short Text Understanding	Long Text Understanding	Average
ERNIE-Bot 4.0	77.54	68.11	66.42	70.69
GLM-4	76.63	66.66	67.70	70.33
Qwen-max	70.77	64.88	63.84	67.50
GLM-3-Turbo	75.21	60.52	59.77	65.17
Qwen-72B-Chat	71.25	61.48	59.80	64.18
Yi-34B-Chat	72.66	61.33	58.36	64.12
Qwen-14B-Chat	69.51	56.53	57.38	61.14
GPT-4	66.11	55.11	47.38	56.20
ERNIE-Bot	57.80	51.81	51.47	53.69
Qwen-7B-Chat	62.74	48.76	44.97	52.16
Yi-6B-Chat	60.70	47.79	39.49	51.33
Baichuan2-7B-Chat	64.38	46.77	40.33	50.49
Baichuan2-13B-Chat	65.57	49.24	35.40	50.07
ChatGLM3-6B	58.04	43.01	39.73	46.93
Xunzi-Qwen-Chat	60.20	44.31	30.87	45.13
GPT-3.5 Turbo	53.50	43.72	36.94	44.72
LLaMA2-70B	33.55	36.29	30.72	33.54

Five-shot AO

Model	General Historical Knowledge	Short Text Understanding	Long Text Understanding	Average
ERNIE-Bot 4.0	75.69	69.59	66.12	70.47
GLM-4	74.89	65.48	69.07	69.81
Qwen-max	75.29	65.48	66.99	69.25
GLM-3-Turbo	72.99	59.48	59.66	64.04
Qwen-72B-Chat	71.67	61.30	57.07	63.35
ERNIE-Bot	68.81	57.62	50.36	58.93
GPT-4	65.91	58.07	48.36	57.45
Qwen-14B-Chat	70.60	53.73	45.91	56.75
Yi-34B-Chat	66.62	52.57	41.90	53.70
Baichuan2-7B-Chat	63.37	45.91	39.94	49.74
Baichuan2-13B-Chat	63.75	45.86	32.74	47.45
Qwen-7B-Chat	61.42	45.98	30.78	46.06
ChatGLM3-6B	55.74	42.92	38.45	45.71
GPT-3.5 Turbo	53.99	43.21	36.40	44.54
Xunzi-Qwen-Chat	51.30	41.25	29.84	40.80
Yi-6B-Chat	55.76	35.97	28.48	40.07

Zero-shot COT

Model	General Historical Knowledge	Short Text Understanding	Long Text Understanding	Average
Qwen-max	75.10	66.72	61.03	67.62
Qwen-72B-Chat	74.79	65.25	56.78	65.61
Qwen-14B-Chat	67.51	54.64	46.12	56.09
Qwen-7B-Chat	61.54	44.97	40.21	48.91

Five-shot COT

Model	General Historical Knowledge	Short Text Understanding	Long Text Understanding	Average
Qwen-max	74.30	65.94	61.46	67.23
Qwen-72B-Chat	71.79	61.62	57.66	63.69
Qwen-14B-Chat	67.49	51.51	39.93	52.97
Qwen-7B-Chat	59.37	47.71	35.36	47.48

Data

Download

You can find the dev set available in the data directory. For access to the test dataset, please contact the provided email (yuting_wei@bupt.edu.cn). Subsequently, we plan to incorporate the dataset into Hugging Face datasets.

To download and unzip the zip file, then load the data with pandas, follow the instructions below:

import os
import pandas as pd

File_Dir="data"
test_df=pd.read_excel(os.path.join(File_Dir,"dev",".xlsx"))

Data Format

To facilitate usage, we have organized the supercategory labels and English/Chinese names corresponding to 13 subjects. Please refer to subject_mapping.json for details. The format is:

{
    "art_and_cultural_heritage": {
      "English": "Art and Cultural Heritage",
      "Chinese": "艺术和文化遗产",
      "Supercategory": "General Historical Knowledge"
    },
    ...
    "filename":{
        "English": English Name,
        "Chinese": Chinese Name,
        "Supercatagory": Supercatagory Label (General Historical Knowledge/Short Text Understanding/Long Text Understanding)"
    }
}

Each subject consists of two splits: dev and test. The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The test set is for model evaluation. Labels on the test split are not released, users are required to submit their results to automatically obtain test accuracy. How to submit?

Below is a dev example from art and cultural heritage:

Question	A	B	C	D	Answer	Explanation
五代南唐时期著名画家顾闳中的绘画名作是？(The famous painting masterpiece of Gu Hongzhong, a famous painter in the Southern Tang Dynasty during the Five Dynasties, is?)	《女史箴图》(Admonitions of the Instructress to the Court Ladies)	《五牛图》(Five Buffaloes)	《簪花仕女图》(Ladies with Flowers)	《韩熙载夜宴图》(Han Xizai Giving a Night Banquet)	D	让我们逐步分析。顾闳中的绘画名作是《韩熙载夜宴图》。《五牛图》是韩滉的作品，《簪花仕女图》是周昉的作品，《女史箴图》是顾恺之的作品。 (Let‘s analyze step by step. The famous painting by Gu Hongzhong is 'Han Xizai Giving a Night Banquet.' 'Five Buffaloes' is a work by Han Huang, 'Ladies with Flowers' is by Zhou Fang, and 'Admonitions of the Instructress to the Court Ladies' is by Gu Kaizhi.)

How to Evaluate on AC-EVAL

We implemented automatic answer extraction using regular expressions, the evaluation code for each model is located in the src directory.

We use the following prompt when evaluating the models:

answer-only prompt

Zero-shot AO

以下是中国古代{主题}领域的单项选择题，请直接给出正确答案对应的选项字母。

{测试题目}
A. {选项A}
B. {选项B}
C. {选项C}
D. {选项D}
答案：

Few-shot AO

以下是中国古代{主题}领域的单项选择题示例。在查看这些示例之后，请直接给出接下来一道题目的正确答案所对应的选项字母。

示例1：{题目1}
A. {选项A}
B. {选项B}
C. {选项C}
D. {选项D}
答案：A

[k-shot demo, note that k is 0 in the zero-shot case]

{测试题目}
A. {选项A}
B. {选项B}
C. {选项C}
D. {选项D}
答案：

chain-of-thought prompt

Zero-shot COT

以下是中国古代{主题}领域的单项选择题，请逐步分析并给出正确答案对应的选项。

{测试题目}
A. {选项A}
B. {选项B}
C. {选项C}
D. {选项D}
答案：

Few-shot COT

以下是中国古代{主题}领域的单项选择题示例。在查看这些示例之后，请逐步分析接下来一道题目并给出正确答案所对应的选项字母。

示例1：{题目1}
A. {选项A}
B. {选项B}
C. {选项C}
D. {选项D}
答案解析：
让我们逐步分析。{解析过程}
所以答案是A。

[k-shot demo, note that k is 0 in the zero-shot case]

{测试题目}
A. {选项A}
B. {选项B}
C. {选项C}
D. {选项D}
答案：

How to Submit

You need to first prepare a UTF-8 encoded JSON file with the following format, please refer to submission_example.json for details.

{
    "historical_facts": {
        "0": "A",
        "1": "B",
        "2": "B",
        ...
    },
    
    "subject_name":{
    "0":"ans_0",
    "1":"ans_1",
    ...
    }
    ....
}

Then, you can submit the prepared JSON file to the email (yuting_wei@bupt.edu.cn). Please include the type of experiment you conducted in the subject of your email, using one of the following file labels: [zero-shot-AO, few-shot-AO, zero-shot-COT, few-shot-COT].

TODO

add evaluation code into src
add breakdown results
incorporate into Hugging Face datasets

Licenses

This work is licensed under a MIT License.

The AC-EVAL dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Citation

Please cite our paper if you use our dataset.

@misc{wei2024aceval,
      title={AC-EVAL: Evaluating Ancient Chinese Language Understanding in Large Language Models}, 
      author={Yuting Wei and Yuanxing Xu and Xinru Wei and Simin Yang and Yangfu Zhu and Yuqing Li and Di Liu and Bin Wu},
      year={2024},
      eprint={2403.06574},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgements

This project was inspired by and based on the structure of C-Eval. We are grateful for this work and would like to acknowledge their significant contributions to the community.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data/dev		data/dev
resources		resources
src		src
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE-DATA		LICENSE-DATA
README.md		README.md
README_zh.md		README_zh.md
requirements.txt		requirements.txt
subject_mapping.json		subject_mapping.json
submission_example.json		submission_example.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Leaderboard

Zero-shot AO

Five-shot AO

Zero-shot COT

Five-shot COT

Data

Download

Data Format

How to Evaluate on AC-EVAL

answer-only prompt

Zero-shot AO

Few-shot AO

chain-of-thought prompt

Zero-shot COT

Few-shot COT

How to Submit

TODO

Licenses

Citation

Acknowledgements

About

Releases

Packages

Languages

License

PolarisHsu/AC-EVAL

Folders and files

Latest commit

History

Repository files navigation

Leaderboard

Zero-shot AO

Five-shot AO

Zero-shot COT

Five-shot COT

Data

Download

Data Format

How to Evaluate on AC-EVAL

answer-only prompt

Zero-shot AO

Few-shot AO

chain-of-thought prompt

Zero-shot COT

Few-shot COT

How to Submit

TODO

Licenses

Citation

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages