CIF-Bench

This is the official repo for the paper CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models. The project page has a more user-friendly UI for reading.

Results

Private Split

In the private split, there 5 instructions used for each task, hence derive $5 \times 150 \times 50=37500$ data instances for each model in evaluation.

Model Name	Overall	Chinese Culture	Classification	Code	Commonsense	Creative NLG	Evaluation	Grammar	Linguistic	Motion Detection	NER	NLI	QA	Reasoning	Role Playing	Sentiment	Structured Data	Style Transfer	Summarization	Toxic	Translation
Baichuan2-13B-Chat	.529	.520	.674	.333	.641	.497	.686	.542	.528	.578	.563	.632	.569	.515	.752	.624	.459	.462	.332	.441	.273
Qwen-72B-Chat	.519	.486	.630	.296	.634	.508	.634	.458	.520	.494	.550	.626	.565	.528	.762	.613	.496	.459	.282	.608	.271
Yi-34B-Chat	.512	.483	.606	.347	.623	.497	.598	.480	.490	.575	.525	.619	.554	.494	.757	.580	.472	.439	.346	.514	.259
Qwen-14B-Chat	.500	.481	.582	.307	.614	.494	.645	.428	.475	.496	.513	.616	.548	.507	.764	.583	.469	.453	.283	.575	.262
Deepseek-Llm-67B-Chat	.471	.467	.571	.259	.577	.486	.549	.442	.476	.475	.509	.566	.496	.439	.711	.546	.409	.436	.262	.570	.235
Baichuan-13B-Chat	.450	.408	.491	.286	.552	.439	.670	.417	.422	.482	.486	.565	.505	.377	.704	.552	.387	.402	.350	.431	.304
Chatglm3-6B	.436	.381	.439	.330	.541	.452	.577	.310	.358	.436	.453	.544	.503	.414	.762	.560	.446	.402	.321	.391	.270
Yi-6B-Chat	.417	.402	.454	.313	.523	.425	.506	.383	.383	.487	.396	.523	.457	.369	.754	.482	.401	.380	.310	.455	.227
Baichuan2-7B-Chat	.412	.437	.647	.160	.520	.402	.580	.511	.444	.455	.407	.489	.395	.406	.670	.517	.342	.298	.101	.463	.138
Chatglm2-6B	.352	.278	.469	.346	.403	.424	.535	.274	.397	.406	.240	.397	.352	.326	.714	.438	.298	.313	.320	.461	.190
Chatglm-6B-Sft	.349	.265	.454	.365	.385	.462	.554	.296	.379	.427	.232	.380	.321	.292	.718	.415	.296	.333	.351	.441	.190
Chinese-Llama2-Linly-13B	.344	.250	.462	.311	.399	.429	.557	.273	.358	.385	.268	.390	.330	.313	.653	.433	.279	.332	.292	.457	.181
Gpt-3.5-Turbo-Sft	.343	.269	.427	.298	.389	.395	.575	.325	.365	.389	.226	.382	.394	.345	.710	.433	.324	.266	.290	.397	.225
Chinese-Alpaca-2-13B	.341	.242	.421	.356	.382	.442	.602	.256	.363	.430	.210	.376	.334	.317	.714	.459	.299	.316	.308	.452	.200
Chinese-Alpaca-13B	.334	.250	.399	.348	.364	.435	.616	.275	.349	.421	.223	.370	.309	.319	.724	.426	.285	.307	.298	.445	.181
Chinese-Alpaca-7B	.334	.216	.412	.378	.381	.425	.576	.265	.359	.393	.243	.383	.326	.295	.710	.409	.301	.327	.325	.405	.186
Chinese-Llama2-Linly-7B	.333	.218	.451	.330	.396	.427	.583	.248	.350	.410	.231	.367	.345	.276	.698	.433	.259	.315	.310	.469	.168
Tigerbot-13B-Chat	.331	.205	.397	.309	.385	.420	.614	.310	.379	.341	.276	.363	.329	.301	.694	.419	.280	.310	.283	.393	.186
Telechat-7B	.329	.267	.338	.321	.420	.404	.420	.272	.265	.327	.320	.388	.355	.244	.672	.344	.334	.335	.299	.364	.184
Ziya-Llama-13B	.329	.196	.402	.324	.341	.428	.616	.312	.349	.400	.228	.351	.279	.313	.721	.468	.311	.291	.278	.431	.175
Chinese-Alpaca-33B	.326	.234	.370	.372	.364	.429	.614	.246	.318	.377	.221	.368	.300	.314	.713	.428	.288	.303	.295	.401	.199
Tigerbot-7B-Chat	.325	.218	.395	.306	.370	.413	.631	.294	.370	.368	.215	.355	.313	.292	.713	.415	.283	.315	.290	.389	.171
Chinese-Alpaca-2-7B	.323	.215	.374	.335	.366	.415	.546	.257	.326	.395	.215	.375	.318	.289	.698	.417	.285	.303	.312	.439	.193
Aquilachat-7B	.309	.162	.234	.291	.320	.437	.344	.135	.266	.309	.287	.337	.342	.236	.609	.255	.249	.400	.527	.430	.306
Moss-Moon-003-Sft	.302	.214	.405	.274	.347	.380	.448	.305	.341	.378	.232	.317	.321	.267	.694	.375	.251	.259	.288	.424	.152
Qwen-7B-Chat	.301	.211	.410	.289	.349	.391	.531	.219	.387	.404	.208	.325	.297	.278	.681	.419	.266	.251	.248	.371	.157
Belle-13B-Sft	.264	.198	.307	.285	.316	.349	.409	.237	.305	.222	.177	.317	.284	.242	.631	.299	.244	.222	.234	.296	.133
Cpm-Bee-10B	.244	.234	.377	.024	.278	.311	.255	.302	.278	.327	.148	.286	.224	.147	.603	.277	.117	.263	.220	.352	.125

Public Split

In the public split, there is only one instruction used for each task, hence derive $1 \times 150 \times 50=7500$ data instances for each model in evaluation.

Model Name	Overall	Chinese Culture	Classification	Code	Commonsense	Creative NLG	Evaluation	Grammar	Linguistic	Motion Detection	NER	NLI	QA	Reasoning	Role Playing	Sentiment	Structured Data	Style Transfer	Summarization	Toxic	Translation
Qwen-72B-Chat	.589	.512	.716	.444	.706	.587	.661	.424	.521	.694	.515	.695	.668	.539	.752	.637	.505	.587	.609	.671	.466
Qwen-14B-Chat	.564	.481	.678	.416	.657	.567	.669	.396	.485	.663	.486	.647	.609	.498	.757	.638	.460	.610	.629	.691	.467
Deepseek-LLM-67B-Chat	.526	.477	.617	.364	.609	.559	.573	.374	.458	.631	.493	.588	.624	.444	.694	.592	.384	.576	.594	.666	.439
gpt-3.5-Public-SFT	.522	.316	.611	.492	.578	.538	.639	.377	.447	.580	.492	.587	.565	.498	.745	.583	.444	.501	.620	.643	.452
Yi-34B-Chat	.516	.452	.607	.437	.624	.516	.545	.254	.382	.671	.398	.631	.592	.460	.761	.566	.440	.551	.610	.608	.408
Baichuan2-13B-Chat	.512	.446	.623	.403	.600	.505	.582	.352	.423	.633	.435	.600	.591	.474	.751	.597	.434	.525	.572	.494	.372
Tigerbot-13B-Chat	.494	.350	.558	.447	.599	.528	.707	.352	.447	.551	.498	.571	.569	.413	.732	.560	.365	.502	.607	.601	.306
Chinese-Alpaca-2-13B	.492	.260	.572	.434	.533	.562	.574	.318	.417	.624	.467	.566	.545	.420	.712	.595	.382	.488	.641	.740	.347
Chinese-Alpaca-33B	.484	.274	.546	.470	.527	.540	.703	.332	.382	.582	.464	.550	.506	.423	.732	.548	.342	.494	.629	.648	.334
Ziya-Llama-13B	.479	.287	.550	.422	.523	.551	.650	.294	.384	.610	.437	.546	.499	.404	.749	.582	.367	.499	.629	.722	.313
Chinese-Llama2-Linly-13B	.479	.286	.623	.439	.549	.535	.626	.286	.403	.587	.468	.563	.524	.411	.676	.561	.359	.482	.602	.696	.313
Tigerbot-7B-Chat	.478	.354	.528	.440	.570	.540	.708	.314	.430	.528	.413	.532	.554	.393	.731	.583	.351	.519	.630	.614	.291
ChatGLM3-6B	.472	.321	.488	.436	.527	.503	.588	.290	.328	.574	.415	.557	.526	.397	.749	.612	.431	.529	.620	.589	.392
Chinese-Alpaca-13B	.471	.264	.553	.443	.495	.525	.587	.334	.394	.653	.457	.524	.513	.402	.726	.526	.323	.486	.628	.702	.336
ChatGLM2-6B	.464	.334	.532	.436	.522	.527	.651	.314	.395	.536	.402	.520	.533	.407	.725	.506	.363	.480	.627	.661	.303
Chinese-Alpaca-7B	.452	.237	.536	.438	.484	.502	.672	.318	.389	.652	.394	.504	.501	.351	.699	.543	.365	.478	.623	.711	.328
Chinese-Alpaca-2-7B	.448	.251	.472	.435	.480	.532	.577	.268	.348	.596	.431	.509	.493	.344	.703	.510	.334	.483	.637	.596	.343
Chinese-Llama2-Linly-7B	.443	.264	.558	.419	.497	.522	.664	.236	.381	.593	.381	.496	.546	.350	.713	.559	.323	.495	.603	.584	.293
Qwen-7B-Chat	.442	.313	.549	.404	.520	.515	.646	.244	.411	.570	.368	.489	.514	.384	.713	.563	.328	.463	.576	.639	.281
ChatGLM-6B	.440	.311	.499	.446	.484	.548	.558	.278	.382	.484	.386	.480	.483	.353	.738	.460	.346	.480	.633	.543	.322
Baichuan-13B-Chat	.426	.355	.416	.361	.516	.416	.564	.324	.374	.380	.394	.531	.584	.339	.668	.478	.402	.459	.559	.497	.392
Yi-6B-Chat	.420	.320	.439	.395	.489	.449	.493	.230	.293	.587	.341	.496	.516	.344	.742	.488	.348	.498	.627	.510	.285
CPM-Bee-10B	.415	.382	.455	.284	.431	.508	.300	.317	.367	.494	.397	.451	.472	.304	.647	.329	.284	.538	.534	.486	.305
Moss-Moon-003-SFT	.399	.233	.465	.389	.427	.482	.509	.274	.369	.526	.385	.403	.457	.325	.712	.450	.304	.435	.594	.542	.308
Belle-SFT-Public	.397	.196	.503	.376	.426	.472	.543	.269	.371	.512	.356	.450	.430	.338	.645	.426	.300	.398	.558	.683	.224
Telechat-7B	.350	.172	.299	.438	.386	.456	.400	.138	.202	.412	.322	.375	.414	.261	.660	.341	.320	.462	.639	.494	.304
Aquilachat-7B	.350	.203	.270	.357	.404	.449	.394	.090	.260	.348	.322	.385	.426	.274	.595	.308	.267	.434	.607	.409	.355
Baichuan2-7B-Chat	.339	.345	.595	.154	.455	.327	.523	.362	.354	.466	.233	.414	.349	.339	.673	.429	.300	.246	.097	.357	.130

TODOs

Inference code for demo model.
Evaluation code and prompts.
Public split data.

Others

Contacts for any discussions can be made to Yizhi Li and Ge Zhang

@article{li2024cifbench,
      title={CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models}, 
      author={Yizhi LI and Ge Zhang and Xingwei Qu and Jiali Li and Zhaoqun Li and Zekun Wang and Hao Li and Ruibin Yuan and Yinghao Ma and Kai Zhang and Wangchunshu Zhou and Yiming Liang and Lei Zhang and Lei Ma and Jiajun Zhang and Zuowen Li and Stephen W. Huang and Chenghua Lin and Wenhu Chen and Jie Fu},
      year={2024},
      eprint={2402.13109},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
c_static		c_static
dataset		dataset
static		static
CITATION.bib		CITATION.bib
README.md		README.md
csv_to_html.py		csv_to_html.py
index.html		index.html
leaderboard.csv		leaderboard.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CIF-Bench

Results

Private Split

Public Split

TODOs

Others

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

yizhilll/CIF-Bench

Folders and files

Latest commit

History

Repository files navigation

CIF-Bench

Results

Private Split

Public Split

TODOs

Others

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages