Skip to content

pfnet-research/pfgen-bench

Repository files navigation

Preferred Generation Benchmark

pfgen-benchmark is a benchmark designed to evaluate Japanese text generation specifically for pretrained models. Unlike conventional benchmarks that use templates containing instructions, this benchmark relies solely on providing numerous examples. By conveying expectations such as the question-answering nature of the task, responses of approximately 100 characters, and outputs resembling formal public documents purely through examples, it minimizes the influence of differences in instructions or templates. Additionally, output evaluation is conducted using n-gram-based methods, enabling quick, cost-effective, and deterministic evaluations, unlike the LLM as a Judge approach.

To enable comparisons across as many models as possible, the leaderboard actively includes a wide range of models. These include openly accessible models, models mentioned in academic papers, and those announced by companies through press releases. Contributions of model outputs are encouraged, and results can be submitted via pull requests. For detailed instructions on how to contribute, please refer to the "How to Contribute" section.

See more details: TBD (arxiv)

pfgen-benchmark は事前学習モデル向けに設計された日本語の生成文を評価するベンチマークです。通常のベンチマークでは指示文を含むテンプレートを使いますが、このベンチマークでは多数の例示のみを行います。質問応答タスクであることや、約100字の回答、公用文に近い出力を期待していることを例示のみで伝えることで、指示文やテンプレートの差異による影響を小さくしています。また、出力文の評価は n-gram を用いた方法を用いており、LLM as a Judge の手法と異なり、短時間、低コストでかつ決定的な評価を可能にしています。

詳しくはこちら: Jxiv preprint

できる限り多くのモデルを同じ軸で比較できるように、リーダーボードには積極的に多くのモデル掲載しています。オープンにアクセス可能なモデル、論文で言及されているモデル、企業がプレスリリースを出しているモデルなど、比較の価値があると思われるモデルについては、是非プルリクエストで出力を追加してください。追加方法については「How to contribute」を参照ください。

License of LLM output

The license of the parts of this repository other than the output of LLM is Apache License Version 2.0. The license of the output of LLM depends on the license of each model.

How to evaluate model

You can evaluate the model using run-hf.py (which uses transformers) or run-vllm.py (which uses vLLM). For detailed parameters, refer to --help. The --num-trials parameter, which is the number of patterns for which the model will generate answers, should be decided considering the trade-off between execution time and required accuracy.

# Run a model using Huggingface library or vLLM.
python ./run-hf.py --model=pfnet/plamo-13b --num-trials=5

# Evaluate output and update leaderboard.
make

How to contribute

Follow the instructions in the "How to Evaluate Model" section to run the evaluation. This process will generate config.json and trials.jsonl.xz files under the result directory. Please create a pull request containing only these two files.

To ensure more accurate ranking among models, the number of executions (--num-trials) should be as many as possible, within the limit of 100 trials.

Leaderboard

Rank Score                    Model                                       Length           Fluency Truthfulness Helpfulness
N/A 1.0501 (±0.0000/√1) 👑 system/ground-truth 100.0 (±0.0) 1.155 0.996 1.000
1 0.9303 (±0.0083/√10) 💬 anthropic/claude-3-5-sonnet-20240620 102.2 (±10.4) 0.949 0.959 0.883
2 0.9144 (±0.0037/√2) 💬 deepseek-ai/DeepSeek-V3 87.4 (±14.9) 0.960 0.983 0.800
3 0.8615 (±0.0092/√10) 💬 openai/gpt-4o 84.5 (±18.6) 0.919 0.980 0.686
N/A 0.8494 (±0.0253/√1000) 🎯 system/criteria 100.0 (±3.4) 0.936 0.978 0.505
4 0.8270 (±0.0229/√10) 💬 anthropic/claude-3-opus-20240229 102.3 (±9.5) 0.911 0.944 0.627
5 0.8059 (±0.0169/√5) 💬 google/gemini-2.0-flash-exp 68.0 (±17.7) 0.834 0.984 0.600
6 0.8036 (±0.0133/√10) 💬 openai/gpt-4-turbo 86.5 (±17.4) 0.820 0.959 0.632
7 0.7916 (±0.0146/√10) 💬 openai/gpt-4 107.2 (±11.6) 0.888 0.951 0.536
8 0.7827 (±0.0129/√100) 💬 Qwen/Qwen2.5-72B-Instruct 98.7 (±14.8) 0.871 0.936 0.540
9 0.7789 (±0.0213/√100) 🟢 weblab-GENIAC/Tanuki-8x8B-dpo-v1.0 109.1 (±36.8) 0.890 0.941 0.506
10 0.7782 (±0.0154/√100) 💬 Qwen/Qwen2.5-72B-Instruct 96.5 (±17.8) 0.847 0.939 0.549
11 0.7773 (±0.0168/√100) 💬 pfnet/plamo-1.0-prime 178.2 (±114.5) 0.874 0.942 0.516
12 0.7768 (±0.0113/√5) 💬 mlx-community/Qwen2.5-72B-Instruct-4bit 100.8 (±17.7) 0.860 0.933 0.538
13 0.7766 (±0.0276/√100) 🟢 tokyotech-llm/Swallow-70b-NVE-hf 104.1 (±17.9) 0.884 0.938 0.507
14 0.7756 (±0.0264/√100) 🟢 tokyotech-llm/Swallow-70b-NVE-instruc... 104.1 (±18.5) 0.878 0.938 0.510
15 0.7748 (±0.0000/√1) 💬 openai/chatgpt-o1 76.3 (±17.7) 0.755 0.960 0.610
16 0.7650 (±0.0263/√100) 🟢 tokyotech-llm/Swallow-70b-instruct-hf 102.5 (±14.4) 0.872 0.929 0.494
17 0.7643 (±0.0000/√1) 💬 openai/chatgpt-o1-pro 79.5 (±17.3) 0.748 0.955 0.590
18 0.7628 (±0.0275/√100) 🟢 tokyotech-llm/Swallow-70b-hf 103.5 (±16.1) 0.876 0.930 0.483
19 0.7601 (±0.0289/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-70B-v0.1 106.3 (±21.0) 0.864 0.925 0.492
20 0.7538 (±0.0251/√100) 🟢 turing-motors/Llama-3-heron-brain-70B... 101.1 (±16.9) 0.857 0.925 0.479
21 0.7501 (±0.0237/√100) 💬 weblab-GENIAC/Tanuki-8x8B-dpo-v1.0 181.0 (±87.4) 0.847 0.923 0.480
22 0.7469 (±0.0270/√100) 🟢 pfnet/plamo-100b-base 115.2 (±64.0) 0.861 0.920 0.460
23 0.7444 (±0.0260/√100) 🟢 sbintuitions/sarashina2-70b 120.0 (±49.4) 0.825 0.923 0.485
24 0.7423 (±0.0302/√100) 💬 cyberagent/Llama-3.1-70B-Japanese-Ins... 199.2 (±110.3) 0.817 0.905 0.505
25 0.7392 (±0.0232/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-70B-I... 93.6 (±23.5) 0.847 0.941 0.429
26 0.7370 (±0.0217/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-70B-I... 97.5 (±19.8) 0.846 0.932 0.433
27 0.7365 (±0.0218/√100) 🟢 CohereForAI/c4ai-command-r-plus 107.5 (±42.3) 0.818 0.913 0.478
28 0.7336 (±0.0254/√100) 🟢 tokyotech-llm/Llama-3-Swallow-70B-v0.1 108.2 (±24.7) 0.837 0.908 0.456
29 0.7320 (±0.0201/√10) 💬 anthropic/claude-3-sonnet-20240229 114.3 (±18.9) 0.810 0.910 0.476
30 0.7317 (±0.0101/√100) 💬 microsoft/phi-4 111.7 (±29.4) 0.833 0.913 0.449
31 0.7261 (±0.0169/√100) 💬 microsoft/phi-4 107.6 (±27.9) 0.829 0.922 0.426
32 0.7249 (±0.0247/√100) 💬 cyberagent/calm3-22b-chat 136.8 (±46.7) 0.813 0.907 0.455
33 0.7246 (±0.0250/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-70B-I... 89.8 (±33.9) 0.812 0.940 0.422
34 0.7242 (±0.0156/√100) 🟢 microsoft/phi-4 102.5 (±12.7) 0.864 0.924 0.385
35 0.7217 (±0.0219/√100) 🟢 cyberagent/calm3-22b-chat 105.0 (±13.1) 0.824 0.916 0.425
36 0.7194 (±0.0321/√10) 💬 google/text-bison 77.6 (±31.9) 0.790 0.968 0.401
37 0.7185 (±0.0000/√1) 💬 elyza/Llama-3-ELYZA-JP-70B 98.6 (±33.8) 0.837 0.931 0.388
38 0.7175 (±0.0257/√100) 🟢 nvidia/nemotron-4-340b-instruct 107.3 (±28.4) 0.816 0.908 0.429
39 0.7084 (±0.0207/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-In... 95.9 (±19.7) 0.835 0.930 0.360
40 0.7046 (±0.0248/√100) 💬 nvidia/nemotron-4-340b-instruct 94.5 (±39.1) 0.768 0.910 0.435
41 0.7024 (±0.0238/√100) 🟢 rinna/nekomata-14b 104.3 (±18.0) 0.812 0.912 0.383
42 0.7023 (±0.0271/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-v0.2 112.6 (±33.2) 0.818 0.901 0.388
43 0.7008 (±0.0318/√100) 🟢 tokyotech-llm/Swallow-13b-instruct-hf 104.5 (±13.0) 0.812 0.898 0.392
44 0.6990 (±0.0288/√100) 🟢 tokyotech-llm/Swallow-13b-NVE-hf 106.2 (±19.2) 0.820 0.906 0.371
45 0.6980 (±0.0252/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-8B-In... 98.7 (±50.0) 0.798 0.927 0.369
46 0.6958 (±0.0236/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-In... 92.9 (±20.0) 0.814 0.931 0.343
47 0.6945 (±0.0300/√100) 🟢 sbintuitions/sarashina2-13b 107.8 (±28.3) 0.794 0.900 0.390
48 0.6938 (±0.0217/√100) 🟢 weblab-GENIAC/Tanuki-8B-dpo-v1.0 111.5 (±22.8) 0.800 0.893 0.389
49 0.6924 (±0.0232/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-70B-I... 74.1 (±31.4) 0.755 0.948 0.373
50 0.6891 (±0.0255/√100) 🟢 tokyotech-llm/Swallow-13b-hf 104.8 (±17.7) 0.811 0.901 0.355
51 0.6853 (±0.0201/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-In... 96.6 (±18.8) 0.815 0.919 0.322
52 0.6794 (±0.0243/√100) 🟢 cyberagent/Llama-3.1-70B-Japanese-Ins... 128.8 (±72.2) 0.764 0.883 0.391
53 0.6759 (±0.0232/√10) 🟢 meta-llama/Meta-Llama-3.1-405B 101.2 (±15.1) 0.767 0.892 0.368
54 0.6745 (±0.0152/√10) 💬 google/gemini-1.5-pro-001 52.4 (±15.0) 0.666 0.980 0.377
55 0.6737 (±0.0276/√100) 🟢 sbintuitions/sarashina1-13b 105.4 (±23.4) 0.775 0.882 0.364
56 0.6715 (±0.0284/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-v0.1 107.5 (±22.2) 0.787 0.881 0.347
57 0.6697 (±0.0277/√100) 🟢 nvidia/nemotron-4-340b-base 106.9 (±26.5) 0.768 0.884 0.357
58 0.6677 (±0.0250/√100) 🟢 llm-jp/llm-jp-3-13b 101.1 (±9.7) 0.770 0.884 0.349
59 0.6673 (±0.0225/√100) 🟢 sbintuitions/sarashina1-65b 104.2 (±20.0) 0.776 0.894 0.332
60 0.6663 (±0.0262/√100) 🟢 tokyotech-llm/Swallow-7b-plus-hf 106.1 (±18.1) 0.780 0.880 0.339
61 0.6656 (±0.0169/√10) 💬 google/gemini-1.5-flash-001 55.1 (±21.7) 0.687 0.967 0.342
62 0.6625 (±0.0140/√10) 💬 anthropic/claude-3-haiku-20240307 81.9 (±31.0) 0.747 0.943 0.298
63 0.6590 (±0.0133/√10) 💬 google/gemini-2.0-flash-thinking-exp-... 49.8 (±11.0) 0.639 0.984 0.354
64 0.6572 (±0.0518/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-8B-In... 108.9 (±63.7) 0.764 0.895 0.313
65 0.6473 (±0.0182/√100) 💬 Qwen/Qwen2-72B-Instruct 108.7 (±24.8) 0.703 0.853 0.386
66 0.6456 (±0.0255/√100) 🟢 sbintuitions/sarashina2-7b 105.6 (±22.8) 0.746 0.874 0.316
67 0.6447 (±0.0251/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-8B-In... 74.3 (±31.3) 0.706 0.934 0.294
68 0.6445 (±0.0241/√100) 🟢 tokyotech-llm/Llama-3-Swallow-8B-v0.1 110.3 (±28.4) 0.748 0.867 0.319
69 0.6406 (±0.0139/√100) 💬 Qwen/QwQ-32B-Preview 119.1 (±72.2) 0.730 0.897 0.294
70 0.6399 (±0.1763/√100) 💬 turing-motors/Llama-3-heron-brain-70B... 155.4 (±101.8) 0.718 0.805 0.397
71 0.6368 (±0.0207/√100) 🟢 tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1 105.5 (±21.0) 0.753 0.870 0.287
72 0.6350 (±0.0260/√100) 🟢 karakuri-ai/karakuri-lm-8x7b-instruct... 104.0 (±16.9) 0.755 0.863 0.287
73 0.6337 (±0.0265/√100) 🟢 tokyotech-llm/Swallow-7b-hf 106.5 (±18.7) 0.746 0.866 0.289
74 0.6335 (±0.0252/√100) 🟢 karakuri-ai/karakuri-lm-8x7b-chat-v0.1 103.2 (±16.6) 0.766 0.872 0.263
75 0.6318 (±0.0264/√100) 🟢 tokyotech-llm/Llama-3-Swallow-70B-Ins... 119.2 (±74.3) 0.724 0.861 0.311
76 0.6310 (±0.0127/√100) 💬 Qwen/Qwen2.5-32B-Instruct 75.4 (±19.3) 0.634 0.898 0.360
77 0.6303 (±0.0252/√100) 🟢 cyberagent/calm2-7b-chat-dpo-experime... 110.0 (±24.3) 0.735 0.863 0.293
78 0.6297 (±0.0150/√100) 💬 Qwen/Qwen2.5-32B-Instruct 71.1 (±18.7) 0.634 0.906 0.349
79 0.6291 (±0.0207/√100) 💬 Qwen/QwQ-32B-Preview 229.6 (±135.9) 0.719 0.867 0.301
80 0.6285 (±0.0239/√100) 🟢 pfnet/nekomata-14b-pfn-qfin-inst-merge 124.7 (±47.2) 0.725 0.866 0.295
81 0.6279 (±0.0252/√100) 🟢 tokyotech-llm/Swallow-7b-NVE-hf 108.1 (±24.5) 0.747 0.870 0.267
82 0.6274 (±0.0772/√100) 🟢 rinna/nekomata-14b-instruction 98.3 (±24.2) 0.732 0.855 0.295
83 0.6267 (±0.0263/√100) 🟢 sbintuitions/sarashina1-7b 106.7 (±25.1) 0.737 0.866 0.276
84 0.6252 (±0.0246/√100) 🟢 karakuri-ai/karakuri-lm-70b-v0.1 106.0 (±27.0) 0.713 0.852 0.310
85 0.6214 (±0.0063/√10) 💬 google/gemini-1.0-pro-001 47.4 (±15.2) 0.635 0.976 0.254
86 0.6202 (±0.0251/√100) 🟢 stabilityai/japanese-stablelm-base-be... 107.3 (±19.2) 0.733 0.848 0.280
87 0.6197 (±0.0258/√100) 🟢 stockmark/stockmark-13b 108.9 (±49.3) 0.727 0.860 0.272
88 0.6191 (±0.0284/√100) 🟢 stockmark/stockmark-13b-instruct 108.0 (±46.8) 0.720 0.859 0.278
89 0.6178 (±0.0230/√100) 🟢 karakuri-ai/karakuri-lm-70b-chat-v0.1 104.7 (±27.5) 0.706 0.842 0.306
90 0.6176 (±0.0249/√100) 🟢 tokyotech-llm/Swallow-7b-instruct-hf 106.3 (±17.8) 0.716 0.851 0.285
91 0.6149 (±0.0153/√100) 💬 Qwen/Qwen2.5-14B-Instruct 76.5 (±18.4) 0.644 0.893 0.308
92 0.6136 (±0.0143/√10) 💬 openai/gpt-35-turbo 64.0 (±22.2) 0.658 0.944 0.239
93 0.6095 (±0.0225/√100) 💬 rinna/llama-3-youko-70b-instruct 135.3 (±46.8) 0.683 0.817 0.328
94 0.6091 (±0.0277/√100) 🟢 pfnet/nekomata-14b-pfn-qfin 85.1 (±28.4) 0.672 0.893 0.262
95 0.6087 (±0.1545/√100) 💬 tokyotech-llm/Swallow-70b-NVE-instruc... 135.7 (±74.0) 0.678 0.804 0.344
96 0.6063 (±0.0213/√100) 💬 Qwen/Qwen2.5-14B-Instruct 80.0 (±21.8) 0.639 0.889 0.290
97 0.6060 (±0.0238/√100) 🟢 Qwen/Qwen2-72B 105.5 (±23.5) 0.703 0.836 0.279
98 0.6037 (±0.0239/√100) 🟢 tokyotech-llm/Swallow-7b-NVE-instruct-hf 105.7 (±16.4) 0.719 0.847 0.245
99 0.6030 (±0.0287/√100) 💬 karakuri-ai/karakuri-lm-8x7b-instruct... 197.4 (±72.1) 0.703 0.832 0.274
100 0.6029 (±0.0223/√100) 🟢 Qwen/Qwen2-72B-Instruct 106.0 (±26.7) 0.684 0.825 0.299
101 0.5987 (±0.0264/√100) 🟢 cyberagent/calm2-7b-chat 107.5 (±20.8) 0.701 0.843 0.253
102 0.5971 (±0.0235/√100) 🟢 stockmark/stockmark-100b 107.2 (±24.7) 0.709 0.842 0.240
103 0.5945 (±0.1370/√100) 💬 tokyotech-llm/Swallow-13b-instruct-hf 167.3 (±116.4) 0.670 0.790 0.323
104 0.5921 (±0.0211/√100) 🟢 elyza/Llama-3-ELYZA-JP-8B 115.6 (±44.8) 0.685 0.831 0.260
105 0.5832 (±0.0220/√100) 🟢 augmxnt/shisa-gamma-7b-v1 106.7 (±21.8) 0.706 0.831 0.213
106 0.5825 (±0.0249/√100) 🟢 tokyotech-llm/Swallow-MS-7b-v0.1 106.4 (±25.9) 0.702 0.828 0.218
107 0.5811 (±0.0218/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-ac_00... 103.6 (±15.6) 0.675 0.816 0.252
108 0.5808 (±0.0220/√100) 🟢 stabilityai/japanese-stablelm-base-ga... 106.9 (±17.2) 0.690 0.822 0.230
109 0.5783 (±0.0217/√100) 🟢 microsoft/Phi-3-medium-4k-instruct 105.9 (±20.0) 0.675 0.826 0.234
110 0.5777 (±0.0228/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-dolly... 105.2 (±14.5) 0.675 0.811 0.247
111 0.5754 (±0.0182/√100) 🟢 Xwin-LM/Xwin-LM-70B-V0.1 105.4 (±26.8) 0.681 0.833 0.213
112 0.5737 (±0.0209/√100) 🟢 microsoft/Phi-3-medium-128k-instruct 107.7 (±24.7) 0.674 0.825 0.223
113 0.5735 (±0.0216/√100) 🟢 google/gemma-2-9b-it 95.9 (±22.0) 0.674 0.837 0.209
114 0.5734 (±0.1980/√100) 💬 tokyotech-llm/Swallow-70b-instruct-hf 130.9 (±105.0) 0.636 0.758 0.326
115 0.5724 (±0.0209/√100) 🟢 rinna/llama-3-youko-70b 104.6 (±20.6) 0.681 0.826 0.210
116 0.5716 (±0.0230/√100) 🟢 sbintuitions/sarashina2.1-1b 116.9 (±41.3) 0.668 0.821 0.226
117 0.5712 (±0.0194/√100) 💬 karakuri-ai/karakuri-lm-8x7b-chat-v0.1 244.4 (±49.3) 0.678 0.816 0.220
118 0.5710 (±0.0226/√100) 🟢 rinna/llama-3-youko-8b-instruct 111.6 (±23.4) 0.672 0.809 0.232
119 0.5659 (±0.0234/√100) 🟢 meta-llama/Meta-Llama-3.1-70B 103.7 (±20.1) 0.665 0.822 0.211
120 0.5656 (±0.0226/√100) 💬 meta-llama/Meta-Llama-3-70B-Instruct 110.2 (±36.4) 0.665 0.777 0.254
121 0.5646 (±0.0240/√100) 💬 microsoft/Phi-3-medium-4k-instruct 131.3 (±50.6) 0.633 0.807 0.253
122 0.5642 (±0.0261/√100) 🟢 stabilityai/japanese-stablelm-instruc... 105.1 (±19.5) 0.646 0.799 0.247
123 0.5620 (±0.0254/√100) 🟢 meta-llama/Meta-Llama-3-70B 102.0 (±17.2) 0.664 0.809 0.213
124 0.5588 (±0.0230/√100) 🟢 stabilityai/japanese-stablelm-instruc... 105.6 (±17.0) 0.673 0.812 0.191
125 0.5574 (±0.0216/√100) 🟢 rinna/nekomata-7b 108.4 (±18.0) 0.678 0.816 0.178
126 0.5569 (±0.0244/√100) 🟢 rinna/llama-3-youko-8b 104.9 (±17.0) 0.670 0.813 0.188
127 0.5568 (±0.0200/√100) 🟢 meta-llama/Meta-Llama-3-70B-Instruct 111.8 (±55.9) 0.655 0.780 0.236
128 0.5562 (±0.0952/√100) 💬 stockmark/stockmark-13b-instruct 137.2 (±89.6) 0.633 0.798 0.238
129 0.5537 (±0.0204/√100) 🟢 tokyotech-llm/Llama-3-Swallow-8B-Inst... 114.4 (±48.5) 0.657 0.812 0.192
130 0.5516 (±0.1016/√100) 💬 cyberagent/calm2-7b-chat-dpo-experime... 181.1 (±120.1) 0.644 0.775 0.236
131 0.5511 (±0.0203/√100) 🟢 google/gemma-2-27b-it 110.3 (±56.8) 0.599 0.836 0.218
132 0.5500 (±0.0605/√100) 💬 tokyotech-llm/Llama-3-Swallow-70B-Ins... 156.5 (±106.5) 0.633 0.780 0.237
133 0.5500 (±0.0467/√100) 💬 tokyotech-llm/Swallow-7b-instruct-hf 121.9 (±77.3) 0.612 0.812 0.225
134 0.5437 (±0.0218/√100) 💬 Xwin-LM/Xwin-LM-70B-V0.1 200.7 (±63.1) 0.652 0.782 0.198
135 0.5436 (±0.0246/√100) 🟢 llm-jp/llm-jp-3-3.7b 101.3 (±10.4) 0.646 0.795 0.189
136 0.5432 (±0.0208/√100) 💬 CohereForAI/c4ai-command-r-plus 48.9 (±16.5) 0.505 0.931 0.194
137 0.5429 (±0.0238/√100) 🟢 meta-llama/Meta-Llama-3.1-70B-Instruct 157.6 (±221.7) 0.636 0.770 0.222
138 0.5387 (±0.0269/√100) 💬 rinna/llama-3-youko-8b-instruct 265.4 (±104.1) 0.635 0.771 0.210
139 0.5386 (±0.0215/√100) 💬 microsoft/Phi-3-medium-128k-instruct 91.9 (±44.7) 0.589 0.834 0.193
140 0.5377 (±0.0481/√100) 💬 meta-llama/Meta-Llama-3.1-70B-Instruct 135.8 (±194.8) 0.617 0.779 0.218
141 0.5349 (±0.0203/√100) 💬 google/gemma-2-27b-it 74.7 (±42.7) 0.545 0.874 0.186
142 0.5347 (±0.0188/√100) 🟢 rinna/youri-7b 107.6 (±16.3) 0.654 0.802 0.148
143 0.5316 (±0.0273/√100) 💬 lightblue/karasu-7B-chat 111.8 (±46.5) 0.621 0.800 0.174
144 0.5301 (±0.0476/√100) 💬 lightblue/karasu-7B-chat-plus 107.1 (±46.7) 0.615 0.798 0.178
145 0.5283 (±0.0585/√100) 💬 lightblue/karasu-7B-chat-plus-unleashed 104.6 (±45.3) 0.614 0.794 0.177
146 0.5179 (±0.0264/√100) 🟢 cyberagent/calm2-7b 106.0 (±26.2) 0.601 0.770 0.182
147 0.5164 (±0.0209/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-jaste... 109.3 (±33.5) 0.606 0.788 0.155
148 0.5143 (±0.0212/√100) 🟢 llm-jp/llm-jp-13b-v2.0 104.1 (±11.2) 0.604 0.760 0.180
149 0.5143 (±0.0170/√100) 🟢 moneyforward/houou-instruction-7b-v3 112.2 (±37.8) 0.629 0.778 0.135
150 0.5122 (±0.0132/√100) 💬 Qwen/Qwen2.5-7B-Instruct 69.5 (±28.7) 0.557 0.847 0.132
151 0.5085 (±0.0160/√100) 🟢 moneyforward/houou-instruction-7b-v1 105.9 (±41.0) 0.617 0.781 0.128
152 0.5080 (±0.0306/√100) 💬 stabilityai/japanese-stablelm-instruc... 111.3 (±58.3) 0.548 0.782 0.195
153 0.5073 (±0.0208/√100) 💬 Qwen/Qwen2-57B-A14B-Instruct 154.8 (±89.5) 0.615 0.734 0.173
154 0.5045 (±0.0208/√100) 🟢 Qwen/Qwen2-57B-A14B 106.7 (±22.5) 0.617 0.757 0.139
155 0.5041 (±0.0225/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-dolly... 106.2 (±29.3) 0.579 0.778 0.155
156 0.5022 (±0.0221/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-jaste... 95.0 (±36.2) 0.579 0.795 0.132
157 0.5013 (±0.0196/√100) 🟢 google/gemma-2-9b 107.3 (±26.0) 0.595 0.761 0.148
158 0.5013 (±0.0375/√100) 💬 karakuri-ai/karakuri-lm-70b-chat-v0.1 427.4 (±151.5) 0.579 0.723 0.202
159 0.5002 (±0.0218/√100) 🟢 Qwen/Qwen-72B-Chat 223.0 (±258.3) 0.614 0.716 0.171
160 0.4995 (±0.0211/√100) 💬 Qwen/Qwen1.5-72B-Chat 119.3 (±58.1) 0.582 0.708 0.208
161 0.4970 (±0.0117/√100) 💬 Qwen/Qwen2.5-7B-Instruct 65.0 (±22.0) 0.535 0.858 0.098
162 0.4963 (±0.0189/√100) 🟢 Qwen/Qwen1.5-72B-Chat 128.1 (±77.7) 0.586 0.698 0.206
163 0.4959 (±0.0235/√100) 🟢 llm-jp/llm-jp-13b-v1.0 115.0 (±40.9) 0.576 0.756 0.156
164 0.4953 (±0.0203/√100) 🟢 meta-llama/Llama-2-70b-hf 110.4 (±25.8) 0.596 0.745 0.145
165 0.4949 (±0.0177/√100) 💬 moneyforward/houou-instruction-7b-v1 180.5 (±66.6) 0.604 0.734 0.146
166 0.4931 (±0.0247/√100) 🟢 Rakuten/RakutenAI-7B-instruct 105.6 (±33.1) 0.598 0.750 0.132
167 0.4921 (±0.0219/√100) 🟢 Rakuten/RakutenAI-7B-chat 114.9 (±44.7) 0.592 0.760 0.124
168 0.4916 (±0.0201/√100) 🟢 moneyforward/houou-instruction-7b-v2 104.7 (±41.2) 0.588 0.770 0.116
169 0.4895 (±0.0440/√100) 💬 llm-jp/llm-jp-13b-instruct-full-dolly... 268.1 (±133.1) 0.548 0.722 0.199
170 0.4872 (±0.0237/√100) 🟢 lightblue/karasu-7B 110.1 (±19.0) 0.586 0.739 0.137
171 0.4870 (±0.0215/√100) 🟢 Qwen/Qwen-72B 134.6 (±114.6) 0.593 0.715 0.152
172 0.4868 (±0.0163/√100) 💬 google/gemma-2-9b-it 47.6 (±14.6) 0.477 0.880 0.104
173 0.4863 (±0.1167/√100) 💬 pfnet/nekomata-14b-pfn-qfin-inst-merge 93.4 (±55.0) 0.544 0.721 0.194
174 0.4862 (±0.0221/√100) 🟢 Qwen/Qwen2-57B-A14B-Instruct 116.9 (±82.5) 0.601 0.734 0.124
175 0.4857 (±0.0168/√100) 💬 moneyforward/houou-instruction-7b-v2 207.0 (±57.3) 0.591 0.719 0.147
176 0.4829 (±0.0211/√100) 🟢 Qwen/Qwen1.5-72B 136.2 (±85.6) 0.591 0.705 0.153
177 0.4827 (±0.0464/√100) 💬 llm-jp/llm-jp-13b-instruct-full-ac_00... 269.1 (±131.5) 0.542 0.716 0.191
178 0.4762 (±0.0810/√100) 💬 stabilityai/japanese-stablelm-instruc... 126.2 (±67.4) 0.545 0.726 0.158
179 0.4746 (±0.0210/√100) 🟢 rinna/youri-7b-chat 102.1 (±16.4) 0.571 0.752 0.100
180 0.4744 (±0.0227/√100) 🟢 pfnet/plamo-13b 108.2 (±28.5) 0.558 0.749 0.116
181 0.4743 (±0.0987/√100) 💬 tokyotech-llm/Swallow-7b-NVE-instruct-hf 129.0 (±72.8) 0.535 0.725 0.163
182 0.4730 (±0.0166/√100) 🟢 Xwin-LM/Xwin-LM-13B-V0.2 109.7 (±27.4) 0.582 0.723 0.114
183 0.4723 (±0.0204/√100) 💬 Rakuten/RakutenAI-7B-chat 233.0 (±133.0) 0.565 0.734 0.118
184 0.4723 (±0.0808/√100) 💬 tokyotech-llm/Llama-3-Swallow-8B-Inst... 199.3 (±155.6) 0.563 0.699 0.154
185 0.4698 (±0.0200/√100) 🟢 Rakuten/RakutenAI-7B 105.4 (±25.6) 0.576 0.721 0.113
186 0.4692 (±0.0161/√100) 🟢 shisa-ai/shisa-v1-qwen2-7b 109.0 (±23.9) 0.563 0.712 0.133
187 0.4661 (±0.0210/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-dolly... 111.6 (±44.2) 0.536 0.756 0.106
188 0.4659 (±0.0438/√100) 💬 deepseek-ai/deepseek-llm-67b-chat 146.0 (±62.1) 0.555 0.703 0.139
189 0.4659 (±0.0202/√100) 🟢 llm-jp/llm-jp-3-1.8b 105.0 (±16.9) 0.568 0.725 0.105
190 0.4648 (±0.1659/√100) 💬 cyberagent/calm2-7b-chat 124.7 (±95.9) 0.536 0.688 0.171
191 0.4622 (±0.0195/√100) 🟢 Qwen/Qwen-14B-Chat 135.5 (±84.3) 0.572 0.718 0.097
192 0.4619 (±0.0162/√100) 💬 lmsys/vicuna-13b-v1.5-16k 126.5 (±48.4) 0.574 0.715 0.097
193 0.4609 (±0.0113/√10) 🟢 google/gemma-2-2b-jpn-it 69.4 (±24.1) 0.509 0.805 0.069
194 0.4607 (±0.0165/√100) 🟢 SakanaAI/EvoLLM-JP-v1-7B 111.2 (±30.4) 0.579 0.708 0.095
195 0.4601 (±0.0184/√100) 🟢 shisa-ai/shisa-v1-llama3-8b 112.9 (±31.4) 0.557 0.703 0.120
196 0.4597 (±0.0268/√100) 🟢 CohereForAI/c4ai-command-r-v01 179.2 (±166.3) 0.590 0.592 0.197
197 0.4586 (±0.0141/√100) 🟢 google/gemma-2-2b-it 88.2 (±30.8) 0.536 0.761 0.079
198 0.4561 (±0.0202/√100) 🟢 pfnet/plamo-13b-instruct 144.0 (±147.7) 0.532 0.763 0.073
199 0.4559 (±0.0201/√100) 🟢 pfnet/plamo-13b-instruct-nc 156.0 (±183.1) 0.523 0.768 0.077
200 0.4558 (±0.0156/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 75.3 (±26.6) 0.488 0.804 0.076
201 0.4543 (±0.0217/√100) 🟢 rinna/youri-7b-instruction 96.2 (±29.5) 0.530 0.743 0.090
202 0.4535 (±0.0348/√100) 💬 Rakuten/RakutenAI-7B-instruct 128.6 (±83.2) 0.527 0.726 0.108
203 0.4535 (±0.0183/√100) 🟢 THUDM/glm-4-9b 110.3 (±36.9) 0.554 0.689 0.118
204 0.4527 (±0.0146/√100) 🟢 lmsys/vicuna-13b-v1.5-16k 107.9 (±25.9) 0.576 0.708 0.075
205 0.4504 (±0.0224/√100) 🟢 rinna/nekomata-7b-instruction 96.4 (±23.7) 0.528 0.734 0.089
206 0.4486 (±0.0161/√100) 💬 Qwen/Qwen2-7B-Instruct 163.6 (±61.4) 0.547 0.688 0.111
207 0.4484 (±0.0191/√100) 💬 SakanaAI/EvoLLM-JP-v1-7B 123.9 (±68.1) 0.545 0.706 0.094
208 0.4477 (±0.0205/√100) 🟢 rinna/llama-3-youko-70b-instruct 130.7 (±95.3) 0.527 0.670 0.146
209 0.4426 (±0.0204/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b-inst... 111.1 (±28.2) 0.544 0.687 0.097
210 0.4409 (±0.1064/√100) 💬 lightblue/karasu-7B 138.1 (±92.9) 0.512 0.679 0.131
211 0.4404 (±0.0146/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 75.9 (±22.7) 0.493 0.773 0.056
212 0.4387 (±0.0655/√100) 💬 Qwen/Qwen-72B-Chat 117.7 (±137.1) 0.541 0.632 0.143
213 0.4385 (±0.0285/√100) 💬 rinna/youri-7b-chat 95.4 (±41.1) 0.500 0.733 0.083
214 0.4377 (±0.0107/√100) 🟢 google/gemma-1.1-7b-it 86.8 (±21.4) 0.509 0.732 0.072
215 0.4374 (±0.0217/√100) 🟢 Qwen/Qwen1.5-32B-Chat 127.0 (±57.0) 0.538 0.642 0.133
216 0.4336 (±0.0168/√100) 🟢 stabilityai/japanese-stablelm-base-be... 107.1 (±17.2) 0.539 0.689 0.073
217 0.4335 (±0.0221/√100) 🟢 Qwen/Qwen-14B 118.1 (±71.6) 0.530 0.675 0.096
218 0.4332 (±0.0164/√100) 🟢 Qwen/Qwen2-7B-Instruct 119.1 (±45.7) 0.531 0.670 0.098
219 0.4330 (±0.0149/√100) 💬 google/gemma-2-2b-it 56.0 (±27.8) 0.445 0.788 0.066
220 0.4320 (±0.0171/√100) 🟢 Qwen/Qwen2-7B 109.1 (±40.1) 0.532 0.671 0.093
221 0.4296 (±0.0322/√100) 💬 Qwen/Qwen-14B-Chat 159.0 (±69.7) 0.522 0.675 0.092
222 0.4295 (±0.0157/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b-instruct 111.5 (±31.4) 0.530 0.676 0.083
223 0.4292 (±0.0181/√100) 💬 Xwin-LM/Xwin-LM-13B-V0.2 240.7 (±48.4) 0.533 0.670 0.085
224 0.4282 (±0.0193/√100) 🟢 stabilityai/japanese-stablelm-3b-4e1t... 110.8 (±26.0) 0.518 0.688 0.078
225 0.4272 (±0.0273/√100) 🟢 mistralai/Mistral-Nemo-Instruct-2407 155.8 (±132.8) 0.548 0.611 0.122
226 0.4265 (±0.0115/√100) 💬 google/gemma-1.1-7b-it 78.7 (±28.4) 0.475 0.739 0.066
227 0.4256 (±0.0270/√100) 🟢 rinna/japanese-gpt-neox-3.6b 129.8 (±73.4) 0.485 0.685 0.106
228 0.4228 (±0.0185/√100) 🟢 stabilityai/japanese-stablelm-base-ja... 110.4 (±28.6) 0.528 0.668 0.073
229 0.4222 (±0.0138/√100) 🟢 Xwin-LM/Xwin-LM-7B-V0.2 110.6 (±29.3) 0.520 0.677 0.070
230 0.4220 (±0.0185/√100) 🟢 lmsys/vicuna-7b-v1.5-16k 111.8 (±31.8) 0.522 0.670 0.074
231 0.4207 (±0.0189/√100) 🟢 stabilityai/japanese-stablelm-3b-4e1t... 112.8 (±27.0) 0.507 0.683 0.072
232 0.4201 (±0.0177/√100) 💬 lmsys/vicuna-7b-v1.5-16k 128.1 (±52.5) 0.514 0.668 0.078
233 0.4164 (±0.0244/√100) 🟢 google/gemma-7b 135.5 (±132.3) 0.533 0.631 0.085
234 0.4150 (±0.0212/√100) 💬 Qwen/Qwen1.5-32B-Chat 125.7 (±250.5) 0.496 0.620 0.130
235 0.4149 (±0.0375/√100) 💬 llm-jp/llm-jp-13b-instruct-full-dolly... 186.6 (±108.4) 0.469 0.685 0.090
236 0.4144 (±0.0149/√100) 💬 01-ai/Yi-1.5-34B-Chat 170.6 (±47.1) 0.514 0.628 0.101
237 0.4140 (±0.0208/√100) 🟢 meta-llama/Meta-Llama-3-8B-Instruct 116.8 (±44.3) 0.523 0.637 0.082
238 0.4125 (±0.0303/√100) 💬 CohereForAI/c4ai-command-r-v01 137.7 (±324.6) 0.519 0.562 0.157
239 0.4122 (±0.0199/√100) 🟢 rinna/bilingual-gpt-neox-4b 121.0 (±43.6) 0.485 0.660 0.092
240 0.4097 (±0.0187/√100) 🟢 meta-llama/Meta-Llama-3.1-8B 108.7 (±35.4) 0.512 0.650 0.068
241 0.4087 (±0.0201/√100) 🟢 meta-llama/Llama-2-70b-chat-hf 161.3 (±140.8) 0.519 0.608 0.099
242 0.4087 (±0.0146/√100) 🟢 microsoft/Phi-3-small-8k-instruct 109.1 (±24.1) 0.514 0.644 0.068
243 0.4076 (±0.0142/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b-fast-... 109.0 (±32.9) 0.503 0.644 0.076
244 0.4074 (±0.0207/√100) 💬 elyza/ELYZA-japanese-Llama-2-13b-inst... 156.6 (±65.9) 0.490 0.646 0.086
245 0.4073 (±0.0175/√100) 🟢 stabilityai/japanese-stablelm-instruc... 110.0 (±26.5) 0.490 0.663 0.070
246 0.4058 (±0.0295/√100) 💬 rinna/youri-7b-instruction 97.0 (±57.0) 0.439 0.713 0.065
247 0.4050 (±0.0191/√100) 🟢 mistralai/Mixtral-8x22B-v0.1 115.6 (±55.4) 0.517 0.615 0.084
248 0.4048 (±0.0175/√100) 🟢 meta-llama/Meta-Llama-3-8B 109.0 (±19.8) 0.505 0.641 0.068
249 0.4045 (±0.0186/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 133.1 (±57.4) 0.475 0.678 0.061
250 0.4042 (±0.0131/√100) 🟢 microsoft/Orca-2-13b 115.5 (±42.6) 0.510 0.630 0.073
251 0.4041 (±0.0218/√100) 💬 meta-llama/Meta-Llama-3-8B-Instruct 131.4 (±88.3) 0.508 0.614 0.090
252 0.4035 (±0.0151/√100) 🟢 SakanaAI/EvoLLM-JP-A-v1-7B 110.4 (±31.3) 0.508 0.633 0.069
253 0.4033 (±0.0164/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b-fast... 107.2 (±28.5) 0.495 0.643 0.072
254 0.4032 (±0.0237/√100) 🟢 Qwen/Qwen1.5-32B 150.3 (±104.8) 0.505 0.605 0.100
255 0.4024 (±0.0187/√100) 🟢 01-ai/Yi-1.5-34B 109.9 (±28.2) 0.493 0.631 0.083
256 0.4011 (±0.0236/√100) 🟢 cyberagent/open-calm-7b 143.8 (±97.0) 0.472 0.641 0.091
257 0.4006 (±0.0166/√100) 💬 microsoft/Phi-3-small-8k-instruct 189.7 (±84.1) 0.500 0.630 0.073
258 0.4001 (±0.0199/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 117.6 (±48.9) 0.464 0.684 0.052
259 0.3985 (±0.0161/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b 138.4 (±51.8) 0.493 0.634 0.069
260 0.3960 (±0.0199/√100) 🟢 line-corporation/japanese-large-lm-1.7b 179.2 (±174.5) 0.474 0.650 0.065
261 0.3949 (±0.0193/√100) 💬 meta-llama/Meta-Llama-3.1-8B-Instruct 216.6 (±345.2) 0.487 0.624 0.074
262 0.3948 (±0.0190/√100) 💬 Qwen/Qwen1.5-14B-Chat 127.9 (±50.6) 0.500 0.604 0.080
263 0.3946 (±0.0201/√100) 🟢 Qwen/Qwen1.5-14B 130.9 (±67.8) 0.509 0.609 0.066
264 0.3934 (±0.0201/√100) 🟢 stabilityai/japanese-stablelm-instruc... 107.8 (±38.0) 0.466 0.648 0.066
265 0.3914 (±0.0172/√100) 🟢 mistralai/Mixtral-8x7B-Instruct-v0.1 95.1 (±25.2) 0.488 0.636 0.050
266 0.3863 (±0.0160/√100) 🟢 Qwen/Qwen1.5-14B-Chat 131.4 (±55.8) 0.491 0.593 0.075
267 0.3837 (±0.0188/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 117.4 (±42.4) 0.462 0.649 0.041
268 0.3823 (±0.0645/√100) 💬 mistralai/Mistral-Nemo-Instruct-2407 157.9 (±140.3) 0.484 0.563 0.100
269 0.3822 (±0.0647/√100) 💬 llm-jp/llm-jp-13b-instruct-full-dolly... 97.6 (±76.2) 0.397 0.664 0.086
270 0.3819 (±0.0265/√100) 🟢 google/gemma-2-27b 214.2 (±183.3) 0.450 0.608 0.087
271 0.3804 (±0.0161/√100) 🟢 Qwen/Qwen-7B-Chat 140.8 (±65.1) 0.485 0.612 0.045
272 0.3803 (±0.0249/√100) 💬 elyza/ELYZA-japanese-Llama-2-7b-instruct 136.4 (±70.7) 0.452 0.619 0.070
273 0.3772 (±0.0162/√100) 💬 microsoft/Phi-3-small-128k-instruct 199.7 (±111.9) 0.473 0.590 0.069
274 0.3760 (±0.0236/√100) 🟢 cyberagent/open-calm-3b 123.2 (±79.0) 0.442 0.624 0.062
275 0.3759 (±0.0149/√100) 🟢 lmsys/longchat-7b-v1.5-32k 116.9 (±31.6) 0.474 0.609 0.045
276 0.3740 (±0.0164/√100) 🟢 meta-llama/Llama-2-13b-hf 108.5 (±21.8) 0.474 0.603 0.045
277 0.3737 (±0.0197/√100) 🟢 meta-llama/Meta-Llama-3.1-8B-Instruct 204.5 (±303.4) 0.478 0.589 0.055
278 0.3720 (±0.0622/√100) 💬 Xwin-LM/Xwin-LM-7B-V0.2 205.3 (±79.1) 0.466 0.590 0.060
279 0.3720 (±0.0157/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b-fast 177.5 (±147.2) 0.458 0.598 0.061
280 0.3699 (±0.0345/√100) 💬 Qwen/Qwen-7B-Chat 182.9 (±110.3) 0.468 0.600 0.042
281 0.3694 (±0.0103/√100) 🟢 google/gemma-7b-it 89.7 (±21.6) 0.446 0.640 0.022
282 0.3685 (±0.0173/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b 140.0 (±52.8) 0.462 0.596 0.047
283 0.3673 (±0.0089/√100) 💬 google/gemma-7b-it 110.0 (±47.6) 0.448 0.633 0.020
284 0.3655 (±0.0116/√100) 🟢 deepseek-ai/deepseek-llm-7b-chat 113.9 (±24.7) 0.474 0.579 0.043
285 0.3642 (±0.0165/√100) 🟢 llm-jp/llm-jp-1.3b-v1.0 134.0 (±62.6) 0.437 0.612 0.044
286 0.3637 (±0.0223/√100) 🟢 cyberagent/open-calm-large 122.3 (±73.9) 0.424 0.611 0.056
287 0.3637 (±0.0152/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b-fast 168.0 (±77.4) 0.452 0.587 0.052
288 0.3632 (±0.0237/√100) 💬 elyza/ELYZA-japanese-Llama-2-7b-fast-... 178.6 (±113.6) 0.443 0.582 0.064
289 0.3628 (±0.0145/√100) 🟢 Qwen/Qwen-7B 117.3 (±39.0) 0.468 0.582 0.039
290 0.3554 (±0.0178/√100) 🟢 meta-llama/Llama-2-7b-chat-hf 139.3 (±93.1) 0.464 0.570 0.031
291 0.3545 (±0.0445/√100) 💬 llm-jp/llm-jp-13b-instruct-full-jaste... 48.8 (±50.1) 0.283 0.723 0.058
292 0.3543 (±0.0439/√100) 💬 lmsys/longchat-7b-v1.5-32k 160.1 (±73.5) 0.448 0.572 0.043
293 0.3538 (±0.0175/√100) 🟢 01-ai/Yi-1.5-9B 113.0 (±29.4) 0.457 0.555 0.050
294 0.3531 (±0.0159/√100) 🟢 mistralai/Mixtral-8x7B-v0.1 94.3 (±20.8) 0.450 0.573 0.037
295 0.3514 (±0.0102/√100) 🟢 google/gemma-1.1-2b-it 80.4 (±21.6) 0.404 0.625 0.025
296 0.3495 (±0.0268/√100) 🟢 cyberagent/open-calm-1b 141.3 (±110.0) 0.412 0.578 0.059
297 0.3471 (±0.0131/√100) 🟢 microsoft/Orca-2-7b 131.1 (±70.7) 0.447 0.555 0.039
298 0.3465 (±0.0202/√100) 💬 deepseek-ai/deepseek-llm-7b-chat 167.2 (±76.5) 0.435 0.562 0.042
299 0.3463 (±0.0178/√100) 💬 mistralai/Mixtral-8x7B-Instruct-v0.1 147.1 (±111.8) 0.448 0.548 0.043
300 0.3449 (±0.0986/√100) 💬 stabilityai/japanese-stablelm-instruc... 109.4 (±66.2) 0.397 0.585 0.053
301 0.3440 (±0.0978/√100) 💬 stabilityai/japanese-stablelm-3b-4e1t... 127.8 (±80.5) 0.401 0.576 0.055
302 0.3436 (±0.0126/√100) 💬 01-ai/Yi-1.5-9B-Chat 143.6 (±60.1) 0.438 0.540 0.053
303 0.3428 (±0.0163/√100) 🟢 meta-llama/Llama-2-7b-hf 112.3 (±28.0) 0.440 0.550 0.038
304 0.3408 (±0.0225/√100) 🟢 anthracite-org/magnum-32b-v2 191.9 (±223.2) 0.442 0.507 0.073
305 0.3393 (±0.0225/√100) 🟢 stockmark/gpt-neox-japanese-1.4b 92.2 (±63.7) 0.351 0.641 0.025
306 0.3322 (±0.0151/√100) 🟢 Qwen/Qwen1.5-7B-Chat 127.7 (±117.0) 0.431 0.520 0.045
307 0.3315 (±0.0203/√100) 🟢 Qwen/Qwen1.5-7B 141.8 (±126.5) 0.445 0.504 0.046
308 0.3313 (±0.0115/√100) 🟢 google/gemma-2b-it 85.9 (±24.7) 0.393 0.577 0.024
309 0.3293 (±0.0252/√100) 💬 Qwen/Qwen1.5-7B-Chat 195.7 (±113.1) 0.429 0.503 0.056
310 0.3276 (±0.0709/√100) 💬 elyza/ELYZA-japanese-Llama-2-13b-fast... 134.0 (±98.8) 0.395 0.543 0.045
311 0.3272 (±0.0101/√100) 💬 01-ai/Yi-1.5-6B-Chat 194.4 (±75.0) 0.426 0.530 0.025
312 0.3187 (±0.0142/√100) 🟢 Qwen/Qwen2-1.5B-Instruct 131.4 (±46.7) 0.421 0.513 0.022
313 0.3172 (±0.0150/√100) 🟢 Qwen/Qwen2-1.5B 120.9 (±30.7) 0.422 0.511 0.019
314 0.3161 (±0.0119/√100) 🟢 deepseek-ai/deepseek-llm-7b-base 113.7 (±21.6) 0.424 0.501 0.024
315 0.3147 (±0.0175/√100) 💬 Qwen/Qwen2-1.5B-Instruct 180.7 (±101.0) 0.408 0.511 0.025
316 0.3078 (±0.0195/√100) 🟢 cyberagent/open-calm-medium 117.3 (±59.4) 0.363 0.537 0.024
317 0.3058 (±0.1106/√100) 💬 rinna/nekomata-7b-instruction 61.2 (±57.0) 0.307 0.567 0.043
318 0.3053 (±0.0177/√100) 🟢 google/gemma-2b 151.5 (±113.6) 0.410 0.480 0.026
319 0.3050 (±0.0190/√100) 🟢 Qwen/Qwen1.5-MoE-A2.7B 146.4 (±90.3) 0.412 0.468 0.035
320 0.2993 (±0.0095/√100) 🟢 01-ai/Yi-1.5-6B-Chat 133.3 (±46.2) 0.394 0.481 0.022
321 0.2993 (±0.0107/√100) 🟢 tiiuae/falcon-11B 121.6 (±31.5) 0.398 0.483 0.016
322 0.2957 (±0.0641/√100) 💬 meta-llama/Llama-2-13b-chat-hf 305.2 (±299.7) 0.402 0.453 0.032
323 0.2953 (±0.0442/√100) 🟢 augmxnt/shisa-base-7b-v1 200.4 (±160.3) 0.378 0.478 0.030
324 0.2924 (±0.0506/√100) 💬 Qwen/Qwen1.5-MoE-A2.7B-Chat 245.1 (±209.1) 0.381 0.453 0.043
325 0.2914 (±0.0133/√100) 🟢 mistralai/Mistral-7B-v0.1 117.4 (±40.4) 0.402 0.454 0.018
326 0.2907 (±0.0175/√100) 🟢 Qwen/Qwen1.5-MoE-A2.7B-Chat 149.8 (±91.0) 0.388 0.448 0.036
327 0.2853 (±0.0163/√100) 🟢 Qwen/Qwen1.5-4B-Chat 127.8 (±71.2) 0.395 0.441 0.019
328 0.2809 (±0.0133/√100) 🟢 Qwen/Qwen1.5-1.8B-Chat 178.3 (±92.0) 0.381 0.445 0.017
329 0.2770 (±0.0131/√100) 🟢 mistralai/Mistral-7B-Instruct-v0.2 146.2 (±70.1) 0.387 0.419 0.024
330 0.2769 (±0.0324/√100) 💬 llm-jp/llm-jp-13b-instruct-full-jaste... 16.9 (±24.6) 0.125 0.693 0.013
331 0.2769 (±0.1029/√100) 💬 stabilityai/japanese-stablelm-instruc... 117.0 (±115.0) 0.307 0.489 0.035
332 0.2666 (±0.0241/√100) 🟢 deepseek-ai/deepseek-llm-67b-chat 140.2 (±83.0) 0.351 0.440 0.009
333 0.2661 (±0.0128/√100) 🟢 Qwen/Qwen1.5-1.8B 129.7 (±65.7) 0.360 0.424 0.014
334 0.2613 (±0.0136/√100) 🟢 Qwen/Qwen2-0.5B-Instruct 176.8 (±98.9) 0.351 0.426 0.007
335 0.2604 (±0.0148/√100) 🟢 mistralai/Mistral-7B-Instruct-v0.1 139.8 (±101.3) 0.367 0.400 0.014
336 0.2598 (±0.0129/√100) 🟢 Qwen/Qwen2-0.5B 122.7 (±43.5) 0.350 0.420 0.009
337 0.2581 (±0.0196/√100) 🟢 cyberagent/open-calm-small 119.1 (±54.1) 0.310 0.460 0.004
338 0.2555 (±0.0163/√100) 🟢 Qwen/Qwen1.5-4B 149.2 (±76.6) 0.363 0.388 0.015
339 0.2543 (±0.0266/√100) 🟢 mosaicml/mpt-30b-chat 121.3 (±46.4) 0.327 0.428 0.008
340 0.2414 (±0.0281/√100) 💬 Qwen/Qwen1.5-1.8B-Chat 480.0 (±210.3) 0.329 0.392 0.003
341 0.2394 (±0.0745/√100) 💬 Qwen/Qwen1.5-4B-Chat 105.3 (±104.1) 0.307 0.390 0.021
342 0.2317 (±0.0455/√100) 💬 mistralai/Mistral-7B-Instruct-v0.1 202.3 (±153.9) 0.320 0.362 0.012
343 0.2231 (±0.0166/√100) 💬 mistralai/Mistral-7B-Instruct-v0.2 261.2 (±166.3) 0.316 0.334 0.019
344 0.2182 (±0.0152/√100) 🟢 microsoft/phi-1 47.6 (±34.3) 0.234 0.420 0.000
345 0.2177 (±0.0110/√100) 🟢 Qwen/Qwen1.5-0.5B-Chat 143.4 (±52.1) 0.317 0.327 0.009
346 0.2169 (±0.0561/√100) 💬 Qwen/Qwen2-0.5B-Instruct 129.5 (±114.3) 0.265 0.379 0.006
347 0.2169 (±0.0218/√100) 🟢 mosaicml/mpt-30b-instruct 109.8 (±36.1) 0.274 0.370 0.008
348 0.2146 (±0.0151/√100) 🟢 microsoft/phi-2 78.0 (±31.4) 0.287 0.356 0.001
349 0.2061 (±0.0820/√100) 💬 meta-llama/Llama-2-70b-chat-hf 523.3 (±444.5) 0.271 0.303 0.045
350 0.2040 (±0.0152/√100) 🟢 Qwen/Qwen1.5-0.5B 138.6 (±55.9) 0.296 0.314 0.003
351 0.2038 (±0.0538/√100) 🟢 mosaicml/mpt-30b 236.5 (±433.3) 0.271 0.334 0.007
352 0.1885 (±0.0194/√100) 🟢 microsoft/phi-1_5 77.5 (±33.6) 0.258 0.306 0.001
353 0.1833 (±0.0406/√100) 💬 google/gemma-1.1-2b-it 32.6 (±26.7) 0.171 0.376 0.003
354 0.1765 (±0.0439/√100) 💬 Qwen/Qwen1.5-0.5B-Chat 214.3 (±172.6) 0.251 0.276 0.002
355 0.1687 (±0.0172/√100) 🟢 upstage/SOLAR-10.7B-v1.0 171.0 (±87.1) 0.265 0.237 0.004
356 0.1544 (±0.0132/√100) 🟢 01-ai/Yi-1.5-34B-Chat 730.0 (±533.6) 0.201 0.256 0.006
357 0.1475 (±0.0826/√100) 💬 mosaicml/mpt-30b-chat 112.2 (±112.4) 0.182 0.254 0.007
358 0.1241 (±0.0558/√100) 💬 google/gemma-2b-it 24.1 (±24.6) 0.115 0.257 0.000
359 0.1226 (±0.0240/√100) 🟢 Deci/DeciLM-7B 174.0 (±165.5) 0.190 0.174 0.003
360 0.1160 (±0.0081/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 212.1 (±148.9) 0.153 0.195 0.000
361 0.1009 (±0.0846/√100) 💬 meta-llama/Llama-2-7b-chat-hf 241.5 (±336.2) 0.136 0.158 0.009
362 0.1004 (±0.0094/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 123.1 (±128.8) 0.119 0.182 0.000
363 0.0987 (±0.0145/√100) 🟢 deepseek-ai/deepseek-llm-67b-base 154.2 (±77.3) 0.174 0.121 0.000
364 0.0982 (±0.1596/√100) 💬 rinna/nekomata-14b-instruction 16.0 (±38.1) 0.115 0.141 0.039
365 0.0955 (±0.0102/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 129.5 (±141.0) 0.116 0.170 0.000
366 0.0939 (±0.0064/√100) 🟢 sbintuitions/tiny-lm-chat 250.2 (±275.6) 0.133 0.149 0.000
367 0.0936 (±0.0082/√100) 💬 sbintuitions/tiny-lm-chat 276.7 (±209.6) 0.135 0.145 0.000
368 0.0921 (±0.0058/√100) 🟢 sbintuitions/tiny-lm 471.9 (±199.0) 0.135 0.142 0.000
369 0.0880 (±0.0334/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 134.0 (±144.7) 0.105 0.159 0.000
370 0.0762 (±0.0033/√100) 🟢 line-corporation/japanese-large-lm-3.6b 1066.6 (±31.6) 0.125 0.103 0.000
371 0.0760 (±0.0032/√100) 🟢 line-corporation/japanese-large-lm-3.... 1066.4 (±31.8) 0.125 0.103 0.000
372 0.0758 (±0.0034/√100) 💬 line-corporation/japanese-large-lm-3.... 1067.2 (±31.8) 0.125 0.102 0.000
373 0.0673 (±0.0085/√100) 🟢 moneyforward/houou-instruction-7b-v3 143.2 (±112.2) 0.098 0.104 0.000
374 0.0625 (±0.0169/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-ac_00... 31.6 (±10.3) 0.088 0.099 0.000
375 0.0429 (±0.0440/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 31.7 (±54.7) 0.045 0.084 0.000
376 0.0406 (±0.0028/√100) 🟢 microsoft/Phi-3-small-128k-instruct 268.1 (±123.4) 0.083 0.039 0.000
377 0.0337 (±0.0026/√100) 🟢 augmxnt/shisa-7b-v1 590.7 (±238.2) 0.076 0.025 0.000
378 0.0284 (±0.0012/√100) 🟢 lightblue/karasu-7B-chat-plus 285.1 (±53.8) 0.080 0.005 0.000
379 0.0225 (±0.0702/√100) 💬 SakanaAI/EvoLLM-JP-A-v1-7B 5.9 (±27.6) 0.026 0.037 0.005
380 0.0180 (±0.0039/√100) 🟢 mistralai/Mistral-Nemo-Base-2407 607.5 (±344.5) 0.039 0.015 0.000
381 0.0047 (±0.0024/√100) 🟢 ai-forever/mGPT-13B 321.1 (±266.7) 0.008 0.006 0.000
382 0.0022 (±0.0006/√100) 🟢 lightblue/qarasu-14B-chat-plus-unleashed 937.5 (±557.0) 0.004 0.002 0.000
383 0.0019 (±0.0002/√100) 🟢 01-ai/Yi-1.5-9B-Chat 1440.0 (±51.9) 0.005 0.001 0.000
384 0.0018 (±0.0004/√100) 🟢 CohereForAI/aya-23-8B 1676.6 (±351.0) 0.004 0.002 0.000
385 0.0006 (±0.0002/√100) 🟢 meta-llama/Llama-2-13b-chat-hf 1523.9 (±43.5) 0.001 0.001 0.000
386 0.0000 (±0.0000/√100) 🟢 01-ai/Yi-1.5-6B 0.0 (±0.0) 0.000 0.000 0.000
387 0.0000 (±0.0000/√100) 🟢 lightblue/karasu-1.1B 0.0 (±0.0) 0.000 0.000 0.000
388 0.0000 (±0.0000/√100) 🟢 lightblue/karasu-7B-chat-plus-unleashed 0.0 (±0.0) 0.000 0.000 0.000
389 0.0000 (±0.0000/√100) 🟢 lightblue/karasu-7B-chat 0.0 (±0.0) 0.000 0.000 0.000
390 0.0000 (±0.0000/√100) 🟢 lightblue/suzume-llama-3-8B-japanese 300.0 (±0.0) 0.000 0.000 0.000
391 0.0000 (±0.0000/√100) 🟢 lightblue/suzume-llama-3-8B-multilingual 300.0 (±0.0) 0.000 0.000 0.000

Citation

If you use this repository, please cite the following paper:

@preprint{Imos2024-pre-pfgen,
  title={{pfgen-bench: 日本語事前学習モデルのための文章生成性能評価ベンチマーク}},
  author={今城, 健太郎 and 平野, 正徳 and 鈴木, 脩司 and 三上, 裕明},
  doi={10.51094/jxiv.1008},
  year={2024}
}

Or cite directory this repository:

@misc{imajo2024-pfgen
    title={{Preferred Generation Benchmark}},
    author={Kentaro Imajo and Masanori Hirano and Shuji Suzuki and Hiroaki Mikami},
    year={2024},
    url = {https://github.com/pfnet-research/pfgen-bench}
}

About

Preferred Generation Benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published