Skip to content

The code of arxiv paper: "CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis"

Notifications You must be signed in to change notification settings

RUCKBReasoning/CoT-based-Synthesizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CoT-based-Synthesizer

This repo is the official implementation of the paper "CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis".

Introduction

We present CoT-based Synthesizer, an innovative method that leverages CoT reasoning to integrate information from multiple responses, enabling the generation of a more accurate and refined synthesis response, even when all candidate responses are flawed. We also develop an automated data generation pipeline, which facilitates the training of smaller, cost-efficient models that can enhance the integration performance of larger models effectively. Overview

Experiment

We conduct experiments on mathematical reasoning and table question answering tasks, including four benchmarks: MATH, GSM8k, WikiTQ and FeTaQA. The evaluation result is shown below:

Method GLM-4-Plus GPT-4o Llama3.1-70B Llama3.1-8B Llama3-8B Qwen2-7B Qwen2.5-14B Average
GSM8k
CoT-prompting 88.6 91.4 92.7 81.9 73.4 82.0 91.2 85.9
Self-consistency 90.1 92.4 93.9 85.1 80.9 84.3 92.3 88.4
Universal Self-consistency (Llama3.1-70B) 90.1 92.3 93.5 85.4 82.0 84.9 92.3 88.6
LMCOR(Llama3-8B) 88.9 90.4 90.1 83.1 79.4 84.8 89.5 86.6
ArmoRM(Llama3-8B) 90.3 91.6 93.3 85.5 82.4 86.1 92.1 88.8
Scalar RM(Llama3-8B) 89.1 91.9 93.3 85.6 81.6 85.8 91.4 88.4
Ours(Llama3.1-70B) 91.2 92.6 93.6 86.9 83.5 88.3 92.3 89.8
Ours(Synthesizer-8B) 91.4 93.0 94.0 86.1 81.3 86.4 92.7 89.3
MATH
CoT-prompting 54.8 62.5 66.6 46.5 24.2 57.3 74.4 55.2
Self-consistency 63.0 68.7 68.8 55.4 32.4 61.0 76.6 60.8
Universal Self-consistency (Llama3.1-70B) 62.6 67.3 68.4 52.8 35.4 62.2 78.2 61.0
LMCOR(Llama3-8B) 52.4 61.2 57.6 44.8 33.6 51.6 64.0 52.2
ArmoRM(Llama3-8B) 60.6 67.5 69.4 52.6 32.8 60.2 77.2 60.0
Scalar RM(Llama3-8B) 61.4 65.9 66.8 52.8 34.2 59.4 77.6 59.7
Ours(Llama3.1-70B) 64.2 75.5 69.6 52.8 38.8 63.6 79.0 63.4
Ours(Synthesizer-8B) 64.4 72.9 69.6 54.6 36.0 62.4 78.2 62.6
WikiTQ
CoT-prompting 90.1 89.9 86.7 72.4 71.7 63.8 77.9 78.9
Universal Self-consistency (Llama3.1-70B) 91.6 91.8 88.3 79.6 76.3 69.2 81.5 82.6
LMCOR(Llama3-8B) 88.8 90.4 87.8 77.3 75.4 69.2 81.4 81.5
ArmoRM(Llama3-8B) 91.0 91.8 87.5 77.9 73.8 69.4 81.2 81.8
Scalar RM(Llama3-8B) 91.8 90.5 87.5 77.6 74.9 69.8 80.1 81.7
Ours(Llama3.1-70B) 91.9 92.3 88.3 83.4 82.2 78.0 84.2 85.8
Ours(Synthesizer-8B) 92.1 91.9 88.9 79.9 77.7 72.2 82.4 83.6
FeTaQA
CoT-prompting 86.4 86.3 85.6 82.6 82.2 73.5 82.7 82.8
Universal Self-consistency (Llama3.1-70B) 87.1 87.0 86.1 84.3 83.9 77.5 84.1 84.3
LMCOR(Llama3-8B) 86.0 84.7 83.0 84.7 83.8 79.9 83.2 83.6
ArmoRM(Llama3-8B) 87.5 86.1 86.0 83.2 82.5 76.1 82.9 83.5
Scalar RM(Llama3-8B) 87.4 85.5 85.3 83.0 82.3 75.1 83.3 83.1
Ours(Llama3.1-70B) 87.0 86.8 86.6 84.9 85.1 82.3 84.1 85.3
Ours(Synthesizer-8B) 87.5 87.9 87.5 84.7 85.9 82.1 86.6 86.0

Dataset

We use four benchmarks for evaluation, you can obtain the original file of theses benchmarks in data folder. The detailed sources of these benchmarks are as follows:

  • GSM8k: Same as the original GSM8k test dataset.
  • MATH: A subset of MATH, the same 500 samples from PRM800K.
  • WikiTQ: A subset of WikiTQ, the same 633 samples from TableLLM.
  • FeTaQA: A subset of FeTaQA, the same 753 samples from TableLLM.

Prompt Template

The prompt we used for generating synthesis answer is introduced below.

Please act as an excellent summarizer and summarize the following AI responses to the questions. Your summary should fully consider the connection between the question and AI responses, resulting in a correct, high-quality answer. In most cases, the same response that appears most often in the response may be the correct answer. If you find that there is no correct answer, please try to generate a correct answer yourself. Do not copy The candidate's answer, give your summarized answer and reasons, and give the correct answer at the end of the sentence in the format: The answer is...

[The Start of Original Question]
{question}
[The End of Original Question]

[The Start of AI Responses]
{responses}
[The End of AI Responses]

Environment Setup

Run the following commands to setup your environment:

git clone https://github.com/RUCKBReasoning/CoT-based-Synthesizer
conda create --name synthesizer --yes python=3.11
pip install -r requirements.txt

For the evaluation of MATH, we refer to the evaluation of DART-Math, with the environment setup as follows:

git clone https://github.com/hkust-nlp/dart-math.git && cd dart-math
pip install -e "."

Inference

The example commands of synthesizer inference (e.g., MATH) are shown below:

cd inference

python synthesis_infer.py --model_name <model_name> --input_file ../data/MATH/math_llama3-8b_ans.jsonl --dataset MATH --output_file math_llama3-8b_summary

Evaluation

The python code in evaluation folder is used for reproducing evaluation results. You can run the following command (e.g., MATH) to reproduce the result:

cd evaluation

python eval_math_task.py --dataset MATH --input_file math_llama3-8b_summary --dataset_name MATH

Data Pipeline

The python code in the pipelinefolder is designed for generating training data. We first use sampling.py script to generate candidate responses and then use synthesizer script to generate corresponding diverse synthesis answers. The filtered, correct responses are then utilized for training.

Citation

If you find our paper helpful, please cite the original paper:

@misc{zhang2025cotbasedsynthesizerenhancingllm,
      title={CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis}, 
      author={Bohan Zhang and Xiaokang Zhang and Jing Zhang and Jifan Yu and Sijia Luo and Jie Tang},
      year={2025},
      eprint={2501.01668},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.01668}, 
}

Contact

If you have any questions, we encourage you to either create Github issues or get in touch with us at zbhmint@ruc.edu.cn.

About

The code of arxiv paper: "CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published