This paper investigates an under-explored challenge in large language models (LLMs): chain-of-thought prompting with noisy rationales, which include irrelevant or inaccurate reasoning thoughts within examples used for in-context learning. We construct NoRa dataset that is tailored to evaluate the robustness of reasoning in the presence of noisy rationales. Our findings on NoRa dataset reveal a prevalent vulnerability to such noise among current LLMs, with existing robust methods like self-correction and self-consistency showing limited efficacy. Notably, compared to prompting with clean rationales, GPT-3.5 drops by 1.4%-19.8% in accuracy with irrelevant thoughts and more drastically by 2.2%-40.4% with inaccurate thoughts.
Addressing this challenge necessitates external supervision that should be accessible in practice. Here, we propose the method of contrastive denoising with noisy chain-of-thought (CD-CoT). It enhances LLMs’ denoising-reasoning capabilities by contrasting noisy rationales with only one clean rationale, which can be the minimal requirement for denoising-purpose prompting. This method follows a principle of exploration and exploitation: (1) rephrasing and selecting rationales in the input space to achieve explicit denoising and (2) exploring diverse reasoning paths and voting on answers in the output space. Empirically, CD-CoT demonstrates an average improvement of 17.8% in accuracy over the base model and shows significantly stronger denoising capabilities than baseline methods.
Choose one of the following installation commands based on your OpenAI API version:
## Use new openai API:
pip install openai requests pandas nltk pyyaml scikit-learn tiktoken python-dotenv
## Use old openai API:
pip install openai==0.28 requests pandas nltk pyyaml scikit-learn tiktoken python-dotenv
You can set up your OpenAI API credentials using either environment variables or a configuration file:
export OPENAI_API_KEY=[YOUR_API_KEY_HERE]
# if you have mutiple API keys
export OPENAI_API_KEY=[key1]:[key2]:[key3]...
# if you required specific API base
export OPENAI_API_BASE=[YOUR_API_BASE_HERE]
or create a .env file in the project path:
Create a .env
file in the project root directory:
OPENAI_API_KEY=[YOUR_API_KEY_HERE]
# if you have mutiple keys
OPENAI_API_KEY=[key1]:[key2]:[key3]...
# if you required specific api base
OPENAI_API_BASE=[YOUR_API_BASE_HERE]
The repository is organized as follows:
-
data/
: Contains raw datasets and preprocessed datasets- Original datasets used for generation
- Pre-processed datasets ready for experiments
-
data_process/
: Libraries and utilities for dataset processing and manipulation -
method/
: Implementation of different noise handling methods- Various approaches for handling noisy rationales in chain-of-thought prompting
-
llm_model/
: Interfaces for different large language models- Wrappers and utilities for interacting with various LLMs
-
noise_test.py
: Main experiment script for testing noise rationale handling -
config.yml
: Configuration file for experiment settings- Model parameters
- Dataset options
- Testing configurations
NoRa supports three task categories with corresponding subtasks:
- Math: base-9, base-11
- Symbolic: equal, longer
- Commonsense: (no subtasks)
Configure experiment settings in config.yml
and run:
python noise_test.py
# Zero-shot testing
python noise_test.py -task [task]_[subtask]_zeroshot -method [basemodel|CD-CoT] -model gpt-3.5-turbo-0125 -test_num [test_num]
# Clean ICL testing
python noise_test.py -task [task]_[subtask]_clean -method [basemodel|CD-CoT] -model gpt-3.5-turbo-0125 -test_num [test_num]
# Noisy ICL testing
python noise_test.py -task [task]_[subtask]_[irrelevant|inaccurate]_[easy|medium|hard] -method [basemodel|CD-CoT] -model gpt-3.5-turbo-0125 -test_num [test_num]
Parameters: Available arguments:
task_subtask
: math_base-9, math_base-11, symbolic_equal, symbolic_longer, commonsensemethod
: basemodel, CD-CoT, smoothllm, selfdenoise, selfpolish, contrastivecot, ISC, SCO, BTnoise-type
: irrelevant, inaccuratedifficulty
: easy, medium, hardmodel
: e.g., gpt-3.5-turbo-0125test_num
: e.g., 100
Examples:
# Clean data testing
python noise_test.py -task math_base-9_clean -method basemodel
# Noisy data testing with different configurations
python noise_test.py -task math_base-9_inaccurate_easy -method basemodel
python noise_test.py -task symbolic_longer_irrelevant_easy -method CD-CoT
python noise_test.py -task commonsense_inaccurate_hard -method contrastivecot
The results would appear in ./results/{task}/{subtask}/{model}/{method}/
The file will be log_[ICL_|][n_clean_shots]clean_[noise_[n_noisy_shots][inaccurate|irrelevant]_[fixed|random]_ratio[ratio]|origin]_case[cases_num]_temp[temperature]_n[reasoning_times].json
Category | Parameter | Sub-Parameter | Description | Examples |
---|---|---|---|---|
Model | model | llm model name | "gpt-3.5-turbo", "gemini-pro", "mixtral", "llama-2-70b" | |
Dataset | dataset | the dataset used for the experiment. | "base_math", "symbolic", "commonsense" | |
start_num | the starting number of the experiment. | 0 | ||
test_num | the number of test instances. | 200 | ||
batch_size | the size of the data processed per batch. | 1, 5 | ||
Task Config | math | subtask | the subtask of Nora-Math | base-9, base-11 |
symbolic | subtask | the subtask of Nora-symbolic | equal, longer | |
Generation | use_processed_dataset | whether use processed dataset, or generate test by detailed setting | True, False | |
processed_dataset_options | processed_dataset_path | processed dataset path or default dataset | processed dataset path or one of ["default-zeroshot", "default-clean", "default-(irrelevant,inaccurate)-(easy,medium,hard)-(fixed,random)"] | |
n_shots | shots num | 1, 2, 3, 4, 5 | ||
using_subset | ||||
raw_dataset_options | if_in_context | Represent whether use in-context shot for reasoning. | True, False | |
n_shots | The number of clean rationale shot | 0,1,2,3,4... | ||
if_noise | Represent whether exist noise shots | True, False | ||
n_noisy_shots | The number of noisy rationale shot | 1,2,3,4.... | ||
noisy_type | The type of noisy rationale shot | irrelevant, inaccurate | ||
noisy_ratio | The ratio of inserting a noise thought after a clean thought. | 0-1 | ||
noise_distribution | random: each clean thought have the possibility of noisy_ration to get a noisy thought, fixed: each shot have n_clean_thought * ratio of noisy thoughts | random, fixed | ||
prefix_context | Represent whether put in-context shots into the prompt prefix or mix as a messages list | True, False | ||
method | Represent what kind of method to process the reasoning | CD-CoT, basemodel, smoothllm, selfdenoise, selfpolish, contrastivecot, ISC, SCO, BT | ||
temperature_reason | the reasoning temperature. Available if method is not CD-CoT | 0-1 | ||
n_reason | The reasoning repeat times. Available if method is not CD-CoT | 1,2,3,4,5.... |
If you find our work helpful, please kindly cite our paper:
@inproceedings{zhou2024can,
title={Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?},
author={Zhou, Zhanke and Tao, Rong and Zhu, Jianing and Luo, Yiwen and Wang, Zengmao and Han, Bo},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}