Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?

Exemplars of noisy questions and noisy rationales (our new research problem).

Abstract

This paper investigates an under-explored challenge in large language models (LLMs): chain-of-thought prompting with noisy rationales, which include irrelevant or inaccurate reasoning thoughts within examples used for in-context learning. We construct NoRa dataset that is tailored to evaluate the robustness of reasoning in the presence of noisy rationales. Our findings on NoRa dataset reveal a prevalent vulnerability to such noise among current LLMs, with existing robust methods like self-correction and self-consistency showing limited efficacy. Notably, compared to prompting with clean rationales, GPT-3.5 drops by 1.4%-19.8% in accuracy with irrelevant thoughts and more drastically by 2.2%-40.4% with inaccurate thoughts.

Addressing this challenge necessitates external supervision that should be accessible in practice. Here, we propose the method of contrastive denoising with noisy chain-of-thought (CD-CoT). It enhances LLMs’ denoising-reasoning capabilities by contrasting noisy rationales with only one clean rationale, which can be the minimal requirement for denoising-purpose prompting. This method follows a principle of exploration and exploitation: (1) rephrasing and selecting rationales in the input space to achieve explicit denoising and (2) exploring diverse reasoning paths and voting on answers in the output space. Empirically, CD-CoT demonstrates an average improvement of 17.8% in accuracy over the base model and shows significantly stronger denoising capabilities than baseline methods.

Getting Started

1. Install Dependencies

Choose one of the following installation commands based on your OpenAI API version:

## Use new openai API:
pip install openai requests pandas nltk pyyaml scikit-learn tiktoken python-dotenv

## Use old openai API:
pip install openai==0.28 requests pandas nltk pyyaml scikit-learn tiktoken python-dotenv

2. Configure OpenAI API

You can set up your OpenAI API credentials using either environment variables or a configuration file:

Option A: Using Environment Variables

export OPENAI_API_KEY=[YOUR_API_KEY_HERE]
# if you have mutiple API keys
export OPENAI_API_KEY=[key1]:[key2]:[key3]...
# if you required specific API base
export OPENAI_API_BASE=[YOUR_API_BASE_HERE]

or create a .env file in the project path:

Option B: Using Configuration File

Create a .env file in the project root directory:

OPENAI_API_KEY=[YOUR_API_KEY_HERE]
# if you have mutiple keys
OPENAI_API_KEY=[key1]:[key2]:[key3]...
# if you required specific api base
OPENAI_API_BASE=[YOUR_API_BASE_HERE]

Project Structure

The repository is organized as follows:

data/: Contains raw datasets and preprocessed datasets
- Original datasets used for generation
- Pre-processed datasets ready for experiments
data_process/: Libraries and utilities for dataset processing and manipulation
method/: Implementation of different noise handling methods
- Various approaches for handling noisy rationales in chain-of-thought prompting
llm_model/: Interfaces for different large language models
- Wrappers and utilities for interacting with various LLMs
noise_test.py: Main experiment script for testing noise rationale handling
config.yml: Configuration file for experiment settings
- Model parameters
- Dataset options
- Testing configurations

Run experiments

NoRa supports three task categories with corresponding subtasks:

Math: base-9, base-11
Symbolic: equal, longer
Commonsense: (no subtasks)

Running Options

Option 1: Using Config File (Recommended)

Configure experiment settings in config.yml and run:

python noise_test.py

Option 2: Command Line Arguments (for quick start)

# Zero-shot testing
python noise_test.py -task [task]_[subtask]_zeroshot -method [basemodel|CD-CoT] -model gpt-3.5-turbo-0125 -test_num [test_num]
# Clean ICL testing
python noise_test.py -task [task]_[subtask]_clean -method [basemodel|CD-CoT] -model gpt-3.5-turbo-0125 -test_num [test_num]
# Noisy ICL testing
python noise_test.py -task [task]_[subtask]_[irrelevant|inaccurate]_[easy|medium|hard] -method [basemodel|CD-CoT] -model gpt-3.5-turbo-0125 -test_num [test_num]

Parameters: Available arguments:

task_subtask: math_base-9, math_base-11, symbolic_equal, symbolic_longer, commonsense
method: basemodel, CD-CoT, smoothllm, selfdenoise, selfpolish, contrastivecot, ISC, SCO, BT
noise-type: irrelevant, inaccurate
difficulty: easy, medium, hard
model: e.g., gpt-3.5-turbo-0125
test_num: e.g., 100

Examples:

# Clean data testing
python noise_test.py -task math_base-9_clean -method basemodel

# Noisy data testing with different configurations
python noise_test.py -task math_base-9_inaccurate_easy -method basemodel
python noise_test.py -task symbolic_longer_irrelevant_easy -method CD-CoT
python noise_test.py -task commonsense_inaccurate_hard -method contrastivecot

Result

The results would appear in ./results/{task}/{subtask}/{model}/{method}/

The file will be log_[ICL_|][n_clean_shots]clean_[noise_[n_noisy_shots][inaccurate|irrelevant]_[fixed|random]_ratio[ratio]|origin]_case[cases_num]_temp[temperature]_n[reasoning_times].json

Config Introduction

Category	Parameter	Sub-Parameter	Description	Examples
Model	model		llm model name	"gpt-3.5-turbo", "gemini-pro", "mixtral", "llama-2-70b"
Dataset	dataset		the dataset used for the experiment.	"base_math", "symbolic", "commonsense"
	start_num		the starting number of the experiment.	0
	test_num		the number of test instances.	200
	batch_size		the size of the data processed per batch.	1, 5
Task Config	math	subtask	the subtask of Nora-Math	base-9, base-11
	symbolic	subtask	the subtask of Nora-symbolic	equal, longer
Generation	use_processed_dataset		whether use processed dataset, or generate test by detailed setting	True, False
	processed_dataset_options	processed_dataset_path	processed dataset path or default dataset	processed dataset path or one of ["default-zeroshot"， "default-clean", "default-(irrelevant,inaccurate)-(easy,medium,hard)-(fixed,random)"]
		n_shots	shots num	1, 2, 3, 4, 5
		using_subset
	raw_dataset_options	if_in_context	Represent whether use in-context shot for reasoning.	True, False
		n_shots	The number of clean rationale shot	0,1,2,3,4...
		if_noise	Represent whether exist noise shots	True, False
		n_noisy_shots	The number of noisy rationale shot	1,2,3,4....
		noisy_type	The type of noisy rationale shot	irrelevant, inaccurate
		noisy_ratio	The ratio of inserting a noise thought after a clean thought.	0-1
		noise_distribution	random: each clean thought have the possibility of noisy_ration to get a noisy thought, fixed: each shot have n_clean_thought * ratio of noisy thoughts	random, fixed
	prefix_context		Represent whether put in-context shots into the prompt prefix or mix as a messages list	True, False
	method		Represent what kind of method to process the reasoning	CD-CoT, basemodel, smoothllm, selfdenoise, selfpolish, contrastivecot, ISC, SCO, BT
	temperature_reason		the reasoning temperature. Available if method is not CD-CoT	0-1
	n_reason		The reasoning repeat times. Available if method is not CD-CoT	1,2,3,4,5....

Citation

If you find our work helpful, please kindly cite our paper:

@inproceedings{zhou2024can,
  title={Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?},
  author={Zhou, Zhanke and Tao, Rong and Zhu, Jianing and Luo, Yiwen and Wang, Zengmao and Han, Bo},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
data		data
data_process		data_process
generate_processed_dataset		generate_processed_dataset
imgs		imgs
llm_model		llm_model
method		method
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
config.yml		config.yml
noise_test.py		noise_test.py
pickle_process.ipynb		pickle_process.ipynb
quick_start.yml		quick_start.yml
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?

Abstract

Getting Started

1. Install Dependencies

2. Configure OpenAI API

Option A: Using Environment Variables

Option B: Using Configuration File

Project Structure

Run experiments

Running Options

Option 1: Using Config File (Recommended)

Option 2: Command Line Arguments (for quick start)

Result

Config Introduction

Citation

About

Releases

Packages

Contributors 2

Languages

tmlr-group/NoisyRationales

Folders and files

Latest commit

History

Repository files navigation

Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?

Abstract

Getting Started

1. Install Dependencies

2. Configure OpenAI API

Option A: Using Environment Variables

Option B: Using Configuration File

Project Structure

Run experiments

Running Options

Option 1: Using Config File (Recommended)

Option 2: Command Line Arguments (for quick start)

Result

Config Introduction

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages