Skip to content

Latest commit

 

History

History
239 lines (175 loc) · 13.5 KB

README.md

File metadata and controls

239 lines (175 loc) · 13.5 KB

Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?

Paper Stars


Exemplars of noisy questions and noisy rationales (our new research problem).

Abstract

This paper investigates an under-explored challenge in large language models (LLMs): chain-of-thought prompting with noisy rationales, which include irrelevant or inaccurate reasoning thoughts within examples used for in-context learning. We construct NoRa dataset that is tailored to evaluate the robustness of reasoning in the presence of noisy rationales. Our findings on NoRa dataset reveal a prevalent vulnerability to such noise among current LLMs, with existing robust methods like self-correction and self-consistency showing limited efficacy. Notably, compared to prompting with clean rationales, GPT-3.5 drops by 1.4%-19.8% in accuracy with irrelevant thoughts and more drastically by 2.2%-40.4% with inaccurate thoughts.

Addressing this challenge necessitates external supervision that should be accessible in practice. Here, we propose the method of contrastive denoising with noisy chain-of-thought (CD-CoT). It enhances LLMs’ denoising-reasoning capabilities by contrasting noisy rationales with only one clean rationale, which can be the minimal requirement for denoising-purpose prompting. This method follows a principle of exploration and exploitation: (1) rephrasing and selecting rationales in the input space to achieve explicit denoising and (2) exploring diverse reasoning paths and voting on answers in the output space. Empirically, CD-CoT demonstrates an average improvement of 17.8% in accuracy over the base model and shows significantly stronger denoising capabilities than baseline methods.

Getting Started

1. Install Dependencies

Choose one of the following installation commands based on your OpenAI API version:

## Use new openai API:
pip install openai requests pandas nltk pyyaml scikit-learn tiktoken python-dotenv

## Use old openai API:
pip install openai==0.28 requests pandas nltk pyyaml scikit-learn tiktoken python-dotenv

2. Configure OpenAI API

You can set up your OpenAI API credentials using either environment variables or a configuration file:

Option A: Using Environment Variables

export OPENAI_API_KEY=[YOUR_API_KEY_HERE]
# if you have mutiple API keys
export OPENAI_API_KEY=[key1]:[key2]:[key3]...
# if you required specific API base
export OPENAI_API_BASE=[YOUR_API_BASE_HERE]

or create a .env file in the project path:

Option B: Using Configuration File

Create a .env file in the project root directory:

OPENAI_API_KEY=[YOUR_API_KEY_HERE]
# if you have mutiple keys
OPENAI_API_KEY=[key1]:[key2]:[key3]...
# if you required specific api base
OPENAI_API_BASE=[YOUR_API_BASE_HERE]

Project Structure

The repository is organized as follows:

  • data/: Contains raw datasets and preprocessed datasets

    • Original datasets used for generation
    • Pre-processed datasets ready for experiments
  • data_process/: Libraries and utilities for dataset processing and manipulation

  • method/: Implementation of different noise handling methods

    • Various approaches for handling noisy rationales in chain-of-thought prompting
  • llm_model/: Interfaces for different large language models

    • Wrappers and utilities for interacting with various LLMs
  • noise_test.py: Main experiment script for testing noise rationale handling

  • config.yml: Configuration file for experiment settings

    • Model parameters
    • Dataset options
    • Testing configurations

Run experiments

NoRa supports three task categories with corresponding subtasks:

  • Math: base-9, base-11
  • Symbolic: equal, longer
  • Commonsense: (no subtasks)

Running Options

Option 1: Using Config File (Recommended)

Configure experiment settings in config.yml and run:

python noise_test.py

Option 2: Command Line Arguments (for quick start)

# Zero-shot testing
python noise_test.py -task [task]_[subtask]_zeroshot -method [basemodel|CD-CoT] -model gpt-3.5-turbo-0125 -test_num [test_num]
# Clean ICL testing
python noise_test.py -task [task]_[subtask]_clean -method [basemodel|CD-CoT] -model gpt-3.5-turbo-0125 -test_num [test_num]
# Noisy ICL testing
python noise_test.py -task [task]_[subtask]_[irrelevant|inaccurate]_[easy|medium|hard] -method [basemodel|CD-CoT] -model gpt-3.5-turbo-0125 -test_num [test_num]

Parameters: Available arguments:

  • task_subtask: math_base-9, math_base-11, symbolic_equal, symbolic_longer, commonsense
  • method: basemodel, CD-CoT, smoothllm, selfdenoise, selfpolish, contrastivecot, ISC, SCO, BT
  • noise-type: irrelevant, inaccurate
  • difficulty: easy, medium, hard
  • model: e.g., gpt-3.5-turbo-0125
  • test_num: e.g., 100

Examples:

# Clean data testing
python noise_test.py -task math_base-9_clean -method basemodel

# Noisy data testing with different configurations
python noise_test.py -task math_base-9_inaccurate_easy -method basemodel
python noise_test.py -task symbolic_longer_irrelevant_easy -method CD-CoT
python noise_test.py -task commonsense_inaccurate_hard -method contrastivecot

Result

The results would appear in ./results/{task}/{subtask}/{model}/{method}/

The file will be log_[ICL_|][n_clean_shots]clean_[noise_[n_noisy_shots][inaccurate|irrelevant]_[fixed|random]_ratio[ratio]|origin]_case[cases_num]_temp[temperature]_n[reasoning_times].json

Config Introduction

Category Parameter Sub-Parameter Description Examples
Model model llm model name "gpt-3.5-turbo", "gemini-pro", "mixtral", "llama-2-70b"
Dataset dataset the dataset used for the experiment. "base_math", "symbolic", "commonsense"
start_num the starting number of the experiment. 0
test_num the number of test instances. 200
batch_size the size of the data processed per batch. 1, 5
Task Config math subtask the subtask of Nora-Math base-9, base-11
symbolic subtask the subtask of Nora-symbolic equal, longer
Generation use_processed_dataset whether use processed dataset, or generate test by detailed setting True, False
processed_dataset_options processed_dataset_path processed dataset path or default dataset processed dataset path or one of ["default-zeroshot", "default-clean", "default-(irrelevant,inaccurate)-(easy,medium,hard)-(fixed,random)"]
n_shots shots num 1, 2, 3, 4, 5
using_subset
raw_dataset_options if_in_context Represent whether use in-context shot for reasoning. True, False
n_shots The number of clean rationale shot 0,1,2,3,4...
if_noise Represent whether exist noise shots True, False
n_noisy_shots The number of noisy rationale shot 1,2,3,4....
noisy_type The type of noisy rationale shot irrelevant, inaccurate
noisy_ratio The ratio of inserting a noise thought after a clean thought. 0-1
noise_distribution random: each clean thought have the possibility of noisy_ration to get a noisy thought, fixed: each shot have n_clean_thought * ratio of noisy thoughts random, fixed
prefix_context Represent whether put in-context shots into the prompt prefix or mix as a messages list True, False
method Represent what kind of method to process the reasoning CD-CoT, basemodel, smoothllm, selfdenoise, selfpolish, contrastivecot, ISC, SCO, BT
temperature_reason the reasoning temperature. Available if method is not CD-CoT 0-1
n_reason The reasoning repeat times. Available if method is not CD-CoT 1,2,3,4,5....

Citation

If you find our work helpful, please kindly cite our paper:

@inproceedings{zhou2024can,
  title={Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?},
  author={Zhou, Zhanke and Tao, Rong and Zhu, Jianing and Luo, Yiwen and Wang, Zengmao and Han, Bo},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
  year={2024}
}