[EMNLP 2024] This is the official implementation of the paper "A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners" in PyTorch.
π©βπ«10-11-2024 We presented the work at the Wharton AI & Analytics Initiative's Research & Education Symposium.
π10-07-2024 Great to see Apple's trending paper GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models follows up our work to question the reasoning capabilities of LLMs. It references our work multiple times, while generalizing mathematical reasoning problems into symbolic templates. Definitely worth checking out both!
πNews! 09-21-2024 A short version of this work has been accepted to the EMNLP 2024 GenBench Workshop.
πNews! 09-20-2024 The full paper has been accepted to the EMNLP 2024 Main π΄.
π©βπ«09-20-2024 We presented the work at the Penn ASSET & Warren Center research mixer.
π¦07-08-2024 We released a short video on Twitter. Enjoy!
π06-17-2024 A short version of this work has been accepted to the ICML 2024 Workshop on LLMs and Cognition.
π06-16-2024 We released the paper on ArXiv.
Large language models (LLMs) have achieved remarkable progress in understanding and generating human-like text, but there is ongoing debate about whether LLMs possess genuine reasoning capabilities. This work reconceptualizes the evaluation of LLM's reasoning capabilities into a general and rigorous testing framework with statistical guarantee.
We say that an LLM is subject to token bias in a reasoning task if systematic changes to some or all tokens in the task descriptions - while keeping the underlying logic intact - allow us to predict the direction of the shift in the modelβs output. A strong token bias suggests that LLM is relying on superficial patterns in the input rather than truly understanding the underlying reasoning task, leading to brittle performance that fails to generalize well. Let us look at the following classic "twenty-five horses" problem in graph theory:
You want to find the fastest 3 horses in a group of 25 horses. You can only race 5 horses at a time. You donβt have a stopwatch, so you can only know the ranking of each horse within each race. How many races do you need?
GPT-4 and Claude-3-opus achieve an accuracy of nearly 98.5% and 40.5% in answering this question. However, if we simply perturb "horses" to "bunnies", a change that shouldn't affect the logical essence, would systematically decrease the accuracy to 85.0% and 30.0%, respectively. Further changing "25" to other values decreases their accuracy to 46.0% and 24.0%. These observations indicate strong token biases on the frequently-used names "horses" and "25" in such problems, and LLMs do not have a genuine understanding of how it should solve such problems.
You want to find the fastest 3 bunnies in a group of 25 bunnies. You can only race 5 bunnies at a time. You donβt have a stopwatch, so you can only know the ranking of each bunny within each race. How many races do you need?
We take the classic Linda Problem in Psychology as another example. Below is the original problem statement.
Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which is more probable?
(a) Linda is a bank teller.
(b) Linda is a bank teller and is active in the feminist movement.
Experiments in behavioral psychology reveal that people typically believed the second option was more likely than the first, but this contradicts the basic probability rule of conjunction. Advanced LLMs like GPT-4 can typically recognize this fallacy well since it is a classical problem that appears frequently in cognitive science literature. However, altering seemingly irrelevant tokens like the name πββοΈ "Linda" -> π "Luna" in the problem statement, while maintaining the same logical structure would surprisingly confuse most LLMs. In one-shot learning, GPT-4 and Claude-3-opus would see their accuracy decrease from 100.0% to 72.0% and from 95.0% to 32.0%, respectively. (check detailed experiment setups in paper).
Luna is 29 years old, married, deeply passionate about environmental conservation and transgender rights, and volunteers their weekends at local park clean-ups. They studied physics and applied math in college, and held several campaigns to reduce the campusβs carbon footprint. Which is more probable?
(a) Luna is an assistant professor in aerospace engineering and is an active member of an environmental advocacy group.
(b) Luna is an assistant professor in aerospace engineering.
In our paper, we explore many other token biases in logical reasoning, set theory, and mathematical reasoning problems. We reconceptualize the evaluation of reasoning capabilities into a general and rigorous statistical testing framework, moving beyond accuracy. We conclude, with statistical guarantee, that LLMs do not consistently apply genuine reasoning in their decision-making process, but primarily rely on token bias for response generation. Therefore, we raise concerns about the extent to which LLMs truly engage in reasoning; Any robust evaluation of the LLM's generalization should account for the fundamental impact of token bias hidden in the current benchmark problems.
All images are generated by OpenAI GPT-4o. When we requested 'lop-eared bunnies', the model even displayed a visual token bias by generating bunnies with four ears β both lop and erect β suggesting it associated the term 'bunnies' with the presence of two erect ears without genuine logical understandings.
All the twenty-five bunnies above π°x25 will be happy if you could cite our work. Thank you!
@article{jiang2024peek,
title={A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners},
author={Jiang, Bowen and Xie, Yangxinyu and Hao, Zhuoqun and Wang, Xiaomeng and Mallick, Tanwi and Su, Weijie J and Taylor, Camillo J and Roth, Dan},
journal={arXiv preprint arXiv:2406.11050},
year={2024}
}
- Add the twenty-five horses problem to the paper
- Evaluate the new GPT-o1 reasoning model
Please check requirements.txt. You can run the following commands to create a virtual environment and install all the requirements:
python -m venv myenv
source myenv/bin/activate
pip install -r requirements.txt
We provide our synthetic dataset under data/, which contains a comprehensive set of logical-fallacy problems. The dataset file is in JSON format, and each item is a dictionary containing question_id
, question
, target_answer
, and incorrect_answer
. You can also follow the instructions below to generate more synthetic data on the fly.
β€οΈ Always set up OpenAI ChatGPT models. Please follow its Developer quickstart to set up your OpenAI API, create a new api_tokens/openai_key.txt file, and copy and paste your API key into it.
𧑠To use Google Gemini models with an API for inference, follow instructions on Google Vertex AI about the Try Gemini 1.0 Pro (Python)
section. Note that your school's Gmail account may not allow you to make payments.
- Step 1: According to their instructions, you need to first install the Vertex AI client libraries to create a project with a project ID, enable Vertex AI API, create a service account, and generate your account key. You don't need to set the environment variable
GOOGLE_APPLICATION_CREDENTIALS
since we have already done that for you in our codes query_llm.py. - Step 2: Install or update the Vertex AI SDK for Python.
- Step 3: Authenticate to Vertex AI and set up Application Default Credentials.
-
Follow the
Local development environment - Provide user credentials for your Google Account
section to install and initialize the gcloud CLI. This step will download a foldergoogle-cloud-sdk
to your project's top directory. -
After installation, run
gcloud init
to initialize the gcloud CLI. You will be able to choose your account and project ID. Create a new api_tokens/gemini_project_id.txt file, and copy and paste your project ID into it.
-
To create your credential file, run
gcloud auth application-default login
You will see a prompt like
Credentials saved to file: [/path/to/your/home/.config/gcloud/application_default_credentials.json]
. -
Because of the path of the credential file we set in our config.yaml, run
mv /path/to/your/home/.config/gcloud/application_default_credentials.json google-cloud-sdk/google_gemini_credential.json
-
π To use Meta Llama models with an API for inference, follow instructons on Replicate Run Llama 3 with an API about the Running Llama 3 with Python
section to set up your API tokens, create a new api_tokens/llama_key.txt file, and copy and paste your tokens into it.
π To use Anthropic Claude models with an API for inference, follow its Quickstart Guide to install the Anthropic Python SDK, set up an account with API access, get your API key, create a new api_tokens/claude_key.txt file, and copy and paste your key into it. You don't need to set the environment variable ANTHROPIC_API_KEY
.
π To use Mistral models with an API for inference, follow its Quickstart to install the mistralai library, set up an account with API access, get your [API key](https://console.anthropic.com/settings/keys, create a new api_tokens/mistral_key.txt file, and copy and paste your key into it. You don't need to set the environment variable MISTRAL_API_KEY
.
We allow command-line argparser for the following arguments:
-
--model
to select the LLM for inference. Last updated on 06-29-2024, but our codes should be compatible with any more recent model names.- OpenAI ChatGPT family. Check OpenAI's continuous model upgrades.
gpt3.5
or equivalentlygpt-3.5-turbo
,gpt-3.5-turbo-0125
gpt-3.5-turbo-1106
gpt-3.5-turbo-0613
gpt-4o
gpt4
or equivalentlygpt-4-turbo
,gpt-4-turbo-2024-04-09
gpt-4-0125-preview
gpt-4-1106-preview
gpt-4-0613
- Google Gemini family. Check Gemini model versions and lifecycle. Note that Google currently imposes a relatively low request-per-minute for API usages, so you may encounter related errors when running the inference code.
gemini
or equivalentlygemini-1.0-pro
,gemini-1.0-pro-002
gemini-1.0-pro-001
gemini-1.5-pro-preview-0409
- Meta Llama family. Check Choosing which model to use Llama-3 and Llama-2.
llama
or equivalentlyllama3-70b
,meta-llama-3-70b-instruct
llama3-8b
or equivalentlymeta-llama-3-8b-instruct
llama-2-70b-chat
llama-2-13b-chat
llama-2-7b-chat
- Anthropic Claude family. Check Models overview.
claude
or equivalentlyclaude-3-opus-20240229
claude-3-sonnet-20240229
claude-3-haiku-20240307
- Mistral family. Check API versioning.
mistral
or equivalentlymistral-large-latest
,mistral-large-2402
mistral-medium-latest
or equivalentlymistral-medium-2312
mistral-small-latest
or equivalentlymistral-small-2402
open-mixtral-8x22b
or equivalentlyopen-mixtral-8x22b-2404
open-mixtral-8x7b
or equivalentlymistral-small-2312
open-mistral-7b
or equivalentlymistral-tiny-2312
- OpenAI ChatGPT family. Check OpenAI's continuous model upgrades.
-
--task
to specifydata
to generate synthetic datasets orinference
to evaluate the LLM's ability to answer the questions. -
--verbose
to print detailed data information and model responses during the inference. -
[For Data Generation Only]
--fallacy
to select the type of logical fallacy. We currently supportlinda
for the Linda Problem and its variants andsets
for the syllogistic problems. -
[For Data Generation Only]
--gen_mode
to select the mode of generating synthetic dataset whentask
isdata
. Options arebaseline
: simple in-context learning with limited instructions,control
: step-by-step guidance to generate both gold samples and random samples with irrelevant info. -
[For Data Generation Only]
--variant
to select the variant of the Linda problems, such as the defaultoriginal
,variant_one
,variant_two
, ...,variant_six
. Detailed information about each variant can be found in thedef linda_problem()
function in prompts.py. Include this argument iff--fallacy
islinda
. -
[For Data Generation Only]
--conn
to select the logical connecting word, such asbecause
,sothat
, orto
to generate new data. Add this argument iff--fallacy
islinda
and--variant
isvariant_one
orvariant_two
. -
[For Data Generation Only]
--n
to set the number of synthetic data problems to generate. -
[For Inference Only]
--data_file
to set the data file path for inference. -
[For Inference Only]
--eval_mode
to set the evaluation mode for the model to answer questions. Options arebaseline
for directly promptingzs_cot
for zero-shot chain-of-thought (CoT) promptingos
for one-shot in-context learning (ICL) prompting with the original Linda Problem (default)os_cot
for one-shot ICL plus COT promptingos_bob
for one-shot ICL prompting but with a rephrased Bob Problemos_bob_cot
for one-shot ICL prompting plus COT but with a rephrased Bob Problemos_incorrect
for one-shot ICL but with an incorrect answer and a rephrased Bob Problemos_incorrect_cot
for one-shot ICL plus COT but with an incorrect answer and a rephrased Bob Problemfs
for few-shot ICL promptingfs_cot
for few-shot ICL plus COT promptingweak_control_zs_cot
for weakly controlled zero-shot CoT prompting, leaking the hint that it is a Linda Problem but without detailed instructionsweak_control_os_cot
for weakly controlled one-shot CoT prompting, leaking the hint that it is a Linda Problem but without detailed instructionscontrol_zs_cot
for controlled zero-shot CoT prompting, leaking the hint that it is a Linda Problem with detailed and carefully-curated instructionscontrol_os_cot
for controlled one-shot CoT prompting, leaking the hint that it is a Linda Problem with detailed and carefully-curated instructions
For example, you can run
python main.py --model gpt3.5 --task data --fallacy linda --gen_mode control --variant original --n 100 --verbose
in the command line and adjust model
, fallacy
, gen_mode
, variant
, and n
accordingly. All the other hyper-parameters can be set at config.yaml.
Generated files will be saved to the data/ directory.
To start the inference
python main.py --model gpt3.5 --task inference --fallacy linda --eval_mode os_cot --data_file synthetic_dataset_linda_original_gold.json --verbose
in the command line and adjust model
, eval_mode
, and data_file
accordingly.
To efficiently run the evaluation with multiple prompting methods, models, and/or data files in parallel, please modify the number of GPU devices available and adjust the codes in run.sh
. Then run
bash run.sh
All results and final accuracies will be automatically saved to the outputs/ directory.