This repository contains the code and datasets necessary to reproduce the results presented in the Neurips 2024 paper "Truth is Universal: Robust Detection of Lies in LLMs".
We recommend using conda for Python installation. While we used Python 3.11.9, other versions should be compatible. It's advisable to create a new Python environment before installing the required packages. Create and activate the environment:
conda create --name truth_is_universal python=3.11
conda activate truth_is_universal
Here, python=3.11
is optional and other versions should be compatible as well.
Navigate to your preferred repository location, then clone the repository, enter it, and install the requirements:
git clone git@github.com:sciai-lab/Truth_is_Universal.git
cd Truth_is_Universal
pip install -r requirements.txt
This repository provides all datasets used in the paper, but not the associated activation vectors due to their large size. You'll need to generate these activations before running any other code. This requires model weights from the Llama3, Llama2, Gemma or Gemma2 model family. We suggest obtaining these weights from Hugging Face (e.g. Llama3-8B-Instruct here). Insert the paths to the weight folders into config.ini
.
Then, run generate_acts.py
to generate the activations. For example, to generate activations for the cities and neg_cities datasets for Llama3-8B-Instruct in layers 11 and 12:
python generate_acts.py --model_family Llama3 --model_size 8B --model_type chat --layers 11 12 --datasets cities neg_cities --device cuda:0
The model runs in float16 precision. Hence, at least 16GB of GPU RAM are required to run Llama3-8B. To run Gemma2 a GPU that supports torch.bfloat16 precision is needed. The activations will be stored in the acts
folder. You can generate the activations for all layers by setting --layers -1
. You can generate the activations for all topics-specific datasets (defined in the paper) by setting --datasets all_topic_specific
and for all datasets by setting --datasets all
.
generate_acts.py
: For generating activations as described above.utils.py
: Contains various helper functions, e.g. for loading the activations.probes.py
: Different classifiers that can be trained on the internal model activations to classify statements as true or false.
Jupyter Notebooks:
truth_directions.ipynb
: Code for generating figures from the first four paper sections; from learning truth directions to exploring the dimensionality of the truth subspace. You need to generate the following activations to run this notebook (e.g. for Llama3-8B-Instruct):
python generate_acts.py --model_family Llama3 --model_size 8B --model_type chat --layers 12 --datasets all_topic_specific --device cuda:0
and
python generate_acts.py --model_family Llama3 --model_size 8B --model_type chat --layers -1 --datasets cities neg_cities sp_en_trans neg_sp_en_trans --device cuda:0
-
generate_lies.ipynb
: For generating the LLM responses (lies) to the real world scenarios. Responses generated by Llama3-8B-Instruct are already in the datasets folder and have been manually categorized by the first author as either honest reply or lie. -
lie_detection.ipynb
: Three classifiers (TTPD, LR and CCS) are used to classify statements as true or false based on the internal LLM activations. We examine their ability to generalize to unseen topics, unseen types of statements, and real-world lies. This code reproduces the results of Section 5 of the paper. The activations of all datasets in the datasets folder (in one layer) are needed to run this notebook. You can generate these activations, e.g. for Llama3-8B-Instruct, via the following command:
python generate_acts.py --model_family Llama3 --model_size 8B --model_type chat --layers 12 --datasets all --device cuda:0
The DataManager class in utils.py
, the script generate_acts.py
and the CCS implementation in probes.py
are (up to some modifications) from the Geometry of Truth GitHub repository by Samuel Marks.
The datasets in the datasets folder were primarily collected from previous papers, all of which are referenced in our paper.