Skip to content

holistic-ai/THaMES

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models.

Paper License

This is the official implementation of the NeurIPS 2024 SOLAR Workshop Paper: THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models. THaMES is a comprehensive framework for generating, mitigating, and evaluating hallucinations in Large Language Models (LLMs). The framework provides an end-to-end pipeline with support for:

  • Ingesting various document formats (PDF, CSV, TXT, DOCX)
  • Generating diverse question-answer pairs
  • Evaluating model hallucination propensity
  • Applying state-of-the-art mitigation strategies

THaMES Framework Overview

Installation

$ git clone https://github.com/Liangmf11/THaMES
$ cd THaMES
$ pip install -r requirements.txt

Setup

In .env, set the following keys:

OPENAI_API_KEY=<YOUR-OPENAI-API-KEY-HERE>
AZURE_OPENAI_ENDPOINT=<YOUR-OPENAI-API-ENDPOINT-HERE>

Quick Start

The easiest way to use THaMES is through the unified pipeline interface:

# Run full pipeline with interactive prompts
$ poetry run python src/pipeline.py

# Run with specific arguments
$ poetry run python src/pipeline.py --categories 1,2 --question_types 1 2 3 --num_questions 10 --model_type azure --model_name gpt-4 --mitigation_techniques CoVe RAG --evaluation_type HALUEVAL

Pipeline Components

THaMES consists of three main components that can be run independently as well:

1. Question-Answer Generation

$ poetry run python src/qa_pair_generator.py --categories 1,2 --filename academic_dataset --question-types 1,2,3 --num-questions 10 --hallucination y

2. Model Evaluation

# Evaluate with HaluEval
$ poetry run python src/evaluate_refactored.py --model_type azure --dataset ./output/final/academic_dataset --model_name gpt-4 --mitigation_techniques CoVe --evaluation_type HALUEVAL

# Evaluate with RAGAS
$ poetry run python src/evaluate_refactored.py --model_type ollama --dataset ./output/final/academic_dataset --model_name llama2 --mitigation_techniques RAG --evaluation_type RAGAS

Features

Document Processing

  • Support for multiple document formats (PDF, CSV, TXT, DOCX)
  • Automatic document categorization
  • Batch processing capabilities

Question Generation

Questions are broken down into the following categories:

  • Simple: Basic questions that do not require complex reasoning or multiple contexts, with straight-forward answers directly contained in the knowledge base.

  • Reasoning: Questions designed to enhance the need for reasoning to answer them effectively (at least one leap of intuition required to correlate the answer to the correct information from the knowledge base)

  • Multi-Context: Questions that necessitate information from multiple related sections or chunks to formulate an answer.

  • Situational: Questions including user context to evaluate the ability of the generation to produce relevant answer according to the context (first part of the question establishes some [correct OR distracting] context prior to the question itself)

  • Distracting: Questions made to confuse the retrieval part of the RAG with a distracting element from the knowledge base but irrelevant to the question. (Designed to mess with embedding engines - leaves more reasoning work for the LLM)

  • Double: Questions with two distinct parts to evaluate the capabilities of the query rewriter of the RAG

  • Conditional: Questions that introduce a conditional element, adding complexity to the question.

Usage Notes

  • Batch sizes and question counts are flexible, to be specified by the user or use standard defaults.
  • The generator includes a quality filtering pipeline which flags various keywords as indicators of poor question quality, then recursively reevaluates flagged questions until the batch is no longer flagged.

Hallucination Mitigation

Mitigation Strategies

  • Chain-of-Verification (CoVe): A technique that breaks down response generation into multiple verification steps, where each step validates the previous conclusions against the source material. [paper]

  • Retrieval-Augmented Generation (RAG): Enhances LLM responses by retrieving relevant context from a knowledge base (in this case, formed out of previously-incorrect/low-scoring questions) before generation, combining the model's parametric knowledge with non-parametric information retrieval. [paper]

  • Parameter-Efficient Fine-Tuning (PEFT): A method that adapts pre-trained models to specific tasks while updating only a small subset of parameters (updated based on a corpus of previously-incorrect questions), helping to maintain factual consistency. [paper]

Evaluation Methods

  • Hallucination Identification: Multiple-choice based evaluation method measuring factual consistency, relevance, and semantic accuracy of model outputs against reference documents, based on providing either a correct or incorrect Question-Answer pair and prompting the model to judge the accuracy of the pair. based on [paper]

  • Text Generation: Answer-synthesis based evaluation framework based on multiple metrics including faithfulness, context relevancy, and answer relevancy, helping to quantify both retrieval quality and generation accuracy. based on [paper]

  • Per-category and aggregate metrics: Detailed performance analysis broken down by question categories (Simple, Reasoning, Multi-Context, etc.) as well as overall system performance metrics including:

    • Faithfulness Score
    • Context Relevance
    • Answer Relevance
    • Response Coherence
    • Hallucination Rate

Usage Modes

THaMES supports two usage modes:

1. Interactive Prompt Mode

$ poetry run python src/pipeline.py

Follow the interactive prompts to configure:

  • Document categories
  • Question types
  • Model selection
  • Mitigation strategies
  • Evaluation methods

2. Argument Mode

$ poetry run python src/pipeline.py --categories 1,2 \
                        --question_types 1 2 3 \
                        --num_questions 10 \
                        --model_type azure \
                        --model_name gpt-4 \
                        --mitigation_techniques CoVe RAG \
                        --evaluation_type HALUEVAL

Tutorials

Detailed tutorials will soon be available in the tutorials/ directory including:

  • Basic Pipeline Usage across all pathways
  • Generating a basic testset
  • Comparing Evaluation Results
  • Understanding Model Performance

Directory Structure

thames/
├── src/
│   ├── pipeline.py
│   ├── evaluate_refactored.py
│   └── qa_pair_generator.py
├── test_docs/
├── output/
│   └── final/
├── tutorials/
└── README.md

Citation

If you use this work, please cite the following paper:

@misc{liang2024thamesendtoendtoolhallucination,
      title={THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models}, 
      author={Mengfei Liang and Archish Arun and Zekun Wu and Cristian Munoz and Jonathan Lutch and Emre Kazim and Adriano Koshiyama and Philip Treleaven},
      year={2024},
      eprint={2409.11353},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.11353}, 
}

License

THaMES is licensed under the MIT License - see LICENSE for more details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages