diff --git a/README.md b/README.md index ebd49d1..a9d8476 100644 --- a/README.md +++ b/README.md @@ -1,12 +1,14 @@  -Detect hallucinated content in generated answers for any RAG system. +Detect hallucinated content in generated answers for any RAG system. A live demo of this tool can be +tried [here](https://interaktiv.brdata-dev.de/second-opinion-demo/). :warning: This is only a Proof of Concept and not ready for production use. ## Installation -This API is tested with Python version 3.11 on Debian but should run on most recent Python versions and operation systems. +This API is tested with Python version 3.11 on Debian but should run on most recent Python versions and operation +systems. 1. Create virtual environment `pyenv virtualenv 3.11.7 NAME && pyenv activate NAME` 2. Install dependencies `pip install -r requirements.txt` @@ -19,7 +21,7 @@ You can now view the Swagger documentation of the API in your browser under `loc ## Fact check -The endpoint `check` performs a check if a `sentence` is contained in the `source`. +The endpoint `check` performs a check if a `sentence` is contained in the `source`. If this test is passed, the endpoint returns a boolean value `result` as `true`. If the information from the sentence is not contained in the source, it will return false. @@ -27,7 +29,7 @@ not contained in the source, it will return false. Besides this boolean value, the endpoint returns an array `answers`, which spell out for each sentence in the source why or why not it is contained in the source. -As an URL paramater you can pass the threshold, a lower threshold means higher latency and possibly better accuracy. +As an URL paramater you can pass the threshold, a lower threshold means higher latency and possibly better accuracy. The default value is 0.65. ```shell @@ -42,20 +44,20 @@ curl -X 'POST' \ ``` ## Evaluation -This repository contains two scripts designed to evaluate and enhance the accuracy of our hallucination detection systems. + +This repository contains two scripts designed to evaluate and enhance the accuracy of our hallucination detection +systems. The script `evaluation.py` aims to validate the effectiveness of the system by comparing its predictions with the gold standard dataset, ultimately providing a measure of accuracy. -The script `predictor.py` focuses on processing the test data set using the provided API to create set to validate against. - -:warning: This test set is designed to detect wrong answers to questions if the source is given. We think the two tasks -are similar enough to use this on our domain as well. +The script `predictor.py` focuses on processing the test data set using the provided API to create set to validate +against. ### Available Test- and Training Data -The test and training data ist purely synthetic. It is generated by a random dump from our vector store containing +The test and training data ist purely synthetic. It is generated by a random dump from our vector store containing BR24 articles, split by `
`aragraphs. For the test set 150 of those paragraphs are randomly sampled and saved to -`data/test.csv`. +`data/test.csv`. This file is used by `create_training_data.py` to generate a question which can be answered given the paragraph. @@ -64,7 +66,7 @@ the LLM is explicitly asked to add wrong but plausible content to the answer. ## Hypothesis data -Your hypothesis data should be placed in the data folder and be suffixed with `_result.jsonl`. Each row shall contain a +Your hypothesis data should be placed in the data folder and be suffixed with `_result.jsonl`. Each row shall contain a JSON object with the structure as follows: ```json @@ -76,7 +78,6 @@ JSON object with the structure as follows: ``` ## Evaluation -To run the evaluation simply run `python evaluate.py` after you've placed your results in the data folder. -The evaluation script calculates the accuracy - e.g. the percentage of correctly predicted samples. -The current +To run the evaluation simply run `python evaluate.py` after you've placed your results in the data folder. +The evaluation script calculates the accuracy - e.g. the percentage of correctly predicted samples. \ No newline at end of file diff --git a/data/DatasetCardHallucinations.md b/evaluation/dataset/DatasetCardHallucinations.md similarity index 100% rename from data/DatasetCardHallucinations.md rename to evaluation/dataset/DatasetCardHallucinations.md diff --git a/data/create_hallucination_training_data.py b/evaluation/dataset/create_hallucination_training_data.py similarity index 100% rename from data/create_hallucination_training_data.py rename to evaluation/dataset/create_hallucination_training_data.py diff --git a/data/train.jsonl b/evaluation/dataset/train.jsonl similarity index 100% rename from data/train.jsonl rename to evaluation/dataset/train.jsonl diff --git a/data/verified.txt b/evaluation/dataset/verified.txt similarity index 100% rename from data/verified.txt rename to evaluation/dataset/verified.txt diff --git a/src/prompts.py b/src/prompts.py index f77528c..f9511db 100644 --- a/src/prompts.py +++ b/src/prompts.py @@ -11,7 +11,7 @@ - Vorhandensein und Korrektheit von Quellenangaben - Inhaltliche Übereinstimmung der Informationen mit dem Ausgangstext -Falls einer dieser Aspekte misachtet wird, gilt die Aussage als nicht unterstützt. +Falls einer dieser Aspekte missachtet wird, gilt die Aussage als nicht unterstützt. Für jede Faktenprüfungsaufgabe erhalten Sie: Einen Satz, der eine oder mehrere zu überprüfende Behauptungen enthält