Merge branch 'develop' of github.com:br-data/second-opinion into develop

br-data · Oct 7, 2024 · a07b45d · a07b45d
2 parents 9ae84a4 + e000960
commit a07b45d
Show file tree

Hide file tree

Showing 6 changed files with 17 additions and 16 deletions.
diff --git a/README.md b/README.md
@@ -1,12 +1,14 @@
 ![logo](assets/logoSecondOpinion.gif)
 
-Detect hallucinated content in generated answers for any RAG system.
+Detect hallucinated content in generated answers for any RAG system. A live demo of this tool can be
+tried [here](https://interaktiv.brdata-dev.de/second-opinion-demo/).
 
 :warning: This is only a Proof of Concept and not ready for production use.
 
 ## Installation
 
-This API is tested with Python version 3.11 on Debian but should run on most recent Python versions and operation systems.
+This API is tested with Python version 3.11 on Debian but should run on most recent Python versions and operation
+systems.
 
 1. Create virtual environment `pyenv virtualenv 3.11.7 NAME && pyenv activate NAME`
 2. Install dependencies `pip install -r requirements.txt`
@@ -19,15 +21,15 @@ You can now view the Swagger documentation of the API in your browser under `loc
 
 ## Fact check
 
-The endpoint `check` performs a check if a `sentence` is contained in the `source`. 
+The endpoint `check` performs a check if a `sentence` is contained in the `source`.
 
 If this test is passed, the endpoint returns a boolean value `result` as `true`. If the information from the sentence is
 not contained in the source, it will return false.
 
 Besides this boolean value, the endpoint returns an array `answers`, which spell out for each sentence in the source
 why or why not it is contained in the source.
 
-As an URL paramater you can pass the threshold, a lower threshold means higher latency and possibly better accuracy. 
+As an URL paramater you can pass the threshold, a lower threshold means higher latency and possibly better accuracy.
 The default value is 0.65.
 
 ```shell
@@ -42,20 +44,20 @@ curl -X 'POST' \
 ```
 
 ## Evaluation
-This repository contains two scripts designed to evaluate and enhance the accuracy of our hallucination detection systems. 
+
+This repository contains two scripts designed to evaluate and enhance the accuracy of our hallucination detection
+systems.
 The script `evaluation.py` aims to validate the effectiveness of the system by comparing its predictions with the
 gold standard dataset, ultimately providing a measure of accuracy.
 
-The script `predictor.py` focuses on processing the test data set using the provided API to create set to validate against.
-
-:warning: This test set is designed to detect wrong answers to questions if the source is given. We think the two tasks
-are similar enough to use this on our domain as well.
+The script `predictor.py` focuses on processing the test data set using the provided API to create set to validate
+against.
 
 ### Available Test- and Training Data
 
-The test and training data ist purely synthetic. It is generated by a random dump from our vector store containing 
+The test and training data ist purely synthetic. It is generated by a random dump from our vector store containing
 BR24 articles, split by `<p>`aragraphs. For the test set 150 of those paragraphs are randomly sampled and saved to
-`data/test.csv`. 
+`data/test.csv`.
 
 This file is used by `create_training_data.py` to generate a question which can be answered given the paragraph.
 
@@ -64,7 +66,7 @@ the LLM is explicitly asked to add wrong but plausible content to the answer.
 
 ## Hypothesis data
 
-Your hypothesis data should be placed in the data folder and be suffixed with `_result.jsonl`. Each row shall contain a 
+Your hypothesis data should be placed in the data folder and be suffixed with `_result.jsonl`. Each row shall contain a
 JSON object with the structure as follows:
 
 ```json
@@ -76,7 +78,6 @@ JSON object with the structure as follows:
 ```
 
 ## Evaluation
-To run the evaluation simply run `python evaluate.py` after you've placed your results in the data folder.
-The evaluation script calculates the accuracy - e.g. the percentage of correctly predicted samples.
 
-The current 
+To run the evaluation simply run `python evaluate.py` after you've placed your results in the data folder.
+The evaluation script calculates the accuracy - e.g. the percentage of correctly predicted samples.
diff --git a/data/DatasetCardHallucinations.md → ...tion/dataset/DatasetCardHallucinations.md b/data/DatasetCardHallucinations.md → ...tion/dataset/DatasetCardHallucinations.md
diff --git a/data/create_hallucination_training_data.py → ...set/create_hallucination_training_data.py b/data/create_hallucination_training_data.py → ...set/create_hallucination_training_data.py
diff --git a/data/train.jsonl → evaluation/dataset/train.jsonl b/data/train.jsonl → evaluation/dataset/train.jsonl
diff --git a/data/verified.txt → evaluation/dataset/verified.txt b/data/verified.txt → evaluation/dataset/verified.txt
diff --git a/src/prompts.py b/src/prompts.py
@@ -11,7 +11,7 @@
 - Vorhandensein und Korrektheit von Quellenangaben
 - Inhaltliche Übereinstimmung der Informationen mit dem Ausgangstext
 
-Falls einer dieser Aspekte misachtet wird, gilt die Aussage als nicht unterstützt.
+Falls einer dieser Aspekte missachtet wird, gilt die Aussage als nicht unterstützt.
 
 Für jede Faktenprüfungsaufgabe erhalten Sie:
 Einen Satz, der eine oder mehrere zu überprüfende Behauptungen enthält