Skip to content

Commit

Permalink
Merge branch 'develop' of github.com:br-data/second-opinion into develop
Browse files Browse the repository at this point in the history
  • Loading branch information
Niels Ringler committed Oct 7, 2024
2 parents 9ae84a4 + e000960 commit a07b45d
Show file tree
Hide file tree
Showing 6 changed files with 17 additions and 16 deletions.
31 changes: 16 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
![logo](assets/logoSecondOpinion.gif)

Detect hallucinated content in generated answers for any RAG system.
Detect hallucinated content in generated answers for any RAG system. A live demo of this tool can be
tried [here](https://interaktiv.brdata-dev.de/second-opinion-demo/).

:warning: This is only a Proof of Concept and not ready for production use.

## Installation

This API is tested with Python version 3.11 on Debian but should run on most recent Python versions and operation systems.
This API is tested with Python version 3.11 on Debian but should run on most recent Python versions and operation
systems.

1. Create virtual environment `pyenv virtualenv 3.11.7 NAME && pyenv activate NAME`
2. Install dependencies `pip install -r requirements.txt`
Expand All @@ -19,15 +21,15 @@ You can now view the Swagger documentation of the API in your browser under `loc

## Fact check

The endpoint `check` performs a check if a `sentence` is contained in the `source`.
The endpoint `check` performs a check if a `sentence` is contained in the `source`.

If this test is passed, the endpoint returns a boolean value `result` as `true`. If the information from the sentence is
not contained in the source, it will return false.

Besides this boolean value, the endpoint returns an array `answers`, which spell out for each sentence in the source
why or why not it is contained in the source.

As an URL paramater you can pass the threshold, a lower threshold means higher latency and possibly better accuracy.
As an URL paramater you can pass the threshold, a lower threshold means higher latency and possibly better accuracy.
The default value is 0.65.

```shell
Expand All @@ -42,20 +44,20 @@ curl -X 'POST' \
```

## Evaluation
This repository contains two scripts designed to evaluate and enhance the accuracy of our hallucination detection systems.

This repository contains two scripts designed to evaluate and enhance the accuracy of our hallucination detection
systems.
The script `evaluation.py` aims to validate the effectiveness of the system by comparing its predictions with the
gold standard dataset, ultimately providing a measure of accuracy.

The script `predictor.py` focuses on processing the test data set using the provided API to create set to validate against.

:warning: This test set is designed to detect wrong answers to questions if the source is given. We think the two tasks
are similar enough to use this on our domain as well.
The script `predictor.py` focuses on processing the test data set using the provided API to create set to validate
against.

### Available Test- and Training Data

The test and training data ist purely synthetic. It is generated by a random dump from our vector store containing
The test and training data ist purely synthetic. It is generated by a random dump from our vector store containing
BR24 articles, split by `<p>`aragraphs. For the test set 150 of those paragraphs are randomly sampled and saved to
`data/test.csv`.
`data/test.csv`.

This file is used by `create_training_data.py` to generate a question which can be answered given the paragraph.

Expand All @@ -64,7 +66,7 @@ the LLM is explicitly asked to add wrong but plausible content to the answer.

## Hypothesis data

Your hypothesis data should be placed in the data folder and be suffixed with `_result.jsonl`. Each row shall contain a
Your hypothesis data should be placed in the data folder and be suffixed with `_result.jsonl`. Each row shall contain a
JSON object with the structure as follows:

```json
Expand All @@ -76,7 +78,6 @@ JSON object with the structure as follows:
```

## Evaluation
To run the evaluation simply run `python evaluate.py` after you've placed your results in the data folder.
The evaluation script calculates the accuracy - e.g. the percentage of correctly predicted samples.

The current
To run the evaluation simply run `python evaluate.py` after you've placed your results in the data folder.
The evaluation script calculates the accuracy - e.g. the percentage of correctly predicted samples.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion src/prompts.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
- Vorhandensein und Korrektheit von Quellenangaben
- Inhaltliche Übereinstimmung der Informationen mit dem Ausgangstext
Falls einer dieser Aspekte misachtet wird, gilt die Aussage als nicht unterstützt.
Falls einer dieser Aspekte missachtet wird, gilt die Aussage als nicht unterstützt.
Für jede Faktenprüfungsaufgabe erhalten Sie:
Einen Satz, der eine oder mehrere zu überprüfende Behauptungen enthält
Expand Down

0 comments on commit a07b45d

Please sign in to comment.