Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
redadmiral authored Oct 1, 2024
1 parent 1805e16 commit 6dea219
Showing 1 changed file with 2 additions and 60 deletions.
62 changes: 2 additions & 60 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
![logo](assets/logoSecondOpinion.gif)

Detect hallucinated content in generated answers for any RAG system.
This repo provides the summarisation functionality for the second opinion demo.

:warning: This is only a Proof of Concept and not ready for production use.

Expand Down Expand Up @@ -35,64 +33,8 @@ curl -X 'POST' \
Since this API is also designed for demonstration purposes it is possible to enforce hallucinations by setting the
honest parameter to `false`. As models you can choose either `gpt-3.5-turbo` or `gpt-4-turbo`.


## Fact check

The endpoint `check` performs a check if a `sentence` is contained in the `source`.

If this test is passed, the endpoint returns a boolean value `result` as `true`. If the information from the sentence is
not contained in the source, it will return false.

Besides this boolean value, the endpoint returns an array `answers`, which spell out for each sentence in the source
why or why not it is contained in the source.

As an URL paramater you can pass the threshold, a lower threshold means higher latency and possibly better accuracy.
The default value is 0.65.

```shell
curl -X 'POST' \
'http://localhost:3000/check?semantic_similarity_threshold=0.65&model=gpt-3.5-turbo' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"source": "string",
"sentence": "string"
}'
```

## Evaluation
This repository contains two scripts designed to evaluate and enhance the accuracy of our hallucination detection systems.
The script `evaluation.py` aims to validate the effectiveness of the system by comparing its predictions with the
gold standard dataset, ultimately providing a measure of accuracy.

The script `predictor.py` focuses on processing the test data set using the provided API to create set to validate against.

### Available Test- and Training Data

The test and training data ist purely synthetic. It is generated by a random dump from our vector store containing
BR24 articles, split by `<p>`aragraphs. For the test set 150 of those paragraphs are randomly sampled and saved to
`data/test.csv`.

This file is used by `create_training_data.py` to generate a question which can be answered given the paragraph.

Using this question and the paragraph, GPT 3.5 Turbo is used to generate answers to the questions. In some cases
the LLM is explicitly asked to add wrong but plausible content to the answer.

## Hypothesis data

Your hypothesis data should be placed in the data folder and be suffixed with `_result.jsonl`. Each row shall contain a
JSON object with the structure as follows:

```json
{
"id": "string",
"hallucination": true,
"prob": 0.01
}
```

## Evaluation
To run the evaluation simply run `python evaluate.py` after you've placed your results in the data folder.
The evaluation script calculates the accuracy - e.g. the percentage of correctly predicted samples.

The current
The current

0 comments on commit 6dea219

Please sign in to comment.