Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation #11

Merged
merged 6 commits into from
Oct 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,7 @@ __pycache__/
*.tmp
*.bak

# Data
/data
*.csv

6 changes: 4 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -119,11 +119,13 @@ dmypy.json
scripts/rag_evaluation/data/flat_files/

# data
*/data
app/data/*
/app/data/resources_summarized.csv
!/app/data/batch_cn1d3YOzng9mkfZawyqfkL1k_output_33a6_ids_dates_no_urls.jsonl
!/app/data/batch_O9aiHyHpLDHCaaxZzcw1dqDS_output_no_ids_dates_no_urls.jsonl
scripts/rag_evaluation/evaluate_generation/generation_speed/data/*.csv
scripts/rag_evaluation/evaluate_retrieval/data/output/*csv

/models
/data/
!/evaluation/data/
evaluation/evaluation_dataset/legacy_fix
7 changes: 3 additions & 4 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
FROM python:3.10-slim

WORKDIR /app
WORKDIR /

COPY requirements.txt requirements.txt
COPY app app

COPY app /app/

RUN pip install --no-cache-dir -r requirements.txt

EXPOSE 8000

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this being removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I placed it in the docker compose so that the run was not executed when the container is built. Now the command is run after all the containers are up.

59 changes: 35 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Fasten Answers AI

This project uses Llama3 8B powered by llama.cpp to provide LLM answers to users' questions using documents indexed in Elasticsearch if available. The architecture includes a data flow that allows document uploads, indexing in Elasticsearch, and response generation to queries using a large language model (LLM).
This project uses llama.cpp to provide LLM answers to users' questions using documents indexed in Elasticsearch if available. The architecture includes a data flow that allows document uploads, indexing in Elasticsearch, and response generation to queries using a large language model (LLM).


## Project Summary
Expand All @@ -13,27 +13,41 @@ It uses the following components:

## Data Flow Architecture

1. **Document Upload**:
- PDF documents are uploaded via a FastAPI endpoint.
- The document text is extracted and split into chunks using `langchain`.
1. **Indexing in Elasticsearch**:
- FHIR resources must be in JSON format.
- As the project is still in development, we are testing different approaches for storing FHIR resources in the vector database. The alternatives we have tested include: i. saving each resource as a string, divided into chunks with overlap, ii. flattening each resource before chunking with overlap, and iii. summarizing each FHIR resource using the OpenAI API or a local LLM with llama.cpp. You can find more details on each approach in the [indexing strategies documentation](./docs/indexing_strategies.md)
- Text embeddings are generated using `sentence-transformers` and stored in Elasticsearch.

2. **Indexing in Elasticsearch**:
- The text chunks are indexed in Elasticsearch.
- Text embeddings are generated using `sentence-transformers` and stored in Elasticsearch to facilitate search.

3. **Response Generation**:
- Queries are sent through a FastAPI endpoint.
2. **Response Generation**:
- Queries are sent through a FastAPI endpoint.
- Relevant results are retrieved from Elasticsearch.
- An LLM, served by llama.cpp, generates a response based on the retrieved results.
- You can find more details on how to setup the generation in the [generation strategies documentation](./docs/indexing_strategies.md).

## Hardware and technical recommendations

Since the project utilizes llama.cpp for LLM execution, it’s crucial to configure the command properly to optimize performance based on the available hardware resources. All parameters can be found in the [llama.cpp server documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md), but for this project, we use the following configuration:

| Parameter | Explanation |
| -------- | ------- |
| -n, --predict, --n-predict N | number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled) |
| -c, --ctx-size N | size of the prompt context as a power of 2 (default: 0, 0 = loaded from model) |
| -t, --threads N | number of threads to use during generation (default: -1). The higher, the faster the inference will be. |
| -np, --parallel N (optional)| number of parallel sequences to decode (default: 1). If for example you decide to process 4 sequences in a batch in parallel, you shoud set -np 4. |

To configure these parameters based on the available resources of the local machine, modify the `command` line in the `llama` service section within the [docker-compose.yml](docker-compose.yml) file directly.

If the hardware has sufficient resources to run llama.cpp in parallel, it is recommended to include the -np parameter in the Docker Compose configuration before running `docker compose up`. In this scenario, the endpoints that support parallel execution, as specified in [routes](./app/routes/), are documented in [evaluate_generation.md](./docs/evaluate_generation.md) and [evaluate_retrieval.md](./docs/evaluate_retrieval.md).

## Running the Project

### Prerequisites

- Docker
- Docker Compose
- Docker.
- Docker Compose.
- LLM models should be downloaded and stored in the [./models](./models/) folder in .gguf format. We have tested the performance of Phi 3.5 Mini and Llama 3.1 in various quantization formats. The prompts for conversation and summary generation are configured in the [prompts folder](./app/config/prompts/). If you want to add a new model with a different prompt, you must update the prompt files in that directory and place the corresponding model in the models folder.

### Instructions
### Instructions to Launch the RAG System

1. **Clone the repository**:

Expand All @@ -42,17 +56,15 @@ It uses the following components:
cd fasten-answers-ai
```

2. **Modify the `docker-compose.yml` file variables** (if necessary):
2. **Modify the `docker-compose.yml` file app env variables** (if necessary):

```sh
ES_HOST=http://elasticsearch:9200
ES_USER=elastic
ES_PASSWORD=changeme
EMBEDDING_MODEL_NAME=all-MiniLM-L6-v2
LLAMA_HOST=http://llama:8080
LLAMA_PROMPT="A chat between a curious user and an intelligent, polite medical assistant. The assistant provides detailed, helpful answers to the user's medical questions, including accurate references where applicable."
INDEX_NAME=fasten-index
UPLOAD_DIR=/app/data/
ES_INDEX_NAME: fasten-index
EMBEDDING_MODEL_NAME: all-MiniLM-L6-v2
LLM_HOST: http://llama:9090
```

3. **Start the services with Docker Compose**:
Expand All @@ -63,11 +75,10 @@ It uses the following components:

This command will start the following services:
- **Elasticsearch**: Available at `http://localhost:9200`
- **Llama3**: Served by llama.cpp at `http://localhost:8080`
- **Llama**: Served by llama.cpp at `http://localhost:8080`
- **FastAPI Application**: Available at `http://localhost:8000`

## Generating CURL Requests

### Generate CURL Requests with `curl_generator`
## Running the evaluations

To facilitate generating CURL requests to the 8000 port, you can use the `curl_generator.py` script located in the scripts folder.
* [Retrieval evaluation](./docs/evaluate_retrieval.md)
* [Generation evaluation](./docs/evaluate_generation.md)
4 changes: 2 additions & 2 deletions app/config/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ def __init__(self):
self.host = os.getenv("ES_HOST", "http://localhost:9200")
self.user = os.getenv("ES_USER", "elastic")
self.password = os.getenv("ES_PASSWORD", "changeme")
self.index_name = os.getenv("INDEX_NAME", "fasten-index")
self.index_name = os.getenv("ES_INDEX_NAME", "fasten-index")


class ModelsSettings:
Expand All @@ -17,7 +17,7 @@ def __init__(self):
# Embedding model
self.embedding_model_name = os.getenv("EMBEDDING_MODEL_NAME", "all-MiniLM-L6-v2")
# LLM host
self.llm_host = os.getenv("LLAMA_HOST", "http://localhost:9090")
self.llm_host = os.getenv("LLM_HOST", "http://localhost:9090")
# Conversation prompts
self.conversation_model_prompt = {
"llama3.1": self.load_prompt(os.path.join(base_dir, "prompts/conversation_model_prompt_llama3.1-instruct.txt")),
Expand Down
22 changes: 17 additions & 5 deletions app/evaluation/generation/correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from tqdm import tqdm
import pandas as pd

from evaluation.core.openai.openai import get_chat_completion
from app.services.openai import OpenAIHandler


CORRECTNESS_SYS_TMPL = """
Expand Down Expand Up @@ -73,7 +73,7 @@ def __init__(self, openai_api_key, model="gpt-4o-mini-2024-07-18", threshold=4.0
self.threshold = threshold
self.max_tokens = max_tokens

def run_correctness_eval(self, query_str: str, reference_answer: str, generated_answer: str):
def run_correctness_eval(self, query_str: str, reference_answer: str, generated_answer: str, openai_handler: OpenAIHandler):
"""
Evaluates the correctness of a generated answer against a reference answer.

Expand All @@ -92,9 +92,14 @@ def run_correctness_eval(self, query_str: str, reference_answer: str, generated_

system_prompt = CORRECTNESS_SYS_TMPL

open_ai_response = get_chat_completion(
self.openai_api_key, user_prompt, system_prompt, ANSWER_JSON_SCHEMA, model=self.model, max_tokens=self.max_tokens
open_ai_response = openai_handler.get_chat_completion(
openai_api_key=self.openai_api_key,
user_prompt=user_prompt,
system_prompt=system_prompt,
answer_json_schema=ANSWER_JSON_SCHEMA,
max_tokens=self.max_tokens,
)

json_answer = json.loads(open_ai_response.get("choices")[0].get("message").get("content"))

score = json_answer["score"]
Expand Down Expand Up @@ -138,6 +143,8 @@ def evaluate_dataset(
Returns:
- pd.DataFrame, the original dataframe with additional columns for score, reasoning, and passing status.
"""
openai_handler = OpenAIHandler(model=self.model)

fieldnames = [resource_id_column, "score", "reasoning", "passing"]

with open(output_file, mode="w", newline="") as file:
Expand All @@ -149,9 +156,14 @@ def evaluate_dataset(
try:
for _, row in tqdm(df.iterrows(), total=len(df), desc="Processing correctness"):
result = self.run_correctness_eval(
row[query_column], row[reference_answer_column], row[generated_answer_column]
row[query_column],
row[reference_answer_column],
row[generated_answer_column],
openai_handler=openai_handler,
)

result[resource_id_column] = row[resource_id_column]

# Write the result to the CSV file
writer.writerow(result)

Expand Down
20 changes: 15 additions & 5 deletions app/evaluation/generation/faithfulness.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

from tqdm import tqdm

from evaluation.core.openai.openai import get_chat_completion
from app.services.openai import OpenAIHandler


FAITHFULLNESS_SYS_TMPL = """
Expand Down Expand Up @@ -77,7 +77,7 @@ def __init__(self, openai_api_key, model="gpt-4o-mini-2024-07-18", max_tokens=30
self.model = model
self.max_tokens = max_tokens

def run_faithfulness_eval(self, generated_answer: str, contexts: str):
def run_faithfulness_eval(self, generated_answer: str, contexts: str, openai_handler: OpenAIHandler):
"""
Evaluates the faithfulness of a generated answer against provided contexts based on three aspects.

Expand All @@ -92,8 +92,12 @@ def run_faithfulness_eval(self, generated_answer: str, contexts: str):
user_prompt = FAITHFULLNESS_USER_TMPL.format(generated_answer=generated_answer, contexts=contexts)
system_prompt = FAITHFULLNESS_SYS_TMPL

open_ai_response = get_chat_completion(
self.openai_api_key, user_prompt, system_prompt, ANSWER_JSON_SCHEMA, model=self.model, max_tokens=self.max_tokens
open_ai_response = openai_handler.get_chat_completion(
openai_api_key=self.openai_api_key,
user_prompt=user_prompt,
system_prompt=system_prompt,
answer_json_schema=ANSWER_JSON_SCHEMA,
max_tokens=self.max_tokens,
)

json_answer = json.loads(open_ai_response.get("choices")[0].get("message").get("content"))
Expand Down Expand Up @@ -146,6 +150,8 @@ def evaluate_dataset(
Returns:
- pd.DataFrame, the original dataframe with additional columns for relevancy, accuracy, conciseness and pertinence, and reasoning.
"""
openai_handler = OpenAIHandler(model=self.model)

fieldnames = [resource_id_column, "relevancy", "accuracy", "conciseness_and_pertinence", "reasoning"]

with open(output_file, mode="w", newline="") as file:
Expand All @@ -155,8 +161,12 @@ def evaluate_dataset(

try:
for _, row in tqdm(df.iterrows(), total=len(df), desc="Processing faithfulness"):
result = self.run_faithfulness_eval(row[generated_answer_column], row[contexts_column])
result = self.run_faithfulness_eval(
row[generated_answer_column], row[contexts_column], openai_handler=openai_handler
)

result[resource_id_column] = row[resource_id_column]

# Write the result to the CSV file
writer.writerow(result)

Expand Down
17 changes: 9 additions & 8 deletions app/routes/evaluation_endpoints.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ async def evaluate_generation(
task.connect(params)

evaluator = CorrectnessEvaluator(openai_api_key, openai_model, threshold=correctness_threshold, max_tokens=max_tokens)
evaluation_metrics = evaluator.evaluate_dataset(
evaluation_metrics_correctness = evaluator.evaluate_dataset(
df=df,
output_file=output_file_correctness,
query_column=query_column,
Expand All @@ -178,35 +178,36 @@ async def evaluate_generation(
)

evaluator = FaithfulnessEvaluator(openai_api_key, openai_model, max_tokens=max_tokens)
evaluation_metrics = evaluator.evaluate_dataset(
evaluation_metrics_faithfulness = evaluator.evaluate_dataset(
df=df,
output_file=output_file_faithfulness,
generated_answer_column=generated_answer_column,
contexts_column=contexts_column,
resource_id_column=resource_id_column,
)

combined_metrics = {**evaluation_metrics_correctness, **evaluation_metrics_faithfulness}

# Upload metrics to clearml
if clearml_track_experiment and task:
for series_name, value in evaluation_metrics.items():
for series_name, value in combined_metrics.items():
task.get_logger().report_single_value(name=series_name, value=value)
task.close()

return evaluation_metrics
return combined_metrics

except Exception as e:
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail=f"Error during generation evaluation evaluation: {str(e)}"
)


@router.post("/batch_generation")
async def batch_generation(
@router.post("/dataset_generation")
async def dataset_generation(
file: UploadFile = File(...),
limit: int = File(None),
question_column: str = File("openai_query"),
model_prompt: str = Form("llama3.1"),
llm_model: str = Form("llama3.1"),
search_text_boost: float = Form(1),
search_embedding_boost: float = Form(1),
k: int = Form(5),
Expand Down Expand Up @@ -241,7 +242,7 @@ async def batch_generation(
k=k,
text_boost=search_text_boost,
embedding_boost=search_embedding_boost,
llm_model=llm_model,
llm_model=model_prompt,
process=process,
job=job,
)
Expand Down
Loading