fastenhealth · dgbaenar · Oct 8, 2024 · Sep 18, 2024 · Sep 18, 2024 · Sep 18, 2024
diff --git a/.dockerignore b/.dockerignore
@@ -16,3 +16,7 @@ __pycache__/
 *.tmp
 *.bak
 
+# Data
+/data
+*.csv
+
diff --git a/.gitignore b/.gitignore
@@ -119,11 +119,13 @@ dmypy.json
 scripts/rag_evaluation/data/flat_files/
 
 # data
-*/data
+app/data/*
+/app/data/resources_summarized.csv
+!/app/data/batch_cn1d3YOzng9mkfZawyqfkL1k_output_33a6_ids_dates_no_urls.jsonl
+!/app/data/batch_O9aiHyHpLDHCaaxZzcw1dqDS_output_no_ids_dates_no_urls.jsonl
 scripts/rag_evaluation/evaluate_generation/generation_speed/data/*.csv
 scripts/rag_evaluation/evaluate_retrieval/data/output/*csv
 
 /models
-/data/
 !/evaluation/data/
 evaluation/evaluation_dataset/legacy_fix
diff --git a/Dockerfile b/Dockerfile
@@ -1,12 +1,11 @@
 FROM python:3.10-slim
 
-WORKDIR /app
+WORKDIR /
 
 COPY requirements.txt requirements.txt
-COPY app app
+
+COPY app /app/
 
 RUN pip install --no-cache-dir -r requirements.txt
 
 EXPOSE 8000
-
-CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # Fasten Answers AI
 
-This project uses Llama3 8B powered by llama.cpp to provide LLM answers to users' questions using documents indexed in Elasticsearch if available. The architecture includes a data flow that allows document uploads, indexing in Elasticsearch, and response generation to queries using a large language model (LLM).
+This project uses llama.cpp to provide LLM answers to users' questions using documents indexed in Elasticsearch if available. The architecture includes a data flow that allows document uploads, indexing in Elasticsearch, and response generation to queries using a large language model (LLM).
 
 
 ## Project Summary
@@ -13,27 +13,41 @@ It uses the following components:
 
 ## Data Flow Architecture
 
-1. **Document Upload**:
-   - PDF documents are uploaded via a FastAPI endpoint.
-   - The document text is extracted and split into chunks using `langchain`.
+1. **Indexing in Elasticsearch**:
+   - FHIR resources must be in JSON format.
+   - As the project is still in development, we are testing different approaches for storing FHIR resources in the vector database. The alternatives we have tested include: i. saving each resource as a string, divided into chunks with overlap, ii. flattening each resource before chunking with overlap, and iii. summarizing each FHIR resource using the OpenAI API or a local LLM with llama.cpp. You can find more details on each approach in the [indexing strategies documentation](./docs/indexing_strategies.md)
+   - Text embeddings are generated using `sentence-transformers` and stored in Elasticsearch.
 
-2. **Indexing in Elasticsearch**:
-   - The text chunks are indexed in Elasticsearch.
-   - Text embeddings are generated using `sentence-transformers` and stored in Elasticsearch to facilitate search.
-
-3. **Response Generation**:
-   - Queries are sent through a FastAPI endpoint.
+2. **Response Generation**:
+   - Queries are sent through a FastAPI endpoint. 
    - Relevant results are retrieved from Elasticsearch.
    - An LLM, served by llama.cpp, generates a response based on the retrieved results.
+   - You can find more details on how to setup the generation in the [generation strategies documentation](./docs/indexing_strategies.md).
+
+## Hardware and technical recommendations
+
+Since the project utilizes llama.cpp for LLM execution, it’s crucial to configure the command properly to optimize performance based on the available hardware resources. All parameters can be found in the [llama.cpp server documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md), but for this project, we use the following configuration:
+
+| Parameter    | Explanation |
+| -------- | ------- |
+| -n, --predict, --n-predict N | number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled) |
+| -c, --ctx-size N | size of the prompt context as a power of 2 (default: 0, 0 = loaded from model)     |
+| -t, --threads N | number of threads to use during generation (default: -1). The higher, the faster the inference will be. |
+| -np, --parallel N (optional)| number of parallel sequences to decode (default: 1). If for example you decide to process 4 sequences in a batch in parallel, you shoud set -np 4. |
+
+To configure these parameters based on the available resources of the local machine, modify the `command` line in the `llama` service section within the [docker-compose.yml](docker-compose.yml) file directly.
+
+If the hardware has sufficient resources to run llama.cpp in parallel, it is recommended to include the -np parameter in the Docker Compose configuration before running `docker compose up`. In this scenario, the endpoints that support parallel execution, as specified in [routes](./app/routes/), are documented in [evaluate_generation.md](./docs/evaluate_generation.md) and [evaluate_retrieval.md](./docs/evaluate_retrieval.md).
 
 ## Running the Project
 
 ### Prerequisites
 
-- Docker
-- Docker Compose
+- Docker.
+- Docker Compose.
+- LLM models should be downloaded and stored in the [./models](./models/) folder in .gguf format. We have tested the performance of Phi 3.5 Mini and Llama 3.1 in various quantization formats. The prompts for conversation and summary generation are configured in the [prompts folder](./app/config/prompts/). If you want to add a new model with a different prompt, you must update the prompt files in that directory and place the corresponding model in the models folder.
 
-### Instructions
+### Instructions to Launch the RAG System
 
 1. **Clone the repository**:
 
@@ -42,17 +56,15 @@ It uses the following components:
     cd fasten-answers-ai
     ```
 
-2. **Modify the `docker-compose.yml` file variables** (if necessary):
+2. **Modify the `docker-compose.yml` file app env variables** (if necessary):
 
     ```sh
     ES_HOST=http://elasticsearch:9200
     ES_USER=elastic
     ES_PASSWORD=changeme
-    EMBEDDING_MODEL_NAME=all-MiniLM-L6-v2
-    LLAMA_HOST=http://llama:8080
-    LLAMA_PROMPT="A chat between a curious user and an intelligent, polite medical assistant. The assistant provides detailed, helpful answers to the user's medical questions, including accurate references where applicable."
-    INDEX_NAME=fasten-index
-    UPLOAD_DIR=/app/data/
+    ES_INDEX_NAME: fasten-index
+    EMBEDDING_MODEL_NAME: all-MiniLM-L6-v2
+    LLM_HOST: http://llama:9090
     ```
 
 3. **Start the services with Docker Compose**:
@@ -63,11 +75,10 @@ It uses the following components:
 
     This command will start the following services:
     - **Elasticsearch**: Available at `http://localhost:9200`
-    - **Llama3**: Served by llama.cpp at `http://localhost:8080`
+    - **Llama**: Served by llama.cpp at `http://localhost:8080`
     - **FastAPI Application**: Available at `http://localhost:8000`
 
-## Generating CURL Requests
-
-### Generate CURL Requests with `curl_generator`
+## Running the evaluations
 
-To facilitate generating CURL requests to the 8000 port, you can use the `curl_generator.py` script located in the scripts folder.
+* [Retrieval evaluation](./docs/evaluate_retrieval.md)
+* [Generation evaluation](./docs/evaluate_generation.md)
diff --git a/app/config/settings.py b/app/config/settings.py
@@ -7,7 +7,7 @@ def __init__(self):
         self.host = os.getenv("ES_HOST", "http://localhost:9200")
         self.user = os.getenv("ES_USER", "elastic")
         self.password = os.getenv("ES_PASSWORD", "changeme")
-        self.index_name = os.getenv("INDEX_NAME", "fasten-index")
+        self.index_name = os.getenv("ES_INDEX_NAME", "fasten-index")
 
 
 class ModelsSettings:
@@ -17,7 +17,7 @@ def __init__(self):
         # Embedding model
         self.embedding_model_name = os.getenv("EMBEDDING_MODEL_NAME", "all-MiniLM-L6-v2")
         # LLM host
-        self.llm_host = os.getenv("LLAMA_HOST", "http://localhost:9090")
+        self.llm_host = os.getenv("LLM_HOST", "http://localhost:9090")
         # Conversation prompts
         self.conversation_model_prompt = {
             "llama3.1": self.load_prompt(os.path.join(base_dir, "prompts/conversation_model_prompt_llama3.1-instruct.txt")),

diff --git a/...cw1dqDS_output_no_ids_dates_no_urls.jsonl → ...cw1dqDS_output_no_ids_dates_no_urls.jsonl b/...cw1dqDS_output_no_ids_dates_no_urls.jsonl → ...cw1dqDS_output_no_ids_dates_no_urls.jsonl
diff --git a/...fkL1k_output_33a6_ids_dates_no_urls.jsonl → ...fkL1k_output_33a6_ids_dates_no_urls.jsonl b/...fkL1k_output_33a6_ids_dates_no_urls.jsonl → ...fkL1k_output_33a6_ids_dates_no_urls.jsonl
diff --git a/app/evaluation/generation/correctness.py b/app/evaluation/generation/correctness.py
@@ -13,7 +13,7 @@
 from tqdm import tqdm
 import pandas as pd
 
-from evaluation.core.openai.openai import get_chat_completion
+from app.services.openai import OpenAIHandler
 
 
 CORRECTNESS_SYS_TMPL = """
@@ -73,7 +73,7 @@ def __init__(self, openai_api_key, model="gpt-4o-mini-2024-07-18", threshold=4.0
         self.threshold = threshold
         self.max_tokens = max_tokens
 
-    def run_correctness_eval(self, query_str: str, reference_answer: str, generated_answer: str):
+    def run_correctness_eval(self, query_str: str, reference_answer: str, generated_answer: str, openai_handler: OpenAIHandler):
         """
         Evaluates the correctness of a generated answer against a reference answer.
 
@@ -92,9 +92,14 @@ def run_correctness_eval(self, query_str: str, reference_answer: str, generated_
 
             system_prompt = CORRECTNESS_SYS_TMPL
 
-            open_ai_response = get_chat_completion(
-                self.openai_api_key, user_prompt, system_prompt, ANSWER_JSON_SCHEMA, model=self.model, max_tokens=self.max_tokens
+            open_ai_response = openai_handler.get_chat_completion(
+                openai_api_key=self.openai_api_key,
+                user_prompt=user_prompt,
+                system_prompt=system_prompt,
+                answer_json_schema=ANSWER_JSON_SCHEMA,
+                max_tokens=self.max_tokens,
             )
+
             json_answer = json.loads(open_ai_response.get("choices")[0].get("message").get("content"))
 
             score = json_answer["score"]
@@ -138,6 +143,8 @@ def evaluate_dataset(
         Returns:
         - pd.DataFrame, the original dataframe with additional columns for score, reasoning, and passing status.
         """
+        openai_handler = OpenAIHandler(model=self.model)
+
         fieldnames = [resource_id_column, "score", "reasoning", "passing"]
 
         with open(output_file, mode="w", newline="") as file:
@@ -149,9 +156,14 @@ def evaluate_dataset(
             try:
                 for _, row in tqdm(df.iterrows(), total=len(df), desc="Processing correctness"):
                     result = self.run_correctness_eval(
-                        row[query_column], row[reference_answer_column], row[generated_answer_column]
+                        row[query_column],
+                        row[reference_answer_column],
+                        row[generated_answer_column],
+                        openai_handler=openai_handler,
                     )
+
                     result[resource_id_column] = row[resource_id_column]
+
                     # Write the result to the CSV file
                     writer.writerow(result)
 

diff --git a/app/evaluation/generation/faithfulness.py b/app/evaluation/generation/faithfulness.py
@@ -11,7 +11,7 @@
 
 from tqdm import tqdm
 
-from evaluation.core.openai.openai import get_chat_completion
+from app.services.openai import OpenAIHandler
 
 
 FAITHFULLNESS_SYS_TMPL = """
@@ -77,7 +77,7 @@ def __init__(self, openai_api_key, model="gpt-4o-mini-2024-07-18", max_tokens=30
         self.model = model
         self.max_tokens = max_tokens
 
-    def run_faithfulness_eval(self, generated_answer: str, contexts: str):
+    def run_faithfulness_eval(self, generated_answer: str, contexts: str, openai_handler: OpenAIHandler):
         """
         Evaluates the faithfulness of a generated answer against provided contexts based on three aspects.
 
@@ -92,8 +92,12 @@ def run_faithfulness_eval(self, generated_answer: str, contexts: str):
             user_prompt = FAITHFULLNESS_USER_TMPL.format(generated_answer=generated_answer, contexts=contexts)
             system_prompt = FAITHFULLNESS_SYS_TMPL
 
-            open_ai_response = get_chat_completion(
-                self.openai_api_key, user_prompt, system_prompt, ANSWER_JSON_SCHEMA, model=self.model, max_tokens=self.max_tokens
+            open_ai_response = openai_handler.get_chat_completion(
+                openai_api_key=self.openai_api_key,
+                user_prompt=user_prompt,
+                system_prompt=system_prompt,
+                answer_json_schema=ANSWER_JSON_SCHEMA,
+                max_tokens=self.max_tokens,
             )
 
             json_answer = json.loads(open_ai_response.get("choices")[0].get("message").get("content"))
@@ -146,6 +150,8 @@ def evaluate_dataset(
         Returns:
         - pd.DataFrame, the original dataframe with additional columns for relevancy, accuracy, conciseness and pertinence, and reasoning.
         """
+        openai_handler = OpenAIHandler(model=self.model)
+
         fieldnames = [resource_id_column, "relevancy", "accuracy", "conciseness_and_pertinence", "reasoning"]
 
         with open(output_file, mode="w", newline="") as file:
@@ -155,8 +161,12 @@ def evaluate_dataset(
 
             try:
                 for _, row in tqdm(df.iterrows(), total=len(df), desc="Processing faithfulness"):
-                    result = self.run_faithfulness_eval(row[generated_answer_column], row[contexts_column])
+                    result = self.run_faithfulness_eval(
+                        row[generated_answer_column], row[contexts_column], openai_handler=openai_handler
+                    )
+
                     result[resource_id_column] = row[resource_id_column]
+
                     # Write the result to the CSV file
                     writer.writerow(result)
 

diff --git a/app/routes/evaluation_endpoints.py b/app/routes/evaluation_endpoints.py
@@ -168,7 +168,7 @@ async def evaluate_generation(
             task.connect(params)
 
         evaluator = CorrectnessEvaluator(openai_api_key, openai_model, threshold=correctness_threshold, max_tokens=max_tokens)
-        evaluation_metrics = evaluator.evaluate_dataset(
+        evaluation_metrics_correctness = evaluator.evaluate_dataset(
             df=df,
             output_file=output_file_correctness,
             query_column=query_column,
@@ -178,35 +178,36 @@ async def evaluate_generation(
         )
 
         evaluator = FaithfulnessEvaluator(openai_api_key, openai_model, max_tokens=max_tokens)
-        evaluation_metrics = evaluator.evaluate_dataset(
+        evaluation_metrics_faithfulness = evaluator.evaluate_dataset(
             df=df,
             output_file=output_file_faithfulness,
             generated_answer_column=generated_answer_column,
             contexts_column=contexts_column,
             resource_id_column=resource_id_column,
         )
 
+        combined_metrics = {**evaluation_metrics_correctness, **evaluation_metrics_faithfulness}
+
         # Upload metrics to clearml
         if clearml_track_experiment and task:
-            for series_name, value in evaluation_metrics.items():
+            for series_name, value in combined_metrics.items():
                 task.get_logger().report_single_value(name=series_name, value=value)
             task.close()
 
-        return evaluation_metrics
+        return combined_metrics
 
     except Exception as e:
         raise HTTPException(
             status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail=f"Error during generation evaluation evaluation: {str(e)}"
         )
 
 
-@router.post("/batch_generation")
-async def batch_generation(
+@router.post("/dataset_generation")
+async def dataset_generation(
     file: UploadFile = File(...),
     limit: int = File(None),
     question_column: str = File("openai_query"),
     model_prompt: str = Form("llama3.1"),
-    llm_model: str = Form("llama3.1"),
     search_text_boost: float = Form(1),
     search_embedding_boost: float = Form(1),
     k: int = Form(5),
@@ -241,7 +242,7 @@ async def batch_generation(
             k=k,
             text_boost=search_text_boost,
             embedding_boost=search_embedding_boost,
-            llm_model=llm_model,
+            llm_model=model_prompt,
             process=process,
             job=job,
         )
-Original file line number
+Diff line change
@@ Expand Up / @@ -16,3 +16,7 @@ __pycache__/ @@
     *.tmp
     *.bak
+    # Data
+    /data
+    *.csv