Facing issue generating correct evaluation report for RAG application #1997

Abhinay-14 · 2024-08-09T08:08:38Z

Abhinay-14
Aug 9, 2024

Hello,

I'm currently evaluating my RAG application, which uses a WebSocket for interaction. I successfully created the knowledge base, generated the test set, and performed the evaluation. However, when I check the evaluation report, I see the following:

Generator: 0%
Evaluator: 0%
Rewriter: 0%
Router: 100%
Knowledge Base: 100%
Overall Correctness Score: 0%

I’ve implemented an asynchronous method to call the WebSocket API and get the response. This same method is used in the get_answer function.

This is the code that I am using currently
`
import pandas as pd
import giskard
import websockets
import json
import asyncio
import openai
import os
import fitz

from dotenv import load_dotenv, find_dotenv
from giskard.rag import generate_testset, KnowledgeBase
from giskard.rag import QATestset
from giskard.rag import evaluate
from giskard.llm.client.openai import OpenAIClient
from giskard.llm import set_llm_model, set_default_client, set_llm_api

_ = load_dotenv(find_dotenv())

openai.api_key = os.getenv("OPENAI_API_KEY")

set_llm_api("openai")
llmClient = OpenAIClient(model="gpt-3.5-turbo-1106")
set_default_client(llmClient)

sid = "a3bf-cac7defc43aa"
event = "message"
partitions = ["partitions"]
endpoint = "ws://127.0.0.1/chat"

-------- To generate test suite and evaluate the application ----------

CHUNK_SIZE = 800
OVERLAP_SIZE = 100

def split_into_chunks(content, chunk_size, overlap_size):
return [content[i:i+chunk_size] for i in range(0, len(content), chunk_size - overlap_size)]

def read_files_from_folder(folder_path):
file_data = []

for file_name in os.listdir(folder_path):
    file_path = os.path.join(folder_path, file_name)
    
    if file_name.endswith(".pdf"):
        doc = fitz.open(file_path)
        content = ""
        for page_num in range(len(doc)):
            page = doc.load_page(page_num)
            content += page.get_text()
        doc.close()
    else:
        continue

    chunks = split_into_chunks(content, CHUNK_SIZE, OVERLAP_SIZE)
    for chunk in chunks:
        file_data.append({"file_name": file_name, "content": chunk})

return file_data

folder_path = "src/main/giskardevaluation/docs"
file_data = read_files_from_folder(folder_path)

df = pd.DataFrame(file_data, columns=["file_name", "content"])

knowledge_base = KnowledgeBase(df)
print("knowledge base: ", knowledge_base)

testset = generate_testset(
knowledge_base,
num_questions=15,
language='en',
agent_description="A chatbot to answer questions based on the provided company relavant documents. The documents can include annual reports, financial reports, future plans and many more related to that specific company.",
)

testset.save("src/main/giskardevaluation/my_testset.jsonl")

loaded_testset = QATestset.load("src/main/giskardevaluation/my_testset.jsonl")

#------ Function to call websocket api and get response ----------- #
async def fetch_response_from_websocket(question, endpoint, sid, event, partitions):
message = {
"question": question,
"sid": sid,
"event": event,
"partitions": partitions
}
async with websockets.connect(endpoint) as websocket:
await websocket.send(json.dumps(message))
return await receive_complete_response(websocket)

async def receive_complete_response(websocket):
complete_response = ""
while True:
try:
response = await websocket.recv()
if response.startswith("data:"):
response = response[5:]
data = json.loads(response)
if data["type"] == "answer":
complete_response += data["value"]
elif data["type"] == "metadata":
break
except json.JSONDecodeError:
print(f"Received non-JSON response: {response}")
continue
except Exception as e:
print(f"Error occurred: {e}")
break
print("websocket response:")
print(complete_response)
return complete_response

async def get_answer_from_agent(messages):
user_input = messages[-1]["content"]
print('question: ', user_input)
return await fetch_response_from_websocket(user_input, endpoint, sid, event, partitions)

----------- get answer function ---------------

def get_answer_fn(question: str, history=None) -> str:
"""A function representing your RAG agent."""
async def answer_fn():
messages = history if history else []
messages.append({"role": "user", "content": question})
answer = await get_answer_from_agent(messages)
return answer

return asyncio.run(answer_fn())

report = evaluate(get_answer_fn, testset=loaded_testset, knowledge_base=knowledge_base)

report.to_html("src/main/giskardevaluation/rag_eval_report.html")
`

Could anyone please help me understand why the generator, evaluator, and rewriter metrics are showing 0%, while the router and knowledge base are at 100%? Any guidance on where I might be going wrong would be greatly appreciated.

Thank you!

Answered by pierlj

Aug 12, 2024

Hello, can you check manually that you get correct answers when you call your get_answer_fn? I suspect that you don't get any answer at all.

When all answers are incorrect, it is expected that the knowledge base and router components are at 100% (e.g. for the knowledge base component it is computed as the 1 minus the gap between the topics with best and worst correctness, but if every answer is incorrect, the gap is 0). The other components are computed directly from the correctness on the question and therefore are at 0% when every answer is wrong.

Also I see that your agent does not support history, the generate_testset function will generate some questions split into two messages (conv…

View full answer

pierlj · 2024-08-12T08:59:40Z

pierlj
Aug 12, 2024
Maintainer

Hello, can you check manually that you get correct answers when you call your get_answer_fn? I suspect that you don't get any answer at all.

When all answers are incorrect, it is expected that the knowledge base and router components are at 100% (e.g. for the knowledge base component it is computed as the 1 minus the gap between the topics with best and worst correctness, but if every answer is incorrect, the gap is 0). The other components are computed directly from the correctness on the question and therefore are at 0% when every answer is wrong.

Also I see that your agent does not support history, the generate_testset function will generate some questions split into two messages (conversational questions) by default. These will not be correctly handled by your agent since you cannot send the context message before the question, it might reduce the overall score of some components.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Giskard

Facing issue generating correct evaluation report for RAG application #1997

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Giskard

Facing issue generating correct evaluation report for RAG application #1997

Abhinay-14 Aug 9, 2024

-------- To generate test suite and evaluate the application ----------

----------- get answer function ---------------

Replies: 1 comment

pierlj Aug 12, 2024 Maintainer

Abhinay-14
Aug 9, 2024

pierlj
Aug 12, 2024
Maintainer