updated the eval section in the ReadME and batch sizes #67

shamanez · 2023-10-21T21:45:00Z

No description provided.

tleyden · 2023-10-23T09:38:16Z

README.md

 ```

 For all available arguments and options, see `dalm train-rag-e2e --help`

 ## Evaluation

-Here's a summary of evaluation results on evaluating on a 200K line test csv of Patent abstracts
+Here's a summary of the evaluation results obtained from analyzing a 200,000-line test CSV containing patent abstracts. We assessed our system's performance using top-K Hit rate and Recall as evaluation metrics. To ensure a fair evaluation strategy, we ensured that the documents related to query-passage pairs in the evaluation dataset were not present in the training dataset. With access to the evaluation dataset, our initial step involved encoding the passage set into vectors and indexing them using the HNSW library. Subsequently, we calculated metrics based on the approximate nearest neighbor similarity search.


This seems like a good start as it provides some concrete steps. For my own better understanding, here's a stab at a detailed explanation:

So you have a trained retriever, let's call that R. This retriever is trained to be good at finding the most relevant passages in a corpus given a question.

You have a ground-truth test dataset that was not seen during training, which consists of question/passage pairs. For example 200K rows like the following:

Question Passage

What is the difference between NLP transformers and recurrent neural networks? With the advent of transformers, the limits of scaling deep learning NLP was massively increased due to the parallelization opportunities .. etc

How do multi-modal systems like CLIP work? The main insight from CLIP was to mix image embeddings with text embeddings, creating a hybrid embedding space .. etc

To evaluate the retriever on this dataset, the fundamental question you're asking is: "Given a question and a corpus of passages, can it find the best matching passages?". To answer that, you would ideally do the following:

Use the trained retriever R to encode all passages into a vector store

Grab a question, for example the first question in the table above

Use the question encoder to transform it into an embedding vector: QE

For each encoded passage (PE) in the vector store, find the nearest neighbor similarity search score between QE and PE

Find the top-K (eg, top 5) best matches based on nearest neighbor similarity search scores

Compare them against the ground truth top-K best messages to calculate precision and recall.

However, we have to simplify the above evaluation steps due to a constraint on our ground-truth dataset: we don't actually have the top-K best matching passages for a given question, we only have the top-1 best passage match. You can see this constraint in the fact that the dataset consists of Question/Single-Passage pairs, not Question/Best-5-Passage pairs.

So the evaluation only considers whether best matching passage -- according to ground-truth dataset -- was found in the top-K matches from the actual vector store search described in step 5 above. If it was indeed found then it will increase the overall evaluation score, otherwise it will decrease the recall store. The higher the position in the top-K ranking, the more it will decrease the evaluation score.

updated the read me eval and batch_sizes

116cac5

shamanez requested a review from Jacobsolawetz October 21, 2023 21:45

tleyden reviewed Oct 23, 2023

View reviewed changes

metric-space added 2 commits October 23, 2023 17:51

Merge Traun's comment into Shamane's section

d5cb1dd

Update README.md

42d153a

metric-space merged commit 335a1b8 into main Oct 23, 2023

metric-space deleted the ReadMe-Eval branch October 23, 2023 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updated the eval section in the ReadME and batch sizes #67

updated the eval section in the ReadME and batch sizes #67

shamanez commented Oct 21, 2023

tleyden Oct 23, 2023 •

edited

Loading

Question	Passage
What is the difference between NLP transformers and recurrent neural networks?	With the advent of transformers, the limits of scaling deep learning NLP was massively increased due to the parallelization opportunities .. etc
How do multi-modal systems like CLIP work?	The main insight from CLIP was to mix image embeddings with text embeddings, creating a hybrid embedding space .. etc

updated the eval section in the ReadME and batch sizes #67

updated the eval section in the ReadME and batch sizes #67

Conversation

shamanez commented Oct 21, 2023

tleyden Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

tleyden Oct 23, 2023 •

edited

Loading