Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updated the eval section in the ReadME and batch sizes #67

Merged
merged 3 commits into from
Oct 23, 2023

Conversation

shamanez
Copy link
Member

No description provided.

README.md Outdated
```

For all available arguments and options, see `dalm train-rag-e2e --help`

## Evaluation

Here's a summary of evaluation results on evaluating on a 200K line test csv of Patent abstracts
Here's a summary of the evaluation results obtained from analyzing a 200,000-line test CSV containing patent abstracts. We assessed our system's performance using top-K Hit rate and Recall as evaluation metrics. To ensure a fair evaluation strategy, we ensured that the documents related to query-passage pairs in the evaluation dataset were not present in the training dataset. With access to the evaluation dataset, our initial step involved encoding the passage set into vectors and indexing them using the HNSW library. Subsequently, we calculated metrics based on the approximate nearest neighbor similarity search.
Copy link
Contributor

@tleyden tleyden Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good start as it provides some concrete steps. For my own better understanding, here's a stab at a detailed explanation:

So you have a trained retriever, let's call that R. This retriever is trained to be good at finding the most relevant passages in a corpus given a question.

You have a ground-truth test dataset that was not seen during training, which consists of question/passage pairs. For example 200K rows like the following:

Question Passage
What is the difference between NLP transformers and recurrent neural networks? With the advent of transformers, the limits of scaling deep learning NLP was massively increased due to the parallelization opportunities .. etc
How do multi-modal systems like CLIP work? The main insight from CLIP was to mix image embeddings with text embeddings, creating a hybrid embedding space .. etc

To evaluate the retriever on this dataset, the fundamental question you're asking is: "Given a question and a corpus of passages, can it find the best matching passages?". To answer that, you would ideally do the following:

  1. Use the trained retriever R to encode all passages into a vector store
  2. Grab a question, for example the first question in the table above
  3. Use the question encoder to transform it into an embedding vector: QE
  4. For each encoded passage (PE) in the vector store, find the nearest neighbor similarity search score between QE and PE
  5. Find the top-K (eg, top 5) best matches based on nearest neighbor similarity search scores
  6. Compare them against the ground truth top-K best messages to calculate precision and recall.

However, we have to simplify the above evaluation steps due to a constraint on our ground-truth dataset: we don't actually have the top-K best matching passages for a given question, we only have the top-1 best passage match. You can see this constraint in the fact that the dataset consists of Question/Single-Passage pairs, not Question/Best-5-Passage pairs.

So the evaluation only considers whether best matching passage -- according to ground-truth dataset -- was found in the top-K matches from the actual vector store search described in step 5 above. If it was indeed found then it will increase the overall evaluation score, otherwise it will decrease the recall store. The higher the position in the top-K ranking, the more it will decrease the evaluation score.

@metric-space metric-space merged commit 335a1b8 into main Oct 23, 2023
@metric-space metric-space deleted the ReadMe-Eval branch October 23, 2023 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants