-
Notifications
You must be signed in to change notification settings - Fork 897
Context Precision score heavily influenced by position of relevant contexts in array #2013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The implementation of the context precision metric in the RAGAS library is fine without any bugs. I think the issue here is with the incorrect definition of context precision metric given in the RAGAS documentation, "Context Precision is a metric that measures the proportion of relevant chunks in the retrieved contexts." The actual definition of context precision metric is, "Context Precision metric evaluates the retriever’s ability to rank relevant chunks higher than irrelevant ones for a given query in the retrieved context. Specifically, it assesses whether the relevant chunks in the retrieved context are prioritized at the top of the ranking. " The Context Precision (denoted as Context Precision@K) is computed using the formula: Where:
Context Precision Metric ComputationLet us compute the context precision metric for your example step by step. Case-1: Relevant chunk at the topExample
Step-by-Step Computation
Final AnswerThe Context Precision score is 1.0, as the only relevant chunk is ranked at the top of the retrieved contexts. Case-2: Relevant chunk at the bottomExample
Step-by-Step Computation
Final AnswerThe Context Precision score is 0.25, as the only relevant chunk is ranked at the bottom of the retrieved contexts. I will create a PR to modify the incorrect definition in the RAGAS documentation. |
I hope this explanation clarifies your doubt, @mshobana1990. Please let me know if need any further clarification. |
Hi @KalyanKS-NLP But I wanted to evaluate noise in the retrieved contexts. I want to understand how much of the retrieved contexts are relevant or not. |
Yeah, something like context relevance is missing here, if I'm not wrong. |
[ ] I have checked the documentation and related resources and couldn't resolve my bug.
Describe the bug
ContextPrecision evaluator in RAGAS produces inconsistent scores that are strongly influenced by the position of relevant contexts in the input array. When a relevant context appears as the first element in the retrieved_contexts array, the score is artificially inflated (often >0.9), even when most contexts are irrelevant and when the score gradually decreases if the relevant context's position moves ahead.
Ragas version:0.2.14
Python version:3.11
Code to Reproduce
The below test fails with the score of 0.9999999999
WHEREAS
The below test passes. Here I have just changed the position of the relevant context as the last element.
Also,
When the relevant context is in second position, the score is 0.49999999995
When it is is in third position, the score is 0.3333333333
Error trace
Not applicable.
Expected behavior
The score should not be affected by the position of relevant context.
Additional context
The issue is consistent across the models Open AI GPT 4o, AWS Bedrock Claude Sonnet 3.7 and Azure OpenAI GPT4o. This position bias severely impacts the reliability of the Context Precision metric for evaluating RAG systems. In real-world applications, the ordering of retrieved contexts should not affect the precision score, as the goal is to measure how many of the retrieved contexts are actually relevant to the question.
The text was updated successfully, but these errors were encountered: