Skip to content

Context Precision score heavily influenced by position of relevant contexts in array #2013

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mshobana1990 opened this issue Apr 15, 2025 · 4 comments
Labels
bug Something isn't working module-metrics this is part of metrics module

Comments

@mshobana1990
Copy link

mshobana1990 commented Apr 15, 2025

[ ] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug
ContextPrecision evaluator in RAGAS produces inconsistent scores that are strongly influenced by the position of relevant contexts in the input array. When a relevant context appears as the first element in the retrieved_contexts array, the score is artificially inflated (often >0.9), even when most contexts are irrelevant and when the score gradually decreases if the relevant context's position moves ahead.

Ragas version:0.2.14
Python version:3.11

Code to Reproduce

The below test fails with the score of 0.9999999999

def test_context_precision():
    sample = SingleTurnSample(
        reference="The capital of France is Paris.",
        retrieved_contexts=[
            "The capital of France is Paris",
            "Bahmni is a comprehensive, easy-to-use, and fully open-source Hospital Information System (HIS)",
            "it suitable for a wide range of healthcare facilities, from small clinics to large hospitals.",
            "A resilient EMR and hospital management system built on reliable open-source components.",

        ],
        user_input="What is the capital of France?",
    )

    context_precision = ContextPrecision(llm=evaluator_llm)
    score = context_precision.single_turn_score(sample)
    print("Context Precision score: ", score)
    assert score < 0.3

WHEREAS

The below test passes. Here I have just changed the position of the relevant context as the last element.

def test_context_precision():
    sample = SingleTurnSample(
        reference="The capital of France is Paris.",
        retrieved_contexts=[
            "Bahmni is a comprehensive, easy-to-use, and fully open-source Hospital Information System (HIS)",
            "it suitable for a wide range of healthcare facilities, from small clinics to large hospitals.",
            "A resilient EMR and hospital management system built on reliable open-source components.",
            "The capital of France is Paris",
        ],
        user_input="What is the capital of France?",
    )

    context_precision = ContextPrecision(llm=evaluator_llm)
    score = context_precision.single_turn_score(sample)
    print("Context Precision score: ", score)
    assert score < 0.3

Also,
When the relevant context is in second position, the score is 0.49999999995
When it is is in third position, the score is 0.3333333333

Error trace
Not applicable.

Expected behavior
The score should not be affected by the position of relevant context.

Additional context
The issue is consistent across the models Open AI GPT 4o, AWS Bedrock Claude Sonnet 3.7 and Azure OpenAI GPT4o. This position bias severely impacts the reliability of the Context Precision metric for evaluating RAG systems. In real-world applications, the ordering of retrieved contexts should not affect the precision score, as the goal is to measure how many of the retrieved contexts are actually relevant to the question.

@mshobana1990 mshobana1990 added the bug Something isn't working label Apr 15, 2025
@dosubot dosubot bot added the module-metrics this is part of metrics module label Apr 15, 2025
@KalyanKS-NLP
Copy link
Contributor

The implementation of the context precision metric in the RAGAS library is fine without any bugs.

I think the issue here is with the incorrect definition of context precision metric given in the RAGAS documentation, "Context Precision is a metric that measures the proportion of relevant chunks in the retrieved contexts."

The actual definition of context precision metric is, "Context Precision metric evaluates the retriever’s ability to rank relevant chunks higher than irrelevant ones for a given query in the retrieved context. Specifically, it assesses whether the relevant chunks in the retrieved context are prioritized at the top of the ranking. "

The Context Precision (denoted as Context Precision@K) is computed using the formula:

$$ \text{Context Precision@K} = \frac{\sum_{k=1}^{K} \left( \text{Precision@k} \times v_k \right)}{\text{Total number of relevant chunks in the top } K \text{ chunks}} $$

Where:

  • K: The total number of chunks in the retrieved contexts.
  • $v_k$: A relevance indicator at rank ( k ), where $v_k = 1$ if the chunk is relevant, and $v_k = 0$ if it is not.
  • $\text{Precision@k} = \frac{\text{true positives@k}}{\text{true positives@k} + \text{false positives@k}}$: The precision at rank ( k ), calculated as the ratio of relevant chunks up to position ( k ) to the total number of chunks up to ( k ).

Context Precision Metric Computation

Let us compute the context precision metric for your example step by step.

Case-1: Relevant chunk at the top

Example

  • Question: "What is the capital of France?"
  • Ground Truth: "The capital of France is Paris."
  • Retrieved Contexts:
    1. "The capital of France is Paris"
    2. "Bahmni is a comprehensive, easy-to-use, and fully open-source Hospital Information System (HIS)"
    3. "It suitable for a wide range of healthcare facilities, from small clinics to large hospitals."
    4. "A resilient EMR and hospital management system built on reliable open-source components."

Step-by-Step Computation

  1. Relevance Check:

    • Chunk 1: Relevant – directly answers the query by stating the capital of France is Paris.
    • Chunk 2: Irrelevant – discusses Bahmni, unrelated to the capital of France.
    • Chunk 3: Irrelevant – describes healthcare facilities, unrelated to the query.
    • Chunk 4: Irrelevant – mentions an EMR system, unrelated to the query.
  2. Assign Relevance Indicators:

    • Chunk 1: $$v_1 = 1$$ (Relevant)
    • Chunk 2: $$v_2 = 0$$ (Irrelevant)
    • Chunk 3: $$v_3 = 0$$ (Irrelevant)
    • Chunk 4: $$v_4 = 0$$ (Irrelevant)
  3. Compute Precision@k:

    • At $$k = 1$$: 1 relevant / 1 total = $$\frac{1}{1} = 1.0$$
    • At $$k = 2$$: 1 relevant / 2 total = $$\frac{1}{2} = 0.5$$
    • At $$k = 3$$: 1 relevant / 3 total = $$\frac{1}{3} \approx 0.333$$
    • At $$k = 4$$: 1 relevant / 4 total = $$\frac{1}{4} = 0.25$$
  4. Compute Weighted Sum:

    • Multiply each Precision@k by its corresponding $$v_k$$:
      • For $$k = 1$$: $$1.0 \times v_1 = 1.0 \times 1 = 1.0$$
      • For $$k = 2$$: $$0.5 \times v_2 = 0.5 \times 0 = 0$$
      • For $$k = 3$$: $$0.333 \times v_3 = 0.333 \times 0 = 0$$
      • For $$k = 4$$: $$0.25 \times v_4 = 0.25 \times 0 = 0$$
    • Sum: $$1.0 + 0 + 0 + 0 = 1.0$$
  5. Compute Score:

    • Total number of relevant items in the retrieved contexts = 1 (only the first chunk is relevant).
    • Context Precision = $$\frac{\text{Weighted Sum}}{\text{Total Relevant Items}} = \frac{1.0}{1} = 1.0$$

Final Answer

The Context Precision score is 1.0, as the only relevant chunk is ranked at the top of the retrieved contexts.

Case-2: Relevant chunk at the bottom

Example

  • Question: "What is the capital of France?"
  • Ground Truth: "The capital of France is Paris."
  • Retrieved Contexts:
    1. "Bahmni is a comprehensive, easy-to-use, and fully open-source Hospital Information System (HIS)"
    2. "It suitable for a wide range of healthcare facilities, from small clinics to large hospitals."
    3. "A resilient EMR and hospital management system built on reliable open-source components."
    4. "The capital of France is Paris"

Step-by-Step Computation

  1. Relevance Check:

    • Chunk 1: Irrelevant – discusses Bahmni, unrelated to the capital of France.
    • Chunk 2: Irrelevant – describes healthcare facilities, unrelated to the query.
    • Chunk 3: Irrelevant – mentions an EMR system, unrelated to the query.
    • Chunk 4: Relevant – directly answers the query by stating the capital of France is Paris.
  2. Assign Relevance Indicators:

    • Chunk 1: $$v_1 = 0$$ (Irrelevant)
    • Chunk 2: $$v_2 = 0$$ (Irrelevant)
    • Chunk 3: $$v_3 = 0$$ (Irrelevant)
    • Chunk 4: $$v_4 = 1$$ (Relevant)
  3. Compute Precision@k:

    • At $$k = 1$$: 0 relevant / 1 total = $$\frac{0}{1} = 0.0$$
    • At $$k = 2$$: 0 relevant / 2 total = $$\frac{0}{2} = 0.0$$
    • At $$k = 3$$: 0 relevant / 3 total = $$\frac{0}{3} = 0.0$$
    • At $$k = 4$$: 1 relevant / 4 total = $$\frac{1}{4} = 0.25$$
  4. Compute Weighted Sum:

    • Multiply each Precision@k by its corresponding $$v_k$$:
      • For $$k = 1$$: $$0.0 \times v_1 = 0.0 \times 0 = 0.0$$
      • For $$k = 2$$: $$0.0 \times v_2 = 0.0 \times 0 = 0.0$$
      • For $$k = 3$$: $$0.0 \times v_3 = 0.0 \times 0 = 0.0$$
      • For $$k = 4$$: $$0.25 \times v_4 = 0.25 \times 1 = 0.25$$
    • Sum: $$0.0 + 0.0 + 0.0 + 0.25 = 0.25$$
  5. Compute Score:

    • Total number of relevant items in the retrieved contexts = 1 (only the fourth chunk is relevant).
    • Context Precision = $$\frac{\text{Weighted Sum}}{\text{Total Relevant Items}} = \frac{0.25}{1} = 0.25$$

Final Answer

The Context Precision score is 0.25, as the only relevant chunk is ranked at the bottom of the retrieved contexts.

I will create a PR to modify the incorrect definition in the RAGAS documentation.

@KalyanKS-NLP
Copy link
Contributor

I hope this explanation clarifies your doubt, @mshobana1990. Please let me know if need any further clarification.

@mshobana1990
Copy link
Author

Hi @KalyanKS-NLP
It is pretty clear now! Thanks for the explanation!

But I wanted to evaluate noise in the retrieved contexts. I want to understand how much of the retrieved contexts are relevant or not.
Does RAGAS provide a way for this evaluation?

@Nithanaroy
Copy link

Yeah, something like context relevance is missing here, if I'm not wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module-metrics this is part of metrics module
Projects
None yet
Development

No branches or pull requests

3 participants