Skip to content

Conversation

@stefanoamorelli
Copy link

@stefanoamorelli stefanoamorelli commented Dec 7, 2025

Description

Adds ContextualFaithfulnessEvaluator for RAG systems to detect hallucinations by validating whether response claims are grounded in retrieval context. This differs from the existing FaithfulnessEvaluator which checks against conversation history rather than retrieved documents.

The evaluator uses a 4-tier scoring system mapped to numeric values:

  • Not Faithful (0.0)
  • Partially Faithful (0.33)
  • Mostly Faithful (0.67)
  • Fully Faithful (1.0)

Also an optional retrieval_context field to Case and EvaluationData for passing retrieved data through the evaluation pipeline.

Related Issues

#65

Documentation PR

Type of Change

New feature

Testing

New unit tests introduced

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

This field stores documents from vector stores or retrieval systems,
enabling RAG evaluation workflows where responses need validation
against source context. [1]

[1]: https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-evaluate.html
Mirrors the EvaluationData field so users can provide retrieved
documents when defining test cases.
Pass retrieval_context from Case through to EvaluationData in both
sync and async paths so evaluators can access it.
Defines a 4-tier rating scale from Not Faithful to Fully Faithful,
with guidance on evaluating factual claims against retrieval context.
Based on faithfulness metrics from RAG evaluation literature. [1]

[1]: https://arxiv.org/abs/2309.01431
Validates whether response claims are grounded in retrieval context,
designed specifically for RAG systems. Uses structured output with
faithfulness tiers mapped to scores [0.0, 0.33, 0.67, 1.0]. This
differs from FaithfulnessEvaluator which checks conversation history
rather than retrieved documents. [1]

[1]: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/
Tests the new field in Case and updates experiment serialization
assertions to include retrieval_context.
Covers initialization, score mapping across all faithfulness tiers,
input validation, prompt formatting behavior, async evaluation, and
serialization.
Shows RAG evaluation with retrieval_context and updates the available
evaluators list to include ContextualFaithfulnessEvaluator and
HarmfulnessEvaluator.
@stefanoamorelli stefanoamorelli changed the title feat(types): add retrieval_context field to EvaluationData feat: add ContextualFaithfulnessEvaluator Dec 7, 2025
@stefanoamorelli stefanoamorelli marked this pull request as ready for review December 7, 2025 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant