Overview

Developed by	Arize AI
Date of development	August 6, 2024
Validator type	RAG, LLM Judge
Blog	https://docs.arize.com/arize/large-language-models/guardrails
License	Apache 2
Input/Output	RAG Retrieval or Output

Description

Given a RAG application, this Guard will use an LLM Judge to decide whether the LLM response is acceptable. Users can instantiate the Guard with one of the Arize off-the-shelf evaluators (Context Relevancy, Hallucination or QA Correctness), which match our off-the-shelf RAG evaluators in Phoenix.

Alternatively, users can customize the Guard with their own LLM Judge by writing a custom prompt that inherits from the ArizeRagEvalPromptBase(ABC) class.

Benchmark Results

For the off-the-shelf Guards, we have benchmarked results on public datasets.

Context Relevancy LLM Judge

We benchmarked the Context Relevancy Guard on "wiki_qa-train" benchmark dataset in benchmark_context_relevancy_prompt.py.

https://huggingface.co/datasets/microsoft/wiki_qa

Model: gpt-4o-mini
Guard Results
              precision    recall  f1-score   support

    relevant       0.70      0.86      0.77        93
   unrelated       0.85      0.68      0.76       107

    accuracy                           0.77       200
   macro avg       0.78      0.77      0.76       200
weighted avg       0.78      0.77      0.76       200

Latency
count    200.000000
mean       2.812122
std        1.753805
min        1.067620
25%        1.708051
50%        2.248962
75%        3.321251
max       14.102804
Name: guard_latency_gpt-4o-mini, dtype: float64
median latency
2.2489616039965767

Hallucination LLM Judge

This Guard was benchmarked on the "halueval_qa_data" from the HaluEval benchmark:

Model: gpt-4o-mini
Guard Results
              precision    recall  f1-score   support

     factual       0.79      0.97      0.87       129
hallucinated       0.96      0.73      0.83       121

    accuracy                           0.85       250
   macro avg       0.87      0.85      0.85       250
weighted avg       0.87      0.85      0.85       250

Latency
count    250.000000
mean       1.865513
std        0.603700
min        1.139974
25%        1.531160
50%        1.758210
75%        2.026153
max        6.403010
Name: guard_latency_gpt-4o-mini, dtype: float64
median latency
1.7582097915001214

QA Correctness LLM Judge

This Guard was benchmarked on the 2.0 version of the large-scale dataset Stanford Question Answering Dataset (SQuAD 2.0): https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/default/15785042.pdf

Model: gpt-4o-mini
Guard Results
              precision    recall  f1-score   support

     correct       1.00      0.96      0.98       133
   incorrect       0.96      1.00      0.98       117

    accuracy                           0.98       250
   macro avg       0.98      0.98      0.98       250
weighted avg       0.98      0.98      0.98       250

Latency
count    250.000000
mean       2.610912
std        1.415877
min        1.148114
25%        1.678278
50%        2.263149
75%        2.916726
max       10.625763
Name: guard_latency_gpt-4o-mini, dtype: float64
median latency
2.263148645986803

Installation

guardrails hub install hub://arize-ai/llm_rag_evaluator

Usage Examples

Validating string output via Python

In this example, we apply the validator to a string output generated by an LLM.

# Import Guard and Validator
from guardrails.hub import LlmRagEvaluator, HallucinationPrompt
from guardrails import Guard

# Setup Guard
guard = Guard().use(
    LlmRagEvaluator(
        eval_llm_prompt_generator=HallucinationPrompt(prompt_name="hallucination_judge_llm"),
        llm_evaluator_fail_response="hallucinated",
        llm_evaluator_pass_response="factual",
        llm_callable="gpt-4o-mini",
        on_fail="exception",
        on="prompt"
    ),
)

metadata = {
    "user_message": "User message",
    "context": "Context retrieved from RAG application",
    "llm_response": "Proposed response from LLM before Guard is applied"
}

guard.validate(llm_output="Proposed response from LLM before Guard is applied", metadata=metadata)

API Reference

__init__(self, on_fail="noop")

Parameters

eval_llm_prompt_generator (Type[ArizeRagEvalPromptBase]): Child class that will use a fixed interface to generate a prompt for an LLM Judge given the retrieved context, user input message and proposed LLM response. Off-the-shelf child classes include QACorrectnessPrompt, HallucinationPrompt and ContextRelevancyPrompt.
llm_evaluator_fail_response (str): Expected string output from the Judge LLM when the validator fails, e.g. "hallucinated".
llm_evaluator_pass_response (str): Expected string output from the Judge LLM when the validator passes, e.g. "factual".
llm_callable (str): Callable LLM string used to instantiate the LLM Judge, such as gpt-4o-mini.
on_fail (str, Callable): The policy to enact when a validator fails. If str, must be one of reask, fix, filter, refrain, noop, exception or fix_reask. Otherwise, must be a function that is called when the validator fails.

validate(self, value, metadata) -> ValidationResult

Note:

This method should not be called directly by the user. Instead, invoke guard.parse(...) where this method will be called internally for each associated Validator.
When invoking guard.parse(...), ensure to pass the appropriate metadata dictionary that includes keys and values required by this validator. If guard is associated with multiple validators, combine all necessary metadata into a single dictionary.

Parameters

value (Any): The input value to validate.

metadata (dict): A dictionary containing metadata required for validation. Keys and values must match the expectations of this validator.

Key	Type	Description	Default
`user_message`	String	User input message to RAG application.	N/A
`context`	String	Retrieved context from RAG application.	N/A
`llm_response`	String	Proposed response from the LLM used in the RAG application.	N/A

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
tests		tests
validator		validator
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
Makefile.txt		Makefile.txt
README.md		README.md
env.txt		env.txt
gitignore.txt		gitignore.txt
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Description

Benchmark Results

Context Relevancy LLM Judge

Hallucination LLM Judge

QA Correctness LLM Judge

Installation

Usage Examples

Validating string output via Python

API Reference

About

Releases

Packages

Contributors 4

Languages

License

Arize-ai/rag-llm-prompt-evaluator-guard

Folders and files

Latest commit

History

Repository files navigation

Overview

Description

Benchmark Results

Context Relevancy LLM Judge

Hallucination LLM Judge

QA Correctness LLM Judge

Installation

Usage Examples

Validating string output via Python

API Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages