A comprehensive framework for evaluating GenAI applications.
This is a WIP. We’re actively adding features, fixing issues, and expanding examples. Please give it a try, share feedback, and report bugs.
- Multi-Framework Support: Seamlessly use metrics from Ragas, DeepEval, and custom implementations
- Turn & Conversation-Level Evaluation: Support for both individual queries and multi-turn conversations
- Evaluation types: Response, Context, Tool Call, Overall Conversation evaluation & Script-based evaluation
- LLM Provider Flexibility: OpenAI, Watsonx, Gemini, vLLM and others
- API Integration: Direct integration with external API for real-time data generation (if enabled)
- Setup/Cleanup Scripts: Support for running setup and cleanup scripts before/after each conversation evaluation (applicable when API is enabled)
- Flexible Configuration: Configurable environment & metric metadata
- Rich Output: CSV, JSON, TXT reports + visualization graphs (pass rates, distributions, heatmaps)
- Early Validation: Catch configuration errors before expensive LLM calls
- Statistical Analysis: Statistics for every metric with score distribution analysis
# From Git
pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git
# Local Development
pip install uv
uv sync# Set required environment variable(s) for Judge-LLM
export OPENAI_API_KEY="your-key"
# Optional: For script-based evaluations requiring Kubernetes access
export KUBECONFIG="/path/to/your/kubeconfig"
# Run evaluation
lightspeed-eval --system-config <CONFIG.yaml> --eval-data <EVAL_DATA.yaml> --output-dir <OUTPUT_DIR>Please make any necessary modifications to system.yaml and evaluation_data.yaml. The evaluation_data.yaml file includes sample data for guidance.
# Set required environment variable(s) for both Judge-LLM and API authentication (for MCP)
export OPENAI_API_KEY="your-evaluation-llm-key"
export API_KEY="your-api-endpoint-key"
# Ensure API is running at configured endpoint
# Default: http://localhost:8080
# Run with API-enabled configuration
lightspeed-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml# Set required environment variable(s) for Judge-LLM
export OPENAI_API_KEY="your-key"
# Use configuration with api.enabled: false
# Pre-fill response, contexts & tool_calls data in YAML
lightspeed-eval --system-config config/system_api_disabled.yaml --eval-data config/evaluation_data.yaml- Ragas -- docs on Ragas website
- Response Evaluation
- Context Evaluation
- Custom
- Response Evaluation
answer_correctnessintent_eval- Evaluates whether the response demonstrates the expected intent or purposekeywords_eval- Keywords evaluation with alternatives (ALL keywords must match, case insensitive)
- Tool Evaluation
tool_eval- Validates tool calls and arguments with regex pattern matching
- Response Evaluation
- Script-based
- Action Evaluation
script:action_eval- Executes verification scripts to validate actions (e.g., infrastructure changes)
- Action Evaluation
- DeepEval -- docs on DeepEval website
# Core evaluation parameters
core:
# Maximum number of threads, set to null for Python default.
# 50 is OK on a typical laptop. Check your Judge-LLM service for max requests per minute
max_threads: 50
# Judge-LLM Configuration
llm:
provider: openai # openai, watsonx, azure, gemini etc.
model: gpt-4o-mini # Model name for the provider
temperature: 0.0 # Generation temperature
max_tokens: 512 # Maximum tokens in response
timeout: 300 # Request timeout in seconds
num_retries: 3 # Retry attempts
# Lightspeed API Configuration for Real-time Data Generation
api:
enabled: true # Enable/disable API calls
api_base: http://localhost:8080 # Base API URL
endpoint_type: streaming # streaming or query endpoint
timeout: 300 # API request timeout in seconds
provider: openai # LLM provider for API queries (optional)
model: gpt-4o-mini # Model to use for API queries (optional)
no_tools: null # Whether to bypass tools (optional)
system_prompt: null # Custom system prompt (optional)
# Metrics Configuration with thresholds and defaults
metrics_metadata:
turn_level:
"ragas:response_relevancy":
threshold: 0.8
description: "How relevant the response is to the question"
default: true # Used by default when turn_metrics is null
"ragas:faithfulness":
threshold: 0.8
description: "How faithful the response is to the provided context"
default: false # Only used when explicitly specified
"custom:intent_eval":
threshold: 1 # Binary evaluation (0 or 1)
description: "Intent alignment evaluation using custom LLM evaluation"
"custom:tool_eval":
description: "Tool call evaluation comparing expected vs actual tool calls (regex for arguments)"
"custom:keywords_eval": # Binary evaluation (0 or 1)
description: "Keywords evaluation (ALL match) with sequential alternate checking (case insensitive)"
conversation_level:
"deepeval:conversation_completeness":
threshold: 0.8
description: "How completely the conversation addresses user intentions"
# Output Configuration
output:
output_dir: ./eval_output
base_filename: evaluation
enabled_outputs: # Enable specific output types
- csv # Detailed results CSV
- json # Summary JSON with statistics
- txt # Human-readable summary
# Visualization Configuration
visualization:
figsize: [12, 8] # Graph size (width, height)
dpi: 300 # Image resolution
enabled_graphs:
- "pass_rates" # Pass rate bar chart
- "score_distribution" # Score distribution box plot
- "conversation_heatmap" # Heatmap of conversation performance
- "status_breakdown" # Pie chart for pass/fail/error breakdown# Judge-LLM Google Gemini
llm:
provider: "gemini"
model: "gemini-1.5-pro"
temperature: 0.0
max_tokens: 512
timeout: 120
num_retries: 3
# Embeddings for Judge-LLM
# provider: "huggingface" or "openai"
# model: model name
# provider_kwargs: additional arguments,
# for examples see https://docs.ragas.io/en/stable/references/embeddings/#ragas.embeddings.HuggingfaceEmbeddings
embedding:
provider: "huggingface"
model: "sentence-transformers/all-mpnet-base-v2"
provider_kwargs:
# cache_folder: <path_for_downloaded_model>
model_kwargs:
device: "cpu"
...- conversation_group_id: "test_conversation"
description: "Sample evaluation"
# Optional: Environment setup/cleanup scripts, when API is enabled
setup_script: "scripts/setup_env.sh" # Run before conversation
cleanup_script: "scripts/cleanup_env.sh" # Run after conversation
# Conversation-level metrics
conversation_metrics:
- "deepeval:conversation_completeness"
conversation_metrics_metadata:
"deepeval:conversation_completeness":
threshold: 0.8
turns:
- turn_id: id1
query: What is OpenShift Virtualization?
response: null # Populated by API if enabled, otherwise provide
contexts:
- OpenShift Virtualization is an extension of the OpenShift ...
attachments: [] # Attachments (Optional)
expected_keywords: [["virtualization"], ["openshift"]] # For keywords_eval evaluation
expected_response: OpenShift Virtualization is an extension of the OpenShift Container Platform that allows running virtual machines alongside containers
expected_intent: "explain a concept" # Expected intent for intent evaluation
# Per-turn metrics (overrides system defaults)
turn_metrics:
- "ragas:faithfulness"
- "custom:keywords_eval"
- "custom:answer_correctness"
- "custom:intent_eval"
# Per-turn metric configuration
turn_metrics_metadata:
"ragas:faithfulness":
threshold: 0.9 # Override system default
# turn_metrics: null (omitted) → Use system defaults (metrics with default=true)
- turn_id: id2
query: Skip this turn evaluation
turn_metrics: [] # Skip evaluation for this turn
- turn_id: id3
query: Create a namespace called test-ns
verify_script: "scripts/verify_namespace.sh" # Script-based verification
turn_metrics:
- "script:action_eval" # Script-based evaluation (if API is enabled)- Real-time data generation: Queries are sent to external API
- Dynamic responses:
responseandtool_callsfields populated by API - Conversation context: Conversation context is maintained across turns
- Authentication: Use
API_KEYenvironment variable - Data persistence: Saves amended
response/tool_callsdata to output directory so it can be used with API disabled
- Static data mode: Use pre-filled
responseandtool_callsdata - Faster execution: No external API calls
- Reproducible results: Same data used across runs
| Field | Type | Required | Description |
|---|---|---|---|
conversation_group_id |
string | ✅ | Unique identifier for conversation |
description |
string | ❌ | Optional description |
setup_script |
string | ❌ | Path to setup script (Optional, used when API is enabled) |
cleanup_script |
string | ❌ | Path to cleanup script (Optional, used when API is enabled) |
conversation_metrics |
list[string] | ❌ | Conversation-level metrics (Optional, if override is required) |
conversation_metrics_metadata |
dict | ❌ | Conversation-level metric config (Optional, if override is required) |
turns |
list[TurnData] | ✅ | List of conversation turns |
| Field | Type | Required | Description | API Populated |
|---|---|---|---|---|
turn_id |
string | ✅ | Unique identifier for the turn | ❌ |
query |
string | ✅ | The question/prompt to evaluate | ❌ |
response |
string | 📋 | Actual response from system | ✅ (if API enabled) |
contexts |
list[string] | 📋 | Context information for evaluation | ✅ (if API enabled) |
attachments |
list[string] | ❌ | Attachments | ❌ |
expected_keywords |
list[list[string]] | 📋 | Expected keywords for keyword evaluation (list of alternatives) | ❌ |
expected_response |
string | 📋 | Expected response for comparison | ❌ |
expected_intent |
string | 📋 | Expected intent for intent evaluation | ❌ |
expected_tool_calls |
list[list[list[dict]]] | 📋 | Expected tool call sequences (multiple alternative sets) | ❌ |
tool_calls |
list[list[dict]] | ❌ | Actual tool calls from API | ✅ (if API enabled) |
verify_script |
string | 📋 | Path to verification script | ❌ |
turn_metrics |
list[string] | ❌ | Turn-specific metrics to evaluate | ❌ |
turn_metrics_metadata |
dict | ❌ | Turn-specific metric configuration | ❌ |
📋 Required based on metrics: Some fields are required only when using specific metrics
Examples
expected_keywords: Required forcustom:keywords_eval(case insensitive matching)expected_response: Required forcustom:answer_correctnessexpected_intent: Required forcustom:intent_evalexpected_tool_calls: Required forcustom:tool_eval(multiple alternative sets format)verify_script: Required forscript:action_eval(used when API is enabled)response: Required for most metrics (auto-populated if API enabled)
| Override Value | Behavior |
|---|---|
null (or omitted) |
Use system defaults (metrics with default: true) |
[] (empty list) |
Skip evaluation for this turn |
["metric1", ...] |
Use specified metrics only |
The custom:tool_eval metric supports flexible matching with multiple alternative patterns:
- Format:
[[[tool_calls, ...]], [[tool_calls]], ...](list of list of list) - Matching: Tries each alternative until one matches
- Use Cases: Optional tools, multiple approaches, default arguments, skip scenarios
- Empty Sets:
[]represents "no tools" and must come after primary alternatives
# Multiple alternative sets format: [[[tool_calls, ...]], [[tool_calls]], ...]
expected_tool_calls:
- # Alternative 1: Primary approach
- # Sequence 1
- tool_name: oc_get
arguments:
kind: pod
name: openshift-light* # Regex patterns supported
- # Sequence 2 (if multiple parallel tool calls needed)
- tool_name: oc_describe
arguments:
kind: pod
- # Alternative 2: Different approach
- # Sequence 1
- tool_name: kubectl_get
arguments:
resource: pods
- # Alternative 3: Skip scenario (optional)
[] # When model has information from previous conversationThe framework supports script-based evaluations. Note: Scripts only execute when API is enabled - they're designed to test with actual environment changes.
- Setup scripts: Run before conversation evaluation (e.g., create failed deployment for troubleshoot query)
- Cleanup scripts: Run after conversation evaluation (e.g., cleanup failed deployment)
- Verify scripts: Run per turn for
script:action_evalmetric (e.g., validate if a pod has been created or not)
# Example: evaluation_data.yaml
- conversation_group_id: infrastructure_test
setup_script: ./scripts/setup_cluster.sh
cleanup_script: ./scripts/cleanup_cluster.sh
turns:
- turn_id: turn_id
query: Create a new cluster
verify_script: ./scripts/verify_cluster.sh
turn_metrics:
- script:action_evalScript Path Resolution
Script paths in evaluation data can be specified in multiple ways:
- Relative Paths: Resolved relative to the evaluation data YAML file location, not the current working directory
- Absolute Paths: Used as-is
- Home Directory Paths: Expands to user's home directory
# Hosted vLLM (provider: hosted_vllm)
export HOSTED_VLLM_API_KEY="your-key"
export HOSTED_VLLM_API_BASE="https://your-vllm-endpoint/v1"
# OpenAI (provider: openai)
export OPENAI_API_KEY="your-openai-key"
# IBM Watsonx (provider: watsonx)
export WATSONX_API_KEY="your-key"
export WATSONX_API_BASE="https://us-south.ml.cloud.ibm.com"
export WATSONX_PROJECT_ID="your-project-id"
# Gemini (provider: gemini)
export GEMINI_API_KEY="your-key"# API authentication for external system (MCP)
export API_KEY="your-api-endpoint-key"- CSV: Detailed results with status, scores, reasons
- JSON: Summary statistics with score distributions
- TXT: Human-readable summary
- PNG: 4 visualization types (pass rates, score distributions, heatmaps, status breakdown)
- PASS/FAIL/ERROR: Status based on thresholds
- Actual Reasons: DeepEval provides LLM-generated explanations, Custom metrics provide detailed reasoning
- Score Statistics: Mean, median, standard deviation, min/max for every metric
uv sync --group dev
make format
make pylint
make pyright
make docstyle
make check-types
uv run pytest tests --cov=srcFor generating answers (optional) refer README-generate-answers
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
Contributions welcome - see development setup above for code quality tools.