Amazon Bedrock Agent Evaluation

Bedrock Agent Evaluation is an evlauation framework for Amazon Bedrock agent tool-use and chain-of-thought reasoning.

Features

Test your own Bedrock Agent with custom questions
Includes built-in evaluation for RAG, Text2SQL, and Chain-of-Thought
Extend the capabilities to include custom tool evaluations
Integrated with LangFuse for easy observability of evaluation results

(Include a demo video here)

Screenshots / Code block to include:

Driver code snippet?
Run_Evaluation function code snippet?
Evaluator tool function code snippet?
Screenshots of langfuse dashboard
Screenshot of langfuse trace and evaluation metrics

How to use

Deployment environment options

Clone this repo in a SageMaker notebook (Link to how to do it)
Clone this repo locally and set up AWS CLI credentials to your AWS account (Link to how to do it)

Pre-Requisites for Running

Install required dependency for the framework from requirements.txt
Setup LangFuse account and create a project using the cloud (Link to langfuse) or self-host option for AWS (Link to aws self hosted langfuse repo)

Option 1: Bring your own agent to evaluate

Bring your existing agent you want to evaluate(Currently RAG and Text2SQL evaluations built-in)
Create a dataset file for evaluations, manually or using the generator (Refer to the sample_data_file.json for the necessary format)
Copy the config_tpl.py into a 'config.py' configuration file
Run driver.py to run the evaluation job
Check the LangFuse console to see the traces

Option 2: Create Sample Agents to run Evaluations

Follow the instructions in README.md in the blog_sample_agents folder

Navigating the Langfuse Traces and Dashboard

How each traces are structured and sent
What are included in each hierarchy level of trace / generation / span
What tags we are including and how to filter
How to use the dashboard
How to compare evaluation scores
How to compare model latency

Documentation

Version 1: What to include in documentation:

How the framework is structured (Each component/file of the framework): Driver, evaluators, setting up langfuse, custom evaluators
How evaluations are implemented and the workflow
How to add custom evaluators
How to modify langfuse traces
How to modify evaluation logic for existing evaluations

Future iteration plan:

Change the framework to a more trajectory based rather than per question
Use Opentelemetry collectors for more accurate tracing
Abstract out the logic for trace parsing for bedrock calls
Make it Platform agnostic? (Might have to create multiple versions that needs to be imported for this)
Evaluation should be more trajectory and goal fulfillment / tool adherence rather than individual tools, and provide them with built-in metrics that can be chosen for each evaluation rather than evaluations for each tool
For langfuse evalutors, integrate more with langfuse capabilities: Dataset to run evaluations on langfuse, human annotation for human in the loop
A streamlit UI for online evaluation feature demo

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
blog_sample_agents		blog_sample_agents
data_files		data_files
evaluators		evaluators
helpers		helpers
text2sql_agent		text2sql_agent
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
agent_info_extractor.py		agent_info_extractor.py
base_evaluator.py		base_evaluator.py
config_tpl.py		config_tpl.py
cot_helper.py		cot_helper.py
custom_evaluator.py		custom_evaluator.py
driver.py		driver.py
rag_evaluator.py		rag_evaluator.py
requirements.txt		requirements.txt
text2sql_evaluator.py		text2sql_evaluator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon Bedrock Agent Evaluation

Features

How to use

Deployment environment options

Pre-Requisites for Running

Option 1: Bring your own agent to evaluate

Option 2: Create Sample Agents to run Evaluations

Navigating the Langfuse Traces and Dashboard

Documentation

Security

License

About

Releases

Packages

Contributors 3

Languages

License

aws-samples/amazon-bedrock-agent-evaluation-framework

Folders and files

Latest commit

History

Repository files navigation

Amazon Bedrock Agent Evaluation

Features

How to use

Deployment environment options

Pre-Requisites for Running

Option 1: Bring your own agent to evaluate

Option 2: Create Sample Agents to run Evaluations

Navigating the Langfuse Traces and Dashboard

Documentation

Security

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages