Bedrock Agent Evaluation is an evlauation framework for Amazon Bedrock agent tool-use and chain-of-thought reasoning.
- Test your own Bedrock Agent with custom questions
- Includes built-in evaluation for RAG, Text2SQL, and Chain-of-Thought
- Extend the capabilities to include custom tool evaluations
- Integrated with LangFuse for easy observability of evaluation results
(Include a demo video here)
Screenshots / Code block to include:
- Driver code snippet?
- Run_Evaluation function code snippet?
- Evaluator tool function code snippet?
- Screenshots of langfuse dashboard
- Screenshot of langfuse trace and evaluation metrics
- Clone this repo in a SageMaker notebook (Link to how to do it)
- Clone this repo locally and set up AWS CLI credentials to your AWS account (Link to how to do it)
- Install required dependency for the framework from requirements.txt
- Setup LangFuse account and create a project using the cloud (Link to langfuse) or self-host option for AWS (Link to aws self hosted langfuse repo)
- Bring your existing agent you want to evaluate(Currently RAG and Text2SQL evaluations built-in)
- Create a dataset file for evaluations, manually or using the generator (Refer to the sample_data_file.json for the necessary format)
- Copy the config_tpl.py into a 'config.py' configuration file
- Run driver.py to run the evaluation job
- Check the LangFuse console to see the traces
Follow the instructions in README.md in the blog_sample_agents folder
- How each traces are structured and sent
- What are included in each hierarchy level of trace / generation / span
- What tags we are including and how to filter
- How to use the dashboard
- How to compare evaluation scores
- How to compare model latency
Version 1: What to include in documentation:
- How the framework is structured (Each component/file of the framework): Driver, evaluators, setting up langfuse, custom evaluators
- How evaluations are implemented and the workflow
- How to add custom evaluators
- How to modify langfuse traces
- How to modify evaluation logic for existing evaluations
Future iteration plan:
- Change the framework to a more trajectory based rather than per question
- Use Opentelemetry collectors for more accurate tracing
- Abstract out the logic for trace parsing for bedrock calls
- Make it Platform agnostic? (Might have to create multiple versions that needs to be imported for this)
- Evaluation should be more trajectory and goal fulfillment / tool adherence rather than individual tools, and provide them with built-in metrics that can be chosen for each evaluation rather than evaluations for each tool
- For langfuse evalutors, integrate more with langfuse capabilities: Dataset to run evaluations on langfuse, human annotation for human in the loop
- A streamlit UI for online evaluation feature demo
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.