A framework for optimizing Large Language Model tool-calling through dual-stage finetuning and intelligent routing
Features β’ Architecture β’ Demo β’ Quick Start β’ ConsumerBench β’ Documentation
AgentFlux is a novel framework that significantly improves LLM tool-calling performance through a two-stage optimization approach: specialized classifier models for tool selection, and individual tool adapter models for precise argument generation. This dual-optimization strategy achieves superior accuracy while maintaining cost-efficiency compared to traditional monolithic approaches.
Modern LLM applications increasingly rely on tool-calling capabilities to interact with external APIs, databases, and services. However, traditional approaches face several challenges:
- Inefficient Tool Selection: Large models waste compute on simple routing decisions
- Suboptimal Argument Generation: Generic models struggle with tool-specific parameter formatting
- High Latency & Cost: Every request requires full model inference
- Poor Scalability: Adding new tools degrades performance across all tools
AgentFlux solves these problems by separating concerns: we build a decoupled fine-tuning framework called DualTune that creates a lightweight classifier rapidly selects the appropriate tool, then a specialized adapter generates precise arguments.
- Synthetic Data Generation: Automatically generate high-quality training data using GPT-5
- Intelligent Data Validation: Comprehensive argument validation and trajectory cleaning
- Dual-Model Training: Simultaneous training of classifier and tool-specific adapters
- Built on Unsloth: Leverages state-of-the-art LoRA optimization
- Smart Routing: FastAPI-based proxy with intelligent request classification
- Tool Specialization: Per-tool finetuned models for optimal argument generation
- Built on Rena Core: Production-grade orchestration framework (Rust + Python)
- Multi-Category Benchmarks: Evaluate across filesys, Monday.com, Notion, and custom MCP tools
- Automated Judging: LLM-based evaluation with ground truth comparison
- Detailed Metrics: Track accuracy, latency, and cost metrics across the pipeline
AgentFlux consists of three integrated components working in harmony:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AgentFlux System β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββ β
β β DualTune | | | | | |
| | Finetuning β β AgentFlux β β Rena β β
β β Pipeline ββββΆβ Inference ββββΆβ Core β β
β β β β β β (Eval) β β
β ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Generates optimized models through a multi-stage process:
Query Generation β Trajectory Collection β Data Cleaning β Model Training
(GPT-5) (GPT-5) (Validation) (Unsloth LoRA)
Components:
- Query Generator: Creates diverse, realistic user queries per tool category
- Trajectory Collector: Records complete tool-calling conversations with a frontier model (GPT-5)
- Data Processor: Validates arguments, checks types, splits train/eval/test sets
- Model Trainer: Finetunes both classifier and per-tool adapters using LoRA
Technical Specifications:
- Training Method: LoRA (rank=32, alpha=64, dropout=0.1)
- Optimization: RSLoRA with AdamW-8bit optimizer
- Context Length: 32,768 tokens
- Learning Rate: 5e-6 with cosine scheduling
Production-ready inference system with intelligent routing:
User Request β Proxy Server β Classifier β Tool Adapter β API Response
β β β
[Tool Cap] [Tool Select] [Args Gen]
Request Flow:
- Proxy Layer (
proxy.py): FastAPI server receives chat completion requests - Classification (
classifier.py): Lightweight model predicts tool from context (n=10 samples, temperature=1.0) - Adaptation (
tool_adaptor.py): Specialized model generates precise arguments - Execution: Formatted tool call sent to target API
Key Innovations:
- Multi-Sample Classification: Aggregates predictions from 10 samples for robustness
- Tool-Specific Chat Templates: Custom Jinja2 templates optimize each tool's behavior
Production-grade evaluation infrastructure:
βββββββββββββββββββ ββββββββββββββββββββ
β rena-browserd ββββββββββΆβ rena-runtime β
β (Rust Core) β gRPC β (Python MCP) β
β β β β
β Process Manager β β Tool Execution β
β Docker Control β β MCP Protocol β
βββββββββββββββββββ ββββββββββββββββββββ
Components:
- rena-browserd (Rust): Manages MCP server lifecycle, Docker containers, process orchestration
- rena-runtime (Python): Executes MCP protocol, handles tool invocations, logs trajectories
- browserd-cli: Command-line interface for running queries and managing apps
- browserd-eval: Evaluation harness for benchmarking tool-calling performance
Evaluation Pipeline:
- Query Generation β Category-specific test queries
- Trajectory Generation β Run queries through AgentFlux or baseline
- Automated Judging β Compare outputs against ground truth
- Score Calculation β Aggregate accuracy and success metrics
AgentFlux/
βββ finetune/ # π Model Finetuning Pipeline
β βββ unsloth-cli-split.py # Main training script (Unsloth + LoRA)
β βββ data_prepare.py # Data validation & train/eval/test splitting
β βββ gen_queries.py # Synthetic query generation (GPT-5)
β βββ gen_trajs.py # Trajectory collection from baseline
β βββ gen_tool_template.py # Generate Jinja2 templates per tool
β βββ classifier.jinja # Chat template for classifier model
β βββ tool_template.jinja # Chat template for tool adapters
β βββ results/ # Training outputs, logs, model checkpoints
β βββ base_models/ # Base-model-specific chat templatos
β βββ scripts/ # π§ Automation Scripts
β βββ finetune.sh # Complete finetuning pipeline
β βββ finetune_classifier.sh # Train classifier onlg
β βββ finetune_tool_adaptors.sh # Train tool adapters only
β
βββ inference/agentflux/ # β‘ AgentFlux Inference System
β βββ agentflux/
β β βββ proxy.py # FastAPI proxy server (port 8030)
β β βββ classifier.py # Finetuned & GPT classifier implementations
β β βββ tool_adaptor.py # Finetuned & GPT tool adapter implementations
β β βββ utils/logging_setup.py # Logging configuration
β βββ pyproject.toml # Package configuration
β
βββ orchestration-framework/ # π¬ Evaluation Infrastructure
β βββ rena-core/ # Core orchestration framework
β β βββ rena-browserd/ # Rust: Process & Docker management
β β β βββ browserd/ # Core library
β β β βββ browserd-cli/ # CLI interface
β β β βββ browserd-eval/ # Evaluation harness
β β βββ rena-runtime/ # Python: MCP protocol execution
β βββ evaluation/ # Evaluation scripts
β β βββ run_agentflux.py # Run with AgentFlux proxy
β β βββ run_baseline.py # Run with baseline GPT
β β βββ gen_queries.py # Generate test queries
β β βββ score.py # Calculate final metrics
β β βββ filesys/judge.py # Filesystem category judge
β β βββ monday/judge.py # Monday.com category judge
β β βββ notion/judge.py # Notion category judge
β βββ __init__.py # Helper functions
β
βββ ConsumerBench # Benchmarking framework for measuring system efficiency
See AgentFlux in action powering a Coinbase trading agent:
demos/agentflux-coinbase-demo.mp4
- Python: 3.8 or higher
- Rust: 1.70 or higher
- CUDA: 11.8+ (for GPU acceleration)
- Docker: Latest stable version (for Rena Core)
- OpenAI API Key: Required for query/trajectory generation and baseline evaluation
export OPENAI_API_KEY="your-api-key-here"
git clone https://github.com/yourusername/AgentFlux.git
cd AgentFluxThe finetuning pipeline requires Unsloth for efficient LoRA training:
# Install Unsloth (recommended: use conda/mamba)
conda create -n dualtune python=3.10
conda activate dualtune
# Install Unsloth with CUDA support
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# Install additional dependencies
pip install transformers datasets trl accelerate peft
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118cd inference/agentflux
pip install -e .
cd ../..
# Install inference dependencies
pip install fastapi uvicorn httpx aioconsolecd orchestration-framework/rena-core
# Install Rust dependencies and build
make setup
# Install Python runtime dependencies
cd rena-runtime
pip install -e .
cd ../../..Run the complete finetuning pipeline for a category (e.g., filesys, monday, notion):
# Full pipeline: query gen β trajectory collection β training
cd finetune
bash scripts/finetune.sh filesysThis will:
- Generate 1000+ synthetic queries using GPT-5
- Collect tool-calling trajectories from baseline model
- Clean and validate data (argument checking, type validation)
- Split into train/eval/test sets (80/10/10)
- Train classifier model β
finetune/filesys/results/finetune_output/classifier/ - Train per-tool adapters β
finetune/filesys/results/finetune_output/tool_adaptors/{tool_name}/
Customization:
# Custom hyperparameters: category, batch_size, grad_accumulation, epochs
bash scripts/finetune_classifier.sh filesys 8 2 3
bash scripts/finetune_tool_adaptors.sh filesys 8 2 3Training Outputs:
- Model checkpoints:
finetune/{category}/results/finetune_output/ - Training logs:
finetune/{category}/results/log/ - Processed data:
finetune/{category}/results/trajectories/
Deploy your finetuned models via the AgentFlux proxy server:
# Start finetuned model server (both finetuned and classifier)
cd ../inference/agentflux
bash scripts/vllm.sh &
# Start AgentFlux proxy
bash scripts/proxy.sh &Now send requests to http://localhost:9015/v1/chat/completions using OpenAI SDK format!
Example Request:
import openai
client = openai.OpenAI(
base_url="http://localhost:8030/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="agentflux", # Model name doesn't matter, proxy handles routing
messages=[
{"role": "user", "content": "Read the contents of README.md"}
],
tools=[...] # Your MCP tool definitions
)
print(response.choices[0].message.tool_calls)Benchmark your finetuned models against baseline:
# Set workspace for filesys category
export WORKSPACE=/path/to/test/workspace
# Run complete evaluation pipeline
cd ../orchestration_framework/evaluation
bash scripts/evaluate.sh filesysThis executes:
- Query generation for test set
- Trajectory generation using AgentFlux proxy
- Automated judging against ground truth
- Score calculation and metrics reporting
Output:
- Trajectories:
orchestration-framework/evaluation/{category}/eval-results/ - Judgments:
orchestration-framework/evaluation/{category}/judge-results/ - Final scores printed to console
Manual Evaluation Steps:
# 1. Generate test queries
python orchestration-framework/evaluation/gen_queries.py --category filesys
# 2. Run with AgentFlux
python orchestration-framework/evaluation/run_agentflux.py filesys \
--classifier config/filesys/classifier.json \
--tool_adapters config/filesys/tool_adapters.json \
--query orchestration-framework/evaluation/filesys/queries/fuzzing_queries.txt \
--output orchestration-framework/evaluation/filesys/eval-results/trajs.jsonl
# 3. Judge results
python orchestration-framework/evaluation/filesys/judge.py \
--trajs orchestration-framework/evaluation/filesys/eval-results/trajs.jsonl \
--output orchestration-framework/evaluation/filesys/judge-results/judged.jsonl
# 4. Calculate scores
python orchestration-framework/evaluation/score.py \
--llm_judge_path orchestration-framework/evaluation/filesys/judge-results/judged.jsonlConsumerBench is a benchmarking framework that can be used to measure the system efficiency of local AI models when run concurrently with each other. ConsumerBench supports MCP workflows and computes the system efficiency, GPU utilization and power consumption of DualTune workflows.
Each tool category requires configuration in inference/agentflux/config/{category}/:
tool_list.json: MCP tool definitions (OpenAI format)classifier.json: Classifier model endpoint and configuration{ "model": "filesys-classifier", "port": 8001, "tools": ["read_file", "write_file", "list_directory", ...] }tool_adapters.json: Per-tool adapter configurations{ "read_file": {"model": "read_file-adapter", "port": 8002}, "write_file": {"model": "write_file-adapter", "port": 8003}, ... }query_generation_template.txt: Prompt template for generating training queriesjudge_sys_prompt.txt: System prompt for evaluation judging
-
Create Configuration:
cd inference mkdir -p agentflux/config/my_category # Add tool_list.json, classifier.json, tool_adapters.json
-
Generate Training Data:
cd ../finetune mkdir -p my_category # Add query_generation_template.txt bash scripts/finetune.sh my_category
-
Create Judge Script:
# orchestration-framework/evaluation/my_category/judge.py # Implement category-specific validation logic
-
Run Evaluation:
cd ../orchestration_framework/evaluation bash scripts/evaluate.sh my_category
Finetuning Hyperparameters (bash/finetune_classifier.sh, bash/finetune_tool_adaptors.sh):
batch_size: Per-device batch size (default: 4)accumulate_step: Gradient accumulation steps (default: 4)num_train_epochs: Training epochs (default: 4)- Learning rate: 5e-6 (fixed in script)
- Scheduler: Cosine annealing
Data Validation (finetune/data_prepare.py):
- Validates required vs optional parameters
- Type checking (string, number, boolean, array, object)
- Rejects unexpected arguments
- Filters conversations exceeding 32,768 tokens
- Binary search removal of problematic entries
Caching Behavior (inference/agentflux/tool_adaptor.py):
- Tracks up to 5 unique function calls per request
- Blocks after 10 identical calls (prevents loops)
- Clears cache when "summarize" tool is called
- Uses SHA256 hashing for call deduplication
- Base Model: Qwen2.5-7B-Instruct (32K context)
- Framework: Unsloth (optimized Hugging Face Transformers)
- Method: LoRA (Low-Rank Adaptation)
- Rank: 32
- Alpha: 64 (scaling factor)
- Dropout: 0.1
- Target modules: Q, K, V, O, Gate, Up, Down projections
- Optimizer: AdamW-8bit (memory efficient)
- Scheduler: Cosine annealing with warmup
- Training: RSLoRA enabled, gradient checkpointing (Unsloth mode)
Proxy Server (proxy.py):
- FastAPI application on port 8030
- OpenAI-compatible
/v1/chat/completionsendpoint - Automatic tool list substitution
- Error handling and logging
Classification Strategy (classifier.py):
- Generates 10 completions with temperature=1.0
- Extracts tool names from
<tool_call>tags - Votes by frequency (most common tool wins)
- Falls back to "summarize" if no tools detected
Adaptation Strategy (tool_adaptor.py):
- Single tool selection via
tool_choiceparameter - Custom chat templates per tool
- Retry logic (max 3 attempts)
- Response validation and error handling
browserd (Rust):
- Docker API integration (bollard)
- Async process management (tokio)
- gRPC server for runtime communication
- Structured logging (tracing)
rena-runtime (Python):
- MCP protocol implementation (mcp SDK)
- Tool invocation and response handling
- Trajectory logging (JSONL format)
- Container lifecycle management
We welcome contributions! Here are some areas where you can help:
- New Tool Categories: Add support for additional MCP tool sets (GitHub, Slack, Google Drive, etc.)
- Evaluation Metrics: Implement new judging criteria and success metrics
- Model Architectures: Experiment with different base models and training techniques
- Optimization: Improve inference speed, memory usage, or training efficiency
- Documentation: Enhance guides, add tutorials, create example notebooks
To Contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please ensure:
- Code follows existing style conventions
- Tests pass (if applicable)
- Documentation is updated
- Commit messages are descriptive
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Issues: GitHub Issues
- Email: rohankad@cs.washington.edu
If you use AgentFlux in your research, please cite:
@article{kadekodi2025dualtune,
title={DualTune: Decoupled Fine-Tuning for On-Device Agentic Systems},
author={Kadekodi, Rohan and Jin, Zhan and Kamahori, Keisuke and Gu, Yile and Khatiri, Sean and Bayindirli, Noah H and Gorbunov, Sergey and Kasikci, Baris},
journal={arXiv preprint arXiv:2510.00229},
year={2025}
}β Star us on GitHub if AgentFlux helps your project! β