Skip to content

efeslab/AgentFlux

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AgentFlux: Framework for Privacy-Preserving On-Device Agentic Systems

A framework for optimizing Large Language Model tool-calling through dual-stage finetuning and intelligent routing

License Python 3.8+ Rust

Features β€’ Architecture β€’ Demo β€’ Quick Start β€’ ConsumerBench β€’ Documentation


🎯 Overview

AgentFlux is a novel framework that significantly improves LLM tool-calling performance through a two-stage optimization approach: specialized classifier models for tool selection, and individual tool adapter models for precise argument generation. This dual-optimization strategy achieves superior accuracy while maintaining cost-efficiency compared to traditional monolithic approaches.

Why AgentFlux?

Modern LLM applications increasingly rely on tool-calling capabilities to interact with external APIs, databases, and services. However, traditional approaches face several challenges:

  • Inefficient Tool Selection: Large models waste compute on simple routing decisions
  • Suboptimal Argument Generation: Generic models struggle with tool-specific parameter formatting
  • High Latency & Cost: Every request requires full model inference
  • Poor Scalability: Adding new tools degrades performance across all tools

AgentFlux solves these problems by separating concerns: we build a decoupled fine-tuning framework called DualTune that creates a lightweight classifier rapidly selects the appropriate tool, then a specialized adapter generates precise arguments.


πŸš€ Key Features

πŸŽ“ DualTune: Automated Decoupled Finetuning Pipeline

  • Synthetic Data Generation: Automatically generate high-quality training data using GPT-5
  • Intelligent Data Validation: Comprehensive argument validation and trajectory cleaning
  • Dual-Model Training: Simultaneous training of classifier and tool-specific adapters
  • Built on Unsloth: Leverages state-of-the-art LoRA optimization

⚑ AgentFlux Inference Framework

  • Smart Routing: FastAPI-based proxy with intelligent request classification
  • Tool Specialization: Per-tool finetuned models for optimal argument generation

πŸ”¬ Comprehensive Evaluation Suite

  • Built on Rena Core: Production-grade orchestration framework (Rust + Python)
  • Multi-Category Benchmarks: Evaluate across filesys, Monday.com, Notion, and custom MCP tools
  • Automated Judging: LLM-based evaluation with ground truth comparison
  • Detailed Metrics: Track accuracy, latency, and cost metrics across the pipeline

πŸ“ Architecture

AgentFlux consists of three integrated components working in harmony:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         AgentFlux System                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚    DualTune      |   |                  |   |            |   |
|  |   Finetuning     β”‚   β”‚   AgentFlux      β”‚   β”‚    Rena    β”‚   β”‚
β”‚  β”‚    Pipeline      │──▢│   Inference      │──▢│    Core    β”‚   β”‚
β”‚  β”‚                  β”‚   β”‚                  β”‚   β”‚  (Eval)    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. DualTune Finetuning Pipeline

Generates optimized models through a multi-stage process:

Query Generation β†’ Trajectory Collection β†’ Data Cleaning β†’ Model Training
    (GPT-5)           (GPT-5)        (Validation)    (Unsloth LoRA)

Components:

  • Query Generator: Creates diverse, realistic user queries per tool category
  • Trajectory Collector: Records complete tool-calling conversations with a frontier model (GPT-5)
  • Data Processor: Validates arguments, checks types, splits train/eval/test sets
  • Model Trainer: Finetunes both classifier and per-tool adapters using LoRA

Technical Specifications:

  • Training Method: LoRA (rank=32, alpha=64, dropout=0.1)
  • Optimization: RSLoRA with AdamW-8bit optimizer
  • Context Length: 32,768 tokens
  • Learning Rate: 5e-6 with cosine scheduling

2. AgentFlux Inference Framework

Production-ready inference system with intelligent routing:

User Request β†’ Proxy Server β†’ Classifier β†’ Tool Adapter β†’ API Response
                   ↓              ↓             ↓
              [Tool Cap]    [Tool Select]  [Args Gen]

Request Flow:

  1. Proxy Layer (proxy.py): FastAPI server receives chat completion requests
  2. Classification (classifier.py): Lightweight model predicts tool from context (n=10 samples, temperature=1.0)
  3. Adaptation (tool_adaptor.py): Specialized model generates precise arguments
  4. Execution: Formatted tool call sent to target API

Key Innovations:

  • Multi-Sample Classification: Aggregates predictions from 10 samples for robustness
  • Tool-Specific Chat Templates: Custom Jinja2 templates optimize each tool's behavior

3. Rena Core Orchestration Framework

Production-grade evaluation infrastructure:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  rena-browserd  │◄───────▢│  rena-runtime    β”‚
β”‚   (Rust Core)   β”‚  gRPC   β”‚  (Python MCP)    β”‚
β”‚                 β”‚         β”‚                  β”‚
β”‚ Process Manager β”‚         β”‚  Tool Execution  β”‚
β”‚ Docker Control  β”‚         β”‚  MCP Protocol    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Components:

  • rena-browserd (Rust): Manages MCP server lifecycle, Docker containers, process orchestration
  • rena-runtime (Python): Executes MCP protocol, handles tool invocations, logs trajectories
  • browserd-cli: Command-line interface for running queries and managing apps
  • browserd-eval: Evaluation harness for benchmarking tool-calling performance

Evaluation Pipeline:

  1. Query Generation β†’ Category-specific test queries
  2. Trajectory Generation β†’ Run queries through AgentFlux or baseline
  3. Automated Judging β†’ Compare outputs against ground truth
  4. Score Calculation β†’ Aggregate accuracy and success metrics

πŸ—οΈ Repository Structure

AgentFlux/
β”œβ”€β”€ finetune/                           # πŸŽ“ Model Finetuning Pipeline
β”‚   β”œβ”€β”€ unsloth-cli-split.py            #    Main training script (Unsloth + LoRA)
β”‚   β”œβ”€β”€ data_prepare.py                 #    Data validation & train/eval/test splitting
β”‚   β”œβ”€β”€ gen_queries.py                  #    Synthetic query generation (GPT-5)
β”‚   β”œβ”€β”€ gen_trajs.py                    #    Trajectory collection from baseline
β”‚   β”œβ”€β”€ gen_tool_template.py            #    Generate Jinja2 templates per tool
β”‚   β”œβ”€β”€ classifier.jinja                #    Chat template for classifier model
β”‚   β”œβ”€β”€ tool_template.jinja             #    Chat template for tool adapters
β”‚   β”œβ”€β”€ results/                        #    Training outputs, logs, model checkpoints
β”‚   β”œβ”€β”€ base_models/                    #    Base-model-specific chat templatos
β”‚   └── scripts/                        # πŸ”§ Automation Scripts 
β”‚       β”œβ”€β”€ finetune.sh                 #    Complete finetuning pipeline
β”‚       β”œβ”€β”€ finetune_classifier.sh      #    Train classifier onlg
β”‚       └── finetune_tool_adaptors.sh   #    Train tool adapters only
β”‚
β”œβ”€β”€ inference/agentflux/                # ⚑ AgentFlux Inference System
β”‚   β”œβ”€β”€ agentflux/
β”‚   β”‚   β”œβ”€β”€ proxy.py                    #    FastAPI proxy server (port 8030)
β”‚   β”‚   β”œβ”€β”€ classifier.py               #    Finetuned & GPT classifier implementations
β”‚   β”‚   β”œβ”€β”€ tool_adaptor.py             #    Finetuned & GPT tool adapter implementations
β”‚   β”‚   └── utils/logging_setup.py      #    Logging configuration
β”‚   └── pyproject.toml                  #    Package configuration
β”‚
β”œβ”€β”€ orchestration-framework/            # πŸ”¬ Evaluation Infrastructure
β”‚   β”œβ”€β”€ rena-core/                      #    Core orchestration framework
β”‚   β”‚   β”œβ”€β”€ rena-browserd/              #    Rust: Process & Docker management
β”‚   β”‚   β”‚   β”œβ”€β”€ browserd/               #       Core library
β”‚   β”‚   β”‚   β”œβ”€β”€ browserd-cli/           #       CLI interface
β”‚   β”‚   β”‚   └── browserd-eval/          #       Evaluation harness
β”‚   β”‚   └── rena-runtime/               #    Python: MCP protocol execution
β”‚   β”œβ”€β”€ evaluation/                     #    Evaluation scripts
β”‚   β”‚   β”œβ”€β”€ run_agentflux.py            #    Run with AgentFlux proxy
β”‚   β”‚   β”œβ”€β”€ run_baseline.py             #    Run with baseline GPT
β”‚   β”‚   β”œβ”€β”€ gen_queries.py              #    Generate test queries
β”‚   β”‚   β”œβ”€β”€ score.py                    #    Calculate final metrics
β”‚   β”‚   β”œβ”€β”€ filesys/judge.py            #    Filesystem category judge
β”‚   β”‚   β”œβ”€β”€ monday/judge.py             #    Monday.com category judge
β”‚   β”‚   └── notion/judge.py             #    Notion category judge
β”‚   └── __init__.py                     #    Helper functions
β”‚
└── ConsumerBench                       #    Benchmarking framework for measuring system efficiency

πŸŽ₯ Demo

See AgentFlux in action powering a Coinbase trading agent:

demos/agentflux-coinbase-demo.mp4


🎬 Quick Start

Prerequisites

System Requirements

  • Python: 3.8 or higher
  • Rust: 1.70 or higher
  • CUDA: 11.8+ (for GPU acceleration)
  • Docker: Latest stable version (for Rena Core)

API Keys

  • OpenAI API Key: Required for query/trajectory generation and baseline evaluation
    export OPENAI_API_KEY="your-api-key-here"

Installation

1. Clone the Repository

git clone https://github.com/yourusername/AgentFlux.git
cd AgentFlux

2. Install DualTune Finetuning Dependencies

The finetuning pipeline requires Unsloth for efficient LoRA training:

# Install Unsloth (recommended: use conda/mamba)
conda create -n dualtune python=3.10
conda activate dualtune

# Install Unsloth with CUDA support
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# Install additional dependencies
pip install transformers datasets trl accelerate peft
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

3. Install AgentFlux Inference System

cd inference/agentflux
pip install -e .
cd ../..

# Install inference dependencies
pip install fastapi uvicorn httpx aioconsole

4. Setup Rena Core (Optional - for evaluation)

cd orchestration-framework/rena-core

# Install Rust dependencies and build
make setup

# Install Python runtime dependencies
cd rena-runtime
pip install -e .
cd ../../..

Usage

πŸŽ“ Finetuning Your Models

Run the complete finetuning pipeline for a category (e.g., filesys, monday, notion):

# Full pipeline: query gen β†’ trajectory collection β†’ training
cd finetune
bash scripts/finetune.sh filesys

This will:

  1. Generate 1000+ synthetic queries using GPT-5
  2. Collect tool-calling trajectories from baseline model
  3. Clean and validate data (argument checking, type validation)
  4. Split into train/eval/test sets (80/10/10)
  5. Train classifier model β†’ finetune/filesys/results/finetune_output/classifier/
  6. Train per-tool adapters β†’ finetune/filesys/results/finetune_output/tool_adaptors/{tool_name}/

Customization:

# Custom hyperparameters: category, batch_size, grad_accumulation, epochs
bash scripts/finetune_classifier.sh filesys 8 2 3
bash scripts/finetune_tool_adaptors.sh filesys 8 2 3

Training Outputs:

  • Model checkpoints: finetune/{category}/results/finetune_output/
  • Training logs: finetune/{category}/results/log/
  • Processed data: finetune/{category}/results/trajectories/

⚑ Running AgentFlux Inference

Deploy your finetuned models via the AgentFlux proxy server:

# Start finetuned model server (both finetuned and classifier)
cd ../inference/agentflux
bash scripts/vllm.sh &

# Start AgentFlux proxy
bash scripts/proxy.sh &

Now send requests to http://localhost:9015/v1/chat/completions using OpenAI SDK format!

Example Request:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8030/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="agentflux",  # Model name doesn't matter, proxy handles routing
    messages=[
        {"role": "user", "content": "Read the contents of README.md"}
    ],
    tools=[...]  # Your MCP tool definitions
)

print(response.choices[0].message.tool_calls)

πŸ”¬ Evaluation

Benchmark your finetuned models against baseline:

# Set workspace for filesys category
export WORKSPACE=/path/to/test/workspace

# Run complete evaluation pipeline
cd ../orchestration_framework/evaluation
bash scripts/evaluate.sh filesys

This executes:

  1. Query generation for test set
  2. Trajectory generation using AgentFlux proxy
  3. Automated judging against ground truth
  4. Score calculation and metrics reporting

Output:

  • Trajectories: orchestration-framework/evaluation/{category}/eval-results/
  • Judgments: orchestration-framework/evaluation/{category}/judge-results/
  • Final scores printed to console

Manual Evaluation Steps:

# 1. Generate test queries
python orchestration-framework/evaluation/gen_queries.py --category filesys

# 2. Run with AgentFlux
python orchestration-framework/evaluation/run_agentflux.py filesys \
  --classifier config/filesys/classifier.json \
  --tool_adapters config/filesys/tool_adapters.json \
  --query orchestration-framework/evaluation/filesys/queries/fuzzing_queries.txt \
  --output orchestration-framework/evaluation/filesys/eval-results/trajs.jsonl

# 3. Judge results
python orchestration-framework/evaluation/filesys/judge.py \
  --trajs orchestration-framework/evaluation/filesys/eval-results/trajs.jsonl \
  --output orchestration-framework/evaluation/filesys/judge-results/judged.jsonl

# 4. Calculate scores
python orchestration-framework/evaluation/score.py \
  --llm_judge_path orchestration-framework/evaluation/filesys/judge-results/judged.jsonl

πŸš€ ConsumerBench

ConsumerBench is a benchmarking framework that can be used to measure the system efficiency of local AI models when run concurrently with each other. ConsumerBench supports MCP workflows and computes the system efficiency, GPU utilization and power consumption of DualTune workflows.


πŸ“š Documentation

Configuration Files

Each tool category requires configuration in inference/agentflux/config/{category}/:

  • tool_list.json: MCP tool definitions (OpenAI format)
  • classifier.json: Classifier model endpoint and configuration
    {
      "model": "filesys-classifier",
      "port": 8001,
      "tools": ["read_file", "write_file", "list_directory", ...]
    }
  • tool_adapters.json: Per-tool adapter configurations
    {
      "read_file": {"model": "read_file-adapter", "port": 8002},
      "write_file": {"model": "write_file-adapter", "port": 8003},
      ...
    }
  • query_generation_template.txt: Prompt template for generating training queries
  • judge_sys_prompt.txt: System prompt for evaluation judging

Adding a New Tool Category

  1. Create Configuration:

    cd inference
    mkdir -p agentflux/config/my_category
    # Add tool_list.json, classifier.json, tool_adapters.json
  2. Generate Training Data:

    cd ../finetune
    mkdir -p my_category
    # Add query_generation_template.txt
    bash scripts/finetune.sh my_category
  3. Create Judge Script:

    # orchestration-framework/evaluation/my_category/judge.py
    # Implement category-specific validation logic
  4. Run Evaluation:

    cd ../orchestration_framework/evaluation
    bash scripts/evaluate.sh my_category

Advanced Configuration

Finetuning Hyperparameters (bash/finetune_classifier.sh, bash/finetune_tool_adaptors.sh):

  • batch_size: Per-device batch size (default: 4)
  • accumulate_step: Gradient accumulation steps (default: 4)
  • num_train_epochs: Training epochs (default: 4)
  • Learning rate: 5e-6 (fixed in script)
  • Scheduler: Cosine annealing

Data Validation (finetune/data_prepare.py):

  • Validates required vs optional parameters
  • Type checking (string, number, boolean, array, object)
  • Rejects unexpected arguments
  • Filters conversations exceeding 32,768 tokens
  • Binary search removal of problematic entries

Caching Behavior (inference/agentflux/tool_adaptor.py):

  • Tracks up to 5 unique function calls per request
  • Blocks after 10 identical calls (prevents loops)
  • Clears cache when "summarize" tool is called
  • Uses SHA256 hashing for call deduplication

πŸ› οΈ Technical Details

Finetuning Stack

  • Base Model: Qwen2.5-7B-Instruct (32K context)
  • Framework: Unsloth (optimized Hugging Face Transformers)
  • Method: LoRA (Low-Rank Adaptation)
    • Rank: 32
    • Alpha: 64 (scaling factor)
    • Dropout: 0.1
    • Target modules: Q, K, V, O, Gate, Up, Down projections
  • Optimizer: AdamW-8bit (memory efficient)
  • Scheduler: Cosine annealing with warmup
  • Training: RSLoRA enabled, gradient checkpointing (Unsloth mode)

AgentFlux Architecture

Proxy Server (proxy.py):

  • FastAPI application on port 8030
  • OpenAI-compatible /v1/chat/completions endpoint
  • Automatic tool list substitution
  • Error handling and logging

Classification Strategy (classifier.py):

  • Generates 10 completions with temperature=1.0
  • Extracts tool names from <tool_call> tags
  • Votes by frequency (most common tool wins)
  • Falls back to "summarize" if no tools detected

Adaptation Strategy (tool_adaptor.py):

  • Single tool selection via tool_choice parameter
  • Custom chat templates per tool
  • Retry logic (max 3 attempts)
  • Response validation and error handling

Rena Core Implementation

browserd (Rust):

  • Docker API integration (bollard)
  • Async process management (tokio)
  • gRPC server for runtime communication
  • Structured logging (tracing)

rena-runtime (Python):

  • MCP protocol implementation (mcp SDK)
  • Tool invocation and response handling
  • Trajectory logging (JSONL format)
  • Container lifecycle management

🀝 Contributing

We welcome contributions! Here are some areas where you can help:

  • New Tool Categories: Add support for additional MCP tool sets (GitHub, Slack, Google Drive, etc.)
  • Evaluation Metrics: Implement new judging criteria and success metrics
  • Model Architectures: Experiment with different base models and training techniques
  • Optimization: Improve inference speed, memory usage, or training efficiency
  • Documentation: Enhance guides, add tutorials, create example notebooks

To Contribute:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Please ensure:

  • Code follows existing style conventions
  • Tests pass (if applicable)
  • Documentation is updated
  • Commit messages are descriptive

πŸ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


πŸ“ž Contact & Support


🌟 Citation

If you use AgentFlux in your research, please cite:

@article{kadekodi2025dualtune,
  title={DualTune: Decoupled Fine-Tuning for On-Device Agentic Systems},
  author={Kadekodi, Rohan and Jin, Zhan and Kamahori, Keisuke and Gu, Yile and Khatiri, Sean and Bayindirli, Noah H and Gorbunov, Sergey and Kasikci, Baris},
  journal={arXiv preprint arXiv:2510.00229},
  year={2025}
}

⭐ Star us on GitHub if AgentFlux helps your project! ⭐

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •