|
1 | | -# Baseten Plugin for Vision Agents |
| 1 | +# Qwen3-VL hosted on Baseten |
| 2 | +Qwen3-VL is the latest open-source Video Language Model (VLM) from Alibaba. This plugin allows developers to easily run the model hosted on [Baseten](https://www.baseten.co/) with Vision Agents. The model accepts text and video and responds with text vocalised with the TTS service of your choice. |
2 | 3 |
|
3 | | -LLM integrations for the models hosted on Baseten for Vision Agents framework. |
| 4 | +## Features |
4 | 5 |
|
5 | | -TODO |
| 6 | +- **Video understanding**: Automatically buffers and forwards video frames to Baseten-hosted VLM models |
| 7 | +- **Streaming responses**: Supports streaming text responses with real-time chunk events |
| 8 | +- **Frame buffering**: Configurable frame rate and buffer duration for optimal performance |
| 9 | +- **Event-driven**: Emits LLM events (chunks, completion, errors) for integration with other components |
6 | 10 |
|
7 | 11 | ## Installation |
8 | 12 |
|
9 | 13 | ```bash |
10 | | -pip install vision-agents-plugins-baseten |
| 14 | +uv add vision-agents[baseten] |
11 | 15 | ``` |
12 | 16 |
|
13 | | -## Usage |
| 17 | +## Quick Start |
14 | 18 |
|
15 | 19 | ```python |
| 20 | +from vision_agents.core import Agent, User |
| 21 | +from vision_agents.plugins import baseten, getstream, deepgram, elevenlabs, vogent |
16 | 22 |
|
| 23 | +async def create_agent(**kwargs) -> Agent: |
| 24 | + # Initialize the Baseten VLM |
| 25 | + llm = baseten.VLM(model="qwen3vl") |
| 26 | + |
| 27 | + # Create an agent with video understanding capabilities |
| 28 | + agent = Agent( |
| 29 | + edge=getstream.Edge(), |
| 30 | + agent_user=User(name="Video Assistant", id="agent"), |
| 31 | + instructions="You're a helpful video AI assistant. Analyze the video frames and respond to user questions about what you see.", |
| 32 | + llm=llm, |
| 33 | + stt=deepgram.STT(), |
| 34 | + tts=elevenlabs.TTS(), |
| 35 | + turn_detection=vogent.TurnDetection(), |
| 36 | + processors=[], |
| 37 | + ) |
| 38 | + return agent |
| 39 | + |
| 40 | +async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None: |
| 41 | + await agent.create_user() |
| 42 | + call = await agent.create_call(call_type, call_id) |
| 43 | + |
| 44 | + with await agent.join(call): |
| 45 | + # The agent will automatically process video frames and respond to user input |
| 46 | + await agent.finish() |
17 | 47 | ``` |
18 | 48 |
|
| 49 | +## Configuration |
| 50 | + |
| 51 | +### Environment Variables |
| 52 | + |
| 53 | +- **`BASETEN_API_KEY`**: Your Baseten API key (required) |
| 54 | +- **`BASETEN_BASE_URL`**: The base URL for your Baseten API endpoint (required) |
| 55 | + |
| 56 | +### Initialization Parameters |
| 57 | + |
| 58 | +```python |
| 59 | +baseten.VLM( |
| 60 | + model: str, # Baseten model name (e.g., "qwen3vl") |
| 61 | + api_key: Optional[str] = None, # API key (defaults to BASETEN_API_KEY env var) |
| 62 | + base_url: Optional[str] = None, # Base URL (defaults to BASETEN_BASE_URL env var) |
| 63 | + fps: int = 1, # Frames per second to process (default: 1) |
| 64 | + frame_buffer_seconds: int = 10, # Seconds of video to buffer (default: 10) |
| 65 | + client: Optional[AsyncOpenAI] = None, # Custom OpenAI client (optional) |
| 66 | +) |
| 67 | +``` |
| 68 | + |
| 69 | +### Parameters |
| 70 | + |
| 71 | +- **`model`**: The name of the Baseten-hosted model to use. Must be a vision-capable model. |
| 72 | +- **`api_key`**: Your Baseten API key. If not provided, reads from `BASETEN_API_KEY` environment variable. |
| 73 | +- **`base_url`**: The base URL for Baseten API. If not provided, reads from `BASETEN_BASE_URL` environment variable. |
| 74 | +- **`fps`**: Number of video frames per second to capture and send to the model. Lower values reduce API costs but may miss fast-moving content. Default is 1 fps. |
| 75 | +- **`frame_buffer_seconds`**: How many seconds of video to buffer. Total buffer size = `fps * frame_buffer_seconds`. Default is 10 seconds. |
| 76 | +- **`client`**: Optional pre-configured `AsyncOpenAI` client. If provided, `api_key` and `base_url` are ignored. |
| 77 | + |
| 78 | +## How It Works |
| 79 | + |
| 80 | +1. **Video Frame Buffering**: The plugin automatically subscribes to video tracks when the agent joins a call. It buffers frames at the specified FPS for the configured duration. |
| 81 | + |
| 82 | +2. **Frame Processing**: When responding to user input, the plugin: |
| 83 | + - Converts buffered video frames to JPEG format |
| 84 | + - Resizes frames to 800x600 (maintaining aspect ratio) |
| 85 | + - Encodes frames as base64 data URLs |
| 86 | + |
| 87 | +3. **API Request**: Sends the conversation history (including system instructions) along with all buffered frames to the Baseten model. |
| 88 | + |
| 89 | +4. **Streaming Response**: Processes the streaming response and emits events for each chunk and completion. |
| 90 | + |
| 91 | +## Events |
| 92 | + |
| 93 | +The plugin emits the following events: |
| 94 | + |
| 95 | +- **`LLMResponseChunkEvent`**: Emitted for each text chunk in the streaming response |
| 96 | +- **`LLMResponseCompletedEvent`**: Emitted when the response stream completes |
| 97 | +- **`LLMErrorEvent`**: Emitted if an API request fails |
19 | 98 |
|
20 | 99 | ## Requirements |
| 100 | + |
21 | 101 | - Python 3.10+ |
22 | | -- `openai` |
23 | | -- GetStream SDK |
| 102 | +- `openai>=2.5.0` |
| 103 | +- `vision-agents` (core framework) |
| 104 | +- Baseten API key and base URL |
| 105 | + |
| 106 | +## Notes |
| 107 | + |
| 108 | +- **Frame Rate**: The default FPS of 1 is optimized for VLM use cases. Higher FPS values will increase API costs and latency. |
| 109 | +- **Frame Size**: Frames are automatically resized to 800x600 pixels while maintaining aspect ratio to optimize API payload size. |
| 110 | +- **Buffer Duration**: The 10-second default buffer provides context for the model while keeping memory usage reasonable. |
| 111 | +- **Tool Calling**: Tool/function calling support is not yet implemented (see TODOs in code). |
| 112 | + |
| 113 | +## Troubleshooting |
24 | 114 |
|
25 | | -## License |
26 | | -MIT |
| 115 | +- **No video processing**: Ensure the agent has joined a call with video tracks available. The plugin automatically subscribes to video when tracks are added. |
| 116 | +- **API errors**: Verify your `BASETEN_API_KEY` and `BASETEN_BASE_URL` are set correctly and the model name is valid. |
| 117 | +- **High latency**: Consider reducing `fps` or `frame_buffer_seconds` to decrease the number of frames sent per request. |
0 commit comments