Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write realtime agent websocket blogpost #332

Merged
merged 13 commits into from
Jan 8, 2025
12 changes: 6 additions & 6 deletions notebook/agentchat_realtime_websocket.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@
"metadata": {},
"outputs": [],
"source": [
"!pip install \"ag2\" \"fastapi>=0.115.0,<1\" \"uvicorn>=0.30.6,<1\""
"!pip install \"ag2\" \"fastapi>=0.115.0,<1\" \"uvicorn>=0.30.6,<1\" \"jinja2\""
]
},
{
Expand All @@ -65,7 +65,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -171,7 +171,7 @@
"\n",
"1. **Define Port**: Sets the `PORT` variable to `5050`, which will be used for the server.\n",
"2. **Initialize FastAPI App**: Creates a `FastAPI` instance named `app`, which serves as the main application.\n",
"3. **Define Root Endpoint**: Adds a `GET` endpoint at the root URL (`/`). When accessed, it returns a JSON response with the message `\"Websocket Audio Stream Server is running!\"`.\n",
"3. **Define Root Endpoint**: Adds a `GET` endpoint at the root URL (`/`). When accessed, it returns a JSON response with the message `\"WebSocket Audio Stream Server is running!\"`.\n",
"\n",
"This sets up a basic FastAPI server and provides a simple health-check endpoint to confirm that the server is operational."
]
Expand All @@ -189,7 +189,7 @@
"\n",
"@app.get(\"/\", response_class=JSONResponse)\n",
"async def index_page():\n",
" return {\"message\": \"Websocket Audio Stream Server is running!\"}"
" return {\"message\": \"WebSocket Audio Stream Server is running!\"}"
]
},
{
Expand Down Expand Up @@ -310,7 +310,7 @@
]
},
"kernelspec": {
"display_name": ".venv-3.9",
"display_name": ".venv",
"language": "python",
"name": "python3"
},
Expand All @@ -324,7 +324,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.20"
"version": "3.10.16"
}
},
"nbformat": 4,
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
311 changes: 311 additions & 0 deletions website/blog/2025-01-08-RealtimeAgent-over-websocket/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,311 @@
---
title: Real-Time Voice Interactions with the WebSocket Audio Adapter
authors:
- marklysze
- sternakt
- davorrunje
- davorinrusevljan
tags: [Realtime API, Voice Agents, AI Tools]

---

<div class="blog-authors">
<p class="authors">Authors:</p>
<CardGroup cols={2}>
<Card href="https://github.com/marklysze">
<div class="col card">
<div class="img-placeholder">
<img noZoom src="https://github.com/marklysze.png" />
</div>
<div>
<p class="name">Mark Sze</p>
<p>Software Engineer at AG2.ai</p>
</div>
</div>
</Card>
<Card href="https://github.com/sternakt">
<div class="col card">
<div class="img-placeholder">
<img noZoom src="https://github.com/sternakt.png" />
</div>
<div>
<p class="name">Tvrtko Sternak</p>
<p>Machine Learning Engineer at Airt</p>
</div>
</div>
</Card>
<Card href="https://github.com/davorrunje">
<div class="col card">
<div class="img-placeholder">
<img noZoom src="https://github.com/davorrunje.png" />
</div>
<div>
<p class="name">Davor Runje</p>
<p>CTO at Airt</p>
</div>
</div>
</Card>
<Card href="https://github.com/davorinrusevljan">
<div class="col card">
<div class="img-placeholder">
<img noZoom src="https://github.com/davorinrusevljan.png" />
</div>
<div>
<p class="name">Davorin Ruševljan</p>
<p>Developer</p>
</div>
</div>
</Card>
</CardGroup>
</div>

![Realtime agent communication over websocket](img/websocket_communication_diagram.png)

**TL;DR:**
- **Demo implementation**: Implement a website using websockets and communicate using voice with the [**`RealtimeAgent`**](/docs/reference/agentchat/realtime_agent/realtime_agent)
- **Introducing [**`WebSocketAudioAdapter`**](/docs/reference/agentchat/realtime_agent/websocket_audio_adapter#websocketaudioadapter)**: Stream audio directly from your browser using [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/).
- **Simplified Development**: Connect to real-time agents quickly and effortlessly with minimal setup.

# **Realtime over WebSockets**

In our [previous blog post](/blog/2024-12-20-RealtimeAgent/index), we introduced a way to interact with the [**`RealtimeAgent`**](/docs/reference/agentchat/realtime_agent/realtime_agent) using [**`TwilioAudioAdapter`**](/docs/reference/agentchat/realtime_agent/twilio_audio_adapter#twilioaudioadapter). While effective, this approach required a setup-intensive process involving [Twilio](https://www.twilio.com/) integration, account configuration, number forwarding, and other complexities. Today, we're excited to introduce the[**`WebSocketAudioAdapter`**](/docs/reference/agentchat/realtime_agent/websocket_audio_adapter#websocketaudioadapter), a streamlined approach to real-time audio streaming directly via a web browser.

This post explores the features, benefits, and implementation of the [**`WebSocketAudioAdapter`**](/docs/reference/agentchat/realtime_agent/websocket_audio_adapter#websocketaudioadapter), showing how it transforms the way we connect with real-time agents.

## **Why We Built the [**`WebSocketAudioAdapter`**](/docs/reference/agentchat/realtime_agent/websocket_audio_adapter#websocketaudioadapter)**
### **Challenges with Existing Solutions**
Previously introduced [**`TwilioAudioAdapter`**](/docs/reference/agentchat/realtime_agent/twilio_audio_adapter#twilioaudioadapter) provides a robust way to cennect to your [**`RealtimeAgent`**](/docs/reference/agentchat/realtime_agent/realtime_agent), it comes with challenges:
- **Browser Limitations**: For teams building web-first applications, integrating with a telephony platform can feel redundant.
- **Complex Setup**: Configuring Twilio accounts, verifying numbers, and setting up forwarding can be time-consuming.
- **Platform Dependency**: This solution requires developers to rely on external API, which adds latency and costs.

### **Our Solution**
The [**`WebSocketAudioAdapter`**](/docs/reference/agentchat/realtime_agent/websocket_audio_adapter#websocketaudioadapter) eliminates these challenges by allowing direct audio streaming over [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/). It integrates seamlessly with modern web technologies, enabling real-time voice interactions without external telephony platforms.

## **How It Works**
At its core, the [**`WebSocketAudioAdapter`**](/docs/reference/agentchat/realtime_agent/websocket_audio_adapter#websocketaudioadapter) leverages [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/) to handle real-time audio streaming. This means your browser becomes the communication bridge, sending audio packets to a server where a [**`RealtimeAgent`**](/docs/reference/agentchat/realtime_agent/realtime_agent) agent processes them.

Here’s a quick overview of its components and how they fit together:

1. **WebSocket Connection**:
- The adapter establishes a [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/) connection between the client (browser) and the server.
- Audio packets are streamed in real time through this connection.

2. **Integration with FastAPI**:
- Using Python's [FastAPI](https://fastapi.tiangolo.com/) framework, developers can easily set up endpoints for handling [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/) traffic.

3. **Powered by Realtime Agents**:
- The audio adapter integrates with an AI-powered [**`RealtimeAgent`**](/docs/reference/agentchat/realtime_agent/realtime_agent), allowing the agent to process audio inputs and respond intelligently.

## **Key Features**
### **1. Simplified Setup**
Unlike [**`TwilioAudioAdapter`**](/docs/reference/agentchat/realtime_agent/twilio_audio_adapter#twilioaudioadapter), the [**`WebSocketAudioAdapter`**](/docs/reference/agentchat/realtime_agent/websocket_audio_adapter#websocketaudioadapter) requires no phone numbers, no telephony configuration, and no external accounts. It's a plug-and-play solution.

### **2. Real-Time Performance**
By streaming audio over [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/), the adapter ensures low latency, making conversations feel natural and seamless.

### **3. Browser-Based**
Everything happens within the user's browser, meaning no additional software is required. This makes it ideal for web applications.

### **4. Flexible Integration**
Whether you're building a chatbot, a voice assistant, or an interactive application, the adapter can integrate easily with existing frameworks and AI systems.

## **Example: Build a Voice-Enabled Weather Bot**
Let’s walk through a practical example where we use the [**`WebSocketAudioAdapter`**](/docs/reference/agentchat/realtime_agent/websocket_audio_adapter#websocketaudioadapter) to create a voice-enabled weather bot.
You can find the full example [here](https://github.com/ag2ai/realtime-agent-over-websockets/tree/main).

To run the demo example, follow these steps:

### **1. Clone the Repository**
```bash
git clone https://github.com/ag2ai/realtime-agent-over-websockets.git
cd realtime-agent-over-websockets
```

### **2. Set Up Environment Variables**
Create a `OAI_CONFIG_LIST` file based on the provided `OAI_CONFIG_LIST_sample`:
```bash
cp OAI_CONFIG_LIST_sample OAI_CONFIG_LIST
```
In the OAI_CONFIG_LIST file, update the `api_key` to your OpenAI API key.

### (Optional) Create and use a virtual environment

To reduce cluttering your global Python environment on your machine, you can create a virtual environment. On your command line, enter:

```
python3 -m venv env
source env/bin/activate
```

### **3. Install Dependencies**
Install the required Python packages using `pip`:
```bash
pip install -r requirements.txt
```

### **4. Start the Server**
Run the application with Uvicorn:
```bash
uvicorn realtime_over_websockets.main:app --port 5050
```

After you start the server you should see your application running in the logs:

```bash
INFO: Started server process [64425]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:5050 (Press CTRL+C to quit)
```

### Ready to Chat? 🚀
Now you can simply open [**localhost:5050/start-chat**](http://localhost:5050/start-chat) in your browser, and dive into an interactive conversation with the [**`RealtimeAgent`**](/docs/reference/agentchat/realtime_agent/realtime_agent)! 🎤✨

To get started, simply speak into your microphone and ask a question. For example, you can say:

**"What's the weather like in Seattle?"**

This initial question will activate the agent, and it will respond, showcasing its ability to understand and interact with you in real time.

## Code review
Let’s dive in and break down how this example works—from setting up the server to handling real-time audio streaming with [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/).

### **Set Up the FastAPI app**
We use [FastAPI](https://fastapi.tiangolo.com/) to serve the chat interface and handle WebSocket connections. A key part is configuring the server to load and render HTML templates dynamically for the user interface.

- **Template Loading**: Use `Jinja2Templates` to load `chat.html` from the `templates` directory. The template is dynamically rendered with variables like the server's `port`.
- **Static Files**: Serve assets (e.g., JavaScript, CSS) from the `static` directory.

```python
app = FastAPI()


@app.get("/", response_class=JSONResponse)
async def index_page() -> dict[str, str]:
return {"message": "WebSocket Audio Stream Server is running!"}


website_files_path = Path(__file__).parent / "website_files"

app.mount(
"/static", StaticFiles(directory=website_files_path / "static"), name="static"
)

templates = Jinja2Templates(directory=website_files_path / "templates")


@app.get("/start-chat/", response_class=HTMLResponse)
async def start_chat(request: Request) -> HTMLResponse:
"""Endpoint to return the HTML page for audio chat."""
port = request.url.port
return templates.TemplateResponse("chat.html", {"request": request, "port": port})
```

### Defining the WebSocket Endpoint

The `/media-stream` WebSocket route is where real-time audio interaction is processed and streamed to the AI assistant. Let’s break it down step-by-step:

1. **Accept the WebSocket Connection**
The WebSocket connection is established when a client connects to `/media-stream`. Using `await websocket.accept()`, we ensure the connection is live and ready for communication.

```python
@app.websocket("/media-stream")
async def handle_media_stream(websocket: WebSocket) -> None:
"""Handle WebSocket connections providing audio stream and OpenAI."""
await websocket.accept()
```

2. **Initialize Logging**
A logger instance (`getLogger("uvicorn.error")`) is set up to monitor and debug the server's activities, helping track events during the connection and interaction process.

```python
logger = getLogger("uvicorn.error")
```
3. **Set Up the `WebSocketAudioAdapter`**
The [**`WebSocketAudioAdapter`**](/docs/reference/agentchat/realtime_agent/websocket_audio_adapter#websocketaudioadapter) bridges the client’s audio stream with the [**`RealtimeAgent`**](/docs/reference/agentchat/realtime_agent/realtime_agent). It streams audio data over [WebSockets](https://fastapi.tiangolo.com/advanced/websockets/) in real time, ensuring seamless communication between the browser and the agent.

```python
audio_adapter = WebSocketAudioAdapter(websocket, logger=logger)
```

4. **Configure the Realtime Agent**
The `RealtimeAgent` is the AI assistant driving the interaction. Key parameters include:
- **Name**: The agent identity, here called `"Weather Bot"`.
- **System Message**: System message for the agent.
- **Language Model Configuration**: Defined by `realtime_llm_config` for LLM settings.
- **Audio Adapter**: Connects the [**`WebSocketAudioAdapter`**](/docs/reference/agentchat/realtime_agent/websocket_audio_adapter#websocketaudioadapter) for handling audio.
- **Logger**: Logs the agent's activities for better observability.

```python
realtime_agent = RealtimeAgent(
name="Weather Bot",
system_message="Hello there! I am an AI voice assistant powered by Autogen and the OpenAI Realtime API. You can ask me about weather, jokes, or anything you can imagine. Start by saying 'How can I help you'?",
llm_config=realtime_llm_config,
audio_adapter=audio_adapter,
logger=logger,
)
```

5. **Define a Custom Realtime Function**
The `get_weather` function is registered as a realtime callable function. When the user asks about the weather, the agent can call the function to get an accurate weather report and respond based on the provided information:
- Returns `"The weather is cloudy."` for `"Seattle"`.
- Returns `"The weather is sunny."` for other locations.

```python
@realtime_agent.register_realtime_function( # type: ignore [misc]
name="get_weather", description="Get the current weather"
)
def get_weather(location: Annotated[str, "city"]) -> str:
return (
"The weather is cloudy."
if location == "Seattle"
else "The weather is sunny."
)
```

6. **Run the Realtime Agent**
The `await realtime_agent.run()` method starts the agent, handling incoming audio streams, processing user queries, and responding in real time.

Here is the full code for the `/media-stream` endpoint:

```python
@app.websocket("/media-stream")
async def handle_media_stream(websocket: WebSocket) -> None:
"""Handle WebSocket connections providing audio stream and OpenAI."""
await websocket.accept()

logger = getLogger("uvicorn.error")

audio_adapter = WebSocketAudioAdapter(websocket, logger=logger)

realtime_agent = RealtimeAgent(
name="Weather Bot",
system_message="Hello there! I am an AI voice assistant powered by Autogen and the OpenAI Realtime API. You can ask me about weather, jokes, or anything you can imagine. Start by saying 'How can I help you'?",
llm_config=realtime_llm_config,
audio_adapter=audio_adapter,
logger=logger,
)

@realtime_agent.register_realtime_function( # type: ignore [misc]
name="get_weather", description="Get the current weather"
)
def get_weather(location: Annotated[str, "city"]) -> str:
return (
"The weather is cloudy."
if location == "Seattle"
else "The weather is sunny."
)

await realtime_agent.run()
```

## **Benefits in Action**
- **Quick Prototyping**: Spin up a real-time voice application in minutes.
- **Cost Efficiency**: Eliminate third-party telephony costs.
- **User-Friendly**: Runs in the browser, making it accessible to anyone with a microphone.

## **Conclusion**
The [**`WebSocketAudioAdapter`**](/docs/reference/agentchat/realtime_agent/websocket_audio_adapter#websocketaudioadapter) marks a shift toward simpler, more accessible real-time audio solutions. It empowers developers to build and deploy voice applications faster and more efficiently. Whether you're creating an AI assistant, a voice-enabled app, or an experimental project, this adapter is your go-to tool for real-time audio streaming.

Try it out and bring your voice-enabled ideas to life!
1 change: 1 addition & 0 deletions website/mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -507,6 +507,7 @@
{
"group": "Recent posts",
"pages": [
"blog/2025-01-08-RealtimeAgent-over-websocket/index",
"blog/2024-12-20-RealtimeAgent/index",
"blog/2024-12-20-Tools-interoperability/index",
"blog/2024-12-20-Reasoning-Update/index",
Expand Down
Loading