A reference implementation demonstrating a multi-agent architecture using Deepgram's Voice Agent API, where specialized agents handle different phases of customer interactions through seamless handoffs.
Traditional single-agent voice systems face fundamental limitations as complexity grows. This implementation demonstrates how to overcome these challenges by treating the conversation as a sequence of specialized phases/states, where each phase has its own:
- System instructions (the agent's prompt)
- Transformed context (summarized from previous conversation)
- Specific tools (2-4 focused functions per agent)
This approach solves critical problems:
- Context Management: Each agent starts fresh rather than accumulating entire conversation history
- Focused Responsibility: Agents excel at their specific task instead of juggling everything
- Better Reliability: Fewer functions per agent means clearer decision-making for the LLM
- Easier Debugging: Issues are isolated to specific agents and transitions
┌─────────────────────────────────────────┐
│ Twilio Phone Call (Persistent) │
│ WebSocket Connection │
└──────────────────┬──────────────────────┘
│ Audio Stream
┌─────────▼──────────┐
│ CallOrchestrator │ ← Central coordinator
│ ┌────────────────┐ │
│ │ Audio Forward │ │ ← Persistent task
│ │ Task │ │
│ └────────────────┘ │
└─────────┬──────────┘
│ Manages lifecycle
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Qualifier│──►│ Advisor │──►│ Closer │
│ Agent │ │ Agent │ │ Agent │ ← Ephemeral Voice Agent sessions
└─────────┘ └─────────┘ └─────────┘
│ │ │
└──────────────┼──────────────┘
│
Groq API (Llama 3.3 70B by default)
Context Summarization
Key Architecture Points:
- Persistent: Twilio WebSocket and audio forwarding task remain active throughout
- Ephemeral: Voice Agent sessions are created/destroyed per agent
- Orchestration: CallOrchestrator manages all transitions and state
- Context Transfer: A small and fast LLM (via Groq API) summarizes conversation between agents
This implementation creates separate Voice Agent sessions for each specialized agent. The CallOrchestrator (orchestrator/call_orchestrator.py) manages these transitions while maintaining the Twilio connection.
A critical implementation detail: the audio forwarding task starts once and persists throughout all agent transitions. This ensures the Twilio WebSocket remains active while agents switch:
# Audio task starts ONCE and persists
if not self.audio_task or self.audio_task.done():
self.audio_task = asyncio.create_task(self.forward_twilio_audio())Voice Agent connections are async context managers that require manual lifecycle management:
# Creating connection
self.current_agent_context = self.current_client.agent.v1.connect()
self.current_agent_connection = await self.current_agent_context.__aenter__()
# Closing connection (keep audio task alive during transitions)
await self.close_current_agent(keep_audio_task=True)Functions must be called with proper timing to maintain conversation flow:
# 1. Send response to current agent
response = AgentV1FunctionCallResponseMessage(...)
await self.current_agent_connection.send_function_call_response(response)
# 2. Brief pause for processing
await asyncio.sleep(0.5)
# 3. Then perform the action (e.g., transition)
await self.transition_to_agent(next_agent)Each agent represents a distinct conversation phase with focused responsibilities:
- Purpose: Initial contact and lead qualification
- Functions:
handoff_to_next_agent,end_conversation - Collects: Name, location, specific needs
- Purpose: Provide consultation and recommendations
- Functions:
handoff_to_next_agent,end_conversation - Context: Receives summary from qualifier
- Purpose: Schedule follow-up and gather feedback
- Functions:
schedule_followup,record_satisfaction,end_conversation - Context: Receives summary from advisor
When an agent calls handoff_to_next_agent, the orchestrator:
- Summarizes the current conversation using Groq AI
- Closes the current Voice Agent session (but keeps audio task running)
- Starts a new Voice Agent session with the summarized context
- Continues audio forwarding to the new agent seamlessly
Each agent is configured with settings for STT, LLM, TTS, and functions:
from deepgram.extensions.types.sockets import AgentV1SettingsMessage
def get_qualifier_config(context: str = "") -> AgentV1SettingsMessage:
return AgentV1SettingsMessage(
audio=AgentV1AudioConfig(...), # Audio encoding settings
agent=AgentV1Agent(
listen=AgentV1Listen(...), # Deepgram Flux STT
think=AgentV1Think( # LLM configuration
model="gpt-4o-mini",
prompt=QUALIFIER_PROMPT,
functions=QUALIFIER_FUNCTIONS
),
speak=AgentV1SpeakProviderConfig(...), # Deepgram Aura TTS
greeting="Hi, this is Alex..."
)
)Functions include detailed descriptions to guide the LLM's behavior. See agents/shared/functions.py for examples. For comprehensive function definition best practices, refer to docs/FUNCTION_GUIDE.md.
Key pattern: Functions should wait for customer confirmation:
HANDOFF_FUNCTION = AgentV1Function(
name="handoff_to_next_agent",
description="""...
CORRECT PATTERN:
1. You ask: "I can connect you with [next agent]. Would that work?"
2. WAIT for customer response
3. Customer says: "Yes"
4. You IMMEDIATELY call this function WITHOUT additional text
"""
)- Python 3.8+
- Twilio account with phone number (with outbound calling enabled)
- Deepgram API key
- Groq API key (free tier available at https://console.groq.com/)
- Public tunnel (ngrok, zrok, etc.)
This implementation uses the Deepgram Python SDK:
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install packages
pip install -r requirements.txtCopy .env.example to .env and add your API keys:
cp .env.example .envThe key environment variables (config.py manages these):
DEEPGRAM_API_KEY- Your Deepgram API keyGROQ_API_KEY- Groq API key for conversation summarizationTWILIO_ACCOUNT_SID,TWILIO_AUTH_TOKEN,TWILIO_PHONE_NUMBER- Twilio credentialsLEAD_SERVER_EXTERNAL_URL- Your public tunnel URLLEAD_PHONE_NUMBER- Phone number to call
# Using zrok (free, easy)
zrok share public localhost:8000
# Or ngrok
ngrok http 8000 --scheme=wsCopy the public URL and update LEAD_SERVER_EXTERNAL_URL in your .env file.
python main.pyThe system will:
- Start a WebSocket server on port 8000
- Initiate an outbound call to
LEAD_PHONE_NUMBER - Begin the conversation with the Qualifier agent
Here's a simplified flow showing agent transitions and function calls:
Agent: "Hi, this is Alex calling from our advisory services.
Is now a good time to chat briefly?"
Customer: "Sure, I have a few minutes"
Agent: "Great! May I get your name?"
Customer: "John Smith"
Agent: "Where are you located?"
Customer: "Seattle"
Agent: "What brings you to call us today?"
Customer: "I need help with retirement planning"
Agent: "I can connect you with one of our advisors. Would that work?"
Customer: "Yes, that'd be great"
[Agent calls handoff_to_next_agent function]
- Groq summarizes: "John Smith from Seattle needs retirement planning advice"
- Qualifier session closes
- Advisor session starts with context
Agent: "Hi John! I understand you're interested in retirement planning.
How can I help?"
Customer: "I'm 45 and want to retire by 60"
Agent: "Based on your timeline, I recommend a formal consultation.
Can I connect you with our team to schedule that?"
Customer: "Yes please"
[Agent calls handoff_to_next_agent function]
- Groq summarizes: "John Smith from Seattle would like to schedule a formal consultation"
- Advisor session closes
- Closer session starts with context
Agent: "Thanks for speaking with our advisor.
When would work best for your consultation?"
Customer: "Next Wednesday afternoon"
[Agent calls schedule_followup function]
Agent: "Perfect! How would you rate your experience today from 1 to 5?"
Customer: "5"
[Agent calls record_satisfaction function]
Agent: "Thank you! Have a great day!"
[Agent calls end_conversation function]
deepgram-voice-agent-multi-agent-/
├── main.py # Entry point - starts server, initiates calls
├── config.py # Environment variable management
│
├── orchestrator/
│ └── call_orchestrator.py # Core orchestration logic
│
├── agents/
│ ├── shared/
│ │ └── functions.py # Shared function definitions
│ ├── qualifier/ # Qualifier agent configuration
│ ├── advisor/ # Advisor agent configuration
│ └── closer/ # Closer agent configuration
│
├── utils/
│ └── context_summarizer.py # Groq LLM summarization
│
├── call_handling/
│ └── twilio_client.py # Twilio API wrapper
│
└── docs/
├── PROMPT_GUIDE.md # Voice agent prompt best practices
└── FUNCTION_GUIDE.md # Function definition best practices
Edit the prompts in each agent's config file:
Qualifier (agents/qualifier/config.py):
- Change greeting message
- Adjust information gathering flow
- Modify qualification criteria
Advisor (agents/advisor/config.py):
- Customize consultation approach
- Change expertise area (retirement, investments, etc.)
- Adjust handoff triggers
Closer (agents/closer/config.py):
- Modify scheduling questions
- Change satisfaction survey format
- Customize closing message
Important: See docs/PROMPT_GUIDE.md for voice-specific prompt engineering best practices.
Quick overview:
- Create
agents/your_agent/config.pywith agent configuration - Update
orchestrator/call_orchestrator.pytransition logic - Add agent-specific function handlers if needed
Edit utils/context_summarizer.py to:
- Switch LLM models (currently Llama 3.3 70B on Groq). See other available Groq models.
- Modify summarization prompts for better context extraction
- Change which data points are captured
- docs/PROMPT_GUIDE.md - Voice agent prompt engineering best practices
- docs/FUNCTION_GUIDE.md - Function definition best practices