BUD-E (Buddy) is an open-source voice assistant framework designed to facilitate seamless interaction with AI models and APIs. It enables the creation and integration of diverse skills for educational and research applications.
BUD-E_V1.0 operates on a client-server architecture, allowing users to communicate with the assistant on edge devices. The main computation, however, is conducted on a server, which can either be cloud-based or on a local device equipped with a strong GPU.
BUD-E V1.0 uses a client-server architecture:
- Server: Handles main computation (speech recognition, language processing, text-to-speech, vision processing).
- Client: Manages user interactions (audio recording, playback, clipboard management).
- Automatic Speech Recognition (ASR)
- Language Model (LLM)
- Text-to-Speech (TTS)
- Vision Processing (Image Captioning and OCR)
- Python Desktop Client (Windows and Linux)
- School BUD-E Web Interface
Note: Mac OS support for the desktop client is waiting for you to build it. :)
-
Clone the repository:
git clone https://github.com/LAION-AI/BUD-E_V1.0.git cd BUD-E_V1.0/server
-
Install dependencies:
pip install -r requirements.txt
-
Configure components in their respective files:
- ASR:
bud_e_transcribe.py
- LLM:
bud_e_llm.py
- TTS:
bud_e_tts.py
- Vision:
bud_e_captioning_with_ocr.py
- ASR:
-
Start the server:
python bud_e-server.py
-
Navigate to the client directory:
cd ../client
-
Install dependencies:
pip install -r requirements.txt
-
Configure the client:
- Edit
bud_e_client.py
to set the server IP and port. - Obtain a Porcupine API key for wake word detection.
- Edit
-
Run the client:
python bud_e_client.py
BUD-E's functionality can be extended through a skill system. Skills are Python functions that can be activated in two ways:
- Keyword Activation
- Language Model (LM) Activation
To create a new skill:
- Create a Python file in the
client/skills
folder. - Define the skill function with this structure:
def skill_name(transcription_response, client_session, LMGeneratedParameters=""):
# Skill logic
return skill_response, client_session
- Add a skill description comment above the function:
For keyword-activated skills:
# KEYWORD ACTIVATED SKILL: [["keyword1"], ["keyword2", "keyword3"], ["phrase1"]]
For LM-activated skills:
# LM ACTIVATED SKILL: SKILL TITLE: Skill Name DESCRIPTION: What the skill does. USAGE INSTRUCTIONS: How to use the skill.
Here's an example of an LM-activated skill that changes the assistant's voice:
# LM ACTIVATED SKILL: SKILL TITLE: Change Voice DESCRIPTION: This skill changes the text-to-speech voice for the assistant's responses. USAGE INSTRUCTIONS: To change the voice, use the following format: <change_voice>voice_name</change_voice>. Replace 'voice_name' with one of the available voices: Stella, Stefanie, Florian, or Thorsten. For example, to change the voice to Stefanie, you would use: <change_voice>Stefanie</change_voice>. The assistant will confirm the voice change or provide an error message if an invalid voice is specified.
def server_side_execution_change_voice(user_input, client_session, params):
voice_name = params.strip('()')
valid_voices = ['Stella', 'Stefanie', 'Florian', 'Thorsten']
if voice_name not in valid_voices:
return f"Invalid voice. Please choose from: {', '.join(valid_voices)}.", client_session
if 'TTS_Config' not in client_session:
client_session['TTS_Config'] = {}
client_session['TTS_Config']['voice'] = voice_name
print(f"Voice changed to {voice_name}")
return f"Voice successfully changed to {voice_name}.", client_session
This skill demonstrates how LM-activated skills work:
- The skill description provides instructions for the language model on how to use the skill.
- The skill expects the LM to generate a parameter enclosed in specific tags (e.g.,
<change_voice>Stefanie</change_voice>
). - The
params
argument in the function receives the content within these tags. - The skill processes this input and updates the client session accordingly.
BUD-E supports integration with various AI model providers:
- ASR: Local Whisper models or cloud services (e.g., Deepgram)
- LLM: Commercial APIs (e.g., Groq, OpenAI) or self-hosted models (e.g., VLLM, Ollama)
- TTS: Cloud services or local solutions (e.g., FishTTS, StyleTTS 2)
- Vision: Custom models or cloud APIs
Refer to the configuration files for integration examples.
Common issues and potential solutions:
- Dependency installation failures: Try using
conda
for problematic packages. - API connection errors: Verify API keys, endpoint URLs, and network connectivity.
- Wake word detection issues: Ensure correct Porcupine API key configuration.
- Performance issues: For local setups, ensure adequate GPU capabilities or optimize model sizes.
Best join our Discord community: https://discord.gg/pCPJJXP7Qx
Apache 2.0
- Porcupine for wake word detection
- Whisper for speech recognition
- FishTTS and StyleTTS 2 for text-to-speech capabilities
- Groq, Hyperlab, and other API providers for AI model access