An OpenAI API compatible text to speech server.
- Compatible with the OpenAI audio/speech API
- Serves the /v1/audio/speech endpoint
- Not affiliated with OpenAI in any way, does not require an OpenAI API Key
- A free, private, text-to-speech server with custom voice cloning
Full Compatibility:
tts-1
:alloy
,echo
,fable
,onyx
,nova
, andshimmer
(configurable)tts-1-hd
:alloy
,echo
,fable
,onyx
,nova
, andshimmer
(configurable, uses OpenAI samples by default)- response_format:
mp3
,opus
,aac
,flac
,wav
andpcm
- speed 0.25-4.0 (and more)
Details:
- Model
tts-1
via piper tts (very fast, runs on cpu)- You can map your own piper voices via the
voice_to_speaker.yaml
configuration file
- You can map your own piper voices via the
- Model
tts-1-hd
via coqui-ai/TTS xtts_v2 voice cloning (fast, but requires around 4GB GPU VRAM)- Custom cloned voices can be used for tts-1-hd, See: Custom Voices Howto
- 🌐 Multilingual support with XTTS voices, the language is automatically detected if not set
- Custom fine-tuned XTTS model support
- Configurable generation parameters
- Streamed output while generating
- Occasionally, certain words or symbols may sound incorrect, you can fix them with regex via
pre_process_map.yaml
- Tested with python 3.9-3.11, piper does not install on python 3.12 yet
If you find a better voice match for tts-1
or tts-1-hd
, please let me know so I can update the defaults.
Version 0.18.2, 2024-08-16
- Fix docker building for amd64, refactor github actions again, free up more disk space
Version 0.18.1, 2024-08-15
- refactor github actions
Version 0.18.0, 2024-08-15
- Allow folders of wav samples in xtts. Samples will be combined, allowing for mixed voices and collections of small samples. Still limited to 30 seconds total. Thanks @nathanhere.
- Fix missing yaml requirement in -min image
- fix fr_FR-tom-medium and other 44khz piper voices (detect non-default sample rates)
- minor updates
Version 0.17.2, 2024-07-01
- fix -min image (re: langdetect)
Version 0.17.1, 2024-07-01
- fix ROCm (add langdetect to requirements-rocm.txt)
- Fix zh-cn for xtts
Version 0.17.0, 2024-07-01
- Automatic language detection, thanks @RodolfoCastanheira
Version 0.16.0, 2024-06-29
- Multi-client safe version. Audio generation is synchronized in a single process. The estimated 'realtime' factor of XTTS on a GPU is roughly 1/3, this means that multiple streams simultaneously, or
speed
over 2, may experience audio underrun (delays or pauses in playback). This makes multiple clients possible and safe, but in practice 2 or 3 simultaneous streams is the maximum without audio underrun.
Version 0.15.1, 2024-06-27
- Remove deepspeed from requirements.txt, it's too complex for typical users. A more detailed deepspeed install document will be required.
Version 0.15.0, 2024-06-26
- Switch to coqui-tts (updated fork), updated simpler dependencies, torch 2.3, etc.
- Resolve cuda threading issues
Version 0.14.1, 2024-06-26
- Make deepspeed possible (
--use-deepspeed
), but not enabled in pre-built docker images (too large). Requires the cuda-toolkit installed, see the Dockerfile comment for details
Version 0.14.0, 2024-06-26
- Added
response_format
:wav
andpcm
support - Output streaming (while generating) for
tts-1
andtts-1-hd
- Enhanced generation parameters for xtts models (temperature, top_p, etc.)
- Idle unload timer (optional) - doesn't work perfectly yet
- Improved error handling
Version 0.13.0, 2024-06-25
- Added Custom fine-tuned XTTS model support
- Initial prebuilt arm64 image support (Apple M-series, Raspberry Pi - MPS is not supported in XTTS/torch), thanks @JakeStevenson, @hchasens
- Initial attempt at AMD GPU (ROCm 5.7) support
- Parler-tts support removed
- Move the *.default.yaml to the root folder
- Run the docker as a service by default (
restart: unless-stopped
) - Added
audio_reader.py
for streaming text input and reading long texts
Version 0.12.3, 2024-06-17
- Additional logging details for BadRequests (400)
Version 0.12.2, 2024-06-16
- Fix :min image requirements (numpy<2?)
Version 0.12.0, 2024-06-16
- Improved error handling and logging
- Restore the original alloy tts-1-hd voice by default, use alloy-alt for the old voice.
Version 0.11.0, 2024-05-29
- 🌐 Multilingual support (16 languages) with XTTS
- Remove high Unicode filtering from the default
config/pre_process_map.yaml
- Update Docker build & app startup. thanks @justinh-rahb
- Fix: "Plan failed with a cudnnException"
- Remove piper cuda support
Version: 0.10.1, 2024-05-05
- Remove
runtime: nvidia
from docker-compose.yml, this assumes nvidia/cuda compatible runtime is available by default. thanks @jmtatsch
Version: 0.10.0, 2024-04-27
- Pre-built & tested docker images, smaller docker images (8GB or 860MB)
- Better upgrades: reorganize config files under
config/
, voice models undervoices/
- Compatibility! If you customized your
voice_to_speaker.yaml
orpre_process_map.yaml
you need to move them to theconfig/
folder. - default listen host to 0.0.0.0
Version: 0.9.0, 2024-04-23
- Fix bug with yaml and loading UTF-8
- New sample text-to-speech application
say.py
- Smaller docker base image
- Add beta parler-tts support (you can describe very basic features of the speaker voice), See: (https://www.text-description-to-speech.com/) for some examples of how to describe voices. Voices can be defined in the
voice_to_speaker.default.yaml
. Two example parler-tts voices are included in thevoice_to_speaker.default.yaml
file.parler-tts
is experimental software and is kind of slow. The exact voice will be slightly different each generation but should be similar to the basic description.
...
Version: 0.7.3, 2024-03-20
- Allow different xtts versions per voice in
voice_to_speaker.yaml
, ex. xtts_v2.0.2 - Quality: Fix xtts sample rate (24000 vs. 22050 for piper) and pops
Copy the sample.env
to speech.env
(customize if needed)
cp sample.env speech.env
TTS_HOME=voices
HF_HOME=voices
#PRELOAD_MODEL=xtts
#PRELOAD_MODEL=xtts_v2.0.2
#EXTRA_ARGS=--log-level DEBUG --unload-timer 300
#USE_ROCM=1
# install curl and ffmpeg
sudo apt install curl ffmpeg
# Create & activate a new virtual environment (optional but recommended)
python -m venv .venv
source .venv/bin/activate
# Install the Python requirements
# - use requirements-rocm.txt for AMD GPU (ROCm support)
# - use requirements-min.txt for piper only (CPU only)
pip install -U -r requirements.txt
# run the server
bash startup.sh
On first run, the voice models will be downloaded automatically. This might take a while depending on your network connection.
docker compose up
docker compose -f docker-compose.rocm.yml up
XTTS only has CPU support here and will be very slow, you can use the Nvidia image for XTTS with CPU (slow), or use the piper only image (recommended)
For a minimal docker image with only piper support (<1GB vs. 8GB).
docker compose -f docker-compose.min.yml up
usage: speech.py [-h] [--xtts_device XTTS_DEVICE] [--preload PRELOAD] [--unload-timer UNLOAD_TIMER] [--use-deepspeed] [--no-cache-speaker] [-P PORT] [-H HOST]
[-L {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
OpenedAI Speech API Server
options:
-h, --help show this help message and exit
--xtts_device XTTS_DEVICE
Set the device for the xtts model. The special value of 'none' will use piper for all models. (default: cuda)
--preload PRELOAD Preload a model (Ex. 'xtts' or 'xtts_v2.0.2'). By default it's loaded on first use. (default: None)
--unload-timer UNLOAD_TIMER
Idle unload timer for the XTTS model in seconds, Ex. 900 for 15 minutes (default: None)
--use-deepspeed Use deepspeed with xtts (this option is unsupported) (default: False)
--no-cache-speaker Don't use the speaker wav embeddings cache (default: False)
-P PORT, --port PORT Server tcp port (default: 8000)
-H HOST, --host HOST Host to listen on, Ex. 0.0.0.0 (default: 0.0.0.0)
-L {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Set the log level (default: INFO)
You can use it like this:
curl http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{
"model": "tts-1",
"input": "The quick brown fox jumped over the lazy dog.",
"voice": "alloy",
"response_format": "mp3",
"speed": 1.0
}' > speech.mp3
Or just like this:
curl -s http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{
"input": "The quick brown fox jumped over the lazy dog."}' > speech.mp3
Or like this example from the OpenAI Text to speech guide:
import openai
client = openai.OpenAI(
# This part is not needed if you set these environment variables before import openai
# export OPENAI_API_KEY=sk-11111111111
# export OPENAI_BASE_URL=http://localhost:8000/v1
api_key = "sk-111111111",
base_url = "http://localhost:8000/v1",
)
with client.audio.speech.with_streaming_response.create(
model="tts-1",
voice="alloy",
input="Today is a wonderful day to build something people love!"
) as response:
response.stream_to_file("speech.mp3")
Also see the say.py
sample application for an example of how to use the openai-python API.
# play the audio, requires 'pip install playsound'
python say.py -t "The quick brown fox jumped over the lazy dog." -p
# save to a file in flac format
python say.py -t "The quick brown fox jumped over the lazy dog." -m tts-1-hd -v onyx -f flac -o fox.flac
You can also try the included audio_reader.py
for listening to longer text and streamed input.
Example usage:
python audio_reader.py -s 2 < LICENSE # read the software license - fast
- Select the piper voice and model from the piper samples
- Update the
config/voice_to_speaker.yaml
with a new section for the voice, for example:
...
tts-1:
ryan:
model: voices/en_US-ryan-high.onnx
speaker: # default speaker
- New models will be downloaded as needed, of you can download them in advance with
download_voices_tts-1.sh
. For example:
bash download_voices_tts-1.sh en_US-ryan-high
Coqui XTTS v2 voice cloning can work with as little as 6 seconds of clear audio. To create a custom voice clone, you must prepare a WAV file sample of the voice.
- Mono (single channel) 22050 Hz WAV file
- 6-30 seconds long - longer isn't always better (I've had some good results with as little as 4 seconds)
- low noise (no hiss or hum)
- No partial words, breathing, laughing, music or backgrounds sounds
- An even speaking pace with a variety of words is best, like in interviews or audiobooks.
- Audio longer than 30 seconds will be silently truncated.
You can use FFmpeg to prepare your audio files, here are some examples:
# convert a multi-channel audio file to mono, set sample rate to 22050 hz, trim to 6 seconds, and output as WAV file.
ffmpeg -i input.mp3 -ac 1 -ar 22050 -t 6 -y me.wav
# use a simple noise filter to clean up audio, and select a start time start for sampling.
ffmpeg -i input.wav -af "highpass=f=200, lowpass=f=3000" -ac 1 -ar 22050 -ss 00:13:26.2 -t 6 -y me.wav
# A more complex noise reduction setup, including volume adjustment
ffmpeg -i input.mkv -af "highpass=f=200, lowpass=f=3000, volume=5, afftdn=nf=25" -ac 1 -ar 22050 -ss 00:13:26.2 -t 6 -y me.wav
Once your WAV file is prepared, save it in the /voices/
directory and update the config/voice_to_speaker.yaml
file with the new file name.
For example:
...
tts-1-hd:
me:
model: xtts
speaker: voices/me.wav # this could be you
You can also use a sub folder for multiple audio samples to combine small samples or to mix different samples together.
For example:
...
tts-1-hd:
mixed:
model: xtts
speaker: voices/mixed
Where the voices/mixed/
folder contains multiple wav files. The total audio length is still limited to 30 seconds.
Multilingual cloning support was added in version 0.11.0 and is available only with the XTTS v2 model. To use multilingual voices with piper simply download a language specific voice.
Coqui XTTSv2 has support for multiple languages: English (en
), Spanish (es
), French (fr
), German (de
), Italian (it
), Portuguese (pt
), Polish (pl
), Turkish (tr
), Russian (ru
), Dutch (nl
), Czech (cs
), Arabic (ar
), Chinese (zh-cn
), Hungarian (hu
), Korean (ko
), Japanese (ja
), and Hindi (hi
). When not set, an attempt will be made to automatically detect the language, falling back to English (en
).
Unfortunately the OpenAI API does not support language, but you can create your own custom speaker voice and set the language for that.
- Create the WAV file for your speaker, as in Custom Voices Howto
- Add the voice to
config/voice_to_speaker.yaml
and include the correct Coquilanguage
code for the speaker. For example:
xunjiang:
model: xtts
speaker: voices/xunjiang.wav
language: zh-cn
- Don't remove high unicode characters in your
config/pre_process_map.yaml
! If you have these lines, you will need to remove them. For example:
Remove:
- - '[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF\U00002702-\U000027B0\U000024C2-\U0001F251]+'
- ''
These lines were added to the config/pre_process_map.yaml
config file by default before version 0.11.0:
- Your new multi-lingual speaker voice is ready to use!
Adding a custom xtts model is simple. Here is an example of how to add a custom fine-tuned 'halo' XTTS model.
- Save the model folder under
voices/
(all 4 files are required, including the vocab.json from the model)
openedai-speech$ ls voices/halo/
config.json vocab.json model.pth sample.wav
- Add the custom voice entry under the
tts-1-hd
section ofconfig/voice_to_speaker.yaml
:
tts-1-hd:
...
halo:
model: halo # This name is required to be unique
speaker: voices/halo/sample.wav # voice sample is required
model_path: voices/halo
- The model will be loaded when you access the voice for the first time (
--preload
doesn't work with custom models yet)
The generation of XTTSv2 voices can be fine tuned with the following options (defaults included below):
tts-1-hd:
alloy:
model: xtts
speaker: voices/alloy.wav
enable_text_splitting: True
length_penalty: 1.0
repetition_penalty: 10
speed: 1.0
temperature: 0.75
top_k: 50
top_p: 0.85