Speech Synthesis (TTS)
After installing the required dependencies and configuring conf.yaml, enable the corresponding speech synthesis engine by modifying the TTS_MODEL option in conf.yaml.
sherpa-onnx (Local & Recommended)
Available since version
v0.5.0-alpha.1(PR#50)
sherpa-onnx is a powerful inference engine that supports multiple TTS models. It is built-in supported and uses CPU inference by default.
Configuration Steps:
- Download the required model from sherpa-onnx TTS models
- Modify
conf.yamlreferring to the configuration examples inconfig_alts
For GPU inference (CUDA only), please refer to CUDA Inference.
Edge TTS (Online, No API Key Required)
- Features:
- Fast response speed
- Requires maintaining network connection
- Configuration: Set
tts_model: edge_ttsinconf.yaml
Fish Audio TTS (Online, API Key Required)
Available since version
v0.3.0-beta
- Install dependencies:
uv pip install fish-audio-sdk
- Configuration steps:
- Register an account on Fish Audio and obtain an API key
- Select the desired voice and copy its Reference ID
- In
conf.yaml, set:tts_model: fish_api_tts- Fill in
api_keyandreference_idin thefish_api_ttssection
Azure TTS (Online, API Key Required)
The same TTS service as neuro-sama
- Obtain an API key for the text-to-speech service from Azure
- Fill in the relevant configuration in the
azure_ttssection ofconf.yaml
Since version v0.2.5, api_key.py has been deprecated. Please make sure to set the API key in conf.yaml
The default voice used in conf.yaml is the same as neuro-sama
SiliconFlow TTS (Online, API Key Required)
An online text-to-speech service provided by SiliconFlow, supporting custom audio models and voice configuration.
Configuration Steps
-
Upload Reference Audio:
SiliconFlow currently offers models likeFunAudioLLM/CosyVoice2-0.5B. To use them, upload reference audio via their official platform:
https://docs.siliconflow.cn/cn/api-reference/audio/upload-voice -
Fill in
conf.yaml:
In thesiliconflow_ttssection of the configuration file, configure parameters as follows (example):
siliconflow_tts:
api_url: "https://api.siliconflow.cn/v1/audio/speech" # Service endpoint (fixed value)
api_key: "sk-yourkey" # API key obtained from SiliconFlow's official website
default_model: "FunAudioLLM/CosyVoice2-0.5B" # Audio model name (check official docs for supported models)
default_voice: "speech:Dreamflowers:aaaaaaabvbbbasdas" # Voice ID (generated after uploading custom voice on the official site)
sample_rate: 32000 # Output sample rate; adjust if audio is distorted (e.g., 16000, 44100)
response_format: "mp3" # Audio format (e.g., mp3, wav)
stream: true # Enable streaming mode
speed: 1 # Speaking speed (range: 0.5–2.0; 1 = default)
gain: 0 # Volume gain (range: -10–10; 0 = default)
MiniMax TTS (Online, API Key Required)
MiniMax provides an online TTS service where models like speech-02-turbo offer powerful TTS capabilities with customizable voice options.
Configuration Steps
-
Obtain
group_idandapi_keyYou can register on the Minimax official website to get yourgroup_idandapi_key, Official Documentation -
Fill in the
conf.yamlconfiguration In theminimax_ttssection of the configuration file, enter parameters in the following format (example):
minimax_tts:
group_id: '' # Your minimax group_id
api_key: '' # Your minimax api_key
# Supported models: 'speech-02-hd', 'speech-02-turbo' (recommended: 'speech-02-turbo')
model: 'speech-02-turbo' # minimax model name
voice_id: 'female-shaonv' # minimax voice id, default is 'female-shaonv'
# Custom pronunciation dictionary, default empty.
# Example: '{"tone": ["测试/(ce4)(shi4)", "危险/dangerous"]}'
pronunciation_dict: ''
The voice_id parameter can be configured to different voice tones. You can check the voice ID query section in the official documentation for a complete list of supported voices. The pronunciation_dict supports custom pronunciation rules - for example, you can define rules to pronounce "牛肉" as "neuro" using the format shown in the example.
ElevenLabs TTS (Online, API Key Required)
Available since version
v1.2.1
ElevenLabs provides high-quality, natural-sounding text-to-speech with support for multiple languages and voice cloning capabilities.
Features
- High-Quality Audio: Industry-leading speech synthesis quality
- Multi-language Support: Supports English, Chinese, Japanese, Korean, and many other languages
- Voice Cloning: Upload audio samples to clone voices
- Rich Voice Library: Multiple preset voices and community voices available
- Real-time Generation: Low-latency speech synthesis
Configuration Steps
-
Register and Get API Key
- Visit ElevenLabs to register an account
- Get your API key from the ElevenLabs dashboard
-
Choose a Voice
- Browse available voices in the ElevenLabs dashboard
- Copy the Voice ID of your preferred voice
- You can also upload audio samples for voice cloning
-
Configure
conf.yamlIn theelevenlabs_ttssection of your configuration file, enter parameters as follows:
elevenlabs_tts:
api_key: 'your_elevenlabs_api_key' # Required: Your ElevenLabs API key
voice_id: 'JBFqnCBsd6RMkjVDRZzb' # Required: ElevenLabs Voice ID
model_id: 'eleven_multilingual_v2' # Model ID (default: eleven_multilingual_v2)
output_format: 'mp3_44100_128' # Output audio format (default: mp3_44100_128)
stability: 0.5 # Voice stability (0.0 to 1.0, default: 0.5)
similarity_boost: 0.5 # Voice similarity boost (0.0 to 1.0, default: 0.5)
style: 0.0 # Voice style exaggeration (0.0 to 1.0, default: 0.0)
use_speaker_boost: true # Enable speaker boost for better quality (default: true)
Parameter Descriptions
- api_key (required): Your ElevenLabs API key
- voice_id (required): Unique identifier for the voice, found in your ElevenLabs dashboard
- model_id: TTS model to use. Available options:
eleven_multilingual_v2(default) - Supports multiple languageseleven_monolingual_v1- English onlyeleven_turbo_v2- Faster generation
- output_format: Audio output format. Common options:
mp3_44100_128(default) - MP3, 44.1kHz, 128kbpsmp3_44100_192- MP3, 44.1kHz, 192kbpspcm_16000- PCM, 16kHzpcm_22050- PCM, 22.05kHzpcm_24000- PCM, 24kHzpcm_44100- PCM, 44.1kHz
- stability: Controls voice consistency (0.0 = more variable, 1.0 = more consistent)
- similarity_boost: Enhances similarity to the original voice (0.0 to 1.0)
- style: Controls style exaggeration (0.0 = neutral, 1.0 = more expressive)
- use_speaker_boost: Enables speaker boost for improved audio quality
Usage Tips
- Voice Selection: Try preset voices first, then consider voice cloning for custom voices
- Parameter Tuning: Adjust
stabilityandsimilarity_boostfor optimal results - Cost Management: ElevenLabs charges based on usage, test first before heavy usage
- Network Requirements: Stable internet connection required for service availability
ElevenLabs offers free trial credits, so you can test the quality before purchasing a paid plan.