Speech Synthesis (TTS)

After installing the required dependencies and configuring conf.yaml, enable the corresponding speech synthesis engine by modifying the TTS_MODEL option in conf.yaml.

Edge TTS (Online, No API Key Required, Default & Recommended)

Features:
- Fast response speed
- Requires maintaining network connection
- Default TTS solution used by the project
Configuration: Set tts_model: edge_tts in conf.yaml

Fish Audio TTS (Online, API Key Required)

Available since version v0.3.0-beta

Install dependencies:

uv pip install fish-audio-sdk

Configuration steps:
- Register an account on Fish Audio and obtain an API key
- Select the desired voice and copy its Reference ID
- In conf.yaml, set:
  - tts_model: fish_api_tts
  - Fill in api_key and reference_id in the fish_api_tts section

Azure TTS (Online, API Key Required)

The same TTS service as neuro-sama

Obtain an API key for the text-to-speech service from Azure
Fill in the relevant configuration in the azure_tts section of conf.yaml

warning

Since version v0.2.5, api_key.py has been deprecated. Please make sure to set the API key in conf.yaml

tip

The default voice used in conf.yaml is the same as neuro-sama

SiliconFlow TTS (Online, API Key Required)

An online text-to-speech service provided by SiliconFlow, supporting custom audio models and voice configuration.

Configuration Steps

Upload Reference Audio：
SiliconFlow currently offers models like FunAudioLLM/CosyVoice2-0.5B. To use them, upload reference audio via their official platform:
https://docs.siliconflow.cn/cn/api-reference/audio/upload-voice
Fill in conf.yaml：
In the siliconflow_tts section of the configuration file, configure parameters as follows (example):

siliconflow_tts:
  api_url: "https://api.siliconflow.cn/v1/audio/speech"  # Service endpoint (fixed value)
  api_key: "sk-yourkey"  # API key obtained from SiliconFlow's official website
  default_model: "FunAudioLLM/CosyVoice2-0.5B"  # Audio model name (check official docs for supported models)
  default_voice: "speech:Dreamflowers:aaaaaaabvbbbasdas"  # Voice ID (generated after uploading custom voice on the official site)
  sample_rate: 32000  # Output sample rate; adjust if audio is distorted (e.g., 16000, 44100)
  response_format: "mp3"  # Audio format (e.g., mp3, wav)
  stream: true  # Enable streaming mode
  speed: 1  # Speaking speed (range: 0.5–2.0; 1 = default)
  gain: 0  # Volume gain (range: -10–10; 0 = default)

MiniMax TTS (Online, API Key Required)

MiniMax provides an online TTS service where models like speech-02-turbo offer powerful TTS capabilities with customizable voice options.

Configuration Steps

Obtain group_id and api_key You can register on the Minimax official website to get your group_id and api_key, Official Documentation
Fill in the conf.yaml configuration In the minimax_tts section of the configuration file, enter parameters in the following format (example):

minimax_tts:
      group_id: '' # Your minimax group_id
      api_key: '' # Your minimax api_key
      # Supported models: 'speech-02-hd', 'speech-02-turbo' (recommended: 'speech-02-turbo')
      model: 'speech-02-turbo' # minimax model name
      voice_id: 'female-shaonv' # minimax voice id, default is 'female-shaonv'
      # Custom pronunciation dictionary, default empty.
      # Example: '{"tone": ["测试/(ce4)(shi4)", "危险/dangerous"]}'
      pronunciation_dict: ''

The voice_id parameter can be configured to different voice tones. You can check the voice ID query section in the official documentation for a complete list of supported voices. The pronunciation_dict supports custom pronunciation rules - for example, you can define rules to pronounce "牛肉" as "neuro" using the format shown in the example.

ElevenLabs TTS (Online, API Key Required)

Available since version v1.2.1

ElevenLabs provides high-quality, natural-sounding text-to-speech with support for multiple languages and voice cloning capabilities.

Features

High-Quality Audio: Industry-leading speech synthesis quality
Multi-language Support: Supports English, Chinese, Japanese, Korean, and many other languages
Voice Cloning: Upload audio samples to clone voices
Rich Voice Library: Multiple preset voices and community voices available
Real-time Generation: Low-latency speech synthesis

Configuration Steps

Register and Get API Key
- Visit ElevenLabs to register an account
- Get your API key from the ElevenLabs dashboard
Choose a Voice
- Browse available voices in the ElevenLabs dashboard
- Copy the Voice ID of your preferred voice
- You can also upload audio samples for voice cloning
Configure conf.yaml In the elevenlabs_tts section of your configuration file, enter parameters as follows:

elevenlabs_tts:
  api_key: 'your_elevenlabs_api_key'  # Required: Your ElevenLabs API key
  voice_id: 'JBFqnCBsd6RMkjVDRZzb'   # Required: ElevenLabs Voice ID
  model_id: 'eleven_multilingual_v2'  # Model ID (default: eleven_multilingual_v2)
  output_format: 'mp3_44100_128'      # Output audio format (default: mp3_44100_128)
  stability: 0.5                      # Voice stability (0.0 to 1.0, default: 0.5)
  similarity_boost: 0.5               # Voice similarity boost (0.0 to 1.0, default: 0.5)
  style: 0.0                         # Voice style exaggeration (0.0 to 1.0, default: 0.0)
  use_speaker_boost: true            # Enable speaker boost for better quality (default: true)

Parameter Descriptions

api_key (required): Your ElevenLabs API key
voice_id (required): Unique identifier for the voice, found in your ElevenLabs dashboard
model_id: TTS model to use. Available options:
- eleven_multilingual_v2 (default) - Supports multiple languages
- eleven_monolingual_v1 - English only
- eleven_turbo_v2 - Faster generation
output_format: Audio output format. Common options:
- mp3_44100_128 (default) - MP3, 44.1kHz, 128kbps
- mp3_44100_192 - MP3, 44.1kHz, 192kbps
- pcm_16000 - PCM, 16kHz
- pcm_22050 - PCM, 22.05kHz
- pcm_24000 - PCM, 24kHz
- pcm_44100 - PCM, 44.1kHz
stability: Controls voice consistency (0.0 = more variable, 1.0 = more consistent)
similarity_boost: Enhances similarity to the original voice (0.0 to 1.0)
style: Controls style exaggeration (0.0 = neutral, 1.0 = more expressive)
use_speaker_boost: Enables speaker boost for improved audio quality

Usage Tips

Voice Selection: Try preset voices first, then consider voice cloning for custom voices
Parameter Tuning: Adjust stability and similarity_boost for optimal results
Cost Management: ElevenLabs charges based on usage, test first before heavy usage
Network Requirements: Stable internet connection required for service availability

tip

ElevenLabs offers free trial credits, so you can test the quality before purchasing a paid plan.

Edge TTS (Online, No API Key Required, Default & Recommended)​

Fish Audio TTS (Online, API Key Required)​

Azure TTS (Online, API Key Required)​

SiliconFlow TTS (Online, API Key Required)​

Configuration Steps​

MiniMax TTS (Online, API Key Required)​

Configuration Steps​

ElevenLabs TTS (Online, API Key Required)​

Features​

Configuration Steps​

Parameter Descriptions​

Usage Tips​

Edge TTS (Online, No API Key Required, Default & Recommended)

Fish Audio TTS (Online, API Key Required)

Azure TTS (Online, API Key Required)

SiliconFlow TTS (Online, API Key Required)

Configuration Steps

MiniMax TTS (Online, API Key Required)

Configuration Steps

ElevenLabs TTS (Online, API Key Required)

Features

Configuration Steps

Parameter Descriptions

Usage Tips