Skip to main content

Speech Synthesis (TTS)

After installing the required dependencies and configuring conf.yaml, enable the corresponding speech synthesis engine by modifying the TTS_MODEL option in conf.yaml.

Available since version v0.5.0-alpha.1 (PR#50)

sherpa-onnx is a powerful inference engine that supports multiple TTS models. It is built-in supported and uses CPU inference by default.

Configuration Steps:

  1. Download the required model from sherpa-onnx TTS models
  2. Modify conf.yaml referring to the configuration examples in config_alts
tip

For GPU inference (CUDA only), please refer to CUDA Inference.

Edge TTS (Online, No API Key Required)

  • Features:
    • Fast response speed
    • Requires maintaining network connection
  • Configuration: Set tts_model: edge_tts in conf.yaml

Fish Audio TTS (Online, API Key Required)

Available since version v0.3.0-beta

  1. Install dependencies:
uv pip install fish-audio-sdk
  1. Configuration steps:
    • Register an account on Fish Audio and obtain an API key
    • Select the desired voice and copy its Reference ID
    • In conf.yaml, set:
      • tts_model: fish_api_tts
      • Fill in api_key and reference_id in the fish_api_tts section

Azure TTS (Online, API Key Required)

The same TTS service as neuro-sama

  1. Obtain an API key for the text-to-speech service from Azure
  2. Fill in the relevant configuration in the azure_tts section of conf.yaml
warning

Since version v0.2.5, api_key.py has been deprecated. Please make sure to set the API key in conf.yaml

tip

The default voice used in conf.yaml is the same as neuro-sama

SiliconFlow TTS (Online, API Key Required)

An online text-to-speech service provided by SiliconFlow, supporting custom audio models and voice configuration.

Configuration Steps

  1. Upload Reference Audio
    SiliconFlow currently offers models like FunAudioLLM/CosyVoice2-0.5B. To use them, upload reference audio via their official platform:
    https://docs.siliconflow.cn/cn/api-reference/audio/upload-voice

  2. Fill in conf.yaml
    In the siliconflow_tts section of the configuration file, configure parameters as follows (example):

siliconflow_tts:
api_url: "https://api.siliconflow.cn/v1/audio/speech" # Service endpoint (fixed value)
api_key: "sk-yourkey" # API key obtained from SiliconFlow's official website
default_model: "FunAudioLLM/CosyVoice2-0.5B" # Audio model name (check official docs for supported models)
default_voice: "speech:Dreamflowers:aaaaaaabvbbbasdas" # Voice ID (generated after uploading custom voice on the official site)
sample_rate: 32000 # Output sample rate; adjust if audio is distorted (e.g., 16000, 44100)
response_format: "mp3" # Audio format (e.g., mp3, wav)
stream: true # Enable streaming mode
speed: 1 # Speaking speed (range: 0.5–2.0; 1 = default)
gain: 0 # Volume gain (range: -10–10; 0 = default)

MiniMax TTS (Online, API Key Required)

MiniMax provides an online TTS service where models like speech-02-turbo offer powerful TTS capabilities with customizable voice options.

Configuration Steps

  1. Obtain group_id and api_key You can register on the Minimax official website to get your group_id and api_key, Official Documentation

  2. Fill in the conf.yaml configuration In the minimax_tts section of the configuration file, enter parameters in the following format (example):

minimax_tts:
group_id: '' # Your minimax group_id
api_key: '' # Your minimax api_key
# Supported models: 'speech-02-hd', 'speech-02-turbo' (recommended: 'speech-02-turbo')
model: 'speech-02-turbo' # minimax model name
voice_id: 'female-shaonv' # minimax voice id, default is 'female-shaonv'
# Custom pronunciation dictionary, default empty.
# Example: '{"tone": ["测试/(ce4)(shi4)", "危险/dangerous"]}'
pronunciation_dict: ''

The voice_id parameter can be configured to different voice tones. You can check the voice ID query section in the official documentation for a complete list of supported voices. The pronunciation_dict supports custom pronunciation rules - for example, you can define rules to pronounce "牛肉" as "neuro" using the format shown in the example.

ElevenLabs TTS (Online, API Key Required)

Available since version v1.2.1

ElevenLabs provides high-quality, natural-sounding text-to-speech with support for multiple languages and voice cloning capabilities.

Features

  • High-Quality Audio: Industry-leading speech synthesis quality
  • Multi-language Support: Supports English, Chinese, Japanese, Korean, and many other languages
  • Voice Cloning: Upload audio samples to clone voices
  • Rich Voice Library: Multiple preset voices and community voices available
  • Real-time Generation: Low-latency speech synthesis

Configuration Steps

  1. Register and Get API Key

    • Visit ElevenLabs to register an account
    • Get your API key from the ElevenLabs dashboard
  2. Choose a Voice

    • Browse available voices in the ElevenLabs dashboard
    • Copy the Voice ID of your preferred voice
    • You can also upload audio samples for voice cloning
  3. Configure conf.yaml In the elevenlabs_tts section of your configuration file, enter parameters as follows:

elevenlabs_tts:
api_key: 'your_elevenlabs_api_key' # Required: Your ElevenLabs API key
voice_id: 'JBFqnCBsd6RMkjVDRZzb' # Required: ElevenLabs Voice ID
model_id: 'eleven_multilingual_v2' # Model ID (default: eleven_multilingual_v2)
output_format: 'mp3_44100_128' # Output audio format (default: mp3_44100_128)
stability: 0.5 # Voice stability (0.0 to 1.0, default: 0.5)
similarity_boost: 0.5 # Voice similarity boost (0.0 to 1.0, default: 0.5)
style: 0.0 # Voice style exaggeration (0.0 to 1.0, default: 0.0)
use_speaker_boost: true # Enable speaker boost for better quality (default: true)

Parameter Descriptions

  • api_key (required): Your ElevenLabs API key
  • voice_id (required): Unique identifier for the voice, found in your ElevenLabs dashboard
  • model_id: TTS model to use. Available options:
    • eleven_multilingual_v2 (default) - Supports multiple languages
    • eleven_monolingual_v1 - English only
    • eleven_turbo_v2 - Faster generation
  • output_format: Audio output format. Common options:
    • mp3_44100_128 (default) - MP3, 44.1kHz, 128kbps
    • mp3_44100_192 - MP3, 44.1kHz, 192kbps
    • pcm_16000 - PCM, 16kHz
    • pcm_22050 - PCM, 22.05kHz
    • pcm_24000 - PCM, 24kHz
    • pcm_44100 - PCM, 44.1kHz
  • stability: Controls voice consistency (0.0 = more variable, 1.0 = more consistent)
  • similarity_boost: Enhances similarity to the original voice (0.0 to 1.0)
  • style: Controls style exaggeration (0.0 = neutral, 1.0 = more expressive)
  • use_speaker_boost: Enables speaker boost for improved audio quality

Usage Tips

  • Voice Selection: Try preset voices first, then consider voice cloning for custom voices
  • Parameter Tuning: Adjust stability and similarity_boost for optimal results
  • Cost Management: ElevenLabs charges based on usage, test first before heavy usage
  • Network Requirements: Stable internet connection required for service availability
tip

ElevenLabs offers free trial credits, so you can test the quality before purchasing a paid plan.