Skip to main content

Speech Recognition (ASR)

Speech Recognition (ASR, Automatic Speech Recognition) converts user speech to text. This project supports multiple speech recognition model implementations. Speech Recognition (ASR, Automatic Speech Recognition) converts user speech to text. This project supports multiple speech recognition model implementations.

ASR-related configuration items are under asr_config in conf.yaml. ASR-related configuration items are under asr_config in conf.yaml.

Here are the speech recognition options you can choose from:

sherpa_onnx_asr (Local & Project Default)

note

(Added in v0.5.0-alpha.1 PR: Add sherpa-onnx support #50) (Added in v0.5.0-alpha.1 PR: Add sherpa-onnx support #50)

sherpa-onnx is a feature-rich inference tool that can run various speech recognition (ASR) models.

info

Starting from version v1.0.0, this project uses sherpa-onnx to run the SenseVoiceSmall (int8 quantized) model as the default speech recognition solution. This is an out-of-the-box configuration - you don't need any additional setup. The system will automatically download and extract model files to the project's models directory on first run. Starting from version v1.0.0, this project uses sherpa-onnx to run the SenseVoiceSmall (int8 quantized) model as the default speech recognition solution. This is an out-of-the-box configuration - you don't need any additional setup. The system will automatically download and extract model files to the project's models directory on first run.

  • All users (hence it's the default)
  • Especially Mac users (due to limited options)
  • Non-NVIDIA GPU users
  • Chinese users
  • Fast CPU inference
  • Configuration difficulty: No configuration needed as it's the project default

The SenseVoiceSmall model may have average English performance.

  • All users (hence it's the default)
  • Especially Mac users (due to limited options)
  • Non-NVIDIA GPU users
  • Chinese users
  • Fast CPU inference
  • Configuration difficulty: No configuration needed as it's the project default

The SenseVoiceSmall model may have average English performance.

CUDA Inference

sherpa-onnx supports both CPU and CUDA inference. While the default SenseVoiceSmall model performs well on CPU, if you have an NVIDIA GPU, you can enable CUDA inference for better performance by following these steps: sherpa-onnx supports both CPU and CUDA inference. While the default SenseVoiceSmall model performs well on CPU, if you have an NVIDIA GPU, you can enable CUDA inference for better performance by following these steps:

  1. First uninstall the CPU version dependencies:
  2. First uninstall the CPU version dependencies:
uv remove sherpa-onnx onnxruntime

Note that sherpa-onnx is installed via pre-built wheels in the example, which means you need to install

CUDA Toolkit 11.x + CUDNN 8.x for CUDA 11.x (and add %SystemDrive%\Program Files\NVIDIA\CUDNN\v8.x\bin to your PATH)

Where x is your cudnn minor version number, e.g., for version v8.9.7, write v8.9 here.

to link to the correct CUDA environment.

If you don't want to use the NVIDIA official installer/manually set PATH, consider using pixi to manage a local conda environment. This approach doesn't require you to install dependencies via uv.

pixi remove --pypi onnxruntime sherpa-onnx
pixi add --pypi onnxruntime-gpu==1.17.1 pip
pixi run python -m pip install sherpa-onnx==1.10.39+cuda -f https://k2-fsa.github.io/sherpa/onnx/cuda.html
  1. Install CUDA version of sherpa-onnx and onnxruntime-gpu dependencies:
# sherpa-onnx provided pre-built wheels are compatible with onnxruntime-gpu==1.17.1
uv add onnxruntime-gpu==1.17.1 sherpa-onnx==1.10.39+cuda -f https://k2-fsa.github.io/sherpa/onnx/cuda.html
  1. Modify configuration file: In conf.yaml, find the sherpa_onnx_asr section and set provider to cuda

Using Other sherpa-onnx Models

If you want to try other speech recognition models:

  1. Download the required model from sherpa-onnx ASR models
  2. Place the model files in the project's models directory
  3. Modify the relevant configuration of sherpa_onnx_asr according to the instructions in conf.yaml

groq_whisper_asr (Online, requires API key, but easy to register with generous free quota)

Groq's Whisper endpoint, very accurate (supports multiple languages) and fast, with many free uses per day. It's pre-installed. Get an API key from groq and add it to the groq_whisper_asr settings in conf.yaml. Users in mainland China and other unsupported regions need a proxy (may not support Hong Kong region) to use it.

  • Users who accept using online speech recognition
  • Multilingual users
  • No local computation, very fast speed (depends on your network speed)
  • Configuration difficulty: Simple

SenseVoiceSmall may have average English performance.

azure_asr (Online, requires API key)

  • Azure Speech Recognition
  • Configure API key and region under the azure_asr option
warning

api_key.py has been deprecated after v0.2.5. Please set API keys in conf.yaml.

  • People who have Azure API keys (Azure accounts are not easy to register)
  • Multilingual users
  • No local computation, very fast speed (depends on your network speed)
  • Configuration difficulty: Simple