Speech Recognition (ASR)
Speech Recognition (ASR, Automatic Speech Recognition) converts user speech to text. This project supports multiple speech recognition model implementations. Speech Recognition (ASR, Automatic Speech Recognition) converts user speech to text. This project supports multiple speech recognition model implementations.
ASR-related configuration items are under asr_config in conf.yaml.
ASR-related configuration items are under asr_config in conf.yaml.
Here are the speech recognition options you can choose from:
sherpa_onnx_asr (Local & Project Default)
(Added in v0.5.0-alpha.1 PR: Add sherpa-onnx support #50)
(Added in v0.5.0-alpha.1 PR: Add sherpa-onnx support #50)
sherpa-onnx is a feature-rich inference tool that can run various speech recognition (ASR) models.
Starting from version v1.0.0, this project uses sherpa-onnx to run the SenseVoiceSmall (int8 quantized) model as the default speech recognition solution. This is an out-of-the-box configuration - you don't need any additional setup. The system will automatically download and extract model files to the project's models directory on first run.
Starting from version v1.0.0, this project uses sherpa-onnx to run the SenseVoiceSmall (int8 quantized) model as the default speech recognition solution. This is an out-of-the-box configuration - you don't need any additional setup. The system will automatically download and extract model files to the project's models directory on first run.
Recommended Users
- All users (hence it's the default)
- Especially Mac users (due to limited options)
- Non-NVIDIA GPU users
- Chinese users
- Fast CPU inference
- Configuration difficulty: No configuration needed as it's the project default
The SenseVoiceSmall model may have average English performance.
Recommended Users
- All users (hence it's the default)
- Especially Mac users (due to limited options)
- Non-NVIDIA GPU users
- Chinese users
- Fast CPU inference
- Configuration difficulty: No configuration needed as it's the project default
The SenseVoiceSmall model may have average English performance.
CUDA Inference
sherpa-onnx supports both CPU and CUDA inference. While the default SenseVoiceSmall model performs well on CPU, if you have an NVIDIA GPU, you can enable CUDA inference for better performance by following these steps:
sherpa-onnx supports both CPU and CUDA inference. While the default SenseVoiceSmall model performs well on CPU, if you have an NVIDIA GPU, you can enable CUDA inference for better performance by following these steps:
- First uninstall the CPU version dependencies:
- First uninstall the CPU version dependencies:
uv remove sherpa-onnx onnxruntime
Note that sherpa-onnx is installed via pre-built wheels in the example, which means you need to install
CUDA Toolkit 11.x + CUDNN 8.x for CUDA 11.x (and add
%SystemDrive%\Program Files\NVIDIA\CUDNN\v8.x\binto yourPATH)Where x is your cudnn minor version number, e.g., for version
v8.9.7, writev8.9here.to link to the correct CUDA environment.
If you don't want to use the NVIDIA official installer/manually set PATH, consider using
pixito manage a local conda environment. This approach doesn't require you to install dependencies via uv.pixi remove --pypi onnxruntime sherpa-onnx
pixi add --pypi onnxruntime-gpu==1.17.1 pip
pixi run python -m pip install sherpa-onnx==1.10.39+cuda -f https://k2-fsa.github.io/sherpa/onnx/cuda.html
- Install CUDA version of
sherpa-onnxandonnxruntime-gpudependencies:
# sherpa-onnx provided pre-built wheels are compatible with onnxruntime-gpu==1.17.1
uv add onnxruntime-gpu==1.17.1 sherpa-onnx==1.10.39+cuda -f https://k2-fsa.github.io/sherpa/onnx/cuda.html
- Modify configuration file:
In
conf.yaml, find thesherpa_onnx_asrsection and setprovidertocuda
Using Other sherpa-onnx Models
If you want to try other speech recognition models:
- Download the required model from sherpa-onnx ASR models
- Place the model files in the project's
modelsdirectory - Modify the relevant configuration of
sherpa_onnx_asraccording to the instructions inconf.yaml
groq_whisper_asr (Online, requires API key, but easy to register with generous free quota)
Groq's Whisper endpoint, very accurate (supports multiple languages) and fast, with many free uses per day. It's pre-installed. Get an API key from groq and add it to the groq_whisper_asr settings in conf.yaml. Users in mainland China and other unsupported regions need a proxy (may not support Hong Kong region) to use it.
Recommended Users
- Users who accept using online speech recognition
- Multilingual users
- No local computation, very fast speed (depends on your network speed)
- Configuration difficulty: Simple
SenseVoiceSmall may have average English performance.
azure_asr (Online, requires API key)
- Azure Speech Recognition
- Configure API key and region under the
azure_asroption
api_key.py has been deprecated after v0.2.5. Please set API keys in conf.yaml.
Recommended Users
- People who have Azure API keys (Azure accounts are not easy to register)
- Multilingual users
- No local computation, very fast speed (depends on your network speed)
- Configuration difficulty: Simple