Finetuning Speech-to-Text models: a Blueprint by Mozilla.ai for building your own STT/ASR dataset & model
This blueprint enables you to create your own Speech-to-Text dataset and model, optimizing performance for your specific language and use case. Everything runs locally—even on your laptop, ensuring your data stays private. You can finetune a model using your own data or leverage the Common Voice dataset, which supports a wide range of languages. To see the full list of supported languages, visit the CommonVoice website.
📘 To explore this project further and discover other Blueprints, visit the Blueprints Hub.
📖 For more detailed guidance on using this project, please visit our Docs here
Input Speech audio:
audio.online-video-cutter.com.mp4
Text output:
Ground Truth | openai/whisper-small | mozilla-ai/whisper-small-gl * |
---|---|---|
O Comité Económico e Social Europeo deu luz verde esta terza feira ao uso de galego, euskera e catalán nas súas sesións plenarias, segundo informou o Ministerio de Asuntos Exteriores nun comunicado no que se felicitou da decisión. | O Comité Económico Social Europeo de Uluz Verde está terza feira a Ousse de Gallego e Uskera e Catalan a súas asesións planarias, segundo informou o Ministerio de Asuntos Exteriores nun comunicado no que se felicitou da decisión. | O Comité Económico Social Europeo deu luz verde esta terza feira ao uso de galego e usquera e catalán nas súas sesións planarias, segundo informou o Ministerio de Asuntos Exteriores nun comunicado no que se felicitou da decisión. |
* Finetuned on the Galician set Common Voice 17.0
📖 For more detailed guidance on using this project, please visit our Docs here
Finetune a STT model on Google Colab | Transcribe using a HuggingFace model | Explore all the functionality on GitHub Codespaces |
---|---|---|
The same instructions apply for the GitHub Codespaces option.
- Use a virtual environment and install dependencies:
pip install -e .
& ffmpeg e.g. for Ubuntu:sudo apt install ffmpeg
, for Mac:brew install ffmpeg
- Simply execute:
python demo/transcribe_app.py
- Add the HF model id of your choice
- Record a sample of your voice and get the transcribe text back
- Create your own, local dataset by running this command and following the instructions:
python src/speech_to_text_finetune/make_local_dataset_app.py
- Configure
config.yaml
with the model, local data directory and hyperparameters of your choice. Note that if you selectpush_to_hub: True
you need to have an HF account and log in locally. - Finetune a model by running:
python src/speech_to_text_finetune/finetune_whisper.py
- Test the finetuned model in the transcription app:
python demo/transcribe_app.py
Note: A Hugging Face account is required!
- Go to the Common Voice dataset repo and ask for explicit access request (should be approved instantly).
- On Hugging Face create an Access Token
- In your terminal, run the command
huggingface-cli login
and follow the instructions to log in to your account. - Configure
config.yaml
with the model, Common Voice dataset repo id of HF and hyperparameters of your choice. - Finetune a model by running:
python src/speech_to_text_finetune/finetune_whisper.py
- Test the finetuned model in the transcription app:
python demo/transcribe_app.py
If you are having issues / bugs, check our Troubleshooting section, before opening a new issue.
This project is licensed under the Apache 2.0 License. See the LICENSE file for details.
Contributions are welcome! To get started, you can check out the CONTRIBUTING.md file.