Finetuning Speech-to-Text models: a Blueprint by Mozilla.ai for building your own STT/ASR dataset & model

This blueprint enables you to create your own Speech-to-Text dataset and model, optimizing performance for your specific language and use case. Everything runs locally—even on your laptop, ensuring your data stays private. You can finetune a model using your own data or leverage the Common Voice dataset, which supports a wide range of languages. To see the full list of supported languages, visit the CommonVoice website.

📘 To explore this project further and discover other Blueprints, visit the Blueprints Hub.

📖 For more detailed guidance on using this project, please visit our Docs here

Example result on Galician

Input Speech audio:

audio.online-video-cutter.com.mp4

Text output:

Ground Truth	openai/whisper-small	mozilla-ai/whisper-small-gl *
O Comité Económico e Social Europeo deu luz verde esta terza feira ao uso de galego, euskera e catalán nas súas sesións plenarias, segundo informou o Ministerio de Asuntos Exteriores nun comunicado no que se felicitou da decisión.	O Comité Económico Social Europeo de Uluz Verde está terza feira a Ousse de Gallego e Uskera e Catalan a súas asesións planarias, segundo informou o Ministerio de Asuntos Exteriores nun comunicado no que se felicitou da decisión.	O Comité Económico Social Europeo deu luz verde esta terza feira ao uso de galego e usquera e catalán nas súas sesións planarias, segundo informou o Ministerio de Asuntos Exteriores nun comunicado no que se felicitou da decisión.

* Finetuned on the Galician set Common Voice 17.0

📖 For more detailed guidance on using this project, please visit our Docs here

Built with

Quick-start

Finetune a STT model on Google Colab	Transcribe using a HuggingFace model	Explore all the functionality on GitHub Codespaces

Try it locally

The same instructions apply for the GitHub Codespaces option.

Setup

Use a virtual environment and install dependencies: pip install -e . & ffmpeg e.g. for Ubuntu: sudo apt install ffmpeg, for Mac: brew install ffmpeg

Evaluate existing STT models from the HuggingFace repository.

Simply execute: python demo/transcribe_app.py
Add the HF model id of your choice
Record a sample of your voice and get the transcribe text back

Making your own STT model using Local Data

Create your own, local dataset by running this command and following the instructions: python src/speech_to_text_finetune/make_local_dataset_app.py
Configure config.yaml with the model, local data directory and hyperparameters of your choice. Note that if you select push_to_hub: True you need to have an HF account and log in locally.
Finetune a model by running: python src/speech_to_text_finetune/finetune_whisper.py
Test the finetuned model in the transcription app: python demo/transcribe_app.py

Making your own STT model using Common Voice

Note: A Hugging Face account is required!

Go to the Common Voice dataset repo and ask for explicit access request (should be approved instantly).
On Hugging Face create an Access Token
In your terminal, run the command huggingface-cli login and follow the instructions to log in to your account.
Configure config.yaml with the model, Common Voice dataset repo id of HF and hyperparameters of your choice.
Finetune a model by running: python src/speech_to_text_finetune/finetune_whisper.py
Test the finetuned model in the transcription app: python demo/transcribe_app.py

Troubleshooting

If you are having issues / bugs, check our Troubleshooting section, before opening a new issue.

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Contributing

Contributions are welcome! To get started, you can check out the CONTRIBUTING.md file.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.devcontainer		.devcontainer
.github		.github
demo		demo
docs		docs
example_data		example_data
images		images
src/speech_to_text_finetune		src/speech_to_text_finetune
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finetuning Speech-to-Text models: a Blueprint by Mozilla.ai for building your own STT/ASR dataset & model

📘 To explore this project further and discover other Blueprints, visit the Blueprints Hub.

📖 For more detailed guidance on using this project, please visit our Docs here

Example result on Galician

📖 For more detailed guidance on using this project, please visit our Docs here

Built with

Quick-start

Try it locally

Setup

Evaluate existing STT models from the HuggingFace repository.

Making your own STT model using Local Data

Making your own STT model using Common Voice

Troubleshooting

License

Contributing

About

Releases 3

Packages

Contributors 4

Languages

License

mozilla-ai/speech-to-text-finetune

Folders and files

Latest commit

History

Repository files navigation

Finetuning Speech-to-Text models: a Blueprint by Mozilla.ai for building your own STT/ASR dataset & model

📘 To explore this project further and discover other Blueprints, visit the Blueprints Hub.

📖 For more detailed guidance on using this project, please visit our Docs here

Example result on Galician

📖 For more detailed guidance on using this project, please visit our Docs here

Built with

Quick-start

Try it locally

Setup

Evaluate existing STT models from the HuggingFace repository.

Making your own STT model using Local Data

Making your own STT model using Common Voice

Troubleshooting

License

Contributing

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 4

Languages

Packages