Skip to content

mozilla-ai/speech-to-text-finetune

Repository files navigation

Project logo

Finetuning Speech-to-Text models: a Blueprint by Mozilla.ai for building your own STT/ASR dataset & model

Docs Tests Ruff

This blueprint enables you to create your own Speech-to-Text dataset and model, optimizing performance for your specific language and use case. Everything runs locally—even on your laptop, ensuring your data stays private. You can finetune a model using your own data or leverage the Common Voice dataset, which supports a wide range of languages. To see the full list of supported languages, visit the CommonVoice website.

speech-to-text-finetune Diagram

📘 To explore this project further and discover other Blueprints, visit the Blueprints Hub.

📖 For more detailed guidance on using this project, please visit our Docs here

Example result on Galician

Input Speech audio:

audio.online-video-cutter.com.mp4

Text output:

Ground Truth openai/whisper-small mozilla-ai/whisper-small-gl *
O Comité Económico e Social Europeo deu luz verde esta terza feira ao uso de galego, euskera e catalán nas súas sesións plenarias, segundo informou o Ministerio de Asuntos Exteriores nun comunicado no que se felicitou da decisión. O Comité Económico Social Europeo de Uluz Verde está terza feira a Ousse de Gallego e Uskera e Catalan a súas asesións planarias, segundo informou o Ministerio de Asuntos Exteriores nun comunicado no que se felicitou da decisión. O Comité Económico Social Europeo deu luz verde esta terza feira ao uso de galego e usquera e catalán nas súas sesións planarias, segundo informou o Ministerio de Asuntos Exteriores nun comunicado no que se felicitou da decisión.

* Finetuned on the Galician set Common Voice 17.0

📖 For more detailed guidance on using this project, please visit our Docs here

Built with

Python Hugging Face Gradio Common Voice

Quick-start

Finetune a STT model on Google Colab Transcribe using a HuggingFace model Explore all the functionality on GitHub Codespaces
Try Finetuning on Colab Try on Spaces Try on Codespaces

Try it locally

The same instructions apply for the GitHub Codespaces option.

Setup

  1. Use a virtual environment and install dependencies: pip install -e . & ffmpeg e.g. for Ubuntu: sudo apt install ffmpeg, for Mac: brew install ffmpeg

Evaluate existing STT models from the HuggingFace repository.

  1. Simply execute: python demo/transcribe_app.py
  2. Add the HF model id of your choice
  3. Record a sample of your voice and get the transcribe text back

Making your own STT model using Local Data

  1. Create your own, local dataset by running this command and following the instructions: python src/speech_to_text_finetune/make_local_dataset_app.py
  2. Configure config.yaml with the model, local data directory and hyperparameters of your choice. Note that if you select push_to_hub: True you need to have an HF account and log in locally.
  3. Finetune a model by running: python src/speech_to_text_finetune/finetune_whisper.py
  4. Test the finetuned model in the transcription app: python demo/transcribe_app.py

Making your own STT model using Common Voice

Note: A Hugging Face account is required!

  1. Go to the Common Voice dataset repo and ask for explicit access request (should be approved instantly).
  2. On Hugging Face create an Access Token
  3. In your terminal, run the command huggingface-cli login and follow the instructions to log in to your account.
  4. Configure config.yaml with the model, Common Voice dataset repo id of HF and hyperparameters of your choice.
  5. Finetune a model by running: python src/speech_to_text_finetune/finetune_whisper.py
  6. Test the finetuned model in the transcription app: python demo/transcribe_app.py

Troubleshooting

If you are having issues / bugs, check our Troubleshooting section, before opening a new issue.

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Contributing

Contributions are welcome! To get started, you can check out the CONTRIBUTING.md file.