This is a repository for our paper, 🐤 Nix-TTS (Accepted to IEEE SLT 2022). We released the pretrained models, an interactive demo, and audio samples below.
[[📄 Paper Link](Coming Soon!)] [🤗 Interactive Demo] [📢 Audio Samples]
Abstract Several solutions for lightweight TTS have shown promising results. Still, they either rely on a hand-crafted design that reaches non-optimum size or use a neural architecture search but often suffer training costs. We present Nix-TTS, a lightweight TTS achieved via knowledge distillation to a high-quality yet large-sized, non-autoregressive, and end-to-end (vocoder-free) TTS teacher model. Specifically, we offer module-wise distillation, enabling flexible and independent distillation to the encoder and decoder module. The resulting Nix-TTS inherited the advantageous properties of being non-autoregressive and end-to-end from the teacher, yet significantly smaller in size, with only 5.23M parameters or up to 89.34% reduction of the teacher model; it also achieves over 3.04$\times$ and 8.36$\times$ inference speedup on Intel-i7 CPU and Raspberry Pi 3B respectively and still retains a fair voice naturalness and intelligibility compared to the teacher model.
Clone the nix-tts
repository and move to its directory
git clone /~https://github.com/rendchevi/nix-tts.git
cd nix-tts
Install the dependencies
- Install Python dependencies. We recommend
python >= 3.8
pip install -r requirements.txt
- Install espeak in your device (for text tokenization).
sudo apt-get install espeak
Or follow the official instruction in case it didn't work.
Download your chosen pre-trained model here.
Model | Num. of Params | Faster than real-time* (CPU Intel-i7) | Faster than real-time* (RasPi Model 3B) |
---|---|---|---|
Nix-TTS (ONNX) | 5.23 M | 11.9x | 0.50x |
Nix-TTS w/ Stochastic Duration (ONNX) | 6.03 M | 10.8x | 0.50x |
* Here we compute how much the model run faster than real-time as the inverse of Real Time Factor (RTF). The complete table of all models speedup is detailed on the paper.
And running Nix-TTS is as easy as:
from nix.models.TTS import NixTTSInference
from IPython.display import Audio
# Initiate Nix-TTS
nix = NixTTSInference(model_dir = "<path_to_the_downloaded_model>")
# Tokenize input text
c, c_length, phoneme = nix.tokenize("Born to multiply, born to gaze into night skies.")
# Convert text to raw speech
xw = nix.vocalize(c, c_length)
# Listen to the generated speech
Audio(xw[0,0], rate = 22050)
- This research is fully and exclusively funded by Kata.ai, where the authors work as part of the Kata.ai Research Team.
- Some of the complex parts of our model, as mentioned in the paper, are adapted from the original implementation of VITS and Comprehensive-Transformer-TTS.