We're hiring! If you like what we're building here, come join us at LMNT.
DiffWave is a fast, high-quality neural vocoder and waveform synthesizer. It starts with Gaussian noise and converts it into speech via iterative refinement. The speech can be controlled by providing a conditioning signal (e.g. log-scaled Mel spectrogram). The model and architecture details are described in DiffWave: A Versatile Diffusion Model for Audio Synthesis.
- unconditional waveform synthesis (thanks to Andrechang!)
- fast sampling algorithm based on v3 of the DiffWave paper
- new pretrained model trained for 1M steps
- updated audio samples with output from new model
- fast inference procedure
- stable training
- high-quality synthesis
- mixed-precision training
- multi-GPU training
- command-line inference
- programmatic inference API
- PyPI package
- audio samples
- pretrained models
- unconditional waveform synthesis
Big thanks to Zhifeng Kong (lead author of DiffWave) for pointers and bug fixes.
22.05 kHz pretrained model (31 MB, SHA256: d415d2117bb0bba3999afabdd67ed11d9e43400af26193a451d112e2560821a8
)
This pre-trained model is able to synthesize speech with a real-time factor of 0.87 (smaller is faster).
- trained on 4x 1080Ti
- default parameters
- single precision floating point (FP32)
- trained on LJSpeech dataset excluding LJ001* and LJ002*
- trained for 1000578 steps (1273 epochs)
Install using pip:
pip install diffwave
or from GitHub:
git clone /~https://github.com/lmnt-com/diffwave.git
cd diffwave
pip install .
Before you start training, you'll need to prepare a training dataset. The dataset can have any directory structure as long as the contained .wav files are 16-bit mono (e.g. LJSpeech, VCTK). By default, this implementation assumes a sample rate of 22.05 kHz. If you need to change this value, edit params.py.
python -m diffwave.preprocess /path/to/dir/containing/wavs
python -m diffwave /path/to/model/dir /path/to/dir/containing/wavs
# in another shell to monitor training progress:
tensorboard --logdir /path/to/model/dir --bind_all
You should expect to hear intelligible (but noisy) speech by ~8k steps (~1.5h on a 2080 Ti).
By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count()
. You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE
environment variable before running the training module.
Basic usage:
from diffwave.inference import predict as diffwave_predict
model_dir = '/path/to/model/dir'
spectrogram = # get your hands on a spectrogram in [N,C,W] format
audio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True)
# audio is a GPU tensor in [N,T] format.
python -m diffwave.inference --fast /path/to/model /path/to/spectrogram -o output.wav