The goal of this repository is to share the toolkits, scripts and configurations used to extend the blizzard 2013-EH2 by adding modern neural voices.
This repository is organized around the following structure:
. ├── bc_extension_cpu.yaml ├── bc_extension_gpu.yaml ├── README.org ├── references.bib ├── evaluation │ ├── run.sh │ └── scripts ├── helpers │ ├── backup_models.sh │ └── render.sh ├── prepare_data │ ├── 01_extract_acoustic.sh │ ├── 02_extract_training_labels.sh │ ├── 03_extract_synthesis_labels.sh │ ├── configurations │ └── scripts ├── src │ ├── ph_list │ ├── test │ └── train └── toolkits ├── fastpitch ├── marytts ├── parallel_wavegan ├── tacotron └── wavenet
Here is the description of the key files/directories:
bc_extension_gpu.yaml
- conda environment configuration assuming GPUs are accessible
bc_extension_cpu.yaml
- conda environment configuration when no GPUs are available
README.org
- this file
references.bib
- the bibtex containing all the references used in this repository
prepare_data
- the directory providing the scripts and configurations needed to prepare the data (i.e. extract features, prompts, …) to run the training and the synthesis
helpers
- the directory containing additional scripts used to backup the models, achieve the synthesis, …
evaluation
- the directory containing the resources to analyze the subjective evaluation results and compute some additional objective evaluation bits
toolkits
- the directory containing the toolkits necessary to conduct the experiments (see further on how they are used)
src
- the directory containing the data needed to run the experiments
Following the shutdown of bintray, the configuration of MaryTTS has been updated.
- install sdkman (see https://sdkman.io/install/) and maven (google to find this for your operating system)
- activate the environment
sdk env activate
- Install JTok
(cd toolkits/marytts/jtok; mvn install)
- Install MaryTTS
(cd toolkits/marytts/marytts; ./gradlew publishToMavenLocal)
MaryTTS should now be ready to use!
This repository relies on java
and gradle
to extract the labels as well as python
:
- The code has been tested using
java 11
(this is a restrict). You can install it using gradle
is using wrappers, so no dependencies have to be explicitly installed- for python, it is easier to create a conda environment:
- bc_extension_gpu.yaml
- defines the environment for a use on GPU (recommended)
- bc_extension_cpu.yaml
- defines the environment for a use on CPU (test for synthesis)
- Additional python packages which need to be installed in the environment (so after activating it!):
Simply go to the directory prepare_data
and run the following command:
# To extract the mel spectrograms
bash ./01_extract_acoustics.sh
# To get the labels, the prompts, the F0 (FastPitch), the duration (FastPitch) and the attention guides (Tacotron)
bash ./02_extract_training_labels.sh
The results will be available in the directory output/synthesis
.
Simply go to the directory prepare_data
and run the following command:
bash ./03_extract_synthesis_labels.sh
The results will be available in the directory output/synthesis
.
For all this part, we assume that the conda environment is activated! We also assume that the data preparation was ran (if not go to the previous section!).
For WaveNet, the training happens in the directory toolkits/wavenet/egs/bc_2013
.
The first thing to do is linking the dataset to what has been extracted during the data preparation:
ln -s $PWD/../../../../prepare_data/output/training/wn $PWD/dump
Then you can start the training as following:
bash run.sh
For WaveNet, the training happens in the directory toolkits/wavenet/egs/bc_2013/voc1
.
The first thing to do is linking the dataset to what has been extracted during the data preparation:
ln -s $PWD/../../../../../prepare_data/output/training/wg $PWD/dump
Then you can start the training as following:
bash run.sh
For FastPitch, the training happens in the directory toolkits/fastpitch
.
The first thing to do is linking the dataset to what has been extracted during the data preparation:
mkdir bc_2013
ln -s $PWD/bc_2013/../../../prepare_data/output/training/fastpitch $PWD/bc_2013/dataset
Then you can start the training as following:
NUM_GPUS=1 BS=16 PH_DICT=bc_2013/dataset/ph_list bash scripts/train.sh
Here is the description for the used variables:
NUM_GPUS
- the number of GPUs used for the training
BS
- the batch size
PH_DICT
- the path to the list of phonemes used in the corpus (if not defined, it will default to
RADIO_ARPABET
&ARCTIC
)
For Tacotron, the training happens in the directory toolkits/tacotron
.
The first thing to do is linking the dataset to what has been extracted during the data preparation:
mkdir bc_2013
ln -s $PWD/bc_2013/../../../prepare_data/output/training/tacotron $PWD/bc_2013/data
Then you can start the training as following:
python train_pag.py -d bc_2013/data/ph_list
The last step is to backup the files to be compatible with the synthesis script. To do so, run the following command:
bash helpers/backup_models models
For this command, the models will be back up in the directory models
.
Change the argument if you want to change the backup directory
EXPES="fp tac wg wn" bash helpers/render.sh
Simply go to the directory evaluation
and run:
bash run.sh
The results will be available in the directory output
.
The models obtained for the experiments are available at this address: https://www.cstr.ed.ac.uk/projects/blizzard/ under the section models (to access these models, you need to obtain a license for [The English audiobook data for the Blizzard Challenge 2013](https://www.cstr.ed.ac.uk/projects/blizzard/2013/lessac_blizzard2013/), then use the same credentials).
The samples are available and subjective evaluation results are available at this address: https://data.cstr.ed.ac.uk/blizzard/wavs_and_scores/2013-EH2-EXT.tar.gz
@article{LeMaguer2024,
title = {The limits of the Mean Opinion Score for speech synthesis evaluation},
author = {Sébastien {Le Maguer} and Simon King and Naomi Harte},
year = 2024,
journal = {Computer, Speech \& Language},
volume = 84,
pages = 101577,
doi = {https://doi.org/10.1016/j.csl.2023.101577},
issn = {0885-2308},
url = {https://www.sciencedirect.com/science/article/pii/S0885230823000967},
}
The citation keys are given to avoid wasting too much space. Please refer to the bibtex file references.bib to access the full entry.
Architecture | Description | Implementation |
---|---|---|
Tacotron | [cite:@Wang2017] | /~https://github.com/cassiavb/Tacotron/commit/946408f8cd7b5fe9c53931c631267ba2a723910d |
FastPitch | [cite:@Lancucki2021] | /~https://github.com/NVIDIA/DeepLearningExamples/commit/6a642837c471c596aab7edf204384f66e9483ab2 |
WaveNet | [cite:@Oord2016] | /~https://github.com/r9y9/wavenet_vocoder/commit/a35fff76ea3687b05e1a10023cad3f7f64fa25a3 |
Parallel WaveGAN | [cite:@Yamamoto2020] | /~https://github.com/kan-bayashi/ParallelWaveGAN/commit/6d4411b65f9487de5ec49dabf029dc107f23192d |
The citation keys are given to avoid wasting too much space. Please refer to the bibtex file references.bib to access the full entry.
Software | Description | Implementation |
---|---|---|
MaryTTS | [cite:@Steiner2018] | /~https://github.com/marytts/marytts |
JTok | /~https://github.com/DFKI-MLT/JTok | |
Pyworld/World | [cite:@Morise2016] | /~https://github.com/mmorise/World, /~https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder |
FlexEval | [cite:@Fayet2020] | https://gitlab.inria.fr/expression/tools/FlexEval |