This project partially reflects the best practices in speaker verification technology up till 2020.
For more details, please refer to the link: The VoxSRC Workshop 2020.
Training data are prepared following Kaldi voxceleb recipe: VoxCeleb2_dev and its four augmented versions are generated using RIRS_NOISES and MUSAN datasets.
80-dimensional (or 40-dimensional) FBANKs are extracted.
Speaker augmentation is not applied.
Online data augmentation is not applied.
1D-conv model: 512-d TDNN model following Kaldi voxceleb recipe.
2D-conv models: Res2Net model and DPN (Dual Path Networks) model.
Large margin finetuning (LMFT).
Mixed precision training and distributed training.
Cosine similarity scoring.
Adaptive symmetric normalization (asnorm) on cosine scores.
Quality measure function (QMF) or score calibration is not applied.
Data preparation takes a long time. It is recommended to run data preparation script on a server. With a D48s_v5 Virtual Machine on Azure (48 vCPUs, 192 GB memory), it takes about 10 hours to complete the whole process.
For the smallest Res2Net model, res2net50_w8_s6_c16, it takes about 52 + 5 hours to train the model on a single NVIDIA Geforce RTX 4090 GPU.
For the largest Res2Net model, res2net200_w8_s6_c16, it takes about 76 + 8 hours to train the model on eight NVIDIA Geforce RTX 3090 GPUs.
Install TensorFlow 1.x and Kaldi toolkit manually
or setup a docker container with Dockerfile (recommended):
docker build -t xx205/tensorflow:23.03-tf1-py3-kaldi -f Dockerfile .
or pull the container directly (recommended):
docker pull xx205/tensorflow:23.03-tf1-py3-kaldi
Run docker container:
docker run --rm -it --gpus device=all -v `pwd`:`pwd` --ipc=host --ulimit memlock=-1 xx205/tensorflow:23.03-tf1-py3-kaldi
Clone this repository:
git clone /~
Run data preparation script (note, 80-dimensional FBANKs are extracted by default. FBANK dimension can be set in
cd voxsrc2020_speaker_verification bash
Go to working directory:
cd tensorflow
Train a Res2Net model with JTBD's VoxSRC2020 setup on VoxCeleb2_dev_aug training set:
bash models.res2net_model res2net50_w24_s4_c32
Export trained checkpoint to pb file for inference:
bash \ models.res2net_model \ res2net50_w24_s4_c32 \ exp/voxceleb2_dev_aug/res2net50_w24_s4_c32_sc_cm_linear_frames200_scale32_margin0.2_8GPUs \ 3 \ sc_cm_linear/kernel
Evaluate trained Res2Net model performance on VoxCeleb1 Test/Extended/Hard trials:
export saved_model=exp/voxceleb2_dev_aug/res2net50_w24_s4_c32_sc_cm_linear_frames200_scale32_margin0.2_8GPUs_122636.pb bash ${saved_model} 3 for partition in T E H; do for approach in cosine snorm; do python3 \ --trial ../data/voxceleb1_trials/list_test_${partition}.txt \ --score ${saved_model%.pb}_embeddings/voxceleb1/${approach}_${partition}.txt done done
Finetune the Res2Net model with LMFT on VoxCeleb2_dev training set:
bash models.res2net_model res2net50_w24_s4_c32
Export finetuned checkpoint to pb file for inference:
bash \ models.res2net_model \ res2net50_w24_s4_c32 \ exp/voxceleb2_dev/res2net50_w24_s4_c32_sc_cm_linear_frames600_scale32_margin0.4_8GPUs \ 3 \ sc_cm_linear/kernel
Evaluate finetuned Res2Net model performance on VoxCeleb1 Test/Extended/Hard trials:
export saved_model=exp/voxceleb2_dev/res2net50_w24_s4_c32_sc_cm_linear_frames600_scale32_margin0.4_8GPUs_127968.pb bash ${saved_model} 3 for partition in T E H; do for approach in cosine snorm; do python3 \ --trial ../data/voxceleb1_trials/list_test_${partition}.txt \ --score ${saved_model%.pb}_embeddings/voxceleb1/${approach}_${partition}.txt done done
model | loss function | subcenter | sample length | margin | # parameters | pooling function |
tdnn_voxsrc2020 | cm_linear_voxsrc2020 | ✗ | 320 | (0.2, 0.1) | 3.5 M | Stats Pool |
dpn68_voxsrc2020 | cm_linear_voxsrc2020 | ✗ | 320 | (0.2, 0.1) | 13.9 M | Stats Pool |
Vox1_Test (EER/minDCF0.01) |
tdnn_voxsrc2020 (cosine) | 3.4398%/0.3339 |
dpn68_voxsrc2020 (cosine) | 2.0894%/0.2544 |
model | loss function | subcenter | sample length | margin | # parameters | pooling function |
dpn68_voxsrc2020 | cm_linear_voxsrc2020 | ✗ | 320 | (0.2, 0.1) | 13.9 M | Stats Pool |
res2net50_w24_s4_c64_voxsrc2020 | cm_linear_voxsrc2020 | ✗ | 320 | (0.2, 0.1) | 26.9 M | Stats Pool |
Vox1_Test (EER/minDCF0.01) |
Extended (EER/minDCF0.01) |
Hard (EER/minDCF0.01) |
dpn68_voxsrc2020 (cosine) | 0.9517%/0.0884 | 1.0217%/0.1125 | 1.8649%/0.1833 |
dpn68_voxsrc2020 (asnorm) 1 | 0.8347%/0.0879 | 0.9452%/0.0996 | 1.6401%/0.1527 |
res2net50_w24_s4_c64_voxsrc2020 (cosine) | 0.9942%/0.1163 | 0.9924%/0.1152 | 1.8387%/0.1857 |
res2net50_w24_s4_c64_voxsrc2020 (asnorm) 2 | 0.8400%/0.0931 | 0.9051%/0.0970 | 1.5976%/0.1471 |
model | loss function | subcenter | sample length | margin | # parameters | pooling function |
dpn68 | sc_cm_linear | ✓ | 200 | 0.2 | 13.9 M | Stats Pool |
dpn68+LMFT | sc_cm_linear | ✓ | 600 | 0.4 | 13.9 M | Stats Pool |
res2net50_w24_s4_c64 | sc_cm_linear | ✓ | 200 | 0.2 | 26.9 M | Stats Pool |
res2net50_w24_s4_c64+LMFT | sc_cm_linear | ✓ | 600 | 0.4 | 26.9 M | Stats Pool |
Vox1_Test (EER/minDCF0.01) |
Extended (EER/minDCF0.01) |
Hard (EER/minDCF0.01) |
dpn68 (cosine) | 1.0952%/0.1052 | 1.1725%/0.1322 | 2.0998%/0.2053 |
dpn68 (asnorm) | 0.9783%/0.1081 | 1.0707%/0.1184 | 1.8656%/0.1742 |
dpn68+LMFT (cosine) | 0.8666%/0.0824 | 0.9634%/0.1058 | 1.6910%/0.1641 |
dpn68+LMFT (asnorm) | 0.8081%/0.0735 | 0.8934%/0.0925 | 1.5076%/0.1414 |
res2net50_w24_s4_c64 (cosine) | 1.0580%/0.1089 | 1.0728%/0.1226 | 1.9778%/0.1911 |
res2net50_w24_s4_c64 (asnorm) | 0.9038%/0.0989 | 0.9824%/0.1086 | 1.7269%/0.1621 |
res2net50_w24_s4_c64+LMFT (cosine) | 0.9198%/0.1005 | 0.9258%/0.1006 | 1.6957%/0.1650 |
res2net50_w24_s4_c64+LMFT (asnorm) | 0.7922%/0.0804 | 0.8482%/0.0903 | 1.4854%/0.1353 |
model | loss function | subcenter | sample length | margin | # parameters | pooling function |
res2net50_w24_s4_c64 | sc_cm_linear | ✓ | 200 | 0.2 | 32.2 M | Stats Pool |
res2net50_w24_s4_c64+LMFT | sc_cm_linear | ✓ | 600 | 0.4 | 32.2 M | Stats Pool |
res2net50_w24_s4_c32 | sc_cm_linear | ✓ | 200 | 0.2 | 17.7 M | Stats Pool |
res2net50_w24_s4_c32+LMFT | sc_cm_linear | ✓ | 600 | 0.4 | 17.7 M | Stats Pool |
res2net50_w8_s6_c16 | sc_cm_linear | ✓ | 200 | 0.2 | 4.8 M | Stats Pool |
res2net50_w8_s6_c16+LMFT | sc_cm_linear | ✓ | 600 | 0.4 | 4.8 M | Stats Pool |
res2net101_w24_s4_c32_att | sc_cm_linear | ✓ | 200 | 0.2 | 29.3 M | Att Stats Pool |
res2net101_w24_s4_c32_att+LMFT | sc_cm_linear | ✓ | 600 | 0.4 | 29.3 M | Att Stats Pool |
res2net152_w24_s4_c32_att | sc_cm_linear | ✓ | 200 | 0.2 | 32.9 M | Att Stats Pool |
res2net152_w24_s4_c32_att+LMFT | sc_cm_linear | ✓ | 600 | 0.4 | 32.9 M | Att Stats Pool |
res2net200_w24_s4_c32_att | sc_cm_linear | ✓ | 200 | 0.2 | 35.5 M | Att Stats Pool |
res2net200_w24_s4_c32_att+LMFT | sc_cm_linear | ✓ | 600 | 0.4 | 35.5 M | Att Stats Pool |
Vox1_Test (EER/minDCF0.01) |
Extended (EER/minDCF0.01) |
Hard (EER/minDCF0.01) |
VoxSRC2022_dev (EER/minDCF0.05) |
res2net50_w24_s4_c64 (cosine) | 0.9304%/0.0855 | 1.0345%/0.1109 | 1.8228%/0.1777 | |
res2net50_w24_s4_c64 (asnorm) | 0.7656%/0.0662 | 0.8989%/0.0964 | 1.5381%/0.1423 | |
res2net50_w24_s4_c64+LMFT (cosine) | 0.7762%/0.0702 | 0.8589%/0.0933 | 1.5236%/0.1506 | |
res2net50_w24_s4_c64+LMFT (asnorm) | 0.6805%/0.0579 | 0.8024%/0.0827 | 1.3703%/0.1210 | |
res2net50_w24_s4_c32 (cosine) | 0.8347%/0.0771 | 0.9893%/0.1104 | 1.8242%/0.1794 | |
res2net50_w24_s4_c32 (asnorm) | 0.7018%/0.0673 | 0.9062%/0.0986 | 1.5860%/0.1466 | |
res2net50_w24_s4_c32+LMFT (cosine) | 0.6327%/0.0622 | 0.8500%/0.0937 | 1.5352%/0.1549 | |
res2net50_w24_s4_c32+LMFT (asnorm) | 0.5529%/0.0625 | 0.8003%/0.0820 | 1.3751%/0.1271 | |
res2net50_w8_s6_c16 (cosine) | 1.0474%/0.1133 | 1.1721%/0.1288 | 2.0929%/0.2074 | |
res2net50_w8_s6_c16 (asnorm) | 0.9889%/0.1024 | 1.0659%/0.1164 | 1.8627%/0.1777 | |
res2net50_w8_s6_c16+LMFT (cosine) | 0.8666%/0.0839 | 0.9772%/0.1019 | 1.7360%/0.1678 | |
res2net50_w8_s6_c16+LMFT (asnorm) | 0.7869%/0.0821 | 0.8903%/0.0917 | 1.5370%/0.1414 | |
res2net101_w24_s4_c32_att (cosine) | 0.6539%/0.0655 | 0.7837%/0.0823 | 1.4125%/0.1395 | |
res2net101_w24_s4_c32_att (asnorm) | 0.5742%/0.0664 | 0.7230%/0.0749 | 1.2679%/0.1187 | |
res2net101_w24_s4_c32_att+LMFT (cosine) | 0.5795%/0.0512 | 0.6526%/0.0677 | 1.2099%/0.1174 | |
res2net101_w24_s4_c32_att+LMFT (asnorm) | 0.5210%/0.0534 | 0.6081%/0.0618 | 1.0940%/0.0999 | |
res2net152_w24_s4_c32_att (cosine) | 0.5476%/0.0507 | 0.7313%/0.0811 | 1.3453%/0.1331 | |
res2net152_w24_s4_c32_att (asnorm) | 0.4891%/0.0530 | 0.6754%/0.0729 | 1.1957%/0.1101 | |
res2net152_w24_s4_c32_att+LMFT (cosine) | 0.4732%/0.0425 | 0.6516%/0.0676 | 1.1957%/0.1119 | |
res2net152_w24_s4_c32_att+LMFT (asnorm) | 0.4572%/0.0463 | 0.6099%/0.0587 | 1.0737%/0.0965 | |
res2net200_w24_s4_c32_att (cosine) | 0.4944%/0.0418 | 0.7137%/0.0780 | 1.2897%/0.1267 | |
res2net200_w24_s4_c32_att (asnorm) | 0.4200%/0.0501 | 0.6658%/0.0690 | 1.1496%/0.1019 | |
res2net200_w24_s4_c32_att+LMFT (cosine) | 0.4041%/0.0390 | 0.6330%/0.0672 | 1.1641%/0.1113 | |
res2net200_w24_s4_c32_att+LMFT (asnorm) | 0.3668%/0.0388 | 0.5930%/0.0581 | 1.0330%/0.0912 | 1.5017/0.0974 |
DPN68 model with B+α+1 configuration ( ↩
Res2Net50 model with B+α+5 configuration ( ↩