You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MFCCs (Mel Frequency Cepstral Coefficents) are a widely used representation of audio data in ASR (automic speech recognition), which are thought as a better approximation of the human auditory system's response than the linearly-spaced frequency spectrum. And many ASR systems achieved state-of-the-art performance by taking advantage of MFCCs. Considering Deep Speech 2 only taking the power spectrum as its input feature, it is worth evaluating the performance of MFCCs on the same network.
The experimental results will be continuously updated in this issue.
The MFCC feature used here is a 39-dimension vector, consisting of the 13 basic cepstral coefficents and the first and second order derivatives, with the first component replaced by the energy of the frame. At the first attempt, the training process follows the default setting totally in train.py except adjusting the kernel and padding size in conv layers to adapt the new feature dimension. But the convergence gets a little bit slow. Then inspired by Wav2letter, retrain the model with no striding in the feature dimension. And a relative better convergence appears, as shown in the figure below.
The validation cost doesn't decay significantly at the end, and the training is in progress with smaller learning rate after pass 25. The rest part of learning curves will be appended later.
The text was updated successfully, but these errors were encountered:
您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!
Hey Kuke. Can you share more details of the experiment with MFCC features as input? I am currently using MelSpec features for DeepSpeech2 but I am getting at best 9% validation CER. Could you share your setup and results if possible?
MFCCs (Mel Frequency Cepstral Coefficents) are a widely used representation of audio data in ASR (automic speech recognition), which are thought as a better approximation of the human auditory system's response than the linearly-spaced frequency spectrum. And many ASR systems achieved state-of-the-art performance by taking advantage of MFCCs. Considering Deep Speech 2 only taking the power spectrum as its input feature, it is worth evaluating the performance of MFCCs on the same network.
The experimental results will be continuously updated in this issue.
The MFCC feature used here is a 39-dimension vector, consisting of the 13 basic cepstral coefficents and the first and second order derivatives, with the first component replaced by the energy of the frame. At the first attempt, the training process follows the default setting totally in
train.py
except adjusting the kernel and padding size in conv layers to adapt the new feature dimension. But the convergence gets a little bit slow. Then inspired by Wav2letter, retrain the model with no striding in the feature dimension. And a relative better convergence appears, as shown in the figure below.The validation cost doesn't decay significantly at the end, and the training is in progress with smaller learning rate after pass 25. The rest part of learning curves will be appended later.
The text was updated successfully, but these errors were encountered: