Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WP2000 - Codec 2 Algorithm Description #31

Merged
merged 41 commits into from
Dec 11, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
112f3b5
kicking off Codec 2 documentation
drowe67 Nov 17, 2023
9bc86bc
building up plot support
drowe67 Nov 18, 2023
cef07b4
drafted time-freq speech section, building up sinsuoidal model figure
drowe67 Nov 18, 2023
ce5e8ba
macro for sinusoid
drowe67 Nov 18, 2023
def80d4
building up sinusoid figure
drowe67 Nov 18, 2023
f778670
sinusoidal figure OK
drowe67 Nov 19, 2023
24d7b22
parameter updates
drowe67 Nov 19, 2023
3d9443f
building up encoder block diagram
drowe67 Nov 19, 2023
1b311ba
encoder block diagram
drowe67 Nov 19, 2023
4d2492d
building up detailed design intro
drowe67 Nov 22, 2023
3dca356
building up NLP figure
drowe67 Nov 22, 2023
70bf39e
inserted DC notch into NLP
drowe67 Nov 23, 2023
04ebf69
Mooneer's suggestions - thanks
drowe67 Nov 23, 2023
ed463b0
moved some introductory info from DD to Intro
drowe67 Nov 23, 2023
17a30f0
first draft of NLP section, Glossary
drowe67 Nov 23, 2023
f95b590
sinusoidal encoder block diagram
drowe67 Nov 23, 2023
97b20b4
drafted sinusoidal analysis section
drowe67 Nov 24, 2023
899fce8
building up synthesis section
drowe67 Nov 24, 2023
0b6a207
first pass of synthesis section
drowe67 Nov 25, 2023
125a169
sinusoidal synthesiser figure
drowe67 Nov 25, 2023
b3ed577
rough draft of phase synthesis copied from source
drowe67 Nov 25, 2023
12bbb03
first draft of voicing estimation
drowe67 Nov 27, 2023
9a18256
make notation more consistent across sections
drowe67 Nov 27, 2023
ba7321c
draft of phase synthesis section
drowe67 Nov 28, 2023
fbbea09
phase synthesis edits
drowe67 Nov 28, 2023
f3b4305
phase model edits and LPC/LSP encoder block diagram
drowe67 Nov 29, 2023
067eaa7
LPC/LSP enocder description, decoder block diagram
drowe67 Dec 1, 2023
43defe5
decoder description, mode table
drowe67 Dec 2, 2023
0098976
building up 700C section
drowe67 Dec 6, 2023
71b86a8
mic EQ and VQ mean removal maths
drowemyriota Dec 6, 2023
670b278
aligning 700C figures with maths
drowemyriota Dec 8, 2023
348f68f
added LPC/LSP and LPC post figure figures, plus code to generate them
drowe67 Dec 9, 2023
c27e56d
oops we forgot to rm this in recent clean up
drowe67 Dec 10, 2023
8a9b13e
removed newamp2 code
drowe67 Dec 10, 2023
d1c085a
Added a list or source files; edited Further Work section
drowe67 Dec 10, 2023
05110e5
first pass at Makefile to build doc
drowe67 Dec 10, 2023
7e88771
proof read, minor edits, update symbol glossary
drowe67 Dec 10, 2023
ea0379f
ctest, README.md, first pass at github action
drowe67 Dec 10, 2023
21dd265
way to run doc ctest without over writing codec2.doc
drowe67 Dec 11, 2023
18c5e48
exclude test_codec2_doc when running tests on github actions
drowe67 Dec 11, 2023
b8e4527
don't need tex packages as we've excluded that test for now
drowe67 Dec 11, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified doc/codec2.pdf
Binary file not shown.
129 changes: 109 additions & 20 deletions doc/codec2.tex
Original file line number Diff line number Diff line change
Expand Up @@ -51,13 +51,6 @@
\draw (#1,#2-0.25) -- (#1,#2+0.25);
}

% tikz: draw a multiplier
\newcommand{\drawMultiplier}[2]{% x, y
\draw (#1,#2) circle (0.5);
\draw (#1-0.25,#2-0.25) -- (#1+0.25,#2+0.25);
\draw (#1-0.25,#2+0.25) -- (#1+0.25,#2-0.25);
}

\maketitle

\section{Introduction}
Expand All @@ -66,10 +59,10 @@ \section{Introduction}

Key feature includes:
\begin{enumerate}
\item A range of modes supporting different bit rates, currently (Nov 2023): 3200, 2400, 1600, 1400, 1300, 1200, 700C. The number is the bit rate, and the supplementary letter the version (700C replaced the earlier 700, 700A, 700B versions). These are referred to as ``Codec 2 3200", ``Codec 2 700C"" etc.
\item A range of modes supporting different bit rates, currently (Nov 2023): 3200, 2400, 1600, 1400, 1300, 1200, 700C. The number is the bit rate, and the supplementary letter the version (700C replaced the earlier 700, 700A, 700B versions). These are referred to as ``Codec 2 3200", ``Codec 2 700C" etc.
\item Modest CPU (a few 10s of MIPs) and memory (a few 10s of kbytes of RAM) requirements such that it can run on stm32 class microcontrollers with hardware FPU.
\item Codec 2 has been designed for digital voice over radio applications, and retains intelligible speech at a few percent bit error rate.
\item An open source reference implementation in the C language for C99/gcc compilers, and a \emph{cmake} build and test framework that runs on Linux/MinGW. Also included is a cross compiled stm32 reference implementation.
\item An open source reference implementation in the C language for C99/gcc compilers, and a \emph{cmake} build and test framework that runs on Linux. Also included is a cross compiled stm32 reference implementation.
\item Ports to non-C99 compilers (e.g. MSVC, some microcontrollers, native builds on Windows) are left to third party developers - we recommend the tests also be ported and pass before considering the port successful.
\end{enumerate}

Expand Down Expand Up @@ -100,7 +93,7 @@ \subsection{Speech in Time and Frequency}

Note that each harmonic has it's own amplitude, that varies across frequency. The red line plots the amplitude of each harmonic. In this example there is a peak around 500 Hz, and another, broader peak around 2300 Hz. The ear perceives speech by the location of these peaks and troughs.

\begin{figure}[H]
\begin{figure}
\caption{ A 40ms segment from the word "these" from a female speaker, sampled at 8kHz. Top is a plot again time, bottom (blue) is a plot against frequency. The waveform repeats itself every 4.3ms ($F_0=230$ Hz), this is the "pitch period" of this segment.}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double check all usages of double-quotes and replace opening quotes with `` as appropriate.

\label{fig:hts2a_time}
\begin{center}
Expand Down Expand Up @@ -276,9 +269,98 @@ \subsection{Overview}
\item A post filter that enhances the speech quality of the baseline codec, especially for low pitched (male) speakers.
\end{enumerate}

\subsection{Naming Conventions}
\subsection{Sinusoidal Analysis}

Both voiced and unvoiced speech is represented using a harmonic sinusoidal model:
\begin{equation}
\hat{s}(n) = \sum_{m=1}^L A_m cos(\omega_0 m n + \theta_m)
\end{equation}
where the parameters $A_m, \theta_m, m=1...L$ represent the magnitude and phases of each sinusoid, $\omega_0$ is the fundamental frequency in radians/sample, and $L=\lfloor \pi/\omega_0 \rfloor$ is the number of harmonics.

In Codec 2, signals are frequently moved between the time and frequency domain. In the source code and this document, time domain signals generally have the subscript $n$, and frequency domain signals the subscript $\omega$, for example $S_n$ and $S_\omega$ represent the same speech expressed in the time and frequency domain. Section \ref{sect:glossary} contains a glossary of symbols.
Figure \ref{fig:analysis} illustrates the processing steps in the sinusoidal analysis system at the core of the Codec 2 encoder. This algorithms described in this section is based on the work in \cite{rowe1997techniques}, with some changes in notation.

\begin{figure}[h]
\caption{Sinusoidal Analysis}
\label{fig:analysis}
\begin{center}
\begin{tikzpicture}[auto, node distance=2cm,>=triangle 45,x=1.0cm,y=1.0cm, align=center]

\node [input] (rinput) {};
\node [tmp, right of=rinput,node distance=0.5cm] (z) {};
\node [block, right of=z,node distance=1.5cm] (window) {Window};
\node [block, right of=window,node distance=2.5cm] (dft) {DFT};
\node [block, right of=dft,node distance=3cm,text width=2cm] (est) {Est Amp and Phase};
\node [block, below of=window] (nlp) {NLP};
\node [output, right of=est,node distance=2cm] (routput) {};

\draw [->] node[align=left,text width=2cm] {$s(n)$} (rinput) -- (window);
\draw [->] (z) |- (nlp);
\draw [->] (window) -- node[below] {$s_w(n)$} (dft);
\draw [->] (dft) -- node[below] {$S_\omega(k)$} (est);
\draw [->] (nlp) -| node[below] {$\omega_0$} (est) ;
\draw [->] (est) -- (routput) node[right] {$\{A_m\}$ \\ $\{\theta_m\}$};

\end{tikzpicture}
\end{center}
\end{figure}

For the purposes of speech analysis the time domain speech signal $s(n)$ is divided into overlapping analysis windows (frames) of $N_w=279$ samples. The centre of each analysis window is separated by $N=80$ samples, or an internal frame rate or 10ms. To analyse the $l$-th frame it is convenient to convert the fixed time reference to a sliding time reference centred on the current analysis window:
\begin{equation}
s_w(n) = s(lN + n) w(n), \quad n = - N_{w2} ... N_{w2}
\end{equation}
where $w(n)$ is a tapered even window of $N_w$ ($N_w$ odd) samples with:
\begin{equation}
N_{w2} = \left \lfloor \frac{N_w}{2} \right \rfloor
\end{equation}
A suitable window function is a shifted Hann window:
\begin{equation}
w(n) = \frac{1}{2} - \frac{1}{2} cos \left(\frac{2 \pi (n- N_{w2})}{N_w-1} \right)
\end{equation}
where the energy in the window is normalised such that:
\begin{equation}
\sum_{n=0}^{N_w-1}w^2(n) = \frac{1}{N_{dft}}
\end{equation}
To analyse $s(n)$ in the frequency domain the $N_{dft}$ point Discrete Fourier Transform (DFT) can be computed:
\begin{equation}
S_w(k) = \sum_{n=-N_{w2}}^{N_{w2}} s_w(n) e^{-j 2 \pi k n / N_{dft}}
\end{equation}
The magnitude and phase of each harmonic is given by:
\begin{equation}
\begin{split}
A_m &= \sqrt{\sum_{k=a_m}^{b_m-1} |S_w(k)|^2 } \\
\theta_m &= arg \left( S_w(m \omega_0 N_{dft} / 2 \pi) \right) \\
a_m &= \left \lfloor \frac{(m - 0.5)\omega_0 N_{dft}}{2 \pi} + 0.5 \right \rfloor \\
b_m &= \left \lfloor \frac{(m + 0.5)\omega_0 N_{dft}}{2 \pi} + 0.5 \right \rfloor
\end{split}
\end{equation}
The DFT indexes $a_m, b_m$ select the band of $S_w(k)$ containing the $m$-th harmonic. The magnitude $A_m$ is the RMS level of the energy in the band containing the harmonic. This method of estimating $A_m$ is relatively insensitive to small errors in $F0$ estimation and works equally well for voiced and unvoiced speech. The phase is sampled at the centre of the band. For all practical Codec 2 modes the phase is not transmitted to the decoder so does not need to be computed. However speech synthesised using the phase is useful as a control during development, and is available using the \emph{c2sim} utility.

\subsection{Sinusoidal Synthesis}

Synthesis is achieved by constructing an estimate of the original speech spectrum using the sinusoidal model parameters for the current frame. This information is then transformed to the time domain using an Inverse DFT (IDFT). To produce a continuous time domain waveform the IDFTs from adjacent frames are smoothly interpolated using a weighted overlap add procedure \cite{mcaulay1986speech}.

The synthetic speech spectrum is constructed using the sinusoidal model parameters by populating a DFT array $\hat{S}_w(k)$ with weighted impulses at the harmonic centres:
\begin{equation}
\begin{split}
\hat{S}_w(k) &= \begin{cases}
A_m e^{\theta_m}, & m=1..L \\
0, & otherwise
\end{cases} \\
k &= \left \lfloor \frac{m \omega_0 N_{dft}}{2 \pi} + 0.5 \right \rfloor
\end{split}
\end{equation}

As we wish to synthesise a real time domain signal, $S_w(k)$ is defined to be conjugate symmetric:
\begin{equation}
%\hat{S}_w(N_{dft} − k) = \hat{S}_w^{*}(k), \quad k = 1,.. N_{dft}/2-1
\hat{S}_w(N_{dft}-k) = \hat{S}_w^{*}(k), \quad k = 1,.. N_{dft}/2-1
\end{equation}
where $\hat{S}_w^*(k)$ is the complex conjugate of $\hat{S}_w(k)$. This signal is converted to the time domain
using the IDFT:
\begin{equation}
s_w(k) = \frac{1}{N_{dft}}\sum_{k=0}^{N_{dft}-1} \hat{S}_w(k) e^{j 2 \pi k n / N_{dft}}
\end{equation}
We introduce the notation $s_w^l(n)$ to denote the synthesised speech for the $l$-th frame. To reconstruct a continuous synthesised speech waveform, we need to smoothly connect adjacent synthesised frames of speech. This is performed by windowing each frame, then shifting and superimposing adjacent frames using an overlap add algorithm.

\subsection{Non-Linear Pitch Estimation}

Expand Down Expand Up @@ -327,19 +409,20 @@ \subsection{Non-Linear Pitch Estimation}
\end{center}
\end{figure}

Before transforming the squared signal to the frequency domain, the signal is low pass filtered and decimated by a factor of 5. This operation is performed to limit the bandwidth of the squared signal to the approximate range of the fundamental frequency. All energy in the squared signal above 400 Hz is superfluous and would lower the resolution of the frequency domain peak picking stage. The low pass filter used for decimation is an FIR type with 48 taps and a cut off frequency of 600 Hz. The decimated signal is then windowed and the $N_{dft} = 512$ point DFT power spectrum $F_\omega(k)$ is computed by zero padding the decimated signal, where $k$ is the DFT bin.
Before transforming the squared signal to the frequency domain, the signal is low pass filtered and decimated by a factor of 5. This operation is performed to limit the bandwidth of the squared signal to the approximate range of the fundamental frequency. All energy in the squared signal above 400 Hz is superfluous and would lower the resolution of the frequency domain peak picking stage. The low pass filter used for decimation is an FIR type with 48 taps and a cut off frequency of 600 Hz. The decimated signal is then windowed and the $N_{dft} = 512$ point DFT power spectrum $F_w(k)$ is computed by zero padding the decimated signal, where $k$ is the DFT bin.

The DFT power spectrum of the squared signal $F_\omega(k)$ generally contains several local maxima. In most cases, the global maxima will correspond to $F_0$, however occasionally the global maxima $|F_\omega(k_{max})|$ corresponds to a spurious peak or multiple of $F_0$ . Thus it is not appropriate to simply choose the global maxima as the fundamental estimate for this frame. Instead, we look at submultiples of the global maxima frequency $k_{max}/2, k_{max}/3,... k_{min}$ for local maxima. If local maxima exists and is above an experimentally derived threshold we choose the submultiple as the $F_0$ estimate. The threshold is biased down for $F_0$ candidates near the previous frames $F_0$ estimate, a form of backwards pitch tracking.
The DFT power spectrum of the squared signal $F_w(k)$ generally contains several local maxima. In most cases, the global maxima will correspond to $F_0$, however occasionally the global maxima $|F_w(k_{max})|$ corresponds to a spurious peak or multiple of $F_0$. Thus it is not appropriate to simply choose the global maxima as the fundamental estimate for this frame. Instead, we look at submultiples of the global maxima frequency $k_{max}/2, k_{max}/3,... k_{min}$ for local maxima. If local maxima exists and is above an experimentally derived threshold we choose the submultiple as the $F_0$ estimate. The threshold is biased down for $F_0$ candidates near the previous frames $F_0$ estimate, a form of backwards pitch tracking.

The accuracy of the pitch estimate in then refined by maximising the function:
\begin{equation}
E(\omega_0)=\sum_{m=1}^L|S_{\omega}(b m \omega_0)|^2
E(\omega_0)=\sum_{m=1}^L|S_w(b m \omega_0)|^2
\end{equation}
where the $\omega_0=2 \pi F_0 /F_s$ is the normalised angular fundamental frequency in radians/sample, $b$ is a constant that maps a frequency in radians/sample to a DFT bin, and $S_\omega$ is the DFT of the speech spectrum for the current frame. This function will be maximised when $mF_0$ samples the peak of each harmonic, corresponding with an accurate pitch estimate. It is evaluated in a small range about the coarse $F_0$ estimate.
where the $\omega_0=2 \pi F_0 /F_s$ is the normalised angular fundamental frequency in radians/sample, $b$ is a constant that maps a frequency in radians/sample to a DFT bin, and $S_\omega$ is the DFT of the speech spectrum for the current frame. This function will be maximised when $mF_0$ aligns with the peak of each harmonic, corresponding with an accurate pitch estimate. It is evaluated in a small range about the coarse $F_0$ estimate.

There is nothing particularly unique about this pitch estimator or it's performance. There are occasional artefacts in the synthesised speech that can be traced to ``gross" and ``fine" pitch estimator errors. In the real world no pitch estimator is perfect, partially because the model assumptions around pitch break down (e.g. in transition regions or unvoiced speech). The NLP algorithm could benefit from additional review, tuning and better pitch tracking. However it appears sufficient for the use case of a communications quality speech codec, and is a minor source of artefacts in the synthesised speech. Other pitch estimators could also be used, provided they have practical, real world implementations that offer comparable performance and CPU/memory requirements.

\subsection{Sinusoidal Analysis and Synthesis}

\subsection{Voicing Estimation}

\subsection{LPC/LSP based modes}

Expand All @@ -364,14 +447,20 @@ \section{Glossary}
\hline
Symbol & Description & Units \\
\hline
$a_m$ & lower DFT index of current band \\
$b_m$ & upper DFT index of current band \\
$b$ & Constant that maps a frequency in radians to a DFT bin \\
$\{A_m\}$ & Set of spectral amplitudes $m=1,...L$ & dB \\
$\{A_m\}$ & Set of harmonic magnitudes $m=1,...L$ & dB \\
$F_0$ & Fundamental frequency (pitch) & Hz \\
$F_s$ & Sample rate (usually 8 kHz) & Hz \\
$F_\omega(k)$ & DFT of squared speech signal in NLP pitch estimator \\
$L$ & Number of harmonics & \\
$F_w(k)$ & DFT of squared speech signal in NLP pitch estimator \\
$L$ & Number of harmonics \\
$P$ & Pitch period & ms or samples \\
$\{\theta_m\}$ & Set of harmonic phases $m=1,...L$ & dB \\
$\omega_0$ & Fundamental frequency (pitch) & radians/sample \\
$s(n)$ & Input speech \\
$s_w(n)$ & Time domain windowed input speech \\
$S_w(k)$ & Frequency domain windowed input speech \\
\hline
\end{tabular}
\caption{Glossary of Symbols}
Expand Down
Loading