drowe67 · drowe67 · Dec 11, 2023 · Nov 17, 2023 · Nov 18, 2023 · Nov 18, 2023
diff --git a/doc/codec2.pdf b/doc/codec2.pdf
diff --git a/doc/codec2.tex b/doc/codec2.tex
@@ -51,13 +51,6 @@
 	\draw (#1,#2-0.25) -- (#1,#2+0.25);
 }
 
-% tikz: draw a multiplier
-\newcommand{\drawMultiplier}[2]{% x, y
-	\draw (#1,#2) circle (0.5);
-	\draw (#1-0.25,#2-0.25) -- (#1+0.25,#2+0.25);
-	\draw (#1-0.25,#2+0.25) -- (#1+0.25,#2-0.25);
-}
-
 \maketitle
 
 \section{Introduction}
@@ -66,10 +59,10 @@ \section{Introduction}
 
 Key feature includes:
 \begin{enumerate}
-\item A range of modes supporting different bit rates, currently (Nov 2023): 3200, 2400, 1600, 1400, 1300, 1200, 700C.  The number is the bit rate, and the supplementary letter the version (700C replaced the earlier 700, 700A, 700B versions). These are referred to as ``Codec 2 3200", ``Codec 2 700C"" etc.
+\item A range of modes supporting different bit rates, currently (Nov 2023): 3200, 2400, 1600, 1400, 1300, 1200, 700C.  The number is the bit rate, and the supplementary letter the version (700C replaced the earlier 700, 700A, 700B versions). These are referred to as ``Codec 2 3200", ``Codec 2 700C" etc.
 \item Modest CPU (a few 10s of MIPs) and memory (a few 10s of kbytes of RAM) requirements such that it can run on stm32 class microcontrollers with hardware FPU.
 \item Codec 2 has been designed for digital voice over radio applications, and retains intelligible speech at a few percent bit error rate.
-\item An open source reference implementation in the C language for C99/gcc compilers, and a \emph{cmake} build and test framework that runs on Linux/MinGW.  Also included is a cross compiled stm32 reference implementation.
+\item An open source reference implementation in the C language for C99/gcc compilers, and a \emph{cmake} build and test framework that runs on Linux.  Also included is a cross compiled stm32 reference implementation.
 \item Ports to non-C99 compilers (e.g. MSVC, some microcontrollers, native builds on Windows) are left to third party developers - we recommend the tests also be ported and pass before considering the port successful.
 \end{enumerate}
 
@@ -100,7 +93,7 @@ \subsection{Speech in Time and Frequency}
 
 Note that each harmonic has it's own amplitude, that varies across frequency.  The red line plots the amplitude of each harmonic. In this example there is a peak around 500 Hz, and another, broader peak around 2300 Hz.  The ear perceives speech by the location of these peaks and troughs.
 
-\begin{figure}[H]
+\begin{figure}
 \caption{ A 40ms segment from the word "these" from a female speaker, sampled at 8kHz. Top is a plot again time, bottom (blue) is a plot against frequency. The waveform repeats itself every 4.3ms ($F_0=230$ Hz), this is the "pitch period" of this segment.}
 \label{fig:hts2a_time}
 \begin{center}
@@ -276,9 +269,98 @@ \subsection{Overview}
 \item A post filter that enhances the speech quality of the baseline codec, especially for low pitched (male) speakers.
 \end{enumerate}
 
-\subsection{Naming Conventions}
+\subsection{Sinusoidal Analysis}
+
+Both voiced and unvoiced speech is represented using a harmonic sinusoidal model:
+\begin{equation}
+\hat{s}(n) = \sum_{m=1}^L A_m cos(\omega_0 m n + \theta_m)
+\end{equation}
+where the parameters $A_m, \theta_m, m=1...L$ represent the magnitude and phases of each sinusoid, $\omega_0$ is the fundamental frequency in radians/sample, and $L=\lfloor \pi/\omega_0 \rfloor$ is the number of harmonics.
 
-In Codec 2, signals are frequently moved between the time and frequency domain. In the source code and this document, time domain signals generally have the subscript $n$, and frequency domain signals the subscript $\omega$, for example $S_n$ and $S_\omega$ represent the same speech expressed in the time and frequency domain.  Section \ref{sect:glossary} contains a glossary of symbols.
+Figure \ref{fig:analysis} illustrates the processing steps in the sinusoidal analysis system at the core of the Codec 2 encoder.  This algorithms described in this section is based on the work in \cite{rowe1997techniques}, with some changes in notation.
+
+\begin{figure}[h]
+\caption{Sinusoidal Analysis}
+\label{fig:analysis}
+\begin{center}
+\begin{tikzpicture}[auto, node distance=2cm,>=triangle 45,x=1.0cm,y=1.0cm, align=center]
+
+\node [input] (rinput) {};
+\node [tmp, right of=rinput,node distance=0.5cm] (z) {};
+\node [block, right of=z,node distance=1.5cm] (window) {Window};
+\node [block, right of=window,node distance=2.5cm] (dft) {DFT};
+\node [block, right of=dft,node distance=3cm,text width=2cm] (est) {Est Amp and Phase};
+\node [block, below of=window] (nlp) {NLP};
+\node [output, right of=est,node distance=2cm] (routput) {};
+
+\draw [->] node[align=left,text width=2cm] {$s(n)$} (rinput) -- (window);
+\draw [->] (z) |- (nlp);
+\draw [->] (window) -- node[below] {$s_w(n)$} (dft);
+\draw [->] (dft) -- node[below] {$S_\omega(k)$} (est);
+\draw [->] (nlp) -| node[below] {$\omega_0$} (est) ;
+\draw [->] (est) -- (routput) node[right] {$\{A_m\}$ \\ $\{\theta_m\}$};
+
+\end{tikzpicture}
+\end{center}
+\end{figure}
+
+For the purposes of speech analysis the time domain speech signal $s(n)$ is divided into overlapping analysis windows (frames) of $N_w=279$ samples. The centre of each analysis window is separated by $N=80$ samples, or an internal frame rate or 10ms. To analyse the $l$-th frame it is convenient to convert the fixed time reference to a sliding time reference centred on the current analysis window:
+\begin{equation}
+s_w(n) = s(lN + n) w(n), \quad n = - N_{w2} ... N_{w2}
+\end{equation}
+where $w(n)$ is a tapered even window of $N_w$ ($N_w$ odd) samples with:
+\begin{equation}
+N_{w2} = \left \lfloor \frac{N_w}{2} \right \rfloor
+\end{equation}
+A suitable window function is a shifted Hann window:
+\begin{equation}
+w(n) = \frac{1}{2} - \frac{1}{2} cos \left(\frac{2 \pi (n- N_{w2})}{N_w-1} \right)
+\end{equation}
+where the energy in the window is normalised such that:
+\begin{equation}
+\sum_{n=0}^{N_w-1}w^2(n) = \frac{1}{N_{dft}}
+\end{equation}
+To analyse $s(n)$ in the frequency domain the $N_{dft}$ point Discrete Fourier Transform (DFT) can be computed:
+\begin{equation}
+S_w(k) = \sum_{n=-N_{w2}}^{N_{w2}} s_w(n) e^{-j 2 \pi k n / N_{dft}}
+\end{equation}
+The magnitude and phase of each harmonic is given by:
+\begin{equation}
+\begin{split}
+A_m      &= \sqrt{\sum_{k=a_m}^{b_m-1} |S_w(k)|^2 } \\
+\theta_m &= arg \left( S_w(m \omega_0 N_{dft} / 2 \pi) \right) \\
+a_m      &= \left \lfloor \frac{(m - 0.5)\omega_0 N_{dft}}{2 \pi} + 0.5 \right \rfloor \\
+b_m      &= \left \lfloor \frac{(m + 0.5)\omega_0 N_{dft}}{2 \pi} + 0.5 \right \rfloor
+\end{split}
+\end{equation}
+The DFT indexes $a_m, b_m$ select the band of $S_w(k)$ containing the $m$-th harmonic. The magnitude $A_m$ is the RMS level of the energy in the band containing the harmonic.  This method of estimating $A_m$ is relatively insensitive to small errors in $F0$ estimation and works equally well for voiced and unvoiced speech. The phase is sampled at the centre of the band.  For all practical Codec 2 modes the phase is not transmitted to the decoder so does not need to be computed.  However speech synthesised using the phase is useful as a control during development, and is available using the \emph{c2sim} utility.
+
+\subsection{Sinusoidal Synthesis}
+
+Synthesis is achieved by constructing an estimate of the original speech spectrum using the sinusoidal model parameters for the current frame. This information is then transformed to the time domain using an Inverse DFT (IDFT). To produce a continuous time domain waveform the IDFTs from adjacent frames are smoothly interpolated using a weighted overlap add procedure \cite{mcaulay1986speech}.
+
+The synthetic speech spectrum is constructed using the sinusoidal model parameters by populating a DFT array $\hat{S}_w(k)$ with weighted impulses at the harmonic centres:
+\begin{equation}
+\begin{split}
+\hat{S}_w(k) &= \begin{cases}
+                A_m e^{\theta_m}, & m=1..L \\
+                0, & otherwise
+                \end{cases} \\
+k &= \left \lfloor \frac{m \omega_0 N_{dft}}{2 \pi} + 0.5 \right \rfloor
+\end{split}
+\end{equation}
+
+As we wish to synthesise a real time domain signal, $S_w(k)$ is defined to be conjugate symmetric:
+\begin{equation}
+%\hat{S}_w(N_{dft} − k) = \hat{S}_w^{*}(k), \quad k = 1,.. N_{dft}/2-1
+\hat{S}_w(N_{dft}-k) = \hat{S}_w^{*}(k), \quad k = 1,.. N_{dft}/2-1
+\end{equation}
+where $\hat{S}_w^*(k)$ is the complex conjugate of $\hat{S}_w(k)$. This signal is converted to the time domain
+using the IDFT:
+\begin{equation}
+s_w(k) = \frac{1}{N_{dft}}\sum_{k=0}^{N_{dft}-1} \hat{S}_w(k) e^{j 2 \pi k n / N_{dft}}
+\end{equation}
+We introduce the notation $s_w^l(n)$ to denote the synthesised speech for the $l$-th frame. To reconstruct a continuous synthesised speech waveform, we need to smoothly connect adjacent synthesised frames of speech. This is performed by windowing each frame, then shifting and superimposing adjacent frames using an overlap add algorithm.
 
 \subsection{Non-Linear Pitch Estimation}
 
@@ -327,19 +409,20 @@ \subsection{Non-Linear Pitch Estimation}
 \end{center}
 \end{figure}
 
-Before transforming the squared signal to the frequency domain, the signal is low pass filtered and decimated by a factor of 5. This operation is performed to limit the bandwidth of the squared signal to the approximate range of the fundamental frequency. All energy in the squared signal above 400 Hz is superfluous and would lower the resolution of the frequency domain peak picking stage. The low pass filter used for decimation is an FIR type with 48 taps and a cut off frequency of 600 Hz. The decimated signal is then windowed and the $N_{dft} = 512$ point DFT power spectrum $F_\omega(k)$ is computed by zero padding the decimated signal, where $k$ is the DFT bin.
+Before transforming the squared signal to the frequency domain, the signal is low pass filtered and decimated by a factor of 5. This operation is performed to limit the bandwidth of the squared signal to the approximate range of the fundamental frequency. All energy in the squared signal above 400 Hz is superfluous and would lower the resolution of the frequency domain peak picking stage. The low pass filter used for decimation is an FIR type with 48 taps and a cut off frequency of 600 Hz. The decimated signal is then windowed and the $N_{dft} = 512$ point DFT power spectrum $F_w(k)$ is computed by zero padding the decimated signal, where $k$ is the DFT bin.
 
-The DFT power spectrum of the squared signal $F_\omega(k)$ generally contains several local maxima. In most cases, the global maxima will correspond to $F_0$, however occasionally the global maxima $|F_\omega(k_{max})|$ corresponds to a spurious peak or multiple of $F_0$ . Thus it is not appropriate to simply choose the global maxima as the fundamental estimate for this frame. Instead, we look at submultiples of the global maxima frequency $k_{max}/2, k_{max}/3,... k_{min}$ for local maxima.  If local maxima exists and is above an experimentally derived threshold we choose the submultiple as the $F_0$ estimate.  The threshold is biased down for $F_0$ candidates near the previous frames $F_0$ estimate, a form of backwards pitch tracking.
+The DFT power spectrum of the squared signal $F_w(k)$ generally contains several local maxima. In most cases, the global maxima will correspond to $F_0$, however occasionally the global maxima $|F_w(k_{max})|$ corresponds to a spurious peak or multiple of $F_0$. Thus it is not appropriate to simply choose the global maxima as the fundamental estimate for this frame. Instead, we look at submultiples of the global maxima frequency $k_{max}/2, k_{max}/3,... k_{min}$ for local maxima.  If local maxima exists and is above an experimentally derived threshold we choose the submultiple as the $F_0$ estimate.  The threshold is biased down for $F_0$ candidates near the previous frames $F_0$ estimate, a form of backwards pitch tracking.
 
 The accuracy of the pitch estimate in then refined by maximising the function:
 \begin{equation}
-E(\omega_0)=\sum_{m=1}^L|S_{\omega}(b m \omega_0)|^2
+E(\omega_0)=\sum_{m=1}^L|S_w(b m \omega_0)|^2
 \end{equation}
-where the $\omega_0=2 \pi F_0 /F_s$ is the normalised angular fundamental frequency in radians/sample, $b$ is a constant that maps a frequency in radians/sample to a DFT bin, and $S_\omega$ is the DFT of the speech spectrum for the current frame. This function will be maximised when $mF_0$ samples the peak of each harmonic, corresponding with an accurate pitch estimate.   It is evaluated in a small range about the coarse $F_0$ estimate.
+where the $\omega_0=2 \pi F_0 /F_s$ is the normalised angular fundamental frequency in radians/sample, $b$ is a constant that maps a frequency in radians/sample to a DFT bin, and $S_\omega$ is the DFT of the speech spectrum for the current frame. This function will be maximised when $mF_0$ aligns with the peak of each harmonic, corresponding with an accurate pitch estimate.   It is evaluated in a small range about the coarse $F_0$ estimate.
 
 There is nothing particularly unique about this pitch estimator or it's performance. There are occasional artefacts in the synthesised speech that can be traced to ``gross" and ``fine" pitch estimator errors.  In the real world no pitch estimator is perfect, partially because the model assumptions around pitch break down (e.g. in transition regions or unvoiced speech).  The NLP algorithm could benefit from additional review, tuning and better pitch tracking.  However it appears sufficient for the use case of a communications quality speech codec, and is a minor source of artefacts in the synthesised speech. Other pitch estimators could also be used, provided they have practical, real world implementations that offer comparable performance and CPU/memory requirements.
 
-\subsection{Sinusoidal Analysis and Synthesis}
+
+\subsection{Voicing Estimation}
 
 \subsection{LPC/LSP based modes}
 
@@ -364,14 +447,20 @@ \section{Glossary}
 \hline
 Symbol & Description & Units \\
 \hline
+$a_m$ & lower DFT index of current band \\
+$b_m$ & upper DFT index of current band \\
 $b$ & Constant that maps a frequency in radians to a DFT bin \\
-$\{A_m\}$ & Set of spectral amplitudes $m=1,...L$ & dB \\
+$\{A_m\}$ & Set of harmonic magnitudes $m=1,...L$ & dB \\
 $F_0$ & Fundamental frequency (pitch) & Hz \\
 $F_s$ & Sample rate (usually 8 kHz) & Hz \\
-$F_\omega(k)$ & DFT of squared speech signal in NLP pitch estimator \\
-$L$ & Number of harmonics & \\
+$F_w(k)$ & DFT of squared speech signal in NLP pitch estimator \\
+$L$ & Number of harmonics \\
 $P$ & Pitch period & ms or samples \\
+$\{\theta_m\}$ & Set of harmonic phases $m=1,...L$ & dB \\
 $\omega_0$ & Fundamental frequency (pitch) & radians/sample \\
+$s(n)$ & Input speech \\
+$s_w(n)$ & Time domain windowed input speech \\
+$S_w(k)$ & Frequency domain windowed input speech \\
 \hline
 \end{tabular}
 \caption{Glossary of Symbols}