Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WP2000 - Codec 2 Algorithm Description #31

Merged
merged 41 commits into from
Dec 11, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
112f3b5
kicking off Codec 2 documentation
drowe67 Nov 17, 2023
9bc86bc
building up plot support
drowe67 Nov 18, 2023
cef07b4
drafted time-freq speech section, building up sinsuoidal model figure
drowe67 Nov 18, 2023
ce5e8ba
macro for sinusoid
drowe67 Nov 18, 2023
def80d4
building up sinusoid figure
drowe67 Nov 18, 2023
f778670
sinusoidal figure OK
drowe67 Nov 19, 2023
24d7b22
parameter updates
drowe67 Nov 19, 2023
3d9443f
building up encoder block diagram
drowe67 Nov 19, 2023
1b311ba
encoder block diagram
drowe67 Nov 19, 2023
4d2492d
building up detailed design intro
drowe67 Nov 22, 2023
3dca356
building up NLP figure
drowe67 Nov 22, 2023
70bf39e
inserted DC notch into NLP
drowe67 Nov 23, 2023
04ebf69
Mooneer's suggestions - thanks
drowe67 Nov 23, 2023
ed463b0
moved some introductory info from DD to Intro
drowe67 Nov 23, 2023
17a30f0
first draft of NLP section, Glossary
drowe67 Nov 23, 2023
f95b590
sinusoidal encoder block diagram
drowe67 Nov 23, 2023
97b20b4
drafted sinusoidal analysis section
drowe67 Nov 24, 2023
899fce8
building up synthesis section
drowe67 Nov 24, 2023
0b6a207
first pass of synthesis section
drowe67 Nov 25, 2023
125a169
sinusoidal synthesiser figure
drowe67 Nov 25, 2023
b3ed577
rough draft of phase synthesis copied from source
drowe67 Nov 25, 2023
12bbb03
first draft of voicing estimation
drowe67 Nov 27, 2023
9a18256
make notation more consistent across sections
drowe67 Nov 27, 2023
ba7321c
draft of phase synthesis section
drowe67 Nov 28, 2023
fbbea09
phase synthesis edits
drowe67 Nov 28, 2023
f3b4305
phase model edits and LPC/LSP encoder block diagram
drowe67 Nov 29, 2023
067eaa7
LPC/LSP enocder description, decoder block diagram
drowe67 Dec 1, 2023
43defe5
decoder description, mode table
drowe67 Dec 2, 2023
0098976
building up 700C section
drowe67 Dec 6, 2023
71b86a8
mic EQ and VQ mean removal maths
drowemyriota Dec 6, 2023
670b278
aligning 700C figures with maths
drowemyriota Dec 8, 2023
348f68f
added LPC/LSP and LPC post figure figures, plus code to generate them
drowe67 Dec 9, 2023
c27e56d
oops we forgot to rm this in recent clean up
drowe67 Dec 10, 2023
8a9b13e
removed newamp2 code
drowe67 Dec 10, 2023
d1c085a
Added a list or source files; edited Further Work section
drowe67 Dec 10, 2023
05110e5
first pass at Makefile to build doc
drowe67 Dec 10, 2023
7e88771
proof read, minor edits, update symbol glossary
drowe67 Dec 10, 2023
ea0379f
ctest, README.md, first pass at github action
drowe67 Dec 10, 2023
21dd265
way to run doc ctest without over writing codec2.doc
drowe67 Dec 11, 2023
18c5e48
exclude test_codec2_doc when running tests on github actions
drowe67 Dec 11, 2023
b8e4527
don't need tex packages as we've excluded that test for now
drowe67 Dec 11, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified doc/codec2.pdf
Binary file not shown.
137 changes: 119 additions & 18 deletions doc/codec2.tex
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ \section{Introduction}

The Codec 2 project was started in 2009 in response to the problem of closed source, patented, proprietary voice codecs in the sub-5 kbit/s range, in particular for use in the Amateur Radio service.

This document describes Codec 2 at two levels. Section \ref{sect:overview} is a high level overview aimed at the Radio Amateur, while Section \ref{sect:details} contains a more detailed description with math and signal processing theory. This document is not a concise algorithmic description, instead the algorithm is defined by the reference C99 source code and automated tests (ctests).
This document describes Codec 2 at two levels. Section \ref{sect:overview} is a high level overview aimed at the Radio Amateur, while Section \ref{sect:details} contains a more detailed description with math and signal processing theory. Combined with the C source code, it is intended to give the reader enough information to understand the operation of Codec 2 in detail and embark on source code level projects, such as improvements, ports to other languages, student or academic research projects. Issues with the current algorithms and topics for further work are also included.

This production of this document was kindly supported by an ARDC grant \cite{ardc2023}. As an open source project, many people have contributed to Codec 2 over the years - we deeply appreciate all of your support.

Expand All @@ -52,7 +52,7 @@ \section{Codec 2 for the Radio Amateur}

\subsection{Model Based Speech Coding}

A speech codec takes speech samples from an A/D converter (e.g. 16 bit samples at an 8 kHz or 128 kbits/s) and compresses them down to a low bit rate that can be more easily sent over a narrow bandwidth channel (e.g. 700 bits/s for HF). Speech coding is the art of "what can we throw away". We need to lower the bit rate of the speech while retaining speech you can understand, and making it sound as natural as possible.
A speech codec takes speech samples from an A/D converter (e.g. 16 bit samples at 8 kHz or 128 kbits/s) and compresses them down to a low bit rate that can be more easily sent over a narrow bandwidth channel (e.g. 700 bits/s for HF). Speech coding is the art of "what can we throw away". We need to lower the bit rate of the speech while retaining speech you can understand, and making it sound as natural as possible.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A speech codec takes speech samples from an A/D converter (e.g. 16 bit samples at 8 kHz or 128 kbits/s) and compresses them down to a low bit rate that can be more easily sent over a narrow bandwidth channel (e.g. 700 bits/s for HF). Speech coding is the art of "what can we throw away". We need to lower the bit rate of the speech while retaining speech you can understand, and making it sound as natural as possible.
A speech codec takes speech samples from an A/D converter (e.g. 16 bit samples at 8 kHz or 128 kbits/s) and compresses them down to a low bit rate that can be more easily sent over a narrow bandwidth channel (e.g. 700 bits/s for HF). Speech coding is the art of ``what can we throw away". We need to lower the bit rate of the speech while retaining speech you can understand, and making it sound as natural as possible.


As such low bit rates we use a speech production ``model". The input speech is anlaysed, and we extract model parameters, which are then sent over the channel. An example of a model based parameter is the pitch of the person speaking. We estimate the pitch of the speaker, quantise it to a 7 bit number, and send that over the channel every 20ms.

Expand Down Expand Up @@ -83,7 +83,7 @@ \subsection{Sinusoidal Speech Coding}
A sinewave will cause a spike or spectral line on a spectrum plot, so we can see each spike as a small sine wave generator. Each sine wave generator has it's own frequency that are all multiples of the fundamental pitch frequency (e.g. $230, 460, 690,...$ Hz). They will also have their own amplitude and phase. If we add all the sine waves together (Figure \ref{fig:sinusoidal_model}) we can produce reasonable quality synthesised speech. This is called sinusoidal speech coding and is the speech production ``model" at the heart of Codec 2.

\begin{figure}[h]
\caption{The sinusoidal speech model. If we sum a series of sine waves, we can generate a speech signal. Each sinewave has it's own amplitude ($A_1,A_2,... A_L$), frequency, and phase (not shown). We assume the frequencies are multiples of the fundamental frequency $F_0$. $L$ is the total number of sinewaves.}
\caption{The sinusoidal speech model. If we sum a series of sine waves, we can generate a speech signal. Each sinewave has it's own amplitude ($A_1,A_2,... A_L$), frequency, and phase (not shown). We assume the frequencies are multiples of the fundamental frequency $F_0$. $L$ is the total number of sinewaves we can fit in 4kHz.}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding spacing before KHz to be consistent with other mentions:

Suggested change
\caption{The sinusoidal speech model. If we sum a series of sine waves, we can generate a speech signal. Each sinewave has it's own amplitude ($A_1,A_2,... A_L$), frequency, and phase (not shown). We assume the frequencies are multiples of the fundamental frequency $F_0$. $L$ is the total number of sinewaves we can fit in 4kHz.}
\caption{The sinusoidal speech model. If we sum a series of sine waves, we can generate a speech signal. Each sinewave has it's own amplitude ($A_1,A_2,... A_L$), frequency, and phase (not shown). We assume the frequencies are multiples of the fundamental frequency $F_0$. $L$ is the total number of sinewaves we can fit in 4 kHz.}

\label{fig:sinusoidal_model}
\begin{center}
\begin{tikzpicture}[>=triangle 45,x=1.0cm,y=1.0cm]
Expand Down Expand Up @@ -114,41 +114,43 @@ \subsection{Sinusoidal Speech Coding}
\draw [->] (0.45,0.7) -- (2.05,1.8);
\draw [->] (0.3,-2.1) -- (2.2,1.6);

% output speec
% output speech
\draw [->] (3,2) -- (4,2);
\draw [xshift=4.2cm,yshift=2cm,color=blue] plot[smooth] file {hts2a_37_sn.txt};

\end{tikzpicture}
\end{center}
\end{figure}

The model parameters evolve over time, but can generally be considered constant for short snap shots in time (a few 10s of ms). For example pitch evolves time, moving up or down as a word is articulated.
The model parameters evolve over time, but can generally be considered constant for short time window (a few 10s of ms). For example pitch evolves over time, moving up or down as a word is articulated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The model parameters evolve over time, but can generally be considered constant for short time window (a few 10s of ms). For example pitch evolves over time, moving up or down as a word is articulated.
The model parameters evolve over time, but can generally be considered constant for short time windows (a few 10s of ms). For example pitch evolves over time, moving up or down as a word is articulated.


As the model parameters change over time, we need to keep updating them. This is known as the \emph{frame rate} of the codec, which can be expressed in terms of frequency (Hz) or time (ms). For sampling model parameters Codec 2 uses a frame rate of 10ms. For transmission over the channel we reduce this to 20-40ms, in order to lower the bit rate. The trade off with a lower frame rate is reduced speech quality.

The parameters of the sinusoidal model are:
\begin{enumerate}
\item The frequency of each sine wave. As they are all harmonics of $F_0$ we can just send $F_0$ to the decoder, and it can reconstruct the frequency of each harmonic as $F_0,2F_0,3F_0,...,LF_0$. We used 5-7 bits/frame to represent the $F_0$ in Codec 2.
\item The magnitude of each sine wave, $A_1,A_2,...,A_L$. These ``spectral magnitudes" are really important as they convey the information the ear needs to understand speech. Most of the bits are used for spectral magnitude information. Codec 2 uses between 20 and 36 bits/frame for spectral amplitude information.
\item Voicing information. Speech can be approximated into voiced speech (vowels) and unvoiced speech (like consonants), or some mixture of the two. The example in Figure \ref{fig:hts2a_time} above is for voiced speech. So we need some way to describe voicing to the decoder. This requires just a few bits/frame.
\item The phase of each sine wave Codec 2 discards the phases of each harmonic and reconstruct them at the decoder using an algorithm, so no bits are required for phases. This results in some drop in speech quality.
\item The frequency of each sine wave. As they are all harmonics of $F_0$ we can just send $F_0$ to the decoder, and it can reconstruct the frequency of each harmonic as $F_0,2F_0,3F_0,...,LF_0$. We used 5-7 bits/frame to represent $F_0$ in Codec 2.
\item The amplitude of each sine wave, $A_1,A_2,...,A_L$. These ``spectral amplitudes" are really important as they convey the information the ear needs to understand speech. Most of the bits are used for spectral amplitude information. Codec 2 uses between 18 and 50 bits/frame for spectral amplitude information.
\item Voicing information. Speech can be approximated into voiced speech (vowels) and unvoiced speech (like consonants), or some mixture of the two. The example in Figure \ref{fig:hts2a_time} above is voiced speech. So we need some way to describe voicing to the decoder. This requires just a few bits/frame.
\item The phase of each sine wave. Codec 2 discards the phases of each harmonic at the encoder and reconstruct them at the decoder using an algorithm, so no bits are required for phases. This results in some drop in speech quality.
\end{enumerate}

\subsection{Codec 2 Block Diagram}
\subsection{Codec 2 Encoder and Decoder}

This section explains how the Codec 2 encoder and decoder works using block diagrams.

\begin{figure}[h]
\caption{Codec 2 Encoder.}
\caption{Codec 2 Encoder}
\label{fig:codec2_encoder}
\begin{center}
\begin{tikzpicture}[auto, node distance=2cm,>=triangle 45,x=1.0cm,y=1.0cm,align=center,text width=2cm]

\node [input] (rinput) {};
\node [input, right of=rinput,node distance=1cm] (z) {};
\node [block, right of=z] (pitch_est) {Pitch Estimator};
\node [input, right of=rinput,node distance=0.5cm] (z) {};
\node [block, right of=z,node distance=1.5cm] (pitch_est) {Pitch Estimator};
\node [block, below of=pitch_est] (fft) {FFT};
\node [block, right of=fft,node distance=3cm] (est_am) {Estimate Amplitudes};
\node [block, below of=est_am] (est_v) {Estimate Voicing};
\node [block, right of=est_am,node distance=3cm] (quant) {Quantise};
\node [block, right of=est_am,node distance=3cm] (quant) {Decimate Quantise};
\node [output, right of=quant,node distance=2cm] (routput) {};

\draw [->] node[align=left] {Input Speech} (rinput) -- (pitch_est);
Expand All @@ -159,25 +161,124 @@ \subsection{Codec 2 Block Diagram}
\draw [->] (pitch_est) -| (quant);
\draw [->] (est_am) -- (quant);
\draw [->] (est_v) -| (quant);
\draw [->] (est_v) -| (quant);
\draw [->] (est_v) -| (quant);
\draw [->] (quant) -- (routput) node[right, align=left, text width=1.5cm] {Bit Stream};

\end{tikzpicture}
\end{center}
\end{figure}

The encoder is presented in Figure \ref{fig:codec2_encoder}. Frames of input speech samples are passed to a Fast Fourier Transform (FFT), which converts the time domain samples to the frequency domain. The same frame of input samples is used to estimate the pitch of the current frame. We then use the pitch and frequency domain speech to estimate the amplitude of each sine wave.

Yet another algorithm is used to determine if the frame is voiced or unvoiced. This works by comparing the spectrum to what we would expect for voiced speech (e.g. lots of spectral lines). If the energy is more random and continuous rather than discrete lines, we consider it unvoiced.

Up until this point the processing happens at a 10ms frame rate. However in the next step we ``decimate`` the model parameters - this means we discard some of the model parameters to lower the frame rate, which helps us lower the bit rate. Decimating to 20ms (throwing away every 2nd set of model parameters) doesn't have much effect, but beyond that the speech quality starts to degrade. So there is a trade off between decimation rate and bit rate over the channel.

Once we have the desired frame rate, we ``quantise"" each model parameter. This means we use a fixed number of bits to represent it, so we can send the bits over the channel. Parameters like pitch and voicing are fairly easy, but quite a bit of DSP goes into quantising the spectral amplitudes. For the higher bit rate Codec 2 modes, we design a filter that matches the spectral amplitudes, then send a quantised version of the filter over the channel. Using the example in Figure \ref{fig:hts2a_time} - the filter would have a band pass peaks at 500 and 2300 Hz. It's frequency response would follow the red line. The filter is time varying - we redesign it for every frame.

You'll notice the term "estimate" being used a lot. One of the problems with model based speech coding is the algorithms we use to extract the model parameters are not perfect. Occasionally the algorithms get it wrong. Look at the red crosses on the bottom plot of Figure \ref{fig:hts2a_time}. These mark the amplitude estimate of each harmonic. If you look carefully, you'll see that above 2000Hz, the crosses fall a little short of the exact centre of each harmonic. This is an example of a ``fine" pitch estimator error, a little off the correct value.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You'll notice the term "estimate" being used a lot. One of the problems with model based speech coding is the algorithms we use to extract the model parameters are not perfect. Occasionally the algorithms get it wrong. Look at the red crosses on the bottom plot of Figure \ref{fig:hts2a_time}. These mark the amplitude estimate of each harmonic. If you look carefully, you'll see that above 2000Hz, the crosses fall a little short of the exact centre of each harmonic. This is an example of a ``fine" pitch estimator error, a little off the correct value.
You'll notice the term ``estimate" being used a lot. One of the problems with model based speech coding is the algorithms we use to extract the model parameters are not perfect. Occasionally the algorithms get it wrong. Look at the red crosses on the bottom plot of Figure \ref{fig:hts2a_time}. These mark the amplitude estimate of each harmonic. If you look carefully, you'll see that above 2000Hz, the crosses fall a little short of the exact centre of each harmonic. This is an example of a ``fine" pitch estimator error, a little off the correct value.


Often the errors interact, for example the fine pitch error shown above will mean the amplitude estimates are a little bit off as well. Fortunately these errors tend to be temporary, and are sometimes not even noticeable to the listener - remember this codec is often used for HF/VHF radio where channel noise is part of the normal experience.

\begin{figure}[h]
\caption{Codec 2 Decoder}
\label{fig:codec2_decoder}
\begin{center}
\begin{tikzpicture}[auto, node distance=2cm,>=triangle 45,x=1.0cm,y=1.0cm,align=center,text width=2cm]

\node [input] (rinput) {};
\node [block, right of=rinput,node distance=2cm] (dequantise) {Dequantise Interpolate};
\node [block, right of=dequantise,node distance=3cm] (recover) {Recover Amplitudes};
\node [block, right of=recover,node distance=3cm] (synthesise) {Synthesise Speech};
\node [block, above of=synthesise] (phase) {Synthesise Phases};
\node [output, right of=synthesise,node distance=2cm] (routput) {};

\draw [->] node[align=left, text width=1.5cm] {Bit Stream} (rinput) -- (dequantise);
\draw [->] (dequantise) -- (recover);
\draw [->] (recover) -- (synthesise);
\draw [->] (recover) |- (phase);
\draw [->] (phase) -- (synthesise);
\draw [->] (synthesise) -- (routput) node[right, align=left, text width=1.5cm] {Output Speech};

\end{tikzpicture}
\end{center}
\end{figure}

Figure \ref{fig:codec2_decoder} shows the operation of the Codec 2 decoder. We take the sequence of bits received from the channel and recover the quantised model parameters, pitch, spectral amplitudes, and voicing. We then resample the model parameters back up to the 10ms frame rate using a technique called interpolation. For example say we receive a $F_0=200$ Hz pitch value then 20ms later $F_0=220$ Hz. We can use the average $F_0=210$ Hz for the middle 10ms frame.

The phases of each harmonic are generated using the other model parameters and some DSP. It turns out that if you know the amplitude spectrum, you can determine a ``reasonable" phase spectrum using some DSP operations, which in practice is implemented with a couple of FFTs. We also use the voicing information - for unvoiced speech we use random phases (a good way to synthesise noise-like signals) - and for voiced speech we make sure the phases are chosen so the synthesised speech transitions smoothly from one frame to the next.

Frames of speech are synthesised using an inverse FFT. We take a blank array of FFT samples, and at intervals of $F_0$ insert samples with the amplitude and phase for each harmonic. We then inverse FFT to create a frame of time domain samples. These frames of synthesised speech samples are carefully aligned with the previous frame to ensure smooth frame-frame transitions, and output to the listener.

\subsection{Bit Allocation}

\section{Signal Processing Details}
Table \ref{tab:bit_allocation} presents the bit allocation for two popular Codec 2 modes. One additional parameter is the frame energy, this is the average level of the spectral amplitudes, or ``AF gain" of the speech frame.

At very low bit rates such as 700C, we use Vector Quantisation (VQ) to represent the spectral amplitudes. We construct a table such that each row of the table has a set of spectral amplitude samples. In Codec 2 700C the table has 512 rows. During the quantisation process, we choose the table row that best matches the spectral amplitudes for this frame, then send the \emph{index} of the table row. The decoder has a similar table, so can use the index to look up the output values. If the table is 512 rows, we can use a 9 bit number to quantise the spectral amplitudes. In Codec 2 700C, we use two tables of 512 entries each (18 bits total), the second one helps fine tune the quantisation from the first table.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first mention of specific modes rather than just bit rates that I could tell offhand. This should be reworded as well as the modes themselves introduced earlier in the document so that the reader can properly associate e.g. 700C with "very low bit rate".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, "Detailed Design" below talks about specific modes. Maybe this should just be "700 bits/second", i.e.

Suggested change
At very low bit rates such as 700C, we use Vector Quantisation (VQ) to represent the spectral amplitudes. We construct a table such that each row of the table has a set of spectral amplitude samples. In Codec 2 700C the table has 512 rows. During the quantisation process, we choose the table row that best matches the spectral amplitudes for this frame, then send the \emph{index} of the table row. The decoder has a similar table, so can use the index to look up the output values. If the table is 512 rows, we can use a 9 bit number to quantise the spectral amplitudes. In Codec 2 700C, we use two tables of 512 entries each (18 bits total), the second one helps fine tune the quantisation from the first table.
At very low bit rates such as 700 bits/second, we use Vector Quantisation (VQ) to represent the spectral amplitudes. We construct a table such that each row of the table has a set of spectral amplitude samples. In Codec 2 700C the table has 512 rows. During the quantisation process, we choose the table row that best matches the spectral amplitudes for this frame, then send the \emph{index} of the table row. The decoder has a similar table, so can use the index to look up the output values. If the table is 512 rows, we can use a 9 bit number to quantise the spectral amplitudes. In Codec 2 700C, we use two tables of 512 entries each (18 bits total), the second one helps fine tune the quantisation from the first table.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think I'll move some of the DD intro info right up the top, and explain the modes/700C nomenclature there.


Vector Quantisation can only represent what is present in the tables, so if it sees anything unusual (for example a different microphone frequency response or background noise), the quantisation can become very rough and speech quality poor. We train the tables at design time using a database of speech samples and a training algorithm - an early form of machine learning.

Codec 2 3200 uses the method of fitting a filter to the spectral amplitudes, this approach tends to be more forgiving of small variations in the input speech spectrum, but is not as efficient in terms of bit rate.

\begin{table}[H]
\label{tab:bit_allocation}
\centering
\begin{tabular}{l c c }
\hline
Parameter & 3200 & 700C \\
\hline
Pitch $F_0$ & 7 & 5 \\
Spectral Amplitudes $\{A_m\}$ & 50 & 18 \\
Energy & 5 & 3 \\
Voicing & 2 & 1 \\
Bits/frame & 64 & 28 \\
Frame Rate & 20ms & 40ms \\
Bit rate & 3200 & 700 \\
\hline
\end{tabular}
\caption{Bit allocation of the 3200 and 700C modes}
\end{table}

\section{Detailed Design}
\label{sect:details}

\cite{griffin1988multiband}
Codec 2 is based on sinusoidal \cite{mcaulay1986speech} and Multi-Band Excitation (MBE) \cite{griffin1988multiband} vocoders that were first developed in the late 1980s. Descendants of the MBE vocoders (IMBE, AMBE etc) have enjoyed widespread use in many applications such as VHF/UHF hand held radios and satellite communications. In the 1990s the author studied sinusoidal speech coding \cite{rowe1997techniques}, which provided the skill set and a practical, patent free baseline for starting the Codec 2 project:

Some features of Codec 2:
\begin{enumerate}
\item A range of modes supporting different bit rates, currently (Nov 2023): 3200, 2400, 1600, 1400, 1300, 1200, 700 bits/s. These are referred to as ``Codec 2 3200", ``Codec 700C"" etc.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of "C" in "Codec2 700C" isn't exactly clear, especially since some Codec2 modes use letters and some don't. Suggest clarifying here.

\item Modest CPU (a few 10s of MIPs) and memory (a few 10s of kbytes of RAM) requirements such that it can run on stm32 class microcontrollers with hardware FPU.
\item An open source reference implementation in the C language for C99/gcc compilers, and a \emph{cmake} build and test framework that runs on Linux. Also included is a cross compiled stm32 reference implementation.
\item Ports to non-C99 compilers (e.g. MSVC, some microcontrollers, native builds on Windows) are left to third party developers - we recommend the tests also be ported and pass before considering the port successful.
\item Codec 2 has been designed for digital voice over radio applications, and retains intelligible speech at a few percent bit error rate.
\item A suite of automated tests used to verify the implementation.
\item A pitch estimator based on a 2nd order non-linearity developed by the author.
\item A single voiced/unvoiced binary voicing model.
\item A frequency domain IFFT/overlap-add synthesis model for voiced and unvoiced speech speech.
\item For the higher bit rate modes, spectral amplitudes are represented using LPCs extracted from time domain analysis and scalar LSP quantisation.
\item For Codec 2 700C, vector quantisation of resampled spectral amplitudes in the log domain.
\item Minimal interframe prediction in order to minimise error propagation and maximise robustness to channel errors.
\item A post filter that enhances the speech quality of the baseline codec, especially for low pitched (male) speakers.
\end{enumerate}

\subsection{Non-Linear Pitch Estimation}

The Non-Linear Pitch (NLP) pitch estimator was developed by the author, and is described in detail in chapter 4 of \cite{rowe1997techniques}. There is nothing particularly unique about this pitch estimator or it's performance. Other pitch estimators could also be used, provided they have practical, real world implementations that offer comparable performance and CPU/memory requirements. This section presents an overview of the NLP algorithm extracted from \cite{rowe1997techniques}.


\subsection{Sinusoidal Analysis and Synthesis}

\subsection{LPC/LSP based modes}

\subsection{Codec 2 700C}

\section{Further Work}

\begin{enumerate}
\item Using c2sim to extract and plot model parameters
\item some examples aimed at the experimenter - e.g. using c2sim to extract and plot model parameters
\item How to use tools to single step through codec operation
\item table summarising source files with one line description
\item Add doc license (Creative Commons?)
\end{enumerate}


Expand Down
10 changes: 10 additions & 0 deletions doc/codec2_refs.bib
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,13 @@ @misc{ardc2023
note = {\url{https://www.ardc.net/apply/grants/2023-grants/enhancing-hf-digital-voice-with-freedv/}}
}

@article{mcaulay1986speech,
title={Speech analysis/synthesis based on a sinusoidal representation},
author={McAulay, Robert and Quatieri, Thomas},
journal={IEEE Transactions on Acoustics, Speech, and Signal Processing},
volume={34},
number={4},
pages={744--754},
year={1986},
publisher={IEEE}
}
Loading