WP2000 - Codec 2 Algorithm Description #31

drowe67 · 2023-11-17T23:09:47Z

tmiw

Looks good currently. Will other sections be handled in this PR too or another one?

drowe67 · 2023-11-18T06:44:51Z

@tmiw - oh I'm sorry, did I press the review button by mistake? Or maybe GitHub requests a review automatically? Anyway, I've only just started. Its very much WIP, not needing any review at this stage

tmiw · 2023-11-18T21:33:46Z

@tmiw - oh I'm sorry, did I press the review button by mistake? Or maybe GitHub requests a review automatically? Anyway, I've only just started. Its very much WIP, not needing any review at this stage

I think I get emails whenever a PR gets created, not just when requested for review. Sorry for the confusion!

drowe67 · 2023-11-19T08:17:42Z

@tmiw - oh I'm sorry, did I press the review button by mistake? Or maybe GitHub requests a review automatically? Anyway, I've only just started. Its very much WIP, not needing any review at this stage

I think I get emails whenever a PR gets created, not just when requested for review. Sorry for the confusion!

That's fine. Actually at some stage it would be good to get feedback from Hams on what they would like to know about Codec 2, so I can answer the most common questions. Perhaps after a first draft of the document is ready.

tmiw

Added comments based on this initial pass.

tmiw · 2023-11-22T23:31:07Z

doc/codec2.tex

@@ -52,7 +52,7 @@ \section{Codec 2 for the Radio Amateur}

 \subsection{Model Based Speech Coding}

-A speech codec takes speech samples from an A/D converter (e.g. 16 bit samples at an 8 kHz or 128 kbits/s) and compresses them down to a low bit rate that can be more easily sent over a narrow bandwidth channel (e.g. 700 bits/s for HF).  Speech coding is the art of "what can we throw away". We need to lower the bit rate of the speech while retaining speech you can understand, and making it sound as natural as possible.
+A speech codec takes speech samples from an A/D converter (e.g. 16 bit samples at 8 kHz or 128 kbits/s) and compresses them down to a low bit rate that can be more easily sent over a narrow bandwidth channel (e.g. 700 bits/s for HF).  Speech coding is the art of "what can we throw away". We need to lower the bit rate of the speech while retaining speech you can understand, and making it sound as natural as possible.


Suggested change

A speech codec takes speech samples from an A/D converter (e.g. 16 bit samples at 8 kHz or 128 kbits/s) and compresses them down to a low bit rate that can be more easily sent over a narrow bandwidth channel (e.g. 700 bits/s for HF). Speech coding is the art of "what can we throw away". We need to lower the bit rate of the speech while retaining speech you can understand, and making it sound as natural as possible.

A speech codec takes speech samples from an A/D converter (e.g. 16 bit samples at 8 kHz or 128 kbits/s) and compresses them down to a low bit rate that can be more easily sent over a narrow bandwidth channel (e.g. 700 bits/s for HF). Speech coding is the art of ``what can we throw away". We need to lower the bit rate of the speech while retaining speech you can understand, and making it sound as natural as possible.

tmiw · 2023-11-22T23:32:06Z

doc/codec2.tex

@@ -83,7 +83,7 @@ \subsection{Sinusoidal Speech Coding}
 A sinewave will cause a spike or spectral line on a spectrum plot, so we can see each spike as a small sine wave generator.  Each sine wave generator has it's own frequency that are all multiples of the fundamental pitch frequency (e.g. $230, 460, 690,...$ Hz).  They will also have their own amplitude and phase.  If we add all the sine waves together (Figure \ref{fig:sinusoidal_model}) we can produce reasonable quality synthesised speech.  This is called sinusoidal speech coding and is the speech production ``model" at the heart of Codec 2.

 \begin{figure}[h]
-\caption{The sinusoidal speech model.  If we sum a series of sine waves, we can generate a speech signal.  Each sinewave has it's own amplitude ($A_1,A_2,... A_L$), frequency, and phase (not shown).  We assume the frequencies are multiples of the fundamental frequency $F_0$. $L$ is the total number of sinewaves.}
+\caption{The sinusoidal speech model.  If we sum a series of sine waves, we can generate a speech signal.  Each sinewave has it's own amplitude ($A_1,A_2,... A_L$), frequency, and phase (not shown).  We assume the frequencies are multiples of the fundamental frequency $F_0$. $L$ is the total number of sinewaves we can fit in 4kHz.}


Adding spacing before KHz to be consistent with other mentions:

Suggested change

\caption{The sinusoidal speech model. If we sum a series of sine waves, we can generate a speech signal. Each sinewave has it's own amplitude ($A_1,A_2,... A_L$), frequency, and phase (not shown). We assume the frequencies are multiples of the fundamental frequency $F_0$. $L$ is the total number of sinewaves we can fit in 4kHz.}

\caption{The sinusoidal speech model. If we sum a series of sine waves, we can generate a speech signal. Each sinewave has it's own amplitude ($A_1,A_2,... A_L$), frequency, and phase (not shown). We assume the frequencies are multiples of the fundamental frequency $F_0$. $L$ is the total number of sinewaves we can fit in 4 kHz.}

tmiw · 2023-11-22T23:32:43Z

doc/codec2.tex

 \draw [->] (3,2) -- (4,2);
 \draw [xshift=4.2cm,yshift=2cm,color=blue] plot[smooth] file {hts2a_37_sn.txt};

 \end{tikzpicture}
 \end{center}
 \end{figure}

-The model parameters evolve over time, but can generally be considered constant for short snap shots in time (a few 10s of ms).  For example pitch evolves time, moving up or down as a word is articulated.
+The model parameters evolve over time, but can generally be considered constant for short time window (a few 10s of ms).  For example pitch evolves over time, moving up or down as a word is articulated.


Suggested change

The model parameters evolve over time, but can generally be considered constant for short time window (a few 10s of ms). For example pitch evolves over time, moving up or down as a word is articulated.

The model parameters evolve over time, but can generally be considered constant for short time windows (a few 10s of ms). For example pitch evolves over time, moving up or down as a word is articulated.

tmiw · 2023-11-22T23:35:14Z

doc/codec2.tex

+
+Once we have the desired frame rate, we ``quantise"" each model parameter.  This means we use a fixed number of bits to represent it, so we can send the bits over the channel.  Parameters like pitch and voicing are fairly easy, but quite a bit of DSP goes into quantising the spectral amplitudes. For the higher bit rate Codec 2 modes, we design a filter that matches the spectral amplitudes, then send a quantised version of the filter over the channel. Using the example in Figure \ref{fig:hts2a_time} - the filter would have a band pass peaks at 500 and 2300 Hz.  It's frequency response would follow the red line. The filter is time varying - we redesign it for every frame.
+
+You'll notice the term "estimate" being used a lot.  One of the problems with model based speech coding is the algorithms we use to extract the model parameters are not perfect.  Occasionally the algorithms get it wrong.  Look at the red crosses on the bottom plot of Figure \ref{fig:hts2a_time}.  These mark the amplitude estimate of each harmonic.  If you look carefully, you'll see that above 2000Hz, the crosses fall a little short of the exact centre of each harmonic.  This is an example of a ``fine" pitch estimator error, a little off the correct value.


Suggested change

You'll notice the term "estimate" being used a lot. One of the problems with model based speech coding is the algorithms we use to extract the model parameters are not perfect. Occasionally the algorithms get it wrong. Look at the red crosses on the bottom plot of Figure \ref{fig:hts2a_time}. These mark the amplitude estimate of each harmonic. If you look carefully, you'll see that above 2000Hz, the crosses fall a little short of the exact centre of each harmonic. This is an example of a ``fine" pitch estimator error, a little off the correct value.

You'll notice the term ``estimate" being used a lot. One of the problems with model based speech coding is the algorithms we use to extract the model parameters are not perfect. Occasionally the algorithms get it wrong. Look at the red crosses on the bottom plot of Figure \ref{fig:hts2a_time}. These mark the amplitude estimate of each harmonic. If you look carefully, you'll see that above 2000Hz, the crosses fall a little short of the exact centre of each harmonic. This is an example of a ``fine" pitch estimator error, a little off the correct value.

tmiw · 2023-11-22T23:37:51Z

doc/codec2.tex

+
+Table \ref{tab:bit_allocation} presents the bit allocation for two popular Codec 2 modes.  One additional parameter is the frame energy, this is the average level of the spectral amplitudes, or ``AF gain" of the speech frame.
+
+At very low bit rates such as 700C, we use Vector Quantisation (VQ) to represent the spectral amplitudes.  We construct a table such that each row of the table has a set of spectral amplitude samples.  In Codec 2 700C the table has 512 rows.  During the quantisation process, we choose the table row that best matches the spectral amplitudes for this frame, then send the \emph{index} of the table row.  The decoder has a similar table, so can use the index to look up the output values.  If the table is 512 rows, we can use a 9 bit number to quantise the spectral amplitudes.  In Codec 2 700C, we use two tables of 512 entries each (18 bits total), the second one helps fine tune the quantisation from the first table.


This is the first mention of specific modes rather than just bit rates that I could tell offhand. This should be reworded as well as the modes themselves introduced earlier in the document so that the reader can properly associate e.g. 700C with "very low bit rate".

Actually, "Detailed Design" below talks about specific modes. Maybe this should just be "700 bits/second", i.e.

Suggested change

At very low bit rates such as 700C, we use Vector Quantisation (VQ) to represent the spectral amplitudes. We construct a table such that each row of the table has a set of spectral amplitude samples. In Codec 2 700C the table has 512 rows. During the quantisation process, we choose the table row that best matches the spectral amplitudes for this frame, then send the \emph{index} of the table row. The decoder has a similar table, so can use the index to look up the output values. If the table is 512 rows, we can use a 9 bit number to quantise the spectral amplitudes. In Codec 2 700C, we use two tables of 512 entries each (18 bits total), the second one helps fine tune the quantisation from the first table.

At very low bit rates such as 700 bits/second, we use Vector Quantisation (VQ) to represent the spectral amplitudes. We construct a table such that each row of the table has a set of spectral amplitude samples. In Codec 2 700C the table has 512 rows. During the quantisation process, we choose the table row that best matches the spectral amplitudes for this frame, then send the \emph{index} of the table row. The decoder has a similar table, so can use the index to look up the output values. If the table is 512 rows, we can use a 9 bit number to quantise the spectral amplitudes. In Codec 2 700C, we use two tables of 512 entries each (18 bits total), the second one helps fine tune the quantisation from the first table.

Yes I think I'll move some of the DD intro info right up the top, and explain the modes/700C nomenclature there.

tmiw · 2023-11-22T23:41:33Z

doc/codec2.tex

+
+Some features of Codec 2:
+\begin{enumerate}
+\item A range of modes supporting different bit rates, currently (Nov 2023): 3200, 2400, 1600, 1400, 1300, 1200, 700 bits/s.  These are referred to as ``Codec 2 3200", ``Codec 700C"" etc.


The use of "C" in "Codec2 700C" isn't exactly clear, especially since some Codec2 modes use letters and some don't. Suggest clarifying here.

tmiw · 2023-11-25T05:33:44Z

doc/codec2.tex

@@ -100,7 +93,7 @@ \subsection{Speech in Time and Frequency}

 Note that each harmonic has it's own amplitude, that varies across frequency.  The red line plots the amplitude of each harmonic. In this example there is a peak around 500 Hz, and another, broader peak around 2300 Hz.  The ear perceives speech by the location of these peaks and troughs.

-\begin{figure}[H]
+\begin{figure}
 \caption{ A 40ms segment from the word "these" from a female speaker, sampled at 8kHz. Top is a plot again time, bottom (blue) is a plot against frequency. The waveform repeats itself every 4.3ms ($F_0=230$ Hz), this is the "pitch period" of this segment.}


Double check all usages of double-quotes and replace opening quotes with `` as appropriate.

drowe67 · 2023-12-10T02:15:13Z

@tmiw - I've set up a Makefile to build the document automatically, so we can make sure if doesn't suffer from bit rot, what do you think about:

Should we build the doc as part of the ctests, or have a separate github actions hook? It will require a few more packages to be installed.
Like freedv-gui - it will generate another codec2.pdf that clashes with the committed version. However I like the idea of having the PDF ready to read, and not require an end user to build it.
Note the pdflatex build stuff is super verbose, but the doc is building OK for me on two machines.

@Tyrbiter - I've kicked off another review WP in #37, or feel free to review in this PR. I'm doing some proof reading myself, but I know I'll miss stuff.

drowe67 · 2023-12-10T22:46:57Z

OK, I've added the doc building as a ctest, as it needs c2sim built anyway.

tmiw · 2023-12-10T22:52:13Z

@tmiw - I've set up a Makefile to build the document automatically, so we can make sure if doesn't suffer from bit rot, what do you think about:

Should we build the doc as part of the ctests, or have a separate github actions hook? It will require a few more packages to be installed.

Like freedv-gui - it will generate another codec2.pdf that clashes with the committed version. However I like the idea of having the PDF ready to read, and not require an end user to build it.

Note the pdflatex build stuff is super verbose, but the doc is building OK for me on two machines.

FWIW, freedv-gui uses GitHub actions for this but doesn't actually generate the PDF/HTML for the user manual until PRs are merged to master to avoid having to constantly deal with merge conflicts. It would probably be a good idea to also either have a ctest to verify document changes or somehow suppress the additional check-in for the module in the GitHub action unless the PR is being merged.

drowe67 · 2023-12-10T23:25:57Z

@tmiw - The doc ctest is bombing on GitHub, but works OK for me at home on three Ubuntu 22 machines. Could you pls take a look and see if there are any obvious issues?

drowe67 · 2023-12-10T23:32:37Z

@tmiw - I've set up a Makefile to build the document automatically, so we can make sure if doesn't suffer from bit rot, what do you think about:

Should we build the doc as part of the ctests, or have a separate github actions hook? It will require a few more packages to be installed.

Like freedv-gui - it will generate another codec2.pdf that clashes with the committed version. However I like the idea of having the PDF ready to read, and not require an end user to build it.

Note the pdflatex build stuff is super verbose, but the doc is building OK for me on two machines.

FWIW, freedv-gui uses GitHub actions for this but doesn't actually generate the PDF/HTML for the user manual until PRs are merged to master to avoid having to constantly deal with merge conflicts. It would probably be a good idea to also either have a ctest to verify document changes or somehow suppress the additional check-in for the module in the GitHub action unless the PR is being merged.

The doc has it's own Makefile, so I was thinking of:

If you really want to create a new codec2.pdf (say after editing) run the Makefile from the doc dir:
```
cd ~/codec2/doc
make
```
If we are just testing the doc build procedure run the ctest:
```
cd ~/codec2/build_linux
ctest -R test_codec2_doc
```
We could configure the ctest version up to write the codec2.pdf to a temp dir so Git doesn't see it as a changed file.

tmiw · 2023-12-12T00:48:33Z

@drowe67, did you still want me to investigate why the ctest for the documentation isn't running in the GitHub environment? I haven't had time to get around to it yet but just noticed that you merged.

drowe67 · 2023-12-12T07:26:09Z

@drowe67, did you still want me to investigate why the ctest for the documentation isn't running in the GitHub environment? I haven't had time to get around to it yet but just noticed that you merged.

Sure if you want to that would be great. I've bumped that task to the further work WP: #37

kicking off Codec 2 documentation

112f3b5

tmiw approved these changes Nov 18, 2023

View reviewed changes

building up plot support

9bc86bc

drowe67 added 3 commits November 19, 2023 08:28

drafted time-freq speech section, building up sinsuoidal model figure

cef07b4

macro for sinusoid

ce5e8ba

building up sinusoid figure

def80d4

drowe67 added 6 commits November 19, 2023 19:11

sinusoidal figure OK

f778670

parameter updates

24d7b22

building up encoder block diagram

3d9443f

encoder block diagram

1b311ba

building up detailed design intro

4d2492d

building up NLP figure

3dca356

tmiw reviewed Nov 22, 2023

View reviewed changes

drowe67 added 7 commits November 23, 2023 18:48

inserted DC notch into NLP

70bf39e

Mooneer's suggestions - thanks

04ebf69

moved some introductory info from DD to Intro

ed463b0

first draft of NLP section, Glossary

17a30f0

sinusoidal encoder block diagram

f95b590

drafted sinusoidal analysis section

97b20b4

building up synthesis section

899fce8

tmiw reviewed Nov 25, 2023

View reviewed changes

drowe67 added 6 commits November 25, 2023 20:44

first pass of synthesis section

0b6a207

sinusoidal synthesiser figure

125a169

rough draft of phase synthesis copied from source

b3ed577

first draft of voicing estimation

12bbb03

make notation more consistent across sections

9a18256

draft of phase synthesis section

ba7321c

drowe67 and others added 12 commits November 29, 2023 07:35

phase synthesis edits

fbbea09

phase model edits and LPC/LSP encoder block diagram

f3b4305

LPC/LSP enocder description, decoder block diagram

067eaa7

decoder description, mode table

43defe5

building up 700C section

0098976

mic EQ and VQ mean removal maths

71b86a8

aligning 700C figures with maths

670b278

added LPC/LSP and LPC post figure figures, plus code to generate them

348f68f

oops we forgot to rm this in recent clean up

c27e56d

removed newamp2 code

8a9b13e

Added a list or source files; edited Further Work section

d1c085a

first pass at Makefile to build doc

05110e5

drowe67 added 2 commits December 11, 2023 08:45

proof read, minor edits, update symbol glossary

7e88771

ctest, README.md, first pass at github action

ea0379f

drowe67 added 3 commits December 11, 2023 11:54

way to run doc ctest without over writing codec2.doc

21dd265

exclude test_codec2_doc when running tests on github actions

18c5e48

don't need tex packages as we've excluded that test for now

b8e4527

drowe67 merged commit 93dbb62 into main Dec 11, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WP2000 - Codec 2 Algorithm Description #31

WP2000 - Codec 2 Algorithm Description #31

drowe67 commented Nov 17, 2023

tmiw left a comment

drowe67 commented Nov 18, 2023

tmiw commented Nov 18, 2023

drowe67 commented Nov 19, 2023

tmiw left a comment

tmiw Nov 22, 2023

tmiw Nov 22, 2023

tmiw Nov 22, 2023

tmiw Nov 22, 2023

tmiw Nov 22, 2023

tmiw Nov 22, 2023

drowe67 Nov 23, 2023

tmiw Nov 22, 2023

tmiw Nov 25, 2023

drowe67 commented Dec 10, 2023 •

edited

Loading

drowe67 commented Dec 10, 2023

tmiw commented Dec 10, 2023

drowe67 commented Dec 10, 2023

drowe67 commented Dec 10, 2023

tmiw commented Dec 12, 2023

drowe67 commented Dec 12, 2023

	\caption{The sinusoidal speech model. If we sum a series of sine waves, we can generate a speech signal. Each sinewave has it's own amplitude ($A_1,A_2,... A_L$), frequency, and phase (not shown). We assume the frequencies are multiples of the fundamental frequency $F_0$. $L$ is the total number of sinewaves we can fit in 4kHz.}
	\caption{The sinusoidal speech model. If we sum a series of sine waves, we can generate a speech signal. Each sinewave has it's own amplitude ($A_1,A_2,... A_L$), frequency, and phase (not shown). We assume the frequencies are multiples of the fundamental frequency $F_0$. $L$ is the total number of sinewaves we can fit in 4 kHz.}

	The model parameters evolve over time, but can generally be considered constant for short time window (a few 10s of ms). For example pitch evolves over time, moving up or down as a word is articulated.
	The model parameters evolve over time, but can generally be considered constant for short time windows (a few 10s of ms). For example pitch evolves over time, moving up or down as a word is articulated.


		Once we have the desired frame rate, we ``quantise"" each model parameter. This means we use a fixed number of bits to represent it, so we can send the bits over the channel. Parameters like pitch and voicing are fairly easy, but quite a bit of DSP goes into quantising the spectral amplitudes. For the higher bit rate Codec 2 modes, we design a filter that matches the spectral amplitudes, then send a quantised version of the filter over the channel. Using the example in Figure \ref{fig:hts2a_time} - the filter would have a band pass peaks at 500 and 2300 Hz. It's frequency response would follow the red line. The filter is time varying - we redesign it for every frame.

		You'll notice the term "estimate" being used a lot. One of the problems with model based speech coding is the algorithms we use to extract the model parameters are not perfect. Occasionally the algorithms get it wrong. Look at the red crosses on the bottom plot of Figure \ref{fig:hts2a_time}. These mark the amplitude estimate of each harmonic. If you look carefully, you'll see that above 2000Hz, the crosses fall a little short of the exact centre of each harmonic. This is an example of a ``fine" pitch estimator error, a little off the correct value.


		Table \ref{tab:bit_allocation} presents the bit allocation for two popular Codec 2 modes. One additional parameter is the frame energy, this is the average level of the spectral amplitudes, or ``AF gain" of the speech frame.

		At very low bit rates such as 700C, we use Vector Quantisation (VQ) to represent the spectral amplitudes. We construct a table such that each row of the table has a set of spectral amplitude samples. In Codec 2 700C the table has 512 rows. During the quantisation process, we choose the table row that best matches the spectral amplitudes for this frame, then send the \emph{index} of the table row. The decoder has a similar table, so can use the index to look up the output values. If the table is 512 rows, we can use a 9 bit number to quantise the spectral amplitudes. In Codec 2 700C, we use two tables of 512 entries each (18 bits total), the second one helps fine tune the quantisation from the first table.

WP2000 - Codec 2 Algorithm Description #31

WP2000 - Codec 2 Algorithm Description #31

Conversation

drowe67 commented Nov 17, 2023

tmiw left a comment

Choose a reason for hiding this comment

drowe67 commented Nov 18, 2023

tmiw commented Nov 18, 2023

drowe67 commented Nov 19, 2023

tmiw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drowe67 commented Dec 10, 2023 • edited Loading

drowe67 commented Dec 10, 2023

tmiw commented Dec 10, 2023

drowe67 commented Dec 10, 2023

drowe67 commented Dec 10, 2023

tmiw commented Dec 12, 2023

drowe67 commented Dec 12, 2023

drowe67 commented Dec 10, 2023 •

edited

Loading