Diarization Research Notes

Google Cloud

Diarization can be easily added to google speech-to-text API like so:

# Diarization set up.
diarization_config = speech.SpeakerDiarizationConfig(
    enable_speaker_diarization=True,
    min_speaker_count=2,
    max_speaker_count=10,
)

# Speech to text client set-up.
input_config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=freq,
    audio_channel_count=numChannels,
    language_code="en",
    diarization_config=diarization_config,
)
response = client.recognize(config=input_config, audio=audio)

For a 25 second audio clip, the diarization process took 8.069 seconds.

The output of the model can be fetched with the following code:

result = response.results[-1]
words_info = result.alternatives[0].words

# Printing out the output.
for word_info in words_info:
    print(f"word: '{word_info.word}', speaker_tag: {word_info.speaker_tag}")

We take results[-1] as the results array is populated with intermediate transcripts. The complete conversation can be seen in the last element.

A snippet of this output is as follows:

word: 'how's', speaker_tag: 3
word: 'your', speaker_tag: 3
word: 'semester', speaker_tag: 3
word: 'going', speaker_tag: 3
word: 'so', speaker_tag: 3
word: 'far', speaker_tag: 3
word: 'it's', speaker_tag: 2
word: 'going', speaker_tag: 2
word: 'pretty', speaker_tag: 2
word: 'good', speaker_tag: 2
word: 'how's', speaker_tag: 2
word: 'your', speaker_tag: 2
word: 'semester', speaker_tag: 2
word: 'going', speaker_tag: 2
word: 'so', speaker_tag: 2
word: 'far', speaker_tag: 2

Each unique speaker is assigned a number which can be used to filter their words. The model seems to struggle with overlapping words and tends to ignore it. The accuracy seems to also be slightly worse with diaziation enabled -> more words are incorrectly transcript / omitted from output.

Note for some reason setting min_speaker_count to 2 would count both speakers as 1 person. Increasing it to 3 allowed it to correctly determine the speaker. This could be a result of the audio clip I originally used was between my brother and I (slightly similar voices).

Adding the following code to the speech.RecognitionConfig seems to boast accuracy however it still sometimes struggles to differentiate my brother and I's voice.

    use_enhanced=True,
    model="phone_call",

Running the same input will produce new tags and the first speaker isn't always assigned tag 1. The tags are random and cannot be used to determine who is the "owner" of the phone and who is the "conversation partner".

Android App

Currently we are using the ACTION_RECOGNIZE_SPEECH speech recognition intention within the android app to perform speech to text. This implementation seems to have speech diarization already in place as it will ignore subsequent speakers after the first speaker began.

This will be useful if we only care about one speaker -> we capture only the "conversation partner" message to send to the LLM.

One downside is this model is a black box. Not much info is told about where the recognition is taking place. The official documentation says "The implementation of this API is likely to stream audio to remote servers to perform speech recognition." We also don't have access to any setting to customize behaviour.

We could explore ACTION_VOICE_SEARCH_HANDS_FREE as an alternative. Documentation seems promising for our usage.

NeMo

Set Up is hard :(

Wiki Contents

Test Notes

Bi-Weekly Core Meetings

Other Meeting Notes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diarization Research Notes

Google Cloud

Android App

NeMo

Wiki Contents

Clone this wiki locally