High Latency from around 700ms to 2000ms - code running in india and using centralindia deployment #2730

vivekfog · 2025-01-23T11:33:33Z

I am using azure speech services for speech to text and I am getting very high latency.

I am using centralindia deployment of speech service.

pip show azure-cognitiveservices-speech ⬇

Name: azure-cognitiveservices-speech
Version: 1.41.1
Summary: Microsoft Cognitive Services Speech SDK for Python
Home-page: None
Author: Microsoft
Author-email: None
License: None
Location: /Users/vivekkumar/Documents/temporary_holder/azure_stt/vocode_env/lib/python3.8/site-packages
Requires: 
Required-by: vocode

My code looks like this 🔽

import azure.cognitiveservices.speech as speechsdk
import time
API_KEY = speech_key
SUBSCRIPTION_KEY = speech_key
REGION = service_region
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
# speech_config.speech_recognition_language = "en-IN"

speech_config.set_property(speechsdk.PropertyId.SpeechServiceResponse_RequestWordLevelTimestamps, "true")
speech_config.set_property(speechsdk.PropertyId.SpeechServiceResponse_OutputFormatOption, "detailed")
auto_detect_source_language_config = speechsdk.languageconfig.AutoDetectSourceLanguageConfig(languages=["en-IN", "hi-IN"])
speech_config.set_property(speechsdk.PropertyId.SpeechServiceConnection_LanguageIdMode,'Continuous')
speech_config.set_property(speechsdk.PropertyId.SpeechServiceConnection_TranslationToLanguages, "en-IN")

# Creates a recognizer with the given settings
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, auto_detect_source_language_config=auto_detect_source_language_config)
start_time = time.time()
results = []
# speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config)

# Define a callback for recognized speech
def handle_recognized(evt, recognized):
    try:
        global start_time, results
        # print("Evt: {}".format(evt))
        # print("Recognized(just text): {}".format(evt.result.text))
        # print("Recognized({}): {}".format(recognized, evt.result.json))
        if hasattr(evt.result, "properties"):
            latency = evt.result.properties.get(
                speechsdk.PropertyId.SpeechServiceResponse_RecognitionLatencyMs
            )
            # print(f"Recognition latency: {latency}ms")
        result_json = eval(evt.result.json)
        # print(result_json)
        if recognized:
            Offset = result_json["SpeechPhrase"]["Offset"]
            Duration = result_json["SpeechPhrase"]["Duration"]
            detected_language = result_json["SpeechPhrase"]["PrimaryLanguage"]["Language"].split("-")[0]
            Text = result_json["SpeechPhrase"]["DisplayText"]
            TranslatedText = result_json["Translations"][0]["DisplayText"]
            Confidence = result_json["SpeechPhrase"]["NBest"][0]["Confidence"]
        else:
            Offset = result_json["SpeechHypothesis"]["Offset"]
            Duration = result_json["SpeechHypothesis"]["Duration"]
            detected_language = result_json["SpeechHypothesis"]["PrimaryLanguage"]["Language"].split("-")[0]
            Text = result_json["SpeechHypothesis"]["Text"]
            TranslatedText = result_json["Translations"][0]["DisplayText"]
            Confidence = -1
        results.append([
                time.time() - start_time,
                latency,
                Offset / 10000,
                Duration / 10000,
                (Offset + Duration) / 10000,
                ((Offset + Duration) / 10000) + int(latency),
                detected_language,
                Text,
                TranslatedText,
                Confidence
            ])
        print_table()
        # print(time.time() - start_time, Offset/10000, Duration/10000, (Offset+Duration)/10000, ((Offset+Duration)/10000) + latency , detected_language, Text, TranslatedText, Confidence, latency)
    except Exception as e:
        print("Error: {}".format(e))


def handle_no_match(evt):
    print(f"silence detected {evt=}")
        

# Define a callback for session start
def handle_session_started(evt):
    print("Session started: {}".format(evt))
    global start_time
    start_time = time.time()

# Define a callback for session stop
def handle_session_stopped(evt):
    print("Session stopped: {}".format(evt))
    print("Stopping continuous recognition...")
    speech_recognizer.stop_continuous_recognition()

def print_table():
    global results
    headers = ["Time", "Latency", "Offset", "Duration", "End", "End+Latency", "Language", "Text", "TranslatedText", "Confidence"]
    print(tabulate(results, headers=headers, tablefmt="grid"))

# Define a callback for canceled recognition
def handle_canceled(evt):
    print("Recognition canceled: {}".format(evt.reason))
    if evt.reason == speechsdk.CancellationReason.Error:
        print("Error details: {}".format(evt.error_details))
    speech_recognizer.stop_continuous_recognition()

# Attach the event handlers
print(dir(speech_recognizer))
speech_recognizer.recognized.connect(lambda x: handle_recognized(x, recognized=True))
speech_recognizer.recognizing.connect(lambda x: handle_recognized(x, recognized=False))
# speech_recognizer.detected.connect(lambda x: handle_no_match(x))
speech_recognizer.session_started.connect(handle_session_started)
speech_recognizer.session_stopped.connect(handle_session_stopped)
speech_recognizer.canceled.connect(handle_canceled)

# Start continuous recognition
print("Say something...")
speech_recognizer.start_continuous_recognition()

# Keep the program running while recognition occurs
try:
    while True:
        pass
except KeyboardInterrupt:
    print("Stopping...")
    speech_recognizer.stop_continuous_recognition()

Attached is the output with latency and other parameters 🔽
azure_stt_test.txt

The text was updated successfully, but these errors were encountered:

BrianMouncer · 2025-01-29T01:07:28Z

I'm not sure what type of latency you are trying to measure. network latency, latency between recognized events, ...

Most people asking about speech latency are looking at User Perceived Latency, which is the time from when they start speaking until the first hypothesized event is received back from the service. To measure that, you would need to modify your code some.

I'd probably modify your while wait loop to take keyboard input, and each time a key is pressed, reset your "start_time = time.time()".
In this way, you can press a key as you begin speaking, and then print out the different both when you receive the "hypothesized" event, as well as when you receive the final "recognized" event.

In this way you can see the latency the user will see between speaking and the application responding to them.
Generally the hypothesized event is used to update some UI so the user knows they are being listened to and what is being recognized, and then the final recognized event is used to take actual action on what the user said. logging the time difference for both events will allow you to get an idea of the latency of your application for both initial user feedback, as well as when your application can reliably know what the user asked...

BrianMouncer · 2025-01-29T01:10:18Z

If you need to speed up the latency of that very first recognition, you can bypass the initial network latency caused by NS Lookup, connection to the service, doing certificate validation, upgrading from http to websockets....

Creates a recognizer with the given settings

speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, auto_detect_source_language_config=auto_detect_source_language_config)

Get the connection object from the recognizer

connection = speechsdk.Connection.from_recognizer(speech_recognizer)

Open the connection to reduce latency

connection.open(True)

but that will only help speed up the very first recognition.

vivekfog · 2025-02-11T06:27:49Z

Hi @BrianMouncer
Thanks for the response.

The latency I am talking about here, is the latency given by the sdk, which is as good as user perceieved latency. This comes as a property (SpeechServiceResponse_RecognitionLatencyMs) of the results of stt.

if hasattr(evt.result, "properties"):
            latency = evt.result.properties.get(
                speechsdk.PropertyId.SpeechServiceResponse_RecognitionLatencyMs
            )

This is very strange that it comes around 700 ms to 2 seconds and hence can't be used in realtime use cases like voicebot etc.

Going through the code, do you think, there is any change, that I can do to reduce latency. I want to auto detect language (hi-IN or en-IN) throughout the session and translate it to English using the azure SpeechServices api. I am interested in reducing the latency not only for the first recognition, but for all the recognitions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Latency from around 700ms to 2000ms - code running in india and using centralindia deployment #2730

High Latency from around 700ms to 2000ms - code running in india and using centralindia deployment #2730

vivekfog commented Jan 23, 2025

BrianMouncer commented Jan 29, 2025

BrianMouncer commented Jan 29, 2025

vivekfog commented Feb 11, 2025

High Latency from around 700ms to 2000ms - code running in india and using centralindia deployment #2730

High Latency from around 700ms to 2000ms - code running in india and using centralindia deployment #2730

Comments

vivekfog commented Jan 23, 2025

BrianMouncer commented Jan 29, 2025

BrianMouncer commented Jan 29, 2025

Creates a recognizer with the given settings

Get the connection object from the recognizer

Open the connection to reduce latency

vivekfog commented Feb 11, 2025