Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Latency from around 700ms to 2000ms - code running in india and using centralindia deployment #2730

Open
vivekfog opened this issue Jan 23, 2025 · 3 comments

Comments

@vivekfog
Copy link

I am using azure speech services for speech to text and I am getting very high latency.

I am using centralindia deployment of speech service.

pip show azure-cognitiveservices-speech ⬇

Name: azure-cognitiveservices-speech
Version: 1.41.1
Summary: Microsoft Cognitive Services Speech SDK for Python
Home-page: None
Author: Microsoft
Author-email: None
License: None
Location: /Users/vivekkumar/Documents/temporary_holder/azure_stt/vocode_env/lib/python3.8/site-packages
Requires: 
Required-by: vocode

My code looks like this 🔽

import azure.cognitiveservices.speech as speechsdk
import time
API_KEY = speech_key
SUBSCRIPTION_KEY = speech_key
REGION = service_region
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
# speech_config.speech_recognition_language = "en-IN"

speech_config.set_property(speechsdk.PropertyId.SpeechServiceResponse_RequestWordLevelTimestamps, "true")
speech_config.set_property(speechsdk.PropertyId.SpeechServiceResponse_OutputFormatOption, "detailed")
auto_detect_source_language_config = speechsdk.languageconfig.AutoDetectSourceLanguageConfig(languages=["en-IN", "hi-IN"])
speech_config.set_property(speechsdk.PropertyId.SpeechServiceConnection_LanguageIdMode,'Continuous')
speech_config.set_property(speechsdk.PropertyId.SpeechServiceConnection_TranslationToLanguages, "en-IN")

# Creates a recognizer with the given settings
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, auto_detect_source_language_config=auto_detect_source_language_config)
start_time = time.time()
results = []
# speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config)

# Define a callback for recognized speech
def handle_recognized(evt, recognized):
    try:
        global start_time, results
        # print("Evt: {}".format(evt))
        # print("Recognized(just text): {}".format(evt.result.text))
        # print("Recognized({}): {}".format(recognized, evt.result.json))
        if hasattr(evt.result, "properties"):
            latency = evt.result.properties.get(
                speechsdk.PropertyId.SpeechServiceResponse_RecognitionLatencyMs
            )
            # print(f"Recognition latency: {latency}ms")
        result_json = eval(evt.result.json)
        # print(result_json)
        if recognized:
            Offset = result_json["SpeechPhrase"]["Offset"]
            Duration = result_json["SpeechPhrase"]["Duration"]
            detected_language = result_json["SpeechPhrase"]["PrimaryLanguage"]["Language"].split("-")[0]
            Text = result_json["SpeechPhrase"]["DisplayText"]
            TranslatedText = result_json["Translations"][0]["DisplayText"]
            Confidence = result_json["SpeechPhrase"]["NBest"][0]["Confidence"]
        else:
            Offset = result_json["SpeechHypothesis"]["Offset"]
            Duration = result_json["SpeechHypothesis"]["Duration"]
            detected_language = result_json["SpeechHypothesis"]["PrimaryLanguage"]["Language"].split("-")[0]
            Text = result_json["SpeechHypothesis"]["Text"]
            TranslatedText = result_json["Translations"][0]["DisplayText"]
            Confidence = -1
        results.append([
                time.time() - start_time,
                latency,
                Offset / 10000,
                Duration / 10000,
                (Offset + Duration) / 10000,
                ((Offset + Duration) / 10000) + int(latency),
                detected_language,
                Text,
                TranslatedText,
                Confidence
            ])
        print_table()
        # print(time.time() - start_time, Offset/10000, Duration/10000, (Offset+Duration)/10000, ((Offset+Duration)/10000) + latency , detected_language, Text, TranslatedText, Confidence, latency)
    except Exception as e:
        print("Error: {}".format(e))


def handle_no_match(evt):
    print(f"silence detected {evt=}")
        

# Define a callback for session start
def handle_session_started(evt):
    print("Session started: {}".format(evt))
    global start_time
    start_time = time.time()

# Define a callback for session stop
def handle_session_stopped(evt):
    print("Session stopped: {}".format(evt))
    print("Stopping continuous recognition...")
    speech_recognizer.stop_continuous_recognition()

def print_table():
    global results
    headers = ["Time", "Latency", "Offset", "Duration", "End", "End+Latency", "Language", "Text", "TranslatedText", "Confidence"]
    print(tabulate(results, headers=headers, tablefmt="grid"))

# Define a callback for canceled recognition
def handle_canceled(evt):
    print("Recognition canceled: {}".format(evt.reason))
    if evt.reason == speechsdk.CancellationReason.Error:
        print("Error details: {}".format(evt.error_details))
    speech_recognizer.stop_continuous_recognition()

# Attach the event handlers
print(dir(speech_recognizer))
speech_recognizer.recognized.connect(lambda x: handle_recognized(x, recognized=True))
speech_recognizer.recognizing.connect(lambda x: handle_recognized(x, recognized=False))
# speech_recognizer.detected.connect(lambda x: handle_no_match(x))
speech_recognizer.session_started.connect(handle_session_started)
speech_recognizer.session_stopped.connect(handle_session_stopped)
speech_recognizer.canceled.connect(handle_canceled)

# Start continuous recognition
print("Say something...")
speech_recognizer.start_continuous_recognition()

# Keep the program running while recognition occurs
try:
    while True:
        pass
except KeyboardInterrupt:
    print("Stopping...")
    speech_recognizer.stop_continuous_recognition()

Attached is the output with latency and other parameters 🔽
azure_stt_test.txt

@BrianMouncer
Copy link
Contributor

I'm not sure what type of latency you are trying to measure. network latency, latency between recognized events, ...

Most people asking about speech latency are looking at User Perceived Latency, which is the time from when they start speaking until the first hypothesized event is received back from the service. To measure that, you would need to modify your code some.

I'd probably modify your while wait loop to take keyboard input, and each time a key is pressed, reset your "start_time = time.time()".
In this way, you can press a key as you begin speaking, and then print out the different both when you receive the "hypothesized" event, as well as when you receive the final "recognized" event.

In this way you can see the latency the user will see between speaking and the application responding to them.
Generally the hypothesized event is used to update some UI so the user knows they are being listened to and what is being recognized, and then the final recognized event is used to take actual action on what the user said. logging the time difference for both events will allow you to get an idea of the latency of your application for both initial user feedback, as well as when your application can reliably know what the user asked...

@BrianMouncer
Copy link
Contributor

If you need to speed up the latency of that very first recognition, you can bypass the initial network latency caused by NS Lookup, connection to the service, doing certificate validation, upgrading from http to websockets....

Creates a recognizer with the given settings

speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, auto_detect_source_language_config=auto_detect_source_language_config)

Get the connection object from the recognizer

connection = speechsdk.Connection.from_recognizer(speech_recognizer)

Open the connection to reduce latency

connection.open(True)

but that will only help speed up the very first recognition.

@vivekfog
Copy link
Author

Hi @BrianMouncer
Thanks for the response.

The latency I am talking about here, is the latency given by the sdk, which is as good as user perceieved latency. This comes as a property (SpeechServiceResponse_RecognitionLatencyMs) of the results of stt.

if hasattr(evt.result, "properties"):
            latency = evt.result.properties.get(
                speechsdk.PropertyId.SpeechServiceResponse_RecognitionLatencyMs
            )

This is very strange that it comes around 700 ms to 2 seconds and hence can't be used in realtime use cases like voicebot etc.

Going through the code, do you think, there is any change, that I can do to reduce latency. I want to auto detect language (hi-IN or en-IN) throughout the session and translate it to English using the azure SpeechServices api. I am interested in reducing the latency not only for the first recognition, but for all the recognitions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants