Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distilled Model Repeats Words When Translating to Arabic Script (Urdu or Kashmiri) #105

Open
UmerTariq1 opened this issue Dec 11, 2024 · 2 comments
Assignees

Comments

@UmerTariq1
Copy link

UmerTariq1 commented Dec 11, 2024

Ні,
While testing the distilled model for English-to-Urdu and English-to-Kashmiri (Arabic script) translations, I observed a recurring issue where the model repeats words excessively and fails to generate correct output. This behavior does not occur with the base model or when translating to other scripts (e.g., Kashmiri in Devanagari script).

Observations:

  1. Affected Models: Distilled model (ai4 bharat/indictrans2-en-indic-dist-200M).
  2. Languages Affected: Urdu and Kashmiri (Arabic script).
  3. Issue Description: The model gets stuck and repeats the same word multiple times when translating specific sentences.
  4. When the same sentence is split into smaller parts, the issue does not occur.
  5. Roughly 0.1% of sentences in my dataset exhibit this behavior. (500 / 450k sentences)
  6. Comparison: The base model produces correct output for the same inputs.
  7. Some Sentences' translation gets stuck after repeating the words (ends the translation) and some finish the sentence translation correctly after repeating the word few times. Example at the end.
  8. This behavior is inconsistent, and I have been unable to identify a specific pattern.

Input Example:

English Sentence:

"The brain controls how the body moves by sending out small electrical signals through the nerves to the muscles. Seizures, or convulsions, occur when abnormal signals from the brain change the way the body functions. Seizures are different from person to person. Some people have only slight shaking of a hand and do not lose consciousness. Other people may become unconscious and have violent shaking of the entire body. Shaking of the body, either mild or violent, does not always occur with seizures."

Base Model Output (Correct):

دماغ اس بات کو کنٹرول کرتا ہے کہ اعصاب کے ذریعے پٹھوں کو چھوٹے برقی اشارے بھیج کر جسم کس طرح حرکت کرتا ہے۔ دورے، یا آتشزدگی، اس وقت ہوتی ہے جب دماغ سے غیر معمولی اشارے جسم کے کام کرنے کے طریقے کو تبدیل کرتے ہیں۔ دورے ایک شخص سے دوسرے میں مختلف ہوتے ہیں۔ کچھ لوگوں کے ہاتھ ہلکے ہلکے ہوتے ہیں اور وہ ہوش نہیں کھوتے۔ دوسرے لوگ بے ہوش ہو سکتے ہیں اور پورے جسم کو پرتشدد طور پر ہلا سکتے ہیں۔ جسم کا ہلنا، ہلکا یا پرتشدد، ہمیشہ دوروں کے ساتھ نہیں ہوتا ہے۔

Distilled Model Output (Incorrect):
دماغ یہ کنٹرول کرتا ہے کہ جسم اعصاب کے ذریعے پٹھوں تک چھوٹے برقی اشارے بھیج کر کیسے حرکت کرتا ہے۔ جب دماغ سے آنے والے غیر معمولی اشارے جسم کے کام کرنے کے طریقے کو بدل دیتے ہیں تو دورے، یا الجھاو، ہوتے ہیں۔ دورے ایک شخص سے دوسرے شخص میں مختلف ہوتے ہیں۔ کچھ لوگوں کو صرف ایک ہاتھ کا ہلکا سا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہلکا ہل

(the repeating word is "ہلکا" which translates to "slight" in this context and is coming from the part "some people have only slight shaking". The translation stops after repeating word)

Example Code:

`
en_indic_ckpt_dir = "ai4bharat/indictrans2-en-indic-dist-200M" # ai4bharat/indictrans2-en-indic-dist-200M
en_indic_tokenizer, en_indic_model = initialize_model_and_tokenizer(en_indic_ckpt_dir, quantization)

ip = IndicProcessor(inference=True)

en_sents = [
"The brain controls how the body moves by sending out small electrical signals through the nerves to the muscles. Seizures, or convulsions, occur when abnormal signals from the brain change the way the body functions. Seizures are different from person to person. Some people have only slight shaking of a hand and do not lose consciousness. Other people may become unconscious and have violent shaking of the entire body. Shaking of the body, either mild or violent, does not always occur with seizures.",
]

src_lang, tgt_lang = "eng_Latn", "urd_Arab"
hi_translations = batch_translate(en_sents, src_lang, tgt_lang, en_indic_model, en_indic_tokenizer, ip)

print(f"\n{src_lang} - {tgt_lang}")
for input_sentence, translation in zip(en_sents, hi_translations):
print(f"{src_lang}: {input_sentence}")
print(f"{tgt_lang}: {translation}")
`

Other Example Sentences:

  • Shoulder dislocation Shoulder dislocation can cause sharp armpit and arm pain, weakens, numbness, swelling. It might also cause nerve damages. Viral infections Infections such as AIDS, chickenpox, typhoid, measles and other infections caused by a virus can cause a dull pain in the armpit [med-health.net].
    (repeats the word and doesnt complete the translation)
@VarunGumma
Copy link
Collaborator

Hi @UmerTariq1, thank you for the detailed bug report, we really appreciate it!

I see that you are using our distilled models from HF. Can you please try to check if the same issue persists with the fairseq models? Also, we have recently developed RoPE based IT2 models which are more robust to longer inputs and perform better at low-resource languages. Can you please check if the same issue persists with those models as well? These new models should fit seamless with your current code, with just a name change.

Also, can you please share your generation_config and quantization? We expect best performance for all these models with beam=5 and fp16. Additionally, for low-resource languages like Kashmiri and Urdu, you can use higher beam sizes like 10 with the distilled models which might give better results.

@UmerTariq1
Copy link
Author

Hi @VarunGumma Thanks for getting back.

  • I tested with RoPE based IT2 models, both base and distilled, and the problem persists.

  • The base rope model is working correctly while distilled rope model is not.

  • My generation_config and quantization config were same as in the Huggingface interface notebook thats linked on this repo (so yes beam=5 and fp16).

  • I also tested with beam 10 and the problem is still there.

  • I dont have fairseq setup so i cant test on that. Maybe you can check on your end?

Here is the colab notebook for reference:
https://colab.research.google.com/drive/1V0KJ1tyeL5r_Y9yrS_1dAqzCToRKIRYw?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants