Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to translate long protein sequences to 3Di #38

Open
cheche0109 opened this issue Jan 27, 2025 · 0 comments
Open

Not able to translate long protein sequences to 3Di #38

cheche0109 opened this issue Jan 27, 2025 · 0 comments

Comments

@cheche0109
Copy link

My idea is when ProstT5 needs to predict/translate the protein sequence longer than around 500 amino acids long, it has very high chance of translating the 1D protein sequences to all d predictions which is nonsensical. When I was shortening a sequence to 480 amino acids long, the 3Di representation kinda makes sense.

python translate.py -i ./large_5.faa -o out --half 1 --is_3Di 0

Using device: cuda:0

Result directory already exists! - Watch out to not overwriting existing results!

is_3Di is False. (0=expect input to be AA, 1= input is 3Di

##########################

Loading model from: Rostlab/ProstT5

##########################

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565

##########################

Input is 3Di: False . Sequence below should be lower-case if input is 3Di.

Example sequence: >GUT_GENOME107594_03343

MDNDKRKKVYDVLTSKTGYKDSFDDFNKFMDDGAARRKVYDVLREQTGYRDSFEDFDKFMSSSHQGEMPVPIVPMRPTEPDLSFSNPSAGKFVQNGLDANEELLNDVKAEPVKDGMGRSYFPVKPQKVMEDEIRREYQVEPIESQINNAIVANEEAIRRIEQGKTEDYDRKAEEHPFLHALGGFARQHEGRVPDVNPSNDKEALRNLYAERNKLEEAKKVIEASRTDGILSNISAGVKDAVTDMGFYDFGMTEMSDNSRLLGIKAKLDNKQPLTEDEQKLLNAAALAQGVQSSHQDRISPWYTAGQTTANMAPFMAEMLVNPAAGLGKAAQKAVTKKVSGLVGKSLTRKMLPKLSRVAGDVAGASVMTGTTGIMRTVADAEGRMLGDVNSSIRDGEIVGNGFSGGLDAGEASAKAFGANTIENWSEMLGEYFAPALRGMGMVADKGMRKMGLGRVSDFISDINSTSLARGIDDFLEKTQWNGPIGEIAEEEAGIIANSYITGDNKLSDLTDPRLQLDIVLGVGLFGGFVSGMKTIGYRSPSKIAEKDLKRAEKNASSMFDNWEDIRLEIENTDEEQLPDTLNSIIGHAAGNDNAKKAAIVDYAYSLQKWRGVNAAKLKQTVENPAEAQNLVENEENGTGVEAIPEFDKVSVYRNFKRAERKVAQALPNTPIEKIEDVTDVDKFAADNGLNEEQKSAVTDYMAAKEPYSIYQSDVEARKEVVKSNAREQAVKDAERTSNPDTGFITQVKRKFSETPVYLVGGNLSFGDDGLLDRSNSTETVYYLDENGKRLPAPAEDFDSIVSQSSKEELVANAEAQATADFDAQENESLASPEILPPGIGETVSLDGGTYVIEGADNDNPGNFSALKLNTDGEIEVTPGLSEQISLSPDEYYEAKETELWKNDGIQPEQPVGKVDEVVAPALVQGVDKMPEISAENTSGPEMEDNKVSEETPEQRLQKVVDSLPKKKDGNIDYKALTPQQRFDYTSAAESPEVAIEDLKSDVAVKNEELEKINARLEKAIGGERVELRDTIRSKKKELDELTAFFQSVVPEQPDTSVNEEQPSVPEDVRTDEDYVEWVADNSDDAEEVLGAYSVAKELASHEQTLKPWQRELLGRKVSTSSFVRFGDRNHITGTLAKGWLRKDGEEIDSIAQELTANGVDVSEQDIVDFILDNPSNRVSEISDTMRSLSSRFSEIATKETGIPVGGPESNTGKLYIQLKEANKKIDELTDEQKADMRNALIADMDASDVQRSGDYYESLADYAEQYDRFRDEMNAEEADEAVIRQMEAESPTLYHGGFTADELDDIYSQIENDNGTERQTEDSGETQPALSGDEIEQREEPGIPDAVGAEDSESEGQDSGVVPDIEEIQDNTLDNEDESLSLHLNSKEEENGTISESVPQGERREETPQNGSLEEATDRLRERSRANEEARSRSGKTLSFQEKLAEEARETEAYAKEKGSWIPMSKVFDLGASGPSGNEADTYISQDNHIYKVNNLMNSKGILPLLERVALHNAIFPSSQYELTGFTGFEGGNVYPVLRQRYVPNATLSSPEEIDSYMRSLGFKQTGEAAYSNGDVVISDLRPRNVLKDTDGDLYVVDADFKKEDAVSFEASPISPGENVLDYAERISREKEMHDVRQSVDTNPTDAQKEAGNYKKGHIRLDGYDITIENPKGSERSGTDAKGGKWSVTMNNDYGYIRGTQGVDGDHIDVFLSDDPTTGNVYVIDQVKEDGSFDEHKVMYGFGSALAAKRAYLSNYSKGWNGLGKITQVSKDEFRKWVNSSRRKTKPFAEYKSVKMESDVRTDRQGNPVDADGKLIIEGNRLVTDKRYAELLERMRKKLGGQMNMGVDPEILAIGTEMAVYHIEKGARKFAEYAKAMIADLGDVIRPYLKSFYNGARDLPEMQELAKNMDSYNDVSSFDVVNFDKVIPDVINGIATMAEEKEIKRQADVANAAIKKVRSKNKKKNNNVSLPLGDLFNQNIEEYGKEQRKESDSGSEGNQGTNGQLGEGAWEEDRKSGLQGETGSVSGRDGADADRGGRVHGVSVGRQSSVKRNRNNYSFGDSHIDVPSGDVAKLKANVAAIRTLKEIEESGLPATDEQKAILAKYSGWGGLSNALNDEKYNARKSYYGADKNWNEKYLPYYEQLIELLTPEEFRSAVQSTTTSHYTPETVIRSMWDIAGRIGVKGGDVSEPAMGIGRIIGLMPDETSSRSRISGYEIDSLSGRISKALYPDANIKVQGYETEFFPQSKDLVITNVPFGKQAPYDKALEKTLKKQMKGAYNLHNYFIAKSLLELKEGGIGIFVTSSATMDGASSRFREFASSGGFDLVGAIRLPNDAFQKDAGTSVTADVLVFRRRKSGEKPNEINFISTTQIGEGNYQENGETRTKPIMVNEYFASHPEMMLGEMMTAYDAGSGGLYSGASQTLKAKPGMDLQKALDAAVKKLTENVNIGIENADSRLENTEKEQTTLKNGTLSVKDGKVYVAMNGVLEEIAVKDKFVYSGKTRKTADAVNEYNELKSTLRELISEEQKKGGNPEPLRKKLNEQYDGFVGKYGTLNRNKALDDVFDEDFEHNLPLSLETVRRVPSPTGKSMVYEVEKGKGILDKRVNYPVEEPTKADSVKDAINISRSYKGNIDIPYIARLTGKGEEEVTEEMLRDGSAYRDPLTGTLVDRATYLSGNVKSKLEDARAMAENDPAFDKNVADLEKVQPETIRFGDISYRLGTPWIPAQYINEFAENVLGISGVDVTYMPSLNEFVVGKHARISDFEKSGAIGTDRVGAIDLFAYAINQRKPKIYDEHTEYGPSGSIKVRTVNEAETQAAAEKIMEISDKFIEYIDGRKGIHRELERIYNDKYNNYVLKKYELPSFSHMEKDEDGKEKMVTHYPNSNTSISMREHQARAIQRSIEGSTLLAHQVGTGKTFTMITTAMEMRRLGLARKPMIV

##########################

Average sequence length: 2940 measured over 1 sequences

Parameters used for generation: {'do_sample': True, 'num_beams': 3, 'top_p': 0.95, 'temperature': 1.2, 'top_k': 6, 'repetition_penalty': 1.2}

Example generation for 0_GUT_GENOME107594_03343:

seqs[batch_idx]

ddpvvvvvvvvvvcvppvddddpvvvvvvvpdpvvvvvvvvvvcvppvddddpvvvvvvvvppddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd

Translating 1 proteins with an avg. length of 2940 took 2.7[m] (164.3[s/protein])

Writing results ...

python translate.py -i ./large_5.faa -o out --half 1 --is_3Di 0

Using device: cuda:0

Result directory already exists! - Watch out to not overwriting existing results!

is_3Di is False. (0=expect input to be AA, 1= input is 3Di

##########################

Loading model from: Rostlab/ProstT5

##########################

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565

##########################

Input is 3Di: False . Sequence below should be lower-case if input is 3Di.

Example sequence: >GUT_GENOME107594_03343

MDNDKRKKVYDVLTSKTGYKDSFDDFNKFMDDGAARRKVYDVLREQTGYRDSFEDFDKFMSSSHQGEMPVPIVPMRPTEPDLSFSNPSAGKFVQNGLDANEELLNDVKAEPVKDGMGRSYFPVKPQKVMEDEIRREYQVEPIESQINNAIVANEEAIRRIEQGKTEDYDRKAEEHPFLHALGGFARQHEGRVPDVNPSNDKEALRNLYAERNKLEEAKKVIEASRTDGILSNISAGVKDAVTDMGFYDFGMTEMSDNSRLLGIKAKLDNKQPLTEDEQKLLNAAALAQGVQSSHQDRISPWYTAGQTTANMAPFMAEMLVNPAAGLGKAAQKAVTKKVSGLVGKSLTRKMLPKLSRVAGDVAGASVMTGTTGIMRTVADAEGRMLGDVNSSIRDGEIVGNGFSGGLDAGEASAKAFGANTIENWSEMLGEYFAPALRGMGMVADKGMRKMGLGRVSDFISDINSTSLARGIDDFLEKTQWNGPIGEIAEEEAGIIANSYITGDNKLSDLTDPRLQLDIVLGVGLFGGFVSGMKTIGYRSPSKIAEKDLKRAEKNASSMFDNWEDIRLEIENTDEEQLPDTLNSIIGHAAGNDNAKKAAIVDYAYSLQKWRGVNAAKLKQTVENPAEAQNLVENEENGTGVEAIPEFDKVSVYRNFKRAERKVAQALPNTPIEKIEDVTDVDKFAADNGLNEEQKSAVTDYMAAKEPYSIYQSDVEARKEVVKSNAREQAVKDAERTSNPDTGFITQVKRKFSETPVYLVGGNLSFGDDGLLDRSNSTETVYYLDENGKRLPAPAEDFDSIVSQSSKEELVANAEAQATADFDAQENESLASPEILPPGIGETVSLDGGTYVIEGADNDNPGNFSALKLNTDGEIEVTPGLSEQISLSPDEYYEAKETELW

##########################

Average sequence length: 900 measured over 1 sequences

Parameters used for generation: {'do_sample': True, 'num_beams': 3, 'top_p': 0.95, 'temperature': 1.2, 'top_k': 6, 'repetition_penalty': 1.2}

Example generation for 0_GUT_GENOME107594_03343:

seqs[batch_idx]

ddpvvlvvvvvvvcvppvddddsvvvvvcvvdpvslvvvvvvvcvvpvddddsvvvvvvvvvvpdddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddplvvlvvllvvlvvvlvvlvvvvvvvvvvvvpdddddddddddddddddddddddpvvsvvvnvvsvvvsvvsvvvsvvvvvvvpddlvvllvlllvvlcvvvvvddpppvvvvlvvllvvlvvcvvvvhdddpvsvvsvvvvvvvvvvvvvvvvpddpssvvsnvvnvcvvvvvvcvvdvppdddpvvvvvvvvvvvvpddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd

Translating 1 proteins with an avg. length of 900 took 0.6[m] (33.7[s/protein])

Writing results ...

python translate.py -i ./large_5.faa -o out --half 1 --is_3Di 0

Using device: cuda:0

Result directory already exists! - Watch out to not overwriting existing results!

is_3Di is False. (0=expect input to be AA, 1= input is 3Di

##########################

Loading model from: Rostlab/ProstT5

##########################

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565

##########################

Input is 3Di: False . Sequence below should be lower-case if input is 3Di.

Example sequence: >GUT_GENOME107594_03343

MDNDKRKKVYDVLTSKTGYKDSFDDFNKFMDDGAARRKVYDVLREQTGYRDSFEDFDKFMSSSHQGEMPVPIVPMRPTEPDLSFSNPSAGKFVQNGLDANEELLNDVKAEPVKDGMGRSYFPVKPQKVMEDEIRREYQVEPIESQINNAIVANEEAIRRIEQGKTEDYDRKAEEHPFLHALGGFARQHEGRVPDVNPSNDKEALRNLYAERNKLEEAKKVIEASRTDGILSNISAGVKDAVTDMGFYDFGMTEMSDNSRLLGIKAKLDNKQPLTEDEQKLLNAAALAQGVQSSHQDRISPWYTAGQTTANMAPFMAEMLVNPAAGLGKAAQKAVTKKVSGLVGKSLTRKMLPKLSRVAGDVAGASVMTGTTGIMRTVADAEGRMLGDVNSSIRDGEIVGNGFSGGLDAGEASAKAFGANTIENWSEMLGEYFAPALRGMGMVADKGMRKMGLGRVSDFISDINSTSLARGIDDFLEKTQW

##########################

Average sequence length: 480 measured over 1 sequences

Parameters used for generation: {'do_sample': True, 'num_beams': 3, 'top_p': 0.95, 'temperature': 1.2, 'top_k': 6, 'repetition_penalty': 1.2}

Example generation for 0_GUT_GENOME107594_03343:

seqs[batch_idx]

ddpvvlvvvvvvvcvvpvddddsvvvvvcvvdpvslvvvvvvvcvvpvddddsvvvvvvvvvvpdddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddpvvvvvvvvvvppppdllvvlvvllvvlvvvlvvlvvvvvvvvvvvcvvcvpvvvvvvvvvvpppddppppvvvsvvvnvvsvvvsvlsvllsvlvvvlvpddlvvllvlllvllcvvvvvddpcllvllpdpllvvlvvcvvvvhdddpvsvssvssnvssvvsvvvvvvsddpssvvsnvcnvcvvvvvvcvvcvppppppvvvvvvvvvvvvpddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant