Skip to content

How to prepare gene annotations for gprofile

Serghei Mangul edited this page Jul 30, 2016 · 5 revisions

This is internal document. Use it in case you are planing to prepare gene annotations for a new organism. Note human and mouse annotations are already prepared.

To extract the transcript names from gtf:

awk -F "transcript_id" '{print $2}' genes.gtf | awk -F "transcript_name" '{print $1}' | sed 's/"//g' | sed 's/;//' >transcripts.txt

To extract gene names from gtf:

awk -F "gene_id" '{print $2}' genes.gtf | awk -F "gene_name" '{print $1}' | sed 's/"//g' | sed 's/;//' >genes.txt

Merge them into a single file:

paste genes.txt transcripts.txt | awk '{print $1","$2}' >genes_transcripts.txt

Please download UT3, UTR5, and CDS from here.

Prepare them in the correct format:

awk -F "_" '{print $1}' CDS_NCBIM37.bed >CDS_NCBIM37_v2.bed
awk -F "_" '{print $1}' UTR5_NCBIM37.bed >UTR5_NCBIM37_v2.bed
awk -F "_" '{print $1}' UTR3_NCBIM37.bed >UTR3_NCBIM37_v2.bed
Clone this wiki locally