-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME_BL_PCT_70.txt
125 lines (95 loc) · 6.79 KB
/
README_BL_PCT_70.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
INTRODUCTION
This set of two Perl scripts calculates conserved branch length (BL) and probability of conserved targeting (PCT),
both as described in Friedman et al., 2008. This code produces essentially the same output as displayed at
TargetScanHuman Release 7.0.
The code is the same as for Release 5, except that (a) thresholds for branch length bins have been modified to those
use in TargetScan Release 6 and (b) targetscan_70_BL_PCT.pl has been modified to accept updated site type names.
The code for Release 7 is also designed for use with an expanded set of species (such as those UCSC Bioinformtics's
100-way multiz alignments).
All species must be identified by their NCBI Taxonomy IDs, such as
8364 9031 9258 9361 9365 9371 9544 9598 9706 9615 9685 9785 9796 9913 9986 10090 10116 10141 13616 28377 30611 37347 42254
See NCBI Taxonomy or http://www.targetscan.org/vert_70/docs/species100.html for the meaning of each of these.
Any other species will be ignored during BL calculations.
Some Perl modules are also required:
Statistics::Lite (for summary statistics)
Bio::TreeIO (part of BioPerl, for parsing tree structure)
You may get several warnings like
(in cleanup) Can't call method ... on an undefined value at ... Bio/Tree/Node.pm ... during global destruction.
These can be safely ignored.
These scripts are normally run after predicting miRNA targets with targetscan_70.pl
(the first zip archive link at the bottom of http://www.targetscan.org/cgi-bin/targetscan/data_download.cgi?db=vert_70).
Context score calculation (with targetscan_70_context_scores.pl) is not a prerequisite.
All files in the directory "PCT_parameters" were generated by Vikram Agarwal.
========================
targetscan_70_BL_bins.pl
The first script is targetscan_70_BL_bins.pl. It must be run first.
It uses a UTR alignment to measure relative conservation across species for that gene by calculating a mean BL,
which is then used to assign the UTR to a bin (numbered 1 - 10). This information is needed before
the next script can calculate a BL (and PCT, if the site is for a conserved miRNA family) for each predicted miRNA site.
Input file for targetscan_70_BL_bins.pl:
1) a UTR file = a tab-delimited multiple sequence alignment of the 3' UTRs of genes from the desired species
which is the same as the input file for targetscan_70.pl.
Each line of this UTR alignment file (ex: test/input/UTR_Sequences_sample.txt) consists of 3 tab separated entries
1) Gene symbol or transcript ID
2) Species ID (which should match species IDs in miRNA input file)
3) Sequence
This script also requires a directory called "PCT_parameters" containing the following file:
Tree.generic.txt
This is a tree file describing the phylogeny of all TargetScanHuman 7 species as defined by the conservation of their UTRs.
The output file (sample: test/output/UTRs_median_BLs_bins.txt) has three tab-delimited fields:
1) gene identifer
2) mean branch length for this gene's UTR. This is printed but ignored by the next script.
3) UTR bin (1 - 10) indicating degree of conservation
The script can be executed in 2 different ways:
1) Running the script without any arguments (./targetscan_70_BL_bins.pl) will print out a help screen.
2) Running the script with input filenames and output file will perform the analysis. Ex:
./targetscan_70_BL_bins.pl test/input/UTR_Sequences_sample.txt > UTRs_median_BLs_bins.txt
========================
targetscan_70_BL_PCT.pl
The second script is targetscan_70_BL_PCT.pl. It must be run after targetscan_70_BL_bins.pl
This goes through the file of predicted miRNAs, as generated by targetscan_70.pl
(from http://www.targetscan.org/cgi-bin/targetscan/data_download.cgi?db=vert_70) and for each group
of overlapping sites, it calculates a branch length (BL) indicating the degree of conservation across species
taking into account the degree of UTR conservation as indicated by the bin (calculated from targetscan_70_BL_bins.pl).
If the site is for a highly conserved miRNA family, probability of conserved targeting (PCT) is also
calculated. Given the site type - specific thresholds for BL, each site can then be classified as
conserved or poorly conserved.
This script also requires a 'sort' application accepting Linux-like arguments. If this is not on your system,
you may sort the "predicted targets" input file yourself numerically by the 8th column and rename it,
including "sort" somewhere in the name. If this program sees the word "sort" somewhere in this filename,
it assumes the file has already been sorted and skips sorting it.
Input files for targetscan_70_BL_PCT.pl:
1) miRNA file
The same miRNA file used by targetscan_70.pl (sample: test/input/miR_Family_info_sample.txt).
The third column is ignored.
Each line of the miRNA seed sequence file consists of 3 tab separated entries
1) Name of the miRNA family
2) The 7 nucleotide long seed region sequence.
3) Species ID of this miRNA family (which should match species IDs in UTR input file)
2) predicted targets
The output file generated by targetscan_70.pl (sample: test/output/targetscan_70_output.txt)
3) UTR bin info
The output file generated by targetscan_70_BL_bins.pl (sample: test/output/UTRs_median_BLs_bins.txt)
This script also requires a directory called "PCT_parameters" containing the following files:
Three files indicating background frequencies of all highly conserved seed regions
7mer_1a_PCT_parameters.txt
7mer_m8_PCT_parameters.txt
8mer_PCT_parameters.txt
Ten files, each a tree file describing the phylogeny of one bin's UTRs
Tree.bin_*.txt
The output file (sample: test/output/targetscan_70_output.BL_PCT.txt) has 14 fields:
Fields 1 - 11 are the same as in the "predicted targets" input file
12 = branch length (calculated for all sites)
13 = PCT (probability of conserved targeting) IF the miRNA family is highly conserved (or "NA" if not)
14 = "x" if the site is conserved, based on site type - specific BL thresholds:
8mer >= 0.8; 7mer-m8 >= 1.3; 7mer-1A >= 1.6
The script can be executed in 2 different ways:
1) Running the script without any arguments (./targetscan_70_BL_PCT.pl) will print out a help screen.
2) Running the script with input filenames and output file will perform the analysis. Ex:
./targetscan_70_BL_PCT.pl test/input/miR_Family_info_sample.txt test/output/targetscan_70_output.txt test/output/UTRs_median_BLs_bins.txt > targetscan_70_output.BL_PCT.txt
NOTES
These scripts were designed on a Linux platform. While running these scripts on Windows or Mac platforms, make sure to call the native perl binary.
This can be done explicitly by executing it as 'perl targetscan_70_BL_bins.pl' or changing the first line of the script to point to the native binary.
Use all input files with Unix-style end-of-line characters.
QUESTIONS/SUGGESTIONS:
Please direct all correpondence to wibr-bioinformatics@wi.mit.edu