-
Notifications
You must be signed in to change notification settings - Fork 8
/
README
213 lines (141 loc) · 6.87 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
RNAcode -- Analyze the protein coding potential in multiple sequence alignments
0. About RNAcode
================
RNAcode predicts protein coding regions in an alignment of homologous
nucleotide sequences. The prediction is based on evolutionary
signatures typical for protein genese, i.e. the presence of
synonyomous/conservative nucleotide mutations, conservation of the
reading frame and absence of stop codons.
RNAcode does not rely on any species specific sequence characteristics
whatsoever and does not use any machine learning techniques. The only
input required for RNAcode is a multiple sequence alignment either in
MAF or Clustal W format. RNAcode reports local regions of unusual high
coding potential together with an associated p-value.
1. Installation
===============
You can compile and install RNAcode like this:
# ./configure
# make
# make install (as root)
See INSTALL for details and more advanced installation options.
2. Usage
========
2.1 Synopsis
------------
Analyze alignment (Clustal W or MAF form) with standard options and
print detailed results page:
# RNAcode data.aln
Analyze alignment and show best non-overlapping hits below a p-value
cutoff of 0.01 in gtf format:
# RNAcode --outfile results.gtf --gtf --best-only --cutoff 0.01
Create color annotations for high scoring coding segments:
# RNAcode --eps data.aln
2.2 Input alignment
-------------------
The input alignment needs to be formatted in ClustalW format or MAF
format (http://genome.ucsc.edu/FAQ/FAQformat#format5). The latter
format allows to include genomic coordinates which can be used to
produce annotation files.
Important: RNAcode uses the first sequence as reference sequence,
i.e. all results and reported coding regions apply to this reference
sequence.
Currently the alignments has to contain at least 3 sequences. Gaps
have to be given as dash ('-'). Unspecified letters given as 'N' are
allowed and treated neurally during all calculations. No difference is
made between uppercase or lowercase input, i.e. 'softly'-repeat masked
sequences which use lowercase letters for masked regions are treated
the same way as unmasked sequences.
2.3. Command line
-----------------
RNAcode is invoked as follows:
# RNAcode [OPTIONS] alignment.aln
alignment is the alignment file and OPTIONS is one of the follwing
command-line options either given in one-letter form with a single
dash or as long option with double dash:
--outfile -o (default: stdout)
File to which the output is written. Defaults to standard output.
--cutoff -p (default: 1.0)
Show only regions that have a p-value below the given number. By
default all hits are shown.
--num-samples -n (default: 100)
Number of random alignments that are sampled to calculate the
p-value. RNAcode estimates the significance of a coding prediction by
sampling a given number of random alignments. Default is 100 which
gives reasonably stable p-values that are useful for assessing the
relevance of a prediction.
--stop-early -s
Setting this option stops the sampling process as soon as it is clear
that the best hit will not fall below the given p-value cutoff. For
example, assume a p-value cutoff of 0.05 (see --cutoff) and a sample
size of 1000 is given (see --num-samples). As soon as 50 random
samples score better than the original alignment, the process is
stopped and all hits in the original alignment are reported as p>0.05
(or by convention as 1.0 in gtf and tabular output).
--best-region -r Show only best non-overlapping hits
By default all positive scoring segments are shown in the output if
they fall below the given p-value cutoff. If two hits overlap
(different frame or different strand) and --best-region is given only
the hit with the highest score is shown. Strong coding regions often
lead to statistically significant signals also in other frames. These
hits are suppressed by this option and only the correct reading frame
is reported.
--best-only -b Show only best hit
This options shows only the one single best hit for each alignment.
--pars -c
Scoring parameters as comma separated string:
"DELTA,OMEGA,omega,stop_penalty"
See the appendix of the Paper for an explanation for the meaning of
these parameters. Default: "-10.0,-4.0,-2.0,-8.0"
--gtf -g
--tabular -t
Changes the default output to two different machine readable formats
(see next section).
--eps -e
Create colored plots in EPS format. The generated plots are resolution
independent vector graphics that can be included in any graphics
software. For each high scoring segment below a given cutoff (see
--eps-cutoff) a file named hss-N.eps is created (N is the running
number of the high scoring segment). See documents/color-legend.pdf
for an explanation of the color scheme.
--eps-cutoff -i
Create plots only for high scoring segments with p better than this
cutoff (default: 0.05)
--eps-dir -d
Directory to save EPS files in. Default: "eps"
2.4. Output format
------------------
In the default output each prediction is reported on one line by 10 fields.
1. HSS id Unique running number for each high scoring segment
predicted in one RNAcode call
2. Frame: The reading frame phasing relative to the starting
nucleotide position in the reference sequence. +1 means
that the first nucleotide in the reference sequence is in
the same frame as the predicted coding region. Negative
frames indicate that the predicted regions are on the
reverse complement strand.
3. Length: The length of the predicted region in amino acids
4. From: The position of the first/last amino acid in the translated
5. To: nucleotide sequence of the reference sequence starting
with 1.
6. Name The name of the reference sequence as given in the input alignment.
7. Start The nucleotide position in the reference sequence of the
8. End predicted coding region. If no genomic coordinates are given
(if you provide a CLUSTAL W as input) the first nucleotide position in
the references sequence is set to 1, otherwise the positions are the
1-based genomic coordinates as given in the input MAF file.
9. Score The coding potential score. High scores indicate high coding potential.
10. P The p-value associated with the score. This is the probability
that a random alignment with same properties contains an equally good
or better hit.
If --tabular is given, the output is printed as tab-delimited list
without header or any other output. With --gtf the output is formated
as GTF genome annotation file.
4. Citing RNAcode
=================
RNAcode: robust discrimination of coding and noncoding regions in
comparative sequence data
Washietl S, Findeiss S, Muller S, Kalkhof S, von Bergen M, Hofacker IL, Stadler PF, Goldman N
RNA (2011), in revision
3. Contact
==========
Stefan Washietl <wash@mit.edu>