-
Notifications
You must be signed in to change notification settings - Fork 2
/
README
726 lines (546 loc) · 28.5 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
FlowClus
This program can both filter and denoise reads produced by
454 and Ion Torrent sequencing technologies. It was written
in C and compiled with gcc (version 4.8.2) on Ubuntu Linux.
This file describes the usage and parameters of the program
in great detail.
Parameter names are subject to change -- run ./FlowClus -h
for the latest.
***************************************************************
Usage:
./FlowClus {-m master.csv} [optional parameters]
The order of the parameters does not matter.
Parameters specified as "options" below do not require
arguments. Other parameters should be followed by the
appropriate type argument (integer, floating-point value,
or string).
***************************************************************
-h Help option
This option will cause the program to print to stderr
brief usage information and exit.
More complete descriptions of the parameters are
found in this README file.
***************************************************************
Required parameter:
-m <str> Input master file with primer and mid tag sequences
(one per line, comma- or tab-delimited)
Lines for primers should begin with "primer", followed
by the name of the primer and its sequence.
Lines for mid tags should begin with "midtag", followed
by the name of the sample and the sequence. They
should be listed following the primer with which
they were used. The sample names should NOT include
a space or a hyphen (' ' or '-'). This may cause an
error if the filtering and denoising steps are run
separately.
The sequences can contain IUPAC ambiguous DNA codes,
but should not use regular expressions.
Here is a sample master file for an amplicon sequenced
bidirectionally from its two primers (341F and 926R)
generated from two samples (MID12 and MID13):
primer,341F,CCTACGGGAGGCAGCAG
midtag,sample1,ATATCGCGAG
midtag,sample2,AGACTCGACGT
primer,926R,CCGTCAATTCMTTTGAGTTT
midtag,sample1,AGCACTGTAG
midtag,sample2,ACGTCGGGTCT
Note that it is perfectly acceptable to use the same
sample name with different primers.
If one wants to search for the reverse primer at the
3' end of a read (see Sequence Analysis, below),
one should add a line for each primer. The line
should begin with "reverse" and be followed by its
sequence (5' to 3', not reverse-complemented):
primer,341F,CCTACGGGAGGCAGCAG
reverse,CCGTCAATTCMTTTGAGTTT
midtag,sample1,ATATCGCGAG
midtag,sample2,AGACTCGACGT
***************************************************************
Analysis options:
-st Option to print status updates to stdout while FlowClus
is running
-a Option to filter only
-b Option to denoise only
-ab Option to filter and denoise (default)
If you choose to filter the reads, you must specify an
input sff.txt file (see next).
If you choose "denoise only", you must have reads that
have already been filtered by this program (or have
the same format) as input files.
-i <str> Input sff.txt file
This is the file that is converted from a 454 .sff
file using the 454 Tools program "sffinfo" (results
have also been good with sff.txt files produced by
the script process_sff.py in QIIME). It is required
in order to perform filtering.
The label names that are used in the analysis of this
file are given in FlowClus.h. If these do not match
those used in your sff.txt file, they will need to
be changed accordingly.
Filtering can be performed on only one sff.txt file at
a time. You may wish to concatenate multiple .sff
files using the 454 Tools program "sfffile" before
converting to text and filtering with FlowClus
(provided that there are no collisions, among the
.sff files, of different samples' sharing mid tag
sequences).
-f <str> File extension for the filtered flowgrams
(default: ".flow")
If filtering is performed, separate output files will
be produced for each of the primers. The files
will be the primer name (as given in the master
file) followed by this file extension.
If the "denoise only" option is specified, these
files will be used as the inputs.
The format of a filtered flowgram file is as follows:
The first line indicates the number of flows and
the flow order. After that, each line will contain
the filtered flowgram for a read. It will begin
with the read header (454 accession number), the
name of its sample, and optionally the flow start
number (if the flow order is irregular). Then the
flows will be listed, beginning with the flow of
the last base of the primer -- this is due to the
way the flowgram is denoised by FlowClus. The
flowgram will end with the last "good" flow of the
sequence -- either the flow corresponding to the
last base, or the flow immediately prior to where
the read was truncated.
Flow values greater than the maximum specified (see
-u below) will be changed to that maximum value.
-o <str> Output fasta file after denoising
(default: "denoised.fasta")
If denoising is performed, this output fasta file
is produced.
NOTE: For this, and any other output files, if a
file of that name already exists, you will be
prompted to overwrite. If you decline to
overwrite, the program will terminate so you can
specify a new file name.
***************************************************************
Other input/output files and options:
-e <str> Output fasta file after filtering
If a file name is specified, a fasta file will be
produced for the reads after the filtering step.
The flowgrams will not be re-interpreted at this
stage, so the reads should not be altered except
for possibly missing bases at the 3' end (the
"3' gap" as a wise man once termed it).
-x Option to produce "QIIME-style" fasta files
If this option is specified, the output fasta
file(s) (after filtering and/or after denoising)
will have the same format as that produced by
split_libraries.py or inflate_denoiser_output.py
in QIIME. The reads can be used immediately in
an OTU clustering, such as by pick_otus.py.
Each new fasta header will contain the sample name,
followed by "_", a unique number, a space, and
the read's 454 accession number.
The mid tag and primer sequences will be removed
from the 5' ends of the reads.
The output cleaned flowgrams will not be affected.
-c <str> Output file for counts from filtering step
The filtering step allows for a variety of criteria
with which to eliminate or to truncate reads. For
each criterion specified by the user, a tally of
the number of reads eliminated or truncated for
that reason will be printed to the given file.
This is followed by a tally of the number of mid
tag - primer matches and reads printed for each
of the primers. Then a full breakdown of the
counts for each sample is printed.
-cv <str> Output file for detailed filtering information
for each read
In the output file, each read that matches a mid tag
and primer will be listed, along with its status
("Eliminated", "Truncated", or "Passed"), the
criterion that resulted in its elimination or
truncation, and its length before and after
filtering. If the read is eliminated, its length
after filtering will not be listed, and if the
read is not truncated, its length before filtering
will not be listed. Note that these lengths
include the mid tags and primers, so the output
fasta files may contain sequences whose lengths
do not exactly match those listed in this file
(e.g. if the -x option is specified).
-d <str> Output file for denoising "misses"
Denoising in FlowClus consists of matching a read's
flow values to those of each cluster. If a pair
of flow values are sufficiently distinct, the read
does not join that cluster.
If a file is specified with -d, that pair of flow
values is recorded. At the conclusion of
denoising, the read vs. cluster flow values are
printed to the given file in comma-separated form.
The file can then be visualized as a levelplot in
R to give some indications of how well the
denoising worked and if the parameters should be
adjusted.
-v Option to produce consensus flowgram and
mapping files after denoising
-vf <str> File extension for denoised flowgrams
(default: ".den")
-vm <str> File extension for mapping files
(default: ".map")
The denoising process of FlowClus relies on creating,
for each cluster, consensus flowgrams and a set of
reads that map to that cluster. These are used to
produce the denoised fasta file.
If the -v option is specified, the consensus
flowgrams and read headers for each cluster will
be printed for each primer. The file names will
be the primer name followed by the specified (or
default) file extensions.
If denoising is performed using a trie (see -tr below),
a mapping file will not be produced. Instead, one
can use the mapping file generated for chimera-
checking (see -cm, below), although this may list a
read from an internal node more than once.
-ch Option to produce output fasta files that can be
analyzed by a de novo chimera-checking program
after denoising
-cu <str> File extension for these files that can be
analyzed by UCHIME (default: ".chfasta")
-cp <str> File extension for these files that can be
analyzed by Perseus
-cm <str> File extension for the output mapping files that
indicate the reads in each cluster
(default: ".chmap")
Since FlowClus does nothing to remove PCR chimeras,
one may still wish to use a program to do so.
If the -ch option is specified, such fasta files will
be produced for each primer. The files will
contain a single sequence for each cluster -- the
longest sequence -- and the fasta header will
indicate how many reads are in the cluster.
Mapping files will also be produced. They will list
each cluster's representative read header, followed
by a comma-separated list of all of the cluster's
reads' headers. This can be used to remove the
reads from any cluster that is judged to be
chimeric.
The default is to produce files of the form required
by UCHIME (v4.2.40). The alternative is to produce
files of the form required by Perseus. Either
option may be specified, but not both.
If denoising is performed using a trie (see -tr below),
only the leaf nodes will have a sequence printed to
the fasta file. The mapping file will contain
the read headers for that leaf node AND those of
any ancestor nodes. Therefore, a read from an
internal node may appear more than once in the
mapping file.
-sd <str> Input file containing distances for each flow
value (default: "stddev.txt")
FlowClus allows one to denoise with variable values
(confidence interval widths) at each flow value.
"stddev.txt" provides a list of such distances
based on the standard deviations produced by
Balzer et al. (Bioinformatics, 2010). This file
lists the standard deviations for each flow value
(from 0.00 to 19.99, one per line).
The -sd option allows one to specify an alternative
file to use.
***************************************************************
Filtering options:
Reads can be filtered based on sequence, quality scores, and
flowgrams -- or any combination thereof. But before a read
is analyzed, it must match a mid tag and primer given in
the master file.
-em <int> Number of mismatches to the mid tag sequence to
allow for a read (default: 0)
-ep <int> Number of mismatches to the primer sequence to
allow for a read (default: 0)
The reads will be tested for matching each of the
mid tag - primer sequences in order. This parameter
allows for some mismatching to the target sequences.
Only substitutions are allowed, not insertions or
deletions.
This parameter does not apply to the 3' base of the
primer, i.e. that base must be a match. This is due
to the way the flowgram is analyzed by the program.
Be careful not to specify too large a number for the
mid tag parameter, or you may find "sample
switching" of your reads if the mid tags are not
sufficiently distinct.
Only reads that have a match and contain at least one
additional base (past the mid tag and primer) will
be further analyzed by the program.
NOTE: A given read is either eliminated, truncated, or
neither. If multiple criteria would have eliminated or
truncated a read, the criterion credited is the first in the
categories listed below (Sequence, Quality Scores, Flowgram).
In analyzing the Sequence and Flowgram of a read, it must
first pass the min./max. length restrictions. After this,
the sequence/flowgram is examined 5' to 3', and the criterion
that is violated first is credited with the elimination or
truncation.
Sequence Analysis:
NOTE: The bases of the "key sequence" that begins every
read (e.g. "tcag") are not analyzed.
-l <int> Minimum sequence length (bp) (default [and min.]:
<length of mid tag - primer> + 1)
A read shorter than this minimum length will be
eliminated.
This parameter also applies after all of the
various truncation criteria, i.e. if a read's
truncation causes its length to fall below the
minimum specified here, it will be eliminated.
The specified minimum length includes the length of
the mid tag and the primer (but not the key
sequence).
-L <int> Maximum sequence length (bp)
A read longer than this maximum will be eliminated.
The specified maximum length includes the length of
the mid tag and the primer.
-t <int> Maximum sequence length for truncation (bp)
A read will be truncated to the maximum specified.
The specified maximum length includes the length of
the mid tag and the primer.
NOTE: The truncated bases, as well as the quality
scores and flow values, will NOT be further
analyzed, even if the read should have been
eliminated due to its truncated end.
For example: A read should be eliminated due to
an ambiguous base at position 410 and the -N 0
option. If -t 400 is specified, the read is
truncated at 400bp, so the ambiguous base is
ignored and the read is not eliminated.
This criterion should not result in any eliminations,
except in a case where the truncated sequence
results in a flowgram that is too short for the
-lf parameter. If this occurs, you should
reconsider your choice of parameters.
The resulting flowgram might not match the sequence
perfectly, if the truncation divides a homopolymer.
Note that -L and -t can be used together. For example, one
can eliminate any read that is longer than 600bp, but
truncate the rest to 400bp by specifying -L 600 -t 400.
-N <int> Maximum number of ambiguous bases to allow in a
read
A read containing more than the specified number of
Ns will be eliminated.
-n <int> Maximum number of ambiguous bases to allow in a
read before truncating it
A read containing more than the specified number of
Ns will be truncated immediately prior to the
offending N.
This may result in the elimination of the read due to
the minimum size parameters (-l or -lf).
Note that -N and -n can be used together. For example, one
can truncate the reads before the first N (thus removing
all Ns) but also eliminate any read that has more than
two Ns by specifying -N 2 -n 0.
-G <int> Maximum homopolymer length to allow in a read
A read that contains a homopolymer run longer than
the specified value will be eliminated.
-g <int> Maximum homopolymer length to allow in a read
before truncating it
A read that contains a homopolymer run longer than
the specified value will be truncated immediately
prior to the homopolymer run.
This may result in the elimination of the read due to
the minimum size parameters (-l or -lf).
Note that -G and -g can be used together. For example, one
can truncate a read prior to a homopolymer longer than 6
but also eliminate any read that has a homopolymer longer
than 10 by specifying -G 10 -g 6.
-r Option to remove the reverse primer from reads
-rq Option to require the reverse primer in reads,
which will then be removed
-er <int> Number of mismatches to allow in the search for
the reverse primer (default: 0)
If either option is specified, the reads will be
checked for the presence of the reverse-complement
of the "reverse" primer specified in the master
file. The search will allow for the specified
number of mismatches (not including the 3' base,
which must be a match).
If found, the reads will be truncated immediately
prior to the reverse primer. This may result in
the elimination of the read, depending on the
minimum length parameters.
If the "require" option is specified, the read will
be eliminated if the reverse primer is not found.
This is equivalent to QIIME's "-z truncate_remove"
option in split_libraries.py.
The less strict "remove only" option is equivalent
to "-z truncate_only" in split_libraries.
Either option may be specified, but not both.
Quality Score Analysis:
NOTE: The quality scores for bases removed by truncation
due to any of the above criteria are NOT analyzed.
Also, the quality scores of the "key sequence" that
begins every read are not analyzed.
-s <float> Minimum average quality score
A read whose average quality score is not at least
the value specified will be eliminated.
-wl <int> Length of sliding window of quality scores
-wq <float> Minimum average quality score of sliding window
-wx Option to eliminate a read with a bad quality
window (default: do not eliminate)
A read containing a "window" (of the length specified
by -wl) of consecutive bases whose average quality
score is less than that specified by -wq will be
truncated immediately prior to the window.
This may result in the elimination of the read due to
the minimum size parameters (-l or -lf).
If the option -wx is specified, the read will be
eliminated regardless of where the window occurs
(this is equivalent to QIIME's -w, -g combination
in split_libraries.py).
Note that -q and -wl/-wq/-wx can be used together. For
example, one can truncate a read prior to a window of
50bp that contains an average quality score below 20,
but also eliminate a read whose overall average quality
score is below 25, by specifying -q 25 -wl 50 -wq 20.
Flowgram Analysis:
NOTE: The flow values of the "key sequence" that begins
every read are not analyzed.
-u <float> Absolute maximum flow value (default: 19.99)
This program requires a maximum flow value that can
be analyzed. Flow values greater than the
maximum will be changed to the maximum.
To truncate a flowgram prior to a flow value larger
than a specified maximum, use the -z parameter
(see below).
-lf <int> Minimum number of flows (default [and min.]:
number of flows corresponding to minimum
sequence length of <length of key - mid
tag - primer> + 1)
A read whose flowgram is shorter than this minimum
length will be eliminated.
This parameter also applies after all of the
various truncation criteria, i.e. if a read's
truncation causes its flowgram's length to fall
below the minimum specified here, it will be
eliminated.
The specified minimum length includes the flows of
the mid tag and the primer, as well as the key
sequence. Depending on the first base of the
mid tag, the standard key "tcag" utilizes 7-10
flows before the actual sequence begins.
-Lf <int> Maximum number of flows
A flowgram will be truncated to the maximum
specified.
This criterion should not result in any eliminations,
except in a case where the truncated flowgram
results in a read that is too short for the
-l parameter. If this occurs, you should
reconsider your choice of parameters.
-p <float> Noisy interval flow value minimum
-q <float> Noisy interval flow value maximum
A flowgram will be truncated immediately prior
to a flow whose value falls in the interval
defined by -p and -q (inclusive).
In the CleanMinMax.pl script of AmpliconNoise,
this interval is 0.50 - 0.70.
This may result in the elimination of the read due to
the minimum size parameters (-l or -lf).
-z <float> Maximum flow value (for truncation)
A flowgram will be truncated immediately prior
to a flow whose value is greater than the
specified value (not inclusive).
In the CleanMinMax.pl script of AmpliconNoise,
this value is 6.49.
This may result in the elimination of the read due to
the minimum size parameters (-l or -lf).
Note that this is not the absolute maximum flow value
(see -u above).
-y <float> Minimum flow value for 4 straight flows
A flowgram will be truncated immediately prior to
a set of 4 flows whose values are not at least
the minimum specified.
This may result in the elimination of the read due to
the minimum size parameters (-l or -lf).
This criterion is based on observations of Reeder
and Knight (Nature Methods, 2010), confirmed by
me (JMG), that a flowgram may contain 4 or more
consecutive flows with almost no signal, but the
corresponding sequence will not contain any Ns.
I would recommend a small value for this parameter,
such as -y 0.25.
Note that specifying -y 0.50 will NOT truncate reads
prior to all Ns, because only three flows with
insufficient signal are required for an N to be
called. If you wish to truncate prior to all Ns,
use the -n 0 option.
This criterion has not been modified for use with
an irregular flow pattern ("flow pattern B").
It still evaluates every set of 4 consecutive
flows, regardless of the actual flow order.
Be cautious when using this criterion with such
data.
***************************************************************
Denoising options:
If denoising is to be performed, one of the following
parameters must be specified, but not both.
These parameters define the width of the "confidence interval"
around each flow value. If a flow value from a read falls
outside of the confidence interval for the corresponding
flow value of a cluster, the flow values will be considered
sufficiently distinct, and the read will not join the
cluster.
-j <float> Constant value
This parameter defines a constant value (plus or
minus) for all flow values.
For example, if "-j 0.50" is specified, the
interval for a flow value of 1.09 will range from
0.59 to 1.59 (inclusive). Flow values inside
that interval will not be considered significantly
distinct from 1.09.
The parameter must be strictly positive. To run
with a distance of 0, use -j 0.001.
In essence, the 454 base caller uses a constant
value of 0.50. The important difference is that
454 calls bases using integers (0, 1, 2, ...) as
its reference values. FlowClus uses a cluster's
(floating-point) flow values as the references.
-k <float> Number of distances
This parameter defines a multiplier for the distances
given for each flow value in an external file
("stddev.txt" by default, or whatever is specified
by -sd parameter; see above). It must be positive.
For example, the distance (given in "stddev.txt") for
the flow value 0.93 is 0.11781817. If "-k 5" is
specified, the interval for 0.93 will range from
0.34090915 to 1.51909085.
The distances given in "stddev.txt" are based on the
standard deviations produced by Balzer et al.
(Bioinformatics, 2010). They more naturally
reflect flowgram noise in that they increase with
larger flow values.
If you have a set of desired distances for each flow
value, write those distances to a separate file
(one per line) and specify -sd <file> -k 1. There
should be one distance for each flow value, from
0.00 to the maximum specified by -u (or 19.99, by
default).
For asymmetric intervals, each line should contain
two positive floating-point distances (comma- or
tab-delimited) -- the first for the negative
distance, and the second for the positive distance.
For example, if the distances for the flow value of
0.58 are given as "0.35,0.56", the interval of query
flow values will range from 0.23 to 1.14 (assuming
-k 1 is specified).
-tr Option to denoise using a trie
If this option is selected, denoising will be
performed using a trie data structure. The reads
will not be clustered; instead they will be placed
into the trie according to their flow values and
the distance(s) given by -j or -k.
This will use less memory and have a shorter run-time,
but will be less precise than the default denoising.
It is recommended for very large datasets.
Since clusters are not formed, some of the output
files are different, as outlined in various sections
above.
***************************************************************
Congratulations. You made it to the end. Hopefully this is
not because you still have questions about the program and
how it is used. But if you do have questions, or if you
found any bugs, please let me know.
John M. Gaspar (jsh58@wildcats.unh.edu)
June 2013 (updated Feb. 2014)