-
Notifications
You must be signed in to change notification settings - Fork 12
/
parameters.tsv
We can make this file beautiful and searchable if this error is corrected: Illegal quoting in line 25.
429 lines (429 loc) · 20.5 KB
/
parameters.tsv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
############################################################################################
################################ PEMA 's PARAMETERS ####################################
############################################################################################
#
# for version 2.1.4
#
# In this file there are all the parameters that need to be assigned every time PEMA is
# about to run! The parameters we have here, are not the only parameters of the tools
# invoked by PEMA. Hence, we encourage you the most to study the manual of each tool and
# make them as good as possible for your specific experiment.
#
# ATTENTION!
# From each variable you have to leave EXACTLY ONE (1) TAB and then fill in the parameter.
#
#
############################################################################################
########################## The parameter setting starts frome here! #######################
############################################################################################
#
# Give in your each uniq experiment a NAME, so a single output file will be created for each of them
#
outputFolderName test_Swarm_16S
#
#
# PEMA needs the names of the samples to be as in ENA format (e.g "ERR1021912") but
# they are more like as the Illumina sequencer returned them (e.g "FdCor_1.fast.gz")
# In order to change the names of your samples as they should be.
# Otherwise, you need to set the "EnaData" variable as 'No'.
EnaData Yes
#
# Each sequencer has a special combination of letters with which all its reads start with
# For example MiSeq has "@M0". Please provide this pattern to PEMA to convert your data to the ENA
# format. If you do not know that, you can either google it or gunzip a fastq.gz files of yours and provide
# the letatters between the "@" and the first ":" characters.
sequencerPrefix M0
#
#############################################################
################# fastqc (v0.11.8) #####################
#############################################################
#
# No parameters here!
#
#
#############################################################
################ trimmomatic (v.0.38) ##################### // http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf
#############################################################
#
# Performs an adaptive quality trim, balancing the benefits of retaining longer reads against the costs of retaining bases with errors.
# It needs to be set either as 'Yes' or 'No'.
#
maxInfo Yes
#
#
############ for MAXINFO ####################
# Specifies the read length which is likely to allow the location of the read within the target sequence to be determined.
# It needs an integer to be set as a value.
#
targetLength 100
#
#
# This value, specifies the balance between preserving as much read length as possible vs. removal of incorrect bases.
# It can take values between 0 and 1.
# A low value of this parameter (<0.2) favours longer reads, while a high value (>0.8) favours read correctness.
#
strictness 0.8
#
#
############ for ILLUMINACLIP ##################
# Specifies the path to a fasta file containing all the adapters, PCR sequences etc.
# The naming of the various sequences within this file determines how they are used.
#
adapters TruSeq2-PE.fa
#
#
# This parameter specifies the maximum mismatch count which will still allow a full match to be performed.
# It needs an integer to be set as a value.
#
seedMismatches 0
#
#
# This parameter specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment.
# It needs an integer to be set as a value.
#
palindromeClipThreshold 20
#
#
# It specifies how accurate the match between any adapter etc. sequence must be against a read.
# It needs an integer to be set as a value.
#
simpleClipThreshold 30
#
#
############ for LEADING ##########################
# The LEADING modules, removes low quality bases from the beginning.
# As long as a base has a value below this threshold (value of the 'leading' parameter) the base is removed and the next base will be investigated.
# It needs an integer to be set as a value.
#
leading 20
#
#
############ for TRAILING #############################
# This module of Trimmomatic removes low quality bases from the end.
# As long as a base has a value below this threshold (value of the 'trailing' parameter), the base is removed
# and the next base (which as trimmomatic is starting from the 3' prime end, would be base preceding the just removed base) will be investigated.
# It needs an integer to be set as a value.
#
trailing 2
#
#
############ for MINLEN ################################
# This module removes reads that fall below the specified minimal length. If required, it should normally be after all other processing steps.
# Reads removed by this step will be counted and included in the "dropped reads" count presented in the trimmomatic summary.
# It needs an integer to be set as a value.
#
minlen 100
#
#
# Finally, you need to set how many threads you want Trimmomatic to run into.
#
threadsTrimmomatic 8
#
#
###############################################################
######### BayesHammer (from SPAdes: v3.13.0) ###############
###############################################################
#
# This algorithm is part of the SPAdes tool package. You can find its manual here: http://cab.spbu.ru/files/release3.10.1/manual.html#sec1
#
# no parameters here!
#
#
################################################################
################# PANDAseq (v. 2.11) ################### // https://storage.googleapis.com/pandaseq/pandaseq.html
################################################################
#
# PANDAseq is the algorithm that PEMA uses in order to merge the paired-end reads.
# PANDAseq has more than one merging algorithms.
#
# Here, we set the algorithm used for assembly. The most common of them are:
# pear --> uses the formula described in the PEAR paper (Zhang 2013), optionally with the probability of a random base (q) provide
# simple_bayesian --> uses the formula described in the original paper (Masella 2012), optionally with an error estimation (ε) provided.
# other options are stich, flash and more that you can fing in the above link.
#
pandaseqAlgorithm simple_bayesian
#
#
# PANDAseq is a I/O bound algorithm. That means that it needs severous time in order to handle the ipnut and output files
# while the process is quite fast. However, it does support multithreading and here you can set the number of threads it is going to use.
#
pandaseqThreads 8
#
#
# The 'minlen' parameter sets the minimum length for a sequence, after primers are removed.
# By default, all sequences are kept. With this option, sequences shorter than desired can be discarded.
# In case you need to use this parameter, be sure you leave a tab after 'minlen' and set it like this: '-l 80'
# If you do not want to use this parameter, please remove everything after the 'minlen'
#
pandaseqMinlen
#
#
# The 'minoverlap' parameter sets the minimum overlap between forward and reverse reads.
# By default, this is at least one nucleotide of overlap.
# Raising this number does not generally increase the quality of the output as alignments with small overlaps tend to score poorly and are discarded anyway.
#
minoverlap 1
#
#
# The 'threshold' parameter sets the score, between zero and one, that a sequence must meet to be kept in the output.
# Any alignments lower than this will be discarded as low quality.
# Increasing this number will not necessarily prevent uncalled bases (Ns) from appearing in the final sequence.
# It is also used as the threshold to match primers, if primers are supplied. The default value is 0.6.
#
threshold 0.6
#
#
# The '-N' parameter eliminates all sequences with uncalled nucleotides in the output.
# Otherwise, during assembly, uncalled bases (Ns) from unpaired regions may be emitted.
# If you need -N to be on your analysis, please add '-N' after 'elimination'. Please make sure you leave a tab.
# If you do not want the parameter to be on, please make sure there is nothing after the 'elimination' parameter.
#
elimination
#
#
## PEMA performs the PANDAseq algorithm, with the -a and the -B parameters also on.
## That it for striping the primers after assembly, rather than before and allowing input sequences to lack a barcode/tag correspondingly.
#
#
###############################################################
########### obiuniq (from OBITools: v1.2.12) ###############
###############################################################
#
# This algorithm is part of the OBITools package. You can find more about them here: https://pythonhosted.org/OBITools/
#
# no parameters here!
#
#
###############################################################
################ VSEARCH (v2.9.1) ######################
###############################################################
#
# VSEARCH is the main algorithm used for a lot of steps in the case of the 16S marker gene.
# Set how many threads do you want PANDAseq to use.
# It needs an integer to be set as a value.
#
vsearchThreads 8
#
#
# Here you need to set a score about the clustering step of the VSEARCH algorithm.
# Do not add a read into a certain cluster if the pairwise identity with its centroid, is lower than the value of the 'vsearchId' parameter.
# The pairwise identity is defined as the number of (matching columns) / (alignment length - terminal gaps).
# It needs a real number to be set as a value, ranging from 0.0 to 1.0 .
#
vsearchId 0.95
#
#
################################################################################################
#//////////////////////////////////////////////////////////////////////////////////////////////
############################ GENE - dependent parameters #####################################
#//////////////////////////////////////////////////////////////////////////////////////////////
###############################################################################################
#
# The marker gene you have is really important for both the clustering & chimera removal procedure and the
# taxonomy assignment. By default, the pipeline runs for 16S. Substitute with 'COI' if COI is your marker gene
# write it down after the underscore (_) - do not erase the udenrscore!!
#
gene gene_16S
#
#
##########################################################################################################
# Here are some parameters needed when the the metabarcoding analysis is about the 16S/18S marker genes #
##########################################################################################################
#
#
# If your marker gene is 16S, you can choose between 2 different approaches of taxonomy assignment (alignment & phylogenetic based)
# An alignment based taxonomy assignment - set as 'alignment' - which is based on SILVA and CREST (version 3.0).
# However, you can also get a phylogenetic based assignment, by putting 'phylogeny' in this parameter. In that case, a reference tree we created is being used as well as the RAxML
#
taxonomyAssignmentMethod alignment
#
#
# I you choose phylogeny based taxonomy assignment, then you ll need to rum PaPaRa.
# Pleas fill in how many cores PaPaRa is able to use.
#
numberOfCoresForPapara 7
#
#
# When you use the alignment-based taxonomy assignment, then the LCAClassifier from the CREST algorithm, uses a Silva version for the assignment.
# PEMA allows you to choose between the two last version of Silva. Hence, set the "silvaVersion" parameter either as 'silva_128' or as 'silva_132'
# depending on the version of your choice.
# In case you are running 18S rRNA data, you may also use the PR2 database, by setting the referenceDb parametera as 'pr2'.
#
referenceDb silva_132
#
#
# As you may need a series of taxonomy assignment when you use the alignment-based method, please give another name in your
# taxonomy output folder of the CREST algorithm, each time you are about to use it.
# CREST creates an output folder every time and if a folder with the same already exists, it is going to abort the task!
# You need to set the value of this parameter in case of 16S/18S rRNA and ITS marker genes.
#
taxonomyFolderName my_taxon_assign
#
#
##########################################################################################################
####################### For the case of the ITS marker gene ######################################
##########################################################################################################
#
# For the case of ITS there is an extra problem with respect to the primes used.
# Please complete the next two variables with the primers you used
#
forwardITSPrimer GATGAAGAACGYAGYRAA
reverseITSPrimer CTBTTVCCKCTTCACTCG
#
#
##########################################################################################################
######## Here are some parameters needed with respect to clustering algorithms ###############
##########################################################################################################
#
#
# For the case of the 16S and 18S rRNA marker genes, you can either get an OTU-table using the VSEARCH algorithm
# or you can get an ASV-table by taking advantage of the SWARM algorithm.
# If your marker gene is COI or ITS, you can choose between 2 different approaches of clustering.
# Depending on which of them you choose you get either a robust output in a short time (Swarm) or a non-robust output (CROP) that requires quite much more time.
# Please fill in accroding to your marker gene and needs (write "Swarm" or "vsearch" or "CROP" after algo_).
#
clusteringAlgo algo_Swarm
#
#
# In case of SWARM, the user needs to speeecify the value of "d" parameter,
# maximum number of differences allowed between two amplicons, meaning that two amplicons
# will be grouped if they hav e integer (or less) differences. This is swarm's most important
# parameter
#
d 15
#
#
# when using the option --fastidious (-f), define the minimum mass of a large ASV.
# By default, an ASV with a mass of 3 or more is considered large.
# Conversely, an ASV is small if it has a mass of less than 3, meaning that it is composed of either one amplicon of abundance 2, or two amplicons of abundance 1.
# Any positive value greater than 1 can be specified. Using higher boundary values will speed up the second pass, but also reduce the taxonomical resolution of swarm
# results. Default mass of a large OTU is 3.
#
boundary 3
#
#
# You also need to set the number of threads that Swarm is able to use
#
swarmThreads 20
#
#
## SWARM tends to create a great numebr of ASVs, especially when d takes a low value.
## Would you like to remove the singletons (ASVs that appear only once with abundance equal to 1) ?
#
removeSingletons Yes
#
#
# CROP is able to run on multiple CPUs. As PEMA needs to update an environmental variable to do thi$
#
omp_num_threads 20
#
#
# PEMA invokes the UCHIME_DENOVO3 algorithm for the chimera removal in the case of the COI marker gene.
# This is a quite common algorithm for that step, but you still need to choose the minimum abundance skew value.
# By default this value is equal to 16. However, the only difference between UCHIME_DENOVO2 and 3 is that in the second version,
# the abskew value used to be equal to 2. So, be my guest and pick your value as I really have no clue about that.
# Probably, for environmental studies a low abskew is better, while in more specific studies a larger one would fit most.
#
abskew 2
#
#
#########################################################################################################
#
# For the taxonomy assignment of the COI marker gene
#
#########################################################################################################
#
# Since PEMA v.2.0 both Midori and Midori2 are supported. The second version of Midori includes more than
# 180.000 unique species and expands beyond Metazoa. However, the computational time needed increases to a
# a great extent when Midori2 is the one selected. Choose wisely, especially in case you have a lot of query sequencies
# If you want Midori2 for your analysis, replace "1" with "2" after the "midori_" prefix.
#
midori_version midori_1
#
#
#########################################################################################################
#
###########################################################################################################
#
# In - house (local) refererence database for the taxonoomy assignment step
#
###########################################################################################################
# In case you would prefer to use a reference database of yours for the taxonomy step, PEMA will ask you for the
# files necessary to train each of the two classifiers supported. You need to provide those in the directory called
# `local_ref_db` and depending on which classifier will train, the files have to be in a different format.
# More specifically, if you are working with COI, then PEMA uses thes RDPClassifier and you need to provide a taxonomy
# and a sequence file, as described here: https://john-quensen.com/tutorials/training-the-rdp-classifier/
# If you are working with 16S, 18S or ITS then PEMA uses the CREST classifier for the taxonomy step
# and you need to provide the corresponding two files as described at: https://github.com/lanzen/CREST
# (see after the Output section).
# As containers are lost when exit from one, you will have to train the classifier, every time you run a new PEMA container
# If you are about to use a custom ref db, set the following parameter as 'Yes'. Otherwise, it must be set as 'No'.
# The `name_of_custom_db` may be empty or no depending on whether you will use a custom db or not.
#
custom_ref_db No
name_of_custom_db partialCustomdb
#
#######################################################################################################
#
# In case you would like to link the OTU/ASV assignment to its closest NCBI Taxonomy Id
# set the following parameter as `Yes`
#
getNCBITaxId No
#
######################################################################################################
#
# The following parameters is only for the case that the Phyloseq R package is about to run
#
######################################################################################################
#
# If wish to use Phyloseq in order to analyse your returned data then set the following parameter 'phyloseq' with 'Yes'.
# In order to do that, PEMA needs an MSA that it is returned by the MAFFT (v7.427) aligner and a phylogeny tree of the OTUs found which is built by the RAxML-ng algorithm.
# Please remember that in order to use phyloseq a "metadata.tsv" file is necessary to be part of your anaylis folder.
#
phyloseq Yes
#
#
# The phyloseq object can handle phylogenetic trees as well. PEMA uses RAxML-ng in order to build such trees.
# Do you want to create such a tree with your OTUs? In case you build this once, you can use it as many times as you want.
#
tree Yes
#
# In case you are about to use the phyloseq option, then a phylogeny tree has to be built. Hence, PEMA invokes the RAxML-ng algorithm
# which is able to run in more than one threads. Please set the number of threads RAxML is able to use.
#
raxmlThreads 5
#
#
# You can also set the number of the parsimony-based starting trees for the RAxML-ng
#
parsTrees 1
#
#
# And finally, the number of the bootstrap trees
#
bootstrapTrees 1
#
###########################################################################################################
#
# Finally, do you want your raw data to be removed in another file and empty the "rawData" file and all the checkpoints of PEMA to be also in an extra folder
# Swich 'Yes' to 'No' if you wish so. Be very careful when you do that, as you need to remember that if you want to analyze another dataset through P.E.M.A
# you will have to remove the first one manually.
#
emptyRawDataFile Yes
emptyCheckpoints Yes
#
#
# Classifier to use, parameter values ("CREST","phylogeny","RDPClassifier")
# If you are about to use a version of Silva db, or Unite db or a custom db that you have prepared
# the required files accordingly
# (see https://hariszaf.github.io/pema_documentation/training_crest_classifier/),
# you may set this parameter as "CREST".
# In case you are running 16S data and you would like a phylogeny based taxonomy assignment,
# you may set this parametera as "phylogeny"
# In case you are about to use a version of Midori or a custom db for which you have prepared the
# required files (see https://hariszaf.github.io/pema_documentation/training_rdpclassifier/) you
# may set this parameter as "RDPClassifier"
classifierAlgo CREST
#
#
#