forked from Geo-omics/scripts
-
Notifications
You must be signed in to change notification settings - Fork 0
/
contents.list
209 lines (185 loc) · 9.72 KB
/
contents.list
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
# Lines that end with ":" are folder names. Lines following it are the scripts that are in the folder.
# To add a description to a script insert a TAB after the name and describe in a line.
# DO NOT include ":" or a new line character in your description.
# Use "crawler.pl" to update all the scripts.
AssemblyTools:
addFileName2header.pl - Add the name of the fasta/fastq file (without the extension) to the header. Useful when merging multiple fasta/fastq files and wish to easily keep track of sequences
batchBlast.pl - Run multiple Blasts at once in an "embarassingly" parallel manner
calcN50.pl - Calculate N50 and L50 values for a fasta file
chopper.pl - Chop a file (fasta/fastq/tab-delimited/multiple-line) into multiple parts.
createFastq.pl - Use the fasta and quality files to produce fastq files
CRISPR_spacer_extractor.pl - Given a fasta file with repeat sequences and a contig fasta file, get the positions of these repeats in the contigs and find the coordinates of spacers
dereplicate.pl - Remove sequences that are exactly the same and maintain a record of the clusters.
extractSeqs.pl - Given a list of sequence names, extract the sequences from a fasta/fastq file
findStretchesOfNs.pl - Go through the fasta file and find the sequences with a 100 or more N's at a stretch.
gcSkew.pl - Calculate the GC skew for for a fasta file
getRandomData.pl - Extract a random % of data (fasta/fastq)
interleave.pl - Take the forward(1) and reverse(2) fasta/fastq files and arrange them such that all odd sequences are forward and evens are reverse.
kmerFreq.pl - Calculate any kmer frequency from the given fasta
length+GC.pl - Calculate the length and GC content from the fasta file.
limit2Length.pl - Remove sequences from a fasta file that do not pass the length threshold set by the user. Print length distribution to the screen
separateInterleaved.pl - Separate interleaved files.
usageStats.pl - Monitor a given process and email report when process finishes
BamTools:
bamTools.pl - Calculate average coverage for a given list of contig names. Samtools required.
coveragePerScaffold.pl - Using the GenomeCoverageBed default output to calculate the coverage per scaffold and the whole genome.
derep_getReadAbundance.pl -
getBwaMappedReadList.pl -
BashScripts:
do2folder.sh - Do something to the entire folder. Just add the command into the script.
do2list.sh - Do something to the a list of file/folder locations. Just add the command into the script.
ssh2sameFolder.sh - As the name suggests, SSH to your current location in a different server. The path should exist on the other server.
firefox_already_running.sh - Helps fight that pesky 'firefox already running' error. You know which one.
mapping.sh - BWA mapping pipeline. Open the file in a text editor for help.
qc.sh - NGS QC pipeline
assemble.sh - NGS de novo assembly pipeline. Expects the QC to be done using qc.sh.
BlastTools:
basicHF.pl
batchBlast.pl - Run multiple Blasts at once in an "embarassingly" parallel manner
blastDensityPlot.pl - Given multiple blasts to the same reference database, make density plot based on tabular blast outputs.
extract_Blast_Hits_Of_Interest.pl
extractSubSeq.pl
fragRec.pl
getSciNames.pl
mapper.pl
parseBlastXML.pl
plotCoverage.pl
postBlast.pl
removeBlastSubj.pl
removeCommentLines.pl
silvaTaxonAppend.pl
top5.pl
DerepTools:
derep_ClusterMap.pl
derep_getReadAbundance.pl
dereplicate.pl
inflate.pl - Use this script if you've used dereplicate.pl to remove duplicate reads and wish to know what the original number of reads would have been had you used the original dataset.
GeneralTools:
folderLevelSize.pl
usageStats.pl
firefox_already_running.sh - Helps fight that pesky 'firefox already running' error. You know which one.
HomologyTools:
basicHF.pl
uClustHomology.pl
JGITools:
binTablesForIMG.pl - Given a list of Contigs reformat it to upload to IMG Scaffold workspace
consolidateJGIdata.pl - Consolidate all the data generated by IMG annotaions into one(or many, by bins) tab-delimited file.
extractGenomes.pl - extract genomes from NCBI databases (nr/nt) using NCBI Taxon ID, curate and concatenate them to form your own customized database.
img_Bin_Classifier.pl - Use the IMG taxonomic classification of contigs/scaffolds to get the taxonomic makeup of each bin.
getGFF.pl - Given a list of contig names extract GFF data.
gff2tbl.pl - Read JGI's GFF3, Contigs and Gene_product files and produce a usable output in NCBI's ridiculous '.tbl' format.
measureCompleteness.pl - Given a fasta file and IMG consolidated data look for the 36 essential bacterial housekeeping genes (Cicarelli et al 2006)
map_project_names.pl - Map IMG project names to your own. Creates Symbolic links with your project names to extracted IMG tar balls.
separateInterleaved.pl - Separate interleaved files.
MapperTools:
getQueryList.pl
getSciNames.pl
itemize.pl - If you used mapper.pl for multiple datasets and you wish to get a comparison for each dataset. Use this script.
mapper_getQueryList.pl
mapper.pl
silvaTaxonAppend.pl
Modified_Source_Code_Tools:
extractContigReads.pl
NCBITools:
Ebot.Output.Extract.Gi.Title.Rev3.py - This script takes a NCBI genbank file, extracts the gi#, the study name, the journal it came from and the sequence and builds a tab delimitted output for each genbank entry.
extractGenomes.pl - extract genomes from NCBI databases (nr/nt) using NCBI Taxon ID, curate and concatenate them to form your own customized database.
extractGenbankMetadata.pl - Extract Meta data and protein translations from a Genbank format file.
gbk2fna.pl - Read Genbank file and convert it to a Nucleotide Fasta file.
getFastaFromAccNos.pl
getGFF.pl - Given a list of contig names extract GFF data.
getGIAnnotation.sh
getGiInfo.pl
getGI.pl
getLineage.pl
getSciNames.pl
gff2tbl.pl - Read GFF3(tested with JGI's gff), Contigs and Gene_product files and produce a usable output in NCBI's ridiculous '.tbl' format.
GI_info_XMLParser.pl
PPTTools:
ppt_getGI.pl - Get GI numbers for a search term
ppt_getXML.pl - Get TinySeq XML for the GI numbers
parseTinySeqXML.xslt - Reformat TinySeq XML to tabular data
derep+alias.pl - Remove duplicates and assign unique ids to sequences; print output in fasta format along with other metadata. (legacy)
createPhgDB.sh - Bash wrapper that uses these scripts to create the NCBI portion of the PhgDB.
RibopickerTools:
silvaTaxonAppend.pl
taxonDist.pl
SeqTools:
addFileName2header.pl - Add the name of the fasta/fastq file (without the extension) to the header. Useful when merging multiple fasta/fastq files and wish to easily keep track of sequences
batchBlast.pl - Run multiple Blasts at once in an "embarassingly" parallel manner
calcN50.pl - Calculate N50 and L50 values for a fasta file
chopper.pl - Chop a file (fasta/fastq/tab-delimited/multiple-line) into multiple parts.
createFastq.pl - Use the fasta and quality files to produce fastq files
CRISPR_spacer_extractor.pl - Given a fasta file with repeat sequences and a contig fasta file, get the positions of these repeats in the contigs and find the coordinates of spacers
curateDB.pl
dereplicate.pl
extractSeqs.pl
extractSubSeq.pl
findStretchesOfNs.pl
gcSkew.pl
genomeCheck.pl
getRandomData.pl
iClust.pl
interleave.pl
kmerFreq.pl
length+GC.pl
limit2Length.pl
mapper.pl
parseFastq.pl
renameHeaders.pl
sangerSeqParser.pl
separateInterleaved.pl - Separate interleaved files.
tetramer_freqs_esom.pl
toPhylipAndBack.pl
tinySeq2table.xslt - restructure TinySeq format XML file to a *tab-delimited* file
tinySeq2fasta.xslt - restructure TinySeq format XML file to a *FASTA* format file
triage.pl
U2T.pl
TabTools:
countInstances.pl
fileChopper.pl
getCol.pl
getSciNames.pl
TabTools/Tally_Compare:
getMasterList.pl
tally.pl
tallyWrap.pl
weave.pl
VelvetTools:
contigMetadata.pl
extractContigReads.pl
getMyContigs.pl
TinySeqTools:
tinySeq2table.xslt - restructure TinySeq format XML file to a *tab-delimited* file
tinySeq2fasta.xslt - restructure TinySeq format XML file to a *FASTA* format file
OmicsDBTools/Parsers:
gff2neo.pl - Read a GFF(version 3) file and create nodes and relationship files for upload to a GraphDB. Tested on Neo4j v2.2.3.
OmicsDBTools/Put_Data:
createNodes.pl - Read the output from the parsers and create nodes in the Neo4j database (Not a real script; Place holder)
WebTools:
twitterscript.xml - Add twitter feed to the lab website
wrappers/Assembly:
assemble.pl
calcN50.pl
findStretchesOfNs.pl
interleave.pl
limit2Length.pl
mapping.sh - BWA mapping pipeline. Open the file in a text editor for help.
qc.sh - NGS QC pipeline
qc_no_derep.sh - NGS QC Pipeline without derep
qc_no_derep_no_interleave.sh - NGS QC Pipeline without derep or interleave
assemble.sh - NGS de novo assembly pipeline. Expects the QC to be done using qc.sh.
coveragePerScaffold.pl - Using the GenomeCoverageBed default output to calculate the coverage per scaffold and the whole genome.
coveragePerBin.pl - Using the Conf and Scaffold files to get the coverage per bin.
wrappers/ESOM:
changeClasses.pl - Edit the *.cls file to change the class numbers for a given list of contig names.
esomCodonMod.pl
# esomTrain.pl - Normalize and train the ESOM. Use this after the esomWrapper.pl.
esomWrapper.pl
getClassFasta.pl
tetramer_freqs_esom.pl
addInfo2lrn.pl - Adds additional content to the lrn file.
img_Bin_Classifier.pl - Use the IMG taxonomic classification of contigs/scaffolds to get the taxonomic makeup of each bin.
wrappers/AntiSmash:
parallel_antiSmash.pl - Takes a multifasta file and runs antismash, single fasta at a time in an embarrassingly parallel manner. Requires "antiSmash v2.0.2" to be installed.
summarize_antiSmash.pl - Takes the antismash project directory as input and produces a tabular summary of gene clusters and smcogs in the form of counts and sequence names.
antiSmash_summary.sh - Summarizes the output from parallel_antismash.pl. Needs to run from within the main output folder for parallel_antismash.pl.