-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSowe-MAGs-Anvio.Rmd
652 lines (519 loc) · 34.6 KB
/
Sowe-MAGs-Anvio.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
---
title: Sowe MAGs Anvi'o
output:
html_document:
toc: yes
toc_float: yes
pdf_document:
toc: yes
fig_width: 10
fig_height: 10
---
Note that the first part of this was carried out using Anvi'o 6.2 and then this was upgraded partway through.
# Set up environment
```{bash, eval=FALSE}
conda activate anvio-6.2
```
# Assemble contigs from reads
This step used megahit to co-assemble contigs from all samples, keeping only contigs with a length of at least 1000 bp.
```{bash, eval=FALSE}
R1=$( ls trimmed_reads/*_R1.fq.gz | tr '\n' ',' | sed 's/,$//' )
R2=$( ls trimmed_reads/*_R2.fq.gz | tr '\n' ',' | sed 's/,$//' )
megahit -1 $R1 \
-2 $R2 \
--min-contig-len 1000 \
--num-cpu-threads 12 \
--presets meta-large \
--memory 0.3 \
-o anvio/megahit_out \
--verbose
# Remove intermediate contigs to save space
rm -r megahit_out_by_set/*/intermediate_contigs
```
# Reformat fasta for Anvi'o
```{bash, eval=FALSE}
anvi-script-reformat-fasta megahit_out/final.contigs.fa \
--simplify-names \
--min-len 1000 \
-o megahit_out/final.contigs.fixed.fa
```
# Make Anvi'o database
```{bash, eval=FALSE}
mkdir anvio_databases
anvi-gen-contigs-database -f megahit_out/final.contigs.fixed.fa \
-o anvio_databases/CONTIGS.db \
-n sowe
```
# Run HMMs to identify genes
```{bash, eval=FALSE}
anvi-run-hmms -c anvio_databases/CONTIGS.db --num-threads 12
anvi-get-sequences-for-gene-calls -c anvio_databases/CONTIGS.db -o anvio_databases/gene_calls.fa
```
# Classify contigs using Kaiju
```{bash, eval=FALSE}
# install kaiju
# conda install -c bioconda kaiju
kaiju -t /scratch/db/kaiju_db_nr_euk/nodes.dmp \
-f /scratch/db/kaiju_db_nr_euk/kaiju_db_nr_euk.fmi \
-i anvio_databases/gene_calls.fa \
-o anvio_databases/gene_calls.nr.out \
-z 10 \
-v
kaiju-addTaxonNames -t /scratch/db/kaiju_db_nr_euk/nodes.dmp \
-n /scratch/db/kaiju_db_nr_euk/names.dmp \
-i anvio_databases/gene_calls.nr.out \
-o anvio_databases/gene_calls.nr.names \
-r superkingdom,phylum,class,order,family,genus,species
anvi-import-taxonomy-for-genes -i anvio_databases/gene_calls.nr.names \
-c anvio_databases/CONTIGS.db \
-p kaiju \
--just-do-it
```
# Get the abundance of contigs in each sample
(by mapping with Bowtie2 back to the reads)
Make the bowtie2 database of the contigs:
```{bash, eval=FALSE}
bowtie2-build megahit_out/final.contigs.fixed.fa megahit_out/final.contigs.fixed
mkdir bam_files
```
Make the sample_ids.txt file (in Python):
```{python, eval=FALSE}
import os
files = ['D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'D10', 'D11', 'D12']
with open('sample_ids.txt', 'w') as f:
for sample in files:
f.write(sample+'\n')
```
Map the contigs to the reads:
```{bash, eval=FALSE}
for SAMPLE in `awk '{print $1}' sample_ids.txt`
do
# do the bowtie mapping to get the SAM file:
bowtie2 --threads 12 \
-x anvio/megahit_out/final.contigs.fixed \
-1 "trimmed_reads/"$SAMPLE"_R1.fq.gz" \
-2 "trimmed_reads/"$SAMPLE"_R2.fq.gz" \
--no-unal \
-S anvio/bam_files/$SAMPLE.sam
# covert the resulting SAM file to a BAM file:
samtools view -F 4 -bS anvio/bam_files/$SAMPLE.sam > anvio/bam_files/$SAMPLE-RAW.bam
# sort and index the BAM file:
samtools sort anvio/bam_files/$SAMPLE-RAW.bam -o anvio/bam_files/$SAMPLE.bam
samtools index anvio/bam_files/$SAMPLE.bam
# remove temporary files:
rm anvio/bam_files/$SAMPLE.sam anvio/bam_files/$SAMPLE-RAW.bam
done
```
# Make the Anvi'o profile databases that contain coverage and detection statistics
```{bash, eval=FALSE}
mkdir anvio_databases/profiles
for SAMPLE in `awk '{print $1}' sample_ids.txt`
do
anvi-profile -c anvio_databases/CONTIGS.db \
-i bam_files/$SAMPLE.bam \
--num-threads 12 \
-o anvio_databases/profiles/$SAMPLE
done
```
# Merge the sample profiles
```{bash, eval=FALSE}
anvi-merge -c anvio_databases/CONTIGS.db \
-o anvio_databases/merged_profiles \
anvio_databases/profiles/D*/PROFILE.db
```
# Upgrade to Anvi'o 7
```{bash, eval=FALSE}
anvi-migrate --migrate-dbs-safely anvio_databases/CONTIGS.db anvio_databases/merged_profiles/PROFILE.db
```
# Analyses with contigs >5000 bp {.tabset}
Note that these analyses with contigs >5000 bp were not used for the final MAGs, although some of the annotations with KEGG and with the SCG taxonomy were applied to all contigs so these were what was used for the final >2500 bp MAGs.
## Cluster contigs 5000 bp minimum
```{bash, eval=FALSE}
anvi-cluster-contigs -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
--collection-name "merged_concoct_5000" \
--driver concoct \
--length-threshold 5000 \
--num-threads 12 \
--just-do-it
```
This initially gave 155 bins, 72 of which were >50% completion, which we needed to summarise and then manually refine.
## Summarise
```{bash, eval=FALSE}
mkdir concoct_summary_5000
anvi-summarize -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
-C "merged_concoct_5000" \
-o concoct_summary_5000/summary
```
## Estimate completeness
```{bash, eval=FALSE}
anvi-estimate-genome-completeness -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
-C "merged_concoct_5000"
```
## Interactive refinement
```{bash, eval=FALSE}
anvi-refine -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
-C "merged_concoct_5000" \
-b Bin_131_3 --server-only -P 8082
#in second window
ssh -L 8082:localhost:8082 vinko@kronos.pharmacology.dal.ca
#go to http://localhost:8082/ in browser
```
These steps were repeated for all bins, somtimes multiple times, to get redundancy <10% for each bin. After this we had 91 bins that were >50% completion.
## Rename bins
```{bash, eval=FALSE}
anvi-rename-bins -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
--collection-to-read merged_concoct_5000 \
--collection-to-write FINAL \
--size-for-MAG 1 \
--min-completion-for-MAG 50 \
--max-redundancy-for-MAG 10 \
--prefix SOWE \
--report-file renaming_bins.txt
```
## Make new collection with these
This should have also removed bins that were <50% completion or >10% redundancy, but for some reason these weren't removed, so we made a new collection with these removed:
```{bash, eval=FALSE}
anvi-rename-bins -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
--collection-to-read merged_concoct_5000 \
--collection-to-write FINAL \
--size-for-MAG 1 \
--min-completion-for-MAG 50 \
--max-redundancy-for-MAG 10 \
--prefix SOWE \
--report-file renaming_bins.txt
anvi-summarize -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
-C "FINAL" \
-o SUMMARY/summary
anvi-export-collection -C FINAL \
-p anvio_databases/merged_profiles/PROFILE.db
```
Make new collection using Python:
```{python, eval=FALSE}
import pandas as pd
rename = pd.read_csv('renaming_bins.txt', header=0, sep='\t')
rename = rename.loc[rename['completion'] >= 50]
rename = rename.loc[rename['redundancy'] <= 10]
rename = rename.loc[rename['size_in_Mbp'] >= 1]
rename = rename.set_index('new_bin_name')
bins_keeping = list(rename.index.values)
final = pd.read_csv('collection-FINAL.txt', sep='\t', header=None)
keeping = []
for row in final.index.values:
if final.loc[row, 1] in bins_keeping:
keeping.append(row)
filtered = final.loc[keeping, :]
filtered.to_csv('collection-FINAL-filtered.txt', header=False, index=False, sep='\t')
quit()
```
Import the reduced collection:
```{bash, eval=FALSE}
anvi-import-collection -C FINALfiltered \
-p anvio_databases/merged_profiles/PROFILE.db \
-c anvio_databases/CONTIGS.db \
collection-FINAL-filtered.txt
```
## Get taxonomy for all SCG
Then we got the taxonomy of all single copy genes within each bin, following the instructions [here](https://merenlab.org/2019/10/08/anvio-scg-taxonomy/):
```{bash, eval=FALSE}
anvi-get-sequences-for-hmm-hits -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
-C "FINALfiltered" \
-o hmm_hits/"concat-proteins.fa" \
--hmm-source 'Bacteria_71' \
--return-best-hit \
--get-aa-sequences \
--concatenate
anvi-setup-scg-taxonomy
anvi-run-scg-taxonomy -c anvio_databases/CONTIGS.db \
-T 12
anvi-estimate-scg-taxonomy -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
--update-profile-db-with-taxonomy \
--compute-scg-coverages \
--metagenome-mode
anvi-summarize -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
-C "FINALfiltered" \
-o FINAL/summary
```
Now get taxonomy with CheckM:
```{bash, eval=FALSE}
mkdir MAGs
for i in FINAL/summary/bin_by_bin/* ;
do
MAG="$(basename -- $i)"
anvi-script-reformat-fasta FINAL/summary/bin_by_bin/$MAG/$MAG-contigs.fa \
--simplify-names \
--prefix $MAG \
-o MAGs/$MAG.fa
done
checkm tree MAGs -x .fa -t 20 `pwd`/MAGs-CHECKM-TREE
checkm tree_qa `pwd`/MAGs-CHECKM-TREE -f MAGs-CHECKM.txt
```
Reformat CheckM taxonomy:
```{python, eval=FALSE}
import pandas as pd
mags = pd.read_csv('MAGs-CHECKM.txt', header=None)
file = list(mags.loc[:, 0])
new_file = []
for row in file:
row = row.split(' ')
new_row = [r for r in row if r != '']
new_file.append(new_row)
new_file = new_file[3:-1]
new_file_df = pd.DataFrame(new_file, columns=['Bin Id', '# unique markers (of 43)', '# multi-copy', 'Taxonomy'])
new_file_df.to_csv('MAGs-CHECKM-reformat.txt', index=False, sep='\t')
```
## Annotate with KEGG
Following the instructions [here](https://merenlab.org/software/anvio/help/7/programs/anvi-setup-kegg-kofams/) and [here](https://merenlab.org/software/anvio/help/7/programs/anvi-run-kegg-kofams/).
```{bash, eval=FALSE}
anvi-setup-kegg-kofams
anvi-run-kegg-kofams -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
-C "FINALfiltered" \
-T 20
```
## .
At this point we realised that we didn't have any Pseudomonas, that we were particularly interested in because these were so abundant in the weathered LDPE samples according to the Kraken2 output. So we decided to go back and refine the bins that were created using all contigs >2500 bp.
# Analyses with contigs >2500 bp
## Cluster contigs
```{bash, eval=FALSE}
anvi-cluster-contigs -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
--collection-name "merged_concoct_2500" \
--driver concoct \
--length-threshold 2500 \
--num-threads 12 \
--just-do-it
```
This gave 279 bins, 155 of which had completion >50%.
## Summarise
```{bash, eval=FALSE}
anvi-summarize -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
-C "merged_concoct_2500" \
-o concoct_summary_2500/summary
```
## Manually refine
```{bash, eval=FALSE}
anvi-refine -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
-C "merged_concoct_2500" \
-b Bin_65 --server-only -P 8082
#In separate window
ssh -L 8082:localhost:8082 vinko@kronos.pharmacology.dal.ca
#go to http://localhost:8082/ in browser
```
These steps were repeated multiple times to get redundancy <10% in all bins >50% completion. This gave a final 215 bins.
## Rename bins
```{bash, eval=FALSE}
anvi-rename-bins -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
--collection-to-read merged_concoct_2500 \
--collection-to-write FINAL_2500 \
--size-for-MAG 1 \
--min-completion-for-MAG 50 \
--max-redundancy-for-MAG 10 \
--prefix SOWE_2500 \
--report-file renaming_bins.txt
anvi-export-collection -C FINAL_2500 \
-p anvio_databases/merged_profiles/PROFILE.db
```
## Reduce collection
```{python, eval=FALSE}
import pandas as pd
rename = pd.read_csv('renaming_bins.txt', header=0, sep='\t')
rename = rename.loc[rename['completion'] >= 50]
rename = rename.loc[rename['redundancy'] <= 10]
rename = rename.loc[rename['size_in_Mbp'] >= 1]
rename = rename.set_index('new_bin_name')
bins_keeping = list(rename.index.values)
final = pd.read_csv('collection-FINAL_2500.txt', sep='\t', header=None)
keeping = []
for row in final.index.values:
if final.loc[row, 1] in bins_keeping:
keeping.append(row)
filtered = final.loc[keeping, :]
filtered.to_csv('collection-FINAL_2500-filtered.txt', header=False, index=False, sep='\t')
```
## Import reduced collection
```{bash, eval=FALSE}
anvi-import-collection -C FINAL_2500_filtered \
-p anvio_databases/merged_profiles/PROFILE.db \
-c anvio_databases/CONTIGS.db \
collection-FINAL_2500-filtered.txt
anvi-summarize -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
-C "FINAL_2500_filtered" \
-o FINAL_2500_filtered/summary
```
## Get taxonomy
```{bash, eval=FALSE}
anvi-estimate-scg-taxonomy -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
-C FINAL_2500_filtered \
--compute-scg-coverages \
-o FINAL_2500_filtered_scg-taxonomy.txt
```
## Rename using taxonomy
```{bash, eval=FALSE}
anvi-export-collection -p anvio_databases/merged_profiles/PROFILE.db \
-C FINAL_2500_filtered \
-O FINAL_2500_filtered_scg-export
perl -e '$namesfile=shift; open(F,$namesfile); $count=1;while ($line = <F>){next if $line =~ /^bin_name/; chomp $line; @temp=split("\t",$line); $species_name=$temp[7]."_".$temp[9]."_".$count; $count++;$species_name=~s/\s+/_/g; $tax_id=$temp[0]; $hash{$tax_id}=$species_name;} $splitfile=shift; open(S,$splitfile); while ($s = <S>){chomp $s; @temp1=split("\t",$s); $t_id=$temp1[1]; $tax_name=$hash{$t_id}; print "$temp1[0]\t$tax_name\n";}' FINAL_2500_filtered_scg-taxonomy.txt FINAL_2500_filtered_scg-export.txt > FINAL_2500_filtered_scg-export-wtax.txt
#Got an error after this:
#This was the error:
#Config Error: Sorry, bin name can't start with a digit. Long story. Please specify a name that
# starts with an ASCII letter.
#So this is because one of the family names starts with a number
#Just rerunning the perl code with a prefix for the name
perl -e '$namesfile=shift; open(F,$namesfile); $count=1;while ($line = <F>){next if $line =~ /^bin_name/; chomp $line; @temp=split("\t",$line); $species_name="SOWE_".$temp[7]."_".$temp[9]."_".$count; $count++;$species_name=~s/\s+/_/g; $tax_id=$temp[0]; $hash{$tax_id}=$species_name;} $splitfile=shift; open(S,$splitfile); while ($s = <S>){chomp $s; @temp1=split("\t",$s); $t_id=$temp1[1]; $tax_name=$hash{$t_id}; print "$temp1[0]\t$tax_name\n";}' FINAL_2500_filtered_scg-taxonomy.txt FINAL_2500_filtered_scg-export.txt > FINAL_2500_filtered_scg-export-wtax.txt
anvi-import-collection FINAL_2500_filtered_scg-export-wtax.txt \
-p anvio_databases/merged_profiles/PROFILE.db \
-c anvio_databases/CONTIGS.db \
-C FINAL_2500_filtered_wtax
```
## Interactive view of MAGs
```{bash, eval=FALSE}
anvi-interactive -p anvio_databases/merged_profiles/PROFILE.db \
-c anvio_databases/CONTIGS.db \
-C FINAL_2500_filtered_wtax \
--server-only -P 8082
anvi-summarize -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
-C "FINAL_2500_filtered_wtax" \
-o FINAL_2500_filtered_wtax/summary
```
## Get fasta file of each MAG
```{bash, eval=FALSE}
mkdir MAGs-2
for i in FINAL_2500_filtered_wtax/summary/bin_by_bin/* ;
do
MAG="$(basename -- $i)"
anvi-script-reformat-fasta FINAL_2500_filtered_wtax/summary/bin_by_bin/$MAG/$MAG-contigs.fa \
--simplify-names \
--prefix $MAG \
-o MAGs/$MAG.fa
done
```
This didn't work for the MAGs that had a hyphen in the name, as it said that this wasn't an OK name. So redoing these in Python:
```{python, eval=FALSE}
import os
MAGs = os.listdir('FINAL_2500_filtered_wtax/summary/bin_by_bin/')
MAGs = [MAG for MAG in MAGs if '-' in MAG]
for MAG in MAGs:
new_MAG = MAG.replace('-', '_')
os.system('anvi-script-reformat-fasta FINAL_2500_filtered_wtax/summary/bin_by_bin/'+MAG+'/'+MAG+'-contigs.fa --simplify-names --prefix '+new_MAG+' -o MAGs/'+new_MAG+'.fa')
```
## Get CheckM taxonomy
```{bash, eval=FALSE}
checkm tree MAGs -x .fa -t 20 `pwd`/MAGs-CHECKM-TREE
checkm tree_qa `pwd`/MAGs-CHECKM-TREE -f MAGs-CHECKM.txt
```
## Put MAGs into tree
Check which single copy genes we have in a good number of the MAGs (note we can choose multiple genes to use):
```{bash, eval=FALSE}
anvi-get-sequences-for-hmm-hits -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
-C "FINAL_2500_filtered_wtax" \
--hmm-source 'Bacteria_71' \
--list-available-gene-names
* Bacteria_71 [type: singlecopy]: ADK, AICARFT_IMPCHas, ATP-synt, ATP-synt_A,
Chorismate_synt, EF_TS, Exonuc_VII_L, GrpE, Ham1p_like, IPPT, OSCP, PGK,
Pept_tRNA_hydro, RBFA, RNA_pol_L, RNA_pol_Rpb6, RRF, RecO_C, Ribonuclease_P,
Ribosom_S12_S23, Ribosomal_L1, Ribosomal_L13, Ribosomal_L14, Ribosomal_L16,
Ribosomal_L17, Ribosomal_L18p, Ribosomal_L19, Ribosomal_L2, Ribosomal_L20,
Ribosomal_L21p, Ribosomal_L22, Ribosomal_L23, Ribosomal_L27, Ribosomal_L27A,
Ribosomal_L28, Ribosomal_L29, Ribosomal_L3, Ribosomal_L32p, Ribosomal_L35p,
Ribosomal_L4, Ribosomal_L5, Ribosomal_L6, Ribosomal_L9_C, Ribosomal_S10,
Ribosomal_S11, Ribosomal_S13, Ribosomal_S15, Ribosomal_S16, Ribosomal_S17,
Ribosomal_S19, Ribosomal_S2, Ribosomal_S20p, Ribosomal_S3_C, Ribosomal_S6,
Ribosomal_S7, Ribosomal_S8, Ribosomal_S9, RsfS, RuvX, SecE, SecG, SecY, SmpB,
TsaE, UPF0054, YajC, eIF-1a, ribosomal_L24, tRNA-synt_1d, tRNA_m1G_MT,
Adenylsucc_synt
```
Get amino acid sequences for these genes (I'm just using the ones from the example for now, and will see how they do!):
```{bash, eval=FALSE}
anvi-get-sequences-for-hmm-hits -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
-C "FINAL_2500_filtered_wtax" \
-o concatenated-proteins_all_genes.fa \
--hmm-source Bacteria_71 \
--gene-names Chorismate_synt,EF_TS,Exonuc_VII_L,GrpE,Ham1p_like,IPPT,OSCP,PGK,Pept_tRNA_hydro,RBFA,RNA_pol_L,RNA_pol_Rpb6,RRF,RecO_C,Ribonuclease_P,Ribosom_S12_S23,Ribosomal_L1,Ribosomal_L13,Ribosomal_L14,Ribosomal_L16,Ribosomal_L17,Ribosomal_L18p,Ribosomal_L19,Ribosomal_L2,Ribosomal_L20,Ribosomal_L21p,Ribosomal_L22,Ribosomal_L23,Ribosomal_L27,Ribosomal_L27A,Ribosomal_L28,Ribosomal_L29,Ribosomal_L3,Ribosomal_L32p,Ribosomal_L35p,Ribosomal_L4,Ribosomal_L5,Ribosomal_L6,Ribosomal_L9_C,Ribosomal_S10,Ribosomal_S11,Ribosomal_S13,Ribosomal_S15,Ribosomal_S16,Ribosomal_S17,Ribosomal_S19,Ribosomal_S2,Ribosomal_S20p,Ribosomal_S3_C,Ribosomal_S6,Ribosomal_S7,Ribosomal_S8,Ribosomal_S9,RsfS,RuvX,SecE,SecG,SecY,SmpB,TsaE,UPF0054,YajC,eIF-1a,ribosomal_L24,tRNA-synt_1d,tRNA_m1G_MT,Adenylsucc_synt \
--return-best-hit \
--get-aa-sequences \
--concatenate
anvi-get-sequences-for-hmm-hits -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
-C "FINAL_2500_filtered_wtax" \
-o concatenated-nucleotides_all_ribosomalS.fa \
--hmm-source Bacteria_71 \
--gene-names Ribosomal_S10,Ribosomal_S11,Ribosomal_S13,Ribosomal_S15,Ribosomal_S16,Ribosomal_S17,Ribosomal_S19,Ribosomal_S2,Ribosomal_S20p,Ribosomal_S3_C,Ribosomal_S6,Ribosomal_S7,Ribosomal_S8,Ribosomal_S9 \
--return-best-hit \
--concatenate
anvi-get-sequences-for-hmm-hits -c anvio_databases/CONTIGS.db \
-p anvio_databases/merged_profiles/PROFILE.db \
-C "FINAL_2500_filtered_wtax" \
-o concatenated-proteins_all_ribosomalS.fa \
--hmm-source Bacteria_71 \
--gene-names Ribosomal_S10,Ribosomal_S11,Ribosomal_S13,Ribosomal_S15,Ribosomal_S16,Ribosomal_S17,Ribosomal_S19,Ribosomal_S2,Ribosomal_S20p,Ribosomal_S3_C,Ribosomal_S6,Ribosomal_S7,Ribosomal_S8,Ribosomal_S9 \
--return-best-hit \
--get-aa-sequences \
--concatenate
```
This used Muscle.
Now make the tree (Newick format):
```{bash, eval=FALSE}
anvi-gen-phylogenomic-tree -f concatenated-proteins_all_genes.fa \
-o phylogenomic-tree_all_genes.txt
anvi-gen-phylogenomic-tree -f concatenated-proteins_ribosomalS.fa \
-o phylogenomic-tree_ribosomalS.txt
```
# Run PathoFact using 2500 bp MAGs {.tabset}
Installed following the instructions [here](https://git-r3lab.uni.lu/laura.denies/PathoFact/-/tree/master). Note that we weren't successfully able to run `git lfs install` but everything else worked so we assume this didn't matter.
The config.yaml file:
```{python, eval=FALSE}
pathofact:
sample: ['SOWE_Burkholderiaceae_Limnohabitans_sp005789825_186-contigs', 'SOWE_Moraxellaceae_None_3-contigs', 'SOWE_Methylophilaceae_None_85-contigs', 'SOWE_Burkholderiaceae_None_66-contigs', 'SOWE_Aerococcaceae_None_8-contigs', 'SOWE_Burkholderiaceae_None_161-contigs', 'SOWE_Moraxellaceae_HYN0046_sp003351745_131-contigs', 'SOWE_Bacteriovoracaceae_None_211-contigs', 'SOWE_UBA7239_UBA7239_sp002333095_194-contigs', 'SOWE_Methylophilaceae_Methylopumilus_universalis_77-contigs', 'SOWE_Ilumatobacteraceae_UBA2093_sp005788815_149-contigs', 'SOWE_None_None_20-contigs', 'SOWE_Nitrospiraceae_Nitrospira_sp900170025_162-contigs', 'SOWE_Moraxellaceae_Alkanindiges_sp001982605_78-contigs', 'SOWE_Burkholderiaceae_Sphaerotilus_natans_193-contigs', 'SOWE_Burkholderiaceae_JOSHI-001_sp005403045_74-contigs', 'SOWE_Burkholderiaceae_None_213-contigs', 'SOWE_SG8-38_SG8-38_sp001303415_13-contigs', 'SOWE_Opitutaceae_Lacunisphaera_sp900104925_46-contigs', 'SOWE_Crocinitomicaceae_M0103_sp006227105_111-contigs', 'SOWE_Akkermansiaceae_UBA956_sp002293145_137-contigs', 'SOWE_Spirosomaceae_None_201-contigs', 'SOWE_None_None_196-contigs', 'SOWE_None_None_32-contigs', 'SOWE_Spirosomaceae_Flectobacillus_major_26-contigs', 'SOWE_Burkholderiaceae_Rhodoferax_sp003415675_139-contigs', 'SOWE_Moraxellaceae_Agitococcus_sp002333125_59-contigs', 'SOWE_Leptospiraceae_UBA2033_sp002333425_47-contigs', 'SOWE_Microbacteriaceae_Rhodoglobus_sp004297555_166-contigs', 'SOWE_Burkholderiaceae_None_151-contigs', 'SOWE_Burkholderiaceae_None_33-contigs', 'SOWE_Burkholderiaceae_JOSHI-001_sp005403045_172-contigs', 'SOWE_Burkholderiaceae_None_70-contigs', 'SOWE_Rhodobacteraceae_Pseudorhodobacter_sp000176015_108-contigs', 'SOWE_EnvOPS12_None_160-contigs', 'SOWE_Burkholderiaceae_None_86-contigs', 'SOWE_None_None_157-contigs', 'SOWE_Burkholderiaceae_None_133-contigs', 'SOWE_Burkholderiaceae_Rubrivivax_sp001464055_169-contigs', 'SOWE_Spirosomaceae_Arcicella_aurantiaca_54-contigs', 'SOWE_Nitrososphaeraceae_TA-21_sp005877345_106-contigs', 'SOWE_Saprospiraceae_OLB8_sp001567405_112-contigs', 'SOWE_Burkholderiaceae_None_102-contigs', 'SOWE_Polyangiaceae_None_123-contigs', 'SOWE_Sphingobacteriaceae_None_146-contigs', 'SOWE_Flavobacteriaceae_None_159-contigs', 'SOWE_Pseudomonadaceae_None_140-contigs', 'SOWE_Moraxellaceae_HYN0046_sp003351745_176-contigs', 'SOWE_Rhodocyclaceae_None_14-contigs', 'SOWE_Weeksellaceae_Kaistella_chaponense_28-contigs', 'SOWE_BACL12_UBA7236_sp002473275_5-contigs', 'SOWE_Burkholderiaceae_None_116-contigs', 'SOWE_Xanthomonadaceae_Arenimonas_oryziterrae_91-contigs', 'SOWE_None_None_122-contigs', 'SOWE_None_None_132-contigs', 'SOWE_Burkholderiaceae_None_64-contigs', 'SOWE_None_None_95-contigs', 'SOWE_B-17BO_UBA2475_sp002319075_43-contigs', 'SOWE_Rhodocyclaceae_None_152-contigs', 'SOWE_Polyangiaceae_Polyangium_fumosum_181-contigs', 'SOWE_B-17BO_UBA4416_sp002420145_101-contigs', 'SOWE_Spirosomaceae_None_73-contigs', 'SOWE_None_None_130-contigs', 'SOWE_UBA953_None_55-contigs', 'SOWE_None_None_167-contigs', 'SOWE_Arcobacteraceae_Aliarcobacter_caeni_200-contigs', 'SOWE_Burkholderiaceae_None_141-contigs', 'SOWE_Verrucomicrobiaceae_None_127-contigs', 'SOWE_None_None_190-contigs', 'SOWE_None_None_125-contigs', 'SOWE_None_None_174-contigs', 'SOWE_None_None_61-contigs', 'SOWE_Burkholderiaceae_Sphaerotilus_natans_180-contigs', 'SOWE_None_None_148-contigs', 'SOWE_Moraxellaceae_None_214-contigs', 'SOWE_Xanthomonadaceae_Arenimonas_sp001801685_94-contigs', 'SOWE_Burkholderiaceae_None_185-contigs', 'SOWE_Leptotrichiaceae_None_45-contigs', 'SOWE_Burkholderiaceae_None_178-contigs', 'SOWE_None_None_63-contigs', 'SOWE_Burkholderiaceae_None_118-contigs', 'SOWE_Flavobacteriaceae_None_184-contigs', 'SOWE_Pseudomonadaceae_None_79-contigs', 'SOWE_Moraxellaceae_UBA2031_sp003543245_56-contigs', 'SOWE_None_None_170-contigs', 'SOWE_B-17BO_UBA4416_sp002420145_144-contigs', 'SOWE_None_None_113-contigs', 'SOWE_Polyangiaceae_Polyangium_fumosum_34-contigs', 'SOWE_B-17BO_UBA2475_sp002319075_49-contigs', 'SOWE_Rhabdochlamydiaceae_Rhabdochlamydia_sp901000775_22-contigs', 'SOWE_Weeksellaceae_None_12-contigs', 'SOWE_Burkholderiaceae_None_52-contigs', 'SOWE_Burkholderiaceae_JOSHI-001_sp001770815_44-contigs', 'SOWE_Verrucomicrobiaceae_Prosthecobacter_sp003506995_188-contigs', 'SOWE_Rhodocyclaceae_Zoogloea_ramigera_195-contigs', 'SOWE_Rhodocyclaceae_Azonexus_aromatica_205-contigs', 'SOWE_Rhodocyclaceae_Zoogloea_ramigera_134-contigs', 'SOWE_Arcobacteraceae_Aliarcobacter_cryaerophilus_41-contigs', 'SOWE_Gallionellaceae_None_37-contigs', 'SOWE_Sphi$gomonadaceae_None_110-contigs', 'SOWE_Burkholderiaceae_None_136-contigs', 'SOWE_Burkholderiaceae_Rhodoferax_sp002381045_147-contigs', 'SOWE_Burkholder$aceae_None_206-contigs', 'SOWE_Burkholderiaceae_None_99-contigs', 'SOWE_Sphingomonadaceae_None_2-contigs', 'SOWE_None_None_105-contigs', 'SOWE_Sphingo$onadaceae_Sphingorhabdus_contaminans_87-contigs', 'SOWE_None_None_168-contigs', 'SOWE_Palsa-1005_REAM01_sp004293675_29-contigs', 'SOWE_Burkholderiaceae_Giesbergeria_psychrophila_198-contigs', 'SOWE_Burkholderiaceae_JOSHI-001_sp001770815_179-contigs', 'SOWE_Burkholderiaceae_Giesbergeria_suum_38-contigs', 'SOWE_AWTP1-31_AWTP1-31_sp003962975_72-contigs', 'SOWE_Moraxellaceae_UBA2031_sp002333145_4-contigs', 'SOWE_Burkholderiaceae_None_155-contigs', 'SOWE_Burkholderiaceae_None_69-contigs', 'SOWE_Hyphomonadaceae_UBA7672_sp002483135_171-contigs', 'SOWE_None_None_98-contigs', 'SOWE_Burkholderiaceae_Rhizobacter_sp001425865_135-contigs', 'SOWE_Burkholderiaceae_Hydromonas_sp003339525_92-contigs', 'SOWE_Burkholderiaceae_None_210-contigs', 'SOWE_Flavobacteriaceae_None_35-contigs', 'SOWE_Diplorickettsiaceae_Rickettsiella_sp002290645_58-contigs', 'SOWE_UBA10799_UBA10799_sp003452655_57-contigs', 'SOWE_AWTP1-31_AWTP1-31_sp003962975_128-contigs', 'SOWE_Burkholderiaceae_JOSHI-001_sp005403045_209-contigs', 'SOWE_Polyangiaceae_None_82-contigs', 'SOWE_Caedimonadaceae_Caedimonas_varicaedens_53-contigs', 'SOWE_Rhodocyclaceae_None_10-contigs', 'SOWE_Saprospiraceae_Haliscomenobacter_hydrossis_153-contigs', 'SOWE_Burkholderiaceae_Hydromonas_sp003339525_11-contigs', 'SOWE_UBA953_UBA8828_sp003485955_16-contigs', 'SOWE_None_None_163-contigs', 'SOWE_None_None_104-contigs', 'SOWE_Burkholderiaceae_None_145-contigs', 'SOWE_Thiotrichaceae_Thiolinea_eikelboomii_6-contigs', 'SOWE_UBA4408_UBA4408_sp002389785_40-contigs', 'SOWE_Sphingomonadaceae_None_67-contigs', 'SOWE_None_None_75-contigs', 'SOWE_UBA955_UBA955_sp002293105_36-contigs', 'SOWE_Sphingomonadaceae_Polymorphobacter_sp004681125_115-contigs', 'SOWE_None_None_187-contigs', 'SOWE_Nannocystaceae_Enhygromyxa_salina_31-contigs', 'SOWE_Burkholderiaceae_Leptothrix_cholodnii_65-contigs', 'SOWE_Burkholderiaceae_None_177-contigs', 'SOWE_UBA2999_Gp6-AA45_sp003222535_7-contigs', 'SOWE_Arcobacteraceae_Aliarcobacter_cloacae_191-contigs', 'SOWE_Burkholderiaceae_None_202-contigs', 'SOWE_Burkholderiaceae_None_124-contigs', 'SOWE_Flavobacteriaceae_Flavobacterium_glycines_21-contigs', 'SOWE_None_None_18-contigs', 'SOWE_Burkholderiaceae_None_208-contigs', 'SOWE_Chitinophagaceae_Ferruginibacter_sp003455235_30-contigs', 'SOWE_Nitrospiraceae_None_97-contigs', 'SOWE_Rhodocyclaceae_None_23-contigs', 'SOWE_Opitutaceae_None_76-contigs', 'SOWE_Burkholderiaceae_None_154-contigs', 'SOWE_None_None_84-contigs', 'SOWE_Rhodocyclaceae_None_89-contigs', 'SOWE_Burkholderiaceae_None_39-contigs', 'SOWE_Rhodanobacteraceae_Dokdonella_sp001899855_19-contigs', 'SOWE_SG8-38_UBA1660_sp002320815_51-contigs', 'SOWE_EnvOPS12_None_121-contigs', 'SOWE_Burkholderiaceae_None_83-contigs', 'SOWE_Xanthomonadaceae_Thermomonas_sp006337105_164-contigs', 'SOWE_Rhodobacteraceae_None_143-contigs', 'SOWE_Burkholderiaceae_None_142-contigs', 'SOWE_None_None_199-contigs', 'SOWE_Ilumatobacteraceae_UBA2093_sp005788815_150-contigs', 'SOWE_Burkholderiaceae_None_120-contigs', 'SOWE_Burkholderiaceae_Vitreoscilla_sp004359425_183-contigs', 'SOWE_Aeromonadaceae_None_107-contigs', 'SOWE_Streptococcaceae_None_62-contigs', 'SOWE_Burkholderiaceae_None_24-contigs', 'SOWE_Burkholderiaceae_None_100-contigs', 'SOWE_Methylophilaceae_Methylopumilus_universalis_42-contigs', 'SOWE_Propionibacteriaceae_Micropruina_glycogenica_204-contigs', 'SOWE_Crocinitomicaceae_M0103_sp006227105_175-contigs', 'SOWE_Nanopelagicaceae_Planktophila_limnetica_197-contigs', 'SOWE_Burkholderiaceae_Malikia_spinosa_215-contigs', 'SOWE_Spirosomaceae_Arcicella_aurantiaca_48-contigs', 'SOWE_Pseudomonadaceae_None_158-contigs', 'SOWE_2013-40CM-41-45_SHNE01_sp004295015_68-contigs', 'SOWE_Burkholderiaceae_None_93-contigs', 'SOWE_UBA11063_UBA11063_sp003963245_60-contigs', 'SOWE_Burkholderiaceae_Rhodoferax_sp002413825_81-contigs', 'SOWE_Burkholderiaceae_None_173-contigs', 'SOWE_Cyclobacteriaceae_None_9-contigs', 'SOWE_Moraxellaceae_None_27-contigs', 'SOWE_None_None_129-contigs', 'SOWE_Flavobacteriaceae_None_189-contigs', 'SOWE_Chitinophagaceae_None_114-contigs', 'SOWE_None_None_90-contigs', 'SOWE_Turneriellaceae_Turneriella_parva_50-contigs', 'SOWE_Burkholderiaceae_Rhizobacter_sp001425865_156-contigs', 'SOWE_None_None_119-contigs', 'SOWE_Chitinophagaceae_None_15-contigs', 'SOWE_Burkholderiaceae_None_203-contigs', 'SOWE_Chitinophagaceae_None_192-contigs', 'SOWE_Burkholderiaceae_Aquabacterium_sp004297345_138-contigs', 'SOWE_Palsa-1005_REAM01_sp004293675_182-contigs', 'SOWE_Burkholderiaceae_Rhodoferax_sp005793295_212-contigs', 'SOWE_Spirosomaceae_Arcicella_aurantiaca_71-contigs', 'SOWE_Burkholderiaceae_None_96-contigs', 'SOWE_Burkholderiaceae_Hydrogenophaga_sp002001205_88-contigs', 'SOWE_Burkholderiaceae_Undibacterium_sp003970805_126-contigs', 'SOWE_Enterobacteriaceae_None_1-contigs', 'SOWE_Burkholderiaceae_None_117-contigs', 'SOWE_UBA953_UBA953_sp002293065_25-contigs', 'SOWE_Akkermansiaceae_UBA956_sp002293145_165-contigs', 'SOWE_Burkholderiaceae_Hydromonas_sp003339525_103-contigs', 'SOWE_Burkholderiaceae_None_207-contigs', 'SOWE_Burkholderiaceae_None_109-contigs', 'SOWE_None_None_17-contigs', 'SOWE_UBA7239_UBA7239_sp002333095_80-contigs'] # requires user input
project: PathoFact_results # requires user input
datadir: /home/vinko/contigs_bins/ # requires user input
workflow: "complete" #options: "complete", "AMR", "Tox", "Vir"
size_fasta: 10000 #Adjustable to preference
scripts: "scripts"
signalp: "/home/vinko/tools/signalp-5.0b/bin" # requires user input
deepvirfinder: "submodules/DeepVirFinder/dvf.py"
tox_hmm: "databases/toxins/combined_Toxin.hmm"
tox_lib: "databases/library_HMM_Toxins.csv"
tox_threshold: 40 #Bitscore threshold of the toxin prediction, adjustable by user to preference
vir_hmm: "databases/virulence/Virulence_factor.hmm"
vir_domains: "databases/models_and_domains"
plasflow_threshold: 0.7
plasflow_minlen: 1000
runtime:
short: "00:10:00"
medium: "01:00:00"
long: "02:00:00"
mem:
normal_mem_per_core_gb: "4G"
big_mem_cores: 24
big_mem_per_core_gb: "30G"
```
PathoFact was then run using:
```{bash, eval=FALSE}
conda activate PathoFact
snakemake -s Snakefile --use-conda --reason --cores 10 -p
```
Note that we were unable to run this using more threads as it then requires very large amounts of memory (>500GB), I think for SignalP.
# GapSeq
Installed as instructions.
Run:
```{bash, eval=FALSE}
./gapseq doall /home/robyn/vinko/gapseq/SOWE_Pseudomonadaceae_None_140-contigs.fa
```