Time | Activity | Slides | Hands-on |
---|---|---|---|
Morning | MAG annotation and downstream analyses (Part 1) | Link here | Link here |
Afternoon | MAG annotation and downstream analyses (Part 2) | Link here | |
Afternoon | Closing remarks and open discussion |
First login to the Amazon Cloud and pull possible changes on our GitHub page:
cd physalia_metagenomics
git pull
First let's take a better look at what has REALLY been done in anvi'o
behind the scenes to prepare the data for binning.
Let's open the script:
less Scripts/ANVIO_MEGAHIT.sh
And after several rounds of binning and refining with anvi-interactive
and anvi-refine
, we have finally decided to call it a day.
We have then renamed the bins and made a summary of them with anvi-summarize
.
You can find the output of everything that has been done with the bins in ~/Share/BINNING_MEGAHIT
.
For the moment, let's take the summary file for each of the four samples and copy them to a new folder:
mkdir MAGs
for SAMPLE in Sample01 Sample02 Sample03 Sample04; do
cp ~/Share/BINNING_MEGAHIT/$SAMPLE/MAGsSummary/bins_summary.txt MAGs/$SAMPLE.bins_summary.txt
done
Since we're at it let's also take a couple of files summarizing the abundance of the MAGs across the different samples:
for SAMPLE in Sample01 Sample02 Sample03 Sample04; do
cp ~/Share/BINNING_MEGAHIT/$SAMPLE/MAGsSummary/bins_across_samples/mean_coverage.txt MAGs/$SAMPLE.mean_coverage.txt
cp ~/Share/BINNING_MEGAHIT/$SAMPLE/MAGsSummary/bins_across_samples/detection.txt MAGs/$SAMPLE.detection.txt
done
Later on we might import these summary tables to R.
For now let's take a look at how many MAGs we have:
cat MAGs/*bins_summary.txt | grep _MAG_ | wc -l
Not too shabby, innit?
Normally, one thing that we want to learn more about is the taxonomy of our MAGs.
Although anvi'o
gives us a preliminary idea, we can use a more dedicated platform for taxonomic assignment of MAGs.
Here we will use GTDBtk
, a tool to infer taxonomy for MAGs based on the GTDB database (you can - and probably should - read more about GTDB here).
We have already run GTDBtk
for you for; let's take a look at what it produced:
ls -lt ~/Share/GTDB
Let's copy the two summary files:
cp ~/Share/GTDB/gtdbtk.bac120.summary.tsv MAGs
cp ~/Share/GTDB/gtdbtk.ar122.summary.tsv MAGs
I particularly am curious about Sample03Short_MAG_00001
, the nice bin from yesterday that had no taxonomic assignment.
I wonder if it's an archaeon?
grep Sample03Short_MAG_00001 MAGs/gtdbtk.ar122.summary.tsv
It doesn't look like it, no...
Let's see then in the bacterial classification summary:
grep Sample03Short_MAG_00001 MAGs/gtdbtk.bac120.summary.tsv
And what other taxa we have?
Let's take a quick look with some bash
magic:
cut -f 2 gtdbtk.bac120.summary.tsv | sed '1d' | sort | uniq -c | sort
Later on, let's see if we can do some more analyses on R.
Let's now annotate the MAGs against databases of functional genes to try to get an idea of their metabolic potential.
As everything else, there are many ways we can annotate our MAGs.
Here, let's take advantage of anvi'o
for this as well.
Annotation usually takes some time to run, so we won't do it here.
But let's take a look below at how you could achieve this:
conda activate anvio-7
for SAMPLE in Sample01 Sample02 Sample03 Sample04; do
anvi-run-ncbi-cogs --contigs-db $SAMPLE/CONTIGS.db \
--num-threads 4
anvi-run-kegg-kofams --contigs-db $SAMPLE/CONTIGS.db \
--num-threads 4
anvi-run-pfams --contigs-db $SAMPLE/CONTIGS.db \
--num-threads 4
done
These steps have been done by us already, and the annotations have been stored inside the CONTIGS.db
of each assembly.
What we need now is to get our hands on a nice table that we can then later import to R.
We can achieve this by running anvi-export-functions
:
for SAMPLE in Sample01 Sample02 Sample03 Sample04; do
anvi-export-functions --contigs-db ~/Share/BINNING_MEGAHIT/$SAMPLE/CONTIGS.db \
--output-file MAGs/$SAMPLE.gene_annotation.txt
done
Since we're at it, let's also recover the information about i) the genes found in each split and ii) which splits belong to wihch bin/MAG.
I don't think there's a straightforward way to get this using anvi'o
commands, but because CONTIGS.db
and PROFILES.db
are SQL databases, we can access information within them using sqlite3
:
for SAMPLE in Sample01 Sample02 Sample03 Sample04; do
# Get list of gene calls per split
printf '%s|%s|%s|%s|%s\n' splits gene_callers_id start stop percentage > MAGs/$SAMPLE.genes_per_split.txt
sqlite3 ~/Share/BINNING_MEGAHIT/$SAMPLE/CONTIGS.db 'SELECT * FROM genes_in_splits' >> MAGs/$SAMPLE.genes_per_split.txt
# Get splits per bin
printf '%s|%s|%s\n' collection splits bins > MAGs/$SAMPLE.splits_per_bin.txt
sqlite3 ~/Share/BINNING_MEGAHIT/$SAMPLE/MERGED_PROFILES/PROFILE.db 'SELECT * FROM collections_of_splits' | grep 'MAGs|' >> MAGs/$SAMPLE.splits_per_bin.txt
done
Now let's get all these data into R to explore the MAGs taxonomic identity and functional potential.
First, download the MAGs
folder to your computer using FileZilla.