A collection of all scripts and resulting files related to "Challenges in capturing the mycobiome from shotgun metagenomic data: lack of software and databases".
-
./Mock_fungal/mock_conda
:a)
snakemake_conda.yaml
: configuration file for conda environment with snakemake and all dependencies (ART/Kraken2/MetaPhlAn4) for mock data generation and Kraken2/MetaPhlAn4 taxonomy analysisb)
eukdetect_equal_{reads/coverage}.yaml
: configuration files used for taxonomy classification with EukDetect -
./Mock_fungal/config
: configuration file for the Snakemake -
./Mock_fungal/mock_profiles
:a)
unicellular_classes.tsv
: List of taxIDs for unicellular classes within Ascomycota and Basidiomycota that were initially searched forb)
manually_added_taxa.csv
: List of fungal species manually added to the list of taxIDsc)
genomes_to_download.csv
: List of species IDs that were searched through the NCBI RefSeq database - generated using./Mock_fungal/scripts/data_analysis/prepare_fastas.py
d)
final_genomes_summary.csv
: List of NBCI RefSeq genomes that were used for the mock communities* - generated using./Mock_fungal/scripts/data_analysis/prepare_fastas.py
e)
metadata_ufcg.csv
: Genome metadata file prepared for the UFCG phylogenetic tree generationf)
/profiles
: Equal reads mock community profiles - generated using./Mock_fungal/scripts/data_analysis/prepare_fastas.py
g)
/profiles_equal_cov
: Equal coverage mock community profiles - generated using./Mock_fungal/scripts/data_analysis/prepare_fastas.py
-
./Mock_fungal/workflow
: Scripts for mock communities sequencing data generation, taxonomy classification, and data analysis. Detailed info is provided in Data analysis. -
./Mock_fungal/data
: All figures and tables generated during the analysis of mock communities dataa)
/kraken
- Kraken2 reports and their summaryb)
/metaphlan
- MetaPhlAn4 reports and their summaryc)
/eukdetect
- EukDetect reports and their summaryd)
/HMS
- MycobiomeScan2.0 reports and their summarye)
/funomic
- FunOMIC reports and their summaryf)
/micop
- MiCoP reports and their summaryg)
/all_methods_summary
- Statistics, summary reports and figures
All python scripts were executed using Spyder IDE v5.5.4 (Python v3.12). Scripts are located in ./Mock_fungal/workflow/scripts
and ./Mock_fungal/workflow/scripts/data_analysis
.
-
prepare_fastas.py
- Download species taxids based on class taxids; download existing NCBI RefSeq genomes; concatenate contigs in each fasta file; and create profiles for the mock communities. Note that the script is dependent on the ncbi-datasets conda environment -
Reads simulation by ART, mock community metagenome data generation and its classification by Kraken2 and MetaPhlAn4 are wrapped into the Snakemake v8, conda is provided in
./Mock_fungal/mock_conda/snakemake_conda.yaml
. When all fastas, profiles and Kraken2/MetaPhlAn4 databases are available, provide paths to data in./Mock_fungal/config/config.yaml
and run
cd ./Mock_fungal
snakemake
-
The EukDetect and the HumanMycobiomeScan (MycobiomeScan v2.0) classifications are performed independently using the original pipeline/commands. Configuration scripts for EukDetect are located in
./Mock_fungal/mock_conda/eukdetect_equal_{reads/coverage}.yaml
-
{kraken/metaphlan/hms/eukdetect/funomic/micop}_summary.py
- Summarize taxonomy predictions on species/genus and family levels; find true and false positives. These summaries will be used further for all the analyses -
check_presence_database.py
- Find which mock community species are deposited in the databases; make a summary of genomes characteristics (Figure 1) -
abundance_estimation.py
- Calculate relative abundance on different taxonomy levels; calculate RMSE; compare equal reads vs equal coverage predictions; generate boxplot (Figure 2b) -
summarize_prec_recall.py
- Calculate precision, recall and F1 score for all tools; compare equal reads vs equal coverage predictions; generate boxplot (Figure 2a); Calculate Pearson correlation to community richness (Figure 2c). Note that for this part, RMSE results fromabundance_estimation.py
should be available -
create_datasets_itol.py
- Generate files for visualization of phylogenetic tree with iTOL (used for Figure 3a) -
taxa_detection.py
- Find which species/genera/families were differentially detected between equal reads and equal coverage mock communities by tested tools; which genera are not detected in case there are more than 1 species per genus (Figure 3b); presense of unidentified genera in the databases