Mock mycobiome communities - metagenomic data generation and analysis

A collection of all scripts and resulting files related to "Challenges in capturing the mycobiome from shotgun metagenomic data: lack of software and databases".

Folder structure

./Mock_fungal/mock_conda:

a) snakemake_conda.yaml: configuration file for conda environment with snakemake and all dependencies (ART/Kraken2/MetaPhlAn4) for mock data generation and Kraken2/MetaPhlAn4 taxonomy analysis

b) eukdetect_equal_{reads/coverage}.yaml: configuration files used for taxonomy classification with EukDetect
./Mock_fungal/config: configuration file for the Snakemake
./Mock_fungal/mock_profiles:

a) unicellular_classes.tsv: List of taxIDs for unicellular classes within Ascomycota and Basidiomycota that were initially searched for

b) manually_added_taxa.csv: List of fungal species manually added to the list of taxIDs

c) genomes_to_download.csv: List of species IDs that were searched through the NCBI RefSeq database - generated using ./Mock_fungal/scripts/data_analysis/prepare_fastas.py

d) final_genomes_summary.csv: List of NBCI RefSeq genomes that were used for the mock communities* - generated using ./Mock_fungal/scripts/data_analysis/prepare_fastas.py

e) metadata_ufcg.csv: Genome metadata file prepared for the UFCG phylogenetic tree generation

f) /profiles: Equal reads mock community profiles - generated using ./Mock_fungal/scripts/data_analysis/prepare_fastas.py

g) /profiles_equal_cov: Equal coverage mock community profiles - generated using ./Mock_fungal/scripts/data_analysis/prepare_fastas.py
./Mock_fungal/workflow: Scripts for mock communities sequencing data generation, taxonomy classification, and data analysis. Detailed info is provided in Data analysis.
./Mock_fungal/data: All figures and tables generated during the analysis of mock communities data

a)/kraken - Kraken2 reports and their summary

b)/metaphlan - MetaPhlAn4 reports and their summary

c) /eukdetect - EukDetect reports and their summary

d)/HMS - MycobiomeScan2.0 reports and their summary

e)/funomic - FunOMIC reports and their summary

f)/micop - MiCoP reports and their summary

g)/all_methods_summary - Statistics, summary reports and figures

Data analysis

All python scripts were executed using Spyder IDE v5.5.4 (Python v3.12). Scripts are located in ./Mock_fungal/workflow/scripts and ./Mock_fungal/workflow/scripts/data_analysis.

prepare_fastas.py - Download species taxids based on class taxids; download existing NCBI RefSeq genomes; concatenate contigs in each fasta file; and create profiles for the mock communities. Note that the script is dependent on the ncbi-datasets conda environment
Reads simulation by ART, mock community metagenome data generation and its classification by Kraken2 and MetaPhlAn4 are wrapped into the Snakemake v8, conda is provided in ./Mock_fungal/mock_conda/snakemake_conda.yaml. When all fastas, profiles and Kraken2/MetaPhlAn4 databases are available, provide paths to data in ./Mock_fungal/config/config.yaml and run

    cd ./Mock_fungal
    snakemake

The EukDetect and the HumanMycobiomeScan (MycobiomeScan v2.0) classifications are performed independently using the original pipeline/commands. Configuration scripts for EukDetect are located in ./Mock_fungal/mock_conda/eukdetect_equal_{reads/coverage}.yaml
{kraken/metaphlan/hms/eukdetect/funomic/micop}_summary.py - Summarize taxonomy predictions on species/genus and family levels; find true and false positives. These summaries will be used further for all the analyses
check_presence_database.py - Find which mock community species are deposited in the databases; make a summary of genomes characteristics (Figure 1)
abundance_estimation.py - Calculate relative abundance on different taxonomy levels; calculate RMSE; compare equal reads vs equal coverage predictions; generate boxplot (Figure 2b)
summarize_prec_recall.py - Calculate precision, recall and F1 score for all tools; compare equal reads vs equal coverage predictions; generate boxplot (Figure 2a); Calculate Pearson correlation to community richness (Figure 2c). Note that for this part, RMSE results from abundance_estimation.py should be available
create_datasets_itol.py - Generate files for visualization of phylogenetic tree with iTOL (used for Figure 3a)
taxa_detection.py - Find which species/genera/families were differentially detected between equal reads and equal coverage mock communities by tested tools; which genera are not detected in case there are more than 1 species per genus (Figure 3b); presense of unidentified genera in the databases
FunOMIC.sh,buglist.py,coverageFilter.py and functional_profiling.py are all scripts belonging to FunOMIC tool

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Mock_fungal		Mock_fungal
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mock mycobiome communities - metagenomic data generation and analysis

Contents

Folder structure

Data analysis

About

Releases

Packages

Languages

License

Rounge-lab/mock_mycobiome

Folders and files

Latest commit

History

Repository files navigation

Mock mycobiome communities - metagenomic data generation and analysis

Contents

Folder structure

Data analysis

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages