This repository contains the analysis workflow for Debelius et al (doi: 10.1101/2021.03.23.436606).
The analysis was done based on a qiime2-2020.11 environment with RESRIPt and q2-sidle installed.
To build this environment, install qiime2-202011 according to your system requirements. Then, follow the q2-sidle installation instructions. You will also need snakemake > 5.3.
conda install --channel bioconda snakemake=5.3
Finally, you will need to clone and download the repository in whatever way works best for you.
To run the simulation, you will need to download the 100nt OTU table (ID 45113, CRC32: 7066cf83) and metadata from Qiita Study 850. The OTU table should be imported into QIIME 2 using the command,
qiime tools import \
--type 'FeatureTable[Frequency]' \
--source-fomat BIOMV210Format \
--input-path 45113_otu_table.biom \
--output-path start-table.qza
And placed in the folder specified as the config file (input_dir
under simulation
). (By default, this folder is ipynb/data/inputs/simulation
).
The metadata should also be downloaded from the qiita study (found under the sample information tab), and renamed to start-metadata.txt
and moved to the simulation input folder ipynb/data/inputs/simulation
.
Real data can be downloaded from ENA study PRJEB37382, accessions ERR4704801-ERR4704848. The should be placed in the location specified in the config file. By default, they are expected in the ipynb/data/inputs/real/seqs
directory. Metadata is inferred from the filenames.
The analysis is currently based on the Silva 128 QIIME release, available here. The files should be imported into QIIME 2 and saved to the ipynb/data/reference/silva-ori/
folder. The SEPP reference tree can be downloaded from the qiime2 resources page. The formatted Optivag and Greengenes databases are already in the repository.
You can be through easily through snakemake based on the current repository set up.
snakemake
-
Reference Simulation. The greengenes database is used to filter the reference OTU table to exclude sequences with more than 5 degenerates, and then a reference dataset is simulated from the original data by averaging abundance across multiple individuals of the same age group.
filter_degenerete_reference
filter_ref_table
simulate_reference_samples
-
Regional Simulation. Regions of the 16s gene are extracted and combined with the full length reference table to build a regional "ASV" table and representative sequence set. These are the starting points for our three test methods.
extract_regions
generate_asvs_and_derep_tables
-
Database Preparation. The database is prepared by filtering to remove sequences with a large number of degenerate nucleotides from the reference and database sequences which do not have class-level annotations. These are used to generate per-region
filter_silva_reference
extract_sidle_regions
prepare_extracted_region
-
Reference Reconstruction. The simulated reference table, the greengenes 13_8 taxonomy and the greengenes 13_8 tree are copied from their respective locations.
copy_sim_reference
-
OTU Clustering. The table is generated by merging all ASV representative sequences and the feature tables from all the individual regions and then clustering them closed reference at 99% identity against the Silva 128 reference database
concatenate_single_asv_counts
concatenate_single_rep_seqs
cluster_otus
get_otu_tree
get_otu_taxonomy
-
ASV denosing. The table is generated by merging all ASV counts. Taxonomy comes from naive bayesian classification of all sequences; sequences which don't have phylum level annotation are excluded. The tree is built by fragment insertion of the merged sequences
concatenate_single_asv_counts
concatenate_single_rep_seqs
classify_taxonomy
get_asv_rep_seq
get_asv_table
build_insertion_tree
-
Sidle. For Sidle, the regional sequences are aligned to the regional kmers. The alignments, regional database maps, and feature counts are used to reconstruct the table. Taxonomy is reconstructed using the database map to identify unresolved sequences. To build the phylogenetic tree, we generate consensus fragments and then insert them into them in the Silva 128 backbone.
align_sidle_regions
reconstruct_table_single
reconstruct_taxonomy
reconstruct_fragments_single
reconstructed_fragment_insertion
-
Alpha Diversity. Within sample (alpha) diversity is calculated from multiple rarefaction. We look at observed features, Shannon, Pielou's evenness and faith's diversity. The results are summarized using the
Table1-CompareAlpha.ipynb
notebook. This generated Table 1.rarefy_iteration
alpha_iter
alpha_table
-
Beta Diversity. The between sample (beta) diversity is also calculated from rarified tables, and then, we look at three feature-based metrics (unweighted UniFrac, weighted UniFrac, and Bray-Curtis) as well as a collapsed metric (genus level Bray-Curtis). The results are summarized in the
ipynb/Table2-CompareBeta.ipynb
notebook. This generates Table 2.rarefy_iteration
genus_braycurtis
beta_phylogenetic
braycurtis
adonis
beta_table
-
Taxonomic Comparison. We compared the taxonomy at class level using the merged, unnormalized tables through the
ipynb/Figure1-Taxonomy.ipynb
. This generates Figure 2 and Table S3.compare_taxonomy
-
Data Preparation. We import the sequences into QIIME 2, and then denoise them using dada2 before denoising the sequences.
build_manifest
import_seqs
trim_data_v13
denoise_v13
trim_data_v34
denoise_v34
trim_posthoc
-
Database Preparation. For benchmarking the greengenes database needs to be prepared (
filter_greengenes_reference
), otherwise, the Optivag database is assumed to be tidied. Then, we extract the regions and prepare them from sidle alignment.filter_greengenes_reference
extract_sidle_regions
prepare_extracted_region
-
Reference Reconstruction. The V34 region alone was used as the reference for the simulation. We used the paired end table, classified the taxonomy using a naive bayesian classifier for the full length sequences, and then filtered the table to remove anything missing a phylum level or kingdom level designation.
classify_regional_taxonomy
get_real_reference_data
get_real_representative_seqs
-
OTU Clustering. The table is generated by merging all ASV representative sequences and the feature tables from all the individual regions and then clustering them closed reference at 99% identity
concatenate_single_asv_counts
concatenate_single_rep_seqs
cluster_otus
get_otu_taxonomy
-
ASV denosing. The table is generated by merging all ASV counts. Taxonomy comes from naive bayesian classification of all sequences; sequences which don't have phylum level annotation are excluded. The tree is built by fragment insertion of the merged sequences
concatenate_single_asv_counts
concatenate_single_rep_seqs
classify_taxonomy
get_asv_rep_seq
get_asv_table
-
Sidle. For Sidle, the regional sequences are aligned to the regional kmers. The alignments, regional database maps, and feature counts are used to reconstruct the table. Taxonomy is reconstructed using the database map to identify unresolved sequences.
align_sidle_regions
reconstruct_table_single
reconstruct_taxonomy
-
Data Synthesis. The taxonomic annotation was checked and the tables evaluated using the
ipynb/Figure2-RealData.ipynb
notebook.real_data_figure
The snake file contains a section dedicated to preparing the sidle reads, and it will output a table. This will generate a series of benchmarking files, in the benchmark folder. The Matlab
folder contains the modified m scripts to run with SMURF.