Avengers Assemble

This repository contains the analysis workflow for Debelius et al (doi: 10.1101/2021.03.23.436606).

Installation

The analysis was done based on a qiime2-2020.11 environment with RESRIPt and q2-sidle installed.

To build this environment, install qiime2-202011 according to your system requirements. Then, follow the q2-sidle installation instructions. You will also need snakemake > 5.3.

conda install --channel bioconda snakemake=5.3

Finally, you will need to clone and download the repository in whatever way works best for you.

Input Files

Simulation

To run the simulation, you will need to download the 100nt OTU table (ID 45113, CRC32: 7066cf83) and metadata from Qiita Study 850. The OTU table should be imported into QIIME 2 using the command,

qiime tools import \
 --type 'FeatureTable[Frequency]' \
 --source-fomat BIOMV210Format \
 --input-path 45113_otu_table.biom \
 --output-path start-table.qza

And placed in the folder specified as the config file (input_dir under simulation). (By default, this folder is ipynb/data/inputs/simulation).

The metadata should also be downloaded from the qiita study (found under the sample information tab), and renamed to start-metadata.txt and moved to the simulation input folder ipynb/data/inputs/simulation.

Real Data

Real data can be downloaded from ENA study PRJEB37382, accessions ERR4704801-ERR4704848. The should be placed in the location specified in the config file. By default, they are expected in the ipynb/data/inputs/real/seqs directory. Metadata is inferred from the filenames.

Database Files

The analysis is currently based on the Silva 128 QIIME release, available here. The files should be imported into QIIME 2 and saved to the ipynb/data/reference/silva-ori/ folder. The SEPP reference tree can be downloaded from the qiime2 resources page. The formatted Optivag and Greengenes databases are already in the repository.

Running the workflow

You can be through easily through snakemake based on the current repository set up.

snakemake

Workflow steps

Simulation

Reference Simulation. The greengenes database is used to filter the reference OTU table to exclude sequences with more than 5 degenerates, and then a reference dataset is simulated from the original data by averaging abundance across multiple individuals of the same age group.
- filter_degenerete_reference
- filter_ref_table
- simulate_reference_samples
Regional Simulation. Regions of the 16s gene are extracted and combined with the full length reference table to build a regional "ASV" table and representative sequence set. These are the starting points for our three test methods.
- extract_regions
- generate_asvs_and_derep_tables
Database Preparation. The database is prepared by filtering to remove sequences with a large number of degenerate nucleotides from the reference and database sequences which do not have class-level annotations. These are used to generate per-region
- filter_silva_reference
- extract_sidle_regions
- prepare_extracted_region
Reference Reconstruction. The simulated reference table, the greengenes 13_8 taxonomy and the greengenes 13_8 tree are copied from their respective locations.
- copy_sim_reference
OTU Clustering. The table is generated by merging all ASV representative sequences and the feature tables from all the individual regions and then clustering them closed reference at 99% identity against the Silva 128 reference database
- concatenate_single_asv_counts
- concatenate_single_rep_seqs
- cluster_otus
- get_otu_tree
- get_otu_taxonomy
ASV denosing. The table is generated by merging all ASV counts. Taxonomy comes from naive bayesian classification of all sequences; sequences which don't have phylum level annotation are excluded. The tree is built by fragment insertion of the merged sequences
- concatenate_single_asv_counts
- concatenate_single_rep_seqs
- classify_taxonomy
- get_asv_rep_seq
- get_asv_table
- build_insertion_tree
Sidle. For Sidle, the regional sequences are aligned to the regional kmers. The alignments, regional database maps, and feature counts are used to reconstruct the table. Taxonomy is reconstructed using the database map to identify unresolved sequences. To build the phylogenetic tree, we generate consensus fragments and then insert them into them in the Silva 128 backbone.
- align_sidle_regions
- reconstruct_table_single
- reconstruct_taxonomy
- reconstruct_fragments_single
- reconstructed_fragment_insertion
Alpha Diversity. Within sample (alpha) diversity is calculated from multiple rarefaction. We look at observed features, Shannon, Pielou's evenness and faith's diversity. The results are summarized using the Table1-CompareAlpha.ipynb notebook. This generated Table 1.
- rarefy_iteration
- alpha_iter
- alpha_table
Beta Diversity. The between sample (beta) diversity is also calculated from rarified tables, and then, we look at three feature-based metrics (unweighted UniFrac, weighted UniFrac, and Bray-Curtis) as well as a collapsed metric (genus level Bray-Curtis). The results are summarized in the ipynb/Table2-CompareBeta.ipynb notebook. This generates Table 2.
- rarefy_iteration
- genus_braycurtis
- beta_phylogenetic
- braycurtis
- adonis
- beta_table
Taxonomic Comparison. We compared the taxonomy at class level using the merged, unnormalized tables through the ipynb/Figure1-Taxonomy.ipynb. This generates Figure 2 and Table S3.
- compare_taxonomy

Real Data Analysis

Data Preparation. We import the sequences into QIIME 2, and then denoise them using dada2 before denoising the sequences.
- build_manifest
- import_seqs
- trim_data_v13
- denoise_v13
- trim_data_v34
- denoise_v34
- trim_posthoc
Database Preparation. For benchmarking the greengenes database needs to be prepared (filter_greengenes_reference), otherwise, the Optivag database is assumed to be tidied. Then, we extract the regions and prepare them from sidle alignment.
- filter_greengenes_reference
- extract_sidle_regions
- prepare_extracted_region
Reference Reconstruction. The V34 region alone was used as the reference for the simulation. We used the paired end table, classified the taxonomy using a naive bayesian classifier for the full length sequences, and then filtered the table to remove anything missing a phylum level or kingdom level designation.
- classify_regional_taxonomy
- get_real_reference_data
- get_real_representative_seqs
OTU Clustering. The table is generated by merging all ASV representative sequences and the feature tables from all the individual regions and then clustering them closed reference at 99% identity
- concatenate_single_asv_counts
- concatenate_single_rep_seqs
- cluster_otus
- get_otu_taxonomy
ASV denosing. The table is generated by merging all ASV counts. Taxonomy comes from naive bayesian classification of all sequences; sequences which don't have phylum level annotation are excluded. The tree is built by fragment insertion of the merged sequences
- concatenate_single_asv_counts
- concatenate_single_rep_seqs
- classify_taxonomy
- get_asv_rep_seq
- get_asv_table
Sidle. For Sidle, the regional sequences are aligned to the regional kmers. The alignments, regional database maps, and feature counts are used to reconstruct the table. Taxonomy is reconstructed using the database map to identify unresolved sequences.
- align_sidle_regions
- reconstruct_table_single
- reconstruct_taxonomy
Data Synthesis. The taxonomic annotation was checked and the tables evaluated using the ipynb/Figure2-RealData.ipynb notebook.
- real_data_figure

Benchmarking

The snake file contains a section dedicated to preparing the sidle reads, and it will output a table. This will generate a series of benchmarking files, in the benchmark folder. The Matlab folder contains the modified m scripts to run with SMURF.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
MATLAB		MATLAB
bin		bin
ipynb		ipynb
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml
license.md		license.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Avengers Assemble

Installation

Input Files

Simulation

Real Data

Database Files

Running the workflow

Workflow steps

Simulation

Real Data Analysis

Benchmarking

About

Releases

Packages

Languages

License

jwdebelius/avengers-assemble

Folders and files

Latest commit

History

Repository files navigation

Avengers Assemble

Installation

Input Files

Simulation

Real Data

Database Files

Running the workflow

Workflow steps

Simulation

Real Data Analysis

Benchmarking

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages