Skip to content

Commit c68d7e5

Browse files
mcmk3antgonzaElDeveloper
authored
Added three tutorial sections to the Qiita documentation (#3032)
* Added three tutorial sections to the Qiita documentation: 'Retrieving Public Data for Own Analysis' and 'Processing public data retrieved with redbiom' to the redbiom tab, and 'Statistical Analysis to Justify Clinical Trial Sample Size Tutorial' to the analyzing samples tab. * Update redbiom.rst * Update redbiom.rst * Update redbiom.rst * Further updates to redbiom.rst and the Stats tutorial. * update redbiom.rst * Finished proof-reading * Placed all three tutorials/sections together under Introduction to the download and analysis of public Qiita data * added a new introduction, with links to the three sections * Added figures to stats tutorial and contexts explanation * Added figures to stats tutorial and contexts explanation * Apply suggestions from code review [skip ci] Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu> Co-authored-by: Antonio Gonzalez <antgonza@gmail.com> Co-authored-by: Yoshiki Vázquez Baeza <yoshiki@ucsd.edu>
1 parent 5a308ea commit c68d7e5

File tree

10 files changed

+1229
-2
lines changed

10 files changed

+1229
-2
lines changed

qiita_pet/support_files/doc/source/downloading.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ off the artifact ID and is present in the artifact and metadata filenames.
109109
Finding Samples Based On Their Metadata
110110
---------------------------------------
111111

112-
For help on doing complex searches for samples go to :doc:`../redbiom`. Redbiom
112+
For help on doing complex searches for samples go to :doc:`./redbiom`. Redbiom
113113
helps users find samples based on their metadata, a specific taxon or feature
114114
of interest via a simple Qiita GUI or the command line (more powerful).
115115

qiita_pet/support_files/doc/source/index.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,12 @@ Looking for information about analyzing your data? Please see the document here:
5252

5353
analyzingsamples/index.rst
5454

55+
Interested in how you can access and use public data available on Qiita? Please see the document here:
56+
57+
.. toctree::
58+
59+
reanalysis/reanalysis.rst
60+
5561
Looking for available guides? Please see these documents:
5662

5763
.. toctree::
Loading
Loading
Loading
Loading
Loading

qiita_pet/support_files/doc/source/reanalysis/reanalysis.rst

Lines changed: 1124 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
An explanation of contexts in redbiom and Qiita.
2+
================================================
3+
4+
In both redbiom and Qiita study data is grouped into contexts representing different processing pipelines/methods. Processing and bioinformatic techniques can cause inherent biases in datasets, however if data is grouped by the protocol used to obtain the samples and extract data from them then, within any one such group (context), data is expected to be comparable. This is because the data in such a group (context) should share the same biases/limitations. Therefore, when retrieving data found in a redbiom search a context must be specified; so ensuring the retrieved data is comparable. If you are familiar with the methods and processes used in sample processing pipelines you should be able to interpret contexts as a context ultimately represents a processing pipeline. The general syntax of the context is <binning strategy>-<reference database>-<sequencing strategy>-<region sequenced>-<(if trimmed) trimming length>-<id> and this section should help familiarize you with these considerations therefore providing an understanding of contexts in both redbiom and Qiita.
5+
6+
.. figure:: contexts.jpg
7+
:align: center
8+
9+
Diagram of general workflow for obtaining and processing data (green = data state; grey = process; white = example) [Produced at http://interactive.blockdiag.com/]
10+
11+
Obtaining Data
12+
--------------
13+
14+
The first step of any pipeline is obtaining the data for the pipeline. As with any other stage of the pipeline this process could introduce biases or differences in the processed data, and needs to be accounted for. For instance, the region of the genome sequenced will determine whether two studies are directly comparable - use of different marker genes/regions will mean that the sequences (raw data) are not comparable. While the use of different reference databases to taxonomically classify the sequences may allow comparison at the taxonomic level, use of different databases could lead to biases in which sequences are classified.
15+
16+
Sequencing techniques
17+
---------------------
18+
19+
There are several next generation and third generation sequencing techniques on the market, each with their own advantages and limitations. Illumina sequencing is probably the most common high throughput sequencing technique. Below is a short description of some of the major techniques found in redbiom and Qiita contexts.
20+
21+
* Illumina sequencing is a 'massively parallel' (high throughput) sequencing by synthesis technique, but has relatively short read lengths, a problem for assembly of repetitive regions. Note, however, that repetitive regions are less common in more economical genomes such as those of many bacteria, and are not a problem when assembly is not required such as when sequencing short marker gene fragments.
22+
* Ion torrent is another ‘sequencing by synthesis’ method, which is less high throughput but cheaper as it does not require modified nucleotides.
23+
* 454 sequencing has been out-competed, and the sequencers are no longer in production. It was a sequencing by synthesis method, which utilised polonies (like Illumina) and used imaging (pyrosequencing). While cheaper, it has difficulties distinguishing the number of bases in a run of identical bases because it detects incorporation of nucleotides passed over the cell a single type at a time (which also slows the process).
24+
25+
Onywera and Meiring [1]_ compared Ion torrent and Illumina and suggest (with caution) that Illumina sequencing may produce reads of higher quality (2.4-fold more quality-filtered reads) with greater efficiency when sequencing 16S variable regions in metagenomic samples.
26+
27+
Region sequenced for species identification
28+
-------------------------------------------
29+
30+
Marker genes versus whole genome
31+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
32+
33+
Factors such as cost, time, and complexity all affect the choice of which genomic regions in a sample will actually be sequenced. Whole genome sequencing (WGS) provides the most information, it can also potentially overwhelm an analysis, especially taking into consideration the high frequency of horizontal spread (of genes) among many prokaryotes, as well as varying genome size, gene duplication and deletion. Furthermore, WGS involves greatly increased costs, time and use of resources (sequencing and assembling the entire genomes). Therefore, microbiome studies generally use a marker gene, which allows for rapid and cost efficient identification of the species within a sample, and so analysis of features such as community structure.
34+
35+
16S rDNA (encoding the 16S rRNA of the ribosome, and thus universally present in prokaryotes) is currently the most commonly used marker gene. The gene's highly conserved function leads to a relatively low observed mutation rate in certain regions, allowing it to be used as a guide to phylogeny and taxonomic identity. More conserved regions of the gene allow reliable amplification (conserved sequences are needed for designing primers capable of annealing sequences from a wide variety of taxa) while hypervariable regions allow greater resolution. Note, however, that there is no consensus quantitative definition of genus or species based on 16S rRNA gene sequence
36+
37+
Another marker region is the internal transcribed spacer DNA (ITS) between the large and small subunit rRNA-encoding DNA. As with rDNA, it is readily amplified due to both being present in multiple copies, and having conserved flanking regions (the rDNA).
38+
39+
Within the 16S rDNA sequence, various variable regions can be amplified. V4 is a commonly used variable region of the 16S rDNA sequence, which, in total, contains nine hypervariable regions of varying conservation. Use of a variable region allows the detection of finer scale differences between taxa. Lower levels of variability are useful in determining higher-ranking taxa. Bukin et al [2]_ suggest that the V2-V3 region has the highest resolution for lower-rank taxa (species, genera) and so exploratory and diversity studies may wish to use this region. Conversely, Yang et al [3]_ come to a different conclusion, preferring V4-V6 for phylogenetic analysis. However, traditionally, V3 and V4 have been the most commonly used fragments.
40+
41+
Processing Data
42+
---------------
43+
44+
Processing data can again lead to both differences or biases in data which could affect comparison. Raw sequences undergo quality control, and are often trimmed. Note that trimming is controversial but is required to allow use of data across studies because the same length sequences are required for comparison between studies. Whether trimming occurs, and trimming length, depends on the study; however deblur and DADA require trimmed input sequences because they require input data to be of the same length (as both algorithms consider single nucleotide differences, and differences in length are therefore distinguished). Illumina sequencing technology can theoretically cover as much as 300 nucleotides, though in reality 200 is more common (leading to between 400 and 600 nt for a paired end read), trimming of the ends of this data removes adapter sequences. Common trimming lengths in the Qiita database are 90, 100, 150 and 200 nucleotides. Once quality control is completed, the raw data can then be processed in various manners.
45+
46+
Assigning sequences to taxa
47+
^^^^^^^^^^^^^^^^^^^^^^^^^^^
48+
49+
Both the strategy of taxa assignment, and the reference database, if used, can lead to potential biases in data.
50+
51+
**Binning strategies**
52+
53+
Reads are representative of taxa in the sample and must be assigned appropriately to allow further analysis. There are two main strategies for assigning reads to a particular taxa, either reads are clustered into operational taxonomic units (OTUs) when above some threshold similarity; or specialised processing algorithms are used to yield putatively error free sequences which can then be aligned to a reference database.
54+
55+
Picked OTUs, the former, involves clustering sequence data into clades based on similarity to a reference database in closed-reference picking, or each other in de novo clustering. Reads that do not cluster are discarded in closed-reference picking. Open picking has a second round of clustering in which reads that do not cluster with reference sequences are clustered against one another (de-novo clustering). The threshold similarity for clustering will determine what taxonomic level clusters represent [4]_ .
56+
57+
Two common algorithms representative of the second strategy are Deblur and DADA2, which can be used to resolve sequence data to the sub-operational taxonomic unit (sOTU) level (where sequences may differ by only a nucleotide). At such high (single-nucleotide) resolution a specialised programme is necessary to retrieve putative error-free sequences. As both packages infer exact amplicon sequence variants (ASVs) they give higher resolution composition data output than OTU methods [5]_ [6]_ . Furthermore, ASVs are comparable across studies because they represent exact sequences rather than per-study OTU IDs. For this reason there is also no absolute requirement for using a reference database, eliminating a possible source of bias. For these, and further reasons, Callahan et al [7]_ suggest that ASVs (and the associated methods to obtain them) should replace OTUs.
58+
59+
An independent study comparing DADA2, Deblur and an OTU picking pipeline found that all resulted in similar general community structure, but varied considerably in run times and the number of ASVs/OTUs called per sample (and so the resulting alpha-diversity metrics) [8]_ . OTU clustering tended to exaggerate the number of unique organisms. However, it should be noted that variation between treatments rather than per sample is a more informative metric to consider.
60+
61+
**Reference taxonomy**
62+
63+
If the picked OTUs are closed or open reference then the database they were scored against must also be specified (as with all other processing steps, it represents a possible source of bias). The same considerations hold for any microbial data that has had taxa assigned by alignment to a reference database. For example, if deblur generated ASVs have been mapped to a reference database to allow characterisation of known species in the sample, the database used should be specified.
64+
65+
The most common reference database for 16S rDNA is Greengenes. However, this database has not been actively maintained since 2013. Other databases include:
66+
67+
* Silva, which contains not only 16S but also 18S and 23S/28S data, and is actively maintained. Note it contains data for bacteria (16S, 23S), archaea and eukarya (18S, 28S).
68+
* Unite, a database for ITS (it does not contain any other regions).
69+
* GTDB, the newest database for Bacterial and Archeal sequences, this database also includes representative sequences not yet assigned a species name, and contains whole (and partial) genomes.
70+
71+
Conclusion
72+
----------
73+
74+
This article should have given you a brief overview of some of the major processes and considerations in processing pipelines for microbiome data and so the ability to interpret redbiom and Qiita contexts. The general syntax of the context is <binning strategy>-<reference database>-<sequencing strategy>-<region sequenced><trimming length (if trimmed>-<id>. Each of these processes or considerations can lead to differences or biases in the data, and thus is included in contexts.
75+
76+
Bibliography
77+
------------
78+
79+
.. [1] Onywera H, Meiring TL. 2020 Comparative analyses of Ion Torrent V4 and Illumina V3-V4 16S rRNA gene metabarcoding methods for characterization of cervical microbiota: taxonomic and functional profiling. Sci. Afr. 7, e00278. (doi:10.1016/j.sciaf.2020.e00278)
80+
81+
.. [2] Bukin YS, Galachyants YP, Morozov IV, Bukin SV, Zakharenko AS, Zemskaya TI. 2019 The effect of 16S rRNA region choice on bacterial community metabarcoding results. Sci. Data 6, 190007. (doi:10.1038/sdata.2019.7)
82+
83+
.. [3] Yang B, Wang Y, Qian P-Y. 2016 Sensitivity and correlation of hypervariable regions in 16S rRNA genes in phylogenetic analysis. BMC Bioinformatics 17. (doi:10.1186/s12859-016-0992-y)
84+
85+
.. [4] OTU picking strategies in QIIME Homepage. See http://qiime.org/tutorials/otu_picking.html
86+
87+
.. [5] Amir A et al. 2017 Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns. mSystems 2. (doi:10.1128/mSystems.00191-16)
88+
89+
.. [6] Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP. 2016 DADA2: High-resolution sample inference from Illumina amplicon data. Nat. Methods 13, 581583. (doi:10.1038/nmeth.3869)
90+
91+
.. [7] Callahan BJ, McMurdie PJ, Holmes SP. 2017 Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J. 11, 26392643. (doi:10.1038/ismej.2017.119)
92+
93+
.. [8] Nearing JT, Douglas GM, Comeau AM, Langille MGI. 2018 Denoising the Denoisers: an independent evaluation of microbiome sequence error-correction approaches. PeerJ 6, e5364. (doi:10.7717/peerj.5364)

qiita_pet/support_files/doc/source/redbiom.rst

100755100644
Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@ redbiom
77
* Allows you to search through public studies to find comparable data to your own
88
* Can search by: metadata, feature, or taxon
99

10+
redbiom is a cache service which can be used to search databases for samples which contain particular taxonomic units or given features (e.g. environmental or clinical factors); either as a Qiita plugin, or as a separately installed command line package. redbiom can therefore be used to identify samples for a meta-analysis focused on some particular factor (or factors). The utility of searching by metadata or feature is that it allows the discovery and subsequent use of a potentially wide variety of samples. This may include data from studies with completely different research goals, which one would have otherwise been unlikely to realize could be used.
11+
12+
Note that a cache service is a high-speed data storage layer which stores a subset of data allowing a more rapid searching and retrieval experience. redbiom has a much smaller, sparse vector version of the Qiita database containing mainly metadata, which thus allows rapid searching. Samples found in a redbiom search can then be retrieved from the Qiita database through redbiom for subsequent analysis.
13+
1014
For more information, advanced queries and generating
1115
`BIOM <http://biom-format.org/>`__ files go to the
1216
`redbiom github page <https://github.com/biocore/redbiom/blob/master/README.md>`__.
@@ -35,7 +39,7 @@ Search Options
3539
will find all samples which have the *qiita_study_id* metadata category, and in which the value for that sample is *10317*.
3640

3741
* **Examples:**
38-
42+
3943
* Find all samples in which both the word 'infant', as well as 'antibiotics' exist, and where the infants are under a year old:
4044

4145
.. code-block:: bash

0 commit comments

Comments
 (0)