Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔥📝 Reformat benchmarking docs #215

Merged
merged 3 commits into from
Jan 7, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions autometa/validation/benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@
along with Autometa. If not, see <http://www.gnu.org/licenses/>.
COPYRIGHT

Autometa clustering evaluation benchmarking.
Autometa taxon-profiling, clustering and binning-classification evaluation benchmarking.

Script to benchmark Autometa clustering results using clustering evaluation metrics.
Script to benchmark Autometa taxon-profiling, clustering and binning-classification results using clustering and classification evaluation metrics.
"""


Expand Down
322 changes: 226 additions & 96 deletions docs/source/benchmarking.rst
Original file line number Diff line number Diff line change
@@ -1,113 +1,210 @@
************
============
Benchmarking
************
============

.. headers/sections formatting
See: https://docs.typo3.org/m/typo3/docs-how-to-document/main/en-us/WritingReST/HeadlinesAndSection.html

.. note::

The most recent benchmarking results are hosted on our `KwanLab/metaBenchmarks <https://github.com/KwanLab/metaBenchmarks>`_ Github repository
and provide a range of analyses covering multiple stages and parameter sets. These benchmarks are available with their own respective
modules so that the community may easily assess how their novel (``taxon-profiling``, ``clustering``, ``binning``, ``refinement``) algorithms
perform compared to current state-of-the-art methods. Tools were selected for benchmarking based on their relevance
to environmental, single-assembly, reference-free binning pipelines.
The most recent Autometa benchmarking results covering multiple modules and input parameters are hosted on our
`KwanLab/metaBenchmarks <https://github.com/KwanLab/metaBenchmarks>`_ Github repository and provide a range of
analyses covering multiple stages and parameter sets. These benchmarks are available with their own respective
modules so that the community may easily assess how Autometa's novel (``taxon-profiling``, ``clustering``,
``binning``, ``refinement``) algorithms perform compared to current state-of-the-art methods. Tools were selected for
benchmarking based on their relevance to environmental, single-assembly, reference-free binning pipelines.

Benchmarking with the ``autometa-benchmark`` module
===================================================

Autometa includes the ``autometa-benchmark`` entrypoint, a script to benchmark Autometa taxon-profiling, clustering
and binning-classification prediction results using clustering and classification evaluation metrics. To select the
appropriate benchmarking method, supply the ``--benchmark`` parameter with the respective choice. The three benchmarking
methods are detailed below.

.. note::
If you'd like to follow along with the benchmarking commands, you may download the test datasets
using:

.. code:: bash

autometa-download-dataset \
--community-type simulated \
--community-sizes 78Mbp \
--file-names reference_assignments.tsv.gz binning.tsv.gz taxonomy.tsv.gz \
--dir-path $HOME/Autometa/autometa/datasets/simulated

This will download three files:

- ``reference_assignments``: tab-delimited file containing contigs with their reference genome assignments. ``cols: [contig, reference_genome, taxid, organism_name, ftp_path, length]``
- ``binning.tsv.gz``: tab-delimited file containing contigs with Autometa binning predictions, ``cols: [contig, cluster]``
- ``taxonomy.tsv.gz``: tab-delimited file containing contigs with Autometa taxon-profiling predictions ``cols: [contig, kingdom, phylum, class, order, family, genus, species, taxid]``


Taxon-profiling
---------------

Example benchmarking with simulated communities
-----------------------------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

# Set community size (see above for selection/download of other community types)
community_size=78Mbp

# Inputs
## NOTE: predictions and reference were downloaded using autometa-download-dataset
predictions="$HOME/Autometa/autometa/datasets/simulated/${community_size}/taxonomy.tsv.gz" # required columns -> contig, taxid
reference="$HOME/Autometa/autometa/datasets/simulated/${community_size}/reference_assignments.tsv.gz"
ncbi=$HOME/Autometa/autometa/databases/ncbi

# Outputs
output_wide="${community_size}.taxon_profiling_benchmarks.wide.tsv.gz" # file path
output_long="${community_size}.taxon_profiling_benchmarks.long.tsv.gz" # file path
reports="${community_size}_taxon_profiling_reports" # directory path

autometa-benchmark \
--benchmark classification \
--predictions $predictions \
--reference $reference \
--ncbi $ncbi \
--output-wide $output_wide \
--output-long $output_long \
--output-classification-reports $reports

Download all of the simulated communities and their reference assignments
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. note::
Using ``--benchmark=classification`` requires the path to a directory containing files (nodes.dmp, names.dmp, merged.dmp)
from NCBI's taxdump tarball. This should be supplied using the ``--ncbi`` parameter.

Clustering
----------

Example benchmarking with simulated communities
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

community_sizes=(78Mbp 156Mbp 312Mbp 625Mbp 1250Mbp 2500Mbp 5000Mbp 10000Mbp)
autometa-download-dataset \
--community-type simulated \
--community-sizes ${community_sizes[@]} \
--file-names reference_assignments.tsv.gz binning.tsv.gz taxonomy.tsv.gz \
--dir-path simulated
# Set community size (see above for selection/download of other community types)
community_size=78Mbp

# Inputs
## NOTE: predictions and reference were downloaded using autometa-download-dataset
predictions="$HOME/Autometa/autometa/datasets/simulated/${community_size}/binning.tsv.gz" # required columns -> contig, cluster
reference="$HOME/Autometa/autometa/datasets/simulated/${community_size}/reference_assignments.tsv.gz"

# Outputs
output_wide="${community_size}.clustering_benchmarks.wide.tsv.gz"
output_long="${community_size}.clustering_benchmarks.long.tsv.gz"

autometa-benchmark \
--benchmark clustering \
--predictions $predictions \
--reference $reference \
--output-wide $output_wide \
--output-long $output_long


Benchmark all of the simulated communities and their reference assignments
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Binning
-------

Example benchmarking with simulated communities
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

for community_size in ${community_sizes[@]};do
autometa-benchmark \
--benchmark clustering \
--predictions simulated/${community_size}/binning.tsv.gz \
--reference simulated/${community_size}/reference_assignments.tsv.gz \
--output-wide ${community_size}.clustering_benchmarks.wide.tsv.gz \
--output-long ${community_size}.clustering_benchmarks.long.tsv.gz
done
# Set community size (see above for selection/download of other community types)
community_size=78Mbp

# Inputs
## NOTE: predictions and reference were downloaded using autometa-download-dataset
predictions="$HOME/Autometa/autometa/datasets/simulated/${community_size}/binning.tsv.gz" # required columns -> contig, cluster
reference="$HOME/Autometa/autometa/datasets/simulated/${community_size}/reference_assignments.tsv.gz"

# Outputs
output_wide="${community_size}.binning_benchmarks.wide.tsv.gz"
output_long="${community_size}.binning_benchmarks.long.tsv.gz"

autometa-benchmark \
--benchmark binning-classification \
--predictions $predictions \
--reference $reference \
--output-wide $output_wide \
--output-long $output_long

Aggregate across simulated communities (when dataset index is unique)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: python
Autometa Test Datasets
======================

import pandas as pd
import glob
df = pd.concat([
pd.read_csv(fp, sep="\t", index_col="dataset")
for fp in glob.glob("*.clustering_benchmarks.long.tsv.gz")
])
df.to_csv("benchmarks.tsv", sep='\t', index=True, header=True)
Descriptions
------------

Aggregate across simulated communities (when dataset index is `not` unique)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Simulated Communities
~~~~~~~~~~~~~~~~~~~~~

.. code:: python
.. csv-table:: Autometa Simulated Communities
:file: simulated_community.csv
:header-rows: 1

import pandas as pd
import os
import glob
dfs = []
for fp in glob.glob("*.clustering_benchmarks.long.tsv.gz"):
df = pd.read_csv(fp, sep="\t", index_col="dataset")
df.index = df.index.map(lambda fpath: os.path.basename(fpath))
dfs.append(df)
df = pd.concat(dfs)
df.to_csv("benchmarks.tsv", sep='\t', index=True, header=True)
You can download all the Simulated communities using this `link <https://drive.google.com/drive/folders/1JFjVb-pfQTv4GXqvqRuTOZTfKdT0MwhN?usp=sharing>`__.
Individual communities can be downloaded using the links in the above table.

For more information on simulated communities,
check the `README.md <https://drive.google.com/file/d/1Ti05Qp13FleuMQdnp3C5L-sXnIM25EZE/view?usp=sharing>`__
located in the ``simulated_communities`` directory.

Synthetic Communities
~~~~~~~~~~~~~~~~~~~~~

51 bacterial isolates were assembled into synthetic communities which we've titled ``MIX51``.

The initial synthetic community was prepared using a mixture of fifty-one bacterial isolates.
The synthetic community's DNA was extracted for sequencing, assembly and binning.

You can download the MIX51 community using this `link <https://drive.google.com/drive/folders/1x8d0o6HO5N72j7p_D_YxrSurBfpi9zmK?usp=sharing>`__.

Downloading Test Datasets
=========================
Download
--------

Using the built-in ``autometa`` module
--------------------------------------
Using ``autometa-download-dataset``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Autometa is packaged with a built-in module that allows any user to download any of the available test datasets.
To use these utilities simply run the command ``autometa-download-dataset``.
To use retrieve these datasets one simply needs to run the ``autometa-download-dataset`` command.

For example, to download all of the simulated communities reference binning/taxonomy assignments as well as the Autometa
v2.0 binning/taxonomy predictions:
For example, to download the reference assignments for a simulated community as well as the most recent Autometa
binning and taxon-profiling predictions for this community, provide the following parameters:

.. code:: bash

# Note: community is the test dataset that was used for clustering or classification. e.g.
# choices: 78Mbp,156Mbp,312Mbp,625Mbp,1250Mbp,2500Mbp,5000Mbp,10000Mbp
community_sizes=(78Mbp 156Mbp 312Mbp 625Mbp 1250Mbp 2500Mbp 5000Mbp 10000Mbp)

# choices for simulated: 78Mbp,156Mbp,312Mbp,625Mbp,1250Mbp,2500Mbp,5000Mbp,10000Mbp
autometa-download-dataset \
--community-type simulated \
--community-sizes ${community_sizes[@]} \
--file-names reference_assignments.tsv.gz binning.tsv.gz taxonomy.tsv.gz \
--dir-path simulated
--community-type simulated \
--community-sizes 78Mbp \
--file-names reference_assignments.tsv.gz binning.tsv.gz taxonomy.tsv.gz \
--dir-path simulated


This will download ``reference_assignments.tsv.gz``, ``binning.tsv.gz``, ``taxonomy.tsv.gz`` to the ``simulated/78Mbp`` directory.

Using ``gdrive`` via command line
---------------------------------
- ``reference_assignments``: tab-delimited file containing contigs with their reference genome assignments. ``cols: [contig, reference_genome, taxid, organism_name, ftp_path, length]``
- ``binning.tsv.gz``: tab-delimited file containing contigs with Autometa binning predictions, ``cols: [contig, cluster]``
- ``taxonomy.tsv.gz``: tab-delimited file containing contigs with Autometa taxon-profiling predictions ``cols: [contig, kingdom, phylum, class, order, family, genus, species, taxid]``

You can download the individual assemblies of different datasests with the help of ``gdown`` using command line.
If you have installed ``autometa`` using ``conda`` then ``gdown`` should already be installed.
If not, you can install it using ``conda install -c conda-forge gdown`` or ``pip install gdown``.
Using ``gdrive``
~~~~~~~~~~~~~~~~

You can download the individual assemblies of different datasests with the help of ``gdown`` using command line
(This is what ``autometa-download-dataset`` is using behind the scenes). If you have installed ``autometa`` using
``conda`` then ``gdown`` should already be installed. If not, you can install it using
``conda install -c conda-forge gdown`` or ``pip install gdown``.

Example for the 78Mbp simulated community
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
"""""""""""""""""""""""""""""""""""""""""

1. Navigate to the 78Mbp community dataset using the `link <https://drive.google.com/drive/u/2/folders/1McxKviIzkPyr8ovj8BG7n_IYk-QfHAgG>`_ mentioned above.
2. Get the file ID by navigating to any of the files and right clicking, then selecting the ``get link`` option. This will have a ``copy link`` button that you should use. The link for the metagenome assembly (ie. ``metagenome.fna.gz``) should look like this : ``https://drive.google.com/file/d/15CB8rmQaHTGy7gWtZedfBJkrwr51bb2y/view?usp=sharing``
2. Get the file ID by navigating to any of the files and right clicking, then selecting the ``get link`` option.
This will have a ``copy link`` button that you should use. The link for the metagenome assembly
(ie. ``metagenome.fna.gz``) should look like this : ``https://drive.google.com/file/d/15CB8rmQaHTGy7gWtZedfBJkrwr51bb2y/view?usp=sharing``
3. The file ID is within the ``/`` forward slashes between ``file/d/`` and ``/``, e.g:

.. code:: bash
Expand All @@ -133,25 +230,68 @@ Example for the 78Mbp simulated community
addressing this specific issue which we are keeping a close eye on and will update this documentation when it is merged.


Autometa Test Datasets
======================
Advanced
========

Simulated Communities
---------------------
Data Handling
-------------

.. csv-table:: Autometa Simulated Communities
:file: simulated_community.csv
:header-rows: 1
Aggregating benchmarking results
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can download all the Simulated communities using this `link <https://drive.google.com/drive/folders/1JFjVb-pfQTv4GXqvqRuTOZTfKdT0MwhN?usp=sharing>`__.
Individual communities can be downloaded using the links in the above table.
When dataset index is unique
""""""""""""""""""""""""""""

For more information on simulated communities,
check the `README.md <https://drive.google.com/file/d/1Ti05Qp13FleuMQdnp3C5L-sXnIM25EZE/view?usp=sharing>`__
located in the ``simulated_communities`` directory.
.. code:: python

Generating New Simulated Communities
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
import pandas as pd
import glob
df = pd.concat([
pd.read_csv(fp, sep="\t", index_col="dataset")
for fp in glob.glob("*.clustering_benchmarks.long.tsv.gz")
])
df.to_csv("benchmarks.tsv", sep='\t', index=True, header=True)

When dataset index is `not` unique
""""""""""""""""""""""""""""""""""

.. code:: python

import pandas as pd
import os
import glob
dfs = []
for fp in glob.glob("*.clustering_benchmarks.long.tsv.gz"):
df = pd.read_csv(fp, sep="\t", index_col="dataset")
df.index = df.index.map(lambda fpath: os.path.basename(fpath))
dfs.append(df)
df = pd.concat(dfs)
df.to_csv("benchmarks.tsv", sep='\t', index=True, header=True)


Downloading multiple test datasets at once
------------------------------------------

To download all of the simulated communities reference binning/taxonomy assignments as well as the Autometa
v2.0 binning/taxonomy predictions all at once, you can provide the multiple arguments to ``--community-sizes``.

e.g. ``--community-sizes 78Mbp 156Mbp 312Mbp 625Mbp 1250Mbp 2500Mbp 5000Mbp 10000Mbp``

An example of this is shown in the bash script below:

.. code:: bash

# choices: 78Mbp,156Mbp,312Mbp,625Mbp,1250Mbp,2500Mbp,5000Mbp,10000Mbp
community_sizes=(78Mbp 156Mbp 312Mbp 625Mbp 1250Mbp 2500Mbp 5000Mbp 10000Mbp)

autometa-download-dataset \
--community-type simulated \
--community-sizes ${community_sizes[@]} \
--file-names reference_assignments.tsv.gz binning.tsv.gz taxonomy.tsv.gz \
--dir-path simulated

Generating new simulated communities
------------------------------------

Communities were simulated using `ART <https://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm>`__,
a sequencing read simulator, with a collection of 3000 bacteria randomly retrieved.
Expand All @@ -173,14 +313,4 @@ e.g. ``-l 1250`` would translate to 1250Mbp as the sum of total lengths for all
# -s : the standard deviation of DNA/RNA fragment size for paired-end simulations.
# -l : the length of reads to be simulated
$ coverage = ((250 * reads) / (length * 1000000))
$ art_illumina -p -ss HS25 -l 125 -f $coverage -o simulated_reads -m 275 -s 90 -i asm_path

Synthetic Communities
---------------------

51 bacterial isolates were assembled into synthetic communities which we've titled ``MIX51``.

The initial synthetic community was prepared using a mixture of fifty-one bacterial isolates.
The synthetic community's DNA was extracted for sequencing, assembly and binning.

You can download the MIX51 community using this `link <https://drive.google.com/drive/folders/1x8d0o6HO5N72j7p_D_YxrSurBfpi9zmK?usp=sharing>`__.
$ art_illumina -p -ss HS25 -l 125 -f $coverage -o simulated_reads -m 275 -s 90 -i asm_path