diff --git a/autometa/validation/benchmark.py b/autometa/validation/benchmark.py
index 41037f499..8908aa3f4 100644
--- a/autometa/validation/benchmark.py
+++ b/autometa/validation/benchmark.py
@@ -21,9 +21,9 @@
along with Autometa. If not, see .
COPYRIGHT
-Autometa clustering evaluation benchmarking.
+Autometa taxon-profiling, clustering and binning-classification evaluation benchmarking.
-Script to benchmark Autometa clustering results using clustering evaluation metrics.
+Script to benchmark Autometa taxon-profiling, clustering and binning-classification results using clustering and classification evaluation metrics.
"""
diff --git a/docs/source/benchmarking.rst b/docs/source/benchmarking.rst
index c280b0819..f58eab3e6 100644
--- a/docs/source/benchmarking.rst
+++ b/docs/source/benchmarking.rst
@@ -1,113 +1,210 @@
-************
+============
Benchmarking
-************
+============
+
+.. headers/sections formatting
+ See: https://docs.typo3.org/m/typo3/docs-how-to-document/main/en-us/WritingReST/HeadlinesAndSection.html
.. note::
- The most recent benchmarking results are hosted on our `KwanLab/metaBenchmarks `_ Github repository
- and provide a range of analyses covering multiple stages and parameter sets. These benchmarks are available with their own respective
- modules so that the community may easily assess how their novel (``taxon-profiling``, ``clustering``, ``binning``, ``refinement``) algorithms
- perform compared to current state-of-the-art methods. Tools were selected for benchmarking based on their relevance
- to environmental, single-assembly, reference-free binning pipelines.
+ The most recent Autometa benchmarking results covering multiple modules and input parameters are hosted on our
+ `KwanLab/metaBenchmarks `_ Github repository and provide a range of
+ analyses covering multiple stages and parameter sets. These benchmarks are available with their own respective
+ modules so that the community may easily assess how Autometa's novel (``taxon-profiling``, ``clustering``,
+ ``binning``, ``refinement``) algorithms perform compared to current state-of-the-art methods. Tools were selected for
+ benchmarking based on their relevance to environmental, single-assembly, reference-free binning pipelines.
Benchmarking with the ``autometa-benchmark`` module
===================================================
+Autometa includes the ``autometa-benchmark`` entrypoint, a script to benchmark Autometa taxon-profiling, clustering
+and binning-classification prediction results using clustering and classification evaluation metrics. To select the
+appropriate benchmarking method, supply the ``--benchmark`` parameter with the respective choice. The three benchmarking
+methods are detailed below.
+
+.. note::
+ If you'd like to follow along with the benchmarking commands, you may download the test datasets
+ using:
+
+ .. code:: bash
+
+ autometa-download-dataset \
+ --community-type simulated \
+ --community-sizes 78Mbp \
+ --file-names reference_assignments.tsv.gz binning.tsv.gz taxonomy.tsv.gz \
+ --dir-path $HOME/Autometa/autometa/datasets/simulated
+
+ This will download three files:
+
+ - ``reference_assignments``: tab-delimited file containing contigs with their reference genome assignments. ``cols: [contig, reference_genome, taxid, organism_name, ftp_path, length]``
+ - ``binning.tsv.gz``: tab-delimited file containing contigs with Autometa binning predictions, ``cols: [contig, cluster]``
+ - ``taxonomy.tsv.gz``: tab-delimited file containing contigs with Autometa taxon-profiling predictions ``cols: [contig, kingdom, phylum, class, order, family, genus, species, taxid]``
+
+
+Taxon-profiling
+---------------
+
Example benchmarking with simulated communities
------------------------------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code:: bash
+
+ # Set community size (see above for selection/download of other community types)
+ community_size=78Mbp
+
+ # Inputs
+ ## NOTE: predictions and reference were downloaded using autometa-download-dataset
+ predictions="$HOME/Autometa/autometa/datasets/simulated/${community_size}/taxonomy.tsv.gz" # required columns -> contig, taxid
+ reference="$HOME/Autometa/autometa/datasets/simulated/${community_size}/reference_assignments.tsv.gz"
+ ncbi=$HOME/Autometa/autometa/databases/ncbi
+
+ # Outputs
+ output_wide="${community_size}.taxon_profiling_benchmarks.wide.tsv.gz" # file path
+ output_long="${community_size}.taxon_profiling_benchmarks.long.tsv.gz" # file path
+ reports="${community_size}_taxon_profiling_reports" # directory path
+
+ autometa-benchmark \
+ --benchmark classification \
+ --predictions $predictions \
+ --reference $reference \
+ --ncbi $ncbi \
+ --output-wide $output_wide \
+ --output-long $output_long \
+ --output-classification-reports $reports
-Download all of the simulated communities and their reference assignments
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. note::
+ Using ``--benchmark=classification`` requires the path to a directory containing files (nodes.dmp, names.dmp, merged.dmp)
+ from NCBI's taxdump tarball. This should be supplied using the ``--ncbi`` parameter.
+
+Clustering
+----------
+
+Example benchmarking with simulated communities
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: bash
- community_sizes=(78Mbp 156Mbp 312Mbp 625Mbp 1250Mbp 2500Mbp 5000Mbp 10000Mbp)
- autometa-download-dataset \
- --community-type simulated \
- --community-sizes ${community_sizes[@]} \
- --file-names reference_assignments.tsv.gz binning.tsv.gz taxonomy.tsv.gz \
- --dir-path simulated
+ # Set community size (see above for selection/download of other community types)
+ community_size=78Mbp
+
+ # Inputs
+ ## NOTE: predictions and reference were downloaded using autometa-download-dataset
+ predictions="$HOME/Autometa/autometa/datasets/simulated/${community_size}/binning.tsv.gz" # required columns -> contig, cluster
+ reference="$HOME/Autometa/autometa/datasets/simulated/${community_size}/reference_assignments.tsv.gz"
+
+ # Outputs
+ output_wide="${community_size}.clustering_benchmarks.wide.tsv.gz"
+ output_long="${community_size}.clustering_benchmarks.long.tsv.gz"
+
+ autometa-benchmark \
+ --benchmark clustering \
+ --predictions $predictions \
+ --reference $reference \
+ --output-wide $output_wide \
+ --output-long $output_long
+
-Benchmark all of the simulated communities and their reference assignments
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Binning
+-------
+
+Example benchmarking with simulated communities
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: bash
- for community_size in ${community_sizes[@]};do
- autometa-benchmark \
- --benchmark clustering \
- --predictions simulated/${community_size}/binning.tsv.gz \
- --reference simulated/${community_size}/reference_assignments.tsv.gz \
- --output-wide ${community_size}.clustering_benchmarks.wide.tsv.gz \
- --output-long ${community_size}.clustering_benchmarks.long.tsv.gz
- done
+ # Set community size (see above for selection/download of other community types)
+ community_size=78Mbp
+
+ # Inputs
+ ## NOTE: predictions and reference were downloaded using autometa-download-dataset
+ predictions="$HOME/Autometa/autometa/datasets/simulated/${community_size}/binning.tsv.gz" # required columns -> contig, cluster
+ reference="$HOME/Autometa/autometa/datasets/simulated/${community_size}/reference_assignments.tsv.gz"
+
+ # Outputs
+ output_wide="${community_size}.binning_benchmarks.wide.tsv.gz"
+ output_long="${community_size}.binning_benchmarks.long.tsv.gz"
+
+ autometa-benchmark \
+ --benchmark binning-classification \
+ --predictions $predictions \
+ --reference $reference \
+ --output-wide $output_wide \
+ --output-long $output_long
-Aggregate across simulated communities (when dataset index is unique)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-.. code:: python
+Autometa Test Datasets
+======================
- import pandas as pd
- import glob
- df = pd.concat([
- pd.read_csv(fp, sep="\t", index_col="dataset")
- for fp in glob.glob("*.clustering_benchmarks.long.tsv.gz")
- ])
- df.to_csv("benchmarks.tsv", sep='\t', index=True, header=True)
+Descriptions
+------------
-Aggregate across simulated communities (when dataset index is `not` unique)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Simulated Communities
+~~~~~~~~~~~~~~~~~~~~~
-.. code:: python
+.. csv-table:: Autometa Simulated Communities
+ :file: simulated_community.csv
+ :header-rows: 1
- import pandas as pd
- import os
- import glob
- dfs = []
- for fp in glob.glob("*.clustering_benchmarks.long.tsv.gz"):
- df = pd.read_csv(fp, sep="\t", index_col="dataset")
- df.index = df.index.map(lambda fpath: os.path.basename(fpath))
- dfs.append(df)
- df = pd.concat(dfs)
- df.to_csv("benchmarks.tsv", sep='\t', index=True, header=True)
+You can download all the Simulated communities using this `link `__.
+Individual communities can be downloaded using the links in the above table.
+
+For more information on simulated communities,
+check the `README.md `__
+located in the ``simulated_communities`` directory.
+
+Synthetic Communities
+~~~~~~~~~~~~~~~~~~~~~
+
+51 bacterial isolates were assembled into synthetic communities which we've titled ``MIX51``.
+
+The initial synthetic community was prepared using a mixture of fifty-one bacterial isolates.
+The synthetic community's DNA was extracted for sequencing, assembly and binning.
+
+You can download the MIX51 community using this `link `__.
-Downloading Test Datasets
-=========================
+Download
+--------
-Using the built-in ``autometa`` module
---------------------------------------
+Using ``autometa-download-dataset``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Autometa is packaged with a built-in module that allows any user to download any of the available test datasets.
-To use these utilities simply run the command ``autometa-download-dataset``.
+To use retrieve these datasets one simply needs to run the ``autometa-download-dataset`` command.
-For example, to download all of the simulated communities reference binning/taxonomy assignments as well as the Autometa
-v2.0 binning/taxonomy predictions:
+For example, to download the reference assignments for a simulated community as well as the most recent Autometa
+binning and taxon-profiling predictions for this community, provide the following parameters:
.. code:: bash
- # Note: community is the test dataset that was used for clustering or classification. e.g.
- # choices: 78Mbp,156Mbp,312Mbp,625Mbp,1250Mbp,2500Mbp,5000Mbp,10000Mbp
- community_sizes=(78Mbp 156Mbp 312Mbp 625Mbp 1250Mbp 2500Mbp 5000Mbp 10000Mbp)
-
+ # choices for simulated: 78Mbp,156Mbp,312Mbp,625Mbp,1250Mbp,2500Mbp,5000Mbp,10000Mbp
autometa-download-dataset \
- --community-type simulated \
- --community-sizes ${community_sizes[@]} \
- --file-names reference_assignments.tsv.gz binning.tsv.gz taxonomy.tsv.gz \
- --dir-path simulated
+ --community-type simulated \
+ --community-sizes 78Mbp \
+ --file-names reference_assignments.tsv.gz binning.tsv.gz taxonomy.tsv.gz \
+ --dir-path simulated
+
+This will download ``reference_assignments.tsv.gz``, ``binning.tsv.gz``, ``taxonomy.tsv.gz`` to the ``simulated/78Mbp`` directory.
-Using ``gdrive`` via command line
----------------------------------
+- ``reference_assignments``: tab-delimited file containing contigs with their reference genome assignments. ``cols: [contig, reference_genome, taxid, organism_name, ftp_path, length]``
+- ``binning.tsv.gz``: tab-delimited file containing contigs with Autometa binning predictions, ``cols: [contig, cluster]``
+- ``taxonomy.tsv.gz``: tab-delimited file containing contigs with Autometa taxon-profiling predictions ``cols: [contig, kingdom, phylum, class, order, family, genus, species, taxid]``
-You can download the individual assemblies of different datasests with the help of ``gdown`` using command line.
-If you have installed ``autometa`` using ``conda`` then ``gdown`` should already be installed.
-If not, you can install it using ``conda install -c conda-forge gdown`` or ``pip install gdown``.
+Using ``gdrive``
+~~~~~~~~~~~~~~~~
+
+You can download the individual assemblies of different datasests with the help of ``gdown`` using command line
+(This is what ``autometa-download-dataset`` is using behind the scenes). If you have installed ``autometa`` using
+``conda`` then ``gdown`` should already be installed. If not, you can install it using
+``conda install -c conda-forge gdown`` or ``pip install gdown``.
Example for the 78Mbp simulated community
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+"""""""""""""""""""""""""""""""""""""""""
1. Navigate to the 78Mbp community dataset using the `link `_ mentioned above.
-2. Get the file ID by navigating to any of the files and right clicking, then selecting the ``get link`` option. This will have a ``copy link`` button that you should use. The link for the metagenome assembly (ie. ``metagenome.fna.gz``) should look like this : ``https://drive.google.com/file/d/15CB8rmQaHTGy7gWtZedfBJkrwr51bb2y/view?usp=sharing``
+2. Get the file ID by navigating to any of the files and right clicking, then selecting the ``get link`` option.
+ This will have a ``copy link`` button that you should use. The link for the metagenome assembly
+ (ie. ``metagenome.fna.gz``) should look like this : ``https://drive.google.com/file/d/15CB8rmQaHTGy7gWtZedfBJkrwr51bb2y/view?usp=sharing``
3. The file ID is within the ``/`` forward slashes between ``file/d/`` and ``/``, e.g:
.. code:: bash
@@ -133,25 +230,68 @@ Example for the 78Mbp simulated community
addressing this specific issue which we are keeping a close eye on and will update this documentation when it is merged.
-Autometa Test Datasets
-======================
+Advanced
+========
-Simulated Communities
----------------------
+Data Handling
+-------------
-.. csv-table:: Autometa Simulated Communities
- :file: simulated_community.csv
- :header-rows: 1
+Aggregating benchmarking results
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-You can download all the Simulated communities using this `link `__.
-Individual communities can be downloaded using the links in the above table.
+When dataset index is unique
+""""""""""""""""""""""""""""
-For more information on simulated communities,
-check the `README.md `__
-located in the ``simulated_communities`` directory.
+.. code:: python
-Generating New Simulated Communities
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+ import pandas as pd
+ import glob
+ df = pd.concat([
+ pd.read_csv(fp, sep="\t", index_col="dataset")
+ for fp in glob.glob("*.clustering_benchmarks.long.tsv.gz")
+ ])
+ df.to_csv("benchmarks.tsv", sep='\t', index=True, header=True)
+
+When dataset index is `not` unique
+""""""""""""""""""""""""""""""""""
+
+.. code:: python
+
+ import pandas as pd
+ import os
+ import glob
+ dfs = []
+ for fp in glob.glob("*.clustering_benchmarks.long.tsv.gz"):
+ df = pd.read_csv(fp, sep="\t", index_col="dataset")
+ df.index = df.index.map(lambda fpath: os.path.basename(fpath))
+ dfs.append(df)
+ df = pd.concat(dfs)
+ df.to_csv("benchmarks.tsv", sep='\t', index=True, header=True)
+
+
+Downloading multiple test datasets at once
+------------------------------------------
+
+To download all of the simulated communities reference binning/taxonomy assignments as well as the Autometa
+v2.0 binning/taxonomy predictions all at once, you can provide the multiple arguments to ``--community-sizes``.
+
+e.g. ``--community-sizes 78Mbp 156Mbp 312Mbp 625Mbp 1250Mbp 2500Mbp 5000Mbp 10000Mbp``
+
+An example of this is shown in the bash script below:
+
+.. code:: bash
+
+ # choices: 78Mbp,156Mbp,312Mbp,625Mbp,1250Mbp,2500Mbp,5000Mbp,10000Mbp
+ community_sizes=(78Mbp 156Mbp 312Mbp 625Mbp 1250Mbp 2500Mbp 5000Mbp 10000Mbp)
+
+ autometa-download-dataset \
+ --community-type simulated \
+ --community-sizes ${community_sizes[@]} \
+ --file-names reference_assignments.tsv.gz binning.tsv.gz taxonomy.tsv.gz \
+ --dir-path simulated
+
+Generating new simulated communities
+------------------------------------
Communities were simulated using `ART `__,
a sequencing read simulator, with a collection of 3000 bacteria randomly retrieved.
@@ -173,14 +313,4 @@ e.g. ``-l 1250`` would translate to 1250Mbp as the sum of total lengths for all
# -s : the standard deviation of DNA/RNA fragment size for paired-end simulations.
# -l : the length of reads to be simulated
$ coverage = ((250 * reads) / (length * 1000000))
- $ art_illumina -p -ss HS25 -l 125 -f $coverage -o simulated_reads -m 275 -s 90 -i asm_path
-
-Synthetic Communities
----------------------
-
-51 bacterial isolates were assembled into synthetic communities which we've titled ``MIX51``.
-
-The initial synthetic community was prepared using a mixture of fifty-one bacterial isolates.
-The synthetic community's DNA was extracted for sequencing, assembly and binning.
-
-You can download the MIX51 community using this `link `__.
+ $ art_illumina -p -ss HS25 -l 125 -f $coverage -o simulated_reads -m 275 -s 90 -i asm_path
\ No newline at end of file