sourmash-bio · luizirber · Apr 28, 2020 · Feb 2, 2020 · Apr 19, 2020 · Apr 23, 2020
diff --git a/doc/classifying-signatures.md b/doc/classifying-signatures.md
@@ -25,7 +25,8 @@ One useful modification to `search` is to calculate containment with
 matches where the query is contained within the subject, but the
 subject may have many other k-mers in it. For example, if you are using
 a plasmid as a query, you would use `--containment` to find genomes
-that contained that plasmid.
+that contained that plasmid.  `gather` (discussed below) uses containment
+analysis only.
 
 See [the main sourmash
 tutorial](http://sourmash.readthedocs.io/en/latest/tutorials.html#make-and-search-a-database-quickly)
@@ -39,17 +40,17 @@ mixture of many different genomes.  While you might use containment to
 see if a query genome is present in one or more metagenomes, a common
 question to ask is the reverse: **what genomes are in my metagenome?**
 
-We have implemented two algorithms in sourmash to do this.
+We have implemented two approaches in sourmash to do this.
 
-One algorithm uses taxonomic information from e.g. GenBank to classify
+One approach uses taxonomic information from e.g. GenBank to classify
 individual k-mers, and then infers taxonomic distributions of
 metagenome contents from the presence of these individual
 k-mers. (This is the approach pioneered by
-[Kraken](https://ccb.jhu.edu/software/kraken/) and many other tools.)
+[Kraken](https://ccb.jhu.edu/software/kraken/) and used by many other tools.)
 `sourmash lca` can be used to classify individual genome bins with
 `classify`, or summarize metagenome taxonomy with `summarize`.  The
 [sourmash lca tutorial](http://sourmash.readthedocs.io/en/latest/tutorials-lca.html)
-shows how to use the `lca classify` and `summarize` commands, and also
+shows how to use the `lca classify` and `lca summarize` commands, and also
 provides guidance on building your own database.
 
 The other approach, `gather`, breaks a metagenome down into individual
@@ -93,13 +94,77 @@ that the particular species isn't known.
 
 ## Abundance weighting
 
+By default, sourmash tracks k-mer presence, *not* their abundance. The
+proportions and fractions reported also ignore abundance. So, if
+`sourmash gather` reports that a genome is 5% of a metagenome, it is
+reporting Jaccard containment of that genome in the metagenome, and it
+is ignoring information like the number of reads in the metagenome
+that come from that genome.  Similarly, when `sourmash compare`
+compares genome or metagenome signatures, it's reporting Jaccard
+similarity *without* abundance.
+
+However, it is possible to take into account abundance information by
+computing signatures with `--track-abundance`. The abundance
+information will be used if it's present in the signature, and it can
+be ignored with `--ignore-abundance` in any signature comparison.
+
+There are two ways that abundance weighting can be used. One is in
+containment queries for metagenomes, e.g. with `sourmash
+gather`, and the other is in comparisons of abundance-weighted signatures,
+e.g. with `sourmash search` and `sourmash compare`.  Below, we refer to the
+first as "abundance projection" and the second as "angular similarity".
+
+### Projecting abundances in `sourmash gather`:
+
+`sourmash gather` can report approximate abundance information for
+containment queries against genome databases.  This will give you
+numbers that (approximately) match what you get from counting mapped
+reads.
+
 If you compute your input signatures with `--track-abundance`, both
 `sourmash gather` and `sourmash lca gather` will use that information
-to calculate an abundance-weighted result.  Briefly, this will weight
+to calculate an abundance-weighted result.  This will weight
 each match to a hash value by the multiplicity of the hash value in
 the query signature.  You can turn off this behavior with
 `--ignore-abundance`.
 
+For example, if you have a metagenome composed of two equal sized genomes
+A and B, with A present at 10 times the abundance of B, `gather` on
+abundance-weighted signatures will report that approximately 91% of the
+metagenome is A and approximately 9% is B. (If you use `--ignore-abundance`,
+then `gather` will report approximately 50:50, since the genomes are equal
+sized.)
+
+You can also get count-like information from the CSV output of `sourmash
+gather`; the column `median_abund` contains the median abundance of the k-mers
+in the match to the given genome.
+
+**Buyer beware:** There are substantial challenges in doing this kind
+of analysis on real metagenomic samples, relating to genome representation
+and strain overlap; see [this issue](https://github.com/dib-lab/sourmash/issues/461) for a discussion.
+
+### Computing signature similarity with angular similarity.
+
+If signatures that have abundance information are compared with
+`sourmash search` or `sourmash compare`, the default comparison is
+done with
+[angular similarity](https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity). This
+is a distance metric based on cosine similarity, and it is suitable
+for use in clustering.
+
+For more information on the value of this kind of comparison for
+metagenomics, please see the simka paper,
+[Multiple comparative metagenomics using multiset k-mer counting](https://peerj.com/articles/cs-94/),
+Benoit et al., 2016. Initial comparisons of metagenome similarity
+approximations computed with sourmash to the output of simka suggest a
+significant correlation.
+
+**Implementation note:** Angular similarity searches cannot be done on
+SBT or LCA databases currently; you have to provide lists of signature
+files to `sourmash search` and `sourmash compare`.  sourmash will
+provide a warning if you run `sourmash search` on an LCA or SBT with
+an abundance-weighted query, and automatically apply `--ignore-abundance`.
+
 ## What commands should I use?
 
 It's not always easy to figure that out, we know! We're thinking about

diff --git a/doc/conf.py b/doc/conf.py
@@ -59,7 +59,7 @@
 
 # General information about the project.
 project = 'sourmash'
-copyright = '2016, C. Titus Brown and Luiz Irber'
+copyright = '2016-2020, C. Titus Brown and Luiz Irber'
 author = 'C. Titus Brown and Luiz Irber'
 
 # The version info for the project you're documenting, acts as replacement for

diff --git a/tests/test_sourmash.py b/tests/test_sourmash.py
@@ -2998,110 +2998,141 @@ def test_gather_deduce_moltype():
         assert '1.9 kbp      100.0%  100.0%' in out
 
 
-def test_gather_abund_1_1():
+@utils.in_thisdir
+def test_gather_abund_1_1(c):
+    #
+    # make r1.fa with 1x coverage of genome s10
+    # make r2.fa with 10x coverage of genome s10.
+    # make r3.fa with 1x coverage of genome s11.
+    #
     # nullgraph/make-reads.py -S 1 -r 200 -C 2 tests/test-data/genome-s10.fa.gz > r1.fa
     # nullgraph/make-reads.py -S 1 -r 200 -C 20 tests/test-data/genome-s10.fa.gz > r2.fa
     # nullgraph/make-reads.py -S 1 -r 200 -C 2 tests/test-data/genome-s11.fa.gz > r3.fa
+
+    #
+    # make signature s10-s11 with r1 and r3, i.e. 1:1 abundance
+    # make signature s10x10-s11 with r2 and r3, i.e. 10:1 abundance
+    #
     # ./sourmash compute -k 21 --scaled 1000 --merge=1-1 -o reads-s10-s11.sig r[13].fa --track-abundance
     # ./sourmash compute -k 21 --scaled 1000 --merge=10-1 -o reads-s10x10-s11.sig r[23].fa --track-abundance
 
-    with utils.TempDirectory() as location:
-        query = utils.get_test_data('gather-abund/reads-s10-s11.sig')
-        against_list = ['genome-s10', 'genome-s11', 'genome-s12']
-        against_list = [ 'gather-abund/' + i + '.fa.gz.sig' \
-                         for i in against_list ]
-        against_list = [ utils.get_test_data(i) for i in against_list ]
+    query = utils.get_test_data('gather-abund/reads-s10-s11.sig')
+    against_list = ['genome-s10', 'genome-s11', 'genome-s12']
+    against_list = [ 'gather-abund/' + i + '.fa.gz.sig' \
+                     for i in against_list ]
+    against_list = [ utils.get_test_data(i) for i in against_list ]
 
-        status, out, err = utils.runscript('sourmash',
-                                           ['gather', query] + against_list,
-                                           in_directory=location)
+    status, out, err = c.run_sourmash('gather', query, *against_list)
 
-        print(out)
-        print(err)
+    print(out)
+    print(err)
 
-        assert '49.6%   78.5%       1.8    tests/test-data/genome-s10.fa.gz' in out
-        assert '50.4%   80.0%       1.9    tests/test-data/genome-s11.fa.gz' in out
-        assert 'genome-s12.fa.gz' not in out
+    # when we project s10-s11 (r1+r3), 1:1 abundance,
+    # onto s10 and s11 genomes with gather, we get:
+    # * approximately 50% of each query matching (first column, p_query)
+    # * approximately 80% of subject genomes contents being matched
+    #   (this is due to the low coverage of 2 used to build queries)
+    # * approximately 2.0 abundance (third column, avg_abund)
 
+    assert '49.6%   78.5%       1.8    tests/test-data/genome-s10.fa.gz' in out
+    assert '50.4%   80.0%       1.9    tests/test-data/genome-s11.fa.gz' in out
+    assert 'genome-s12.fa.gz' not in out
 
-def test_gather_abund_10_1():
+
+@utils.in_tempdir
+def test_gather_abund_10_1(c):
+    # see comments in test_gather_abund_1_1, above.
     # nullgraph/make-reads.py -S 1 -r 200 -C 2 tests/test-data/genome-s10.fa.gz > r1.fa
     # nullgraph/make-reads.py -S 1 -r 200 -C 20 tests/test-data/genome-s10.fa.gz > r2.fa
     # nullgraph/make-reads.py -S 1 -r 200 -C 2 tests/test-data/genome-s11.fa.gz > r3.fa
     # ./sourmash compute -k 21 --scaled 1000 --merge=1-1 -o reads-s10-s11.sig r[13].fa --track-abundance
     # ./sourmash compute -k 21 --scaled 1000 --merge=10-1 -o reads-s10x10-s11.sig r[23].fa --track-abundance
 
-    with utils.TempDirectory() as location:
-        query = utils.get_test_data('gather-abund/reads-s10x10-s11.sig')
-        against_list = ['genome-s10', 'genome-s11', 'genome-s12']
-        against_list = [ 'gather-abund/' + i + '.fa.gz.sig' \
-                         for i in against_list ]
-        against_list = [ utils.get_test_data(i) for i in against_list ]
+    query = utils.get_test_data('gather-abund/reads-s10x10-s11.sig')
+    against_list = ['genome-s10', 'genome-s11', 'genome-s12']
+    against_list = [ 'gather-abund/' + i + '.fa.gz.sig' \
+                     for i in against_list ]
+    against_list = [ utils.get_test_data(i) for i in against_list ]
 
-        status, out, err = utils.runscript('sourmash',
-                                           ['gather', query, '-o', 'xxx.csv'] \
-                                            + against_list,
-                                           in_directory=location)
+    status, out, err = c.run_sourmash('gather', query, '-o', 'xxx.csv',
+                                      *against_list)
 
-        print(out)
-        print(err)
-        assert '91.0%  100.0%      14.5    tests/test-data/genome-s10.fa.gz' in out
-        assert '9.0%   80.0%       1.9    tests/test-data/genome-s11.fa.gz' in out
-        assert 'genome-s12.fa.gz' not in out
+    print(out)
+    print(err)
 
-        # check the calculations behind the above output by looking into
-        # the CSV.
-        with open(os.path.join(location, 'xxx.csv'), 'rt') as fp:
-            r = csv.DictReader(fp)
+    # when we project s10x10-s11 (r2+r3), 10:1 abundance,
+    # onto s10 and s11 genomes with gather, we get:
+    # * approximately 91% of s10 matching
+    # * approximately 9% of s11 matching
+    # * approximately 100% of the high coverage genome being matched,
+    #       with only 80% of the low coverage genome
+    # * approximately 2.0 abundance (third column, avg_abund) for s11,
+    #   and (very) approximately 20x abundance for genome s10.
+
+    assert '91.0%  100.0%      14.5    tests/test-data/genome-s10.fa.gz' in out
+    assert '9.0%   80.0%       1.9    tests/test-data/genome-s11.fa.gz' in out
+    assert 'genome-s12.fa.gz' not in out
+
+    # check the calculations behind the above output by looking into
+    # the CSV.
+    with open(c.output('xxx.csv'), 'rt') as fp:
+        r = csv.DictReader(fp)
 
-            overlaps = []
-            f_weighted_list = []
-            average_abunds = []
+        overlaps = []
+        f_weighted_list = []
+        average_abunds = []
 
-            for row in r:
-                overlap = float(row['intersect_bp'])
-                f_weighted = float(row['f_unique_weighted'])
-                average_abund = float(row['average_abund'])
+        for row in r:
+            overlap = float(row['intersect_bp'])
+            f_weighted = float(row['f_unique_weighted'])
+            average_abund = float(row['average_abund'])
 
-                overlaps.append(overlap)
-                f_weighted_list.append(f_weighted)
-                average_abunds.append(average_abund)
+            overlaps.append(overlap)
+            f_weighted_list.append(f_weighted)
+            average_abunds.append(average_abund)
 
-            weighted_calc = []
-            for (overlap, average_abund) in zip(overlaps, average_abunds):
-                prod = overlap*average_abund
-                weighted_calc.append(prod)
+        weighted_calc = []
+        for (overlap, average_abund) in zip(overlaps, average_abunds):
+            prod = overlap*average_abund
+            weighted_calc.append(prod)
 
-            total_weighted = sum(weighted_calc)
-            for prod, f_weighted in zip(weighted_calc, f_weighted_list):
-                assert prod / total_weighted == f_weighted, (prod, f_weighted)
+        total_weighted = sum(weighted_calc)
+        for prod, f_weighted in zip(weighted_calc, f_weighted_list):
+            assert prod / total_weighted == f_weighted, (prod, f_weighted)
 
 
-def test_gather_abund_10_1_ignore_abundance():
+@utils.in_thisdir
+def test_gather_abund_10_1_ignore_abundance(c):
+    # see comments in test_gather_abund_1_1, above.
     # nullgraph/make-reads.py -S 1 -r 200 -C 2 tests/test-data/genome-s10.fa.gz > r1.fa
     # nullgraph/make-reads.py -S 1 -r 200 -C 20 tests/test-data/genome-s10.fa.gz > r2.fa
     # nullgraph/make-reads.py -S 1 -r 200 -C 2 tests/test-data/genome-s11.fa.gz > r3.fa
     # ./sourmash compute -k 21 --scaled 1000 --merge=1-1 -o reads-s10-s11.sig r[13].fa --track-abundance
     # ./sourmash compute -k 21 --scaled 1000 --merge=10-1 -o reads-s10x10-s11.sig r[23].fa --track-abundance
 
-    with utils.TempDirectory() as location:
-        query = utils.get_test_data('gather-abund/reads-s10x10-s11.sig')
-        against_list = ['genome-s10', 'genome-s11', 'genome-s12']
-        against_list = [ 'gather-abund/' + i + '.fa.gz.sig' \
-                         for i in against_list ]
-        against_list = [ utils.get_test_data(i) for i in against_list ]
+    query = utils.get_test_data('gather-abund/reads-s10x10-s11.sig')
+    against_list = ['genome-s10', 'genome-s11', 'genome-s12']
+    against_list = [ 'gather-abund/' + i + '.fa.gz.sig' \
+                     for i in against_list ]
+    against_list = [ utils.get_test_data(i) for i in against_list ]
 
-        status, out, err = utils.runscript('sourmash',
-                                           ['gather', query] \
-                                            + ['--ignore-abundance'] + \
-                                            against_list,
-                                           in_directory=location)
+    status, out, err = c.run_sourmash('gather', query,
+                                      '--ignore-abundance',
+                                      *against_list)
 
-        print(out)
-        print(err)
-        assert all(('57.2%  100.0%', 'tests/test-data/genome-s10.fa.gz' in out))
-        assert all(('42.8%   80.0%', 'tests/test-data/genome-s11.fa.gz' in out))
-        assert 'genome-s12.fa.gz' not in out
+    print(out)
+    print(err)
+
+    # when we project s10x10-s11 (r2+r3), 10:1 abundance,
+    # onto s10 and s11 genomes with gather --ignore-abundance, we get:
+    # * approximately 50% of s10 and s11 matching (first column)
+    # * approximately 100% of the high coverage genome being matched,
+    #       with only 80% of the low coverage genome
+    # no abundance-weighted information is provided here. @CTB check?
+
+    assert all(('57.2%  100.0%', 'tests/test-data/genome-s10.fa.gz' in out))
+    assert all(('42.8%   80.0%', 'tests/test-data/genome-s11.fa.gz' in out))
+    assert 'genome-s12.fa.gz' not in out
 
 
 @utils.in_tempdir