Discrepancies between exact counts and approximate counts #1619

standage · 2017-02-14T07:47:44Z

Given the strange results I've been getting with kevlar, I felt I had no choice but to compute exact k-mer counts for one of my human samples, compare these exact counts to the approximate counts given by khmer, and find how often big differences occur.

The procedure has not yet completed, but I'm already seeing some alarming results. I hope these are all just some big mistake on my part, but...maybe not? Just in the first part of the data, I've observed:

thousands of k-mers whose khmer-computed abundances are inflated by 5-10
dozens of k-mers inflated by > 10 (counttable FPR=0.1).
tens of thousands of khmer-computed k-mer abundances that are lower than the exact count. What?!?! Vast majority of these have a difference of 1, but some as many as 5 to 7.

I don't know what to believe anymore.

Step 1: exact counts with jellyfish

jellyfish count \
    --mer-len=31 \
    --size=32G \
    --threads=16 \
    --output=NA19238.k31.jellyfish.counts \
    --out-counter-len=1 \
    --canonical \
    --timing=NA19238.k31.jellyfish/timing \
    --generator=<(echo 'gunzip -c NA19238.trim.fq.gz') \
    > jellyfish.log \
    2>&1

Step 2: approximate counts with khmer

>>> from __future__ import print_function
>>> import khmer
>>>
>>> ct = khmer.Counttable(31, 64e9 / 4, 4)
>>> ct.consume_fasta('NA19238.trim.fq.gz')
>>> ct.save('NA19238.k31.counttable')
>>> fpr = khmer.calc_expected_collisions(ct)
>>> print('FPR', fpr)

Step 3: compare counts

#!/usr/bin/env python
from __future__ import print_function, division
import argparse
import khmer
import subprocess
import sys
import time

parser = argparse.ArgumentParser()
parser.add_argument('--max-diff', type=int, default=2, metavar='DIFF',
                    help='report any k-mers where the difference between the '
                    'appoximate and exact counts is > DIFF; default is 2')
parser.add_argument('--debug-threshold', type=int, default=500000, metavar='N',
                    help='print a debugging update every N k-mers')
parser.add_argument('jf', help='jellyfish database')
parser.add_argument('kf', help='khmer counttable file')
args = parser.parse_args()

start = time.time()
kcounts = khmer._Counttable(1, [1])
kcounts.load(args.kf)
elapsed = time.time() - start
print('DEBUG loaded count table in {:.3f} seconds'.format(elapsed), file=sys.stderr)

undercounts = 0
overcounts = 0

cmd = 'jellyfish dump --column --tab'.split()
cmd += [args.jf]
jpipe = subprocess.Popen(cmd, stdout=subprocess.PIPE)
current_threshold = args.debug_threshold
allstart = time.time()
start = time.time()
for n, line in enumerate(jpipe.stdout):
    if n > 0 and n >= current_threshold:
        current_threshold += args.debug_threshold
        print('DEBUG loaded {} k-mers in {} seconds'.format(n, time.time() - start), file=sys.stderr)
        start = time.time()
    values = line.strip().split('\t')
    kmer = values[0]
    abund = int(values[1])
    kabund = kcounts.get(kmer)
    
    if kabund < abund:
        undercounts += 1
        print('UNDERCOUNT', kmer, abund, kabund, kabund - abund, file=sys.stderr)
    elif kabund - abund > args.max_diff:
        overcounts += 1
        print('OVERCOUNT', kmer, abund, kabund, kabund - abund, file=sys.stderr)
print('DEBUG loaded all {} kmers in {} seconds'.format(n, time.time() - allstart), file=sys.stderr)
print('Overcounts', overcounts, 'Undercounts', undercounts)

standage · 2017-02-14T07:57:10Z

I should note that I'm working off of branch research/banding_and_k32+. This is essentially a merge of refactor/hashing2 and feature/consume_bitsplit. The latter contains only additions, which should not affect the behavior of these tests.

betatim · 2017-02-14T10:08:46Z

No good idea off the top of my head. Is there a way to reproduce this with a smaller dataset/less RAM?

ctb · 2017-02-14T16:56:54Z

You would expect quite a few false positives from billions of k-mers, no?

The undercounts are MUCH more concerning to me.

standage · 2017-02-14T17:40:38Z

Is there a way to reproduce this with a smaller dataset/less RAM?

Yep, this is running on a large human sample. I'll do a similar comparison with E. coli and see what I come up with.

The undercounts are MUCH more concerning to me.

Yeah, this is very surprising. Either I'm doing something horribly wrong or we've got some serious work to do.

The procedure shown above has been running overnight and is not yet finished, but here is the current distribution of false negatives being reported: column 2 is approximate abundance (khmer) - exact abundance (jellyfish), column 1 is frequency (number of k-mers).

standage · 2017-02-14T17:41:27Z

You would expect quite a few false positives from billions of k-mers, no?

Indeed, but with a FPR ≈ 0.1 would you expect anything to be overcounted by 10? 20?

betatim · 2017-02-14T17:49:28Z

Indeed, but with a FPR ≈ 0.1 would you expect anything to be overcounted by 10? 20?

FPR -> false positive rate, not sure for counting bloom filters but if the naming is correct I would assume it means it gets it wrong at that rate but doesn't tell you anything about the size of the error?

standage · 2017-02-14T17:59:06Z

...but doesn't tell you anything about the size of the error?

Correct, as far as I understand. That is , FPR doesn't give you any direct information about the size of the error. Depending on the size of the table, some k-mers will have more collisions than average and some will have fewer than average (or none). It just seems extremely improbable that all N=4 buckets associated with a k-mer would be inflated to ≈20 when in reality the k-mer occurs once in the input. Either there's a mistake in the code, or there is some effect that violates our assumption of random collisions. Both @titus and @camillescott have suggested to look at handling of non-ACGT characters (N --> A could definitely explain something like this), but the consume_fasta functions currently discard any reads with non-ACGT.

standage · 2017-02-14T18:48:48Z

Ok, I just re-ran this procedure on SRR030252 (an E. coli data set). Used branch refactor/hashing2 this time.

Download and trim

fastq-dump \
        --split-files \
        --defline-seq '@$ac.$si.$sg/$ri' \
        --defline-qual '+' \
        -Z --gzip SRR030252 \
    > SRR030252.fq.gz

trim-low-abund.py \
        -k 17 -M 250M --variable-coverage \
        -o SRR030252.trim.fq.gz --gzip \
        SRR030252.fq.gz \
    2> SRR030252.trim.log &

Jellyfish

jellyfish count \
    --mer-len=19 \
    --size=1G \
    --threads=4 \
    --output=SRR030252.k19.jellyfish.counts \
    --out-counter=1 \
    --canonical \
    --timing=SRR030252.k19.jellyfish.timing \
    --generator=<(echo 'gunzip -c SRR030252.trim.fq.gz') \
    > SRR030252.k19.jellyfish.log \
    2>&1

khmer

>>> from __future__ import print_function
>>> import khmer
>>> 
>>> ct = khmer.Counttable(19, 2.5e8 / 4, 4)
>>> ct.consume_fasta('SRR030252.trim.fq.gz')
>>> ct.save('SRR030252.k19.counttable')
>>> fpr = khmer.calc_expected_collisions(ct)
>>> print('FPR', fpr)

Comparison

Saved as compare.py, executed with:

./compare.py \
        --debug-threshold 1000000 \
        SRR030252.k19.jellyfish.counts \
        SRR030252.k19.counttable \
    > SRR030252.k19.compare.txt 2>&1

#!/usr/bin/env python
from __future__ import print_function, division
import argparse
import khmer
import subprocess
import sys
import time

parser = argparse.ArgumentParser()
parser.add_argument('--max-diff', type=int, default=2, metavar='DIFF',
                    help='report any k-mers where the difference between the '
                    'appoximate and exact counts is > DIFF; default is 2')
parser.add_argument('--debug-threshold', type=int, default=500000, metavar='N',
                    help='print a debugging update every N k-mers')
parser.add_argument('jf', help='jellyfish database')
parser.add_argument('kf', help='khmer counttable file')
args = parser.parse_args()

start = time.time()
kcounts = khmer._Counttable(1, [1])
kcounts.load(args.kf)
elapsed = time.time() - start
print('DEBUG loaded count table in {:.3f} seconds'.format(elapsed), file=sys.stderr)

undercounts = 0
overcounts = 0

cmd = 'jellyfish dump --column --tab'.split()
cmd += [args.jf]
jpipe = subprocess.Popen(cmd, stdout=subprocess.PIPE)
current_threshold = args.debug_threshold
allstart = time.time()
start = time.time()
for n, line in enumerate(jpipe.stdout):
    if n > 0 and n >= current_threshold:
        current_threshold += args.debug_threshold
        print('DEBUG loaded {} k-mers in {} seconds'.format(n, time.time() - start), file=sys.stderr)
        start = time.time()
    values = line.strip().split('\t')
    kmer = values[0]
    abund = int(values[1])
    kabund = kcounts.get(kmer)
    
    if kabund < abund:
        undercounts += 1
        print('UNDERCOUNT', kmer, abund, kabund, kabund - abund, file=sys.stderr)
    elif kabund - abund > args.max_diff:
        overcounts += 1
        print('OVERCOUNT', kmer, abund, kabund, kabund - abund, file=sys.stderr)
print('DEBUG loaded all {} kmers in {} seconds'.format(n, time.time() - allstart), file=sys.stderr)
print('Overcounts', overcounts, 'Undercounts', undercounts)

Results

Distribution of undercounts.

$ grep UNDERCOUNT SRR030252.k19.compare.txt  \
        | cut -f 5 -d ' ' \
        | sort | uniq -c \
        | grep -v UNDER \
        | sort -rn -k2,2 
  33460 -1
     69 -2
      2 -3

Distribution of overcounts.

grep OVERCOUNT SRR030252.k19.compare.txt \
        | cut -f 5 -d ' ' \
        | sort | uniq -c \
        | sort -n -k2,2
    784 3
     90 4
     27 5
     14 6
     23 7
     32 8
     34 9
     50 10
     52 11
     62 12
     87 13
     77 14
     67 15
     72 16
     64 17
     38 18
     31 19
     19 20
      9 21
      9 22
      5 23
      3 24
      3 25
      1 26

standage · 2017-02-14T18:59:40Z

Ok, looking at 10 randomly selected undercounted kmers (all undercounted by 1), as well as some cherry-picked kmers undercounted by two, the explanation is clear (and not surprising in hindsight): these k-mers are all included in reads that contain Ns and are thus discarded by khmer. When it's undercounted by 1, there is 1 read containing an N. When it's undercounted by 2, there are 2 reads with Ns. Seems consistent.

standage · 2017-02-14T19:35:48Z

Prefiltering reads to remove those containing non-ACGT, we get no undercounts. Phew! The distribution of overcounts is identical.

standage · 2017-02-14T20:11:15Z

The 250M memory used in the E. coli example above yielded an FPR of 0.025 (less than the ≈0.1 FPR I've been using for the human stuff). When I increased memory to 2.5 G the FPR dropped to 5.93e-06, and ALL of the overcounts disappeared.

I've been skeptical that low FPRs could result in k-mers being overcounted by as much as 20, but I guess considering total data volume this all seems fairly reasonable.

ctb · 2017-02-14T20:40:29Z

yay w00t

camillescott · 2017-02-14T20:46:26Z

Seeing the actual overcount data, I'm no longer surprised. Somehow, when you initially said FPR=0.1, I was thinking 0.1%, blerp.

…

On Tue, Feb 14, 2017 at 12:40 PM, C. Titus Brown ***@***.***> wrote: yay w00t — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1619 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACwxraie_mMjjVGne_v20cmdhHe0V3Pcks5rchE9gaJpZM4MAKHx> .

-- Camille Scott Graduate Group for Computer Science Lab for Data Intensive Biology University of California, Davis camille.scott.w@gmail.com

ctb · 2017-02-14T21:03:29Z

In #1590, we are removing the code that ignores sequences with Ns in them.

ctb mentioned this issue Feb 14, 2017

[MRG] remove redundant ACGT-checking code. #1590

Merged

8 tasks

standage closed this as completed Feb 14, 2017

standage mentioned this issue Feb 27, 2017

Reported k-mer abundance doesn't match actual abundance when banding kevlar-dev/kevlar#26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancies between exact counts and approximate counts #1619

Discrepancies between exact counts and approximate counts #1619

standage commented Feb 14, 2017

standage commented Feb 14, 2017

betatim commented Feb 14, 2017

ctb commented Feb 14, 2017

standage commented Feb 14, 2017

standage commented Feb 14, 2017

betatim commented Feb 14, 2017

standage commented Feb 14, 2017

standage commented Feb 14, 2017

standage commented Feb 14, 2017

standage commented Feb 14, 2017

standage commented Feb 14, 2017 •

edited

Loading

ctb commented Feb 14, 2017 via email

camillescott commented Feb 14, 2017 via email

ctb commented Feb 14, 2017

Discrepancies between exact counts and approximate counts #1619

Discrepancies between exact counts and approximate counts #1619

Comments

standage commented Feb 14, 2017

Step 1: exact counts with jellyfish

Step 2: approximate counts with khmer

Step 3: compare counts

standage commented Feb 14, 2017

betatim commented Feb 14, 2017

ctb commented Feb 14, 2017

standage commented Feb 14, 2017

standage commented Feb 14, 2017

betatim commented Feb 14, 2017

standage commented Feb 14, 2017

standage commented Feb 14, 2017

Download and trim

Jellyfish

khmer

Comparison

Results

standage commented Feb 14, 2017

standage commented Feb 14, 2017

standage commented Feb 14, 2017 • edited Loading

ctb commented Feb 14, 2017 via email

camillescott commented Feb 14, 2017 via email

ctb commented Feb 14, 2017

standage commented Feb 14, 2017 •

edited

Loading