Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancies between exact counts and approximate counts #1619

Closed
standage opened this issue Feb 14, 2017 · 14 comments
Closed

Discrepancies between exact counts and approximate counts #1619

standage opened this issue Feb 14, 2017 · 14 comments

Comments

@standage
Copy link
Member

Given the strange results I've been getting with kevlar, I felt I had no choice but to compute exact k-mer counts for one of my human samples, compare these exact counts to the approximate counts given by khmer, and find how often big differences occur.

The procedure has not yet completed, but I'm already seeing some alarming results. I hope these are all just some big mistake on my part, but...maybe not? Just in the first part of the data, I've observed:

  • thousands of k-mers whose khmer-computed abundances are inflated by 5-10
  • dozens of k-mers inflated by > 10 (counttable FPR=0.1).
  • tens of thousands of khmer-computed k-mer abundances that are lower than the exact count. What?!?! Vast majority of these have a difference of 1, but some as many as 5 to 7.

I don't know what to believe anymore.


Step 1: exact counts with jellyfish

jellyfish count \
    --mer-len=31 \
    --size=32G \
    --threads=16 \
    --output=NA19238.k31.jellyfish.counts \
    --out-counter-len=1 \
    --canonical \
    --timing=NA19238.k31.jellyfish/timing \
    --generator=<(echo 'gunzip -c NA19238.trim.fq.gz') \
    > jellyfish.log \
    2>&1

Step 2: approximate counts with khmer

>>> from __future__ import print_function
>>> import khmer
>>>
>>> ct = khmer.Counttable(31, 64e9 / 4, 4)
>>> ct.consume_fasta('NA19238.trim.fq.gz')
>>> ct.save('NA19238.k31.counttable')
>>> fpr = khmer.calc_expected_collisions(ct)
>>> print('FPR', fpr)

Step 3: compare counts

#!/usr/bin/env python
from __future__ import print_function, division
import argparse
import khmer
import subprocess
import sys
import time

parser = argparse.ArgumentParser()
parser.add_argument('--max-diff', type=int, default=2, metavar='DIFF',
                    help='report any k-mers where the difference between the '
                    'appoximate and exact counts is > DIFF; default is 2')
parser.add_argument('--debug-threshold', type=int, default=500000, metavar='N',
                    help='print a debugging update every N k-mers')
parser.add_argument('jf', help='jellyfish database')
parser.add_argument('kf', help='khmer counttable file')
args = parser.parse_args()

start = time.time()
kcounts = khmer._Counttable(1, [1])
kcounts.load(args.kf)
elapsed = time.time() - start
print('DEBUG loaded count table in {:.3f} seconds'.format(elapsed), file=sys.stderr)

undercounts = 0
overcounts = 0

cmd = 'jellyfish dump --column --tab'.split()
cmd += [args.jf]
jpipe = subprocess.Popen(cmd, stdout=subprocess.PIPE)
current_threshold = args.debug_threshold
allstart = time.time()
start = time.time()
for n, line in enumerate(jpipe.stdout):
    if n > 0 and n >= current_threshold:
        current_threshold += args.debug_threshold
        print('DEBUG loaded {} k-mers in {} seconds'.format(n, time.time() - start), file=sys.stderr)
        start = time.time()
    values = line.strip().split('\t')
    kmer = values[0]
    abund = int(values[1])
    kabund = kcounts.get(kmer)
    
    if kabund < abund:
        undercounts += 1
        print('UNDERCOUNT', kmer, abund, kabund, kabund - abund, file=sys.stderr)
    elif kabund - abund > args.max_diff:
        overcounts += 1
        print('OVERCOUNT', kmer, abund, kabund, kabund - abund, file=sys.stderr)
print('DEBUG loaded all {} kmers in {} seconds'.format(n, time.time() - allstart), file=sys.stderr)
print('Overcounts', overcounts, 'Undercounts', undercounts)
@standage
Copy link
Member Author

I should note that I'm working off of branch research/banding_and_k32+. This is essentially a merge of refactor/hashing2 and feature/consume_bitsplit. The latter contains only additions, which should not affect the behavior of these tests.

@betatim
Copy link
Member

betatim commented Feb 14, 2017

No good idea off the top of my head. Is there a way to reproduce this with a smaller dataset/less RAM?

@ctb
Copy link
Member

ctb commented Feb 14, 2017

You would expect quite a few false positives from billions of k-mers, no?

The undercounts are MUCH more concerning to me.

@standage
Copy link
Member Author

Is there a way to reproduce this with a smaller dataset/less RAM?

Yep, this is running on a large human sample. I'll do a similar comparison with E. coli and see what I come up with.

The undercounts are MUCH more concerning to me.

Yeah, this is very surprising. Either I'm doing something horribly wrong or we've got some serious work to do.

The procedure shown above has been running overnight and is not yet finished, but here is the current distribution of false negatives being reported: column 2 is approximate abundance (khmer) - exact abundance (jellyfish), column 1 is frequency (number of k-mers).

5564172 -1
 326417 -2
  19459 -3
   1889 -4
    363 -5
     82 -6
     26 -7
      5 -8
      1 -9
      2 -11

@standage
Copy link
Member Author

You would expect quite a few false positives from billions of k-mers, no?

Indeed, but with a FPR ≈ 0.1 would you expect anything to be overcounted by 10? 20?

@betatim
Copy link
Member

betatim commented Feb 14, 2017

Indeed, but with a FPR ≈ 0.1 would you expect anything to be overcounted by 10? 20?

FPR -> false positive rate, not sure for counting bloom filters but if the naming is correct I would assume it means it gets it wrong at that rate but doesn't tell you anything about the size of the error?

@standage
Copy link
Member Author

...but doesn't tell you anything about the size of the error?

Correct, as far as I understand. That is , FPR doesn't give you any direct information about the size of the error. Depending on the size of the table, some k-mers will have more collisions than average and some will have fewer than average (or none). It just seems extremely improbable that all N=4 buckets associated with a k-mer would be inflated to ≈20 when in reality the k-mer occurs once in the input. Either there's a mistake in the code, or there is some effect that violates our assumption of random collisions. Both @titus and @camillescott have suggested to look at handling of non-ACGT characters (N --> A could definitely explain something like this), but the consume_fasta functions currently discard any reads with non-ACGT.

@standage
Copy link
Member Author

Ok, I just re-ran this procedure on SRR030252 (an E. coli data set). Used branch refactor/hashing2 this time.


Download and trim

fastq-dump \
        --split-files \
        --defline-seq '@$ac.$si.$sg/$ri' \
        --defline-qual '+' \
        -Z --gzip SRR030252 \
    > SRR030252.fq.gz

trim-low-abund.py \
        -k 17 -M 250M --variable-coverage \
        -o SRR030252.trim.fq.gz --gzip \
        SRR030252.fq.gz \
    2> SRR030252.trim.log &

Jellyfish

jellyfish count \
    --mer-len=19 \
    --size=1G \
    --threads=4 \
    --output=SRR030252.k19.jellyfish.counts \
    --out-counter=1 \
    --canonical \
    --timing=SRR030252.k19.jellyfish.timing \
    --generator=<(echo 'gunzip -c SRR030252.trim.fq.gz') \
    > SRR030252.k19.jellyfish.log \
    2>&1

khmer

>>> from __future__ import print_function
>>> import khmer
>>> 
>>> ct = khmer.Counttable(19, 2.5e8 / 4, 4)
>>> ct.consume_fasta('SRR030252.trim.fq.gz')
>>> ct.save('SRR030252.k19.counttable')
>>> fpr = khmer.calc_expected_collisions(ct)
>>> print('FPR', fpr)

Comparison

Saved as compare.py, executed with:

./compare.py \
        --debug-threshold 1000000 \
        SRR030252.k19.jellyfish.counts \
        SRR030252.k19.counttable \
    > SRR030252.k19.compare.txt 2>&1
#!/usr/bin/env python
from __future__ import print_function, division
import argparse
import khmer
import subprocess
import sys
import time

parser = argparse.ArgumentParser()
parser.add_argument('--max-diff', type=int, default=2, metavar='DIFF',
                    help='report any k-mers where the difference between the '
                    'appoximate and exact counts is > DIFF; default is 2')
parser.add_argument('--debug-threshold', type=int, default=500000, metavar='N',
                    help='print a debugging update every N k-mers')
parser.add_argument('jf', help='jellyfish database')
parser.add_argument('kf', help='khmer counttable file')
args = parser.parse_args()

start = time.time()
kcounts = khmer._Counttable(1, [1])
kcounts.load(args.kf)
elapsed = time.time() - start
print('DEBUG loaded count table in {:.3f} seconds'.format(elapsed), file=sys.stderr)

undercounts = 0
overcounts = 0

cmd = 'jellyfish dump --column --tab'.split()
cmd += [args.jf]
jpipe = subprocess.Popen(cmd, stdout=subprocess.PIPE)
current_threshold = args.debug_threshold
allstart = time.time()
start = time.time()
for n, line in enumerate(jpipe.stdout):
    if n > 0 and n >= current_threshold:
        current_threshold += args.debug_threshold
        print('DEBUG loaded {} k-mers in {} seconds'.format(n, time.time() - start), file=sys.stderr)
        start = time.time()
    values = line.strip().split('\t')
    kmer = values[0]
    abund = int(values[1])
    kabund = kcounts.get(kmer)
    
    if kabund < abund:
        undercounts += 1
        print('UNDERCOUNT', kmer, abund, kabund, kabund - abund, file=sys.stderr)
    elif kabund - abund > args.max_diff:
        overcounts += 1
        print('OVERCOUNT', kmer, abund, kabund, kabund - abund, file=sys.stderr)
print('DEBUG loaded all {} kmers in {} seconds'.format(n, time.time() - allstart), file=sys.stderr)
print('Overcounts', overcounts, 'Undercounts', undercounts)

Results

Distribution of undercounts.

$ grep UNDERCOUNT SRR030252.k19.compare.txt  \
        | cut -f 5 -d ' ' \
        | sort | uniq -c \
        | grep -v UNDER \
        | sort -rn -k2,2 
  33460 -1
     69 -2
      2 -3

Distribution of overcounts.

grep OVERCOUNT SRR030252.k19.compare.txt \
        | cut -f 5 -d ' ' \
        | sort | uniq -c \
        | sort -n -k2,2
    784 3
     90 4
     27 5
     14 6
     23 7
     32 8
     34 9
     50 10
     52 11
     62 12
     87 13
     77 14
     67 15
     72 16
     64 17
     38 18
     31 19
     19 20
      9 21
      9 22
      5 23
      3 24
      3 25
      1 26

@standage
Copy link
Member Author

Ok, looking at 10 randomly selected undercounted kmers (all undercounted by 1), as well as some cherry-picked kmers undercounted by two, the explanation is clear (and not surprising in hindsight): these k-mers are all included in reads that contain Ns and are thus discarded by khmer. When it's undercounted by 1, there is 1 read containing an N. When it's undercounted by 2, there are 2 reads with Ns. Seems consistent.

@standage
Copy link
Member Author

Prefiltering reads to remove those containing non-ACGT, we get no undercounts. Phew! The distribution of overcounts is identical.

@standage
Copy link
Member Author

standage commented Feb 14, 2017

The 250M memory used in the E. coli example above yielded an FPR of 0.025 (less than the ≈0.1 FPR I've been using for the human stuff). When I increased memory to 2.5 G the FPR dropped to 5.93e-06, and ALL of the overcounts disappeared.

I've been skeptical that low FPRs could result in k-mers being overcounted by as much as 20, but I guess considering total data volume this all seems fairly reasonable.

@ctb
Copy link
Member

ctb commented Feb 14, 2017 via email

@camillescott
Copy link
Member

camillescott commented Feb 14, 2017 via email

@ctb
Copy link
Member

ctb commented Feb 14, 2017

In #1590, we are removing the code that ignores sequences with Ns in them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants