-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancies between exact counts and approximate counts #1619
Comments
I should note that I'm working off of branch |
No good idea off the top of my head. Is there a way to reproduce this with a smaller dataset/less RAM? |
You would expect quite a few false positives from billions of k-mers, no? The undercounts are MUCH more concerning to me. |
Yep, this is running on a large human sample. I'll do a similar comparison with E. coli and see what I come up with.
Yeah, this is very surprising. Either I'm doing something horribly wrong or we've got some serious work to do. The procedure shown above has been running overnight and is not yet finished, but here is the current distribution of false negatives being reported: column 2 is approximate abundance (khmer) - exact abundance (jellyfish), column 1 is frequency (number of k-mers).
|
Indeed, but with a FPR ≈ 0.1 would you expect anything to be overcounted by 10? 20? |
FPR -> false positive rate, not sure for counting bloom filters but if the naming is correct I would assume it means it gets it wrong at that rate but doesn't tell you anything about the size of the error? |
Correct, as far as I understand. That is , FPR doesn't give you any direct information about the size of the error. Depending on the size of the table, some k-mers will have more collisions than average and some will have fewer than average (or none). It just seems extremely improbable that all N=4 buckets associated with a k-mer would be inflated to ≈20 when in reality the k-mer occurs once in the input. Either there's a mistake in the code, or there is some effect that violates our assumption of random collisions. Both @titus and @camillescott have suggested to look at handling of non-ACGT characters (N --> A could definitely explain something like this), but the |
Ok, I just re-ran this procedure on SRR030252 (an E. coli data set). Used branch Download and trim
Jellyfish
khmer>>> from __future__ import print_function
>>> import khmer
>>>
>>> ct = khmer.Counttable(19, 2.5e8 / 4, 4)
>>> ct.consume_fasta('SRR030252.trim.fq.gz')
>>> ct.save('SRR030252.k19.counttable')
>>> fpr = khmer.calc_expected_collisions(ct)
>>> print('FPR', fpr) ComparisonSaved as
#!/usr/bin/env python
from __future__ import print_function, division
import argparse
import khmer
import subprocess
import sys
import time
parser = argparse.ArgumentParser()
parser.add_argument('--max-diff', type=int, default=2, metavar='DIFF',
help='report any k-mers where the difference between the '
'appoximate and exact counts is > DIFF; default is 2')
parser.add_argument('--debug-threshold', type=int, default=500000, metavar='N',
help='print a debugging update every N k-mers')
parser.add_argument('jf', help='jellyfish database')
parser.add_argument('kf', help='khmer counttable file')
args = parser.parse_args()
start = time.time()
kcounts = khmer._Counttable(1, [1])
kcounts.load(args.kf)
elapsed = time.time() - start
print('DEBUG loaded count table in {:.3f} seconds'.format(elapsed), file=sys.stderr)
undercounts = 0
overcounts = 0
cmd = 'jellyfish dump --column --tab'.split()
cmd += [args.jf]
jpipe = subprocess.Popen(cmd, stdout=subprocess.PIPE)
current_threshold = args.debug_threshold
allstart = time.time()
start = time.time()
for n, line in enumerate(jpipe.stdout):
if n > 0 and n >= current_threshold:
current_threshold += args.debug_threshold
print('DEBUG loaded {} k-mers in {} seconds'.format(n, time.time() - start), file=sys.stderr)
start = time.time()
values = line.strip().split('\t')
kmer = values[0]
abund = int(values[1])
kabund = kcounts.get(kmer)
if kabund < abund:
undercounts += 1
print('UNDERCOUNT', kmer, abund, kabund, kabund - abund, file=sys.stderr)
elif kabund - abund > args.max_diff:
overcounts += 1
print('OVERCOUNT', kmer, abund, kabund, kabund - abund, file=sys.stderr)
print('DEBUG loaded all {} kmers in {} seconds'.format(n, time.time() - allstart), file=sys.stderr)
print('Overcounts', overcounts, 'Undercounts', undercounts) ResultsDistribution of undercounts.
Distribution of overcounts.
|
Ok, looking at 10 randomly selected undercounted kmers (all undercounted by 1), as well as some cherry-picked kmers undercounted by two, the explanation is clear (and not surprising in hindsight): these k-mers are all included in reads that contain Ns and are thus discarded by khmer. When it's undercounted by 1, there is 1 read containing an N. When it's undercounted by 2, there are 2 reads with Ns. Seems consistent. |
Prefiltering reads to remove those containing non-ACGT, we get no undercounts. Phew! The distribution of overcounts is identical. |
The 250M memory used in the E. coli example above yielded an FPR of 0.025 (less than the ≈0.1 FPR I've been using for the human stuff). When I increased memory to 2.5 G the FPR dropped to 5.93e-06, and ALL of the overcounts disappeared. I've been skeptical that low FPRs could result in k-mers being overcounted by as much as 20, but I guess considering total data volume this all seems fairly reasonable. |
yay w00t
|
Seeing the actual overcount data, I'm no longer surprised. Somehow, when
you initially said FPR=0.1, I was thinking 0.1%, blerp.
…On Tue, Feb 14, 2017 at 12:40 PM, C. Titus Brown ***@***.***> wrote:
yay w00t
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1619 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACwxraie_mMjjVGne_v20cmdhHe0V3Pcks5rchE9gaJpZM4MAKHx>
.
--
Camille Scott
Graduate Group for Computer Science
Lab for Data Intensive Biology
University of California, Davis
camille.scott.w@gmail.com
|
In #1590, we are removing the code that ignores sequences with Ns in them. |
Given the strange results I've been getting with kevlar, I felt I had no choice but to compute exact k-mer counts for one of my human samples, compare these exact counts to the approximate counts given by khmer, and find how often big differences occur.
The procedure has not yet completed, but I'm already seeing some alarming results. I hope these are all just some big mistake on my part, but...maybe not? Just in the first part of the data, I've observed:
I don't know what to believe anymore.
Step 1: exact counts with jellyfish
Step 2: approximate counts with khmer
Step 3: compare counts
The text was updated successfully, but these errors were encountered: