-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unique-kmers.py ignores the entirety of any sequence with non-ACGTN in it #1540
Comments
Note also that there's a divide by zero if no k-mers are consumed ;). |
Discussion from back in September is relevant: #1036. Since then, whenever I've been consuming genomic sequences, I just preprocess to split on non-ACGTs and filter the resulting fragments by some minimum length. But it would be nice to have khmer handle this nicely. |
Also, some genomic sequences have long internal or terminal stretches of Ns. We'd have to consider whether we want to consume all of these as converted As or ignore (and how exactly that would work). |
I'm fine with N->A conversion generally; history/experience suggests that most sequence has 'em anyway so no harm done. But could easily trash any collection of Ns that is longer than k-mer size, for example. Also, a useful snippet of code that I just used in sourmash:
|
Regarding the comment above by @standage and my code block: I think we should provide a function in |
See #1541 (comment). |
I mean, I think eventually we might want all of the approaches implemented at the C++ level, but...tradeoffs. |
e.g. if a chromosome from a genome sequence has a single non-ACGTN in it, the entire sequence will be ignored. This is unexpected.
This should probably be changed so that such k-mers can be skipped over, or optionally raise an exception - e.g. in the sbt_search branch of sourmash compute, I now convert non-ACGT bases into 'N', unless
--check-sequence
is specified. The khmer-consistent approach would be to convert them into 'A' (seekhmer.utils.clean_input_reads
).The text was updated successfully, but these errors were encountered: