unique-kmers.py ignores the entirety of any sequence with non-ACGTN in it #1540

ctb · 2016-11-24T14:47:52Z

e.g. if a chromosome from a genome sequence has a single non-ACGTN in it, the entire sequence will be ignored. This is unexpected.

This should probably be changed so that such k-mers can be skipped over, or optionally raise an exception - e.g. in the sbt_search branch of sourmash compute, I now convert non-ACGT bases into 'N', unless --check-sequence is specified. The khmer-consistent approach would be to convert them into 'A' (see khmer.utils.clean_input_reads).

The text was updated successfully, but these errors were encountered:

ctb · 2016-11-24T14:49:27Z

Note also that there's a divide by zero if no k-mers are consumed ;).

standage · 2016-11-24T16:34:59Z

Discussion from back in September is relevant: #1036. Since then, whenever I've been consuming genomic sequences, I just preprocess to split on non-ACGTs and filter the resulting fragments by some minimum length. But it would be nice to have khmer handle this nicely.

standage · 2016-11-24T16:36:13Z

Also, some genomic sequences have long internal or terminal stretches of Ns. We'd have to consider whether we want to consume all of these as converted As or ignore (and how exactly that would work).

ctb · 2016-11-25T19:58:11Z

I'm fine with N->A conversion generally; history/experience suggests that most sequence has 'em anyway so no harm done. But could easily trash any collection of Ns that is longer than k-mer size, for example.

Also, a useful snippet of code that I just used in sourmash:

seq = re.sub('[^ACGT]', 'A', seq)

ctb · 2017-01-26T18:54:54Z

Regarding the comment above by @standage and my code block: I think we should provide a function in khmer/utils.py that does both the split that @standage does, as well as the 're.sub' above.This would give people an arsenal of tools with which to pre-treat their sequence. We could also put this in screed instead of khmer. Thoughts as to which, or if this is even a good idea?

standage · 2017-01-27T00:46:18Z

See #1541 (comment).

standage · 2017-01-27T00:47:19Z

I mean, I think eventually we might want all of the approaches implemented at the C++ level, but...tradeoffs.

ctb assigned luizirber Nov 24, 2016

ctb mentioned this issue Nov 24, 2016

More thoughts on non-ACGT characters. #1541

Open

ctb mentioned this issue Jan 26, 2017

Deal with some hashing issues. #1596

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unique-kmers.py ignores the entirety of any sequence with non-ACGTN in it #1540

unique-kmers.py ignores the entirety of any sequence with non-ACGTN in it #1540

ctb commented Nov 24, 2016

ctb commented Nov 24, 2016

standage commented Nov 24, 2016

standage commented Nov 24, 2016

ctb commented Nov 25, 2016

ctb commented Jan 26, 2017

standage commented Jan 27, 2017

standage commented Jan 27, 2017

unique-kmers.py ignores the entirety of any sequence with non-ACGTN in it #1540

unique-kmers.py ignores the entirety of any sequence with non-ACGTN in it #1540

Comments

ctb commented Nov 24, 2016

ctb commented Nov 24, 2016

standage commented Nov 24, 2016

standage commented Nov 24, 2016

ctb commented Nov 25, 2016

ctb commented Jan 26, 2017

standage commented Jan 27, 2017

standage commented Jan 27, 2017