Deal with k-mers containing non-ACTG sensibly #394

ctb · 2014-04-19T22:35:15Z

In particular, what do we do with 'N's? (See Aaron Liston e-mail to khmer@lists.idyll.org list, 30 Jan 2014). This is a systemic flaw in khmer currently and needs to be addressed at a fairly fundamental level, perhaps in the hash function (which could simply be extended to deal with arbitrary ch, or ACTGactgNn, or...), but it needs to be thought through in terms of implications. Yech.

Definitely a 3.0 kinda issue.

See also #370.

wltrimbl · 2014-07-22T16:48:57Z

A handful of options:

Ignore all kmers which contain any IUPAC-ambiguous bases (N, RYSWKM....)
Replace ambiguous symbols with an unambiguous symbol. (all Ns replaced by "A" for instance)
Replace ambiguous symbols with a random character.
Store some kind of fractional symbol in every possible place. (ANA gets 1/4 of a vote in AAA, ACA, AGA, ATA) This is the quake/q-mer approach.
Facts:
Modern instruments produce Ns but not other ambiguous codes. The ambiguous codes are inferences made by assemblers and multiple sequence alignments.
A single erroneous base corrupts k entries in the table.
kmers prove themselves by recurring; our scientific results never hinge on an inference of a single ambiguous base. The q-mer approach ends up downweighting the singleton observations a bit while having no effect on the solid kmers, at the cost of maintaining nightmarish floating-point count tables.
Opinions:
Option 3 is a bad idea; adding k kmers that are incorrect 3/4 of the time is unwise--it injects avoidable noise into the assembly graph.
Option 4 is even worse--adding 4k partial-count entries, and expanding your representation from the natural numbers to some floating-point tally of observed kmers.

I would recommend deliberately ignoring (failing to hash) all kmers containing N and throwing an error for IUPAC nucleotide symbols other than N.

ctb · 2014-07-22T17:11:29Z

On Tue, Jul 22, 2014 at 09:48:57AM -0700, Will Trimble wrote:

A handful of options:

Ignore all kmers which contain any IUPAC-ambiguous bases (N, RYSWKM....)

Replace ambiguous symbols with an unambiguous symbol. (all Ns replaced by "A" for instance)

Replace ambiguous symbols with a random character.

Store some kind of fractional symbol in every possible place. (ANA gets 1/4 of a vote in AAA, ACA, AGA, ATA) This is the quake/q-mer approach.
Facts:

Modern instruments produce Ns but not other ambiguous codes. The ambiguous codes are inferences made by assemblers and multiple sequence alignments.

A single erroneous base corrupts k entries in the table.

kmers prove themselves by recurring; our scientific results never hinge on an inference of a single ambiguous base. The q-mer approach ends up downweighting the singleton observations a bit while having no effect on the solid kmers, at the cost of maintaining nightmarish floating-point count tables.
Opinions:
Option 3 is a bad idea; adding k kmers that are incorrect 3/4 of the time is unwise--it injects avoidable noise into the assembly graph.
Option 4 is even worse--adding 4k partial-count entries, and expanding your representation from the natural numbers to some floating-point tally of observed kmers.

I would recommend deliberately ignoring (failing to hash) all kmers containing N and throwing an error for IUPAC nucleotide symbols other than N.

sounds good to me! this would need to be a 2.0 thing tho.

mr-c · 2014-07-22T17:26:03Z

@wltrimbl +1

@ctb It is probably time to start a 2.0 branch

ctb · 2014-07-22T17:29:36Z

On Tue, Jul 22, 2014 at 10:26:03AM -0700, Michael R. Crusoe wrote:

@wltrimbl +1

@ctb It is probably time to start a 2.0 branch

topic for further discussion :)

ctb · 2015-06-12T20:25:08Z

Conversation continued (with more specificity) in #1036. Closing this 'un; we're not handling IUPAC anytime soon :)

mr-c · 2015-06-12T20:36:17Z

Should we have a separate list of requested but ignored features?

mr-c · 2015-06-12T20:36:51Z

Or there could be a label we could apply 'not-planned'

ctb · 2015-06-12T20:38:03Z

Surely the search function can be used for this?

mr-c · 2015-06-12T20:39:13Z

Right, it would be https://github.com/dib-lab/khmer/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aclosed+label%3A%22not+planned%22+

ctb · 2015-06-12T20:50:00Z

-0

ctb added the discussion-needed label Apr 19, 2014

mr-c added this to the 2.0 milestone Jul 22, 2014

ctb closed this as completed Jun 12, 2015

mr-c added not-planned and removed discussion-needed labels Jun 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deal with k-mers containing non-ACTG sensibly #394

Deal with k-mers containing non-ACTG sensibly #394

ctb commented Apr 19, 2014

wltrimbl commented Jul 22, 2014

ctb commented Jul 22, 2014

mr-c commented Jul 22, 2014

ctb commented Jul 22, 2014

ctb commented Jun 12, 2015

mr-c commented Jun 12, 2015

mr-c commented Jun 12, 2015

ctb commented Jun 12, 2015

mr-c commented Jun 12, 2015

ctb commented Jun 12, 2015 via email

Deal with k-mers containing non-ACTG sensibly #394

Deal with k-mers containing non-ACTG sensibly #394

Comments

ctb commented Apr 19, 2014

wltrimbl commented Jul 22, 2014

ctb commented Jul 22, 2014

mr-c commented Jul 22, 2014

ctb commented Jul 22, 2014

ctb commented Jun 12, 2015

mr-c commented Jun 12, 2015

mr-c commented Jun 12, 2015

ctb commented Jun 12, 2015

mr-c commented Jun 12, 2015

ctb commented Jun 12, 2015 via email