Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with k-mers containing non-ACTG sensibly #394

Closed
ctb opened this issue Apr 19, 2014 · 10 comments
Closed

Deal with k-mers containing non-ACTG sensibly #394

ctb opened this issue Apr 19, 2014 · 10 comments
Milestone

Comments

@ctb
Copy link
Member

ctb commented Apr 19, 2014

In particular, what do we do with 'N's? (See Aaron Liston e-mail to khmer@lists.idyll.org list, 30 Jan 2014). This is a systemic flaw in khmer currently and needs to be addressed at a fairly fundamental level, perhaps in the hash function (which could simply be extended to deal with arbitrary ch, or ACTGactgNn, or...), but it needs to be thought through in terms of implications. Yech.

Definitely a 3.0 kinda issue.

See also #370.

@wltrimbl
Copy link
Collaborator

A handful of options:

  1. Ignore all kmers which contain any IUPAC-ambiguous bases (N, RYSWKM....)
  2. Replace ambiguous symbols with an unambiguous symbol. (all Ns replaced by "A" for instance)
  3. Replace ambiguous symbols with a random character.
  4. Store some kind of fractional symbol in every possible place. (ANA gets 1/4 of a vote in AAA, ACA, AGA, ATA) This is the quake/q-mer approach.
    Facts:
  5. Modern instruments produce Ns but not other ambiguous codes. The ambiguous codes are inferences made by assemblers and multiple sequence alignments.
  6. A single erroneous base corrupts k entries in the table.
  7. kmers prove themselves by recurring; our scientific results never hinge on an inference of a single ambiguous base. The q-mer approach ends up downweighting the singleton observations a bit while having no effect on the solid kmers, at the cost of maintaining nightmarish floating-point count tables.
    Opinions:
    Option 3 is a bad idea; adding k kmers that are incorrect 3/4 of the time is unwise--it injects avoidable noise into the assembly graph.
    Option 4 is even worse--adding 4k partial-count entries, and expanding your representation from the natural numbers to some floating-point tally of observed kmers.

I would recommend deliberately ignoring (failing to hash) all kmers containing N and throwing an error for IUPAC nucleotide symbols other than N.

@ctb
Copy link
Member Author

ctb commented Jul 22, 2014

On Tue, Jul 22, 2014 at 09:48:57AM -0700, Will Trimble wrote:

A handful of options:

  1. Ignore all kmers which contain any IUPAC-ambiguous bases (N, RYSWKM....)
  2. Replace ambiguous symbols with an unambiguous symbol. (all Ns replaced by "A" for instance)
  3. Replace ambiguous symbols with a random character.
  4. Store some kind of fractional symbol in every possible place. (ANA gets 1/4 of a vote in AAA, ACA, AGA, ATA) This is the quake/q-mer approach.
    Facts:
  5. Modern instruments produce Ns but not other ambiguous codes. The ambiguous codes are inferences made by assemblers and multiple sequence alignments.
  6. A single erroneous base corrupts k entries in the table.
  7. kmers prove themselves by recurring; our scientific results never hinge on an inference of a single ambiguous base. The q-mer approach ends up downweighting the singleton observations a bit while having no effect on the solid kmers, at the cost of maintaining nightmarish floating-point count tables.
    Opinions:
    Option 3 is a bad idea; adding k kmers that are incorrect 3/4 of the time is unwise--it injects avoidable noise into the assembly graph.
    Option 4 is even worse--adding 4k partial-count entries, and expanding your representation from the natural numbers to some floating-point tally of observed kmers.

I would recommend deliberately ignoring (failing to hash) all kmers containing N and throwing an error for IUPAC nucleotide symbols other than N.

sounds good to me! this would need to be a 2.0 thing tho.

@mr-c
Copy link
Contributor

mr-c commented Jul 22, 2014

@wltrimbl +1

@ctb It is probably time to start a 2.0 branch

@mr-c mr-c added this to the 2.0 milestone Jul 22, 2014
@ctb
Copy link
Member Author

ctb commented Jul 22, 2014

On Tue, Jul 22, 2014 at 10:26:03AM -0700, Michael R. Crusoe wrote:

@wltrimbl +1

@ctb It is probably time to start a 2.0 branch

topic for further discussion :)

@ctb
Copy link
Member Author

ctb commented Jun 12, 2015

Conversation continued (with more specificity) in #1036. Closing this 'un; we're not handling IUPAC anytime soon :)

@ctb ctb closed this as completed Jun 12, 2015
@mr-c
Copy link
Contributor

mr-c commented Jun 12, 2015

Should we have a separate list of requested but ignored features?

@mr-c
Copy link
Contributor

mr-c commented Jun 12, 2015

Or there could be a label we could apply 'not-planned'

@ctb
Copy link
Member Author

ctb commented Jun 12, 2015

Surely the search function can be used for this?

@mr-c
Copy link
Contributor

mr-c commented Jun 12, 2015

@ctb
Copy link
Member Author

ctb commented Jun 12, 2015 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants