Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More thoughts on non-ACGT characters. #1541

Open
ctb opened this issue Nov 24, 2016 · 21 comments
Open

More thoughts on non-ACGT characters. #1541

ctb opened this issue Nov 24, 2016 · 21 comments

Comments

@ctb
Copy link
Member

ctb commented Nov 24, 2016

Spurred by #1540, I'm converging on two principles regarding non-ACGT:

  • we shouldn't ever silently ignore an entire sequence record, because that leads to surprisingly inconsistent behavior between short reads and genomic sequences, as in the above where an entire genome vanishes from the k-mer counts. So we should either raise an exception OR automatically convert.
  • automatic conversion to As or something is probably the right default behavior, unless --check-sequence or --pedantic is specified. In part this is because even NCBI records have non-ACGT IUPAC bases in them.

There is a fun cautionary tale in this twitter exchange that is relevant, I think; we want to get the k-mer breakdown at least approximately correct :). This is another reason I think automatic conversion is appropriate - if the sequence type is completely wrong (e.g. protein, or RNA, or text) then the k-mer counts will be visibly off, but this is rare compared to the situation where you are looking at mostly legit DNA sequence and there's an S in the middle of a 5 Mbp genome. (I used to think we should automatically convert until some threshold was reached, like "1% of your sequence is bad... STOP", but that's overly clever and will no doubt bite us.

Last but not least, we should include a script (in khmer? screed?) that does a breakdown of the non-ACGT content of your data set so that people can debug more easily.

@betatim
Copy link
Member

betatim commented Nov 25, 2016

How do the non-ACGT letters get into the data in the first place?

@ctb
Copy link
Member Author

ctb commented Nov 25, 2016 via email

@betatim
Copy link
Member

betatim commented Nov 25, 2016

So there is the case where the user is probably not doing what they think they are doing (feeding not DNA sequence into khmer) and the cleverness case. For the latter, should we be clever and sample a replacement from the possibilities (e.g. replace S with random.sample(['G', 'C']))? Or even add one kmer with each option? Or is that (trying to be clever)**2?

@ctb
Copy link
Member Author

ctb commented Nov 25, 2016 via email

@ctb
Copy link
Member Author

ctb commented Dec 19, 2016

I wonder if we can look at the beginning of the file and do some simple statistics (on, say, the first 100kb of sequence) and then barf if it looks bad?

This might be a good use for issues with bad pairing as well - scripts that are told to expect pairing info and don't get any pairs could barf.

After the first 100kb we could then assume that all the data is "good format" and go from there. Will probably catch 90%+ of problems...

@standage
Copy link
Member

Another option is to simply ignore any k-mers containing a non-ACGT character. I don't know how this will affect performance of processing short reads, but it's essentially what I do now for long genomic sequences (by splitting chromosome sequences by [^ACGT] in preprocessing).

@ctb
Copy link
Member Author

ctb commented Dec 22, 2016

@standage, that could solve counting/graph issues, but we should maybe still warn the user if they submit a grossly incorrect file.

more generally, we have a few different things we do with input data:

  1. load it into count tables and graphs
  2. trim reads
  3. discard reads

ignoring non-ACGT works for 1, and maybe 3, but not necessarily for 2.

ugh, bioinformatics :)

@ctb
Copy link
Member Author

ctb commented Jan 24, 2017

Spurred by this comment and previous --

DNA validity checking is done in the following places:

To my surprise, the CPython function Hashtable::consume does no checking...

Some comments:

My biggest concern is how to figure out (for ourselves), test, and explain (to others) what we're actually doing with sequence preprocessing. One thought is to provide a function purely for debugging that takes an input file, runs it through our sequence processing code, and then outputs it with those changes; then we can write tests against that behavior as well. May be pointless but may also be friendly.

If someone were to want to get started on removing the check_and* code a good start would be to just remove it from the various functions and see if the md5sum tests pass - that would show that it's not actually needed any more (which is my suspicion).

@ctb
Copy link
Member Author

ctb commented Jan 24, 2017

Started removing some code in #1590; it seems likely that the check_and* code can be removed from consume_fasta too.

@ctb
Copy link
Member Author

ctb commented Jan 24, 2017

...as indeed it can. Anyway, #1590.

@standage
Copy link
Member

@standage in kevlar should do his own danged preprocessing if he doesn't want to use our bulk loading code ;)

:)

I use the bulk loading code for...loading count tables/graphs, but I iterate over screed records and k-mers to do abundance checking (which is where I ran into issues). I added some temporary basic preprocessing to kevlar last night, but if khmer's get/get_count and hash/hash_dna functions are explicitly not going to support non ACGT characters, there's no problem making these changes to kevlar permanent.

@ctb
Copy link
Member Author

ctb commented Jan 24, 2017 via email

@standage
Copy link
Member

I'm adding some tests to kevlar to verify assumptions regarding non-ACGT are met. I'm getting the following error.

>>> import khmer
>>> ct = khmer.Counttable(15, 1e6, 3)
>>> ct.consume('TTGACTTGACTTCAG')
1
>>> ct.consume('TTGACTTGACTTCAG')
1
>>> assert ct.get('TTGACTTGACTTCAG') > 0
>>>
>>> ct.consume('TTCAGNNNNNTTCAG')
0
>>> try:
...     _ = ct.get('TTCAGNNNNNTTCAG')
... except e:
...     print(str(e))
...
libc++abi.dylib: terminating with uncaught exception of type khmer::khmer_exception: Invalid base in read
Abort trap: 6

It looks like we need to make sure C++ exceptions get handed off to Python so that they can be handled at the Python level.

@ctb
Copy link
Member Author

ctb commented Jan 26, 2017

More thoughts: I think one reason to settle on not doing anything to invalid DNA is that we are seeing a plethora of use cases for different approaches: ignoring non-ACTG, splitting on non-ACGT, replacing non-ACGT with N. So that's good not to dictate internally. But we can provide fast defaults with bulk loading, which is good, I think.

@standage
Copy link
Member

Yeah, it we want performance AND flexibility we need to implement the different use cases at the C++ level. Using a performant and reasonable default for bulk loading but providing different (maybe slower) options at the Python level is a fine compromise for now I think.

For instance, I'm mostly interested in the skipping / splitting approach for long genomic sequences, which are typically much smaller in total volume than Fastq/BAM files full of millions of reads. Sequence handling in Python is fine for the former, but I want to avoid it when possible for the latter.

@betatim
Copy link
Member

betatim commented Jan 27, 2017

Why not provide some standard/frequently used strategies as options for methods that consume a sequence or a file?

consume_fasta(fname, stragety=None) which continues to do what is currently the default, and then provide 'split', 'error', 'ignore', etc a little bit like the unicode en/decoding in python.

@ctb
Copy link
Member Author

ctb commented Jan 27, 2017 via email

@betatim
Copy link
Member

betatim commented Jan 27, 2017

If methods that accept single kmers assume that you know what you are doing then you can implement any non-standard processing in python and pass in the kmers.

Does that work for you @standage ?

@standage
Copy link
Member

👍

@ctb
Copy link
Member Author

ctb commented Mar 20, 2017

Three things left before we can close this issue:

@swamidass
Copy link

Seems that the way sourmash handled it is correct.

sourmash-bio/sourmash#137

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants