-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More thoughts on non-ACGT characters. #1541
Comments
How do the non-ACGT letters get into the data in the first place? |
Various intentional and unintentional bits of cleverness. Some software
uses [IUPAC codes](http://www.bioinformatics.org/sms2/iupac.html) to indicate
observed ambiguity, for example.
|
So there is the case where the user is probably not doing what they think they are doing (feeding not DNA sequence into khmer) and the cleverness case. For the latter, should we be clever and sample a replacement from the possibilities (e.g. replace S with |
I think that's trying to be too clever ;).
|
I wonder if we can look at the beginning of the file and do some simple statistics (on, say, the first 100kb of sequence) and then barf if it looks bad? This might be a good use for issues with bad pairing as well - scripts that are told to expect pairing info and don't get any pairs could barf. After the first 100kb we could then assume that all the data is "good format" and go from there. Will probably catch 90%+ of problems... |
Another option is to simply ignore any k-mers containing a non-ACGT character. I don't know how this will affect performance of processing short reads, but it's essentially what I do now for long genomic sequences (by splitting chromosome sequences by [^ACGT] in preprocessing). |
@standage, that could solve counting/graph issues, but we should maybe still warn the user if they submit a grossly incorrect file. more generally, we have a few different things we do with input data:
ignoring non-ACGT works for 1, and maybe 3, but not necessarily for 2. ugh, bioinformatics :) |
Spurred by this comment and previous -- DNA validity checking is done in the following places:
To my surprise, the CPython function Some comments:
My biggest concern is how to figure out (for ourselves), test, and explain (to others) what we're actually doing with sequence preprocessing. One thought is to provide a function purely for debugging that takes an input file, runs it through our sequence processing code, and then outputs it with those changes; then we can write tests against that behavior as well. May be pointless but may also be friendly. If someone were to want to get started on removing the |
Started removing some code in #1590; it seems likely that the |
...as indeed it can. Anyway, #1590. |
:) I use the bulk loading code for...loading count tables/graphs, but I iterate over screed records and k-mers to do abundance checking (which is where I ran into issues). I added some temporary basic preprocessing to kevlar last night, but if khmer's |
I use the bulk loading code for...loading count tables/graphs, but I iterate over screed records and k-mers to do abundance checking (which is where I ran into issues). I added some temporary basic preprocessing to kevlar last night, but if khmer's `get/get_count` and `hash/hash_dna` functions are explicitly *not* going to support non ACGT characters, there's no problem making these changes to kevlar permanent.
yep.
|
I'm adding some tests to kevlar to verify assumptions regarding non-ACGT are met. I'm getting the following error. >>> import khmer
>>> ct = khmer.Counttable(15, 1e6, 3)
>>> ct.consume('TTGACTTGACTTCAG')
1
>>> ct.consume('TTGACTTGACTTCAG')
1
>>> assert ct.get('TTGACTTGACTTCAG') > 0
>>>
>>> ct.consume('TTCAGNNNNNTTCAG')
0
>>> try:
... _ = ct.get('TTCAGNNNNNTTCAG')
... except e:
... print(str(e))
...
libc++abi.dylib: terminating with uncaught exception of type khmer::khmer_exception: Invalid base in read
Abort trap: 6 It looks like we need to make sure C++ exceptions get handed off to Python so that they can be handled at the Python level. |
More thoughts: I think one reason to settle on not doing anything to invalid DNA is that we are seeing a plethora of use cases for different approaches: ignoring non-ACTG, splitting on non-ACGT, replacing non-ACGT with N. So that's good not to dictate internally. But we can provide fast defaults with bulk loading, which is good, I think. |
Yeah, it we want performance AND flexibility we need to implement the different use cases at the C++ level. Using a performant and reasonable default for bulk loading but providing different (maybe slower) options at the Python level is a fine compromise for now I think. For instance, I'm mostly interested in the skipping / splitting approach for long genomic sequences, which are typically much smaller in total volume than Fastq/BAM files full of millions of reads. Sequence handling in Python is fine for the former, but I want to avoid it when possible for the latter. |
Why not provide some standard/frequently used strategies as options for methods that consume a sequence or a file?
|
yes, I think we're heading in this direction.
|
If methods that accept single kmers assume that you know what you are doing then you can implement any non-standard processing in python and pass in the kmers. Does that work for you @standage ? |
👍 |
Three things left before we can close this issue:
|
Seems that the way sourmash handled it is correct. |
Spurred by #1540, I'm converging on two principles regarding non-ACGT:
--check-sequence
or--pedantic
is specified. In part this is because even NCBI records have non-ACGT IUPAC bases in them.There is a fun cautionary tale in this twitter exchange that is relevant, I think; we want to get the k-mer breakdown at least approximately correct :). This is another reason I think automatic conversion is appropriate - if the sequence type is completely wrong (e.g. protein, or RNA, or text) then the k-mer counts will be visibly off, but this is rare compared to the situation where you are looking at mostly legit DNA sequence and there's an S in the middle of a 5 Mbp genome. (I used to think we should automatically convert until some threshold was reached, like "1% of your sequence is bad... STOP", but that's overly clever and will no doubt bite us.
Last but not least, we should include a script (in khmer? screed?) that does a breakdown of the non-ACGT content of your data set so that people can debug more easily.
The text was updated successfully, but these errors were encountered: