More thoughts on non-ACGT characters. #1541

ctb · 2016-11-24T15:03:04Z

Spurred by #1540, I'm converging on two principles regarding non-ACGT:

we shouldn't ever silently ignore an entire sequence record, because that leads to surprisingly inconsistent behavior between short reads and genomic sequences, as in the above where an entire genome vanishes from the k-mer counts. So we should either raise an exception OR automatically convert.
automatic conversion to As or something is probably the right default behavior, unless --check-sequence or --pedantic is specified. In part this is because even NCBI records have non-ACGT IUPAC bases in them.

There is a fun cautionary tale in this twitter exchange that is relevant, I think; we want to get the k-mer breakdown at least approximately correct :). This is another reason I think automatic conversion is appropriate - if the sequence type is completely wrong (e.g. protein, or RNA, or text) then the k-mer counts will be visibly off, but this is rare compared to the situation where you are looking at mostly legit DNA sequence and there's an S in the middle of a 5 Mbp genome. (I used to think we should automatically convert until some threshold was reached, like "1% of your sequence is bad... STOP", but that's overly clever and will no doubt bite us.

Last but not least, we should include a script (in khmer? screed?) that does a breakdown of the non-ACGT content of your data set so that people can debug more easily.

The text was updated successfully, but these errors were encountered:

betatim · 2016-11-25T13:28:50Z

How do the non-ACGT letters get into the data in the first place?

ctb · 2016-11-25T18:50:12Z

Various intentional and unintentional bits of cleverness. Some software uses [IUPAC codes](http://www.bioinformatics.org/sms2/iupac.html) to indicate observed ambiguity, for example.

betatim · 2016-11-25T19:28:06Z

So there is the case where the user is probably not doing what they think they are doing (feeding not DNA sequence into khmer) and the cleverness case. For the latter, should we be clever and sample a replacement from the possibilities (e.g. replace S with random.sample(['G', 'C']))? Or even add one kmer with each option? Or is that (trying to be clever)**2?

ctb · 2016-11-25T19:29:55Z

I think that's trying to be too clever ;).

ctb · 2016-12-19T14:51:02Z

I wonder if we can look at the beginning of the file and do some simple statistics (on, say, the first 100kb of sequence) and then barf if it looks bad?

This might be a good use for issues with bad pairing as well - scripts that are told to expect pairing info and don't get any pairs could barf.

After the first 100kb we could then assume that all the data is "good format" and go from there. Will probably catch 90%+ of problems...

standage · 2016-12-21T05:48:12Z

Another option is to simply ignore any k-mers containing a non-ACGT character. I don't know how this will affect performance of processing short reads, but it's essentially what I do now for long genomic sequences (by splitting chromosome sequences by [^ACGT] in preprocessing).

ctb · 2016-12-22T14:52:49Z

@standage, that could solve counting/graph issues, but we should maybe still warn the user if they submit a grossly incorrect file.

more generally, we have a few different things we do with input data:

load it into count tables and graphs
trim reads
discard reads

ignoring non-ACGT works for 1, and maybe 3, but not necessarily for 2.

ugh, bioinformatics :)

ctb · 2017-01-24T15:02:33Z

Spurred by this comment and previous --

DNA validity checking is done in the following places:

fix lowercase actgn input handling. #1435 and expose khmer.Read to python and use ReadParser in scripts #1491 collectively provided functionality to check ACTG when using screed/ReadParser bulk loading code. This is what all scripts should be using (the cleaned_seq attribute).
is_valid_dna is called in check_and_* functions, used in consume_fasta*, abundance_distribution, trim_below_abundance, trim_on_abundance, find_spectral_error_positions, trim_on_stoptags, and HLLCounter::consume_string. I think most of this is now redundant with ReadParser. The trim* and find* calls should be able to be removed; and the HLLCounter stuff is a problem and the source of a bug (unique-kmers.py ignores the entirety of any sequence with non-ACGTN in it #1540).
Following through on Eliminate ThreadedSequenceProcessor from the codebase. #1470 to remove ThreadedSequenceProcessor would simplify the overall discussion.

To my surprise, the CPython function Hashtable::consume does no checking...

Some comments:

One of the goals of Migrate read handling code to utils code. #1466 was to simplify a lot of this; where are we with that, @betatim?
To reiterate, it seems likely that much of the check_and* code can be removed which probably deals with @betatim's emotive complaint in Add hash functions supporting k > 32 to Counttable #1511 ;).
@standage in kevlar should do his own danged preprocessing if he doesn't want to use our bulk loading code ;).

My biggest concern is how to figure out (for ourselves), test, and explain (to others) what we're actually doing with sequence preprocessing. One thought is to provide a function purely for debugging that takes an input file, runs it through our sequence processing code, and then outputs it with those changes; then we can write tests against that behavior as well. May be pointless but may also be friendly.

If someone were to want to get started on removing the check_and* code a good start would be to just remove it from the various functions and see if the md5sum tests pass - that would show that it's not actually needed any more (which is my suspicion).

ctb · 2017-01-24T15:30:18Z

Started removing some code in #1590; it seems likely that the check_and* code can be removed from consume_fasta too.

ctb · 2017-01-24T15:35:50Z

...as indeed it can. Anyway, #1590.

standage · 2017-01-24T17:07:21Z

@standage in kevlar should do his own danged preprocessing if he doesn't want to use our bulk loading code ;)

:)

I use the bulk loading code for...loading count tables/graphs, but I iterate over screed records and k-mers to do abundance checking (which is where I ran into issues). I added some temporary basic preprocessing to kevlar last night, but if khmer's get/get_count and hash/hash_dna functions are explicitly not going to support non ACGT characters, there's no problem making these changes to kevlar permanent.

ctb · 2017-01-24T17:08:31Z

I use the bulk loading code for...loading count tables/graphs, but I iterate over screed records and k-mers to do abundance checking (which is where I ran into issues). I added some temporary basic preprocessing to kevlar last night, but if khmer's `get/get_count` and `hash/hash_dna` functions are explicitly *not* going to support non ACGT characters, there's no problem making these changes to kevlar permanent.

yep.

standage · 2017-01-24T22:20:25Z

I'm adding some tests to kevlar to verify assumptions regarding non-ACGT are met. I'm getting the following error.

>>> import khmer
>>> ct = khmer.Counttable(15, 1e6, 3)
>>> ct.consume('TTGACTTGACTTCAG')
1
>>> ct.consume('TTGACTTGACTTCAG')
1
>>> assert ct.get('TTGACTTGACTTCAG') > 0
>>>
>>> ct.consume('TTCAGNNNNNTTCAG')
0
>>> try:
...     _ = ct.get('TTCAGNNNNNTTCAG')
... except e:
...     print(str(e))
...
libc++abi.dylib: terminating with uncaught exception of type khmer::khmer_exception: Invalid base in read
Abort trap: 6

It looks like we need to make sure C++ exceptions get handed off to Python so that they can be handled at the Python level.

ctb · 2017-01-26T21:28:43Z

More thoughts: I think one reason to settle on not doing anything to invalid DNA is that we are seeing a plethora of use cases for different approaches: ignoring non-ACTG, splitting on non-ACGT, replacing non-ACGT with N. So that's good not to dictate internally. But we can provide fast defaults with bulk loading, which is good, I think.

standage · 2017-01-27T00:45:33Z

Yeah, it we want performance AND flexibility we need to implement the different use cases at the C++ level. Using a performant and reasonable default for bulk loading but providing different (maybe slower) options at the Python level is a fine compromise for now I think.

For instance, I'm mostly interested in the skipping / splitting approach for long genomic sequences, which are typically much smaller in total volume than Fastq/BAM files full of millions of reads. Sequence handling in Python is fine for the former, but I want to avoid it when possible for the latter.

betatim · 2017-01-27T16:30:40Z

Why not provide some standard/frequently used strategies as options for methods that consume a sequence or a file?

consume_fasta(fname, stragety=None) which continues to do what is currently the default, and then provide 'split', 'error', 'ignore', etc a little bit like the unicode en/decoding in python.

ctb · 2017-01-27T16:42:30Z

yes, I think we're heading in this direction.

betatim · 2017-01-27T16:45:55Z

If methods that accept single kmers assume that you know what you are doing then you can implement any non-standard processing in python and pass in the kmers.

Does that work for you @standage ?

standage · 2017-01-27T18:51:43Z

👍

ctb · 2017-03-20T14:13:25Z

Three things left before we can close this issue:

[MRG] remove redundant ACGT-checking code. #1590 should be merged;
behavior laid out in [MRG] remove redundant ACGT-checking code. #1590 should be documented;
write up this comment as an issue

swamidass · 2020-11-02T04:29:33Z

Seems that the way sourmash handled it is correct.

sourmash-bio/sourmash#137

standage mentioned this issue Jan 24, 2017

Add hash functions supporting k > 32 to Counttable #1511

Merged

9 tasks

ctb mentioned this issue Jan 24, 2017

[MRG] remove redundant ACGT-checking code. #1590

Merged

8 tasks

standage mentioned this issue Jan 25, 2017

Some additions to refactor/hashing2 #1593

Merged

standage mentioned this issue Jan 27, 2017

unique-kmers.py ignores the entirety of any sequence with non-ACGTN in it #1540

Open

ctb mentioned this issue Feb 14, 2017

HLL refactor: C++11 simplifications #1316

Merged

8 tasks

ctb mentioned this issue Feb 26, 2017

Ambiguous Nucleotide handling question? sourmash-bio/sourmash#137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More thoughts on non-ACGT characters. #1541

More thoughts on non-ACGT characters. #1541

ctb commented Nov 24, 2016

betatim commented Nov 25, 2016

ctb commented Nov 25, 2016 via email •

edited

Loading

betatim commented Nov 25, 2016

ctb commented Nov 25, 2016 via email

ctb commented Dec 19, 2016

standage commented Dec 21, 2016

ctb commented Dec 22, 2016

ctb commented Jan 24, 2017

ctb commented Jan 24, 2017

ctb commented Jan 24, 2017

standage commented Jan 24, 2017

ctb commented Jan 24, 2017 via email

standage commented Jan 24, 2017

ctb commented Jan 26, 2017

standage commented Jan 27, 2017

betatim commented Jan 27, 2017

ctb commented Jan 27, 2017 via email

betatim commented Jan 27, 2017 •

edited by ctb

Loading

standage commented Jan 27, 2017

ctb commented Mar 20, 2017

swamidass commented Nov 2, 2020

More thoughts on non-ACGT characters. #1541

More thoughts on non-ACGT characters. #1541

Comments

ctb commented Nov 24, 2016

betatim commented Nov 25, 2016

ctb commented Nov 25, 2016 via email • edited Loading

betatim commented Nov 25, 2016

ctb commented Nov 25, 2016 via email

ctb commented Dec 19, 2016

standage commented Dec 21, 2016

ctb commented Dec 22, 2016

ctb commented Jan 24, 2017

ctb commented Jan 24, 2017

ctb commented Jan 24, 2017

standage commented Jan 24, 2017

ctb commented Jan 24, 2017 via email

standage commented Jan 24, 2017

ctb commented Jan 26, 2017

standage commented Jan 27, 2017

betatim commented Jan 27, 2017

ctb commented Jan 27, 2017 via email

betatim commented Jan 27, 2017 • edited by ctb Loading

standage commented Jan 27, 2017

ctb commented Mar 20, 2017

swamidass commented Nov 2, 2020

ctb commented Nov 25, 2016 via email •

edited

Loading

betatim commented Jan 27, 2017 •

edited by ctb

Loading