-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] remove redundant ACGT-checking code. #1590
Conversation
Although reading #1491 I'm wondering if we should be using |
Codecov Report
@@ Coverage Diff @@
## master #1590 +/- ##
=========================================
+ Coverage 0.05% 0.05% +<.01%
=========================================
Files 91 91
Lines 11500 11483 -17
Branches 3063 3056 -7
=========================================
Hits 6 6
+ Misses 11494 11477 -17
Continue to review full report at Codecov.
|
Also see #1619 (comment) |
Also see #1595 -- some more sophisticated alphabet checking was added to the parsers. Basically, you can give it one of the alphabets (as defined in |
And, looking at #1511 closer, moving that checking code into the parsers is already becoming SOP. Yay. |
yeppers.
|
@standage note that one thing the partitioning code does is fail to output reads containing Ns. |
I wonder if this code is any faster now that we don't iterate across every sequence 2 or 3 times? |
Time to review? |
On Wed, Mar 22, 2017 at 07:47:42AM -0700, Tim Head wrote:
Time to review?
yes! but this will not be merged until after release.
|
Ran |
@betatim @luizirber @camillescott @standage this PR is now ready for merge into master and (IMO) should be given high priority for review. @camillescott's prediction of many merge conflicts was mistaken, thank goodness - only one small conflict to be resolved! This merge significantly changes the details around ACTGN handling and will require a bump to khmer 3.0. |
Tests pass! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. As I understand it, khmer now makes no attempt to clean up kmers that are passed to the sketch.add()
function, and does a [^ACGT] --> A
conversion for any bulk sequence loading code. Correct?
# because different hash functions do different things with | ||
# non-ACTG characters. So all we want to do is verify that the | ||
# functions execute w/o error on the k-mers before the "bad" DNA, | ||
# and don't return positions in the "good" DNA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning of the phrase and don't return positions in the "good" DNA is unclear to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior of the hash functions on ACTG should all be the same with respect to "good" sequences, and there should be no errors in there to trim at.
Also, does the partitioning code now output reads containing Ns @ctb? |
@standage yes; re partitioning and ignoring of reads containing N, see: this removed line |
This removes
check_and_process_read
fromHashtable
, and eliminates sequence cleaning from non-bulk-loading code. All such Python-accessible code should use thecleaned_seq
attribute that is available from both theReadParser
code /Read
objects and thekhmer.utils.broken_paired_reader
code.See #1541 (comment) for background context.
Prior to this PR,
consume_fasta
& other bulk sequence loading functions automatically upper-cased DNA and ignored any sequence (no matter how long) that contained non-ACGT.trim_on_abundance
andtrim_below_abundance
similarly upper-cased DNA and ignored non-ACGT-containing strings.find_spectral_error_positions
raised an exception on non-ACGT-containing strings.This PR updates the test code introduced in #1633 to match the new behavior.
This PR also includes #1661.
make test
Did it pass the tests?make clean diff-cover
If it introduces new functionality inscripts/
is it tested?make format diff_pylint_report cppcheck doc pydocstyle
Is it wellformatted?
additions are allowed without a major version increment. Changing file
formats also requires a major version number increment.
documented in
CHANGELOG.md
? See keepachangelogfor more details.
changes were made?
tested for streaming IO?)