Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] remove redundant ACGT-checking code. #1590

Merged
merged 55 commits into from
Jun 2, 2017
Merged

[MRG] remove redundant ACGT-checking code. #1590

merged 55 commits into from
Jun 2, 2017

Conversation

ctb
Copy link
Member

@ctb ctb commented Jan 24, 2017

This removes check_and_process_read from Hashtable, and eliminates sequence cleaning from non-bulk-loading code. All such Python-accessible code should use the cleaned_seq attribute that is available from both the ReadParser code / Read objects and the khmer.utils.broken_paired_reader code.


See #1541 (comment) for background context.

Prior to this PR,

  • consume_fasta & other bulk sequence loading functions automatically upper-cased DNA and ignored any sequence (no matter how long) that contained non-ACGT.
  • trim_on_abundance and trim_below_abundance similarly upper-cased DNA and ignored non-ACGT-containing strings.
  • find_spectral_error_positions raised an exception on non-ACGT-containing strings.

This PR updates the test code introduced in #1633 to match the new behavior.

This PR also includes #1661.


  • Is it mergeable?
  • make test Did it pass the tests?
  • make clean diff-cover If it introduces new functionality in
    scripts/ is it tested?
  • make format diff_pylint_report cppcheck doc pydocstyle Is it well
    formatted?
  • Did it change the command-line interface? Only backwards-compatible
    additions are allowed without a major version increment. Changing file
    formats also requires a major version number increment.
  • For substantial changes or changes to the command-line interface, is it
    documented in CHANGELOG.md? See keepachangelog
    for more details.
  • Was a spellchecker run on the source code and documentation after
    changes were made?
  • Do the changes respect streaming IO? (Are they
    tested for streaming IO?)

@ctb ctb changed the title remove unneeded check_and_normalize_read calls remove redundant ACGT-checking code. Jan 24, 2017
@ctb
Copy link
Member Author

ctb commented Jan 24, 2017

Although reading #1491 I'm wondering if we should be using cleaned_seq or equivalent in the C++ bulk consume_fasta code, instead of read.sequence. If so, then we really need to add some tests so that the current code breaks :).

@codecov-io
Copy link

codecov-io commented Jan 24, 2017

Codecov Report

Merging #1590 into master will increase coverage by <.01%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #1590      +/-   ##
=========================================
+ Coverage    0.05%   0.05%   +<.01%     
=========================================
  Files          91      91              
  Lines       11500   11483      -17     
  Branches     3063    3056       -7     
=========================================
  Hits            6       6              
+ Misses      11494   11477      -17
Impacted Files Coverage Δ
include/oxli/hashtable.hh 0% <ø> (ø) ⬆️
src/oxli/subset.cc 0% <0%> (ø) ⬆️
src/oxli/hashtable.cc 0% <0%> (ø) ⬆️
src/oxli/labelhash.cc 0% <0%> (ø) ⬆️
src/oxli/hashgraph.cc 0% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4fc53f6...1a54b45. Read the comment docs.

@ctb ctb mentioned this pull request Jan 26, 2017
8 tasks
@ctb ctb mentioned this pull request Feb 14, 2017
8 tasks
@ctb
Copy link
Member Author

ctb commented Feb 14, 2017

Also see #1619 (comment)

@camillescott
Copy link
Member

Also see #1595 -- some more sophisticated alphabet checking was added to the parsers. Basically, you can give it one of the alphabets (as defined in alphabets.hh) by name, and it'll use that for its checks; as perhaps implied by that, I'm in favor of moving this sort of checking code into the parsers. Alternatively, the graphs could take a parameter for whether they should run any checks on what they consume (also using the alphabets).

@camillescott
Copy link
Member

And, looking at #1511 closer, moving that checking code into the parsers is already becoming SOP. Yay.

@ctb
Copy link
Member Author

ctb commented Feb 14, 2017 via email

@ctb ctb mentioned this pull request Feb 16, 2017
12 tasks
@ctb ctb changed the base branch from master to remove/is_valid_dna_tests February 19, 2017 20:18
@ctb
Copy link
Member Author

ctb commented Feb 20, 2017

@standage note that one thing the partitioning code does is fail to output reads containing Ns.

@ctb ctb changed the base branch from remove/is_valid_dna_tests to master March 20, 2017 14:04
@ctb
Copy link
Member Author

ctb commented Mar 20, 2017

I wonder if this code is any faster now that we don't iterate across every sequence 2 or 3 times?

@betatim
Copy link
Member

betatim commented Mar 22, 2017

Time to review?

@ctb
Copy link
Member Author

ctb commented Mar 22, 2017 via email

@betatim
Copy link
Member

betatim commented Mar 22, 2017

Ran ./scripts/abundance-dist-single.py -x 1e8 -b ecoli_ref-5m.fastq somehisto.hist -k 31 -s which takes about 120s on my machine both on master and this branch :-/

@ctb ctb changed the title remove redundant ACGT-checking code. [MRG] remove redundant ACGT-checking code. Jun 2, 2017
@ctb
Copy link
Member Author

ctb commented Jun 2, 2017

@betatim @luizirber @camillescott @standage this PR is now ready for merge into master and (IMO) should be given high priority for review. @camillescott's prediction of many merge conflicts was mistaken, thank goodness - only one small conflict to be resolved!

This merge significantly changes the details around ACTGN handling and will require a bump to khmer 3.0.

@ctb
Copy link
Member Author

ctb commented Jun 2, 2017

Tests pass!

Copy link
Member

@standage standage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. As I understand it, khmer now makes no attempt to clean up kmers that are passed to the sketch.add() function, and does a [^ACGT] --> A conversion for any bulk sequence loading code. Correct?

# because different hash functions do different things with
# non-ACTG characters. So all we want to do is verify that the
# functions execute w/o error on the k-mers before the "bad" DNA,
# and don't return positions in the "good" DNA.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning of the phrase and don't return positions in the "good" DNA is unclear to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of the hash functions on ACTG should all be the same with respect to "good" sequences, and there should be no errors in there to trim at.

@standage
Copy link
Member

standage commented Jun 2, 2017

Also, does the partitioning code now output reads containing Ns @ctb?

@ctb
Copy link
Member Author

ctb commented Jun 2, 2017

@standage yes; re partitioning and ignoring of reads containing N, see: this removed line

@ctb ctb merged commit e768617 into master Jun 2, 2017
@ctb ctb deleted the remove/is_valid_dna branch June 2, 2017 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants