Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some bulk sequence loading tests that nail down current ACGTN behavior. #1633

Merged
merged 36 commits into from
Mar 20, 2017

Conversation

ctb
Copy link
Member

@ctb ctb commented Feb 19, 2017

These tests nail down behavior prior to #1590, which, when merged, will alter how we handle non-ACTG characters. Note, no behavior is changed in this PR; it's just (lots of) new tests.

This explicitly puts in place tests for sequences that contain one of lowercase, Ns, and non-ACGTN characters, for:

  • consume_fasta and all other bulk-sequence loading functions on Hashtables and derived classes;
  • trim_on_abundance, trim_below_abundance and find_spectral_error_positions

This PR includes #1661.

Adds new test file tests/test_sequence_validation.py and data file tests/test-data/valid-read-testing.fq.


  • Is it mergeable?
  • make test Did it pass the tests?
  • make clean diff-cover If it introduces new functionality in
    scripts/ is it tested?
  • make format diff_pylint_report cppcheck doc pydocstyle Is it well
    formatted?
  • Did it change the command-line interface? Only backwards-compatible
    additions are allowed without a major version increment. Changing file
    formats also requires a major version number increment.
  • For substantial changes or changes to the command-line interface, is it
    documented in CHANGELOG.md? See keepachangelog
    for more details.
  • Was a spellchecker run on the source code and documentation after
    changes were made?
  • Do the changes respect streaming IO? (Are they
    tested for streaming IO?)

@codecov-io
Copy link

codecov-io commented Feb 19, 2017

Codecov Report

Merging #1633 into master will increase coverage by 0.1%.
The diff coverage is 25%.

@@            Coverage Diff            @@
##           master    #1633     +/-   ##
=========================================
+ Coverage   69.82%   69.93%   +0.1%     
=========================================
  Files          66       66             
  Lines        8974     8976      +2     
  Branches     3060     3062      +2     
=========================================
+ Hits         6266     6277     +11     
+ Misses       1025     1018      -7     
+ Partials     1683     1681      -2
Impacted Files Coverage Δ
lib/read_parsers.hh 71.42% <25%> (+5.8%) ⬆️
khmer/_khmer.cc 57.48% <0%> (+0.05%) ⬆️
khmer/_cpy_hashgraph.hh 54.15% <0%> (+0.07%) ⬆️
lib/hashgraph.cc 46.96% <0%> (+0.34%) ⬆️
lib/hashtable.cc 57.54% <0%> (+2.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f00640...3cb741b. Read the comment docs.

@ctb
Copy link
Member Author

ctb commented Feb 19, 2017

Wow, these tests are like a showcase of horrible inconsistency in loading sequences.

kmer = "caggcgcccaccacc".upper()
assert x.get(kmer) == 1

# the 2nd read with this k-mer in it has an N in it; 'consume' will ignore.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rephrase the comment. At the moment I am very puzzled after reading it (ah its being ignored so count should be one, but it is asserted to be two ...??)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "consume will ignore the invalid base and continue consuming the read, so this kmer after the N should have abundance 2"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in fe48787

@betatim
Copy link
Member

betatim commented Feb 20, 2017

I'd move these tests to a new file maybe test_sequence_loading.py? Then we have one file that deals exclusively with verifying the assumptions we make about who cleans what, when and how.

Because it is easy to do I would parametrise the tests (that make sense) on the class so we test all combinations of Count/Node and table/graph. In which case we should get the parametrisation stuff from test_tabletype.py (or put these tests there, I'm -0)


x.output_partitions(infile, savepath)

read_names = [ read.name for read in ReadParser(savepath) ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pep8 🚓 doesn't like the extra white space (hence the travis failure)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed 756454e

@betatim
Copy link
Member

betatim commented Feb 20, 2017

+1 on merging this before 2.1, then #1590 can update the behaviour and fix these tests (remove the "in the future ..." comments)

@betatim betatim added this to the 2.1 milestone Feb 20, 2017
@ctb
Copy link
Member Author

ctb commented Feb 25, 2017

Tests moved to test_sequence_validation.py and parameterized as per @betatim suggestion! Of course now a bunch of tests are failing... :)

@betatim
Copy link
Member

betatim commented Feb 28, 2017

Linux build fails because of a pep8 violation. The OSX build fails because some tests fail and then we exit ungracefully because there are too many open files. Will take a look at the latter.

@betatim
Copy link
Member

betatim commented Feb 28, 2017

Ha, locally on OSX all tests pass as well and no "too many open files" warnings.

With the fixture we can explicitly close all the ReadParser which
might help with the too many open files error on OSX Travis
@betatim
Copy link
Member

betatim commented Feb 28, 2017

Switched to using a fixture for ReadParser which allows us to explicitly close it after using it. Not sure this fixes the problem, but it is my best guess as I can't reproduce it locally.

@betatim
Copy link
Member

betatim commented Feb 28, 2017

🎉

@ctb
Copy link
Member Author

ctb commented Mar 19, 2017

This is now ready for review & merge (although #1661 should be merged first :)

@luizirber @betatim @standage

@ctb
Copy link
Member Author

ctb commented Mar 19, 2017

(Upon merge, we should switch #1590 over to be against master.)

savepath = utils.get_temp_filename('foo')

# read this in using "approved good" behavior w/cleaned_seq
x = _Nodegraph(8, PRIMES_1m)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> graphtype?

Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except for that one comment. Nitpick: if you feel like switching graphtype to Graphtype so that types start with an upper case letter I would ❤️ that.

return request.param


@pytest.fixture
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I've looked this up: can we change this back to yield and change the decorator to @pytest.yield_fixture? Link to the pytest 2.9 docs: http://doc.pytest.org/en/2.9.2/fixture.html#fixture-finalization-executing-teardown-code

@ctb
Copy link
Member Author

ctb commented Mar 20, 2017 via email

@ctb
Copy link
Member Author

ctb commented Mar 20, 2017 via email

@ctb ctb merged commit 6d6ef9f into master Mar 20, 2017
@betatim
Copy link
Member

betatim commented Mar 20, 2017

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants