Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix lowercase actgn input handling. #1435

Merged
merged 28 commits into from
Oct 3, 2016
Merged

fix lowercase actgn input handling. #1435

merged 28 commits into from
Oct 3, 2016

Conversation

ctb
Copy link
Member

@ctb ctb commented Sep 3, 2016

This is a fix for #1434.

Adds a 'cleaned_seq' attribute to screed records that should be used whenever k-mer operations are performed, and propagates this attribute through the codebase with associated refactorings of ReadBundle, trim-low-abund, and normalize-by-median.

The key bugfix commit is 7b40857, which makes the key change to normalize-by-median to uppercase sequences before adding them to the graph. Prior to this, all k-mers containing lower-case characters would simply be ignored. Note that this changes output formats, so the md5 hashes in test_script_output.py have been updated. Since this is a bug and not a new feature, I think semantic versioning will permit a 2.x release. Also note that the output of trim-low-abund.py matches the output of normalize-by-median.py despite quite different implementations, suggesting that both are correct :).

  • Is it mergeable?
  • make test Did it pass the tests?
  • make clean diff-cover If it introduces new functionality in
    scripts/ is it tested?
  • make format diff_pylint_report cppcheck doc pydocstyle Is it well
    formatted?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Is it documented in the ChangeLog?
    http://en.wikipedia.org/wiki/Changelog#Format
  • Was a spellchecker run on the source code and documentation after
    changes were made?
  • Do the changes respect streaming IO? (Are they
    tested for streaming IO?)
  • Is the Copyright year up to date?

@ctb ctb changed the title force verbose_loader to uppercase DNA uppercase DNA sequence coming in from screed Sep 3, 2016
@codecov-io
Copy link

codecov-io commented Sep 3, 2016

Current coverage is 77.18% (diff: 100%)

No coverage report found for master at 3d3a2ca.

Powered by Codecov. Last update 3d3a2ca...5fe8f65

@ctb ctb changed the title uppercase DNA sequence coming in from screed make the DNA sequence coming in from screed -> uppercase Sep 4, 2016
@ctb ctb changed the base branch from master to update/filter_abund October 1, 2016 18:26
@ctb ctb changed the base branch from update/filter_abund to master October 1, 2016 20:35
ctb added 6 commits October 2, 2016 08:53
Specifically, uppercase sequences before adding them to the graph
(without modifying the output sequences). Prior to this, k-mers containing
lowercase characters were simply ignored.

This matches the new behavior of trim-low-abund with ReadBundle usage.

Also, update md5 hashes in tests/test_script_output.py to reflect changes
in output due to changes in graph structure from adding the new k-mers.
@ctb ctb changed the title make the DNA sequence coming in from screed -> uppercase fix lowercase actgn input handling. Oct 2, 2016
@ctb
Copy link
Member Author

ctb commented Oct 2, 2016

  • TODO: add explicit tests of appropriate behavior into broken_paired_reader, other functions.

@ctb
Copy link
Member Author

ctb commented Oct 2, 2016

ready for skeptical review - I think this is for @standage :)

Copy link
Member

@standage standage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments. I think we need to be more explicit and clean about what code is doing what, and make sure the documentation reflects this accurately.

# if any in batch have coverage below desired coverage, consume &yield
if not batch.coverages_at_least(self.countgraph, desired_coverage):
for record in batch.reads:
self.countgraph.consume(record.cleaned_seq)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaner and more concise. I like it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) yes, a little complicated on the side of double and triple negatives when you dig into it, but nice and concise now that it's done!

class ReadBundle(object):
def __init__(self, *raw_records):
self.reads = [i for i in raw_records if i]
self.cleaned_seqs = [r.sequence.replace('N', 'A') for r in self.reads]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm fine with moving the read cleaning code to a function rather than a class. (Having the cleaned seq as part of the original record is better organization anyway, IMO.) But then that leaves the question of what the ReadBundle class is really for. Just aggregation? If so, we need to update the docs from my last PR to make sure we're clear about what code is doing what.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

Yes, the ReadBundle class is about aggregation (pairs/singletons of reads). I'll go update the docs.

@standage
Copy link
Member

standage commented Oct 3, 2016

s/clean/clear/. Won't let me edit review comment. :-/

@ctb
Copy link
Member Author

ctb commented Oct 3, 2016

Ready for review again @standage. (I wasn't 100% sure of the details of the new github review approach, so ... all you wanted me to do was update the documentation, right?)

@standage
Copy link
Member

standage commented Oct 3, 2016

Yep al teh wrods is gud,

@standage standage merged commit 6e4ffc5 into master Oct 3, 2016
@standage standage deleted the fix/cleandna branch October 3, 2016 20:46
@ctb
Copy link
Member Author

ctb commented Oct 4, 2016

Note, this refactored code from #1458; just want to link in :). Also bears on #1262 and #1036.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants