-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix lowercase actgn input handling. #1435
Conversation
Current coverage is 77.18% (diff: 100%)
|
Specifically, uppercase sequences before adding them to the graph (without modifying the output sequences). Prior to this, k-mers containing lowercase characters were simply ignored. This matches the new behavior of trim-low-abund with ReadBundle usage. Also, update md5 hashes in tests/test_script_output.py to reflect changes in output due to changes in graph structure from adding the new k-mers.
|
ready for skeptical review - I think this is for @standage :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments. I think we need to be more explicit and clean about what code is doing what, and make sure the documentation reflects this accurately.
# if any in batch have coverage below desired coverage, consume &yield | ||
if not batch.coverages_at_least(self.countgraph, desired_coverage): | ||
for record in batch.reads: | ||
self.countgraph.consume(record.cleaned_seq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaner and more concise. I like it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) yes, a little complicated on the side of double and triple negatives when you dig into it, but nice and concise now that it's done!
class ReadBundle(object): | ||
def __init__(self, *raw_records): | ||
self.reads = [i for i in raw_records if i] | ||
self.cleaned_seqs = [r.sequence.replace('N', 'A') for r in self.reads] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm fine with moving the read cleaning code to a function rather than a class. (Having the cleaned seq as part of the original record is better organization anyway, IMO.) But then that leaves the question of what the ReadBundle
class is really for. Just aggregation? If so, we need to update the docs from my last PR to make sure we're clear about what code is doing what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
Yes, the ReadBundle
class is about aggregation (pairs/singletons of reads). I'll go update the docs.
s/clean/clear/. Won't let me edit review comment. :-/ |
Ready for review again @standage. (I wasn't 100% sure of the details of the new github review approach, so ... all you wanted me to do was update the documentation, right?) |
Yep al teh wrods is gud, |
This is a fix for #1434.
Adds a 'cleaned_seq' attribute to screed records that should be used whenever k-mer operations are performed, and propagates this attribute through the codebase with associated refactorings of ReadBundle, trim-low-abund, and normalize-by-median.
The key bugfix commit is 7b40857, which makes the key change to normalize-by-median to uppercase sequences before adding them to the graph. Prior to this, all k-mers containing lower-case characters would simply be ignored. Note that this changes output formats, so the md5 hashes in test_script_output.py have been updated. Since this is a bug and not a new feature, I think semantic versioning will permit a 2.x release. Also note that the output of trim-low-abund.py matches the output of normalize-by-median.py despite quite different implementations, suggesting that both are correct :).
make test
Did it pass the tests?make clean diff-cover
If it introduces new functionality inscripts/
is it tested?make format diff_pylint_report cppcheck doc pydocstyle
Is it wellformatted?
without a major version increment. Changing file formats also requires a
major version number increment.
ChangeLog
?http://en.wikipedia.org/wiki/Changelog#Format
changes were made?
tested for streaming IO?)