fix lowercase actgn input handling. #1435

ctb · 2016-09-03T16:27:52Z

This is a fix for #1434.

Adds a 'cleaned_seq' attribute to screed records that should be used whenever k-mer operations are performed, and propagates this attribute through the codebase with associated refactorings of ReadBundle, trim-low-abund, and normalize-by-median.

The key bugfix commit is 7b40857, which makes the key change to normalize-by-median to uppercase sequences before adding them to the graph. Prior to this, all k-mers containing lower-case characters would simply be ignored. Note that this changes output formats, so the md5 hashes in test_script_output.py have been updated. Since this is a bug and not a new feature, I think semantic versioning will permit a 2.x release. Also note that the output of trim-low-abund.py matches the output of normalize-by-median.py despite quite different implementations, suggesting that both are correct :).

Is it mergeable?
make test Did it pass the tests?
make clean diff-cover If it introduces new functionality in
scripts/ is it tested?
make format diff_pylint_report cppcheck doc pydocstyle Is it well
formatted?
Did it change the command-line interface? Only additions are allowed
without a major version increment. Changing file formats also requires a
major version number increment.
Is it documented in the ChangeLog?
http://en.wikipedia.org/wiki/Changelog#Format
Was a spellchecker run on the source code and documentation after
changes were made?
Do the changes respect streaming IO? (Are they
tested for streaming IO?)
Is the Copyright year up to date?

…vior

codecov-io · 2016-09-03T22:42:38Z

Current coverage is 77.18% (diff: 100%)

No coverage report found for master at 3d3a2ca.

Powered by Codecov. Last update 3d3a2ca...5fe8f65

…ndna

Specifically, uppercase sequences before adding them to the graph (without modifying the output sequences). Prior to this, k-mers containing lowercase characters were simply ignored. This matches the new behavior of trim-low-abund with ReadBundle usage. Also, update md5 hashes in tests/test_script_output.py to reflect changes in output due to changes in graph structure from adding the new k-mers.

ctb · 2016-10-02T16:51:16Z

TODO: add explicit tests of appropriate behavior into broken_paired_reader, other functions.

ctb · 2016-10-02T17:11:31Z

ready for skeptical review - I think this is for @standage :)

standage

See comments. I think we need to be more explicit and clean about what code is doing what, and make sure the documentation reflects this accurately.

standage · 2016-10-03T18:06:57Z

scripts/normalize-by-median.py

+        # if any in batch have coverage below desired coverage, consume &yield
+        if not batch.coverages_at_least(self.countgraph, desired_coverage):
+            for record in batch.reads:
+                self.countgraph.consume(record.cleaned_seq)


Cleaner and more concise. I like it!

:) yes, a little complicated on the side of double and triple negatives when you dig into it, but nice and concise now that it's done!

standage · 2016-10-03T18:19:34Z

khmer/utils.py

 class ReadBundle(object):
    def __init__(self, *raw_records):
        self.reads = [i for i in raw_records if i]
-        self.cleaned_seqs = [r.sequence.replace('N', 'A') for r in self.reads]


I guess I'm fine with moving the read cleaning code to a function rather than a class. (Having the cleaned seq as part of the original record is better organization anyway, IMO.) But then that leaves the question of what the ReadBundle class is really for. Just aggregation? If so, we need to update the docs from my last PR to make sure we're clear about what code is doing what.

Agreed!

Yes, the ReadBundle class is about aggregation (pairs/singletons of reads). I'll go update the docs.

standage · 2016-10-03T18:21:39Z

s/clean/clear/. Won't let me edit review comment. :-/

ctb · 2016-10-03T20:16:08Z

Ready for review again @standage. (I wasn't 100% sure of the details of the new github review approach, so ... all you wanted me to do was update the documentation, right?)

standage · 2016-10-03T20:20:17Z

Yep al teh wrods is gud,

ctb · 2016-10-04T13:58:57Z

Note, this refactored code from #1458; just want to link in :). Also bears on #1262 and #1036.

ctb added 2 commits September 3, 2016 09:26

force verbose_loader to uppercase DNA

9e27de6

change broken_paired_reader to uppercase sequences

13284fb

ctb changed the title ~~force verbose_loader to uppercase DNA~~ uppercase DNA sequence coming in from screed Sep 3, 2016

ctb added 3 commits September 3, 2016 09:42

update the hashes for backwards-compatibility checking

4146358

replaced collections.namedtuple with screed.Record in test_functions

a94218d

make test sequences upper case in tests to match new uppercasing beha…

5fe8f65

…vior

ctb changed the title ~~uppercase DNA sequence coming in from screed~~ make the DNA sequence coming in from screed -> uppercase Sep 4, 2016

Merge branch "master" into "fix/cleandna", fix FakeFastaRead

38b2b8e

ctb changed the base branch from master to update/filter_abund October 1, 2016 18:26

ctb added 7 commits October 1, 2016 11:33

remove legacy code for ignoring short reads

079cea4

Merge branch 'fix/cleandna' of github.com:dib-lab/khmer into fix/clea…

e196cf7

…ndna

Merge branch 'update/filter_abund' into fix/cleandna

decd944

undo uppercasing artifacts

b486264

revert md5sum changes

3e60bcf

have ReadBundle rely on sequence loading to do the cleaning

768d92f

Merge branch 'master' of github.com:dib-lab/khmer into fix/cleandna

286574a

ctb changed the base branch from update/filter_abund to master October 1, 2016 20:35

ctb added 6 commits October 2, 2016 08:53

fix normalize-by-median to use cleaned_seq attribute on reads

95b3d0f

refactor trim-low-abund to use more efficient median function

6f52164

fix read bundling issue

cd72c8f

added utils.clean_input_reads; associated refactorings

72167c9

delete redundant sequence uppercasing in thread_utils

974ced3

ctb changed the title ~~make the DNA sequence coming in from screed -> uppercase~~ fix lowercase actgn input handling. Oct 2, 2016

ctb added 4 commits October 2, 2016 09:56

check cleaned_seq behavior in broken_paired_iter directly

4bc1135

fixed pep8

2e5e84b

refactored to use cleaned_seq attribute

c162bdd

updated ChangeLog

e7258b4

ctb added 2 commits October 2, 2016 10:08

fix pep8

ba24ff7

fix copyright year

e6524e4

standage requested changes Oct 3, 2016

View reviewed changes

ctb added 2 commits October 3, 2016 13:12

update dev docs

0ac2942

update ChangeLog

9cff349

update teh wrods

df2eaa4

standage approved these changes Oct 3, 2016

View reviewed changes

standage merged commit 6e4ffc5 into master Oct 3, 2016

standage deleted the fix/cleandna branch October 3, 2016 20:46

ctb mentioned this pull request Oct 3, 2016

screed doesn't uppercase DNA in loaded records #1434

Closed

standage mentioned this pull request Oct 4, 2016

Migrate read handling code to utils code. #1466

Open

betatim mentioned this pull request Oct 4, 2016

twobit_repr as lookup table #1438

Closed

9 tasks

This was referenced Oct 6, 2016

make C++ code that mimics the utils sequence handling code #1483

Open

Factor out sequence preprocessing of reads with 'N's in them #986

Closed

ctb mentioned this pull request Dec 21, 2016

fix case-sensitivity of python-facing k-mer functions #370

Open

ctb mentioned this pull request Jan 24, 2017

More thoughts on non-ACGT characters. #1541

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix lowercase actgn input handling. #1435

fix lowercase actgn input handling. #1435

ctb commented Sep 3, 2016 •

edited

Loading

codecov-io commented Sep 3, 2016 •

edited

Loading

ctb commented Oct 2, 2016

ctb commented Oct 2, 2016

standage left a comment

standage Oct 3, 2016

ctb Oct 3, 2016

standage Oct 3, 2016

ctb Oct 3, 2016

standage commented Oct 3, 2016

ctb commented Oct 3, 2016

standage commented Oct 3, 2016

ctb commented Oct 4, 2016

fix lowercase actgn input handling. #1435

fix lowercase actgn input handling. #1435

Conversation

ctb commented Sep 3, 2016 • edited Loading

codecov-io commented Sep 3, 2016 • edited Loading

Current coverage is 77.18% (diff: 100%)

ctb commented Oct 2, 2016

ctb commented Oct 2, 2016

standage left a comment

Choose a reason for hiding this comment

standage Oct 3, 2016

Choose a reason for hiding this comment

ctb Oct 3, 2016

Choose a reason for hiding this comment

standage Oct 3, 2016

Choose a reason for hiding this comment

ctb Oct 3, 2016

Choose a reason for hiding this comment

standage commented Oct 3, 2016

ctb commented Oct 3, 2016

standage commented Oct 3, 2016

ctb commented Oct 4, 2016

ctb commented Sep 3, 2016 •

edited

Loading

codecov-io commented Sep 3, 2016 •

edited

Loading