[MRG] remove redundant ACGT-checking code. #1590

ctb · 2017-01-24T15:28:08Z

This removes check_and_process_read from Hashtable, and eliminates sequence cleaning from non-bulk-loading code. All such Python-accessible code should use the cleaned_seq attribute that is available from both the ReadParser code / Read objects and the khmer.utils.broken_paired_reader code.

See #1541 (comment) for background context.

Prior to this PR,

consume_fasta & other bulk sequence loading functions automatically upper-cased DNA and ignored any sequence (no matter how long) that contained non-ACGT.
trim_on_abundance and trim_below_abundance similarly upper-cased DNA and ignored non-ACGT-containing strings.
find_spectral_error_positions raised an exception on non-ACGT-containing strings.

This PR updates the test code introduced in #1633 to match the new behavior.

This PR also includes #1661.

Is it mergeable?
make test Did it pass the tests?
make clean diff-cover If it introduces new functionality in
scripts/ is it tested?
make format diff_pylint_report cppcheck doc pydocstyle Is it well
formatted?
Did it change the command-line interface? Only backwards-compatible
additions are allowed without a major version increment. Changing file
formats also requires a major version number increment.
For substantial changes or changes to the command-line interface, is it
documented in CHANGELOG.md? See keepachangelog
for more details.
Was a spellchecker run on the source code and documentation after
changes were made?
Do the changes respect streaming IO? (Are they
tested for streaming IO?)

ctb · 2017-01-24T15:40:14Z

Although reading #1491 I'm wondering if we should be using cleaned_seq or equivalent in the C++ bulk consume_fasta code, instead of read.sequence. If so, then we really need to add some tests so that the current code breaks :).

codecov-io · 2017-01-24T16:54:38Z

Codecov Report

Merging #1590 into master will increase coverage by <.01%.
The diff coverage is 0%.

@@            Coverage Diff            @@
##           master   #1590      +/-   ##
=========================================
+ Coverage    0.05%   0.05%   +<.01%     
=========================================
  Files          91      91              
  Lines       11500   11483      -17     
  Branches     3063    3056       -7     
=========================================
  Hits            6       6              
+ Misses      11494   11477      -17

Impacted Files	Coverage Δ
include/oxli/hashtable.hh	`0% <ø> (ø)`	⬆️
src/oxli/subset.cc	`0% <0%> (ø)`	⬆️
src/oxli/hashtable.cc	`0% <0%> (ø)`	⬆️
src/oxli/labelhash.cc	`0% <0%> (ø)`	⬆️
src/oxli/hashgraph.cc	`0% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4fc53f6...1a54b45. Read the comment docs.

ctb · 2017-02-14T20:49:04Z

Also see #1619 (comment)

…d_dna

…ctions

camillescott · 2017-02-14T21:22:20Z

Also see #1595 -- some more sophisticated alphabet checking was added to the parsers. Basically, you can give it one of the alphabets (as defined in alphabets.hh) by name, and it'll use that for its checks; as perhaps implied by that, I'm in favor of moving this sort of checking code into the parsers. Alternatively, the graphs could take a parameter for whether they should run any checks on what they consume (also using the alphabets).

camillescott · 2017-02-14T21:25:01Z

And, looking at #1511 closer, moving that checking code into the parsers is already becoming SOP. Yay.

ctb · 2017-02-14T21:26:21Z

yeppers.

…d_dna

ctb · 2017-02-20T02:20:20Z

@standage note that one thing the partitioning code does is fail to output reads containing Ns.

…d_dna

ctb · 2017-03-20T14:14:18Z

I wonder if this code is any faster now that we don't iterate across every sequence 2 or 3 times?

betatim · 2017-03-22T14:47:42Z

Time to review?

ctb · 2017-03-22T15:06:52Z

On Wed, Mar 22, 2017 at 07:47:42AM -0700, Tim Head wrote: Time to review?

yes! but this will not be merged until after release.

betatim · 2017-03-22T15:09:18Z

Ran ./scripts/abundance-dist-single.py -x 1e8 -b ecoli_ref-5m.fastq somehisto.hist -k 31 -s which takes about 120s on my machine both on master and this branch :-/

…d_dna_merge

ctb · 2017-06-02T13:20:12Z

@betatim @luizirber @camillescott @standage this PR is now ready for merge into master and (IMO) should be given high priority for review. @camillescott's prediction of many merge conflicts was mistaken, thank goodness - only one small conflict to be resolved!

This merge significantly changes the details around ACTGN handling and will require a bump to khmer 3.0.

ctb · 2017-06-02T15:44:18Z

Tests pass!

standage

Looks good to me. As I understand it, khmer now makes no attempt to clean up kmers that are passed to the sketch.add() function, and does a [^ACGT] --> A conversion for any bulk sequence loading code. Correct?

standage · 2017-06-02T16:40:57Z

tests/test_sequence_validation.py

+    # because different hash functions do different things with
+    # non-ACTG characters.  So all we want to do is verify that the
+    # functions execute w/o error on the k-mers before the "bad" DNA,
+    # and don't return positions in the "good" DNA.


The meaning of the phrase and don't return positions in the "good" DNA is unclear to me.

The behavior of the hash functions on ACTG should all be the same with respect to "good" sequences, and there should be no errors in there to trim at.

standage · 2017-06-02T16:45:58Z

Also, does the partitioning code now output reads containing Ns @ctb?

ctb · 2017-06-02T17:10:42Z

@standage yes; re partitioning and ignoring of reads containing N, see: this removed line

remove unneeded check_and_normalize_read calls

83c755a

ctb mentioned this pull request Jan 24, 2017

More thoughts on non-ACGT characters. #1541

Open

remove unneeded check_and_process_read code

7d46f07

ctb changed the title ~~remove unneeded check_and_normalize_read calls~~ remove redundant ACGT-checking code. Jan 24, 2017

ctb mentioned this pull request Jan 26, 2017

Add cleaned_seq attribute to Read class #1591

Merged

8 tasks

ctb mentioned this pull request Feb 14, 2017

HLL refactor: C++11 simplifications #1316

Merged

8 tasks

ctb added 2 commits February 14, 2017 15:50

Merge branch 'master' of github.com:dib-lab/khmer into remove/is_vali…

db51c34

…d_dna

Merge branch 'master' of github.com:dib-lab/khmer into remove/is_vali…

ffe5076

…d_dna

ctb mentioned this pull request Feb 14, 2017

Discrepancies between exact counts and approximate counts #1619

Closed

remove check_and_normalize_read calls from abundance_distribution fun…

b9b708d

…ctions

ctb mentioned this pull request Feb 16, 2017

Introduce consume_seqfile_banding method #1571

Closed

12 tasks

ctb added 3 commits February 19, 2017 11:18

Merge branch 'master' of github.com:dib-lab/khmer into remove/is_vali…

976cf30

…d_dna

remove outdated comment ref merge

55f36e1

fix another comment

4cbe6bd

ctb mentioned this pull request Feb 19, 2017

some bulk sequence loading tests that nail down current ACGTN behavior. #1633

Merged

8 tasks

Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna

a34ae08

ctb changed the base branch from master to remove/is_valid_dna_tests February 19, 2017 20:18

ctb added 3 commits February 19, 2017 12:19

update consume_seqfile to use cleaned_seq

26180cd

Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna

fde5960

update for add'l tests

8029d19

ctb added 3 commits February 19, 2017 18:20

Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna

19e014a

abundance foo

1b5c03f

Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna

f91ebee

ctb added 9 commits March 19, 2017 07:10

update trim functions on bad dna to be properly undefined

10ce9bd

Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna

98cd756

Merge branch 'fix/consume_partitioned_err' into remove/is_valid_dna

8a683c8

Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna

fc63010

Merge branch 'fix/consume_partitioned_err' into remove/is_valid_dna

01c3b44

Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna

cb3d6cc

add partition ID to new 'bad' sequence

9a5080b

adjust tests for new 'bad dna' sequence

3f6a673

Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna

d4eb8a5

ctb changed the base branch from remove/is_valid_dna_tests to master March 20, 2017 14:04

Merge branch 'master' of github.com:dib-lab/khmer into remove/is_vali…

cf8ebc0

…d_dna

Merge branch 'master' into remove/is_valid_dna

122258c

This was referenced May 18, 2017

[MRG] Clean up "introduction" text in docs #1684

Merged

Split CPython and begin the Cython revolution #1595

Merged

Merge branch 'master' of github.com:dib-lab/khmer into remove/is_vali…

c031499

…d_dna_merge

ctb changed the title ~~remove redundant ACGT-checking code.~~ [MRG] remove redundant ACGT-checking code. Jun 2, 2017

ctb added 2 commits June 2, 2017 06:22

update Changelog

4b3d4ed

grr typo

1a54b45

standage approved these changes Jun 2, 2017

View reviewed changes

ctb merged commit e768617 into master Jun 2, 2017

ctb deleted the remove/is_valid_dna branch June 2, 2017 17:14

ctb mentioned this pull request Mar 22, 2020

Khmer (specifically reverse_complement) only intended for upper-case sequences? #1904

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] remove redundant ACGT-checking code. #1590

[MRG] remove redundant ACGT-checking code. #1590

ctb commented Jan 24, 2017 •

edited

Loading

ctb commented Jan 24, 2017

codecov-io commented Jan 24, 2017 •

edited

Loading

ctb commented Feb 14, 2017

camillescott commented Feb 14, 2017

camillescott commented Feb 14, 2017

ctb commented Feb 14, 2017 via email

ctb commented Feb 20, 2017 •

edited

Loading

ctb commented Mar 20, 2017

betatim commented Mar 22, 2017

ctb commented Mar 22, 2017 via email

betatim commented Mar 22, 2017

ctb commented Jun 2, 2017

ctb commented Jun 2, 2017

standage left a comment

standage Jun 2, 2017

ctb Jun 2, 2017

standage commented Jun 2, 2017

ctb commented Jun 2, 2017

[MRG] remove redundant ACGT-checking code. #1590

[MRG] remove redundant ACGT-checking code. #1590

Conversation

ctb commented Jan 24, 2017 • edited Loading

ctb commented Jan 24, 2017

codecov-io commented Jan 24, 2017 • edited Loading

Codecov Report

ctb commented Feb 14, 2017

camillescott commented Feb 14, 2017

camillescott commented Feb 14, 2017

ctb commented Feb 14, 2017 via email

ctb commented Feb 20, 2017 • edited Loading

ctb commented Mar 20, 2017

betatim commented Mar 22, 2017

ctb commented Mar 22, 2017 via email

betatim commented Mar 22, 2017

ctb commented Jun 2, 2017

ctb commented Jun 2, 2017

standage left a comment

Choose a reason for hiding this comment

standage Jun 2, 2017

Choose a reason for hiding this comment

ctb Jun 2, 2017

Choose a reason for hiding this comment

standage commented Jun 2, 2017

ctb commented Jun 2, 2017

ctb commented Jan 24, 2017 •

edited

Loading

codecov-io commented Jan 24, 2017 •

edited

Loading

ctb commented Feb 20, 2017 •

edited

Loading