Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expose khmer.Read to python and use ReadParser in scripts #1491

Merged
merged 11 commits into from
Nov 15, 2016

Conversation

betatim
Copy link
Member

@betatim betatim commented Oct 24, 2016

Fix #1098
and related to #1483

After getting started on this I realised I wasn't sure how we wanted to use this. So far I am assuming the use case is to replace utils.ReadBundle as used in for example trim-low-abund.py.

Is it really worth it? We will end up turning all the screed Records into read_parser::Reads via various unicode -> bytes stuff, to then do some computation on them. In the current setup we pass the cleaned sequence to get_median_count and friends. Maybe there is something I am missing.

  • Is it mergeable?
  • make test Did it pass the tests?
  • make clean diff-cover If it introduces new functionality in
    scripts/ is it tested?
  • make format diff_pylint_report cppcheck doc pydocstyle Is it well
    formatted?
  • Is it documented in the ChangeLog?
    http://en.wikipedia.org/wiki/Changelog#Format
  • Was a spellchecker run on the source code and documentation after
    changes were made?
  • Do the changes respect streaming IO? (Are they
    tested for streaming IO?)
  • Is the Copyright year up to date?

ReadBundles group several Reads together.

@ctb
Copy link
Member

ctb commented Oct 24, 2016

Some more background/thoughts/components of this --

  • we don't have broken-paired reading functionality in the C++ layer, and the Python version is kinda slow I think; or, at least, screed loading of sequences seems way faster when you don't use broken_paired_reader; benchmarking could be a place to start to see if I'm right :).
  • we also need identical read cleaning behavior in C++ and Python levels, so e.g. the clean_input_reads function could be replicated over to or simply moved into the C++ layer;
  • the C++ Read / screed.Record functionality should be synchronized (Synchronize screed and khmer read record details #769)

So it's half yard cleaning and half optimization, I think.

Make sense?

@betatim
Copy link
Member Author

betatim commented Oct 24, 2016

Ok, so as a simple minded person I will take that to mean: we want a faster way of doing what broken_pair_reader does :)

Read cleaning in C++ vs python: did some exploring and came across Hashtable::check_and_normalize_read which we should keep track of for the consolidation effort.

@betatim
Copy link
Member Author

betatim commented Oct 24, 2016

Having re-read what Hashtable::check_and_normalize_read does it doesn't improve your read (by upper casing) it just tells you that it isn't a good read.

@ctb
Copy link
Member

ctb commented Oct 25, 2016

On Mon, Oct 24, 2016 at 07:04:36AM -0700, Tim Head wrote:

Ok, so as a simple minded person I will take that to mean: we want a faster way of doing what broken_pair_reader does :)

Read cleaning in C++ vs python: did some exploring and came across Hashtable::check_and_normalize_read which we should keep track of for the consolidation effort.

yep, and making this code faster and more sensible is also a +1.

@betatim betatim force-pushed the feature/c-land-readbundle branch from 0bc426e to 96e749d Compare November 8, 2016 16:03
@betatim
Copy link
Member Author

betatim commented Nov 9, 2016

Two notes to myself:

  • test_streaming_io mostly works, except where it fails because of a missing attribute (expected).
  • seqan internally does something different if the input pathname is - or a file name. This means testing with a fifo which we pass as a pathname doesn't exercise the right code. In particular for paths it opens the file, reads a few bytes, closes it again, and re-opens later for the real reading. For a FIFO that shouldn't do the right thing.
  • the test that is failing looks to be some deadlock issue with opening the fifo in the right order or so.

@betatim
Copy link
Member Author

betatim commented Nov 9, 2016

However, replacing screed.open with khmer.ReadParser does speed up by ~2 the broken_pair_reader with a one line change 😀

@ctb
Copy link
Member

ctb commented Nov 9, 2016

On Wed, Nov 09, 2016 at 04:11:59AM -0800, Tim Head wrote:

Two notes to myself:

  • test_streaming_io mostly works, except where it fails because of a missing attribute (expected).

exellent!

  • seqan internally does something different if the input pathname is - or a file name. This means testing with a fifo which we pass as a pathname doesn't exercise the right code. In particular for paths it opens the file, reads a few bytes, closes it again, and re-opens later for the real reading. For a FIFO that shouldn't do the right thing.
  • the test that is failing looks to be some deadlock issue with opening the fifo in the right order or so.

Do you think the test should be discarded? We have the test_streaming_io tests
which actullay test live streaming (which we didn't have at the time of the
FIFO test writing).

@betatim
Copy link
Member Author

betatim commented Nov 9, 2016

the test that is failing looks to be some deadlock issue with opening the fifo in the right order or so.

Do you think the test should be discarded? We have the test_streaming_io tests
which actullay test live streaming (which we didn't have at the time of the
FIFO test writing).

Probably yes. It is super hard to debug because not only is the deadlock happening in another thread, but in some c++ code that is being called by another script being exec'ed from another thread ... so if the streaming IO tests cover this/can be made to test this I am 👍 for that. Will check tomorrow morning.

@betatim betatim force-pushed the feature/c-land-readbundle branch from ea9c94d to f7bcbc0 Compare November 10, 2016 11:04
@ctb
Copy link
Member

ctb commented Nov 10, 2016

Looking good on a quick skim.

One comment - does eliminating the FIFO-based streaming tests drop coverage at all?

@codecov-io
Copy link

codecov-io commented Nov 14, 2016

Current coverage is 95.74% (diff: 88.23%)

Merging #1491 into master will decrease coverage by 0.06%

@@             master      #1491   diff @@
==========================================
  Files            36         36          
  Lines          2938       2938          
  Methods           0          0          
  Messages          0          0          
  Branches        448        449     +1   
==========================================
- Hits           2815       2813     -2   
- Misses           54         55     +1   
- Partials         69         70     +1   

Powered by Codecov. Last update 283835a...18eab38

@betatim
Copy link
Member Author

betatim commented Nov 14, 2016

I removed the stuff that attempted to add the ReadBundle type. It was half-baked and the PR ended up focussing on exposing the Read type and speeding up a bunch of scripts by using khmer.ReadParser.

We can go back to #1483 if we really want to add the ReadBundle. Not sure it will do much in terms of speed though. The next slowest thing in broken_pair_reader is check_is_pair so I'd C++'ify that or speed it up otherwise.

@betatim betatim changed the title Adding a ReadBundle type expose khmer.Read to python and use ReadParser in scripts Nov 14, 2016
@betatim
Copy link
Member Author

betatim commented Nov 14, 2016

No drop in coverage.

$ make diff-cover
diff-cover coverage-gcovr.xml coverage.xml
-------------
Diff Coverage
Diff: origin/master...HEAD, staged, and unstaged changes
-------------
khmer/_khmer.cc (100%)
scripts/extract-paired-reads.py (100%)
scripts/filter-abund-single.py (100%)
scripts/filter-abund.py (100%)
scripts/sample-reads-randomly.py (75.0%): Missing lines 154
scripts/split-paired-reads.py (100%)
scripts/trim-low-abund.py (100%)
-------------
Total:   37 lines
Missing: 1 line
Coverage: 97%
-------------

@betatim betatim force-pushed the feature/c-land-readbundle branch 2 times, most recently from 5e7dcbf to fb4aa0e Compare November 14, 2016 12:17
@ctb
Copy link
Member

ctb commented Nov 14, 2016

On Mon, Nov 14, 2016 at 03:05:32AM -0800, Tim Head wrote:

I removed the stuff that attempted to add the ReadBundle type. It was half-baked and the PR ended up focussing on exposing the Read type and speeding up a bunch of scripts by using khmer.ReadParser.

We can go back to #1483 if we really want to add the ReadBundle. Not sure it will do much in terms of speed though. The next slowest thing in broken_pair_reader is check_is_pair so I'd C++'ify that or speed it up otherwise.

agreed.

@betatim
Copy link
Member Author

betatim commented Nov 14, 2016

This should pass on both legacy python and python any minute now.

@betatim
Copy link
Member Author

betatim commented Nov 14, 2016

Ready for review! @luizirber, @camillescott, @standage , or @ctb

Copy link
Member

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice; LGTM.

@ctb
Copy link
Member

ctb commented Nov 14, 2016

Let's go ahead and merge when the tests pass!

@betatim do you think you could go through the various Read-related issues and update them with this new functionality?

@betatim
Copy link
Member Author

betatim commented Nov 14, 2016

@betatim do you think you could go through the various Read-related issues and update them with this new functionality?

I'll give it a go. Do you have something particular in mind? Otherwise I'll locate read related issues by searching, etc

@ctb
Copy link
Member

ctb commented Nov 14, 2016

On Mon, Nov 14, 2016 at 05:30:35AM -0800, Tim Head wrote:

@betatim do you think you could go through the various Read-related issues and update them with this new functionality?

I'll give it a go. Do you have something particular in mind? Otherwise I'll locate read related issues by searching, etc

they're all linked issues, if only transitively :)

#1483
#791
#1098
#1466

ReadBundles group several Reads together.
The test gets stuck in a deadlock, seqan does subtly different
things when reading from a path that is not '-' and this is
covered by test_streaming_io.
@betatim betatim force-pushed the feature/c-land-readbundle branch from 508af89 to 18eab38 Compare November 15, 2016 07:10
@betatim betatim merged commit 347430a into master Nov 15, 2016
@betatim betatim deleted the feature/c-land-readbundle branch November 15, 2016 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants