Junction count assembly & CPython wrappers for assemblers #1503

camillescott · 2016-11-03T18:31:04Z

Implements the JunctionCountAssembler which optimizes the basic approach behind the labeled assembler, and wraps the Assembler classes in CPython.

Is it mergeable?
make test Did it pass the tests?
make clean diff-cover If it introduces new functionality in
scripts/ is it tested?
make format diff_pylint_report cppcheck doc pydocstyle Is it well
formatted?
Did it change the command-line interface? Only additions are allowed
without a major version increment. Changing file formats also requires a
major version number increment.
Is it documented in the ChangeLog?
http://en.wikipedia.org/wiki/Changelog#Format
Was a spellchecker run on the source code and documentation after
changes were made?
Do the changes respect streaming IO? (Are they
tested for streaming IO?)
Is the Copyright year up to date?

…e/pathlink

…khmer into feature/assembly/junction_count

betatim · 2016-11-04T14:35:59Z

khmer/_khmer.cc

+typedef struct {
+    PyObject_HEAD
+    LinearAssembler * assembler;
+} khmer_KLinearAssembler_Object;


For my education: do you know what the system behind using or not that extra K is (khmer_KLinearAssembler_Object vs khmer_LinearAssembler_Object)?

So far as I know, it's a holdover. I've just followed the pattern of the existing class defs.

Probably my fault originally. I've maintained it in #1504 shrug

betatim · 2016-11-04T14:38:42Z

khmer/_khmer.cc

+    LinearAssembler * assembler;
+} khmer_KLinearAssembler_Object;
+
+#define is_linearassembler_obj(v)  (Py_TYPE(v) == &khmer_KLinearAssembler_Type)


What is this used for?

Copypasta cruft ;) But, it's to check if a PyObject * you get from Python-land is of type khmer_KLinearAssembler_Type.

I figured out the latter part :) Just couldn't work out what it was good for :)

We used to use these things more in the code, but I don't think they're used anywhere any more. Could be removed.

betatim · 2016-11-04T15:28:01Z

tests/test_sandbox_scripts.py

@@ -72,36 +72,37 @@ def _sandbox_scripts():


 @pytest.mark.parametrize("filename", _sandbox_scripts())
-def test_import_succeeds(filename):
+def test_import_succeeds(filename, tmpdir):


(sometimes I wonder if pytest contains too much magic...)

By the way, why do we need to change CWD?

Isolation. There were some sandbox scripts were opening files in the root dir and not cleaning up their mess. I think I actually fixed that, but the extra isolation is probably a good idea anyway.

standage

After initial review, my main questions are about the mechanics and concepts underlying the junction assembler (see comment). Sorry if I'm a bit slow on the uptake, but @camillescott if it's not too much trouble a brief explanation would help a lot!

standage · 2016-11-07T21:44:51Z

tests/test_assembly.py

+class TestJunctionCountAssembler:
+
+    def test_beginning_to_end_across_tip(self, right_tip_structure):
+        # assemble entire contig, ignoring branch point b/c of labels


I understand what this test is testing for, but I'm still fuzzy on the junction counting and label implementation. Perhaps a brief explanation of how the assembler handles this simple example will help me wrap my mind around the more abstract concepts.

The general idea behind the labeled assembler, of which this is an optimized version, is to use known paths to connect across high degree nodes. One primary use case is where we want to thread the graph with reads, so that we can resolve across repetitive k-mers.

A bit more background: De Bruijn graphs break input data up into k-mers, and this loses the longer range information present in any input data. For example, if you used k=31 and input a 1kb contig, the fact that the 1kb contig was actually a known valid path through the data would be lost because you were breaking the contig up into 31-mers. If the 31-mers form a path that doesn't branch, then we're fine - but of course in any real data set there will be high-degree nodes formed by repeats and sequencing errors.

There are a number of approaches to dealing with this, but the approach that @camillescott has implemented (first in the labelled assembler and then in the JunctionCount assembler) is to track k-mers across these high degree nodes. The labelled assembler does so by labeling the graph nodes on either side of the high degree node with the same label; the junction count assembler uses a trickier approach that is an optimized version of the same idea.

Please correct me if I'm wrong, @camillescott :)

standage · 2016-11-07T21:46:32Z

khmer/_khmer.cc

+        }
+
+        try {
+       std::cout << "New Assembler: " << hashtable << std::endl;


There are a few such debugging print statements outside #if DEBUG_ASSEMBLY blocks. Did you intend to leave these in, or should they be removed or marked?

standage · 2016-11-07T21:47:08Z

lib/kmer_filters.cc

+                                     CountingHash * junctions,
+                                     const unsigned int min_cov)
+{
+    KmerFilter filter = [=] (const Kmer& dst_node) {


Whoa, unfamiliar with this syntax. Is it a functional programming construct?

That's a lambda function capturing by-copy ([=]) any variable used inside the function body. And KmerFilter is a type for a function that takes a Kmer and returns True or False (defined at

khmer/lib/khmer.hh

Line 166 in 4657245

typedef std::function<bool (const Kmer&)> KmerFilter;

): typedef std::function<bool (const Kmer&)> KmerFilter;

…nction_count-merge-master

ctb · 2016-11-10T22:23:25Z

OK, I've updated this branch with the latest master. On my laptop, at least, all tests pass...

I'll take a look at this PR and try to answer @standage's questions ;). Assuming all looks OK, are there any questions or objections to merging that haven't shown up in the comments so far?

camillescott · 2016-11-10T23:02:51Z

Yup, that's a good summary. The detail here is that the junction counting
method just counts how many reads span a particular branch, rather than
tracking which specific reads span a particular branch as in the labeling
method. The latter can allow for longer-range threading, but the former is
ridiculously faster, and good enough for some basic operations like tip and
error removal.

On Thu, Nov 10, 2016 at 2:55 PM, C. Titus Brown notifications@github.com
wrote:

@ctb commented on this pull request.

In tests/test_assembly.py #1503:
     assert len(paths) == 1
     # There are K-1 k-mers spanning the junction between
     # the beginning and end of the repeat
     assert len(paths[0]) == len(repeat) + K - 1
+class TestJunctionCountAssembler:
+
def test_beginning_to_end_across_tip(self, right_tip_structure):
   # assemble entire contig, ignoring branch point b/c of labels
The general idea behind the labeled assembler, of which this is an
optimized version, is to use known paths to connect across high degree
nodes. One primary use case is where we want to thread the graph with
reads, so that we can resolve across repetitive k-mers.

A bit more background: De Bruijn graphs break input data up into k-mers,
and this loses the longer range information present in any input data. For
example, if you used k=31 and input a 1kb contig, the fact that the 1kb
contig was actually a known valid path through the data would be lost
because you were breaking the contig up into 31-mers. If the 31-mers form a
path that doesn't branch, then we're fine - but of course in any real data
set there will be high-degree nodes formed by repeats and sequencing errors.

There are a number of approaches to dealing with this, but the approach
that @camillescott https://github.com/camillescott has implemented
(first in the labelled assembler and then in the JunctionCount assembler)
is to track k-mers across these high degree nodes. The labelled assembler
does so by labeling the graph nodes on either side of the high degree node
with the same label; the junction count assembler uses a trickier approach
that is an optimized version of the same idea.

Please correct me if I'm wrong, @camillescott
https://github.com/camillescott :)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1503, or mute the thread
https://github.com/notifications/unsubscribe-auth/ACwxraTo238t2_maJr7kjWw3YuhsWeXRks5q86DWgaJpZM4KowY7
.

Camille Scott

Graduate Group for Computer Science
Lab for Data Intensive Biology
University of California, Davis

camille.scott.w@gmail.com

ctb · 2016-11-10T23:16:53Z

khmer/_khmer.cc

+    {
+        "assemble",
+        (PyCFunction)junctioncountassembler_assemble, METH_VARARGS | METH_KEYWORDS,
+        "Assemble a path linearly until a branch is reached."


Note to self: this docstring is incorrect.

ctb · 2016-11-10T23:18:53Z

lib/assembler.cc

+}
+
+// Starting from the given seed k-mer, assemble all maximal linear paths in
+// both directions, using labels to skip over tricky bits.


This comment is also wrong - it's using junction counters to skip over tricky bits :)

ctb · 2016-11-10T23:21:26Z

TODO:

remove/ifdef debug prints
run the sandbox scripts!
remove s_X_obj stuff since no longer used
look into whether our new, saner inheritance hierarchy lets us do better hashtable extraction from passed-in objects.
fix docstring for junction count assembler CPython function

I think I understand most everything that @camillescott is doing here & it looks great to me! I'll fixi t up a bit & merge tomorrow morning unless there are objections.

ctb

LGTM; I'll add the necessary ChangeLog stuff and merge tomorrow morning.

codecov-io · 2016-11-11T12:15:15Z

Current coverage is 95.81% (diff: 100%)

No coverage report found for master at c114d14.

Powered by Codecov. Last update c114d14...cfd6ed3

ctb · 2016-11-11T12:15:56Z

Thanks @camillescott !

camillescott · 2016-11-11T20:10:55Z

Thanks for taking care of the merge!

On Nov 11, 2016 4:15 AM, "C. Titus Brown" notifications@github.com wrote:

Thanks @camillescott https://github.com/camillescott !

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1503 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACwxratRip4-gkziO120KrzQ_Gm6XjFuks5q9Fx8gaJpZM4KowY7
.

ctb and others added 30 commits June 9, 2016 08:03

Merge branch 'update/remove_labelptr' into feature/pathlink

5e41d57

basic compilation works

8d77afe

track labels in collapsed linear paths

b1e99ed

cleanup & commenting

a432f0e

add assemble_labeled_path

b3a7c3f

compile, basic fns & API

ddedd03

changed fn signature of assemble_labeled_right

8c6d069

seems to be ...working?

d58a139

made left + right assembly work

a707942

a first attempt at streaming assembly

3740422

Merge branch 'update/remove_labelptr' into feature/pathlink

dc517f8

basic stuff is working

fcf61c0

fix print function py2 issue

5e08729

track visited & avoid infinite loops

a3e6cfa

add extract-unassembled-reads*py

de261f6

Merge branch 'master' of github.com:dib-lab/khmer into feature/pathlink

04168a6

Merge branch 'feature/assemble' into feature/pathlink

22393ea

Merge branch 'update/remove_labelptr' into feature/pathlink

7125a33

Merge branch 'master' of github.com:dib-lab/khmer into feature/pathlink

6e9650e

fix typo

c63f31e

Merge branch 'master' of github.com:dib-lab/khmer into feature/pathlink

250a997

fixed bug with 'temporary' use of std:string

e994c91

Merge branch 'fix/readpairiter_error' into feature/pathlink

8b3a22b

Merge branch 'master' of https://github.com/dib-lab/khmer into featur…

1a61e5c

…e/pathlink

update extract-unassembled-reads* sandbox scripts for python3

aa7e7b6

fix @ljcohen short sequence issue?

fd9d82c

Make build_kmer const method

1ea39f1

First pass assembler class

da2408a

Add an assemble_left function to match assemble_right

55350d1

TEMPORARILY add github target for screed bugfix to pass tests

e33cef4

camillescott and others added 11 commits October 15, 2016 19:35

Add first iteration of JunctionCountAssembler

a16dbcd

mofidy output of assembly script

7ced7d3

start convert to junction assembler

44d792c

Merge branch 'feature/assembly/junction_count' of github.com:ged-lab/…

d89af42

…khmer into feature/assembly/junction_count

Merge branch 'master' into feature/review/pathlink

aaa970f

Merge branch 'master' into feature/pathlink

c5b8d52

Formatting only changes via make format

62dd982

fix pep8

361511a

Basic junctioncounting assembler and script working

8ead746

Update from feature/review/pathlink in prep for master

2228041

Merge in master and remove compiler warnings from kwnames in khmer.cc

4657245

betatim reviewed Nov 4, 2016

View reviewed changes

standage reviewed Nov 7, 2016

View reviewed changes

Merge remote-tracking branch 'origin/master' into feature/assembly/ju…

d40e4aa

…nction_count-merge-master

ctb mentioned this pull request Nov 10, 2016

A reconciliation branch for the storage/hashgraph refactoring & junction count stuff. #1508

Merged

ctb reviewed Nov 10, 2016

View reviewed changes

minor cleanup

d963b7f

ctb approved these changes Nov 10, 2016

View reviewed changes

ctb added 2 commits November 10, 2016 15:40

add ChangeLog, whitespace cleanup

8fe1a96

fix pep8

cfd6ed3

ctb merged commit 2d67bdb into master Nov 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Junction count assembly & CPython wrappers for assemblers #1503

Junction count assembly & CPython wrappers for assemblers #1503

camillescott commented Nov 3, 2016

betatim Nov 4, 2016

camillescott Nov 4, 2016

ctb Nov 10, 2016

betatim Nov 4, 2016

camillescott Nov 4, 2016

betatim Nov 7, 2016

ctb Nov 10, 2016

betatim Nov 4, 2016

betatim Nov 4, 2016

camillescott Nov 4, 2016

standage left a comment

standage Nov 7, 2016

ctb Nov 10, 2016

standage Nov 7, 2016

standage Nov 7, 2016

luizirber Nov 8, 2016

ctb commented Nov 10, 2016

camillescott commented Nov 10, 2016

@ctb commented on this pull request.

ctb Nov 10, 2016

ctb Nov 10, 2016

ctb commented Nov 10, 2016 •

edited

Loading

ctb left a comment

codecov-io commented Nov 11, 2016 •

edited

Loading

ctb commented Nov 11, 2016

camillescott commented Nov 11, 2016

Junction count assembly & CPython wrappers for assemblers #1503

Junction count assembly & CPython wrappers for assemblers #1503

Conversation

camillescott commented Nov 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

standage left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ctb commented Nov 10, 2016

camillescott commented Nov 10, 2016

@ctb commented on this pull request.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ctb commented Nov 10, 2016 • edited Loading

ctb left a comment

Choose a reason for hiding this comment

codecov-io commented Nov 11, 2016 • edited Loading

Current coverage is 95.81% (diff: 100%)

ctb commented Nov 11, 2016

camillescott commented Nov 11, 2016

ctb commented Nov 10, 2016 •

edited

Loading

codecov-io commented Nov 11, 2016 •

edited

Loading