Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Junction count assembly & CPython wrappers for assemblers #1503

Merged
merged 114 commits into from
Nov 11, 2016

Conversation

camillescott
Copy link
Member

Implements the JunctionCountAssembler which optimizes the basic approach behind the labeled assembler, and wraps the Assembler classes in CPython.

  • Is it mergeable?
  • make test Did it pass the tests?
  • make clean diff-cover If it introduces new functionality in
    scripts/ is it tested?
  • make format diff_pylint_report cppcheck doc pydocstyle Is it well
    formatted?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Is it documented in the ChangeLog?
    http://en.wikipedia.org/wiki/Changelog#Format
  • Was a spellchecker run on the source code and documentation after
    changes were made?
  • Do the changes respect streaming IO? (Are they
    tested for streaming IO?)
  • Is the Copyright year up to date?

ctb and others added 30 commits June 9, 2016 08:03
typedef struct {
PyObject_HEAD
LinearAssembler * assembler;
} khmer_KLinearAssembler_Object;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my education: do you know what the system behind using or not that extra K is (khmer_KLinearAssembler_Object vs khmer_LinearAssembler_Object)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far as I know, it's a holdover. I've just followed the pattern of the existing class defs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably my fault originally. I've maintained it in #1504 shrug

LinearAssembler * assembler;
} khmer_KLinearAssembler_Object;

#define is_linearassembler_obj(v) (Py_TYPE(v) == &khmer_KLinearAssembler_Type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this used for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copypasta cruft ;) But, it's to check if a PyObject * you get from Python-land is of type khmer_KLinearAssembler_Type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured out the latter part :) Just couldn't work out what it was good for :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to use these things more in the code, but I don't think they're used anywhere any more. Could be removed.

@@ -72,36 +72,37 @@ def _sandbox_scripts():


@pytest.mark.parametrize("filename", _sandbox_scripts())
def test_import_succeeds(filename):
def test_import_succeeds(filename, tmpdir):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(sometimes I wonder if pytest contains too much magic...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, why do we need to change CWD?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isolation. There were some sandbox scripts were opening files in the root dir and not cleaning up their mess. I think I actually fixed that, but the extra isolation is probably a good idea anyway.

Copy link
Member

@standage standage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After initial review, my main questions are about the mechanics and concepts underlying the junction assembler (see comment). Sorry if I'm a bit slow on the uptake, but @camillescott if it's not too much trouble a brief explanation would help a lot!

class TestJunctionCountAssembler:

def test_beginning_to_end_across_tip(self, right_tip_structure):
# assemble entire contig, ignoring branch point b/c of labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what this test is testing for, but I'm still fuzzy on the junction counting and label implementation. Perhaps a brief explanation of how the assembler handles this simple example will help me wrap my mind around the more abstract concepts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general idea behind the labeled assembler, of which this is an optimized version, is to use known paths to connect across high degree nodes. One primary use case is where we want to thread the graph with reads, so that we can resolve across repetitive k-mers.

A bit more background: De Bruijn graphs break input data up into k-mers, and this loses the longer range information present in any input data. For example, if you used k=31 and input a 1kb contig, the fact that the 1kb contig was actually a known valid path through the data would be lost because you were breaking the contig up into 31-mers. If the 31-mers form a path that doesn't branch, then we're fine - but of course in any real data set there will be high-degree nodes formed by repeats and sequencing errors.

There are a number of approaches to dealing with this, but the approach that @camillescott has implemented (first in the labelled assembler and then in the JunctionCount assembler) is to track k-mers across these high degree nodes. The labelled assembler does so by labeling the graph nodes on either side of the high degree node with the same label; the junction count assembler uses a trickier approach that is an optimized version of the same idea.

Please correct me if I'm wrong, @camillescott :)

}

try {
std::cout << "New Assembler: " << hashtable << std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few such debugging print statements outside #if DEBUG_ASSEMBLY blocks. Did you intend to leave these in, or should they be removed or marked?

CountingHash * junctions,
const unsigned int min_cov)
{
KmerFilter filter = [=] (const Kmer& dst_node) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoa, unfamiliar with this syntax. Is it a functional programming construct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lambda function capturing by-copy ([=]) any variable used inside the function body. And KmerFilter is a type for a function that takes a Kmer and returns True or False (defined at

typedef std::function<bool (const Kmer&)> KmerFilter;
): typedef std::function<bool (const Kmer&)> KmerFilter;

@ctb
Copy link
Member

ctb commented Nov 10, 2016

OK, I've updated this branch with the latest master. On my laptop, at least, all tests pass...

I'll take a look at this PR and try to answer @standage's questions ;). Assuming all looks OK, are there any questions or objections to merging that haven't shown up in the comments so far?

@camillescott
Copy link
Member Author

Yup, that's a good summary. The detail here is that the junction counting
method just counts how many reads span a particular branch, rather than
tracking which specific reads span a particular branch as in the labeling
method. The latter can allow for longer-range threading, but the former is
ridiculously faster, and good enough for some basic operations like tip and
error removal.

On Thu, Nov 10, 2016 at 2:55 PM, C. Titus Brown notifications@github.com
wrote:

@ctb commented on this pull request.

In tests/test_assembly.py #1503:

     assert len(paths) == 1
     # There are K-1 k-mers spanning the junction between
     # the beginning and end of the repeat
     assert len(paths[0]) == len(repeat) + K - 1

+class TestJunctionCountAssembler:
+

  • def test_beginning_to_end_across_tip(self, right_tip_structure):
  •    # assemble entire contig, ignoring branch point b/c of labels
    

The general idea behind the labeled assembler, of which this is an
optimized version, is to use known paths to connect across high degree
nodes. One primary use case is where we want to thread the graph with
reads, so that we can resolve across repetitive k-mers.

A bit more background: De Bruijn graphs break input data up into k-mers,
and this loses the longer range information present in any input data. For
example, if you used k=31 and input a 1kb contig, the fact that the 1kb
contig was actually a known valid path through the data would be lost
because you were breaking the contig up into 31-mers. If the 31-mers form a
path that doesn't branch, then we're fine - but of course in any real data
set there will be high-degree nodes formed by repeats and sequencing errors.

There are a number of approaches to dealing with this, but the approach
that @camillescott https://github.com/camillescott has implemented
(first in the labelled assembler and then in the JunctionCount assembler)
is to track k-mers across these high degree nodes. The labelled assembler
does so by labeling the graph nodes on either side of the high degree node
with the same label; the junction count assembler uses a trickier approach
that is an optimized version of the same idea.

Please correct me if I'm wrong, @camillescott
https://github.com/camillescott :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1503, or mute the thread
https://github.com/notifications/unsubscribe-auth/ACwxraTo238t2_maJr7kjWw3YuhsWeXRks5q86DWgaJpZM4KowY7
.

Camille Scott

Graduate Group for Computer Science
Lab for Data Intensive Biology
University of California, Davis

camille.scott.w@gmail.com

{
"assemble",
(PyCFunction)junctioncountassembler_assemble, METH_VARARGS | METH_KEYWORDS,
"Assemble a path linearly until a branch is reached."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: this docstring is incorrect.

}

// Starting from the given seed k-mer, assemble all maximal linear paths in
// both directions, using labels to skip over tricky bits.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is also wrong - it's using junction counters to skip over tricky bits :)

@ctb
Copy link
Member

ctb commented Nov 10, 2016

TODO:

  • remove/ifdef debug prints
  • run the sandbox scripts!
  • remove s_X_obj stuff since no longer used
  • look into whether our new, saner inheritance hierarchy lets us do better hashtable extraction from passed-in objects.
  • fix docstring for junction count assembler CPython function

I think I understand most everything that @camillescott is doing here & it looks great to me! I'll fixi t up a bit & merge tomorrow morning unless there are objections.

Copy link
Member

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; I'll add the necessary ChangeLog stuff and merge tomorrow morning.

@codecov-io
Copy link

codecov-io commented Nov 11, 2016

Current coverage is 95.81% (diff: 100%)

No coverage report found for master at c114d14.

Powered by Codecov. Last update c114d14...cfd6ed3

@ctb ctb merged commit 2d67bdb into master Nov 11, 2016
@ctb
Copy link
Member

ctb commented Nov 11, 2016

Thanks @camillescott !

@camillescott
Copy link
Member Author

Thanks for taking care of the merge!

On Nov 11, 2016 4:15 AM, "C. Titus Brown" notifications@github.com wrote:

Thanks @camillescott https://github.com/camillescott !


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1503 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACwxratRip4-gkziO120KrzQ_Gm6XjFuks5q9Fx8gaJpZM4KowY7
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants