-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Junction count assembly & CPython wrappers for assemblers #1503
Conversation
…khmer into feature/assembly/junction_count
typedef struct { | ||
PyObject_HEAD | ||
LinearAssembler * assembler; | ||
} khmer_KLinearAssembler_Object; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my education: do you know what the system behind using or not that extra K is (khmer_KLinearAssembler_Object
vs khmer_LinearAssembler_Object
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far as I know, it's a holdover. I've just followed the pattern of the existing class defs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably my fault originally. I've maintained it in #1504 shrug
LinearAssembler * assembler; | ||
} khmer_KLinearAssembler_Object; | ||
|
||
#define is_linearassembler_obj(v) (Py_TYPE(v) == &khmer_KLinearAssembler_Type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copypasta cruft ;) But, it's to check if a PyObject *
you get from Python-land is of type khmer_KLinearAssembler_Type
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured out the latter part :) Just couldn't work out what it was good for :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to use these things more in the code, but I don't think they're used anywhere any more. Could be removed.
@@ -72,36 +72,37 @@ def _sandbox_scripts(): | |||
|
|||
|
|||
@pytest.mark.parametrize("filename", _sandbox_scripts()) | |||
def test_import_succeeds(filename): | |||
def test_import_succeeds(filename, tmpdir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(sometimes I wonder if pytest contains too much magic...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, why do we need to change CWD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isolation. There were some sandbox scripts were opening files in the root dir and not cleaning up their mess. I think I actually fixed that, but the extra isolation is probably a good idea anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After initial review, my main questions are about the mechanics and concepts underlying the junction assembler (see comment). Sorry if I'm a bit slow on the uptake, but @camillescott if it's not too much trouble a brief explanation would help a lot!
class TestJunctionCountAssembler: | ||
|
||
def test_beginning_to_end_across_tip(self, right_tip_structure): | ||
# assemble entire contig, ignoring branch point b/c of labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what this test is testing for, but I'm still fuzzy on the junction counting and label implementation. Perhaps a brief explanation of how the assembler handles this simple example will help me wrap my mind around the more abstract concepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general idea behind the labeled assembler, of which this is an optimized version, is to use known paths to connect across high degree nodes. One primary use case is where we want to thread the graph with reads, so that we can resolve across repetitive k-mers.
A bit more background: De Bruijn graphs break input data up into k-mers, and this loses the longer range information present in any input data. For example, if you used k=31 and input a 1kb contig, the fact that the 1kb contig was actually a known valid path through the data would be lost because you were breaking the contig up into 31-mers. If the 31-mers form a path that doesn't branch, then we're fine - but of course in any real data set there will be high-degree nodes formed by repeats and sequencing errors.
There are a number of approaches to dealing with this, but the approach that @camillescott has implemented (first in the labelled assembler and then in the JunctionCount assembler) is to track k-mers across these high degree nodes. The labelled assembler does so by labeling the graph nodes on either side of the high degree node with the same label; the junction count assembler uses a trickier approach that is an optimized version of the same idea.
Please correct me if I'm wrong, @camillescott :)
} | ||
|
||
try { | ||
std::cout << "New Assembler: " << hashtable << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few such debugging print statements outside #if DEBUG_ASSEMBLY
blocks. Did you intend to leave these in, or should they be removed or marked?
CountingHash * junctions, | ||
const unsigned int min_cov) | ||
{ | ||
KmerFilter filter = [=] (const Kmer& dst_node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoa, unfamiliar with this syntax. Is it a functional programming construct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a lambda function capturing by-copy ([=]
) any variable used inside the function body. And KmerFilter
is a type for a function that takes a Kmer
and returns True
or False
(defined at
Line 166 in 4657245
typedef std::function<bool (const Kmer&)> KmerFilter; |
typedef std::function<bool (const Kmer&)> KmerFilter;
…nction_count-merge-master
OK, I've updated this branch with the latest master. On my laptop, at least, all tests pass... I'll take a look at this PR and try to answer @standage's questions ;). Assuming all looks OK, are there any questions or objections to merging that haven't shown up in the comments so far? |
Yup, that's a good summary. The detail here is that the junction counting On Thu, Nov 10, 2016 at 2:55 PM, C. Titus Brown notifications@github.com
Camille Scott Graduate Group for Computer Science |
{ | ||
"assemble", | ||
(PyCFunction)junctioncountassembler_assemble, METH_VARARGS | METH_KEYWORDS, | ||
"Assemble a path linearly until a branch is reached." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: this docstring is incorrect.
} | ||
|
||
// Starting from the given seed k-mer, assemble all maximal linear paths in | ||
// both directions, using labels to skip over tricky bits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is also wrong - it's using junction counters to skip over tricky bits :)
TODO:
I think I understand most everything that @camillescott is doing here & it looks great to me! I'll fixi t up a bit & merge tomorrow morning unless there are objections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; I'll add the necessary ChangeLog stuff and merge tomorrow morning.
Current coverage is 95.81% (diff: 100%)
|
Thanks @camillescott ! |
Thanks for taking care of the merge! On Nov 11, 2016 4:15 AM, "C. Titus Brown" notifications@github.com wrote:
|
Implements the JunctionCountAssembler which optimizes the basic approach behind the labeled assembler, and wraps the Assembler classes in CPython.
make test
Did it pass the tests?make clean diff-cover
If it introduces new functionality inscripts/
is it tested?make format diff_pylint_report cppcheck doc pydocstyle
Is it wellformatted?
without a major version increment. Changing file formats also requires a
major version number increment.
ChangeLog
?http://en.wikipedia.org/wiki/Changelog#Format
changes were made?
tested for streaming IO?)