-
Notifications
You must be signed in to change notification settings - Fork 63
Conversation
This is great @kescobo! |
I think this probably fits with the methods I've been writing in Var for |
@kescobo My question to you and @kmsquire is: This and CountMin.jl both rely on Sketches, so are there commonalities between these two sets of code which can be "merged" within Bio and a consistent module/methods/types for dealing with hashes and sketches implemented that fulfils the needs of the two? |
MinHash is use in many applications and I'd say +1 for adding this in Bio.jl. I think this will fit in Bio.Seq. |
@Ward9250 I might be wrong, but I think CountMin.jl is doing something different. If I understand correctly, a Count-Min sketch gives counts of I looked at CountMin.jl and didn't see an obvious way that it would be used in this context, but would be happy for @kdmurray91 to take a look and let me know if I missed something. |
Codecov Report
@@ Coverage Diff @@
## master #415 +/- ##
==========================================
+ Coverage 77.6% 77.61% +<.01%
==========================================
Files 129 131 +2
Lines 9480 9546 +66
==========================================
+ Hits 7357 7409 +52
- Misses 2123 2137 +14
Continue to review full report at Codecov.
|
I think this definitely should go in Bio.jl, but I think the code may need to be split. There are several aspects of this PR, so I can imagine creating MinHashes of sequences goes in @kescobo I think you are right, I've also had a look at CountMin in the past 10 minutes and can see indeed it is different. |
@Ward9250 That makes sense. To me, Phylo makes the most sense for MASH distance, but you have a much better sense of that package and Var, so I'll defer to your judgement there. |
In that case, the first step to refinement of organisation in this PR is to move purely MinHash code to |
@Ward9250 upon further reflection and examination of what's in phylo and var, I changed my mind. I've added the MASH distance functions to var, but I'm not 100% sure how to incorporate it into what you've currently got going on in I know you're actively working on that, I'm wondering if I should let you handle that integration, and if so, wait on tests/docs. |
I'm happy for you to continue with tests and docs and I can then deal with integration subsequently if you'd like - I'm easy 👍 |
@Ward9250, I think you meant to ping @kdmurray91. |
So I did! 😆 |
src/seq/minhash.jl
Outdated
* .sketch: a sorted set of hashes | ||
* .kmersize: the length of kmers used to generate the sketch | ||
""" | ||
type MinHashSketch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be immutable
if possible.
src/seq/minhash.jl
Outdated
|
||
Base.getindex(s::MinHashSketch, part) = getindex(s.sketch, part) | ||
Base.length(s::MinHashSketch) = length(s.sketch) | ||
Base.size(s::MinHashSketch) = (length(s), s.kmersize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks an abuse of overloading. Base.size
should return the dimensions (or sizes) of an array-like data structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something that would be better (something like dims()
, or should I just not bother?
src/seq/minhash.jl
Outdated
# A seqence and its reverse complement should be the same, so take the smallest | ||
# hash of a seq or its reverse complement. | ||
function revcomphash(kmer::Kmer) | ||
k = minimum((kmer, reverse_complement(kmer))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k = min(kmer, reverse_complement(kmer))
would be better.
src/seq/minhash.jl
Outdated
end | ||
|
||
|
||
function kmerminhash(seq::BioSequence, kmerset, kmerhashes::Vector{UInt64}, k::Int, s::Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By passing k
as a usual argument, you cannot enjoy the full optimization of Julia. kmerminhash{k}(::Type{DNAKmer{k}}, seq, kmerset, kmerhashes, s)
may significantly improve the performance of this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this function updates arguments, the name should end with !
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're misunderstanding this function or I'm misunderstanding you. The k
argument here isn't a kmer, it's an argument to specify the length of kmers used to build the minhash sketch. And it doesn't modify the sequence, it generates all possible kmers from a sequence and builds a sketch using the smallest s
hashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it clear, I wrote a function that is ~10x faster than your implementation: https://gist.github.com/bicycle1885/23f15ac022cff46cb454c9be1a585987. In this function, k
is passed as a type argument and it gives Julia more opportunities to generate specialized code that is specific to the length of kmers. Also, this function modifies the kmerset
and kmerhashes
arguments and hence this should be names like kmerminhash!
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bicycle1885 To your second point (kmerminhash!()
) - I woke up in the middle of the night realizing what you meant! I think my python brain was in force and only thinking about the first argument.
To your first point, it's no surprise that it was me misunderstanding you. I'm not clear why that provides such a performance boost, but the evidence is clear. And that also makes point 2 moot, since you're not doing the weirdness with passing sets around. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bicycle1885 Though now I'm going to modify the code, what about the minhash()
functions that take groups of sequences rather than just one? Will it still be performant if I do something like:
function kmerminhash{k, T<:BioSequence}(::Type{DNAKmer{k}}, seqs::Vector{T}, s::Int)
# generate first `s` kmers
kmerhashes = UInt64[]
iter = [each(DNAKmer{k}, seq) for seq in seqs]
...
?
Or would it make more sense to make a separate iterator function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the purpose of minhash
functions taking multiple sequences is. I thought minhash
is a kind of summary of a sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bicycle1885 That's true, but if the sequence you want a summary of is something like a genome that has multiple contigs, then you need the minhash of all of those contigs together (in principle, you could calculate the minhash of each sequence, and then merge and take the bottom s
hashes from that, but that seems inefficient).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it makes sense. So, what about having an update function like minhash!(sketch::MinHashSketch, seq::BioSequence)
that adds a new sequence to sketch
?
src/seq/minhash.jl
Outdated
h = revcomphash(kmer[2]) | ||
if h < kmerhashes[end] | ||
i = searchsortedlast(kmerhashes, h) | ||
if i == 0 && i != kmerhashes[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i != kmerhashes[1]
compares an index and a hash value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! That should be if i == 0 && h != kmerhashes[1]
.
src/var/mash.jl
Outdated
# License is MIT: https://github.com/BioJulia/Bio.jl/blob/master/LICENSE.md | ||
|
||
function jaccarddistance(sketch1::MinHashSketch, sketch2::MinHashSketch) | ||
d = length(setdiff(sketch1.sketch, sketch2.sketch)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since sketch
es are sorted, I think d
can be computed more efficiently (in O(M+N)
) like the merge step of the merge sort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bicycle1885 I'm not sure about merge sort, but do you mean something like: Iterating through the arrays independently in a loop, and since you know they're sorted you know based on greater than, less than and so on, whether elements are in array A and are not in array B?
@kescobo Note that setdiff gets the values that are in sketch1 and not sketch2, but not the other way around, values in sketch2 that are not in sketch1 will not be returned. If this is what you want all is well, but I thought I'd mention it just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bicycle1885 I initially wrote something out to do something like what you're suggesting (you can look here starting at ln 117 for what I tried), but it didn't seem to be any faster. Probably my lack of knowledge for how to do this efficiently is to blame - happy to take guidance on that.
@Ward9250 the calculation is the length of the intersect over the length of the union. setdiff
tells me how many items are in one but not the other, and since we know the size of each set, that's enough information to know how many are shared as well (we don't need to know what's shared, just the number).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kescobo I think the reason it may not have been much faster is the use of shift!
and array copying? I think you can probably do the task you are tying to do in 117 without all the array manipulation, by just keeping track of two indicies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ward9250 Yeah, that makes sense. I tried a number of different data structures, and I think at one point I was trying to get a vector of matches out of that function, but later morphed into just a calculation of the number. I can try to re-work that code and see if it improves things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great if it's possible to eke out a bit more speed. I drew a draft version of a function that does the same job as setdiff
but is specific to sorted arrays, when I checked out setdiff
with @less
the other day, I can make a gist out of it if it helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ward9250 @bicycle1885 Alas, this change is only 40x faster o_O
https://gist.github.com/kescobo/7341eed840b56ba90d5661da2d745cac
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kescobo That's it! idxes vs. shifts and slices ftw! 😆
src/seq/minhash.jl
Outdated
""" | ||
type MinHashSketch | ||
sketch::Vector{UInt64} | ||
kmersize::Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if having kmer size as a parameter of the type (like array dimensions are a parameter) will afford any benefit here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expound on this? I'm not familiar with making something a parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean doing something like the following gist: https://gist.github.com/Ward9250/1c550e345b742f3287da4f9a2c2fda08
Hi all, Sorry, I am just checking in here. I'll review the comments (& code) here properly tomorrow, but the short answer is that @kescobo is right about CountMin.jl. This PR happens to be the essence of my whole PhD, so expect some verbosity from me in the (aussie) morning! For now, you can take a look at kWIP for our approach to this problem using count-min sketches. I have a preliminary julia re-write of kWIP too, FWIW. Cheers, |
@kdmurray91 Awesome - would be great to have your insight. I'm a total newb at this :-) |
Hey guys, I think this is close - just need some advice on how to move forward with generating sketches that span multiple sequences using @bicycle1885's speedy approach (see here). Does it make sense to extend the |
src/seq/minhash.jl
Outdated
# A seqence and its reverse complement should be the same, so take the smallest | ||
# hash of a seq or its reverse complement. | ||
function revcomphash(kmer::Kmer) | ||
k = min((kmer, reverse_complement(kmer))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k = min(kmer, reverse_complement(kmer))
?
src/seq/minhash.jl
Outdated
function MinHashSketch(sketch::Vector, kmersize::Int) | ||
length(sketch) > 0 || error("Sketch cannot be empty") | ||
kmersize > 0 || error("Kmersize must be greater than 0") | ||
new(sketch, kmersize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use return
when returning a value to clarify your intention.
src/seq/minhash.jl
Outdated
Base.done(s::MinHashSketch, state) = done(s.sketch, state) | ||
|
||
function Base.:(==)(a::MinHashSketch, b::MinHashSketch) | ||
a.kmersize == b.kmersize && a.sketch == b.sketch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return
src/seq/minhash.jl
Outdated
|
||
Generate a MinHash sketch of size `s` for kmers of length `k`. | ||
""" | ||
function minhash(seq::BioSequence, k::Int, s::Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Integer
for k
and s
would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this also true for other places where Int
is passed as the type parameter?
Also, if you don't mind explaining, is Integer
better because it's more generic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is the reason. I'm sometimes frustrated by this kind of restrictions. Most integer values can be converted to Int
, so I think it would be reasonable to make argument types as loose as possible. This is also described in the style guide: http://docs.julialang.org/en/stable/manual/style-guide/#avoid-writing-overly-specific-types.
src/var/mash.jl
Outdated
end | ||
end | ||
end | ||
matches == sketchlen ? (return 1.0) : return matches / (2 * sketchlen - matches) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the if-else clause would be more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, though I thought using the ternary operator is more "standard" in julia.
It's worth noting that this syntax is currently in a bunch of places around the Bio.jl code. Admittedly in most cases, they're in quite short and clear places, but in others, not so much. Eg in banded_needleman_wunch.jl:
@update2_nextcol (i == m ? end_gap_init_a : middle_gap_init_a) (i == m ? end_gap_extend_a : middle_gap_extend_a)
As someone coming from python, I was really confused by this syntax initially, and it's not easy to find info about it. I fully support using if-else in most circumstances, but in that case we should probably be more consistent (and maybe add something to CONTRIBUTING.md if we're breaking from standard julia style).
src/seq/minhash.jl
Outdated
@@ -18,12 +18,12 @@ hashes for kmers of a given length. The type contains two parameters: | |||
""" | |||
immutable MinHashSketch | |||
sketch::Vector{UInt64} | |||
kmersize::Int | |||
kmersize::Integer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be kept as Int
. Loosening the type of a field will propagate to other variables and result in performance degradation. In Julia, calling new
automatically converts the types of arguments to those of fields. So, even if the constructor takes an Integer
value, the value will be converted to Int
when a sketch object is created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I see, but just to be sure, you're saying line 21 there should be Int
, but the constructor on line 23 should be Integer
, yes?
So as a general rule, type fields should be specific where possible, but function arguments should be more permissive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and yes. Function arguments are for users, but field values are for internal use. Arguments should be generic for usability and fields should be specific (or parameterized by type parameters) for performance.
OK, I've implemented methods for Oh, and when I opened this PR I said
I just repeated these procedures, and now going from the FASTA files to a MASH distance is ~10x faster (~250ms) as is getting MASH distance from the sketch (~13 μs). You guys are great! (thanks especially @bicycle1885). |
Weee! It works! @bicycle1885 - I think I managed to address all of your concerns, right? @Ward9250 I played around a bit with your suggested to make |
Sorry for the delayed response all... Regarding the term "sketch": In my mind, this is a generic term for a data structure that holds the essence of some dataset using as few resources as possible. These include bloom filters for presence/absence queries, hyperloglogs for cardinaltiy queries, Count-min sketches (a la countmin.jl) for counting queries, and min-hash or bottom sketches for "summary" queries. So no, the MinHash datastructure is not really a good fit for CountMin.jl, nor is CountMin.jl useful for this PR. However, one thing I've been thinking of is making a SketchingDataStructures.jl package, that would hold a bloom filter, hyperloglog, min-hash and count-min sketch, with a consistent API, and importantly with good serialisation support via HDF5. PhD-related time poverty has prevented this thus far, sorry! As to how mash fits in to all this, MASH's key (IMHO) innovation is applying min-hash sketches to alignment free seq. comparison. They use a metric derived from AAF, which does a similar thing, but using jellyfish to count the entire dataset, not a subset. For full disclosure, I have a paper using a completely different metric derived from the D2 statistic, and using (what amount to) count-min sketches. As I said, I'm currently writing a thesis on this field, so ping any Q's my way, and I'll try not to take so long to get back to you. I'll take a look through the code now, but it sounds as though @bicycle1885 as already been over it with a sharper eye than I could :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, though a couple of features seem duplicated.
src/seq/minhash.jl
Outdated
|
||
# A seqence and its reverse complement should be the same, so take the smallest | ||
# hash of a seq or its reverse complement. | ||
function revcomphash(kmer::Kmer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that implented somewhere as canonical(kmer::Kmer)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, though now that I look at it, I realize that the original MASH paper did the lexicographic minimum - eg the lowest by alphabetical order. I didn't look at the min()
function as it applies to kmers - might have to convert them to strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would take more memory and add conversion cost though :( Can we implement lexicographic sorting functionality for the kmers with methods that are used by the julia sort interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ward9250 I suspect we can- it would just require a custom sort order to make sure A < C < G < T.
But in any case, this is only necessary to make our sketches compatible with sketches generated by other software, and this is a bigger project that also requires the ability to select the specific hash function (see eg here). I don't think it's necessary to implement in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it just be hash(canonical(kmer))
? The canonical kmer is the lexographically smaller of the kmer & revcomp(kmer), by definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# This file is a part of BioJulia. | ||
# License is MIT: https://github.com/BioJulia/Bio.jl/blob/master/LICENSE.md | ||
|
||
function jaccarddistance(sketch1::MinHashSketch, sketch2::MinHashSketch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coudn't this be obtained from Distances.jl? Then it could be jaccarddistance(s1, s2) = Distances.jaccard(s1.sketch, s2.sketch)
or whatever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the way Distances.jl implements this is 1 - sum(min(x, y)) / sum(max(x, y))
, which I think will be less efficient here. The version written here takes advantage of the fact that we know sketches are (1) sets, (2) the same length and (3) sorted. Besides, probably not worth adding another dependency for this one function right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kescobo I expected the Distances methods would be slower since your function is tailored to the type.
Just to be clear, I think this PR is ready pending @bicycle1885 checking that I addressed the concerns he raised in his review. |
Looks good @kescobo 👍 |
LGTM 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still wondering whether there is a better interface for multiple sequences, but I think it's fine to merge this now. Can you update NEWS.md?
# | ||
# This file is a part of BioJulia. | ||
# License is MIT: https://github.com/BioJulia/Bio.jl/blob/master/LICENSE.md | ||
using DataStructures: SortedSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
The NEWS update would have caused a conflict with the ABIF News update, so I rebased and fixed. |
This is a PR for an idea I brought up months ago in gitter. I'd like to add a small module to calculate the MASH distance between two genomes.
Briefly, the theory is that you generate a hash for all kmers of size
k
, but only store a "bottom sketch" of a particular genome of sizes
- that is thes
smallest hashes. If you store the hashes in sorted order, you only need to check if each subsequent hash is larger than a single value, so this is quite fast. The genome distance for two genomes is then essentially the Jaccard index of the minhashes, with some additional modification to account for the size of the kmers used.I have an implementation that I've validated and is quite fast - from two FASTA files of ~5 Mbp each to MASH distance takes ~2 sec on my 3 year old macbook pro. Most of this time is spent generating the sketch, but once a sketch is made, the comparison between two genomes takes less than a millisecond (so adding additional genomes scales very well).
I'm ready to start adding documentation and tests, but thought I should open a PR and get some guidance on where this fits in Bio.jl, and if I should add anything else to the module.
Minhash
MASH Distance
Still to-do