-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18334] MinHash should use binary hash distance #15800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Using this as hashing distance for near-neighbor search doesn't make sense to me. If there aren't enough candidates where the distance is zero, we'll select some candidates who have distance one. But these are just random candidates since distance of one doesn't correspond to being similar at all, if my understanding is correct. Does minhash really fit the abstraction of multi-probing? I notice that they only use hyperplane projection method in this paper. |
|
@sethah Not exactly. Based on the logic in I don't think multi-probing works well on MinHash. Multi-probing mostly benefits LSH family like hyperplane projection and bits sampling. |
|
Good point. Maybe we can log a warning when multi-probing is called with MinHash - to say that it will result in running brute force knn when there aren't enough candidates. |
…ni/spark into SPARK-18334-yunn-minhash-bug
|
(Updated after reading the multi-probe LSH paper) I agree "multiple probing" does not really make sense with MinHash since "multiple probing" relies on perturbations of the hash function. However, I still believe that averaging indicators is the best options for computing hashDistance. The argument below is still valid: For each hash function, the probability of a 1 (collision) is equal to the Jaccard similarity coefficient: [https://en.wikipedia.org/wiki/MinHash], which is the exact distance we want to compute. Taking the average of these 0/1 collision indicators gives you an estimate of the probability of a collision for a random hash function, i.e., the Jaccard similarity, which is exactly what we want to sort by to find neighbors. (This is why I suggested using a sum of indicators for hashDistance.) |
| * Reference: | ||
| * [[https://en.wikipedia.org/wiki/Perfect_hash_function Wikipedia on Perfect Hash Function]] | ||
| * | ||
| * a perfect hash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkbradley is it not more confusing to not have any reference or further explanation here? You mentioned in #15148 (comment) to remove this, but should we not have some doc here or a better reference instead of nothing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's technically a perfect hash function because it's being "perfect" depends on the number of buckets used, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(But I do like the idea of having more references.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
/**
* Model produced by [[MinHash]], where multiple hash functions are stored. Each hash function is
* a perfect hash function for a specific set `S` with cardinality equal to `numEntries`:
* `h_i(x) = ((x \cdot k_i) \mod prime) \mod numEntries`There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks for all of your suggestions!
|
jenkins add to whitelist |
|
Test build #68405 has finished for PR 15800 at commit
|
|
@jkbradley Your updated summary above is in line with my view as well - that "multi-probing" as described in the paper doesn't translate exactly to MinHash, but that it does make sense to use "nearby" points as candidates for the distance measure you proposed. |
|
@jkbradley Averaging indicators make more sense for an AND-amplified MinHash function. The hash distance is 0 when all hash values are equal, and grows as the more hash values differ. As we are moving to |
|
I'm not convinced that it is useful to think about AND and OR amplification for MinHash for approxNearestNeighbors. Do you have a reference describing it? I just can't think of a better method than averaging indicators. |
|
Test build #68426 has finished for PR 15800 at commit
|
|
@jkbradley There are 2 reason I don't think averaging indicators is a good hashDistance for the current implementation. When going with The current implementation is the case when the size of Vector=1, in other words, minimum of whether corresponding hash values are equal. |
|
Test build #68428 has finished for PR 15800 at commit
|
|
Limiting this discussion to MinHash: Say we construct a hash table of L x K functions (for doing both OR and AND amplification). For approxNearestNeighbors, we need to ask, "What is the best scalar value we can compute from these L*K values to approximate the distance between keys?"
Assume that our LxK functions are chosen independently at random (which is what we are doing now). Then I claim that, to compare distances between 2 points (1 query + 1 in the dataset), there is no better way to combine these functions than to:
Rough argument:
|
|
Hi @jkbradley,
For MinHash, we can increase For example, in 10 * 10 MinHash for MultiProbe NN search, my understanding is your method will only have 1 In general, I agree with your rough argument, but I want to add the following:
|
| override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { | ||
| // Since it's generated by hashing, it will be a pair of dense vectors. | ||
| x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - pair._2)).min | ||
| if (x.toDense.values.zip(y.toDense.values).exists(pair => pair._1 == pair._2)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why just 0 and 1? I think if more pairs of values are the same, more the two vectors are closer, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See discussion above :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I do more agree on the comment from @jkbradley at #15800 (comment), if I understand correctly some terms here.
Is the indicator meaning a matching hashing value between two vectors from one hashing function, i.e., h_i?
If this understanding is correct, I think averaging indicators should be the right way to compute MinHash's hash distance.
|
I agree with @jkbradley's suggested approach. One key point here (for MinHash): If a query point vector q hashes to some MinHash Vector [5.0, 22.0, 13.0] the best candidates will be ones that hash to that same vector - I think we all agree. Now, if we wish to search for other candidates that are similar to q but do not hash to exactly that hash vector, we should not think of searching "nearby" buckets. A vector x1 which hashes to [5.0, 23.0, 13.0] is no closer than a vector x2 which hashes to [5.0, 739.0, 13.0]. Though they are both more likely to be near-neighbors than something which has zero bucket collisions. The individual values have binary similarities, but looking at the entire vector we can use total number of individual collisions as an aggregate measure of closeness. This is my understanding, and I think Joseph's suggestions are correct. Though I did not follow the second half of @Yunni's post... |
My second half is suggesting: If a query point vector q hashes to MinHash Array
To increase chance to find the exact k nearest neighbor. |
|
I think that we would have the following hash distance signature: def hashDistance(x: Vector, y: Vector): DoubleThen in |
|
I agree @sethah and I are on the same page. Two clarifications about @Yunni 's post:
I also just realized something else: For approxNearestNeighbors with multiple probing, we should not sort the entire dataset. Shall we switch to something else which will avoid sorting all rows, such as using approxQuantiles to pick a threshold? I'm OK with this improvement coming in a later release. If you agree, I'll make a JIRA. |
|
@sethah That sounds good to me, expect that there is no
Yes, I mean we average K 0/1 indicators for each Vector, and get L averaged indicators. Do you agree with this part? |
I'm afraid I still disagree. There's a fundamental difference between how we are computing nearest neighbors and how the LSH literature computes nearest neighbors:
Say we take LxK indicators and can either:
With this setup, we do the same amount of computation and work with the same amount of selected points. The question is then: Which of (a) or (b) will give higher precision and recall? I'll just give an intuitive argument.
|
|
@sethah I misread what you wrote earlier: I still want to compute a single estimate, rather than L separate ones. |
I think this might be the place we are not on the same page. I consider the output of (a)/(b) as our "Probing Sequence" (or "Probing buckets"), and in the next step we pick and return k keys with smallest distance in those buckets. Do you agree with this part? If you agree, then I claim more duplicates (It's actually redundancy rather than duplicates) brings more chance for finding the correct k nearest neighbors because we enlarge our search range. If you disagree, I think we are not discussing based on the same NN search implementation (differs from the current implementation). I would like to know how you return k nearest neighbor after (b)? |
|
You're talking about enlarging search ranges, or iterations, a few times, but I really do not think it makes sense to have multiple iterations in Spark. Iterations are useful for avoiding an expensive sort. But in Spark, it'd be more efficient to compute approximate quantiles to avoid the sort. In both (a) and (b), you come up with some set of candidates. I was assuming we would compute keyDistance for those candidates and pick the top ones, just as in the current implementation. |
|
@jkbradley I agree with your idea to get rid of full sorting and use
Enlarging search ranges does not necessarily mean iterations. The same threshold logic for (a) gives a larger search range than for (b). Do you agree with this?
Agree with this part. BTW, one concrete example, you can run
|
|
@Yunni Spark DF should have a |
|
@jkbradley Thanks for clarifying, I see your argument now. I agree that it makes sense from a statistical perspective. Still, I have not seen a single paper that describes anything quite exactly like what we're proposing. I would be ok disabling the multi-probe option for the 2.1 release, so we could carry on this discussion and continue hashing out (pun intended :) the APIs. It is my understanding that the main benefit of multi-probe described in the reference paper is to cut down the storage space required by computing many hash tables, but we are not actually storing the entire hash table as a data structure so our implementation is a bit different. I think there's room for discussion/tests about what the benefits are and how drastically they impact performance. |
|
@MLnick Thanks! That's very good to know! @sethah I agree with your comments. @jkbradley If you don't have objection, shall I remove MultiProbe NN Search and |
|
@Yunni I guess we should remove it from the public API. I'm OK with leaving the code there and making it private for now. One response:
If you use the same threshold for both, then I agree. But that's not a reasonable comparison since (a) will do many times more work and communicate many times more data (up to L times more). This will happen when you do posexplode. If you compare the 2 where each selects the same number of rows (on which to compute the keyDistance and select neighbors), then (b) will select many more candidates since it will not have duplicates. Also, one new comment: I'm testing vs the current implementation (min(abs(query bucket - row bucket))). Weirdly, the current one is getting consistently better results than my proposal...even though this does not make sense to me statistically (and even though the current implementation isn't what any of us are proposing to use!). I'm still banging my head against this... UPDATE: Problem solved. The difference is that, for the current approxNearestNeighbors unit test, the current hashDistance function results in 56 rows being considered here: [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala#L153]. This is because lots of rows have the same hashDistance 0.0 with the current implementation. When modified to use the average of indicators, there is a much broader range of distances, so only 26 rows are considered. Proposal: I suggest modifying the above line to limit to numNearestNeighbors or some multiple of numNearestNeighbors. Otherwise, approxNearestNeighbors could compute keyDistance for the entire dataset (in the extreme case). |
|
OK. Abandon this PR since we are making MultiProbe NN Search and |
What changes were proposed in this pull request?
MinHash currently is using the same
hashDistancefunction as RandomProjection. This does not make sense for MinHash because the Jaccard distance of two sets is not relevant to the absolute distance of their hash buckets indices.This bug could affect accuracy of multi probing NN search for MinHash.
MinHash hash distance should just be binary since there is no distance on the buckets.
How was this patch tested?
An incorrect unit test was also introduced, and it's fixed in this PR.