Skip to content

Conversation

@Yunni
Copy link
Contributor

@Yunni Yunni commented Nov 7, 2016

What changes were proposed in this pull request?

MinHash currently is using the same hashDistance function as RandomProjection. This does not make sense for MinHash because the Jaccard distance of two sets is not relevant to the absolute distance of their hash buckets indices.
This bug could affect accuracy of multi probing NN search for MinHash.

MinHash hash distance should just be binary since there is no distance on the buckets.

How was this patch tested?

An incorrect unit test was also introduced, and it's fixed in this PR.

@sethah
Copy link
Contributor

sethah commented Nov 8, 2016

Using this as hashing distance for near-neighbor search doesn't make sense to me. If there aren't enough candidates where the distance is zero, we'll select some candidates who have distance one. But these are just random candidates since distance of one doesn't correspond to being similar at all, if my understanding is correct. Does minhash really fit the abstraction of multi-probing? I notice that they only use hyperplane projection method in this paper.

@Yunni
Copy link
Contributor Author

Yunni commented Nov 8, 2016

@sethah Not exactly. Based on the logic in approxNearestNeighbor, if there aren't enough candidates where the distance is zero, we'll scan the the whole dataset.

I don't think multi-probing works well on MinHash. Multi-probing mostly benefits LSH family like hyperplane projection and bits sampling.

@sethah
Copy link
Contributor

sethah commented Nov 8, 2016

Good point. Maybe we can log a warning when multi-probing is called with MinHash - to say that it will result in running brute force knn when there aren't enough candidates.

@jkbradley
Copy link
Member

jkbradley commented Nov 8, 2016

(Updated after reading the multi-probe LSH paper)

I agree "multiple probing" does not really make sense with MinHash since "multiple probing" relies on perturbations of the hash function.

However, I still believe that averaging indicators is the best options for computing hashDistance. The argument below is still valid:

For each hash function, the probability of a 1 (collision) is equal to the Jaccard similarity coefficient: [https://en.wikipedia.org/wiki/MinHash], which is the exact distance we want to compute.

Taking the average of these 0/1 collision indicators gives you an estimate of the probability of a collision for a random hash function, i.e., the Jaccard similarity, which is exactly what we want to sort by to find neighbors. (This is why I suggested using a sum of indicators for hashDistance.)

* Reference:
* [[https://en.wikipedia.org/wiki/Perfect_hash_function Wikipedia on Perfect Hash Function]]
*
* a perfect hash.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkbradley is it not more confusing to not have any reference or further explanation here? You mentioned in #15148 (comment) to remove this, but should we not have some doc here or a better reference instead of nothing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's technically a perfect hash function because it's being "perfect" depends on the number of buckets used, right?

Copy link
Member

@jkbradley jkbradley Nov 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(But I do like the idea of having more references.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

/** 
 * Model produced by [[MinHash]], where multiple hash functions are stored. Each hash function is
 * a perfect hash function for a specific set `S` with cardinality equal to `numEntries`:
 *    `h_i(x) = ((x \cdot k_i) \mod prime) \mod numEntries`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks for all of your suggestions!

@MLnick
Copy link
Contributor

MLnick commented Nov 9, 2016

jenkins add to whitelist

@SparkQA
Copy link

SparkQA commented Nov 9, 2016

Test build #68405 has finished for PR 15800 at commit a3cd928.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Nov 9, 2016

@jkbradley Your updated summary above is in line with my view as well - that "multi-probing" as described in the paper doesn't translate exactly to MinHash, but that it does make sense to use "nearby" points as candidates for the distance measure you proposed.

@Yunni
Copy link
Contributor Author

Yunni commented Nov 9, 2016

@jkbradley Averaging indicators make more sense for an AND-amplified MinHash function. The hash distance is 0 when all hash values are equal, and grows as the more hash values differ.

As we are moving to Array[Vector] as our output type, I think the averaging indicators are good for comparing the Vector(AND-amplification), but we still need binary distance for Array(OR-amplification).

@jkbradley
Copy link
Member

I'm not convinced that it is useful to think about AND and OR amplification for MinHash for approxNearestNeighbors. Do you have a reference describing it? I just can't think of a better method than averaging indicators.

@SparkQA
Copy link

SparkQA commented Nov 9, 2016

Test build #68426 has finished for PR 15800 at commit c8243c7.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Yunni
Copy link
Contributor Author

Yunni commented Nov 10, 2016

@jkbradley There are 2 reason I don't think averaging indicators is a good hashDistance for the current implementation.
(1) SingleProbe NN performance relies on OR-amplification, changing to averaging indicators will increase the false negative rate and hurt the accuracy of SingleProbe.
(2) Amplification is an construction method for any LSH (See 3.6.3 of http://infolab.stanford.edu/~ullman/mmds/ch3.pdf) I think it's a good abstraction to consider the current implementation as OR-amplification and then move to AND/OR compound.

When going with Array[Vector] as our output type, I think we need to change hashDistance(x: Array[Vector], y: Array[Vector]) to the following:
(1) ScalarRandomProjectionLSH: Minimum of euclidean distance of corresponding hash vectors
(2) MinHashLSH: Minimum of averaging indicators of corresponding hash vectors

The current implementation is the case when the size of Vector=1, in other words, minimum of whether corresponding hash values are equal.

@SparkQA
Copy link

SparkQA commented Nov 10, 2016

Test build #68428 has finished for PR 15800 at commit 6aac8b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Limiting this discussion to MinHash:

Say we construct a hash table of L x K functions (for doing both OR and AND amplification). For approxNearestNeighbors, we need to ask, "What is the best scalar value we can compute from these L*K values to approximate the distance between keys?"

  • I think this is one point of confusion in previous discussion: I am talking only about nearest neighbors, which sorts on a scalar estimate of distance. Things are different when you want to build a data structure with physical hash buckets.

Assume that our LxK functions are chosen independently at random (which is what we are doing now).

Then I claim that, to compare distances between 2 points (1 query + 1 in the dataset), there is no better way to combine these functions than to:

  • Compute LxK binary 0/1 indicators of whether the 2 points match on each hash function
  • Average these LxK 0/1 indicators

Rough argument:

  • For a given scalar hash function (without amplification), there is no concept of "distance" between hash buckets. (This is different from P-Stable LSH.)
  • The distance metric we want to sort by is Jaccard similarity.
  • The average of these indicators is an efficient estimator (in the statistical sense) of the Jaccard similarity.

@Yunni
Copy link
Contributor Author

Yunni commented Nov 10, 2016

Hi @jkbradley,
I agree with your claim on estimating Jaccard similarity, but looks like your L and k are having the same effect on the performance. Consider a case when we want to trade running time (#rows to sort) for more accurate k nearest neighbors. In this case,

  • Using the average of the indicators as probing sequence can minimize the rows to sort, but may have more false negative rate.
  • False negatives always have more negative impacts on accuracy than false positives.

For MinHash, we can increase L(dimension of OR-amplification) and do the following:
(1) For the searching key, we can check all L buckets that contains the key
(2) If there are not enough buckets, we check all buckets that are 1 steps away from buckets in (1)
(3) If still not enough, search buckets that 2 steps away from buckets in (2)
......
In each step, we are searching all probable buckets, which gives us more chance to include the exact k-NN in our search range.

For example, in 10 * 10 MinHash for MultiProbe NN search, my understanding is your method will only have 1 indicator for each row and do the following:
(1) Search all rows with indicator = 0.00
(2) Search all rows with indicator = 0.01
I would suggest we have 10 indicators for each row, and do the following:
(1) Search all rows with any indicator = 0.0
(2) Search all rows with any indicator = 0.1
Even if an indicator is an outliner, we still won't miss it in this case.

In general, I agree with your rough argument, but I want to add the following:

  • An efficient estimator of similarity is not the best probing sequence. A probing sequence with larger search range (more bucket collision) can help us get higher accuracy.

override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
// Since it's generated by hashing, it will be a pair of dense vectors.
x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - pair._2)).min
if (x.toDense.values.zip(y.toDense.values).exists(pair => pair._1 == pair._2)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why just 0 and 1? I think if more pairs of values are the same, more the two vectors are closer, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See discussion above :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I do more agree on the comment from @jkbradley at #15800 (comment), if I understand correctly some terms here.

Is the indicator meaning a matching hashing value between two vectors from one hashing function, i.e., h_i?
If this understanding is correct, I think averaging indicators should be the right way to compute MinHash's hash distance.

@sethah
Copy link
Contributor

sethah commented Nov 10, 2016

I agree with @jkbradley's suggested approach. One key point here (for MinHash):

If a query point vector q hashes to some MinHash Vector [5.0, 22.0, 13.0] the best candidates will be ones that hash to that same vector - I think we all agree. Now, if we wish to search for other candidates that are similar to q but do not hash to exactly that hash vector, we should not think of searching "nearby" buckets. A vector x1 which hashes to [5.0, 23.0, 13.0] is no closer than a vector x2 which hashes to [5.0, 739.0, 13.0]. Though they are both more likely to be near-neighbors than something which has zero bucket collisions. The individual values have binary similarities, but looking at the entire vector we can use total number of individual collisions as an aggregate measure of closeness.

This is my understanding, and I think Joseph's suggestions are correct. Though I did not follow the second half of @Yunni's post...

@Yunni
Copy link
Contributor Author

Yunni commented Nov 10, 2016

If a query point vector q hashes to some MinHash Vector [5.0, 22.0, 13.0] the best candidates will be ones that hash to that same vector.

My second half is suggesting: If a query point vector q hashes to MinHash Array [[5.0, 10.0], [22.0, 7.0], [13.0, 25.0]], the buckets to search will be ones that hash to any of the MinHash vectors like

  • [[5.0, 10.0], [39.0, 7.0], [18.0, 99.0]]
  • [[100.0, 77.0], [22.0, 7.0], [13.0, 2.0]]
  • [[78.0, 40.0], [96.0, 55.0], [13.0, 25.0]]
  • ......

To increase chance to find the exact k nearest neighbor.

@sethah
Copy link
Contributor

sethah commented Nov 10, 2016

I think that we would have the following hash distance signature:

def hashDistance(x: Vector, y: Vector): Double

Then in approxNearestNeighbors we would explode the Array[Vector] column, then apply hash distance and sort on that distance. That way we first select any point where ANY of the L g_l(x) vectors match, and only after that do we consider hashes that are close in the distance measure.

@jkbradley
Copy link
Member

I agree @sethah and I are on the same page. Two clarifications about @Yunni 's post:

  • I'm not sure what you mean by "your method will only have 1 indicator for each row." I'm proposing to compute some number of buckets (which I called "LxK" above), computing indicators for each, and averaging the indicators.
  • I am not proposing multiple iterations of searching, but sorting by hash distance would effectively do those iterations in a single sort.

I also just realized something else: For approxNearestNeighbors with multiple probing, we should not sort the entire dataset. Shall we switch to something else which will avoid sorting all rows, such as using approxQuantiles to pick a threshold? I'm OK with this improvement coming in a later release. If you agree, I'll make a JIRA.

@Yunni
Copy link
Contributor Author

Yunni commented Nov 10, 2016

@sethah That sounds good to me, expect that there is no posexplode() in spark AFAIK. Do you think hashDistance(x: Array[Vector], y: Array[Vector]) is a better workaround, or we should still use hashDistance(x: Vector, y: Vector) and implement our own posexplode()?

@jkbradley

I'm proposing to compute some number of buckets (which I called "LxK" above), computing indicators for each, and averaging the indicators.

Yes, I mean we average K 0/1 indicators for each Vector, and get L averaged indicators. Do you agree with this part?

@jkbradley
Copy link
Member

Yes, I mean we average K 0/1 indicators for each Vector, and get L averaged indicators. Do you agree with this part?

I'm afraid I still disagree. There's a fundamental difference between how we are computing nearest neighbors and how the LSH literature computes nearest neighbors:

  • LSH: This literature is all about data structures where you want to look in a set of buckets and nowhere else. There is no sorting.
  • Us: We are sorting the entire dataset based on a distance estimate.

Say we take LxK indicators and can either:

  • (a) average groups of K to get L distance estimates. Then select the top M points for each estimate, i.e. L*M points total.
  • (b) average all to get 1 distance estimate. Then select the top L*M points for each estimate.

With this setup, we do the same amount of computation and work with the same amount of selected points.

The question is then: Which of (a) or (b) will give higher precision and recall?

I'll just give an intuitive argument.

  • One way to look at it is that (a) will contain many duplicates in the L sets of points, so (b) is more likely to have higher precision and recall.
  • Another is:
    • We are effectively trying to lay out all keys on a number line, where we are placing them at our estimate of the Jaccard similarity. We then set a threshold and pick all keys to the right of that threshold.
    • Method (b) is putting each key at the estimate computed from L*K indicators.
    • Method (a) is putting each key at the estimate computed as max (over L estimates) of the estimate computed from K indicators.
    • Said this way, it is clear that (b) gives a better estimate.
    • Assuming we pick L*M candidate keys for each method, then (a) will have to use a much higher threshold, making it have a much higher false negative rate.

@jkbradley
Copy link
Member

@sethah I misread what you wrote earlier: I still want to compute a single estimate, rather than L separate ones.

@Yunni
Copy link
Contributor Author

Yunni commented Nov 10, 2016

One way to look at it is that (a) will contain many duplicates in the L sets of points, so (b) is more likely to have higher precision and recall.

I think this might be the place we are not on the same page. I consider the output of (a)/(b) as our "Probing Sequence" (or "Probing buckets"), and in the next step we pick and return k keys with smallest distance in those buckets. Do you agree with this part?

If you agree, then I claim more duplicates (It's actually redundancy rather than duplicates) brings more chance for finding the correct k nearest neighbors because we enlarge our search range.

If you disagree, I think we are not discussing based on the same NN search implementation (differs from the current implementation). I would like to know how you return k nearest neighbor after (b)?

@jkbradley
Copy link
Member

You're talking about enlarging search ranges, or iterations, a few times, but I really do not think it makes sense to have multiple iterations in Spark. Iterations are useful for avoiding an expensive sort. But in Spark, it'd be more efficient to compute approximate quantiles to avoid the sort.

In both (a) and (b), you come up with some set of candidates. I was assuming we would compute keyDistance for those candidates and pick the top ones, just as in the current implementation.

@Yunni
Copy link
Contributor Author

Yunni commented Nov 11, 2016

@jkbradley I agree with your idea to get rid of full sorting and use approxQuantile to find the threshold. Doing a full sort on whole dataset hurts performance a lot. Please file a ticket for this.

You're talking about enlarging search ranges, or iterations, a few times.

Enlarging search ranges does not necessarily mean iterations. The same threshold logic for (a) gives a larger search range than for (b). Do you agree with this?

In both (a) and (b), you come up with some set of candidates. I was assuming we would compute keyDistance for those candidates and pick the top ones, just as in the current implementation.

Agree with this part.

BTW, one concrete example, you can run approxNearestNeighbors for min hash in MinHashSuite.scala. Please change singleProbe = false

  • hashDistance in (a) gives precision/recall as (0.95,0.95) when it searches 56 rows
  • hashDistance in (b) gives precision/recall as (0.6,0.6) while searching only 26 rows

@MLnick
Copy link
Contributor

MLnick commented Nov 11, 2016

@Yunni Spark DF should have a posexplode:

scala> val df = Seq((0, Array(Vectors.dense(1, 2), Vectors.dense(5, 4))), (1, Array(Vectors.dense(3, 2), Vectors.dense(1, 2)))).toDF("id", "hash")
df: org.apache.spark.sql.DataFrame = [id: int, hash: array<vector>]

scala> df.show
+---+--------------------+
| id|                hash|
+---+--------------------+
|  0|[[1.0,2.0], [5.0,...|
|  1|[[3.0,2.0], [1.0,...|
+---+--------------------+


scala> df.select(posexplode(df("hash"))).show
+---+---------+
|pos|      col|
+---+---------+
|  0|[1.0,2.0]|
|  1|[5.0,4.0]|
|  0|[3.0,2.0]|
|  1|[1.0,2.0]|
+---+---------+

@sethah
Copy link
Contributor

sethah commented Nov 11, 2016

@jkbradley Thanks for clarifying, I see your argument now. I agree that it makes sense from a statistical perspective. Still, I have not seen a single paper that describes anything quite exactly like what we're proposing. I would be ok disabling the multi-probe option for the 2.1 release, so we could carry on this discussion and continue hashing out (pun intended :) the APIs.

It is my understanding that the main benefit of multi-probe described in the reference paper is to cut down the storage space required by computing many hash tables, but we are not actually storing the entire hash table as a data structure so our implementation is a bit different. I think there's room for discussion/tests about what the benefits are and how drastically they impact performance.

@Yunni
Copy link
Contributor Author

Yunni commented Nov 11, 2016

@MLnick Thanks! That's very good to know!

@sethah I agree with your comments. @jkbradley If you don't have objection, shall I remove MultiProbe NN Search and hashDistance, so we will revisit MultiProbe and do more performance testing in later version?

@jkbradley
Copy link
Member

jkbradley commented Nov 12, 2016

@Yunni I guess we should remove it from the public API. I'm OK with leaving the code there and making it private for now.

One response:

Enlarging search ranges does not necessarily mean iterations. The same threshold logic for (a) gives a larger search range than for (b). Do you agree with this?

If you use the same threshold for both, then I agree. But that's not a reasonable comparison since (a) will do many times more work and communicate many times more data (up to L times more). This will happen when you do posexplode.

If you compare the 2 where each selects the same number of rows (on which to compute the keyDistance and select neighbors), then (b) will select many more candidates since it will not have duplicates.

Also, one new comment:

I'm testing vs the current implementation (min(abs(query bucket - row bucket))). Weirdly, the current one is getting consistently better results than my proposal...even though this does not make sense to me statistically (and even though the current implementation isn't what any of us are proposing to use!). I'm still banging my head against this...

UPDATE: Problem solved. The difference is that, for the current approxNearestNeighbors unit test, the current hashDistance function results in 56 rows being considered here: [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala#L153]. This is because lots of rows have the same hashDistance 0.0 with the current implementation. When modified to use the average of indicators, there is a much broader range of distances, so only 26 rows are considered.

Proposal: I suggest modifying the above line to limit to numNearestNeighbors or some multiple of numNearestNeighbors. Otherwise, approxNearestNeighbors could compute keyDistance for the entire dataset (in the extreme case).

@Yunni
Copy link
Contributor Author

Yunni commented Nov 14, 2016

OK. Abandon this PR since we are making MultiProbe NN Search and hashDistance private. Related changes are included in #15874

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants