[SPARK-17548] [MLlib] Word2VecModel.findSynonyms no longer spuriously rejects the best match when invoked with a vector #15105

willb · 2016-09-14T22:33:19Z

What changes were proposed in this pull request?

This pull request changes the behavior of Word2VecModel.findSynonyms so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where findSynonyms is invoked with a word) or that has an identical angle to the query vector.

How was this patch tested?

I added a test to Word2VecSuite to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected.

SparkQA · 2016-09-14T23:17:47Z

Test build #65409 has finished for PR 15105 at commit 757ce7c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Previously, the `findSynonyms` method in `Word2VecModel` rejected the closest-matching vector. This was typically correct in cases where we were searching for synonyms of a word, but was incorrect in cases where we were searching for words most similar to a given vector, since the given vector might not correspond to a word in the model's vocabulary. With this commit, `findSynonyms` will not discard the best matching term unless the best matching word is also the query word (or is maximally similar to the query vector).

SparkQA · 2016-09-15T00:20:52Z

Test build #65411 has finished for PR 15105 at commit 832ed41.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2016-09-15T00:21:08Z

Hi @willb Good catch. This is a valid issue.

willb · 2016-09-15T00:28:40Z

Thanks, @hhbyyh. This is what I get for running sbt "testOnly ..." I'll push a fix.

hhbyyh · 2016-09-15T00:33:44Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

      .sortBy(-_._2)
-      .take(num + 1)
-      .tail
+      .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d)


Due to the floating point calculation error, tup._2 may not be 1.0d even for the same vector.
And I 'm not sure if it's alway proper to reject the identical vector.

Yes, those are both valid points. I went back and forth on this one, but I think we could actually argue that rejecting the identical vector doesn't make sense in any case.

It's probably easier still to leave the current code in place, up to "tail". If wordOpt is defined, then apply filter and take(num). If not, apply tail.

@hhbyyh So one of the python doctests depends on a word not being a synonym of its vector representation. I think since this is the documented behavior now, that's the direction the fix should go as well, but I'll use Sean's suggestion instead of checking similarity in any case.

@srowen This code in general kind of bothers me (I'd rather see a single pass through the tuples with a bounded priority queue keeping track of the num + 1 candidates than converting to a sequence and then allocating an array to sort in place). But I'm inclined to get some numbers showing that that is a good idea and make it a separate PR unless this is a good time to fold it in (so to speak).

I wasn't sure if the todo was referring to the sorting or to earlier optimizations. I'll get it started as a separate issue; thanks!

@srowen So actually reverting to the old code but filtering only if wordOpt is defined doesn't handle the original case I was considering here, where you pass in a vector that is very similar to the representation of a word in the vocabulary but that is not itself the representation of a word in the vocabulary.

It should be equivalent to what you suggest. Something like...

val topn1 = wordList.zip(cosVec).toSeq.sortBy(-_._2).take(num + 1) if (wordOpt.isDefined) { topn1.filter(tup => !wordOpt.get.equals(tup._1)).take(num) } else { topn1.tail }

OK that's a bit different than what you suggested but does that help a bit?

Consider the case where you're passing in a vector that is extremely similar to the representation of a word in the vocabulary but that is not itself the representation of a word in the vocabulary. (A concrete example is in this test I added.) In this case, wordOpt is not defined (because you are querying for words whose vector representations are similar to a given vector) but you nonetheless are not interested in discarding the best match because it is not a trivial match (that is, it is not going to be your query term).

Related to your other comment (and discussion elsewhere on this PR), I think we could make a case for changing the documented behavior (especially since it is only documented as such in pyspark) in the case where findSynonyms is invoked with the vector representation of a word that is in the vocabulary. Instead of rejecting the best match in that case, we could return it. The argument there is that such a result is telling you something you didn't necessarily know (otherwise, you'd probably be querying for a word and not a vector) and that there is an easy way to identify that such a match is trivially the best match. I recognize that changing documented behavior is a big deal, but in this case it seems like it could be the best way to address a minor bug.

Oops, "tail" is wrong, yeah, because it would give you the bottom n out of n+1, when you just want the first n out of the n+1. Otherwise I think this works?

Anyway, agree with the change you propose. If specifying a vector I would not expect any filtering of the first element. We're changing the behavior no matter what here but it's a fix.

hhbyyh · 2016-09-15T00:34:49Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

-      .take(num + 1)
-      .tail
+      .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d)
+      .take(num)


Just for performance:
.take(num + 1)
.filter(...)
may be faster.

We'd actually need .take(num + 1).filter(...).take(num) to properly handle the cases where the filter isn't rejecting anything. I'm assuming the filter is fairly cheap but you're right that it doesn't make sense to do it any more than is necessary.

srowen · 2016-09-15T07:42:24Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

      .sortBy(-_._2)
-      .take(num + 1)
-      .tail
+      .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d)


It's probably easier still to leave the current code in place, up to "tail". If wordOpt is defined, then apply filter and take(num). If not, apply tail.

srowen · 2016-09-15T07:42:54Z

mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala

+      ("korea", Array(0.45f, 0.60f, 0.60f, 0.60f))
+    )
+    val model = new Word2VecModel(word2VecMap)
+    val syms = model.findSynonyms(Vectors.dense(Array(0.52d, 0.50d, 0.50d, 0.50d)), num)


Nit, but I think "d" is redundant. Also "0.5" seems clearer than "0.50" but this is truly up to taste.

Yes to both (the two significant digits was simply following the style earlier in the test).

SparkQA · 2016-09-15T15:35:33Z

Test build #65443 has finished for PR 15105 at commit e33343f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

willb · 2016-09-15T16:03:01Z

Can we assume in general that distinct words will not have identical (viz., by .equals) vector representations? I ask because the behavior in the current documentation (in that pyspark test I linked above) assumes that you don't want a word to turn up as a synonym for its own vector representation. But this is impossible to enforce if we aren't guaranteed that distinct words won't have identical vector representations.

This seems like a mostly reasonable assumption but it is definitely a corner case that we might want to be robust to (or at least take note of). If we can't accept this assumption, then we should figure out whether or not changing the expected behavior in the documentation is acceptable.

srowen · 2016-09-15T16:08:08Z

I don't think you can assume that, though it will almost always be true. But I don't think the goal is to filter vectors of other words that happen to be identical. The point was just to filter the word itself from the list of synonyms because it's always perfectly similar to itself and that's implicitly not what the caller wants.

So, no you don't necessarily want to filter out all "1.0" similarity.

willb · 2016-09-15T16:33:23Z

@srowen Yes. But if we're querying for words similar to a given vector, then there's no word to filter out (and, indeed, no way to know which word we might want to filter out if multiple words might map to the same vector representation).

SparkQA · 2016-09-15T16:53:54Z

Test build #65448 has finished for PR 15105 at commit 49c0288.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-15T18:14:39Z

Test build #65451 has finished for PR 15105 at commit 6df898d.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

Update PySpark docstring and scaladocs to reflect fixed behavior.

SparkQA · 2016-09-15T19:46:20Z

Test build #65453 has finished for PR 15105 at commit 08424f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-16T09:56:03Z

mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala

  @Since("1.5.0")
  def findSynonyms(word: String, num: Int): DataFrame = {
-    findSynonyms(wordVectors.transform(word), num)
+    findSynonyms(wordVectors.transform(word), num, Some(word))


I think you don't need to or want to change this file or the wrapper class below. They can continue to plumb through the API calls as before because in the underlying class you handle both cases. You might update docs in these classes, however, to match your change to the .mllib class.

In this case (and similarly in Word2VecModelWrapper) I opted to call the three-argument version because the wrappers both explicitly convert their argument to a vector before calling findSynonyms on the underlying model (and so wordOpt would not be defined if the wrapper were invoked with a word). If we were to make the three-argument findSynonyms private we wouldn't be able to share a code path in the wrapper classes and would need to duplicate the code to tidy and reformat results in both methods (data frame creation in this case, unzipping and asJava in the Python model wrapper) or factor it out to a separate method. Let me know how you want me to proceed here.

I agree that updating the docs makes sense and will make it clearer to future maintainers as well.

I see. I agree that something gets a little bit duplicated no matter what. Given your change, it seems easiest to pass the string vs vector argument all the way down, even if that means in these other two classes you duplicate a little code to transform the dataframe, etc. If it's nontrivial, sure, a little helper method seems reasonable. That would help keep this layered more cleanly IMHO.

srowen · 2016-09-16T09:58:24Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

+   * @param wordOpt optionally, a word to reject from the results list
+   * @return array of (word, cosineSimilarity)
+   */
+  private[spark] def findSynonyms(


This can be private then, I think.

srowen · 2016-09-16T10:39:32Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

      ind += 1
    }

+    // NB: This code (and the documented behavior of findSynonyms


Oh, is the problem that several words, including the word itself, may have 1.0 similarity? and so the word itself may not sort as the top result among the 1.0 results? Yeah, then the code below doesn't quite handle that case. (But below you wouldn't need the take(num + 1) now I think?)

What about just ...

val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2) val filtered = wordOpt match { Some(word) => scored.take(num + 1).filter(p => word != p._1) None => scored } filtered.take(num).toArray

* changed filtering code path * made `Word2Vec.findSynonyms(Vector, Int, Option[String])` private * refactored ML pipeline and Python Word2Vec model wrappers to use public APIs

willb · 2016-09-16T15:57:25Z

Thanks for the feedback @srowen! I think 18e6bfe addresses everything from a code perspective, but it missed removing the comment about assuming that distinct words have distinct vector representations, so I've just pushed another commit (624c0f8) that just removes that comment.

SparkQA · 2016-09-16T16:18:48Z

Test build #65490 has finished for PR 15105 at commit 18e6bfe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-16T16:59:30Z

Test build #65492 has finished for PR 15105 at commit 624c0f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…rejects the best match when invoked with a vector ## What changes were proposed in this pull request? This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector. ## How was this patch tested? I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected. Author: William Benton <willb@redhat.com> Closes #15105 from willb/fix/findSynonyms. (cherry picked from commit 25cbbe6) Signed-off-by: Sean Owen <sowen@cloudera.com>

srowen · 2016-09-17T11:51:51Z

Merged to master/2.0

…rejects the best match when invoked with a vector ## What changes were proposed in this pull request? This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector. ## How was this patch tested? I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected. Author: William Benton <willb@redhat.com> Closes apache#15105 from willb/fix/findSynonyms.

test for spurious rejection of similar vectors in findSynonyms

5a7cf31

willb force-pushed the fix/findSynonyms branch from 757ce7c to 832ed41 Compare September 14, 2016 23:32

hhbyyh reviewed Sep 15, 2016

View reviewed changes

srowen requested changes Sep 15, 2016

View reviewed changes

willb force-pushed the fix/findSynonyms branch from e33343f to 49c0288 Compare September 15, 2016 16:07

willb force-pushed the fix/findSynonyms branch from 49c0288 to 6df898d Compare September 15, 2016 18:11

Incorporated feedback from review.

08424f4

Update PySpark docstring and scaladocs to reflect fixed behavior.

willb force-pushed the fix/findSynonyms branch from 6df898d to 08424f4 Compare September 15, 2016 18:32

srowen requested changes Sep 16, 2016

View reviewed changes

willb added 2 commits September 16, 2016 10:07

Incorporated further review feedback

18e6bfe

* changed filtering code path * made `Word2Vec.findSynonyms(Vector, Int, Option[String])` private * refactored ML pipeline and Python Word2Vec model wrappers to use public APIs

Remove comment about old findSynonyms behavior

624c0f8

srowen approved these changes Sep 16, 2016

View reviewed changes

asfgit closed this in 25cbbe6 Sep 17, 2016

[SPARK-17548] [MLlib] Word2VecModel.findSynonyms no longer spuriously rejects the best match when invoked with a vector #15105

[SPARK-17548] [MLlib] Word2VecModel.findSynonyms no longer spuriously rejects the best match when invoked with a vector #15105

Uh oh!

Conversation

willb commented Sep 14, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 14, 2016

Uh oh!

SparkQA commented Sep 15, 2016

Uh oh!

hhbyyh commented Sep 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

willb commented Sep 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

willb Sep 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

willb Sep 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hhbyyh Sep 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 15, 2016

Uh oh!

willb commented Sep 15, 2016

Uh oh!

srowen commented Sep 15, 2016

Uh oh!

willb commented Sep 15, 2016

Uh oh!

SparkQA commented Sep 15, 2016

Uh oh!

SparkQA commented Sep 15, 2016

Uh oh!

SparkQA commented Sep 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

willb commented Sep 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hhbyyh commented Sep 15, 2016 •

edited

Loading

willb Sep 15, 2016 •

edited

Loading

willb Sep 15, 2016 •

edited

Loading

hhbyyh Sep 15, 2016 •

edited

Loading

willb commented Sep 16, 2016 •

edited

Loading