Skip to content

Conversation

@willb
Copy link
Contributor

@willb willb commented Sep 14, 2016

What changes were proposed in this pull request?

This pull request changes the behavior of Word2VecModel.findSynonyms so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where findSynonyms is invoked with a word) or that has an identical angle to the query vector.

How was this patch tested?

I added a test to Word2VecSuite to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected.

@SparkQA
Copy link

SparkQA commented Sep 14, 2016

Test build #65409 has finished for PR 15105 at commit 757ce7c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Previously, the `findSynonyms` method in `Word2VecModel` rejected the
closest-matching vector.  This was typically correct in cases where we
were searching for synonyms of a word, but was incorrect in cases
where we were searching for words most similar to a given vector,
since the given vector might not correspond to a word in the model's
vocabulary.

With this commit, `findSynonyms` will not discard the best matching
term unless the best matching word is also the query word (or is
maximally similar to the query vector).
@SparkQA
Copy link

SparkQA commented Sep 15, 2016

Test build #65411 has finished for PR 15105 at commit 832ed41.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor

hhbyyh commented Sep 15, 2016

Hi @willb Good catch. This is a valid issue.

@willb
Copy link
Contributor Author

willb commented Sep 15, 2016

Thanks, @hhbyyh. This is what I get for running sbt "testOnly ..." I'll push a fix.

.sortBy(-_._2)
.take(num + 1)
.tail
.filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the floating point calculation error, tup._2 may not be 1.0d even for the same vector.
And I 'm not sure if it's alway proper to reject the identical vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those are both valid points. I went back and forth on this one, but I think we could actually argue that rejecting the identical vector doesn't make sense in any case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably easier still to leave the current code in place, up to "tail". If wordOpt is defined, then apply filter and take(num). If not, apply tail.

Copy link
Contributor Author

@willb willb Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hhbyyh So one of the python doctests depends on a word not being a synonym of its vector representation. I think since this is the documented behavior now, that's the direction the fix should go as well, but I'll use Sean's suggestion instead of checking similarity in any case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen This code in general kind of bothers me (I'd rather see a single pass through the tuples with a bounded priority queue keeping track of the num + 1 candidates than converting to a sequence and then allocating an array to sort in place). But I'm inclined to get some numbers showing that that is a good idea and make it a separate PR unless this is a good time to fold it in (so to speak).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if the todo was referring to the sorting or to earlier optimizations. I'll get it started as a separate issue; thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen So actually reverting to the old code but filtering only if wordOpt is defined doesn't handle the original case I was considering here, where you pass in a vector that is very similar to the representation of a word in the vocabulary but that is not itself the representation of a word in the vocabulary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be equivalent to what you suggest. Something like...

val topn1 = wordList.zip(cosVec).toSeq.sortBy(-_._2).take(num + 1)
if (wordOpt.isDefined) {
  topn1.filter(tup => !wordOpt.get.equals(tup._1)).take(num)
} else {
  topn1.tail
}

OK that's a bit different than what you suggested but does that help a bit?

Copy link
Contributor Author

@willb willb Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the case where you're passing in a vector that is extremely similar to the representation of a word in the vocabulary but that is not itself the representation of a word in the vocabulary. (A concrete example is in this test I added.) In this case, wordOpt is not defined (because you are querying for words whose vector representations are similar to a given vector) but you nonetheless are not interested in discarding the best match because it is not a trivial match (that is, it is not going to be your query term).

Related to your other comment (and discussion elsewhere on this PR), I think we could make a case for changing the documented behavior (especially since it is only documented as such in pyspark) in the case where findSynonyms is invoked with the vector representation of a word that is in the vocabulary. Instead of rejecting the best match in that case, we could return it. The argument there is that such a result is telling you something you didn't necessarily know (otherwise, you'd probably be querying for a word and not a vector) and that there is an easy way to identify that such a match is trivially the best match. I recognize that changing documented behavior is a big deal, but in this case it seems like it could be the best way to address a minor bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, "tail" is wrong, yeah, because it would give you the bottom n out of n+1, when you just want the first n out of the n+1. Otherwise I think this works?

Anyway, agree with the change you propose. If specifying a vector I would not expect any filtering of the first element. We're changing the behavior no matter what here but it's a fix.

.take(num + 1)
.tail
.filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d)
.take(num)
Copy link
Contributor

@hhbyyh hhbyyh Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for performance:
.take(num + 1)
.filter(...)
may be faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd actually need .take(num + 1).filter(...).take(num) to properly handle the cases where the filter isn't rejecting anything. I'm assuming the filter is fairly cheap but you're right that it doesn't make sense to do it any more than is necessary.

.sortBy(-_._2)
.take(num + 1)
.tail
.filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably easier still to leave the current code in place, up to "tail". If wordOpt is defined, then apply filter and take(num). If not, apply tail.

("korea", Array(0.45f, 0.60f, 0.60f, 0.60f))
)
val model = new Word2VecModel(word2VecMap)
val syms = model.findSynonyms(Vectors.dense(Array(0.52d, 0.50d, 0.50d, 0.50d)), num)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but I think "d" is redundant. Also "0.5" seems clearer than "0.50" but this is truly up to taste.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes to both (the two significant digits was simply following the style earlier in the test).

@SparkQA
Copy link

SparkQA commented Sep 15, 2016

Test build #65443 has finished for PR 15105 at commit e33343f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@willb
Copy link
Contributor Author

willb commented Sep 15, 2016

Can we assume in general that distinct words will not have identical (viz., by .equals) vector representations? I ask because the behavior in the current documentation (in that pyspark test I linked above) assumes that you don't want a word to turn up as a synonym for its own vector representation. But this is impossible to enforce if we aren't guaranteed that distinct words won't have identical vector representations.

This seems like a mostly reasonable assumption but it is definitely a corner case that we might want to be robust to (or at least take note of). If we can't accept this assumption, then we should figure out whether or not changing the expected behavior in the documentation is acceptable.

@srowen
Copy link
Member

srowen commented Sep 15, 2016

I don't think you can assume that, though it will almost always be true. But I don't think the goal is to filter vectors of other words that happen to be identical. The point was just to filter the word itself from the list of synonyms because it's always perfectly similar to itself and that's implicitly not what the caller wants.

So, no you don't necessarily want to filter out all "1.0" similarity.

@willb
Copy link
Contributor Author

willb commented Sep 15, 2016

@srowen Yes. But if we're querying for words similar to a given vector, then there's no word to filter out (and, indeed, no way to know which word we might want to filter out if multiple words might map to the same vector representation).

@SparkQA
Copy link

SparkQA commented Sep 15, 2016

Test build #65448 has finished for PR 15105 at commit 49c0288.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 15, 2016

Test build #65451 has finished for PR 15105 at commit 6df898d.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Update PySpark docstring and scaladocs to reflect fixed behavior.
@SparkQA
Copy link

SparkQA commented Sep 15, 2016

Test build #65453 has finished for PR 15105 at commit 08424f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Since("1.5.0")
def findSynonyms(word: String, num: Int): DataFrame = {
findSynonyms(wordVectors.transform(word), num)
findSynonyms(wordVectors.transform(word), num, Some(word))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need to or want to change this file or the wrapper class below. They can continue to plumb through the API calls as before because in the underlying class you handle both cases. You might update docs in these classes, however, to match your change to the .mllib class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case (and similarly in Word2VecModelWrapper) I opted to call the three-argument version because the wrappers both explicitly convert their argument to a vector before calling findSynonyms on the underlying model (and so wordOpt would not be defined if the wrapper were invoked with a word). If we were to make the three-argument findSynonyms private we wouldn't be able to share a code path in the wrapper classes and would need to duplicate the code to tidy and reformat results in both methods (data frame creation in this case, unzipping and asJava in the Python model wrapper) or factor it out to a separate method. Let me know how you want me to proceed here.

I agree that updating the docs makes sense and will make it clearer to future maintainers as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I agree that something gets a little bit duplicated no matter what. Given your change, it seems easiest to pass the string vs vector argument all the way down, even if that means in these other two classes you duplicate a little code to transform the dataframe, etc. If it's nontrivial, sure, a little helper method seems reasonable. That would help keep this layered more cleanly IMHO.

* @param wordOpt optionally, a word to reject from the results list
* @return array of (word, cosineSimilarity)
*/
private[spark] def findSynonyms(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be private then, I think.

ind += 1
}

// NB: This code (and the documented behavior of findSynonyms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, is the problem that several words, including the word itself, may have 1.0 similarity? and so the word itself may not sort as the top result among the 1.0 results? Yeah, then the code below doesn't quite handle that case. (But below you wouldn't need the take(num + 1) now I think?)

What about just ...

val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2)
val filtered = wordOpt match {
  Some(word) => scored.take(num + 1).filter(p => word != p._1)
  None => scored
}
filtered.take(num).toArray

* changed filtering code path
* made `Word2Vec.findSynonyms(Vector, Int, Option[String])` private
* refactored ML pipeline and Python Word2Vec model wrappers to use public APIs
@willb
Copy link
Contributor Author

willb commented Sep 16, 2016

Thanks for the feedback @srowen! I think 18e6bfe addresses everything from a code perspective, but it missed removing the comment about assuming that distinct words have distinct vector representations, so I've just pushed another commit (624c0f8) that just removes that comment.

@SparkQA
Copy link

SparkQA commented Sep 16, 2016

Test build #65490 has finished for PR 15105 at commit 18e6bfe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 16, 2016

Test build #65492 has finished for PR 15105 at commit 624c0f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Sep 17, 2016
…rejects the best match when invoked with a vector

## What changes were proposed in this pull request?

This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary.  Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector.

## How was this patch tested?

I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected.

Author: William Benton <willb@redhat.com>

Closes #15105 from willb/fix/findSynonyms.

(cherry picked from commit 25cbbe6)
Signed-off-by: Sean Owen <sowen@cloudera.com>
@srowen
Copy link
Member

srowen commented Sep 17, 2016

Merged to master/2.0

@asfgit asfgit closed this in 25cbbe6 Sep 17, 2016
wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
…rejects the best match when invoked with a vector

## What changes were proposed in this pull request?

This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary.  Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector.

## How was this patch tested?

I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected.

Author: William Benton <willb@redhat.com>

Closes apache#15105 from willb/fix/findSynonyms.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants