-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17548] [MLlib] Word2VecModel.findSynonyms no longer spuriously rejects the best match when invoked with a vector #15105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #65409 has finished for PR 15105 at commit
|
Previously, the `findSynonyms` method in `Word2VecModel` rejected the closest-matching vector. This was typically correct in cases where we were searching for synonyms of a word, but was incorrect in cases where we were searching for words most similar to a given vector, since the given vector might not correspond to a word in the model's vocabulary. With this commit, `findSynonyms` will not discard the best matching term unless the best matching word is also the query word (or is maximally similar to the query vector).
757ce7c to
832ed41
Compare
|
Test build #65411 has finished for PR 15105 at commit
|
|
Hi @willb Good catch. This is a valid issue. |
|
Thanks, @hhbyyh. This is what I get for running |
| .sortBy(-_._2) | ||
| .take(num + 1) | ||
| .tail | ||
| .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the floating point calculation error, tup._2 may not be 1.0d even for the same vector.
And I 'm not sure if it's alway proper to reject the identical vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, those are both valid points. I went back and forth on this one, but I think we could actually argue that rejecting the identical vector doesn't make sense in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably easier still to leave the current code in place, up to "tail". If wordOpt is defined, then apply filter and take(num). If not, apply tail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hhbyyh So one of the python doctests depends on a word not being a synonym of its vector representation. I think since this is the documented behavior now, that's the direction the fix should go as well, but I'll use Sean's suggestion instead of checking similarity in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen This code in general kind of bothers me (I'd rather see a single pass through the tuples with a bounded priority queue keeping track of the num + 1 candidates than converting to a sequence and then allocating an array to sort in place). But I'm inclined to get some numbers showing that that is a good idea and make it a separate PR unless this is a good time to fold it in (so to speak).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if the todo was referring to the sorting or to earlier optimizations. I'll get it started as a separate issue; thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen So actually reverting to the old code but filtering only if wordOpt is defined doesn't handle the original case I was considering here, where you pass in a vector that is very similar to the representation of a word in the vocabulary but that is not itself the representation of a word in the vocabulary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be equivalent to what you suggest. Something like...
val topn1 = wordList.zip(cosVec).toSeq.sortBy(-_._2).take(num + 1)
if (wordOpt.isDefined) {
topn1.filter(tup => !wordOpt.get.equals(tup._1)).take(num)
} else {
topn1.tail
}
OK that's a bit different than what you suggested but does that help a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the case where you're passing in a vector that is extremely similar to the representation of a word in the vocabulary but that is not itself the representation of a word in the vocabulary. (A concrete example is in this test I added.) In this case, wordOpt is not defined (because you are querying for words whose vector representations are similar to a given vector) but you nonetheless are not interested in discarding the best match because it is not a trivial match (that is, it is not going to be your query term).
Related to your other comment (and discussion elsewhere on this PR), I think we could make a case for changing the documented behavior (especially since it is only documented as such in pyspark) in the case where findSynonyms is invoked with the vector representation of a word that is in the vocabulary. Instead of rejecting the best match in that case, we could return it. The argument there is that such a result is telling you something you didn't necessarily know (otherwise, you'd probably be querying for a word and not a vector) and that there is an easy way to identify that such a match is trivially the best match. I recognize that changing documented behavior is a big deal, but in this case it seems like it could be the best way to address a minor bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, "tail" is wrong, yeah, because it would give you the bottom n out of n+1, when you just want the first n out of the n+1. Otherwise I think this works?
Anyway, agree with the change you propose. If specifying a vector I would not expect any filtering of the first element. We're changing the behavior no matter what here but it's a fix.
| .take(num + 1) | ||
| .tail | ||
| .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) | ||
| .take(num) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for performance:
.take(num + 1)
.filter(...)
may be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd actually need .take(num + 1).filter(...).take(num) to properly handle the cases where the filter isn't rejecting anything. I'm assuming the filter is fairly cheap but you're right that it doesn't make sense to do it any more than is necessary.
| .sortBy(-_._2) | ||
| .take(num + 1) | ||
| .tail | ||
| .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably easier still to leave the current code in place, up to "tail". If wordOpt is defined, then apply filter and take(num). If not, apply tail.
| ("korea", Array(0.45f, 0.60f, 0.60f, 0.60f)) | ||
| ) | ||
| val model = new Word2VecModel(word2VecMap) | ||
| val syms = model.findSynonyms(Vectors.dense(Array(0.52d, 0.50d, 0.50d, 0.50d)), num) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but I think "d" is redundant. Also "0.5" seems clearer than "0.50" but this is truly up to taste.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes to both (the two significant digits was simply following the style earlier in the test).
|
Test build #65443 has finished for PR 15105 at commit
|
|
Can we assume in general that distinct words will not have identical (viz., by This seems like a mostly reasonable assumption but it is definitely a corner case that we might want to be robust to (or at least take note of). If we can't accept this assumption, then we should figure out whether or not changing the expected behavior in the documentation is acceptable. |
e33343f to
49c0288
Compare
|
I don't think you can assume that, though it will almost always be true. But I don't think the goal is to filter vectors of other words that happen to be identical. The point was just to filter the word itself from the list of synonyms because it's always perfectly similar to itself and that's implicitly not what the caller wants. So, no you don't necessarily want to filter out all "1.0" similarity. |
|
@srowen Yes. But if we're querying for words similar to a given vector, then there's no word to filter out (and, indeed, no way to know which word we might want to filter out if multiple words might map to the same vector representation). |
|
Test build #65448 has finished for PR 15105 at commit
|
49c0288 to
6df898d
Compare
|
Test build #65451 has finished for PR 15105 at commit
|
Update PySpark docstring and scaladocs to reflect fixed behavior.
6df898d to
08424f4
Compare
|
Test build #65453 has finished for PR 15105 at commit
|
| @Since("1.5.0") | ||
| def findSynonyms(word: String, num: Int): DataFrame = { | ||
| findSynonyms(wordVectors.transform(word), num) | ||
| findSynonyms(wordVectors.transform(word), num, Some(word)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you don't need to or want to change this file or the wrapper class below. They can continue to plumb through the API calls as before because in the underlying class you handle both cases. You might update docs in these classes, however, to match your change to the .mllib class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case (and similarly in Word2VecModelWrapper) I opted to call the three-argument version because the wrappers both explicitly convert their argument to a vector before calling findSynonyms on the underlying model (and so wordOpt would not be defined if the wrapper were invoked with a word). If we were to make the three-argument findSynonyms private we wouldn't be able to share a code path in the wrapper classes and would need to duplicate the code to tidy and reformat results in both methods (data frame creation in this case, unzipping and asJava in the Python model wrapper) or factor it out to a separate method. Let me know how you want me to proceed here.
I agree that updating the docs makes sense and will make it clearer to future maintainers as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I agree that something gets a little bit duplicated no matter what. Given your change, it seems easiest to pass the string vs vector argument all the way down, even if that means in these other two classes you duplicate a little code to transform the dataframe, etc. If it's nontrivial, sure, a little helper method seems reasonable. That would help keep this layered more cleanly IMHO.
| * @param wordOpt optionally, a word to reject from the results list | ||
| * @return array of (word, cosineSimilarity) | ||
| */ | ||
| private[spark] def findSynonyms( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be private then, I think.
| ind += 1 | ||
| } | ||
|
|
||
| // NB: This code (and the documented behavior of findSynonyms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, is the problem that several words, including the word itself, may have 1.0 similarity? and so the word itself may not sort as the top result among the 1.0 results? Yeah, then the code below doesn't quite handle that case. (But below you wouldn't need the take(num + 1) now I think?)
What about just ...
val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2)
val filtered = wordOpt match {
Some(word) => scored.take(num + 1).filter(p => word != p._1)
None => scored
}
filtered.take(num).toArray
* changed filtering code path * made `Word2Vec.findSynonyms(Vector, Int, Option[String])` private * refactored ML pipeline and Python Word2Vec model wrappers to use public APIs
|
Test build #65490 has finished for PR 15105 at commit
|
|
Test build #65492 has finished for PR 15105 at commit
|
…rejects the best match when invoked with a vector ## What changes were proposed in this pull request? This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector. ## How was this patch tested? I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected. Author: William Benton <willb@redhat.com> Closes #15105 from willb/fix/findSynonyms. (cherry picked from commit 25cbbe6) Signed-off-by: Sean Owen <sowen@cloudera.com>
|
Merged to master/2.0 |
…rejects the best match when invoked with a vector ## What changes were proposed in this pull request? This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector. ## How was this patch tested? I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected. Author: William Benton <willb@redhat.com> Closes apache#15105 from willb/fix/findSynonyms.
What changes were proposed in this pull request?
This pull request changes the behavior of
Word2VecModel.findSynonymsso that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases wherefindSynonymsis invoked with a word) or that has an identical angle to the query vector.How was this patch tested?
I added a test to
Word2VecSuiteto ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected.