-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17595] [MLLib] Use a bounded priority queue to find synonyms in Word2VecModel #15150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| override def compare(x: (String, Double), y: (String, Double)): Int = x._2.compareTo(y._2) | ||
| } | ||
|
|
||
| val pq = new BoundedPriorityQueue(num + 1)(ord) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just pass Ordering.by(_._2) instead of defining a function.
|
|
||
| val pq = new BoundedPriorityQueue(num + 1)(ord) | ||
|
|
||
| wordList.zip(cosVec).foreach(tup => pq += tup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pq ++= should be able to add a whole collection?
|
Test build #65601 has finished for PR 15150 at commit
|
|
Thanks for the feedback, @srowen! I've made the changes. |
|
Test build #65603 has finished for PR 15150 at commit
|
| val scored = pq.toSeq.sortBy(-_._2) | ||
|
|
||
| val filtered = wordOpt match { | ||
| case Some(w) => scored.take(num + 1).filter(tup => w != tup._1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Is take still necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep good point, there are already <= num+1 elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @hhbyyh!
|
Test build #65632 has finished for PR 15150 at commit
|
|
Test build #3281 has finished for PR 15150 at commit
|
| val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2) | ||
| val pq = new BoundedPriorityQueue[(String, Double)](num + 1)(Ordering.by(_._2)) | ||
|
|
||
| pq ++= wordList.zip(cosVec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK to merge this as-is as it is an improvement. I know one of the original purposes was to avoid copies. I suppose it's a little more verbose, but avoids a collection copy, to do ...
for (i <- cosVec.indices) {
pq += (wordList(i), cosVec(i))
}
I don't feel strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess I figured that since we were allocating the tuples anyway a single copy of the array wasn't a lot of extra overhead vs. having slightly cleaner code. But I'm happy to make the change if you think it's a good idea. I agree that allocating an array just to iterate through it isn't ideal.
(I'm ambivalent, partially because I don't have a great sense for the vocabulary sizes people typically use this code for in the wild. For my example corpus, my patch as-is, zipping collection iterators, and explicit iteration over indices are all more or less equivalent in time performance. My intuition is that allocating even the single array from zip is a bad deal if we're dealing with a very large vocabulary but probably not if the typical case is on the order of 10^5 words or less.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't know... when I've used it it has been with vocabs of tens of thousands of words. From others' emails I think some people do use it with very large vocabs. If you have a minute, while we're here, might as well take it one more step towards optimized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I'll push as soon as I finish running tests locally.
|
Test build #65661 has finished for PR 15150 at commit
|
|
Merged to master |
What changes were proposed in this pull request?
The code in
Word2VecModel.findSynonymsto choose the vocabulary elements with the highest similarity to the query vector currently sorts the collection of similarities for every vocabulary element. This involves making multiple copies of the collection of similarities while doing a (relatively) expensive sort. It would be more efficient to find the best matches by maintaining a bounded priority queue and populating it with a single pass over the vocabulary, and that is exactly what this patch does.How was this patch tested?
This patch adds no user-visible functionality and its correctness should be exercised by existing tests. To ensure that this approach is actually faster, I made a microbenchmark for
findSynonyms:I ran this test on a model generated from the complete works of Jane Austen and found that the new approach was over 3x faster than the old approach. (If the
numargument tofindSynonymsis very close to the vocabulary size, the new approach will have less of an advantage over the old one.)