[SPARK-17595] [MLLib] Use a bounded priority queue to find synonyms in Word2VecModel #15150

willb · 2016-09-19T14:41:31Z

What changes were proposed in this pull request?

The code in Word2VecModel.findSynonyms to choose the vocabulary elements with the highest similarity to the query vector currently sorts the collection of similarities for every vocabulary element. This involves making multiple copies of the collection of similarities while doing a (relatively) expensive sort. It would be more efficient to find the best matches by maintaining a bounded priority queue and populating it with a single pass over the vocabulary, and that is exactly what this patch does.

How was this patch tested?

This patch adds no user-visible functionality and its correctness should be exercised by existing tests. To ensure that this approach is actually faster, I made a microbenchmark for findSynonyms:

object W2VTiming {
  import org.apache.spark.{SparkContext, SparkConf}
  import org.apache.spark.mllib.feature.Word2VecModel
  def run(modelPath: String, scOpt: Option[SparkContext] = None) {
    val sc = scOpt.getOrElse(new SparkContext(new SparkConf(true).setMaster("local[*]").setAppName("test")))
    val model = Word2VecModel.load(sc, modelPath)
    val keys = model.getVectors.keys
    val start = System.currentTimeMillis
    for(key <- keys) {
      model.findSynonyms(key, 5)
      model.findSynonyms(key, 10)
      model.findSynonyms(key, 25)
      model.findSynonyms(key, 50)
    }
    val finish = System.currentTimeMillis
    println("run completed in " + (finish - start) + "ms")
  }
}

I ran this test on a model generated from the complete works of Jane Austen and found that the new approach was over 3x faster than the old approach. (If the num argument to findSynonyms is very close to the vocabulary size, the new approach will have less of an advantage over the old one.)

srowen · 2016-09-19T15:05:14Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

+      override def compare(x: (String, Double), y: (String, Double)): Int = x._2.compareTo(y._2)
+    }
+
+    val pq = new BoundedPriorityQueue(num + 1)(ord)


I think you can just pass Ordering.by(_._2) instead of defining a function.

srowen · 2016-09-19T15:05:46Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

+
+    val pq = new BoundedPriorityQueue(num + 1)(ord)
+
+    wordList.zip(cosVec).foreach(tup => pq += tup)


pq ++= should be able to add a whole collection?

SparkQA · 2016-09-19T15:44:39Z

Test build #65601 has finished for PR 15150 at commit ddba657.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

willb · 2016-09-19T17:33:08Z

Thanks for the feedback, @srowen! I've made the changes.

SparkQA · 2016-09-19T18:39:41Z

Test build #65603 has finished for PR 15150 at commit 93ebb94.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2016-09-19T19:09:37Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

+    val scored = pq.toSeq.sortBy(-_._2)

    val filtered = wordOpt match {
      case Some(w) => scored.take(num + 1).filter(tup => w != tup._1)


minor: Is take still necessary?

Yep good point, there are already <= num+1 elements.

Thanks, @hhbyyh!

SparkQA · 2016-09-20T04:23:17Z

Test build #65632 has finished for PR 15150 at commit f7311a2.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-20T10:06:32Z

Test build #3281 has finished for PR 15150 at commit f7311a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-20T13:08:17Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

-    val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2)
+    val pq = new BoundedPriorityQueue[(String, Double)](num + 1)(Ordering.by(_._2))
+
+    pq ++= wordList.zip(cosVec)


I'm OK to merge this as-is as it is an improvement. I know one of the original purposes was to avoid copies. I suppose it's a little more verbose, but avoids a collection copy, to do ...

for (i <- cosVec.indices) { pq += (wordList(i), cosVec(i)) }

I don't feel strongly about it.

Yeah, I guess I figured that since we were allocating the tuples anyway a single copy of the array wasn't a lot of extra overhead vs. having slightly cleaner code. But I'm happy to make the change if you think it's a good idea. I agree that allocating an array just to iterate through it isn't ideal.

(I'm ambivalent, partially because I don't have a great sense for the vocabulary sizes people typically use this code for in the wild. For my example corpus, my patch as-is, zipping collection iterators, and explicit iteration over indices are all more or less equivalent in time performance. My intuition is that allocating even the single array from zip is a bad deal if we're dealing with a very large vocabulary but probably not if the typical case is on the order of 10^5 words or less.)

I also don't know... when I've used it it has been with vocabs of tens of thousands of words. From others' emails I think some people do use it with very large vocabs. If you have a minute, while we're here, might as well take it one more step towards optimized?

Agreed. I'll push as soon as I finish running tests locally.

…ray copy

SparkQA · 2016-09-20T16:39:03Z

Test build #65661 has finished for PR 15150 at commit 4b235dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-21T08:45:27Z

Merged to master

Use a bounded priority queue to find synonyms in Word2VecModel

ddba657

willb changed the title ~~Use a bounded priority queue to find synonyms in Word2VecModel~~ [SPARK-17595] [MLLib] Use a bounded priority queue to find synonyms in Word2VecModel Sep 19, 2016

srowen requested changes Sep 19, 2016

View reviewed changes

Stylistic cleanups from review

93ebb94

hhbyyh reviewed Sep 19, 2016

View reviewed changes

Removed redundant take from Word2VecModel.findSynonyms

f7311a2

srowen reviewed Sep 20, 2016

View reviewed changes

Prefer explicit iteration over indices to zip; avoid allocating an ar…

4b235dc

…ray copy

asfgit closed this in 7654385 Sep 21, 2016


		val pq = new BoundedPriorityQueue(num + 1)(ord)

		wordList.zip(cosVec).foreach(tup => pq += tup)

[SPARK-17595] [MLLib] Use a bounded priority queue to find synonyms in Word2VecModel #15150

[SPARK-17595] [MLLib] Use a bounded priority queue to find synonyms in Word2VecModel #15150

Uh oh!

Conversation

willb commented Sep 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 19, 2016

Uh oh!

willb commented Sep 19, 2016

Uh oh!

SparkQA commented Sep 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 20, 2016

Uh oh!

SparkQA commented Sep 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 20, 2016

Uh oh!

srowen commented Sep 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

willb commented Sep 19, 2016 •

edited

Loading