Skip to content

Conversation

@willb
Copy link
Contributor

@willb willb commented Sep 19, 2016

What changes were proposed in this pull request?

The code in Word2VecModel.findSynonyms to choose the vocabulary elements with the highest similarity to the query vector currently sorts the collection of similarities for every vocabulary element. This involves making multiple copies of the collection of similarities while doing a (relatively) expensive sort. It would be more efficient to find the best matches by maintaining a bounded priority queue and populating it with a single pass over the vocabulary, and that is exactly what this patch does.

How was this patch tested?

This patch adds no user-visible functionality and its correctness should be exercised by existing tests. To ensure that this approach is actually faster, I made a microbenchmark for findSynonyms:

object W2VTiming {
  import org.apache.spark.{SparkContext, SparkConf}
  import org.apache.spark.mllib.feature.Word2VecModel
  def run(modelPath: String, scOpt: Option[SparkContext] = None) {
    val sc = scOpt.getOrElse(new SparkContext(new SparkConf(true).setMaster("local[*]").setAppName("test")))
    val model = Word2VecModel.load(sc, modelPath)
    val keys = model.getVectors.keys
    val start = System.currentTimeMillis
    for(key <- keys) {
      model.findSynonyms(key, 5)
      model.findSynonyms(key, 10)
      model.findSynonyms(key, 25)
      model.findSynonyms(key, 50)
    }
    val finish = System.currentTimeMillis
    println("run completed in " + (finish - start) + "ms")
  }
}

I ran this test on a model generated from the complete works of Jane Austen and found that the new approach was over 3x faster than the old approach. (If the num argument to findSynonyms is very close to the vocabulary size, the new approach will have less of an advantage over the old one.)

@willb willb changed the title Use a bounded priority queue to find synonyms in Word2VecModel [SPARK-17595] [MLLib] Use a bounded priority queue to find synonyms in Word2VecModel Sep 19, 2016
override def compare(x: (String, Double), y: (String, Double)): Int = x._2.compareTo(y._2)
}

val pq = new BoundedPriorityQueue(num + 1)(ord)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just pass Ordering.by(_._2) instead of defining a function.


val pq = new BoundedPriorityQueue(num + 1)(ord)

wordList.zip(cosVec).foreach(tup => pq += tup)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pq ++= should be able to add a whole collection?

@SparkQA
Copy link

SparkQA commented Sep 19, 2016

Test build #65601 has finished for PR 15150 at commit ddba657.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@willb
Copy link
Contributor Author

willb commented Sep 19, 2016

Thanks for the feedback, @srowen! I've made the changes.

@SparkQA
Copy link

SparkQA commented Sep 19, 2016

Test build #65603 has finished for PR 15150 at commit 93ebb94.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val scored = pq.toSeq.sortBy(-_._2)

val filtered = wordOpt match {
case Some(w) => scored.take(num + 1).filter(tup => w != tup._1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Is take still necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep good point, there are already <= num+1 elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @hhbyyh!

@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #65632 has finished for PR 15150 at commit f7311a2.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #3281 has finished for PR 15150 at commit f7311a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2)
val pq = new BoundedPriorityQueue[(String, Double)](num + 1)(Ordering.by(_._2))

pq ++= wordList.zip(cosVec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK to merge this as-is as it is an improvement. I know one of the original purposes was to avoid copies. I suppose it's a little more verbose, but avoids a collection copy, to do ...

for (i <- cosVec.indices) {
  pq += (wordList(i), cosVec(i))
}

I don't feel strongly about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess I figured that since we were allocating the tuples anyway a single copy of the array wasn't a lot of extra overhead vs. having slightly cleaner code. But I'm happy to make the change if you think it's a good idea. I agree that allocating an array just to iterate through it isn't ideal.

(I'm ambivalent, partially because I don't have a great sense for the vocabulary sizes people typically use this code for in the wild. For my example corpus, my patch as-is, zipping collection iterators, and explicit iteration over indices are all more or less equivalent in time performance. My intuition is that allocating even the single array from zip is a bad deal if we're dealing with a very large vocabulary but probably not if the typical case is on the order of 10^5 words or less.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't know... when I've used it it has been with vocabs of tens of thousands of words. From others' emails I think some people do use it with very large vocabs. If you have a minute, while we're here, might as well take it one more step towards optimized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'll push as soon as I finish running tests locally.

@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #65661 has finished for PR 15150 at commit 4b235dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Sep 21, 2016

Merged to master

@asfgit asfgit closed this in 7654385 Sep 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants