[SPARK-19066][SparkR]:SparkR LDA doesn't set optimizer correctly #16464

wangmiao1981 · 2017-01-03T23:30:01Z

What changes were proposed in this pull request?

spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the isDistributed field is FALSE, which should be TRUE based on scala code.

In addition, the summary method should bring back the results related to DistributedLDAModel.

How was this patch tested?

Manual tests by comparing with scala example.
Modified the current unit test: fix the incorrect unit test and add necessary tests for summary method.

SparkQA · 2017-01-04T00:34:57Z

Test build #70840 has finished for PR 16464 at commit 2777636.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-04T01:48:38Z

Test build #70845 has finished for PR 16464 at commit 6c096bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-04T07:13:37Z

Test build #70863 has started for PR 16464 at commit 14bafc1.

SparkQA · 2017-01-04T20:11:35Z

Test build #70889 has finished for PR 16464 at commit 26fa14b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2017-01-04T20:28:26Z

cc @felixcheung @yanboliang A bug fix. Thanks!

felixcheung · 2017-01-05T05:47:07Z

mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala

since model is referenced and persisted, is there a need to handle trainingLogLikelihood and logPrior separately like this, and writing to metadata, instead of just getting from the model when fetching for the summary?

The first version, I got them from the model in the LDAWrapper class. However, when I read logPrior, I found that the loaded logPrior is not the same as the value before save. So, I followed the logLikelihood and logPerplexity to save it in the metadata.

With the same dataset, Scala side tests:
Original LogPrior:-3.3387459952856338
LogPrior from saved model: -0.9202435107654922

that's odd, is that an issue with model persistence?

LogPrior is calculated based on the serialized topics etc, which are also used by the trainingLikelyhood. But the trainingLikelyhood is the same for both original and loaded model. Let me debug more. It looks like a bug. The original MLLIB implementation doesn't serialize the two parameters as they can be calculated from other saved values. In addition, there is no unit test for comparing the two values, which could be the reason of not catching this issue.

felixcheung · 2017-01-06T06:28:55Z

mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala

shouldn't it just set optimizer - without having to check if it is == em?

If it is "online", it is not necessary to set the optimizer. But setting it anyway will make the code clean. I will do it.

felixcheung · 2017-01-06T06:31:11Z

R/pkg/inst/tests/testthat/test_mllib.R

~~can you have a test for when these are not nan? like here~~
nm, saw the test above.

wangmiao1981 · 2017-01-06T18:49:51Z

graph.vertices.aggregate(0.0)(seqOp, _ + _) when calculating logPrior gets different values for the two models, even if all parameters of the Model are the same.

I print out each step of seqOp, which are different for the two models. I am trying to understand the logic.

wangmiao1981 · 2017-01-06T19:03:17Z

graph.vertices.aggregate(0.0)(seqOp, _ + _) the vertex sequence is different in the two models.
VertexID sequence for original model:
-8 -6 -3 10 -5 4 11 -1 0 1 -2 6 -7 7 8 -10 -9 9 -4 3 -11 5 2
VertexID sequence for the loaded model:
3 5 2 -11 -5 4 11 -1 0 1 -2 6 -7 -8 7 -10 8 9 -9 -4 10 -6 -3

SparkQA · 2017-01-06T19:16:13Z

Test build #70988 has finished for PR 16464 at commit 5f1c644.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2017-01-06T19:59:18Z

I print out the values of each step of seqOp, graph.vertices.aggregate(0.0)(seqOp, _ + _) just returns the seqOp of the last vertex.
VertexID sequence for original model:
-8 -6 -3 10 -5 4 11 -1 0 1 -2 6 -7 7 8 -10 -9 9 -4 3 -11 5 2
For this sequence, it returns the value of Vertex2.
VertexID sequence for the loaded model:
3 5 2 -11 -5 4 11 -1 0 1 -2 6 -7 -8 7 -10 8 9 -9 -4 10 -6 -3
For this sequence, it returns the value of Vertex -3.

Actually, the value of Vertex2 is the same in both sequences (same for Vertex -3 too).

It seems that graph.vertices.aggregate(0.0)(seqOp, _ + _) doesn't returns the aggregated value as it is supposed to return (i.e., sum(SeqOp(Vertex_i)) ).

wangmiao1981 · 2017-01-06T20:06:36Z

I think it is a bug. I will file a PR to fix it. Thanks!

wangmiao1981 · 2017-01-06T20:50:08Z

@felixcheung PR #16491 is filed.

felixcheung · 2017-01-06T22:50:05Z

let's get 16491 through so we could simply use the model instead of saving extra copies

wangmiao1981 · 2017-01-06T23:04:13Z

Sure. I will revert it to the previous commit once the 16491 is in.

…or for original and loaded model ## What changes were proposed in this pull request? While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different. For example, in the test("read/write DistributedLDAModel"), I add the test: val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior assert(logPrior === logPrior2) The test fails: -4.394180878889078 did not equal -4.294290536919573 The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`. Please refer to #16464 for details. ## How was this patch tested? Add a new unit test for testing logPrior. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16491 from wangmiao1981/ldabug.

…or for original and loaded model ## What changes were proposed in this pull request? While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different. For example, in the test("read/write DistributedLDAModel"), I add the test: val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior assert(logPrior === logPrior2) The test fails: -4.394180878889078 did not equal -4.294290536919573 The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`. Please refer to #16464 for details. ## How was this patch tested? Add a new unit test for testing logPrior. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16491 from wangmiao1981/ldabug. (cherry picked from commit 036b503) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

…or for original and loaded model ## What changes were proposed in this pull request? While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different. For example, in the test("read/write DistributedLDAModel"), I add the test: val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior assert(logPrior === logPrior2) The test fails: -4.394180878889078 did not equal -4.294290536919573 The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`. Please refer to apache#16464 for details. ## How was this patch tested? Add a new unit test for testing logPrior. Author: wm624@hotmail.com <wm624@hotmail.com> Closes apache#16491 from wangmiao1981/ldabug.

SparkQA · 2017-01-09T22:38:32Z

Test build #71093 has finished for PR 16464 at commit 456e06d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2017-01-09T22:51:31Z

@felixcheung I made modifications and don't save the two metrics of DistributedModels.

Thanks!

felixcheung · 2017-01-10T19:08:19Z

mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala

  import LDAWrapper._

  private val lda: LDAModel = pipeline.stages.last.asInstanceOf[LDAModel]
+  private val distributedMoel = lda.isDistributed match {


distributedModel?

Fixed. Thanks!

felixcheung · 2017-01-10T19:21:22Z

LGTM.

SparkQA · 2017-01-10T20:33:30Z

Test build #71149 has finished for PR 16464 at commit aee8da5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-01-13T07:27:20Z

R/pkg/R/mllib_clustering.R

            topics <- dataFrame(callJMethod(jobj, "topics", maxTermsPerTopic))
            vocabulary <- callJMethod(jobj, "vocabulary")
+            trainingLogLikelihood <- callJMethod(jobj, "trainingLogLikelihood")
+            logPrior <- callJMethod(jobj, "logPrior")


I think it's more appropriate to return NULL rather than NaN for local LDA model, since the logPrior is not existing rather than not a number.
BTW, I think we can return NULL directly according to isDistributed, otherwise, call corresponding Scala methods. This should reduce the complexity of LDAWrapper and reduce communication between R and Scala.

I'd prefer NA rather than NULL.
If it's numeric I think NaN is appropriate, no?

In MLlib, if the model is DistributedLDAModel, it has variables called trainingLogLikelihood and logPrior. If the model is LocalLDAModel, there is no above variables exist, we can not tell users whether they are numeric. So I'd prefer NULL to represent not existing.

My take is the convention in R for missing values in a structure (list, data.frame) is NA.
And since we are returning a list we could also simply omit it if the value is not applicable or available. (ie. don't add trainingLogLikelihood to the list in L421)

In any case, what does it look like when printing the summary returned here for these values (NULL, NA) we are proposing?

When returning NULL,
`stats
$docConcentration
[1] 0.09117984 0.09117993 0.09117964 0.09117969 0.09117991 0.13010149
[7] 0.09117986 0.09117983 0.09192744 0.09117969

$topicConcentration
[1] 0.1

$logLikelihood
[1] -395.4505

$logPerplexity
[1] 2.995837

$isDistributed
[1] FALSE

$vocabSize
[1] 10

$topics
SparkDataFrame[topic:int, term:array, termWeights:array]

$vocabulary
[1] "0" "1" "2" "3" "4" "9" "5" "8" "7" "6"

$trainingLogLikelihood
NULL

$logPrior
NULL`

When returning NA
`> stats
$docConcentration
[1] 0.09116572 0.13013764 0.09116557 0.09116559 0.09116584 0.09116617
[7] 0.09116576 0.09116573 0.09116569 0.09116564

$topicConcentration
[1] 0.1

$logLikelihood
[1] -392.9573

$logPerplexity
[1] 2.976949

$isDistributed
[1] FALSE

$vocabSize
[1] 10

$topics
SparkDataFrame[topic:int, term:array, termWeights:array]

$vocabulary
[1] "0" "1" "2" "3" "4" "9" "5" "8" "7" "6"

$trainingLogLikelihood
[1] NA

$logPrior
[1] NA
`

They look very similar. I am fine with any of the two. Thanks!

Great!
Another data point is that NA fits in as just another element in vector and list in R but not NULL (NULL is usually skipped)

> as.data.frame(list(1, 2, 2)) X1 X2 X2.1 1 1 2 2 > as.data.frame(list(1, NULL, 2)) Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0 > as.data.frame(list(1, NA, 2)) X1 NA. X2 1 1 NA 2

In Spark SQL, we convert NULL from JVM to NA in DataFrame

@felixcheung I will update it to NA. Thanks!

yanboliang · 2017-01-13T07:29:33Z

R/pkg/R/mllib_clustering.R

+#'         \item{\code{trainingLogLikelihood}}{Log likelihood of the observed tokens in the training set,
+#'               given the current parameter estimates:
+#'               log P(docs | topics, topic distributions for docs, Dirichlet hyperparameters)
+#'               It is only for \code{DistributedLDAModel} (i.e., optimizer = "em")}


\code{DistributedLDAModel} should convert to text description, since there is no class called DistributedLDAModel in SparkR.

SparkQA · 2017-01-13T07:32:51Z

Test build #71301 has started for PR 16464 at commit 0134a26.

SparkQA · 2017-01-13T07:38:43Z

Test build #71302 has started for PR 16464 at commit b72592c.

SparkQA · 2017-01-13T07:42:53Z

Test build #71303 has started for PR 16464 at commit 882c70d.

wangmiao1981 · 2017-01-13T18:03:11Z

Jenkins, retest this please.

SparkQA · 2017-01-13T19:16:22Z

Test build #71338 has finished for PR 16464 at commit 882c70d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-14T23:01:12Z

Test build #71383 has finished for PR 16464 at commit 95a6910.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-01-15T18:49:06Z

R/pkg/inst/tests/testthat/test_mllib_clustering.R

  expect_equal(vocabSize, 11)
  expect_true(is.null(vocabulary))
+  expect_true(trainingLogLikelihood <= 0 & !is.nan(trainingLogLikelihood))
+  expect_true(logPrior <= 0 & !is.nan(logPrior))


shouldn't these two tests be !is.na?

felixcheung · 2017-01-15T18:50:22Z

This is great, just one minor test comment.
@yanboliang what do you think?

SparkQA · 2017-01-16T06:04:44Z

Test build #71418 has finished for PR 16464 at commit e133ee6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-01-16T14:10:04Z

LGTM, merged into master. Thanks for all you.
@wangmiao1981 Could you prepare another PR for branch-2.1? The PR can only include to set the correct optimizer part.

…or for original and loaded model ## What changes were proposed in this pull request? While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different. For example, in the test("read/write DistributedLDAModel"), I add the test: val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior assert(logPrior === logPrior2) The test fails: -4.394180878889078 did not equal -4.294290536919573 The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`. Please refer to apache#16464 for details. ## How was this patch tested? Add a new unit test for testing logPrior. Author: wm624@hotmail.com <wm624@hotmail.com> Closes apache#16491 from wangmiao1981/ldabug.

## What changes were proposed in this pull request? spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE based on scala code. In addition, the `summary` method should bring back the results related to `DistributedLDAModel`. ## How was this patch tested? Manual tests by comparing with scala example. Modified the current unit test: fix the incorrect unit test and add necessary tests for `summary` method. Author: wm624@hotmail.com <wm624@hotmail.com> Closes apache#16464 from wangmiao1981/new.

felixcheung reviewed Jan 5, 2017

View reviewed changes

felixcheung reviewed Jan 6, 2017

View reviewed changes

wangmiao1981 mentioned this pull request Jan 6, 2017

[SPARK-19110][ML][MLLIB]:DistributedLDAModel returns different logPrior for original and loaded model #16491

Closed

wangmiao1981 added 2 commits January 9, 2017 09:39

fix the optimizer bug

2783d24

set optimizer anyway

864aafa

resolve conflict

456e06d

wangmiao1981 force-pushed the new branch from 5f1c644 to 456e06d Compare January 9, 2017 21:28

felixcheung reviewed Jan 10, 2017

View reviewed changes

fix typo

aee8da5

address review comments

0134a26

yanboliang reviewed Jan 13, 2017

View reviewed changes

simplify backend code

b72592c

improve comment

882c70d

change NULL to NA

95a6910

felixcheung reviewed Jan 15, 2017

View reviewed changes

address review comments

e133ee6

asfgit closed this in 12c8c21 Jan 16, 2017

[SPARK-19066][SparkR]:SparkR LDA doesn't set optimizer correctly #16464

[SPARK-19066][SparkR]:SparkR LDA doesn't set optimizer correctly #16464

Uh oh!

Conversation

wangmiao1981 commented Jan 3, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 4, 2017

Uh oh!

SparkQA commented Jan 4, 2017

Uh oh!

SparkQA commented Jan 4, 2017

Uh oh!

SparkQA commented Jan 4, 2017

Uh oh!

wangmiao1981 commented Jan 4, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Jan 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Jan 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangmiao1981 commented Jan 6, 2017

Uh oh!

wangmiao1981 commented Jan 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jan 6, 2017

Uh oh!

wangmiao1981 commented Jan 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangmiao1981 commented Jan 6, 2017

Uh oh!

wangmiao1981 commented Jan 6, 2017

Uh oh!

felixcheung commented Jan 6, 2017

Uh oh!

wangmiao1981 commented Jan 6, 2017

Uh oh!

SparkQA commented Jan 9, 2017

Uh oh!

wangmiao1981 commented Jan 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Jan 10, 2017

Uh oh!

SparkQA commented Jan 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Jan 6, 2017 •

edited

Loading

felixcheung Jan 6, 2017 •

edited

Loading

wangmiao1981 commented Jan 6, 2017 •

edited

Loading

wangmiao1981 commented Jan 6, 2017 •

edited

Loading

felixcheung Jan 14, 2017 •

edited

Loading