Skip to content

Conversation

@wangmiao1981
Copy link
Contributor

What changes were proposed in this pull request?

spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the isDistributed field is FALSE, which should be TRUE based on scala code.

In addition, the summary method should bring back the results related to DistributedLDAModel.

How was this patch tested?

Manual tests by comparing with scala example.
Modified the current unit test: fix the incorrect unit test and add necessary tests for summary method.

@SparkQA
Copy link

SparkQA commented Jan 4, 2017

Test build #70840 has finished for PR 16464 at commit 2777636.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 4, 2017

Test build #70845 has finished for PR 16464 at commit 6c096bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 4, 2017

Test build #70863 has started for PR 16464 at commit 14bafc1.

@SparkQA
Copy link

SparkQA commented Jan 4, 2017

Test build #70889 has finished for PR 16464 at commit 26fa14b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

cc @felixcheung @yanboliang A bug fix. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since model is referenced and persisted, is there a need to handle trainingLogLikelihood and logPrior separately like this, and writing to metadata, instead of just getting from the model when fetching for the summary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first version, I got them from the model in the LDAWrapper class. However, when I read logPrior, I found that the loaded logPrior is not the same as the value before save. So, I followed the logLikelihood and logPerplexity to save it in the metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the same dataset, Scala side tests:
Original LogPrior:-3.3387459952856338
LogPrior from saved model: -0.9202435107654922

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's odd, is that an issue with model persistence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LogPrior is calculated based on the serialized topics etc, which are also used by the trainingLikelyhood. But the trainingLikelyhood is the same for both original and loaded model. Let me debug more. It looks like a bug. The original MLLIB implementation doesn't serialize the two parameters as they can be calculated from other saved values. In addition, there is no unit test for comparing the two values, which could be the reason of not catching this issue.

Copy link
Member

@felixcheung felixcheung Jan 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it just set optimizer - without having to check if it is == em?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is "online", it is not necessary to set the optimizer. But setting it anyway will make the code clean. I will do it.

Copy link
Member

@felixcheung felixcheung Jan 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you have a test for when these are not nan? like here
nm, saw the test above.

@wangmiao1981
Copy link
Contributor Author

graph.vertices.aggregate(0.0)(seqOp, _ + _) when calculating logPrior gets different values for the two models, even if all parameters of the Model are the same.

I print out each step of seqOp, which are different for the two models. I am trying to understand the logic.

@wangmiao1981
Copy link
Contributor Author

wangmiao1981 commented Jan 6, 2017

graph.vertices.aggregate(0.0)(seqOp, _ + _) the vertex sequence is different in the two models.
VertexID sequence for original model:
-8 -6 -3 10 -5 4 11 -1 0 1 -2 6 -7 7 8 -10 -9 9 -4 3 -11 5 2
VertexID sequence for the loaded model:
3 5 2 -11 -5 4 11 -1 0 1 -2 6 -7 -8 7 -10 8 9 -9 -4 10 -6 -3

@SparkQA
Copy link

SparkQA commented Jan 6, 2017

Test build #70988 has finished for PR 16464 at commit 5f1c644.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

wangmiao1981 commented Jan 6, 2017

I print out the values of each step of seqOp, graph.vertices.aggregate(0.0)(seqOp, _ + _) just returns the seqOp of the last vertex.
VertexID sequence for original model:
-8 -6 -3 10 -5 4 11 -1 0 1 -2 6 -7 7 8 -10 -9 9 -4 3 -11 5 2
For this sequence, it returns the value of Vertex2.
VertexID sequence for the loaded model:
3 5 2 -11 -5 4 11 -1 0 1 -2 6 -7 -8 7 -10 8 9 -9 -4 10 -6 -3
For this sequence, it returns the value of Vertex -3.

Actually, the value of Vertex2 is the same in both sequences (same for Vertex -3 too).

It seems that graph.vertices.aggregate(0.0)(seqOp, _ + _) doesn't returns the aggregated value as it is supposed to return (i.e., sum(SeqOp(Vertex_i)) ).

@wangmiao1981
Copy link
Contributor Author

I think it is a bug. I will file a PR to fix it. Thanks!

@wangmiao1981
Copy link
Contributor Author

@felixcheung PR #16491 is filed.

@felixcheung
Copy link
Member

let's get 16491 through so we could simply use the model instead of saving extra copies

@wangmiao1981
Copy link
Contributor Author

Sure. I will revert it to the previous commit once the 16491 is in.

asfgit pushed a commit that referenced this pull request Jan 7, 2017
…or for original and loaded model

## What changes were proposed in this pull request?

While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different.
For example, in the test("read/write DistributedLDAModel"), I add the test:
val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
assert(logPrior === logPrior2)
The test fails:
-4.394180878889078 did not equal -4.294290536919573

The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`.

Please refer to #16464 for details.
## How was this patch tested?
Add a new unit test for testing logPrior.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16491 from wangmiao1981/ldabug.
asfgit pushed a commit that referenced this pull request Jan 7, 2017
…or for original and loaded model

## What changes were proposed in this pull request?

While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different.
For example, in the test("read/write DistributedLDAModel"), I add the test:
val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
assert(logPrior === logPrior2)
The test fails:
-4.394180878889078 did not equal -4.294290536919573

The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`.

Please refer to #16464 for details.
## How was this patch tested?
Add a new unit test for testing logPrior.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16491 from wangmiao1981/ldabug.

(cherry picked from commit 036b503)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
asfgit pushed a commit that referenced this pull request Jan 7, 2017
…or for original and loaded model

## What changes were proposed in this pull request?

While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different.
For example, in the test("read/write DistributedLDAModel"), I add the test:
val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
assert(logPrior === logPrior2)
The test fails:
-4.394180878889078 did not equal -4.294290536919573

The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`.

Please refer to #16464 for details.
## How was this patch tested?
Add a new unit test for testing logPrior.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16491 from wangmiao1981/ldabug.

(cherry picked from commit 036b503)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Jan 9, 2017
…or for original and loaded model

## What changes were proposed in this pull request?

While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different.
For example, in the test("read/write DistributedLDAModel"), I add the test:
val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
assert(logPrior === logPrior2)
The test fails:
-4.394180878889078 did not equal -4.294290536919573

The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`.

Please refer to apache#16464 for details.
## How was this patch tested?
Add a new unit test for testing logPrior.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#16491 from wangmiao1981/ldabug.
@SparkQA
Copy link

SparkQA commented Jan 9, 2017

Test build #71093 has finished for PR 16464 at commit 456e06d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

@felixcheung I made modifications and don't save the two metrics of DistributedModels.

Thanks!

import LDAWrapper._

private val lda: LDAModel = pipeline.stages.last.asInstanceOf[LDAModel]
private val distributedMoel = lda.isDistributed match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distributedModel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

@felixcheung
Copy link
Member

LGTM.

@SparkQA
Copy link

SparkQA commented Jan 10, 2017

Test build #71149 has finished for PR 16464 at commit aee8da5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

topics <- dataFrame(callJMethod(jobj, "topics", maxTermsPerTopic))
vocabulary <- callJMethod(jobj, "vocabulary")
trainingLogLikelihood <- callJMethod(jobj, "trainingLogLikelihood")
logPrior <- callJMethod(jobj, "logPrior")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more appropriate to return NULL rather than NaN for local LDA model, since the logPrior is not existing rather than not a number.
BTW, I think we can return NULL directly according to isDistributed, otherwise, call corresponding Scala methods. This should reduce the complexity of LDAWrapper and reduce communication between R and Scala.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer NA rather than NULL.
If it's numeric I think NaN is appropriate, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In MLlib, if the model is DistributedLDAModel, it has variables called trainingLogLikelihood and logPrior. If the model is LocalLDAModel, there is no above variables exist, we can not tell users whether they are numeric. So I'd prefer NULL to represent not existing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My take is the convention in R for missing values in a structure (list, data.frame) is NA.
And since we are returning a list we could also simply omit it if the value is not applicable or available. (ie. don't add trainingLogLikelihood to the list in L421)

In any case, what does it look like when printing the summary returned here for these values (NULL, NA) we are proposing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When returning NULL,
`stats
$docConcentration
[1] 0.09117984 0.09117993 0.09117964 0.09117969 0.09117991 0.13010149
[7] 0.09117986 0.09117983 0.09192744 0.09117969

$topicConcentration
[1] 0.1

$logLikelihood
[1] -395.4505

$logPerplexity
[1] 2.995837

$isDistributed
[1] FALSE

$vocabSize
[1] 10

$topics
SparkDataFrame[topic:int, term:array, termWeights:array]

$vocabulary
[1] "0" "1" "2" "3" "4" "9" "5" "8" "7" "6"

$trainingLogLikelihood
NULL

$logPrior
NULL`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When returning NA
`> stats
$docConcentration
[1] 0.09116572 0.13013764 0.09116557 0.09116559 0.09116584 0.09116617
[7] 0.09116576 0.09116573 0.09116569 0.09116564

$topicConcentration
[1] 0.1

$logLikelihood
[1] -392.9573

$logPerplexity
[1] 2.976949

$isDistributed
[1] FALSE

$vocabSize
[1] 10

$topics
SparkDataFrame[topic:int, term:array, termWeights:array]

$vocabulary
[1] "0" "1" "2" "3" "4" "9" "5" "8" "7" "6"

$trainingLogLikelihood
[1] NA

$logPrior
[1] NA
`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They look very similar. I am fine with any of the two. Thanks!

Copy link
Member

@felixcheung felixcheung Jan 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!
Another data point is that NA fits in as just another element in vector and list in R but not NULL (NULL is usually skipped)

> as.data.frame(list(1, 2, 2))
  X1 X2 X2.1
1  1  2    2
> as.data.frame(list(1, NULL, 2))
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  :
  arguments imply differing number of rows: 1, 0
> as.data.frame(list(1, NA, 2))
  X1 NA. X2
1  1  NA  2

In Spark SQL, we convert NULL from JVM to NA in DataFrame

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung I will update it to NA. Thanks!

#' \item{\code{trainingLogLikelihood}}{Log likelihood of the observed tokens in the training set,
#' given the current parameter estimates:
#' log P(docs | topics, topic distributions for docs, Dirichlet hyperparameters)
#' It is only for \code{DistributedLDAModel} (i.e., optimizer = "em")}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\code{DistributedLDAModel} should convert to text description, since there is no class called DistributedLDAModel in SparkR.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71301 has started for PR 16464 at commit 0134a26.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71302 has started for PR 16464 at commit b72592c.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71303 has started for PR 16464 at commit 882c70d.

@wangmiao1981
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71338 has finished for PR 16464 at commit 882c70d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 14, 2017

Test build #71383 has finished for PR 16464 at commit 95a6910.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

expect_equal(vocabSize, 11)
expect_true(is.null(vocabulary))
expect_true(trainingLogLikelihood <= 0 & !is.nan(trainingLogLikelihood))
expect_true(logPrior <= 0 & !is.nan(logPrior))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't these two tests be !is.na?

@felixcheung
Copy link
Member

This is great, just one minor test comment.
@yanboliang what do you think?

@SparkQA
Copy link

SparkQA commented Jan 16, 2017

Test build #71418 has finished for PR 16464 at commit e133ee6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 12c8c21 Jan 16, 2017
@yanboliang
Copy link
Contributor

LGTM, merged into master. Thanks for all you.
@wangmiao1981 Could you prepare another PR for branch-2.1? The PR can only include to set the correct optimizer part.

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…or for original and loaded model

## What changes were proposed in this pull request?

While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different.
For example, in the test("read/write DistributedLDAModel"), I add the test:
val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
assert(logPrior === logPrior2)
The test fails:
-4.394180878889078 did not equal -4.294290536919573

The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`.

Please refer to apache#16464 for details.
## How was this patch tested?
Add a new unit test for testing logPrior.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#16491 from wangmiao1981/ldabug.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE based on scala code.

In addition, the `summary` method should bring back the results related to `DistributedLDAModel`.

## How was this patch tested?
Manual tests by comparing with scala example.
Modified the current unit test: fix the incorrect unit test and add necessary tests for `summary` method.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#16464 from wangmiao1981/new.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE based on scala code.

In addition, the `summary` method should bring back the results related to `DistributedLDAModel`.

## How was this patch tested?
Manual tests by comparing with scala example.
Modified the current unit test: fix the incorrect unit test and add necessary tests for `summary` method.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#16464 from wangmiao1981/new.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants