-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19066][SparkR]:SparkR LDA doesn't set optimizer correctly #16464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #70840 has finished for PR 16464 at commit
|
|
Test build #70845 has finished for PR 16464 at commit
|
|
Test build #70863 has started for PR 16464 at commit |
|
Test build #70889 has finished for PR 16464 at commit
|
|
cc @felixcheung @yanboliang A bug fix. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since model is referenced and persisted, is there a need to handle trainingLogLikelihood and logPrior separately like this, and writing to metadata, instead of just getting from the model when fetching for the summary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first version, I got them from the model in the LDAWrapper class. However, when I read logPrior, I found that the loaded logPrior is not the same as the value before save. So, I followed the logLikelihood and logPerplexity to save it in the metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the same dataset, Scala side tests:
Original LogPrior:-3.3387459952856338
LogPrior from saved model: -0.9202435107654922
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's odd, is that an issue with model persistence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LogPrior is calculated based on the serialized topics etc, which are also used by the trainingLikelyhood. But the trainingLikelyhood is the same for both original and loaded model. Let me debug more. It looks like a bug. The original MLLIB implementation doesn't serialize the two parameters as they can be calculated from other saved values. In addition, there is no unit test for comparing the two values, which could be the reason of not catching this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't it just set optimizer - without having to check if it is == em?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is "online", it is not necessary to set the optimizer. But setting it anyway will make the code clean. I will do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you have a test for when these are not nan? like here
nm, saw the test above.
|
I print out each step of |
|
|
|
Test build #70988 has finished for PR 16464 at commit
|
|
I print out the values of each step of Actually, the value of Vertex2 is the same in both sequences (same for Vertex -3 too). It seems that |
|
I think it is a bug. I will file a PR to fix it. Thanks! |
|
@felixcheung PR #16491 is filed. |
|
let's get 16491 through so we could simply use the model instead of saving extra copies |
|
Sure. I will revert it to the previous commit once the 16491 is in. |
…or for original and loaded model
## What changes were proposed in this pull request?
While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different.
For example, in the test("read/write DistributedLDAModel"), I add the test:
val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
assert(logPrior === logPrior2)
The test fails:
-4.394180878889078 did not equal -4.294290536919573
The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`.
Please refer to #16464 for details.
## How was this patch tested?
Add a new unit test for testing logPrior.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes #16491 from wangmiao1981/ldabug.
…or for original and loaded model
## What changes were proposed in this pull request?
While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different.
For example, in the test("read/write DistributedLDAModel"), I add the test:
val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
assert(logPrior === logPrior2)
The test fails:
-4.394180878889078 did not equal -4.294290536919573
The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`.
Please refer to #16464 for details.
## How was this patch tested?
Add a new unit test for testing logPrior.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes #16491 from wangmiao1981/ldabug.
(cherry picked from commit 036b503)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
…or for original and loaded model
## What changes were proposed in this pull request?
While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different.
For example, in the test("read/write DistributedLDAModel"), I add the test:
val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
assert(logPrior === logPrior2)
The test fails:
-4.394180878889078 did not equal -4.294290536919573
The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`.
Please refer to #16464 for details.
## How was this patch tested?
Add a new unit test for testing logPrior.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes #16491 from wangmiao1981/ldabug.
(cherry picked from commit 036b503)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
…or for original and loaded model
## What changes were proposed in this pull request?
While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different.
For example, in the test("read/write DistributedLDAModel"), I add the test:
val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
assert(logPrior === logPrior2)
The test fails:
-4.394180878889078 did not equal -4.294290536919573
The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`.
Please refer to apache#16464 for details.
## How was this patch tested?
Add a new unit test for testing logPrior.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes apache#16491 from wangmiao1981/ldabug.
|
Test build #71093 has finished for PR 16464 at commit
|
|
@felixcheung I made modifications and don't save the two metrics of DistributedModels. Thanks! |
| import LDAWrapper._ | ||
|
|
||
| private val lda: LDAModel = pipeline.stages.last.asInstanceOf[LDAModel] | ||
| private val distributedMoel = lda.isDistributed match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
distributedModel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks!
|
LGTM. |
|
Test build #71149 has finished for PR 16464 at commit
|
R/pkg/R/mllib_clustering.R
Outdated
| topics <- dataFrame(callJMethod(jobj, "topics", maxTermsPerTopic)) | ||
| vocabulary <- callJMethod(jobj, "vocabulary") | ||
| trainingLogLikelihood <- callJMethod(jobj, "trainingLogLikelihood") | ||
| logPrior <- callJMethod(jobj, "logPrior") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more appropriate to return NULL rather than NaN for local LDA model, since the logPrior is not existing rather than not a number.
BTW, I think we can return NULL directly according to isDistributed, otherwise, call corresponding Scala methods. This should reduce the complexity of LDAWrapper and reduce communication between R and Scala.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer NA rather than NULL.
If it's numeric I think NaN is appropriate, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In MLlib, if the model is DistributedLDAModel, it has variables called trainingLogLikelihood and logPrior. If the model is LocalLDAModel, there is no above variables exist, we can not tell users whether they are numeric. So I'd prefer NULL to represent not existing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My take is the convention in R for missing values in a structure (list, data.frame) is NA.
And since we are returning a list we could also simply omit it if the value is not applicable or available. (ie. don't add trainingLogLikelihood to the list in L421)
In any case, what does it look like when printing the summary returned here for these values (NULL, NA) we are proposing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When returning NULL,
`stats
$docConcentration
[1] 0.09117984 0.09117993 0.09117964 0.09117969 0.09117991 0.13010149
[7] 0.09117986 0.09117983 0.09192744 0.09117969
$topicConcentration
[1] 0.1
$logLikelihood
[1] -395.4505
$logPerplexity
[1] 2.995837
$isDistributed
[1] FALSE
$vocabSize
[1] 10
$topics
SparkDataFrame[topic:int, term:array, termWeights:array]
$vocabulary
[1] "0" "1" "2" "3" "4" "9" "5" "8" "7" "6"
$trainingLogLikelihood
NULL
$logPrior
NULL`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When returning NA
`> stats
$docConcentration
[1] 0.09116572 0.13013764 0.09116557 0.09116559 0.09116584 0.09116617
[7] 0.09116576 0.09116573 0.09116569 0.09116564
$topicConcentration
[1] 0.1
$logLikelihood
[1] -392.9573
$logPerplexity
[1] 2.976949
$isDistributed
[1] FALSE
$vocabSize
[1] 10
$topics
SparkDataFrame[topic:int, term:array, termWeights:array]
$vocabulary
[1] "0" "1" "2" "3" "4" "9" "5" "8" "7" "6"
$trainingLogLikelihood
[1] NA
$logPrior
[1] NA
`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They look very similar. I am fine with any of the two. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
Another data point is that NA fits in as just another element in vector and list in R but not NULL (NULL is usually skipped)
> as.data.frame(list(1, 2, 2))
X1 X2 X2.1
1 1 2 2
> as.data.frame(list(1, NULL, 2))
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 1, 0
> as.data.frame(list(1, NA, 2))
X1 NA. X2
1 1 NA 2
In Spark SQL, we convert NULL from JVM to NA in DataFrame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung I will update it to NA. Thanks!
R/pkg/R/mllib_clustering.R
Outdated
| #' \item{\code{trainingLogLikelihood}}{Log likelihood of the observed tokens in the training set, | ||
| #' given the current parameter estimates: | ||
| #' log P(docs | topics, topic distributions for docs, Dirichlet hyperparameters) | ||
| #' It is only for \code{DistributedLDAModel} (i.e., optimizer = "em")} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\code{DistributedLDAModel} should convert to text description, since there is no class called DistributedLDAModel in SparkR.
|
Test build #71301 has started for PR 16464 at commit |
|
Test build #71302 has started for PR 16464 at commit |
|
Test build #71303 has started for PR 16464 at commit |
|
Jenkins, retest this please. |
|
Test build #71338 has finished for PR 16464 at commit
|
|
Test build #71383 has finished for PR 16464 at commit
|
| expect_equal(vocabSize, 11) | ||
| expect_true(is.null(vocabulary)) | ||
| expect_true(trainingLogLikelihood <= 0 & !is.nan(trainingLogLikelihood)) | ||
| expect_true(logPrior <= 0 & !is.nan(logPrior)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't these two tests be !is.na?
|
This is great, just one minor test comment. |
|
Test build #71418 has finished for PR 16464 at commit
|
|
LGTM, merged into master. Thanks for all you. |
…or for original and loaded model
## What changes were proposed in this pull request?
While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different.
For example, in the test("read/write DistributedLDAModel"), I add the test:
val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
assert(logPrior === logPrior2)
The test fails:
-4.394180878889078 did not equal -4.294290536919573
The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`.
Please refer to apache#16464 for details.
## How was this patch tested?
Add a new unit test for testing logPrior.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes apache#16491 from wangmiao1981/ldabug.
## What changes were proposed in this pull request? spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE based on scala code. In addition, the `summary` method should bring back the results related to `DistributedLDAModel`. ## How was this patch tested? Manual tests by comparing with scala example. Modified the current unit test: fix the incorrect unit test and add necessary tests for `summary` method. Author: wm624@hotmail.com <wm624@hotmail.com> Closes apache#16464 from wangmiao1981/new.
## What changes were proposed in this pull request? spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE based on scala code. In addition, the `summary` method should bring back the results related to `DistributedLDAModel`. ## How was this patch tested? Manual tests by comparing with scala example. Modified the current unit test: fix the incorrect unit test and add necessary tests for `summary` method. Author: wm624@hotmail.com <wm624@hotmail.com> Closes apache#16464 from wangmiao1981/new.
What changes were proposed in this pull request?
spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the
isDistributedfield is FALSE, which should be TRUE based on scala code.In addition, the
summarymethod should bring back the results related toDistributedLDAModel.How was this patch tested?
Manual tests by comparing with scala example.
Modified the current unit test: fix the incorrect unit test and add necessary tests for
summarymethod.