[SPARK-20082][ml] LDA incremental model learning #17461

mdespriee · 2017-03-28T20:14:03Z

What changes were proposed in this pull request?

This PR propose the possibility to re-use a previous LDA as a starting point for the online optimizer, for incremental learning.

I add an initialModel parameter at mllib level, that is used to initialize the alpha and lambda (doc concentration, topic matrix) of the OnlineLDAOptimizer, instead of random.
It's only supported for online optimizer, any use with em optimizer will throw
Consistency of LDA parameters are checked between models (same k, vocab size...)
At ml API level, initialModel can be provided as a path to a serialized trained model (it's a path, and not a LDAModel object, to better fit in Params api)
I reflected the change at ml API level, python API, java API
I added a mention of that in docs

How was this patch tested?

Unit tests: mllib, ml, java api, python api
Manual tests: see added example "LDAIncrementalExample.scala"

NB: This is my first contribution, please apologize is I miss something in PR process, or Spark standards.

hhbyyh

Thanks for the PR. Appreciate your work.

Several primary things for your consideration:

As spark.ml API is the primary API for MLlib, I would recommend an integrated change that includes LDA in spark.ml.
It would bring some confusion if we can only support initial model for online optimizer, please try to include DistributedModel (EMOptimizer) in the change. Regarding to your question in the jira, I'm not sure if theoretically it makes sense to add new documents after setting the initial Model for EMOptimizer. (I'm leaning towards no...). @jkbradley to confirm
more unit tests should be added but I understand we should settle down the solution first.

We probably should first settle down the primary issues and I can help follow up with more details.

hhbyyh · 2017-04-11T07:06:57Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala

I'm thinking we should move all the parameter check into run, since user may set parameters in different orders.

HyukjinKwon · 2017-05-11T14:53:57Z

@mdespriee is it still active? Could you address the comment above?

mdespriee · 2017-05-11T15:16:56Z

Yes, still active. I've been very busy lately. Regarding 1/ -> will do 2/ -> still waiting for a comment from @jkbradley actually. 3/ -> will do I'll try to move on in the coming days. Le 11 mai 2017 4:54 PM, "Hyukjin Kwon" <notifications@github.com> a écrit : @mdespriee <https://github.com/mdespriee> is it still active? Could you address the comment above? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17461 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABhFNNT5Kn1gjKVhy20p8AtrcN7i7watks5r4yE0gaJpZM4MsIsZ> .

mdespriee · 2017-06-29T19:15:32Z

I made some manual tests as well, see here : https://gist.github.com/mdespriee/8ae604036732f39f6345ee91acf777a0

This code could be added in spark-examples, just tell me.

mdespriee · 2017-07-02T14:47:55Z

Putting [WIP] back, as there is a problem with serialization of initialModel param.
I think I underestimated the impact of putting an object like an LDAModel as a simple Param. This would require to implement a dedicated JsonEncoder in Param for it.
It's a bit tricky if I don't want to put a dependency between Param and LDA...
@HyukjinKwon an advice ?

HyukjinKwon · 2017-07-02T18:28:56Z

I am not reviewing this. I think @hhbyyh is.

mdespriee · 2017-07-20T12:39:01Z

Hi @hhbyyh, @jkbradley
a gentle ping on this PR, if you could have a look at the code, and give me your opinion regarding my question hereabove (the use of Param API to provide a previous model, and the impact on serialization of it). Thanks !

hhbyyh · 2017-07-20T16:15:54Z

For the initial model, I think you can just use a String param for the model path. refer to https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L66

mdespriee · 2017-07-21T09:04:44Z

Ok, will do with a model path. I'll push an update shortly, and I'll think this PR will be ready. (in KMeans, the initialModel API is available in mllib but not at ML level, yet. Could be another JIRA...) 2017-07-20 18:16 GMT+02:00 Yuhao Yang <notifications@github.com>:

…

For the initial model, I think you can just use a String param for the model path. refer to https://github.com/apache/ spark/blob/master/mllib/src/main/scala/org/apache/spark/ ml/clustering/KMeans.scala#L66 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17461 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABhFNMzx6pW21BwQLAlI8clIW_9Y8lEPks5sP31tgaJpZM4MsIsZ> .

mdespriee · 2017-08-06T14:45:24Z

Hi @hhbyyh,
This PR is ready for a review. Thanks !

mdespriee · 2017-08-25T08:17:23Z

Hi @hhbyyh, @jkbradley
a gentle ping on this PR. It's not WIP anymore, and ready for a review.

hhbyyh · 2017-08-25T15:50:14Z

Got it. Will make a pass today.

hhbyyh

Thanks for the PR. This should be a good feature to add.
I need to download the code and run some tests to confirm the implementation. Will continue the review tomorrow.

hhbyyh · 2017-08-26T05:54:05Z

docs/mllib-clustering.md

 checkpointing can help reduce shuffle file sizes on disk and help with
 failure recovery.
+* `initialModel`: this parameter, only supported by `OnlineLDAOptimizer`,
+specifies a previously trained LDA model as a start point instead of 


previously trained LocalLDAModel

hhbyyh · 2017-08-26T05:54:49Z

examples/src/main/scala/org/apache/spark/examples/ml/LDAIncrementalExample.scala

+ * bin/run-example ml.LDAIncrementalExample
+ * }}}
+ */
+object LDAIncrementalExample {


Maybe OnlineLDAIncrementalExample ?

hhbyyh · 2017-08-26T05:56:17Z

examples/src/main/scala/org/apache/spark/examples/ml/LDAIncrementalExample.scala

+
+    val spark = SparkSession
+        .builder()
+        .master("local[*]")


Other example usually will not specify master

hhbyyh · 2017-08-26T05:56:50Z

examples/src/main/scala/org/apache/spark/examples/ml/LDAIncrementalExample.scala

+        .master("local[*]")
+        .appName(s"${this.getClass.getSimpleName}")
+        .getOrCreate()
+    spark.sparkContext.setLogLevel("ERROR")


Maybe remove this line.

hhbyyh · 2017-08-26T05:58:00Z

examples/src/main/scala/org/apache/spark/examples/ml/LDAIncrementalExample.scala

+
+    import spark.implicits._
+
+    val dataset = spark.read.text("/home/mde/workspaces/spark-project/spark/docs/*md").toDF("doc")


can we use LDA sample data in Spark?

mdespriee · 2017-08-27T22:48:20Z

I updated the example following your suggestion. It's more consistent with LDAExample this way.

remove setLogLevel review doc

hhbyyh

Hi @mdespriee in #18610 @yanboliang provided some infrastructure for initial model.
We may follow the general practice in #18610 after it's merged. I'd like to know your opinion.

hhbyyh · 2017-08-28T01:59:47Z

mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala

-  EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => OldLDAModel,
-  LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
-  OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => OldDistributedLDAModel, EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => OldLDAModel, LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel, OnlineLDAOptimizer => OldOnlineLDAOptimizer}


Better follow the original format

hhbyyh · 2017-08-28T02:00:55Z

mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala

+   * For Online optimizer only (currently): [[optimizer]] = "online".
+
+   * An initial model to be used as a starting point for the learning, instead of a random
+   * initialization. Provide the path to a serialized trained LDAModel.


LDAModel => LocalLDAModel

Hi @hhbyyh
After a quick look, it seems ok to follow this. I'll try to merge locally and see how it fits. I'll keep you updated. Do you think #18610 will be merged shortly ? (cc @yanboliang)

sprintcheng · 2018-08-15T02:53:59Z

May I know when this change being included into official release, I download spark 2.3.1 and still do NOT find this method(lda.setInitialModel) has been added.

mdespriee · 2018-08-24T15:40:43Z

Hi @sprintcheng,
This PR is stale and is not even mergeable. I haven't had any feedback from spark maintainers since more than a year. @hhbyyh suggested to wait for #18610 which is also stale.
If there is enough interest in this feature, I could update it quickly and have it ready to merge.
I suggest we discuss this in the original JIRA (https://issues.apache.org/jira/browse/SPARK-20082)

AmplabJenkins · 2018-10-22T12:20:27Z

Can one of the admins verify this patch?

hhbyyh reviewed Apr 11, 2017

View reviewed changes

mdespriee force-pushed the SPARK-20082_LDA_online_learning branch from 707286c to fa14304 Compare May 24, 2017 13:49

mdespriee added 8 commits June 27, 2017 14:11

initialModel in LDA online

d66929f

test for incremental learning

847e69c

refact OnlineLDA

73919ba

refact and tests

1c850ac

ml api + some tests

b36f7e9

python api + java api test

73f6a75

doc

6e6782c

docConcentration from initial model

f5b5d38

mdespriee force-pushed the SPARK-20082_LDA_online_learning branch from fa14304 to f5b5d38 Compare June 28, 2017 16:10

move a check at intialization time

b5e50eb

mdespriee changed the title ~~[SPARK-20082][ml][WIP] LDA incremental model learning~~ [SPARK-20082][ml] LDA incremental model learning Jun 28, 2017

doc / typo

5562259

mdespriee changed the title ~~[SPARK-20082][ml] LDA incremental model learning~~ [SPARK-20082][ml][WIP] LDA incremental model learning Jul 2, 2017

mdespriee added 3 commits August 3, 2017 16:05

initModel as a path to a serialized model

43c63a7

python API for the new param

8df8da0

added example

d3a4f16

mdespriee changed the title ~~[SPARK-20082][ml][WIP] LDA incremental model learning~~ [SPARK-20082][ml] LDA incremental model learning Aug 6, 2017

typos in docs

6fe3a20

hhbyyh reviewed Aug 26, 2017

View reviewed changes

Review example after PR comments

31cd11b

remove setLogLevel review doc

mdespriee force-pushed the SPARK-20082_LDA_online_learning branch from 88a17c7 to 31cd11b Compare August 27, 2017 23:20

hhbyyh reviewed Aug 31, 2017

View reviewed changes

mdespriee closed this Jun 13, 2019


		import spark.implicits._

		val dataset = spark.read.text("/home/mde/workspaces/spark-project/spark/docs/*md").toDF("doc")

[SPARK-20082][ml] LDA incremental model learning #17461

[SPARK-20082][ml] LDA incremental model learning #17461

Uh oh!

Conversation

mdespriee commented Mar 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

hhbyyh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 11, 2017

Uh oh!

mdespriee commented May 11, 2017 via email

Uh oh!

mdespriee commented Jun 29, 2017

Uh oh!

mdespriee commented Jul 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jul 2, 2017

Uh oh!

mdespriee commented Jul 20, 2017

Uh oh!

hhbyyh commented Jul 20, 2017

Uh oh!

mdespriee commented Jul 21, 2017 via email

Uh oh!

mdespriee commented Aug 6, 2017

Uh oh!

mdespriee commented Aug 25, 2017

Uh oh!

hhbyyh commented Aug 25, 2017

Uh oh!

hhbyyh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdespriee commented Aug 27, 2017

Uh oh!

hhbyyh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sprintcheng commented Aug 15, 2018

Uh oh!

mdespriee commented Aug 24, 2018

Uh oh!

AmplabJenkins commented Oct 22, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mdespriee commented Mar 28, 2017 •

edited

Loading

mdespriee commented Jul 2, 2017 •

edited

Loading