Skip to content

Conversation

@mdespriee
Copy link

@mdespriee mdespriee commented Mar 28, 2017

What changes were proposed in this pull request?

This PR propose the possibility to re-use a previous LDA as a starting point for the online optimizer, for incremental learning.

  • I add an initialModel parameter at mllib level, that is used to initialize the alpha and lambda (doc concentration, topic matrix) of the OnlineLDAOptimizer, instead of random.
  • It's only supported for online optimizer, any use with em optimizer will throw
  • Consistency of LDA parameters are checked between models (same k, vocab size...)
  • At ml API level, initialModel can be provided as a path to a serialized trained model (it's a path, and not a LDAModel object, to better fit in Params api)
  • I reflected the change at ml API level, python API, java API
  • I added a mention of that in docs

How was this patch tested?

Unit tests: mllib, ml, java api, python api
Manual tests: see added example "LDAIncrementalExample.scala"

NB: This is my first contribution, please apologize is I miss something in PR process, or Spark standards.

Copy link
Contributor

@hhbyyh hhbyyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Appreciate your work.

Several primary things for your consideration:

  1. As spark.ml API is the primary API for MLlib, I would recommend an integrated change that includes LDA in spark.ml.

  2. It would bring some confusion if we can only support initial model for online optimizer, please try to include DistributedModel (EMOptimizer) in the change. Regarding to your question in the jira, I'm not sure if theoretically it makes sense to add new documents after setting the initial Model for EMOptimizer. (I'm leaning towards no...). @jkbradley to confirm

  3. more unit tests should be added but I understand we should settle down the solution first.

We probably should first settle down the primary issues and I can help follow up with more details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we should move all the parameter check into run, since user may set parameters in different orders.

@HyukjinKwon
Copy link
Member

@mdespriee is it still active? Could you address the comment above?

@mdespriee
Copy link
Author

mdespriee commented May 11, 2017 via email

@mdespriee mdespriee force-pushed the SPARK-20082_LDA_online_learning branch from 707286c to fa14304 Compare May 24, 2017 13:49
@mdespriee mdespriee force-pushed the SPARK-20082_LDA_online_learning branch from fa14304 to f5b5d38 Compare June 28, 2017 16:10
@mdespriee mdespriee changed the title [SPARK-20082][ml][WIP] LDA incremental model learning [SPARK-20082][ml] LDA incremental model learning Jun 28, 2017
@mdespriee
Copy link
Author

I made some manual tests as well, see here : https://gist.github.com/mdespriee/8ae604036732f39f6345ee91acf777a0

This code could be added in spark-examples, just tell me.

@mdespriee mdespriee changed the title [SPARK-20082][ml] LDA incremental model learning [SPARK-20082][ml][WIP] LDA incremental model learning Jul 2, 2017
@mdespriee
Copy link
Author

mdespriee commented Jul 2, 2017

Putting [WIP] back, as there is a problem with serialization of initialModel param.
I think I underestimated the impact of putting an object like an LDAModel as a simple Param. This would require to implement a dedicated JsonEncoder in Param for it.
It's a bit tricky if I don't want to put a dependency between Param and LDA...
@HyukjinKwon an advice ?

@HyukjinKwon
Copy link
Member

I am not reviewing this. I think @hhbyyh is.

@mdespriee
Copy link
Author

Hi @hhbyyh, @jkbradley
a gentle ping on this PR, if you could have a look at the code, and give me your opinion regarding my question hereabove (the use of Param API to provide a previous model, and the impact on serialization of it). Thanks !

@hhbyyh
Copy link
Contributor

hhbyyh commented Jul 20, 2017

For the initial model, I think you can just use a String param for the model path. refer to https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L66

@mdespriee
Copy link
Author

mdespriee commented Jul 21, 2017 via email

@mdespriee mdespriee changed the title [SPARK-20082][ml][WIP] LDA incremental model learning [SPARK-20082][ml] LDA incremental model learning Aug 6, 2017
@mdespriee
Copy link
Author

Hi @hhbyyh,
This PR is ready for a review. Thanks !

@mdespriee
Copy link
Author

Hi @hhbyyh, @jkbradley
a gentle ping on this PR. It's not WIP anymore, and ready for a review.

@hhbyyh
Copy link
Contributor

hhbyyh commented Aug 25, 2017

Got it. Will make a pass today.

Copy link
Contributor

@hhbyyh hhbyyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. This should be a good feature to add.
I need to download the code and run some tests to confirm the implementation. Will continue the review tomorrow.

checkpointing can help reduce shuffle file sizes on disk and help with
failure recovery.
* `initialModel`: this parameter, only supported by `OnlineLDAOptimizer`,
specifies a previously trained LDA model as a start point instead of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously trained LocalLDAModel

* bin/run-example ml.LDAIncrementalExample
* }}}
*/
object LDAIncrementalExample {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe OnlineLDAIncrementalExample ?


val spark = SparkSession
.builder()
.master("local[*]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other example usually will not specify master

.master("local[*]")
.appName(s"${this.getClass.getSimpleName}")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove this line.


import spark.implicits._

val dataset = spark.read.text("/home/mde/workspaces/spark-project/spark/docs/*md").toDF("doc")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use LDA sample data in Spark?

@mdespriee
Copy link
Author

I updated the example following your suggestion. It's more consistent with LDAExample this way.

remove setLogLevel
review doc
@mdespriee mdespriee force-pushed the SPARK-20082_LDA_online_learning branch from 88a17c7 to 31cd11b Compare August 27, 2017 23:20
Copy link
Contributor

@hhbyyh hhbyyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mdespriee in #18610 @yanboliang provided some infrastructure for initial model.
We may follow the general practice in #18610 after it's merged. I'd like to know your opinion.

EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => OldLDAModel,
LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
OnlineLDAOptimizer => OldOnlineLDAOptimizer}
import org.apache.spark.mllib.clustering.{DistributedLDAModel => OldDistributedLDAModel, EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => OldLDAModel, LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel, OnlineLDAOptimizer => OldOnlineLDAOptimizer}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better follow the original format

* For Online optimizer only (currently): [[optimizer]] = "online".
* An initial model to be used as a starting point for the learning, instead of a random
* initialization. Provide the path to a serialized trained LDAModel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LDAModel => LocalLDAModel

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hhbyyh
After a quick look, it seems ok to follow this. I'll try to merge locally and see how it fits. I'll keep you updated. Do you think #18610 will be merged shortly ? (cc @yanboliang)

@sprintcheng
Copy link

May I know when this change being included into official release, I download spark 2.3.1 and still do NOT find this method(lda.setInitialModel) has been added.

@mdespriee
Copy link
Author

Hi @sprintcheng,
This PR is stale and is not even mergeable. I haven't had any feedback from spark maintainers since more than a year. @hhbyyh suggested to wait for #18610 which is also stale.
If there is enough interest in this feature, I could update it quickly and have it ready to merge.
I suggest we discuss this in the original JIRA (https://issues.apache.org/jira/browse/SPARK-20082)

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mdespriee mdespriee closed this Jun 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants