-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-20082][ml] LDA incremental model learning #17461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20082][ml] LDA incremental model learning #17461
Conversation
hhbyyh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Appreciate your work.
Several primary things for your consideration:
-
As spark.ml API is the primary API for MLlib, I would recommend an integrated change that includes LDA in spark.ml.
-
It would bring some confusion if we can only support initial model for online optimizer, please try to include DistributedModel (EMOptimizer) in the change. Regarding to your question in the jira, I'm not sure if theoretically it makes sense to add new documents after setting the initial Model for EMOptimizer. (I'm leaning towards no...). @jkbradley to confirm
-
more unit tests should be added but I understand we should settle down the solution first.
We probably should first settle down the primary issues and I can help follow up with more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking we should move all the parameter check into run, since user may set parameters in different orders.
|
@mdespriee is it still active? Could you address the comment above? |
|
Yes, still active. I've been very busy lately.
Regarding 1/ -> will do
2/ -> still waiting for a comment from @jkbradley actually.
3/ -> will do
I'll try to move on in the coming days.
Le 11 mai 2017 4:54 PM, "Hyukjin Kwon" <notifications@github.com> a écrit :
@mdespriee <https://github.com/mdespriee> is it still active? Could you
address the comment above?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17461 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABhFNNT5Kn1gjKVhy20p8AtrcN7i7watks5r4yE0gaJpZM4MsIsZ>
.
|
707286c to
fa14304
Compare
fa14304 to
f5b5d38
Compare
|
I made some manual tests as well, see here : https://gist.github.com/mdespriee/8ae604036732f39f6345ee91acf777a0 This code could be added in spark-examples, just tell me. |
|
Putting [WIP] back, as there is a problem with serialization of |
|
I am not reviewing this. I think @hhbyyh is. |
|
Hi @hhbyyh, @jkbradley |
|
For the initial model, I think you can just use a String param for the model path. refer to https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L66 |
|
Ok, will do with a model path. I'll push an update shortly, and I'll think
this PR will be ready.
(in KMeans, the initialModel API is available in mllib but not at ML level,
yet. Could be another JIRA...)
2017-07-20 18:16 GMT+02:00 Yuhao Yang <notifications@github.com>:
… For the initial model, I think you can just use a String param for the
model path. refer to https://github.com/apache/
spark/blob/master/mllib/src/main/scala/org/apache/spark/
ml/clustering/KMeans.scala#L66
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17461 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABhFNMzx6pW21BwQLAlI8clIW_9Y8lEPks5sP31tgaJpZM4MsIsZ>
.
|
|
Hi @hhbyyh, |
|
Hi @hhbyyh, @jkbradley |
|
Got it. Will make a pass today. |
hhbyyh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. This should be a good feature to add.
I need to download the code and run some tests to confirm the implementation. Will continue the review tomorrow.
docs/mllib-clustering.md
Outdated
| checkpointing can help reduce shuffle file sizes on disk and help with | ||
| failure recovery. | ||
| * `initialModel`: this parameter, only supported by `OnlineLDAOptimizer`, | ||
| specifies a previously trained LDA model as a start point instead of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously trained LocalLDAModel
| * bin/run-example ml.LDAIncrementalExample | ||
| * }}} | ||
| */ | ||
| object LDAIncrementalExample { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe OnlineLDAIncrementalExample ?
|
|
||
| val spark = SparkSession | ||
| .builder() | ||
| .master("local[*]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other example usually will not specify master
| .master("local[*]") | ||
| .appName(s"${this.getClass.getSimpleName}") | ||
| .getOrCreate() | ||
| spark.sparkContext.setLogLevel("ERROR") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe remove this line.
|
|
||
| import spark.implicits._ | ||
|
|
||
| val dataset = spark.read.text("/home/mde/workspaces/spark-project/spark/docs/*md").toDF("doc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use LDA sample data in Spark?
|
I updated the example following your suggestion. It's more consistent with LDAExample this way. |
remove setLogLevel review doc
88a17c7 to
31cd11b
Compare
hhbyyh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mdespriee in #18610 @yanboliang provided some infrastructure for initial model.
We may follow the general practice in #18610 after it's merged. I'd like to know your opinion.
| EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => OldLDAModel, | ||
| LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel, | ||
| OnlineLDAOptimizer => OldOnlineLDAOptimizer} | ||
| import org.apache.spark.mllib.clustering.{DistributedLDAModel => OldDistributedLDAModel, EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => OldLDAModel, LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel, OnlineLDAOptimizer => OldOnlineLDAOptimizer} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better follow the original format
| * For Online optimizer only (currently): [[optimizer]] = "online". | ||
| * An initial model to be used as a starting point for the learning, instead of a random | ||
| * initialization. Provide the path to a serialized trained LDAModel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LDAModel => LocalLDAModel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @hhbyyh
After a quick look, it seems ok to follow this. I'll try to merge locally and see how it fits. I'll keep you updated. Do you think #18610 will be merged shortly ? (cc @yanboliang)
|
May I know when this change being included into official release, I download spark 2.3.1 and still do NOT find this method(lda.setInitialModel) has been added. |
|
Hi @sprintcheng, |
|
Can one of the admins verify this patch? |
What changes were proposed in this pull request?
This PR propose the possibility to re-use a previous LDA as a starting point for the online optimizer, for incremental learning.
initialModelparameter at mllib level, that is used to initialize the alpha and lambda (doc concentration, topic matrix) of the OnlineLDAOptimizer, instead of random.How was this patch tested?
Unit tests: mllib, ml, java api, python api
Manual tests: see added example "LDAIncrementalExample.scala"
NB: This is my first contribution, please apologize is I miss something in PR process, or Spark standards.