-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments: PySpark with Latent Dirichlet Allocation #2
Comments
Hi, A very nice post! Thank you very much! Do you know how to get the topic distribution for each training document? I read a bit on https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda. Based on that, it should be possible to get the topic distribution for each train document through topTopicsPerDocument. But I got error saying that there is no such attribute. Cheers, |
Hi Doudou, Unfortunately, as of the current version of PySpark (2.1), there isn't a way to do that. However, it is possible with the regular Spark project. PySpark is largely a Python wrapper over the original Apache Spark project, which is written in Scala. You can use the And while it currently isn't possible in PySpark, you can follow the same process I've written about in this post in Spark to achieve the same results. Good luck! Sean Lane |
Hi Sean Lane,
Thank you very much for your reply!
I tried 'transform' after fitting the lda as described on
https://spark.apache.org/docs/latest/ml-clustering.html#latent-dirichlet-allocation-lda
.
It sometimes works but sometimes not. Do you have a clue why it's that?
All the bests,
Doudou
…On 5 February 2017 at 22:57, Sean Lane ***@***.***> wrote:
Hi Doudou,
Unfortunately, as of the current version of PySpark (2.1), there isn't a
way to do that. However, it is possible with the regular Spark project.
PySpark is largely a Python wrapper over the original Apache Spark project,
which is written in Scala. You can use the topTopicsPerDocument method
found here: https://spark.apache.org/docs/latest/api/scala/index.html#
org.apache.spark.mllib.clustering.DistributedLDAModel
And while it currently isn't possible in PySpark, you can follow the same
process I've written about in this post in Spark to achieve the same
results. Good luck!
Sean Lane
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AYTR8t0rEMPWSjA0Jq4KlW4r3WGGxsW7ks5rZlPogaJpZM4IbEWd>
.
|
It's hard to tell, but I would suggest reading through the errors that occur when the |
Hi Sean Lane, This is a great post and I am thinking about posting the Java version of it in my upcoming book: Data Algorithms, 2nd Edition. One question: in the last line of your code you refer to "topic_val", which is not defined anywhere in the code. Should that be "topic_indices"? Thank you, |
Hi Mahmoud, Thank you for pointing that out, it was actually a holdover from an iteration on this code where I was playing with different number of topics for a different dataset. I have corrected the error, along with some other inconsistencies that I noticed, and it should be correct now. Good luck with your book! Thanks, |
Thank you very much Sean!
… On Jul 6, 2017, at 8:39 PM, Sean Lane ***@***.***> wrote:
Hi Mahmoud,
Thank you for pointing that out, it was actually a holdover from an iteration on this code where I was playing with different number of topics for a different dataset. I have corrected the error, along with some other inconsistencies that I noticed, and it should be correct now. Good luck with your book!
Thanks,
Sean
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACdE0DgWe5jmdSrgWygtqJwp3WDyREm_ks5sLahzgaJpZM4IbEWd>.
|
Comments for my blog post on Latent Dirichlet Allocation with PySpark: https://sean.lane.sh/blog/2016/PySpark_and_LDA
The text was updated successfully, but these errors were encountered: