Comments: PySpark with Latent Dirichlet Allocation #2

seanlane · 2016-05-10T12:59:08Z

Comments for my blog post on Latent Dirichlet Allocation with PySpark: https://sean.lane.sh/blog/2016/PySpark_and_LDA

DoudouT · 2017-02-01T13:12:13Z

Hi,

A very nice post! Thank you very much!

Do you know how to get the topic distribution for each training document? I read a bit on https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda. Based on that, it should be possible to get the topic distribution for each train document through topTopicsPerDocument. But I got error saying that there is no such attribute.

Cheers,
Doudou

seanlane · 2017-02-05T22:57:44Z

Hi Doudou,

Unfortunately, as of the current version of PySpark (2.1), there isn't a way to do that. However, it is possible with the regular Spark project. PySpark is largely a Python wrapper over the original Apache Spark project, which is written in Scala. You can use the topTopicsPerDocument method found here: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.DistributedLDAModel

And while it currently isn't possible in PySpark, you can follow the same process I've written about in this post in Spark to achieve the same results. Good luck!

Sean Lane

DoudouT · 2017-02-07T10:21:19Z

Hi Sean Lane, Thank you very much for your reply! I tried 'transform' after fitting the lda as described on https://spark.apache.org/docs/latest/ml-clustering.html#latent-dirichlet-allocation-lda . It sometimes works but sometimes not. Do you have a clue why it's that? All the bests, Doudou

…

On 5 February 2017 at 22:57, Sean Lane ***@***.***> wrote: Hi Doudou, Unfortunately, as of the current version of PySpark (2.1), there isn't a way to do that. However, it is possible with the regular Spark project. PySpark is largely a Python wrapper over the original Apache Spark project, which is written in Scala. You can use the topTopicsPerDocument method found here: https://spark.apache.org/docs/latest/api/scala/index.html# org.apache.spark.mllib.clustering.DistributedLDAModel And while it currently isn't possible in PySpark, you can follow the same process I've written about in this post in Spark to achieve the same results. Good luck! Sean Lane — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYTR8t0rEMPWSjA0Jq4KlW4r3WGGxsW7ks5rZlPogaJpZM4IbEWd> .

seanlane · 2017-02-07T18:02:32Z

It's hard to tell, but I would suggest reading through the errors that occur when the transform method doesn't work, they should help you understand what problems are occurring. Good luck!

mahmoudparsian · 2017-07-06T20:47:23Z

Hi Sean Lane,

This is a great post and I am thinking about posting the Java version of it in my upcoming book: Data Algorithms, 2nd Edition.

One question: in the last line of your code you refer to "topic_val", which is not defined anywhere in the code. Should that be "topic_indices"?

Thank you,
best regards,
Mahmoud Parsian

seanlane · 2017-07-07T03:39:31Z

Hi Mahmoud,

Thank you for pointing that out, it was actually a holdover from an iteration on this code where I was playing with different number of topics for a different dataset. I have corrected the error, along with some other inconsistencies that I noticed, and it should be correct now. Good luck with your book!

Thanks,
Sean

mahmoudparsian · 2017-07-09T19:56:38Z

Thank you very much Sean!

…

On Jul 6, 2017, at 8:39 PM, Sean Lane ***@***.***> wrote: Hi Mahmoud, Thank you for pointing that out, it was actually a holdover from an iteration on this code where I was playing with different number of topics for a different dataset. I have corrected the error, along with some other inconsistencies that I noticed, and it should be correct now. Good luck with your book! Thanks, Sean — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACdE0DgWe5jmdSrgWygtqJwp3WDyREm_ks5sLahzgaJpZM4IbEWd>.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments: PySpark with Latent Dirichlet Allocation #2

Comments: PySpark with Latent Dirichlet Allocation #2

seanlane commented May 10, 2016 •

edited

Loading

DoudouT commented Feb 1, 2017

seanlane commented Feb 5, 2017

DoudouT commented Feb 7, 2017 via email

seanlane commented Feb 7, 2017

mahmoudparsian commented Jul 6, 2017

seanlane commented Jul 7, 2017

mahmoudparsian commented Jul 9, 2017 via email

Comments: PySpark with Latent Dirichlet Allocation #2

Comments: PySpark with Latent Dirichlet Allocation #2

Comments

seanlane commented May 10, 2016 • edited Loading

DoudouT commented Feb 1, 2017

seanlane commented Feb 5, 2017

DoudouT commented Feb 7, 2017 via email

seanlane commented Feb 7, 2017

mahmoudparsian commented Jul 6, 2017

seanlane commented Jul 7, 2017

mahmoudparsian commented Jul 9, 2017 via email

seanlane commented May 10, 2016 •

edited

Loading