Skip to content

purecod3/CourseProject

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Latent Dirichlet Allocation

This project is based on the paper written by D. Blei, A. Ng, and M. Jordan - Latent Dirichlet Allocation. https://dl.acm.org/doi/pdf/10.5555/944919.944937. Latent Dirichlet Allocation estimates topic disributions and topic word distributions in a generative model that can be used to infer topic distributions and word topic assignments for new documents. We use this modeling capability to evaluate the application of LDA on classification of documents using a significantly reduced number of features as compared to a bag of words based classification method.

Team Members

  1. Ed Pureza, epureza2 (captain)
  2. Dan Qian, dansiq2
  3. Joe Everton, everton2

Files

Deliverable File Description
Project Proposal Project Proposal_Reproduce Latent aspect rating analysis without aspect keyword supervision.pdf Original project proposal submitted on October 24, 2020
Progress Report ProgressReport.pdf Progress report with accomplishments, challenges, and remaining planned activities as of November 29, 2020
Project Documentation ProjectDocumentation.pdf Project documentation submitted December 8, 2020
Project Video Walk-through https://mediaspace.illinois.edu/media/t/1_jbzbbspv Video presentation of project
Project Tutorial ProjectTutorial.pdf Project tutorial for reproducing experiments (also outlined below)
LDA without Smoothing lda_var_inf_without_smoothing.py Code for running LDA using variational inference and gensim-based alpha update method
LDA without Smooting v2 lda_var_inf_without_smoothing_v2.py Code for running LDA using variational inference. Use if Python environment setup issues are encountered.
LDA with Collapsed Gibbs Sampling lda_gibbs_sampling.py LDA implementation using Collapsed Gibbs Sampling
Original LDA Code with Variational Inference lda_var_inf.py First attempt for implement LDA with variational inference method
Fake News Dataset FA-KES-Dataset.csv Input dataset with news articles classified as fake news or not fake news
Spam Dataset spam.csv Input dataset with news articles classified as spam or ham (not spam)

How to Use

Progamming Language and Packages

  • Python 3.x
  • Packages: pandas, numpy, scipy, sklearn, math, re, random, time

Executing Code

Fork or download Github repo.

Open in IDE and run file(s) or use command prompt (e.g., python lda_var_inf_without_smoothing.py). Start with lda_var_inf_without_smoothing.py. If unexpected results are encountered, try lda_var_inf_without_smoothing_v2.py. Optionally, you can also try the other variations with lda_gibbs_sampling.py and lda_var_inf.py.

To use a different input dataset, your file will need text and classification columns. Modify the source file (input_path) and column settings (text_column, label_column) in the load_csv function call.

(vocabulary_size,
     training_term_doc_matrix,
     training_labels,
     testing_term_doc_matrix,
     testing_labels,
     vocabulary) = load_csv(input_path = 'FA-KES-Dataset.csv',
                            test_set_size=100,
                            training_set_size=200,
                            num_stop_words=50,
                            min_word_freq=5,
                            text_column='article_content',
                            label_column='labels',
                            label_dict = {'1': 1, '0': 0})

lda_var_inf_without_smoothing_v2.py has both datasets (fake news and spam) coded. Comment/uncomment to switch between datasets.

Setting Parameters

Set the following parameters to tune the model:

  • num_topics: number of topics to model
lda.train(num_topics=10, term_doc_matrix=training_term_doc_matrix, iterations=20, e_iterations=10, e_epsilon=0.1, initial_training_set_size=50, initial_training_iterations=20)

See video walk-thru for additional information.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 73.0%
  • Jupyter Notebook 27.0%