Skip to content

Code for a paper on subreddit classification for the Text and Multimedia Mining course '19-'20

Notifications You must be signed in to change notification settings

gijshendriksen/subreddit-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code for the research paper 'Subreddit classification of text posts' for the TxMM'19-'20 course. This repository contains the following important modules:

  • crawler.py is the script that retrieves the dataset using the Python Reddit API Wrapper.
  • statistics.py is a utility module in which we retrieve several statistics from the dataset. For instance, it computes the most discriminative terms for each subreddit.
  • document.py contains a wrapper for documents, which performs simple tasks like tokenization and POS tagging. This way, these operations don't have to be repeated for each feature type that uses them.
  • features.py contains a list of all feature types and the means to extract them from a document. It uses both the statistics and document helper module to compute each feature vector.
  • main.py contains the actual experiments, in which we create the models and actually perform our classification.

About

Code for a paper on subreddit classification for the Text and Multimedia Mining course '19-'20

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages