R Project: Text Mining business texts

This project was part of our examination for the Data Mining course @ UNIMIB (Università degli Studi di Milano Bicocca). We tried to extract information from news articles from the BBC archive regarding business. The final aim was to compare the performance of a Latent Dirichlet Allocation topic modeling algorithm agains a baseline created ad hoc by the authors.

Among the things we learned: document clustering, topic modeling, semantic coherence, stemming algorithms, web scraping.

Prerequisites

To run the R script you need R >= 3.4.0 and Python 2.7+. We also make use of the following R packages:

tm
snowballC
wordcloud
syuzhet
ggplot2
topicmodels
tidytext
dplyr
cluster
fpc
proxy
here
reticulate

and of the following Python packages:

wordcloud
Pillow
numpy
watson_developer_cloud

You may need to install them if you don't have them already. Unfortunately, the R language does not come with a reliable dependency manager.

We make use of the R package "reticulate" to communicate between the two languges. You will find chunks of Python code in R scripts and seemingly nonsense Python standalone files :) .

Corpus

Since there are a couple of computationally intensive tasks for what concerns the pre-processing of the text, we have bundled the pre-processed documents in an RData file. This includes removal of stopwords, stemming, and stem completion. To load it, run:

load('rdata_files/docs.RData')

Replicability

We've tried to ensure replicability of the project but despite setting seeds, various R modeling functions seem to run randomly. We will correct the code if we work it out.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
business		business
graphs		graphs
python		python
rdata_files		rdata_files
tex		tex
tmp		tmp
PPI_BBC_article.pptx		PPI_BBC_article.pptx
compute_ngd.R		compute_ngd.R
greene06icml.pdf		greene06icml.pdf
paper_text_benchmark_bullding.pdf		paper_text_benchmark_bullding.pdf
progetto_data.R		progetto_data.R
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

R Project: Text Mining business texts

Prerequisites

Corpus

Replicability

About

Releases

Packages

Languages

rmaganza/TextMiningBBCBusiness

Folders and files

Latest commit

History

Repository files navigation

R Project: Text Mining business texts

Prerequisites

Corpus

Replicability

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages