MSBX 5420 - Spring 2020

Unstructured and Distributed Data Modeling and Analysis

Leeds School of Business, University of Colorado Boulder

Project Description

COVID-19 has a strong impact on human lives. Our mission is to analyze how COVID-19 affects the newspaper industry.

Prerequisites

AWS cluster with pyspark (python 3) environment
pandas & pyspark & numpy for data processing
tmtoolkit & nltk for text mining and topic modeling
matplotlib & wordcloud & seaborn for data visualization

Installing

pip install --user pandas pyspark tmtoolkit nltk numpy wordcloud matplotlib seaborn

Deployment

For local environment:
- All files except word_count_pyspark.ipynb can be executed on your local machine. Be sure to adjust the path for reading news.csv.
For cluster platform:
- All files can be executed on any cloud service. We will give an example on running files on AWS cluster.
- Tuning path for reading files is a pain, we totally understand that. So if you want to run our code on AWS, we strongly recommend save news.csv at the same folder as all other .ipynb files. In that way, all you need to do is change path to :
- You can use aws cluster jupyter notebook interface to interact with our code;
- You can also use spark-submit to submit our work to your personal cluster and check results with provided hadoop link
```
spark-submit --master yarn --deploy-mode cluster --num-executors 2 --executor-memory 1G --executor-cores 1 --driver-memory 1G /aws_cluster_hadoop_path/files_you_want_to_execute.py(ipynb)
```

Documents

Contact Information

Author:

Yongbo Shu (Yongbo.Shu@colorado.edu)
Katie Greenfield (Kathryn.Greenfield@colorado.edu)
Madison Moye (Madison.Moye@colorado.edu)
Dylan Bernstein (Dylan.Bernstein@colorado.edu)
Jennifer Dickson (Jennifer.Dickson@colorado.edu)

Instructor:

Peigang Zhang (Peigang.Zhang@colorado.edu)

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
data visualization		data visualization
data		data
design doc		design doc
presentation slides		presentation slides
requirement phase		requirement phase
topic modeling		topic modeling
word count		word count
.DS_Store		.DS_Store
README.md		README.md
Unstr_Data_Topic_Model.ipynb		Unstr_Data_Topic_Model.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MSBX 5420 - Spring 2020

Unstructured and Distributed Data Modeling and Analysis

Project Description

Prerequisites

Installing

Deployment

Documents

Contact Information

About

Releases

Packages

Contributors 5

Languages

MSBX5420/Team-Torreys-Peak

Folders and files

Latest commit

History

Repository files navigation

MSBX 5420 - Spring 2020

Unstructured and Distributed Data Modeling and Analysis

Project Description

Prerequisites

Installing

Deployment

Documents

Contact Information

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages