Leeds School of Business, University of Colorado Boulder
COVID-19 has a strong impact on human lives. Our mission is to analyze how COVID-19 affects the newspaper industry.
- AWS cluster with pyspark (python 3) environment
- pandas & pyspark & numpy for data processing
- tmtoolkit & nltk for text mining and topic modeling
- matplotlib & wordcloud & seaborn for data visualization
pip install --user pandas pyspark tmtoolkit nltk numpy wordcloud matplotlib seaborn
-
For local environment:
- All files except word_count_pyspark.ipynb can be executed on your local machine. Be sure to adjust the path for reading news.csv.
-
For cluster platform:
- All files can be executed on any cloud service. We will give an example on running files on AWS cluster.
- Tuning path for reading files is a pain, we totally understand that. So if you want to run our code on AWS, we strongly recommend save news.csv at the same folder as all other .ipynb files. In that way, all you need to do is change path to :
- You can use aws cluster jupyter notebook interface to interact with our code;
- You can also use spark-submit to submit our work to your personal cluster and check results with provided hadoop link
spark-submit --master yarn --deploy-mode cluster --num-executors 2 --executor-memory 1G --executor-cores 1 --driver-memory 1G /aws_cluster_hadoop_path/files_you_want_to_execute.py(ipynb)
Author:
- Yongbo Shu (Yongbo.Shu@colorado.edu)
- Katie Greenfield (Kathryn.Greenfield@colorado.edu)
- Madison Moye (Madison.Moye@colorado.edu)
- Dylan Bernstein (Dylan.Bernstein@colorado.edu)
- Jennifer Dickson (Jennifer.Dickson@colorado.edu)
Instructor:
- Peigang Zhang (Peigang.Zhang@colorado.edu)