This is Team 03's submission of Assignment 1: Analysis of Media and Semantic Forensics in Scientific Literature from USC Viterbi's INF550: Data Science at Scale, Spring 2020 course. The contents of this project are listed below:
- analysis: directory with similarity scores, visualizations, and R code cluster analysis
- data-modification: directory with original Bik .tsv, and R code that joins in results from our other datasets and performs cleanup for analysis
- rselenium-chrome: directory for our web-scraping scripts
- requirements.R: an R script that installs all packages used to develop this project
- TEAM_03_BIGDATA.pdf: our final report
- TEAM_03_UPDATED_DATSET.tsv: our updated dataset
- README.txt: .txt version of this readme
- README.md: this file; readme for github
The rselenium-chrome folder has its own README as well, which details its configuration and contents (including test files to demonstrate the crawler).
Our project also involved modified files from Tika-Similarity, but we could not get those to appear in the github repository for this project (because Tika-Similarity has its own, most likely). The modifications are available in the pull request here: chrismattmann/tika-similarity#97
- Cherry, Carlin - ccherry@usc.edu - 8211265507
- Lee, Matthew - mdlee@usc.edu - 4356300240
- May, David - davidmay@usc.edu - 5801939142
For any inquires on code, please contact Matt at mdlee@usc.edu. Thank you!
- R 3.6.2
- RStudio
- Rtools 3.5.0.4 (if on Windows)
- Docker Desktop
- Python 2.7
- Tika-Python
- Tika-Similarity
- D3 (built-in to Tika-Similarity)
A list of all packaged installed into RStudio for the development of this project, and a brief description of what the package was for. The file requirements.R
included in the main project directory contains code to install all the necessary packages. To install all packages from that file, open RStudio, change the working directory to the location of requirements.R
, and run source("requirements.R")
tidyverse
: functions to manipulate datasets, plotting through ggplot2, pipe operatorlubridate
: for formatting datesRSelenium
: bindings for running Selenium through R (and Docker)getProxy
: for "free" proxies; experimentalkeyring
: for password management [1]jsonlite
: for reading and writing JSONgtrendsR
: for querying from Google Trends APInaniar
: for missingness map plotfactoextra
: for principal component analysis plotcluster
: for distance measures and clustering algorithmsRtsne
: for t-SNE plots, used to visualize clustersarsenal
: for summary tables of clustersgridExtra
: for plotting
[1] requires some configuration, but we only used it for scraping LinkedIn with an alternate account; see r-keyring-tutorial.R
in the rselenium-chrome/keyring folder.
A list of features in our updated dataset. Features in the original dataset with no changes have the note "[no change]"; otherwise, there is a brief description of what we modified (if a modified original feature) or where the feature is from (if added from data blending).
A ^ before the name indicates a feature that we excluded from the final clustering results (i.e. not used to calculate the distance measure for determining clusters; we still examined these features in exploring the clusters). Some exclusions are for improving accuracy; others were due to time constraints and the need to push forward with the analysis.
- ^
authors
: [no change] - ^
title
: [no change] - ^
citation
: [no change] - ^
doi
: [no change] year
: [no change]- ^
month
: extracted text fromcitation
simple_duplication
: renamed from0
; replaced NA with 0reposition_duplication
: renamed from1
; replaced NA with 0alteration_duplication
: renamed from2
; replaced NA with 0cuts_and_beautification
: renamed from3
; replaced NA with 0findings
: [no change]- ^
reported
: replaced NA with 0 correction_date
: [no change]retraction
: replaced NA with 0correction
: replaced NA with 0no_action
: replaced NA with 0- ^
completed
: replaced NA with 0 - ^
first_author
: extracted text fromauthors
affiliation
: from Microsoft Academic; author's current associated organizationpublications
: from Microsoft Academic; times author has published a papercitations
: from Microsoft Academic; times any of author's papers have been citedyear_start
: from Microsoft Academic; year of first publicationyear_end
: from Microsoft Academic; year of last publicationtop_topics
: from Microsoft Academic; fields of study associated with given authorpublication_types
: from Microsoft Academic; mediums of publication from given authortop_authors
: from Microsoft Academic; other authors associated with given authortop_journals
: from Microsoft Academic; journals given author has published intop_institutions
: from Microsoft Academic; institutions associated with given authortop_conferences
: from Microsoft Academic; conferences given author has presented atlab_size_approx
: count of distinct, non-NA values intop_authors
publication_variety
: count of distinct, non-NA values inpublication_types
journal_variety
: count of distinct, non-NA values intop_journals
institution_variety
: count of distinct, non-NA values intop_institutions
conference_variety
: count of distinct, non-NA values intop_conferences
biology
: 1 iftop_topics
contains "biology"; 0 otherwisemedicine
: 1 iftop_topics
contains "medicine"; 0 otherwiseimmunology
: 1 iftop_topics
contains "immunology"; 0 otherwisecancer
: 1 iftop_topics
contains "cancer"; 0 otherwisebiochemistry
: 1 iftop_topics
contains "biochemistry"; 0 otherwisevirology
: 1 iftop_topics
contains "virology"; 0 otherwisecareer_duration
: 2020 -year_start
publication_rate
:publications
/career_duration
highest_degree
: from LinkedIn; degree title of first education recorddegree_area
: from LinkedIn; field of study of first education recorddegree_level
:highest_degree
expressed as a number, with more weight on higher honorstotal_cites
: from InCites; total citations of a given journalimpact_factor
: from InCites; a measure of journal prestige within its fieldeigenfactor
: from InCites; importance of journal considering citationsweb_interest
: from Google Trends; interest score for web searchesimages_interest
: from Google Trends; interest score for image searchesyoutube_interest
: from Google Trends; interest score for video searches- ^
academic_reputation
: from USNews; score for university reputation - ^
affiliation_funding
: from manual research; whether a university is public or private
Some of our analysis involve similarity/clustering techniques not mentioned in lecture (and accumulated gradually from years of reading stackexchange posts...). These resources help provide some context behind such terms.