In this project, I perform the following steps:
- Data Import and HTML Cleaning
- Used Sklearn's count vectorizer to generate word frequencies for wordcloud
- Used SpaCy's part-of-speech tagging to generate noun frequencies for wordcloud of nouns
- Generated seaborn plots to visualize the length of response and the class balance of sentiment
- Used Sklearn's TF-IDF vectorizer to transform the review into sparse form
- Used Sklearn's validation_curve to cross-validate hyperparameters for logistic regression and K nearest neighbors and visualize the change in accuracy for the setting of hyperparameters
- Trained Sklearn's LinearSVC to compare to best scoring models
- Concluded that logistic regression (with C = ~ 5) is the best performing model