This is a simple NLP project based on the NLP section of A-Z Machine Learning Course on Udemy
- Find Github Pages of this repository at https://abd-shoumik.github.io/Restaurant-Review-Classifier/
- Importing the libraries and dataset
- Text cleaning or pre-processing
- Tokenization
- Creating bag of words model
- Splitting Dataset into training set and test set
- Fitting to a predictive model
- Predicting final result
- Accuracy Measurement
- Deploying it in heroku
- Import
numpy
,pandas
andmatplotlib.pylot
- Import the dataset Restaurant_Reviews.tsv in your editor.We are using
.tsv
file here as we want to seprate the words by spaces. We always use.csv(comma seperated value)
file as our dataset. But comma can appear in reviews ,as a result we can get error in separating columns if we use comma as delimiter.So,we use tab for sepation that's why we will import.tsv
- Remove Punctuations, Numbers: Punctuations, Numbers doesn’t help much in processong the given text, if included, they will just increase the size of bag of words that we will create as last step and decrase the efficency of algorithm.
- Stemming: Take roots of the word
- Convert each word into its lower case: For example, it useless to have same words in different cases (eg ‘good’ and ‘GOOD’).
- Tokenization, involves splitting sentences and words from the body of the text.
- Take all the different words of reviews in the dataset without repeating of words.
- One column for each word, therefore there are going to be many columns.
- Rows are reviews
- If word is there in row of dataset of reviews, then the count of word will be there in row of bag of words under the column of the word.
For this purpose we need CountVectorizer
class from sklearn.feature_extraction.text
.
- Splitting Corpus into Training and Test set. For this, we need class
train_test_split
fromsklearn.model_selection
. Split can be made 70/30 or 80/20 or 85/15 or 75/25, here I choose 75/25 via “test_size”.
- Here I used
NaiveBayesClassifier
as I got best accuracy out of it. Fit the model via.fit()
method with attributes X_train and y_train.
y_pred = model.predict(X_test)
- I used the naivebayes classifier which gave me an accuracy of 73 percent.There are 1000 reviews in the dataset.If you can train it with bigger dataset and do parameter tuning ,this accuracy will improve.
- In review_app.py I have taken pickled data from
text_transform.pkl
andnlp_review_classifier.pkl
and created the path of renderinghome.html
&result.html
template.
Then, I have deployed it in heroku platform using this github repository which you can visit at restaurantreviewclassifier
*** If you want to deploy this in your heroku account,make sure you have all the files uploded in this git repo in your github repository ***