E-commerce product classification

With the growing popularity of e-commerce websites like Amazon, Flipkart, and Myntra, number of products available on these platforms have significantly increased. Owing to this, the product classification problem has gained practical significance in the industry. The project aims to address the product classification problem by deeply taking into account natural language processing techniques. It focuses on applying natural language processing to obtain a clean dataset consisting of product descriptions. This has been achieved by eliminating common words, stop words and generic product categories. Additionally, count vectorizers are implemented to convert text data into its appropriate vector representation. Various modern machine learning algorithms are applied to achieve accurate product classification for catalogues consisting of lakhs of products on the obtained vectorized data. In retrospect, this project finds intensive application in the currently booming E-commerce industry.

Literature Surveyed

In order to understand the previous approaches towards solving this problem we went through the following literature:

Dataset

The dataset for the product classification task was taken from the Amazon product classification challenge on Kaggle. The dataset can be found here.

Data-Preprocessing

In the inital stages of the project we perfomed data pre-processing. We used an NLP based approach.

Firstly, null valued columns were eliminated from the data. Then, each data point was converted into an independent document, a tuple containing the product title, description and bullet points together.
Tokenization was performed on the text document which enabled us to perform POS tagging and lemmatization on the dataset based on the category. We appropriately classified the words present in each category description as appropriate parts of speech for further evaluation.
We employed the WordNet lemmatizer to lemmatize categorical data and perform appropriate preprocessing. Every document tuple was cleaned, and stop words and punctuations were eliminated.
In the dataset, products with no defined categories were labeled as 'Generic' products. Hence 'Generic' category products were eliminated.
Words that appear in about 95% of the data points were eliminated from the dataset to improve model performance.

Models employed

Logistic Regression
SGD classifier
Bernoulli Naive Bayes
Multinomial Naive Bayes
Decision Tree
Random Forest
SVM
MLP
K-NN Classifier

Running the model

To demonstrate the working, clone the repository.
Load the model
The models trained are saved in the models directory. Uncompress the model if compressed and load using the following command.

model = joblib.load(model)

Testing the model
Predict the output on the testing data using the following code:

y_pred = model.predict(x_test)

Getting the accuracy
Get the accuracy of the model used using the following code:

score = model.score(x_test)

If you want to execute the whole model including all algorithms and pre-processing tasks, navigate to the final code directory and execute final_code.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Final report		Final report
Visualisation		Visualisation
final code		final code
models		models
plots		plots
Project_Logistic Regression.ipynb		Project_Logistic Regression.ipynb
Project_NLP.ipynb		Project_NLP.ipynb
Project_NN.ipynb		Project_NN.ipynb
Project_NaiveBayes.ipynb		Project_NaiveBayes.ipynb
Project_RandomForests.ipynb		Project_RandomForests.ipynb
Project_SVM.ipynb		Project_SVM.ipynb
Project__KNN.ipynb		Project__KNN.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

E-commerce product classification

Literature Surveyed

Dataset

Data-Preprocessing

Models employed

Running the model

Team Members

About

Releases

Packages

Languages

harshita19244/CSE_343-ML_Project

Folders and files

Latest commit

History

Repository files navigation

E-commerce product classification

Literature Surveyed

Dataset

Data-Preprocessing

Models employed

Running the model

Team Members

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages