Dive into the world of sentiment analysis with this exciting project! We analyze IMDB movie reviews to determine the sentiment behind them using cutting-edge machine learning techniques. From data preprocessing and text cleaning to feature extraction and model training, we explore it all with Naive Bayes and Support Vector Machine (SVM) classifiers.
- Type: Natural Language Processing (NLP)
- Language: Python
- Project Overview
- Libraries Used
- Dataset
- Steps
- Features
- Usage
- Modeling
- Evaluation
- Results
- Contributing
- Acknowledgements
Explore the powerful libraries that drive this project:
- Pandas: For seamless data manipulation and analysis
- NumPy: For efficient numerical operations
- Matplotlib: To visualize data in style
- Scikit-Learn: To implement and evaluate machine learning models
- NLTK: For mastering natural language processing
- Regular Expressions (re): To clean and refine text data
We’re working with the IMDB Movie Reviews Dataset – a treasure trove of movie reviews! The dataset file, IMDB Dataset.csv
, includes:
review
: The actual movie review textsentiment
: The sentiment label (positive
ornegative
)
Here’s how we bring this project to life:
- Import Libraries: Get the essential tools ready for data processing, visualization, and machine learning.
- Load and Inspect Data: Peek into the dataset, check for any missing values, and understand the data distribution.
- Data Preprocessing: Transform text to lowercase, clean out HTML tags, tokenize reviews, and perform lemmatization.
- Data Preparation: Split the data into training and testing sets, encode labels, and convert text into TF-IDF features.
- Model Training and Evaluation: Train and test Naive Bayes and Support Vector Machine models, then evaluate their performance with accuracy scores, confusion matrices, and classification reports.
Our project shines with the following features:
- Data Preprocessing: Clean and tokenize text, strip HTML tags, and normalize text.
- Feature Extraction: Convert text into numerical features using TF-IDF vectorization.
- Model Training: Build and train Naive Bayes and SVM classifiers.
- Evaluation: Assess model performance with accuracy scores, confusion matrices, and detailed classification reports.
- Preprocess the data: Clean and tokenize the text data.
- Train the model: Fit a machine learning model on the training data.
- Evaluate the model: Test the model on the test data and calculate metrics like accuracy, precision, recall, etc.
- Predict sentiment: Use the trained model to predict the sentiment of new reviews.
The project explores several machine learning models, including:
- Logistic Regression
- Support Vector Machines (SVM)
- Naive Bayes
- Random Forest
We also experimented with hyperparameter tuning to improve model performance.
The performance of each model is evaluated using metrics such as:
- Accuracy
- Precision
- Recall
- F1 Score
The confusion matrix is also used to visualize the performance of the models.
See how well our models perform! We evaluate them based on accuracy, confusion matrices, and classification reports to gauge their sentiment classification prowess.
Contributions are welcome! If you have suggestions for improvements, feel free to fork the repository and create a pull request.
A big shoutout to:
- Dataset: The amazing IMDB movie reviews dataset, courtesy of Kaggle.
- Libraries: Our project’s backbone includes
pandas
,numpy
,matplotlib
,scikit-learn
, andnltk
. - Inspiration: Inspired by fantastic sentiment analysis tutorials and groundbreaking NLP research.
- Santhosh VS - Connect with me on LinkedIn
Got questions or feedback? Drop me a line at santhosh02vs@gmail.com. I’d love to hear from you!