Duplicate Question Pairs Checker (NLP Project)

Team Size: 4

Description

The Duplicate Question Pairs Checker is an NLP project that aims to develop a model capable of identifying duplicate pairs of questions in a given dataset. The project involves using natural language processing (NLP) techniques such as tokenization, stemming, and part-of-speech tagging to compare the semantic similarity between pairs of questions. By identifying duplicate question pairs, this project can have applications in various domains, including search engines, question-answering systems, and content recommendation platforms.

Key Skills

NLP (Natural Language Processing)
Machine Learning
Feature Selection
Evaluation
Data Cleaning

Installation

To run the Duplicate Question Pairs Checker project on your local machine, follow these steps:

Clone the repository:

git clone https://github.com/aadithlasar/LPTESTQ.git
cd duplicate-question-pairs

Set up a virtual environment (optional but recommended):
```
python3 -m venv venv
source venv/bin/activate
```
Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

Prepare your dataset in a suitable format (CSV, JSON, etc.).
Preprocess the data using the provided data cleaning scripts (see Data Cleaning).
Implement and apply feature selection techniques (see Feature Selection).
Train the duplicate question pairs identification model (see Model Training).
Evaluate the model's performance (see Evaluation).
Use the trained model to identify duplicate question pairs in new datasets.

Data Cleaning

Data cleaning is a crucial step to ensure the quality and reliability of the model. Various data cleaning techniques are applied to the raw data, including:

Removing duplicate entries
Handling missing values
Text normalization (lowercasing, removing punctuation, etc.)

Feature Selection

Effective feature selection is essential for building a robust model. Some feature selection techniques used in this project include:

TF-IDF (Term Frequency-Inverse Document Frequency)
Word embeddings (Word2Vec, GloVe, etc.)

Model Training

The model is trained using a labeled dataset containing pairs of questions labeled as duplicates or non-duplicates. Techniques used during model training include:

Creating a suitable training-validation split
Building and training a deep learning or machine learning model
Fine-tuning hyperparameters to improve performance

Evaluation

Model performance is evaluated using appropriate evaluation metrics, such as accuracy, precision, recall, F1-score, and ROC curves. The evaluation process helps to determine the effectiveness of the model in identifying duplicate question pairs.

License

This project is licensed under the MIT License.

Feel free to contribute, open issues, and submit pull requests to help improve this project!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
app.py		app.py
bow-with-basic-features.ipynb		bow-with-basic-features.ipynb
bow-with-preprocessing-and-advanced-features.ipynb		bow-with-preprocessing-and-advanced-features.ipynb
mysql_shouldnt_be_pyfle.py		mysql_shouldnt_be_pyfle.py
readme.md		readme.md
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Duplicate Question Pairs Checker (NLP Project)

Table of Contents

Description

Key Skills

Installation

Usage

Data Cleaning

Feature Selection

Model Training

Evaluation

License

About

Releases

Packages

Languages

aadithlasar/LPTESTQ

Folders and files

Latest commit

History

Repository files navigation

Duplicate Question Pairs Checker (NLP Project)

Table of Contents

Description

Key Skills

Installation

Usage

Data Cleaning

Feature Selection

Model Training

Evaluation

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages