ScrapeSense

ScrapeSense is an NLP platform for scraping YouTube comments from any video and analyzing the sentiment of Arabic comments.

It automates scraping, cleaning, embedding, and modeling sentiment analysis tasks using YouTube comments for any given subject.

This guide walks you through the setup, deployment, and interaction with the ScrapeSense application.

🧰 Tech Stack

Frontend:

Next.Js

Backend:

Flask

Development Environment:

Google Colab

Tunneling:

Ngrok

APIs:

YouTube Data API v3

Deployment:

Vercel

NLP Libraries

TF-IDF: Traditional embedding method based on word frequency in documents and within the corpus.
FastText: Non-contextual embedding method from Meta for word representation.
AraBERT: Transformer-based embedding method for Arabic comments, capturing dialects and meaning.

Models Training

Naive Bayes: For sentiment classification using basic probabilistic modeling.
FastText: For training sentiment analysis models based on non-contextual word embeddings.
AraBERT: For fine-tuning sentiment analysis models using contextual embeddings of Arabic comments.

📑 Prerequisites

🔥 Features

YouTube Comments Scraping: Scrape comments from any YouTube video. The data can be exported as a CSV file for easy analysis.
Data Cleaning: Clean the scraped data by removing duplicates, handling missing values, and standardizing formats to prepare it for further processing.
Word Embedding: Convert words into vector representations using techniques like TF-IDF, FastText, and AraBERT for better text analysis and sentiment understanding.
Data Modeling (Sentiment Analysis):
Perform sentiment analysis on Arabic comments using:
- machine learning models ( Naive Bayes algorithm ) to classify sentiments as positive, negative, or neutral.
- integrated classifier of the Fast Text model.
- with the transformer araBERT.

🚀 Getting Started

1. Create an Ngrok account

📌 Ngrok is used in the ScrapeSense app to create a tunnel, allowing the Flask app running in Google Colab to connect to the ScrapeSense user interface.

Visit the Ngrok website and create an account.
After signing up, go to your dashboard and find your Authtoken under the "Auth" section. You can find it here.
Copy the Authtoken, as you'll need it to connect Ngrok to your local server (in Google Colab).

2. Activate YouTube Data API v3

📌 YouTube Data API v3 is used in the ScrapeSense app to access and scrape comments from YouTube videos.

2.1 Create a Project (If You Don’t Have One):

Go to the Google Cloud Console, and create a new project if you don’t have one already.

2.2 Enable the API:

Navigate to API & Services > Library, search for YouTube Data API v3, or use this direct link to enable it for your project.

2.3 Generate an API Key:

Go to Credentials, click Create Credentials, select API Key, and copy it for the ScrapeSense setup.

⚠️ Keep your API Key secure. Do not share it publicly.

3. Setup ScrapeSense-FlaskApp

3.1 Open the ScrapeSense-FlaskApp:

Open the ScrapeSense-FlaskApp in Google Colab by following this link.

3.2 Configure Your ScrapeSense-FlaskAp:

After creating your Ngrok account, copy your Authtoken.
In the ScrapeSense-FlaskApp notebook, paste your Authtoken in the code section provided in the first step. This will configure Ngrok to connect your local server to the public internet.

3.3 Run the ScrapeSense-FlaskApp:

First, run the initial part of the code. This may take a few minutes to set up the Flask app and install the required libraries and models.
Then, run the second part of the code. This step is usually faster and sets up all the app routes.
Once the second part runs, you will see the Ngrok URL as the output. This URL will look something like https://xyz.ngrok-free.app. Use this URL to interact with the ScrapeSense app.

Alternatively, you can access the generated Ngrok URL directly from your Ngrok Agents., click on the running agent, and copy the URL provided.

4. Setup ScrapeSense GUI

4.1 Open the ScrapeSense GUI:

Open the ScrapeSense GUI.

4.2 Enter the following details:

Ngrok URL:
- Paste the Ngrok URL that was generated in the previous steps from the ScrapeSense-FlaskApp. This URL will connect your local Flask server to the ScrapeSense GUI.

YouTube API Key:
- Paste the YouTube API Key that you obtained from the Google Cloud Console. This API key will allow the ScrapeSense app to access YouTube comments and scrape data.

5. Run the App

Click "Start" to begin the scraping and processing workflow. The app will go through 4 pipelines:

5.1 Scrape YouTube Comments:

The app will scrape comments from any YouTube video using the YouTube API.

5.2 Clean the Data:

The app will clean the Arabic comments by removing duplicates and fixing any missing or incorrect values.

5.3 Embed Words:

The app will apply NLP techniques to process and embed the words in the dataset for analysis.

5.4 Prepare the Data:

The app will split the data into training and testing sets and apply basic machine learning models for analysis.

🔧 Troubleshooting

If you run into any issues, try the following steps:

Flask App Status: If the Flask app is down, it’s likely due to the Ngrok agent being down. Check the Ngrok Agents for its status. If it’s down, restart the ScrapeSense-FlaskApp (Step 3 in the setup process).
Note: If you receive an error saying that the Colab notebook has crashed, try refreshing the page and rerunning the notebook. It will solve the problem.
API Key Errors: Double-check that your API key is correct and that the YouTube Data API v3 is enabled on Google Cloud.
Dataset Issues: If the dataset isn't loading or cleaning properly, ensure that the file is in CSV format and that it contains only the two required columns: Comment and Label.

📝 Authors

GitHub: @bensaied

Contributing

Contributions are always welcome! Feel free to fork this repository and submit pull requests.

💝 Support

For support, don't forget to leave a star ⭐️.

⚖️ License

This project is MIT licensed.

🔗 Links

Oussama Ben Saied :
Ghassen Ben Ali :
Ikram Loued :

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
pages		pages
public		public
src/app		src/app
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
jsconfig.json		jsconfig.json
next.config.mjs		next.config.mjs
package-lock.json		package-lock.json
package.json		package.json

License

bensaied/ScrapeSense

Folders and files

Latest commit

History

Repository files navigation