ScrapeSense is an NLP platform for scraping YouTube comments from any video and analyzing the sentiment of Arabic comments.
It automates scraping, cleaning, embedding, and modeling sentiment analysis tasks using YouTube comments for any given subject.
This guide walks you through the setup, deployment, and interaction with the ScrapeSense application.
- Next.Js
- Flask
- Google Colab
- Ngrok
- YouTube Data API v3
- Vercel
- TF-IDF: Traditional embedding method based on word frequency in documents and within the corpus.
- FastText: Non-contextual embedding method from Meta for word representation.
- AraBERT: Transformer-based embedding method for Arabic comments, capturing dialects and meaning.
- Naive Bayes: For sentiment classification using basic probabilistic modeling.
- FastText: For training sentiment analysis models based on non-contextual word embeddings.
- AraBERT: For fine-tuning sentiment analysis models using contextual embeddings of Arabic comments.
-
YouTube Comments Scraping: Scrape comments from any YouTube video. The data can be exported as a CSV file for easy analysis.
-
Data Cleaning: Clean the scraped data by removing duplicates, handling missing values, and standardizing formats to prepare it for further processing.
-
Word Embedding: Convert words into vector representations using techniques like TF-IDF, FastText, and AraBERT for better text analysis and sentiment understanding.
-
Data Modeling (Sentiment Analysis):
Perform sentiment analysis on Arabic comments using:- machine learning models ( Naive Bayes algorithm ) to classify sentiments as positive, negative, or neutral.
- integrated classifier of the Fast Text model.
- with the transformer araBERT.
π Ngrok is used in the ScrapeSense app to create a tunnel, allowing the Flask app running in Google Colab to connect to the ScrapeSense user interface.
- Visit the Ngrok website and create an account.
- After signing up, go to your dashboard and find your Authtoken under the "Auth" section. You can find it here.
- Copy the Authtoken, as you'll need it to connect Ngrok to your local server (in Google Colab).
π YouTube Data API v3 is used in the ScrapeSense app to access and scrape comments from YouTube videos.
- Go to the Google Cloud Console, and create a new project if you donβt have one already.
- Navigate to API & Services > Library, search for YouTube Data API v3, or use this direct link to enable it for your project.
- Go to Credentials, click Create Credentials, select API Key, and copy it for the ScrapeSense setup.
- Open the ScrapeSense-FlaskApp in Google Colab by following this link.
- After creating your Ngrok account, copy your Authtoken.
- In the ScrapeSense-FlaskApp notebook, paste your Authtoken in the code section provided in the first step. This will configure Ngrok to connect your local server to the public internet.
- First, run the initial part of the code. This may take a few minutes to set up the Flask app and install the required libraries and models.
- Then, run the second part of the code. This step is usually faster and sets up all the app routes.
- Once the second part runs, you will see the Ngrok URL as the output. This URL will look something like
https://xyz.ngrok-free.app
. Use this URL to interact with the ScrapeSense app.
- Alternatively, you can access the generated Ngrok URL directly from your Ngrok Agents., click on the running agent, and copy the URL provided.
- Open the ScrapeSense GUI.
- Ngrok URL:
- Paste the Ngrok URL that was generated in the previous steps from the ScrapeSense-FlaskApp. This URL will connect your local Flask server to the ScrapeSense GUI.
- YouTube API Key:
- Paste the YouTube API Key that you obtained from the Google Cloud Console. This API key will allow the ScrapeSense app to access YouTube comments and scrape data.
Click "Start" to begin the scraping and processing workflow. The app will go through 4 pipelines:
- The app will scrape comments from any YouTube video using the YouTube API.
- The app will clean the Arabic comments by removing duplicates and fixing any missing or incorrect values.
- The app will apply NLP techniques to process and embed the words in the dataset for analysis.
- The app will split the data into training and testing sets and apply basic machine learning models for analysis.
If you run into any issues, try the following steps:
-
Flask App Status: If the Flask app is down, itβs likely due to the Ngrok agent being down. Check the Ngrok Agents for its status. If itβs down, restart the ScrapeSense-FlaskApp (Step 3 in the setup process).
Note: If you receive an error saying that the Colab notebook has crashed, try refreshing the page and rerunning the notebook. It will solve the problem. -
API Key Errors: Double-check that your API key is correct and that the YouTube Data API v3 is enabled on Google Cloud.
-
Dataset Issues: If the dataset isn't loading or cleaning properly, ensure that the file is in CSV format and that it contains only the two required columns: Comment and Label.
- GitHub: @bensaied
Contributions are always welcome! Feel free to fork this repository and submit pull requests.
For support, don't forget to leave a star βοΈ.
This project is MIT licensed.