YouTube Analytics

This repository contains scripts for collecting YouTube video data and uploading it to an S3 bucket.

Setup

Set Up Virtual Environment and Install Dependencies: Create a virtual environment and install necessary Python packages using requirements.txt.
```
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```
Collect YouTube Video Data: Use video_data_collector.py to fetch data from a specific YouTube channel. Provide the channel ID as an argument. This generates a video_data.json file.
```
python video_data_collector.py --channel-id <channel-id>
```

Configure AWS Credentials: Before running the upload_to_s3.py, configure your AWS credentials. Replace placeholders with your actual AWS keys and bucket name.

aws_access_key_id = 'your-access-key-id'
aws_secret_access_key = 'your-secret-access-key'
aws_region = 'your-aws-region'
s3_bucket_name = 'your-s3-bucket-name'

Upload Data to S3: Upload the collected video data to your specified S3 bucket.
```
python upload_to_s3.py
```
Optional Steps:
- Schedule video_data_collector.py to run regularly using cron jobs or Airflow to keep data up-to-date.
- Use AWS Glue for ETL operations and to maintain an up-to-date database via scheduled crawlers and jobs.

AWS Glue and RDS Setup:

Set up AWS Glue to read data from S3 and store it in AWS RDS.
Provision an AWS RDS instance using PostgreSQL.
Connect to RDS with psql and create required tables.

sudo yum install postgresql15
psql -h <endpoint-url> -U <username> -d <database-name>

CREATE TABLE video_data (
    video_id VARCHAR(255) PRIMARY KEY,
    title TEXT NOT NULL,
    description TEXT,
    publish_time TIMESTAMP NOT NULL,
    view_count INT,
    like_count INT,
    comment_count INT,
    engagement_rate FLOAT
);

CREATE TABLE top_videos (
    video_id VARCHAR(255) PRIMARY KEY,
    title TEXT NOT NULL,
    description TEXT,
    publish_time TIMESTAMP NOT NULL,
    view_count INT,
    like_count INT,
    comment_count INT,
    engagement_rate FLOAT
);

Set Up Airflow: Install Airflow and its Amazon provider package to schedule and automate ETL processes.
```
pip install "apache-airflow[celery]==2.8.4" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.8.4/constraints-3.8.txt"
pip install apache-airflow-providers-amazon
airflow standalone
```
Access the Airflow webserver at http://localhost:8080/ and configure environment variables for AWS and Google API keys.

Debugging

To debug specific tasks within the Airflow pipeline, you can manually test them:

airflow tasks test youtube_data_pipeline preprocess_data '2023-01-01'
airflow tasks test youtube_data_pipeline save_to_rds_task '2023-01-01'

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
dags		dags
glue		glue
.gitignore		.gitignore
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YouTube Analytics

Setup

Debugging

About

Releases

Packages

Languages

Yurui-Feng/youtube-data-analytics

Folders and files

Latest commit

History

Repository files navigation

YouTube Analytics

Setup

Debugging

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages