Here’s an updated version of the README file that incorporates the new functionalities you’ve added, including clearing the output directory before generating the new transcript markdown files, and handling GPT summaries.
This tool allows you to download transcripts and audio files from YouTube videos, generate summaries using OpenAI's GPT models, and handle files in a more structured way. You can download transcripts and audio for a single video, multiple videos, all videos from a YouTube channel, or all videos from a YouTube playlist.
- Download transcripts as Markdown files
- Generate summary files for transcripts using GPT
- Download audio files as MP3
- Support for single video, multiple videos, YouTube channel, and YouTube playlist
- Option to include timestamps in the downloaded transcripts
- Automatic clearing of the transcript directory to ensure only the relevant files are processed for GPT summary generation
- Use of customizable templates for transcript and summary files
Before using this tool, make sure you have the following:
- Python 3.x installed
- Google YouTube API key
- OpenAI API key for GPT summaries
- Clone the repository:
git clone https://github.com/Netajam/Youtube-Video-Transcript-Fetcher.git
- Navigate to the project directory:
cd Youtube-Video-Transcript-Fetcher
- (Optional but recommended) Create a Python virtual environment:
python3 -m venv venv
- Activate the virtual environment:
- On Windows:
venv\Scripts\activate
- On macOS and Linux:
source venv/bin/activate
- On Windows:
- Install the required dependencies:
pip install -r requirements.txt
- Set up the environment variables:
- Create a
.env
file in the root of the project directory and add your YouTube API and OpenAI API keys:youtube_api_key = "YOUR_YOUTUBE_API_KEY" openai_api_key = "YOUR_OPENAI_API_KEY"
- Create a
Configure the tool by setting the video, channel, or playlist details in config.py
:
# config.py
# Single video ID
video_id = 'O9upVbGSBFo'
# Multiple video IDs
video_ids = ['abc123', 'def456', 'ghi789']
# Channel ID
channel_id = 'UCQNsB8n2QhRzjdfjyZ_2S_g'
# Playlist IDs
playlist_id = 'PLAzuPFpwy9ZLIG5hDwjGyccJr6FpTsrEr'
# File paths
transcript_files_path = 'transcripts' # Path where transcript files will be saved
SUMMARY_FILES_PATH = 'summaries' # Path where summary files will be saved
AUDIO_FILES_PATH = 'audio' # Path where audio files will be saved
TEMPLATES_PATH = 'templates' # Path where template files are located
TRANSCRIPT_PARTS_DIR = 'temp/transcript_parts' # Path for temporary transcript parts
gpt_summaries = 'summaries_gpt' # Path for GPT summary files
# Template file names
transcript_template_file = 'transcript_template.md'
SUMMARY_TEMPLATE_FILE = 'summary_template.md'
# Transcript settings
language_to_fetch = 'en'
# GPT model configuration
OPENAI_MODEL = "gpt-4o"
The templates for the transcript and summary markdown files are located in the templates
folder:
transcript_template.md
: Template for the transcript file.summary_template.md
: Template for the summary file.
You can customize these templates to change the structure of the generated markdown files.
-
Clearing Output Directory: Before generating transcript markdown files, the tool ensures that the output directory (e.g.,
temp/transcript_prompts
) is cleared of all files and subdirectories. This ensures that only the current transcript parts are processed.This prevents mixing old and new files, which is important when generating GPT summaries, ensuring that the files processed are only related to the current task.
-
Generate GPT Summaries: After generating transcript markdown files, the tool can send requests to OpenAI's API to generate summaries for the transcript parts. It monitors the OpenAI API's token limits to ensure compliance.
- Set the
video_id
variable inconfig.py
to the desired video ID. - Run the following command to download the transcript and generate the summary file:
python main.py
- Run the following command to download the audio:
python main.py -ad
- Set the
video_ids
variable inconfig.py
to a list of video IDs. - Run the following command:
python main.py -mv
- Set the
channel_id
variable inconfig.py
to the desired YouTube channel ID. - Run the following command:
python main.py -c
- Set the
playlist_id
variable inconfig.py
to the desired YouTube playlist ID. - Run the following command:
python main.py -pl
To generate summaries using GPT after the transcripts have been downloaded and processed:
- Ensure the OpenAI API key is set in the
.env
file. - Set the
OPENAI_MODEL
and other GPT-related configurations inconfig.py
. - Run the script to generate summaries:
python main.py --gpt
The script will clear the output directory, generate new transcript parts, and send them to OpenAI for summarization.
- Make sure you have a valid Google YouTube API key and OpenAI API key.
- The tool uses the
youtube_transcript_api
library to fetch transcripts. If a transcript is not available for a video, it will skip that video and move on to the next one. - The audio files are downloaded using the
pytube
library. Ensure it's installed by runningpip install pytube
. - The filenames of the Markdown files are sanitized to remove any characters that are not allowed in filenames.
- The transcript and summary files are generated based on the templates in the
templates
folder. You can customize these templates to suit your needs.
If you have any questions or issues, feel free to open an issue in the repository.