Skip to content

Implement video_livechat command #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 67 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
2ecbda4
Add console_scripts to setup.cfg
turicas May 20, 2024
252ff46
Implement draft CLI module
turicas May 20, 2024
dcc9e2f
Add old/draft CLI search code
turicas May 20, 2024
f2540a8
Add useful scripts (to be added to utils and CLI)
turicas Jun 8, 2024
079e5ee
- Add argparse integration and command handling for Youtube CLI Tool
aninhasalesp Jun 25, 2024
4c5d151
- Implemented method to extract URLs from a CSV file;
aninhasalesp Jun 25, 2024
943f6b0
- Implemented command to extract YouTube channel IDs from a list of U…
aninhasalesp Jun 25, 2024
b4f82e5
- Added to the list;
aninhasalesp Jun 25, 2024
525015e
Update cli.py
aninhasalesp Jun 26, 2024
4fba6d4
- Removed the type annotation from the method;
aninhasalesp Jun 27, 2024
2ba79df
- Add changed the method signature in the class to accept (**kwargs…
aninhasalesp Jun 27, 2024
8ab5185
- Fixed typing error in all in the file.
aninhasalesp Jun 28, 2024
6b28320
Add updates docstrings
aninhasalesp Jun 28, 2024
dfc2011
Update import
aninhasalesp Jul 2, 2024
b1b3367
Add update command into the file
aninhasalesp Jul 2, 2024
28b2574
Add update
aninhasalesp Jul 2, 2024
fe180fb
Add improvements to the file
aninhasalesp Jul 2, 2024
d4e66b4
Add test for cli file
aninhasalesp Jul 2, 2024
4bf29ff
Add test for base file
aninhasalesp Jul 2, 2024
216e5f2
Add test for channel_id command
aninhasalesp Jul 2, 2024
1b335b7
add docstrings
aninhasalesp Jul 5, 2024
c5ad8fd
- Implement ChannelInfo class to fetch YouTube channel information fr…
aninhasalesp Jun 25, 2024
e718d4a
- Included ChannelInfo in the list of commands in COMMANDS.
aninhasalesp Jun 25, 2024
7dc7b8d
Add updates docstrings
aninhasalesp Jun 28, 2024
ed012e5
Add updates docstrings
aninhasalesp Jun 28, 2024
9a5fe66
- Add updates
aninhasalesp Jun 28, 2024
8ba47cf
- Add updates
aninhasalesp Jun 28, 2024
c08e4ec
- Add test for channel_info command;
aninhasalesp Jul 4, 2024
a5bb13d
add docstrings
aninhasalesp Jul 5, 2024
923170e
fix
aninhasalesp Jul 6, 2024
d77a36b
Add new command in list
aninhasalesp Jun 30, 2024
734e981
- Implement CSV input processing for video IDs and URLs in VideoInfo …
aninhasalesp Jun 30, 2024
e9000ca
Add updates docstrings
aninhasalesp Jun 28, 2024
d4327d1
Add updates docstrings
aninhasalesp Jun 28, 2024
c4134e0
- Add updates
aninhasalesp Jun 28, 2024
0683403
- Add updates
aninhasalesp Jun 28, 2024
60bd144
Add update
aninhasalesp Jul 2, 2024
916d633
add config optional argmuments
aninhasalesp Jul 2, 2024
65f44fb
add not implemented error
aninhasalesp Jul 2, 2024
eddfb96
add video_id_from_url static method
aninhasalesp Jul 3, 2024
b344a72
add video-info from url case
aninhasalesp Jul 3, 2024
64252d7
- Add test for channel_info command;
aninhasalesp Jul 4, 2024
801e5f3
Add test for video_info command;
aninhasalesp Jul 4, 2024
9572aed
fix
aninhasalesp Jul 4, 2024
5de983d
add docstrings
aninhasalesp Jul 5, 2024
e8ab076
Add updates docstrings
aninhasalesp Jun 28, 2024
cac8aae
- Add updates
aninhasalesp Jun 28, 2024
797b4cb
Add test for base file
aninhasalesp Jul 2, 2024
3061ca8
Add update
aninhasalesp Jul 2, 2024
ca3edc1
Add video_search command
aninhasalesp Jul 3, 2024
fb6391e
Fix
aninhasalesp Jul 3, 2024
301a2e0
Add update
aninhasalesp Jul 3, 2024
68d5ea5
add video-search from url case
aninhasalesp Jul 3, 2024
6ab1762
- Add test for channel_info command;
aninhasalesp Jul 4, 2024
937ad3d
- Add test for video_search command;
aninhasalesp Jul 5, 2024
00a3097
add docstrings
aninhasalesp Jul 5, 2024
0552a62
add updates channel_info test
aninhasalesp Jul 6, 2024
4dc8f34
Remove unnecessary comment
aninhasalesp Jul 3, 2024
cfa0532
Add video_comments command
aninhasalesp Jul 3, 2024
221770c
Add update
aninhasalesp Jul 3, 2024
cfcccbc
- Add test for video_search command;
aninhasalesp Jul 5, 2024
4112d02
- Add test for video_comments command
aninhasalesp Jul 5, 2024
59d1ad5
add docstrings
aninhasalesp Jul 5, 2024
bd78a70
Add video_livechat command
aninhasalesp Jul 3, 2024
bc96643
- Add test for video_search command;
aninhasalesp Jul 5, 2024
6db0448
- Add test for video_livechat command
aninhasalesp Jul 5, 2024
be977aa
add docstrings
aninhasalesp Jul 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
187 changes: 187 additions & 0 deletions scripts/channel_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# pip install youtool[livechat,transcription]
import argparse
import os
import json
import shelve
from pathlib import Path

from chat_downloader.errors import ChatDisabled, LoginRequired, NoChatReplay
from tqdm import tqdm
from youtool import YouTube


class CsvLazyDictWriter: # Got and adapted from <https://github.com/turicas/rows>
"""Lazy CSV dict writer, so you don't need to specify field names beforehand

This class is almost the same as `csv.DictWriter` with the following
differences:

- You don't need to pass `fieldnames` (it's extracted on the first
`.writerow` call);
- You can pass either a filename or a fobj (like `sys.stdout`);
"""

def __init__(self, filename_or_fobj, encoding="utf-8", *args, **kwargs):
self.writer = None
self.filename_or_fobj = filename_or_fobj
self.encoding = encoding
self._fobj = None
self.writer_args = args
self.writer_kwargs = kwargs

def __enter__(self):
return self

def __exit__(self, exc_type, exc_value, traceback):
self.close()

@property
def fobj(self):
if self._fobj is None:
if getattr(self.filename_or_fobj, "read", None) is not None:
self._fobj = self.filename_or_fobj
else:
self._fobj = open(
self.filename_or_fobj, mode="w", encoding=self.encoding
)

return self._fobj

def writerow(self, row):
if self.writer is None:
self.writer = csv.DictWriter(
self.fobj,
fieldnames=list(row.keys()),
*self.writer_args,
**self.writer_kwargs
)
self.writer.writeheader()

self.writerow = self.writer.writerow
return self.writerow(row)

def __del__(self):
self.close()

def close(self):
if self._fobj and not self._fobj.closed:
self._fobj.close()


# TODO: add options to get only part of the data (not all steps)
parser = argparse.ArgumentParser()
parser.add_argument("--api-key", default=os.environ.get("YOUTUBE_API_KEY"), help="Comma-separated list of YouTube API keys to use")
parser.add_argument("username_or_channel_url", type=str)
parser.add_argument("data_path", type=Path)
parser.add_argument("language-code", default="pt-orig", help="See the list by running `yt-dlp --list-subs <video-URL>`")
args = parser.parse_args()

if not args.api_key:
import sys

print("ERROR: API key must be provided either by `--api-key` or `YOUTUBE_API_KEY` environment variable", file=sys.stderr)
exit(1)
api_keys = [key.strip() for key in args.api_key.split(",") if key.strip()]


username = args.username
if username.startswith("https://"):
channel_url = username
username = [item for item in username.split("/") if item][-1]
else:
channel_url = f"https://www.youtube.com/@{username}"
data_path = args.data_path
channel_csv_filename = data_path / f"{username}-channel.csv"
playlist_csv_filename = data_path / f"{username}-playlist.csv"
playlist_video_csv_filename = data_path / f"{username}-playlist-video.csv"
video_csv_filename = data_path / f"{username}-video.csv"
comment_csv_filename = data_path / f"{username}-comment.csv"
livechat_csv_filename = data_path / f"username}-livechat.csv"
language_code = args.language_code
video_transcription_path = data_path / Path(f"{username}-transcriptions")

yt = YouTube(api_keys, disable_ipv6=True)
video_transcription_path.mkdir(parents=True, exist_ok=True)
channel_writer = CsvLazyDictWriter(channel_csv_filename)
playlist_writer = CsvLazyDictWriter(playlist_csv_filename)
video_writer = CsvLazyDictWriter(video_csv_filename)
comment_writer = CsvLazyDictWriter(comment_csv_filename)
livechat_writer = CsvLazyDictWriter(livechat_csv_filename)
playlist_video_writer = CsvLazyDictWriter(playlist_video_csv_filename)

print("Retrieving channel info")
channel_id = yt.channel_id_from_url(channel_url)
channel_info = list(yt.channels_infos([channel_id]))[0]
channel_writer.writerow(channel_info)
channel_writer.close()

main_playlist = {
"id": channel_info["playlist_id"],
"title": "Uploads",
"description": channel_info["description"],
"videos": channel_info["videos"],
"channel_id": channel_id,
"channel_title": channel_info["title"],
"published_at": channel_info["published_at"],
"thumbnail_url": channel_info["thumbnail_url"],
}
playlist_writer.writerow(main_playlist)
playlist_ids = [channel_info["playlist_id"]]
for playlist in tqdm(yt.channel_playlists(channel_id), desc="Retrieving channel playlists"):
playlist_writer.writerow(playlist)
playlist_ids.append(playlist["id"])
playlist_writer.close()

video_ids = []
for playlist_id in tqdm(playlist_ids, desc="Retrieving playlists' videos"):
for video in yt.playlist_videos(playlist_id):
if video["id"] not in video_ids:
video_ids.append(video["id"])
row = {
"playlist_id": playlist_id,
"video_id": video["id"],
"video_status": video["status"],
"channel_id": video["channel_id"],
"channel_title": video["channel_title"],
"playlist_channel_id": video["playlist_channel_id"],
"playlist_channel_title": video["playlist_channel_title"],
"title": video["title"],
"description": video["description"],
"published_at": video["published_at"],
"added_to_playlist_at": video["added_to_playlist_at"],
"tags": video["tags"],
}
playlist_video_writer.writerow(row)
playlist_video_writer.close()

videos = []
for video in tqdm(yt.videos_infos(video_ids), desc="Retrieving detailed video information"):
videos.append(video)
video_writer.writerow(video)
video_writer.close()

for video_id in tqdm(video_ids, desc="Retrieving video comments"):
try:
for comment in yt.video_comments(video_id):
comment_writer.writerow(comment)
except RuntimeError: # Comments disabled
continue
comment_writer.close()

print("Retrieving transcriptions")
yt.videos_transcriptions(
video_ids,
language_code=language_code,
path=video_transcription_path,
skip_downloaded=True,
batch_size=10,
)

# TODO: live chat code will freeze if it's not available
for video_id in tqdm(video_ids, desc="Retrieving live chat"):
try:
for comment in yt.video_livechat(video_id):
livechat_writer.writerow(comment)
except (LoginRequired, NoChatReplay, ChatDisabled):
continue
livechat_writer.close()
43 changes: 43 additions & 0 deletions scripts/clean_vtt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# pip install webvtt-py
import argparse
import io
import json
import os
import shelve
import time
from pathlib import Path

import tiktoken
import webvtt
from openai import APITimeoutError, OpenAI
from rows.utils import CsvLazyDictWriter
from tqdm import tqdm


def vtt_clean(vtt_content, same_line=False):
result_lines, last_line = [], None
for caption in webvtt.read_buffer(io.StringIO(vtt_content)):
new_lines = caption.text.strip().splitlines()
for line in new_lines:
line = line.strip()
if not line or line == last_line:
continue
result_lines.append(f"{str(caption.start).split('.')[0]} {line}\n" if not same_line else f"{line} ")
last_line = line
return "".join(result_lines)


parser = argparse.ArgumentParser()
parser.add_argument("input_path", type=Path)
parser.add_argument("output_path", type=Path)
args = parser.parse_args()

for filename in tqdm(args.input_path.glob("*.vtt")):
new_filename = args.output_path / filename.name
if new_filename.exists():
continue
with filename.open() as fobj:
data = fobj.read()
result = vtt_clean(data)
with new_filename.open(mode="w") as fobj:
fobj.write(result)
4 changes: 4 additions & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,10 @@ packages = find:
python_requires = >=3.7
install_requires = file: requirements/base.txt

[options.entry_points]
console_scripts =
youtool = youtool:cli

[options.extras_require]
cli = file: requirements/cli.txt
dev = file: requirements/dev.txt
Expand Down
Empty file added tests/commands/__init__.py
Empty file.
29 changes: 29 additions & 0 deletions tests/commands/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
import pytest


@pytest.fixture
def channels_urls():
return [
"https://www.youtube.com/@Turicas/featured",
"https://www.youtube.com/c/PythonicCaf%C3%A9"
]


@pytest.fixture
def videos_ids():
return [
"video_id_1",
"video_id_2"
]


@pytest.fixture
def videos_urls(videos_ids):
return [
f"https://www.youtube.com/?v={video_id}" for video_id in videos_ids
]


@pytest.fixture
def usernames():
return ["Turicas", "PythonicCafe"]
Loading