-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload Tasks: CinePile #343
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @JARVVVIS , all files look good. Can you post a screenshot of the evaluation result that you use lmms-eval
on your dataset for further reference? Then I will merge this PR. Thank you!
* Added CinePile * corrected linting errors
@JARVVVIS I keep getting |
Hi @JohnlNguyen. I think this might be occurring due to how LMMS handles downloading YouTube videos here. One way I addressed this issue was by using a more robust method to download the videos and then manually moving them to the appropriate location in the cache where LLMS expects them. Specifically, LMMS downloads each video at -- Sharing below the script that works well for me for downloading videos, please add the ROOT_DIR, and subdir variables appropriately based on the target download location: import os
import tqdm
from datasets import load_dataset
import yt_dlp
def download_video(video_url, filename, root, subdir):
"""
Downloads a video from the given URL using yt_dlp and saves it to the specified root directory.
"""
dir_path = f"{root}/{subdir}"
os.makedirs(dir_path, exist_ok=True)
output_path = f"{dir_path}/{filename}.mp4"
ydl_opts = {
"format": "bestvideo[height<=224][ext=mp4]+bestaudio[ext=m4a]/best[height<=224][ext=mp4]/best[ext=mp4]/best",
"outtmpl": output_path,
"merge_output_format": "mp4",
}
try:
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
print(f"Attempting to download: {video_url}")
print(f"Saving path: {output_path}")
ydl.download([video_url])
if os.path.exists(output_path):
print(f"Downloaded: {output_path}; {video_url}")
return output_path, True
else:
print(f"Failed to download {video_url}.")
return None, False
except Exception as e:
print(f"Exception during download of {video_url}: {e}")
return None, False
def main():
"""
Main function to download videos listed in the CinePile dataset.
"""
cinepile = load_dataset("tomg-group-umd/cinepile", split="test")
eval_df = cinepile.to_pandas()
ROOT_DIR = (
"" ## TODO: Set this to the root directory where you want to save the videos
)
subdir = "" ## TODO: Set this to the subdirectory where you want to save the videos
for idx, row in tqdm.tqdm(eval_df.iterrows(), total=len(eval_df), leave=True):
yt_link = row["yt_clip_link"]
video_filename = f"{row['movie_name']}_{yt_link.split('/')[-1]}"
local_video_path = f"{ROOT_DIR}/{subdir}/{video_filename}.mp4"
try:
if not os.path.isfile(local_video_path):
video_path, did_download = download_video(
yt_link, video_filename, root=ROOT_DIR, subdir=subdir
)
assert video_path is not None and did_download
else:
print(f"Skipping download. Video already exists at {local_video_path}.")
except Exception as e:
print(f"Got Exception for video {yt_link}: {e}")
if __name__ == "__main__":
main() yt-dlp version -- "yt-dlp==2024.8.6" |
* Added CinePile * corrected linting errors
Added support for evaluating models on CinePIle.
CinePile is a question-answering-based, long-form video understanding dataset. It has been created using advanced large language models (LLMs) with human-in-the-loop pipeline leveraging existing human-generated raw data. It consists of approximately 300,000 training data points and 5,000 test data points.