Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload Tasks: CinePile #343

Merged
merged 3 commits into from
Oct 24, 2024
Merged

Upload Tasks: CinePile #343

merged 3 commits into from
Oct 24, 2024

Conversation

JARVVVIS
Copy link
Contributor

Added support for evaluating models on CinePIle.

CinePile is a question-answering-based, long-form video understanding dataset. It has been created using advanced large language models (LLMs) with human-in-the-loop pipeline leveraging existing human-generated raw data. It consists of approximately 300,000 training data points and 5,000 test data points.

Copy link
Collaborator

@kcz358 kcz358 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @JARVVVIS , all files look good. Can you post a screenshot of the evaluation result that you use lmms-eval on your dataset for further reference? Then I will merge this PR. Thank you!

@JARVVVIS
Copy link
Contributor Author

For command:

python3 -m accelerate.commands.launch     --num_processes=1     -m lmms_eval     --model video_llava  --tasks cinepile     --batch_size 1     --log_samples     --log_samples_suffix video_llava_cinepile     --output_path ./logs/ --verbosity=DEBUG
Screenshot 2024-10-23 at 6 47 07 PM

@kcz358 kcz358 merged commit 100ab6f into EvolvingLMMs-Lab:main Oct 24, 2024
1 check passed
KairuiHu pushed a commit that referenced this pull request Oct 24, 2024
* Added CinePile

* corrected linting errors
@JohnlNguyen
Copy link

@JARVVVIS I keep getting Sign in to confirm you’re not a bot. This helps protect our community. Learn more. How do I sign in such that I can download the videos?

@JARVVVIS
Copy link
Contributor Author

JARVVVIS commented Nov 9, 2024

Hi @JohnlNguyen.

I think this might be occurring due to how LMMS handles downloading YouTube videos here.

One way I addressed this issue was by using a more robust method to download the videos and then manually moving them to the appropriate location in the cache where LLMS expects them. Specifically, LMMS downloads each video at -- video_path = os.path.join(hf_home, task), where hf_home = os.getenv("HF_HOME", "~/.cache/huggingface/") and task = 'cinepile' (as defined in the config). However, it's not the cleanest solution since LLMS will still attempt to download the videos. It doesn't check for their existence based on the paths but rather relies on a {task}_download_status.json file. Once LLMS completes its download attempts for all videos, it should still function correctly. Alternatively, you could modify the lmms_eval/api/task.py file to handle this more effectively.

Sharing below the script that works well for me for downloading videos, please add the ROOT_DIR, and subdir variables appropriately based on the target download location:

import os
import tqdm
from datasets import load_dataset
import yt_dlp


def download_video(video_url, filename, root, subdir):
   """
   Downloads a video from the given URL using yt_dlp and saves it to the specified root directory.
   """
   dir_path = f"{root}/{subdir}"
   os.makedirs(dir_path, exist_ok=True)

   output_path = f"{dir_path}/{filename}.mp4"

   ydl_opts = {
       "format": "bestvideo[height<=224][ext=mp4]+bestaudio[ext=m4a]/best[height<=224][ext=mp4]/best[ext=mp4]/best",
       "outtmpl": output_path,
       "merge_output_format": "mp4",
   }

   try:
       with yt_dlp.YoutubeDL(ydl_opts) as ydl:
           print(f"Attempting to download: {video_url}")
           print(f"Saving path: {output_path}")
           ydl.download([video_url])

       if os.path.exists(output_path):
           print(f"Downloaded: {output_path}; {video_url}")
           return output_path, True
       else:
           print(f"Failed to download {video_url}.")
           return None, False

   except Exception as e:
       print(f"Exception during download of {video_url}: {e}")
       return None, False


def main():
   """
   Main function to download videos listed in the CinePile dataset.
   """
   cinepile = load_dataset("tomg-group-umd/cinepile", split="test")
   eval_df = cinepile.to_pandas()

   ROOT_DIR = (
       ""  ## TODO: Set this to the root directory where you want to save the videos
   )
   subdir = ""  ## TODO: Set this to the subdirectory where you want to save the videos

   for idx, row in tqdm.tqdm(eval_df.iterrows(), total=len(eval_df), leave=True):
       yt_link = row["yt_clip_link"]
       video_filename = f"{row['movie_name']}_{yt_link.split('/')[-1]}"
       local_video_path = f"{ROOT_DIR}/{subdir}/{video_filename}.mp4"

       try:
           if not os.path.isfile(local_video_path):
               video_path, did_download = download_video(
                   yt_link, video_filename, root=ROOT_DIR, subdir=subdir
               )
               assert video_path is not None and did_download
           else:
               print(f"Skipping download. Video already exists at {local_video_path}.")
       except Exception as e:
           print(f"Got Exception for video {yt_link}: {e}")


if __name__ == "__main__":
   main()

yt-dlp version -- "yt-dlp==2024.8.6"

ZhaoCinyu pushed a commit to ZhaoCinyu/lmms-eval that referenced this pull request Dec 9, 2024
* Added CinePile

* corrected linting errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants