This readme provides a detailed walkthrough of our proposed quantitative benchmarking framework. The framework enables an in-depth evaluation of video-based conversational models through two types of assessments:
- Video-based Generative Performance Benchmarking
- Zero-Shot Question-Answer Evaluation
Our framework introduces a benchmark designed to assess the text generation performance of video-based conversational models. We leverage a test set of 500 samples curated from the ActivityNet-200 videos for this purpose.
You can download the videos from here and corresponding human-generated detailed descriptions from here.
Our benchmarks cover five key aspects:
- Correctness of Information
- Detailed Orientation
- Contextual Understanding
- Temporal Understanding
- Consistency
Evaluation Aspect | Video Chat | LLaMA Adapter | Video LLaMA | Video-ChatGPT |
---|---|---|---|---|
Correctness of Information | 2.23 | 2.03 | 1.96 | 2.40 |
Detail Orientation | 2.50 | 2.32 | 2.18 | 2.52 |
Contextual Understanding | 2.53 | 2.30 | 2.16 | 2.62 |
Temporal Understanding | 1.94 | 1.98 | 1.82 | 1.98 |
Consistency | 2.24 | 2.15 | 1.79 | 2.37 |
We generate task-specific question-answers by querying the GPT-3.5-Turbo model using the human-generated detailed video descriptions. The generated question-answer pairs are available for download here.
Follow the steps below to perform the quantitative benchmarking:
Step 1: Run the inference using the provided question-answer pairs for each criteria.
python video_chatgpt/eval/run_inference_benchmark_general.py \
--video_dir <path-to-directory-containing-videos> \
--gt_file <ground-truth-file-containing-question-answer-pairs> \
--output_dir <output-dir-path> \
--output_name <output-file-name> \
--model-name <path-to-LLaVA-Lightening-7B-v1-1> \
--projection_path <path-to-Video-ChatGPT-weights>
-
Note that the question-answer pairs (gt_file) are the same for correctness, detailed orientation and Contextual understanding.
-
For temporal understanding and consistency, separate question-answer pairs are provided.
Step 2: Execute the corresponding evaluation script to perform benchmarking.
For example, for correctness criteria:
python quantitative_evaluation/evaluate_benchmark_1_correctness.py \
--pred_path <path-to-prediction-file-generated-using-inference-script> \
--output_dir <output-directory-path> \
--output_json <path-to-save-annotation-final-combined-json-file> \
--api_key <openai-api-key-to-access-GPT3.5-Turbo-model>
For evaluation on all 5 criteria, you can use:
bash quantitative_evaluation/evaluate_benchmark.sh
Note: To further understand how the question-answer annotations are prepared for the benchmarking, refer to: benchmark_dataset_generation.
Our framework facilitates zero-shot evaluation on five standard open-ended question-answer datasets: MSRVTT, MSVD, TGIF, and ActivityNet-QA. For the sake of brevity, we present the evaluation method on ActivityNet-QA. The evaluation protocol remains the same for all datasets, except for some dataset-specific changes related to videos and annotations.
Model | MSVD-QA | MSRVTT-QA | TGIF-QA | Activity Net-QA | ||||
---|---|---|---|---|---|---|---|---|
Accuracy | Score | Accuracy | Score | Accuracy | Score | Accuracy | Score | |
FrozenBiLM | 32.2 | -- | 16.8 | -- | 41.0 | -- | 24.7 | -- |
Video Chat | 56.3 | 2.8 | 45.0 | 2.5 | 34.4 | 2.3 | 26.5 | 2.2 |
LLaMA Adapter | 54.9 | 3.1 | 43.8 | 2.7 | - | - | 34.2 | 2.7 |
Video LLaMA | 51.6 | 2.5 | 29.6 | 1.8 | - | - | 12.4 | 1.1 |
Video-ChatGPT | 64.9 | 3.3 | 49.3 | 2.8 | 51.4 | 3.0 | 35.2 | 2.7 |
Follow these steps to conduct the evaluation:
Step 1: Run the inference. You'll need the following:
a) Videos: Download the videos for ActivityNet-QA from here.
b) Question and answer annotations: You can obtain these from the official GitHub repository, or download from here.
Run the command:
python video_chatgpt/eval/run_inference_activitynet_qa.py \
--video_dir <path-to-video-dir> \
--gt_file_question <test_q.json> \
--gt_file_answers <test_a.json> \
--output_dir <path-to-out-dir> \
--output_name video_chatgpt_activitynet_qa_preds \
--projection_path <path-to-video-chat-gpt-checkpoint>
This will generate a JSON file containing the model's predicted responses.
Step 2: Evaluate the predicted responses. The evaluation process computes the accuracy and assigns a score on a scale of 1-5. This step requires the predictions from step-1, question-answer pair annotations, and an OpenAI API key.
Run the command:
python quantitative_evaluation/evaluate_activitynet_qa.py \
--pred_path <video_chatgpt_activitynet_qa_preds> \
--output_dir <path-to-out-dir> \
--output_json <video_chatgpt_activitynet_qa_results> \
--api_key <your-openai-api_key> \
--num_tasks 1